· tutorials · 3 min read

Scraping Data with Automize and Puppeteer - A Deep Dive into Table Scraping

Welcome back to our series on writing automation scripts! Today, we're tackling a complex scraping challenge with a layout that seems to test our scraping skills to their limits. We’re diving into a variation of quotes to scrape, humorously named "Table Full" due to its chaotic table layout. As you might expect, scraping data from such an unconventional arrangement can be particularly tricky.

Welcome back to our series on writing automation scripts! Today, we're tackling a complex scraping challenge with a layout that seems to test our scraping skills to their limits. We’re diving into a variation of quotes to scrape, humorously named "Table Full" due to its chaotic table layout. As you might expect, scraping data from such an unconventional arrangement can be particularly tricky.

This blog was generated from a tutorial video you can watch here

Understanding the Challenge

The table in question comes with an unfortunate amount of padding and weird structuring, making it a challenge to navigate through its layers and extract the information we need. Typically, such layouts don’t mesh well with conventional scraping techniques.

To conquer this task, we’re going to leverage Automize to guide us and Puppeteer to add an extra layer of complexity. Let’s see how we can piece together the data effectively.

Getting Started with Automize

First things first, open up Automize. Our initial step is to locate the first quote from the disorganized table. As we explore the table structure, we recognize that choosing the first data cell (TD) doesn’t yield the results we want. Each cell whose data we want is separated, often leaving us to ignore the odd ones.

After some investigation, we decide to ignore the initial rows and go with a strategy that captures every other row. This means identifying the table body (T body) and setting our selector correctly to ensure we capture only the essential data.

Transitioning to Puppeteer

With our selectors ready, we shift our focus back to Puppeteer to continue building out our script. It’s crucial to remember that when using Puppeteer, we aim to grab all relevant entries. While a single dollar sign ($) in Puppeteer fetches the first element, two dollar signs ($$) allow us to retrieve all matching elements.

We adjust our JavaScript to ensure we’re capturing each row, effectively advancing by pairs to account for quotes and their corresponding tags. Here’s a quick look at the logic:

let rows = $$('tbody tr');
rows.shift(); // Ignore the first row
for (let i = 0; i < rows.length; i += 2) {
    const quote = rows[i].querySelector('td').innerText;
    const tags = rows[i + 1].querySelectorAll('a');
    // Further processing to retrieve tag texts...
}

With this structure, we ensure that we’re processing the rows correctly, effectively ignoring extraneous data.

Looping Through and Capturing Data

As we loop through our selected rows, we extract the inner text of both quotes and corresponding tags. Utilizing evaluate, we run functions that allow us to grab these texts directly from the DOM, ensuring we acquire all necessary details.

Finally, we push our neatly bundled results into a record that we can later export as CSV:

record.push({
    quote: quoteText,
    tags: tagTexts.join(", ")
});

After running the script, we were pleasantly surprised to see our extracted data neatly organized into a CSV file—just what we intended!

Reflecting on the Process

This exercise emphasized the importance of understanding HTML structures when scraping data. By methodically approaching the problem—examining the DOM, adjusting our parsing approach, and leveraging both Automize and Puppeteer—we successfully navigated a challenging scraping task.

If you found this tutorial helpful, be sure to leave your thoughts, video ideas, or feature requests for Automize in the comments section below. Your feedback keeps us motivated!

Thank you for tuning in, and join us next time as we explore more automation technique

Back to Blog