In the world of data, one of the most common challenges is accessing information. Very often, the data we need is not available in downloadable formats (such as CSV or Excel), but is instead embedded within web pages. This is where web scraping comes into play.
What Is Web Scraping?
Web scraping is a technique that allows you to extract data from websites using code. Web pages are written in HTML (HyperText Markup Language), a language that uses tags to structure content (headings, paragraphs, tables, links, etc.).
Since HTML is essentially text, it can be read and analyzed by programs called parsers, which make it possible to automatically locate and retrieve the desired information.
In practice, instead of manually copying and pasting data from a website, we can write a script that does it for us.
How Do You Obtain HTML Data?
HTML data can be collected in two main ways:
By manually downloading the HTML source code of a web page
By programmatically accessing the website via HTTP requests (for example, using a GET request)
Once the HTML is obtained, it can be analyzed and transformed into structured data ready for analysis.
When Not to Use Web Scraping
It’s important to clarify a crucial point: web scraping is not always allowed.
Many websites impose specific restrictions in their Terms and Conditions, and ignoring them can lead to legal issues. For this reason, it’s essential to do your homework before starting any scraping activity.
Here are some fundamental guidelines to follow:
Always check the website’s Terms and Conditions
Consider whether your data usage is personal, academic, or commercial
Act ethically and responsibly
If the website offers a public API, use it instead of scraping
Send HTTP requests at a reasonable frequency
Avoid massive or simultaneous requests that could resemble a DDoS attack
Stay informed about laws and regulations related to web scraping
Web scraping is not just a technical matter, but also an ethical one. There are excellent articles that explore this topic further, such as “Ethics in Web Scraping” on Towards Data Science.
API vs Web Scraping
Whenever possible, it’s always better to choose APIs over scraping.
Why?
APIs are more stable: they don’t depend on a website’s layout
They are specifically designed to provide data
They offer data that is already structured and easy to use
They scale better with increased request volume
Web scraping, on the other hand, is fragile: even a small change in the website’s HTML code (a redesign, a new tag, a different class) can completely break your script.
Golden rule: if an official API exists, use it.
Key Terms to Know
For beginners, here are some essential concepts:
HTML (HyperText Markup Language): the markup language used to create web pages
Parser: a tool that analyzes HTML code to extract information
Web Scraping: a technique for extracting data from websites using code
Web scraping is a powerful and highly useful tool for anyone working with data, but it must be used consciously and responsibly. Understanding how HTML works, when to use scraping, and when to avoid it is essential to becoming an effective—and above all ethical—data wrangler.
If you want to work with web data, remember: respect websites, respect the rules, and always choose the best solution between scraping and APIs.