chainsawriot

My webscraping inferno

Posted on Oct 21, 2022 by Chung-hong Chan

Client-side content encryption

This is a chronicle of the mistakes I made in a research project. To be honest, I am still fixing all my mistakes on my own and I am very unlikely to fix all of them. I will suffer from these mistakes for the rest of my life. The research project is suffering from my mistakes too. There are many technical wrong decisions or indecisions. But I think the biggest mistake is my pompousness. My self-righteous. My inability to communicate.

As I said it last time: Once people get old, they are entitled to hate. Thus I can say this loud and clear: I hate web scraping.

We teach web scraping in quantitative method classes to future social scientists. Your classes might be different, but I did that. I would probably opt not to teach web scraping to my students again (suppose there is any chance for me to teach in the future, which is extremely unlikely). I am not saying web scraping is not useful. As Deen Freelon has said, web scraping is a possible alternative to APIs in the so-called “Post-API age”.

As a class exercise, web scraping is very empowering and students could learn how different web technology works. In my class, I taught my students to scrape all lyrics from one of their favourite musicians and then make a word cloud. I asked my students to present the word cloud and then we guessed who that musician is. My students enjoyed this activity. They enjoyed it, because this is an easy-peasy toy example. We teach web scraping as a “nice-to-have” technique, so a toy example suffices. Also, you will usually see “Introduction to Web Scraping” as a book, as a course or whatever. You rarely see “Advanced Web Scraping”. The problem is, we rarely teach our students the practical implications of this “nice-to-have” technique: the ethics, project management, resource allocation, and most importantly, the difficulty.

First thing first: I can’t think of any website owner who would welcome someone scraping his or her website. Also, I can’t recall any person who does web scraping saying he or she enjoys doing that. Simply put: web scraping is a mutual assured destruction (or laam caau, 攬炒, in my language). You won’t destory each other if you don’t ~~strike~~ scrape. But “If We Burn, You Burn With Us.” You ~~strike~~scrape, so now you are at war with the website owners.

Let’s assume researchers are reckless and really enjoy scraping websites. There are always some warriors or cowboys in the academia, usually in the higher command. At that moment you select (or are assigned) to do that, you are in a cat-and-mouse game. And your opponents are business website owners and they would throw at your face all the anti-scraping measures at their disposal: Captcha, client-side content encryption, Cloudflare, etc. Ah, Ja, paywall. They have every business logic to prevent you from scraping their websites. And they will put whatever resource they have to protect their website from being scraped.

At this point, the research is now depending on whether you can win this cat-and-mouse game. Do you have the same level of resource —like your opponents— to defeat these protective measures? Now, you know how stressful it is to manage this kind of “nice-to-have” data collection activities. You lose sleep. You work like mad, multiple all-nighters till you drop.

Although I argue social scientists don’t have such resource and skills to win this cat-and-mouse game, let’s assume your technical skills are wonderful and you believe you can win this cat-and-mouse game. Great confidence! So, let’s move on.

We usually severely underestimate the workload. In our toy examples, we might teach our students to scrape one page, a bunch of pages, or at most, all web pages of one website. In real life research, we usually (severely) underestimate the work for scraping all pages of many websites, or even all pages of all websites of a certain kind. One can easily bite off (way) more than one can chew.

Again, let’s assume you have incredible skills and you can work 96 hours a day. And you have the wonderful talent, time, and persistence to write undefeatable scrapers to scrape many websites without the websites blocking you. Now you know that you need to scrape over 2 million pages. If you were Google, you should be able to scrape 9000 or more pages per second with the incredible cluster of crawlers and then you would be able to get all pages in less than 4 minutes. However, we can’t even scrape 1 page per second in most of the time. Your scraper is not as welcomed as Google’s. You usually would be allocated with one machine and one IP.

Again, let’s assume you have Google-scale scraping IP ranges. Before diving into technical questions, let’s get back to the resource question. Remember I said “one machine and one IP”? Sometimes, the storage allocated to you can be as small as a few G. In my language, you need to “eat your own flesh for meal” (食自己) or “good housewife can cook nothing when there is no rice to cook” (巧婦難為無米之炊, well, that’s sexist, like many proverbs in my language). If you are the audience from the West: “There shall no straw be given you, yet ye shall make bricks without straw.”

How are you going to store all these pages? A database? Which DBMS? Do you know enough DBMS to design an efficient database? Let’s assume you are a Stanford-grad CS lucky son of a gun and you dream in relational algebra every fucking night. You want to add an index to speed up the query? No storage. You want to deatomize to speed up the query? No storage. You want to do a fucking simple left-join? No Ram!

The problem is that you haven’t been given enough disk space and memory to do any space-time optimization. The space-time trade-off would force you to trade speed for space (a very 80s thing to do, remind me of programming a C64 or Apple ][ with 64K ram). And then you are struggling with the speed again. Your dream of scraping just 1 page a second would be reduced to maybe 0.1 page per sec due to the extremely bad database performance.

OK. Now you can s/you/I/g the above and let’s assume all of the above “let’s assume”’s assumptions to be “hell no” for me. You can understand what kind of deep shit I am in.

Welcome to my webscraping inferno! And my life sucks!