Scraping Tutorial
##(or how I stopped waiting for each page to load and scraped a webcomic instead.)
So I wanted to read this webcomic ,but I have terrible internet, and the pages were taking too long to load. On top of that, having downloaded webcomics before, I did not want html output. I wanted CBRs(Comic Book Rar, RARs renamed) which my android devices can read. Also the images have title text(mouseovers) that need to be stored and displayed with each page.
###Requirements
- For the web scraping stuff I used requests and
- BeautifulSoup .
- The image work, ie putting the mouseover text below the image is handled by Pillow, a superior fork of the Python Imaging Library
- Finally, for the text stuff, I found this recipe that worked wonders. (This is a salient feature of python- just insane third party support) Props to the author. Saved me some work.
This was not a difficult job. Python has gotten me out of much more problematic situations. This article is written as a pitch for selling python to first/second language shoppers.
###Let’s size up the Enemy
Before any scraping is done, we have to check out the website and look for patterns. Automation is all about recognizing patterns.
Looking at the first page and the ‘last comic’ navigation button link it is clear that the final id is 289.
Looking at a few random page sources, it is clear that
<div id="comicbody"><a href="/index.php?id=176"><img title="''Irritate your subordinates. It's really fun.'' -Sun Tzu" src="http://www.paranatural.net/comics/2013-06-28-chapter-4-page-18.png" id="comic" border="0" /><br /></a></div>
is the div
with all the information we need: The title text(mouseovers) and the image source.
Time to write some code
Performing GET requests using the requests module is terribly easy. Using the get()
method of the requests module, you obtain a response object that has the status_code
and the content
attributes that contain the response status and the html content respectively.
Given that network operations can be prone to timeouts and other connectivity issues( more so if you have bad internet), we ought to wrap this getting of pages stuff in a function which retries getting pages multiple times before giving up.
This function takes a url
and an integer max_attempts
and attempts to get the page max_attempts
times returning immediately with the content in case of success(code 200)
Okay, cool. But now we need to dig out the <div id="comicbody">
from the content.
Enter BeautifulSoup
HTML parsing can be tricky if you are not using the right tools.
But BeautifulSoup simplifies the process vastly, making simple web scraping almost trivial. To work with BeautifulSoup, you first make a soup object using the method BeautifulSoup(<html string>)
. To look for a specific div with known id in the entire document you can use the find()
method. the objects returned by this can access child elements using the .
syntax. To get an attribute of the element, you use the get()
attribute.
Lets wrap this in a function so that we don’t have to deal with the soupy details while setting up the whole contraption.
But wait! We need to download the images as well. That is the point isn’t it?
No problem. We will just use the get_page
function in another download function which will download the images for us. Once we have the contents of a successful request response, all we need to do is to write the contents to a local file. This is done with the built in open() function. Content can be written to it using the write()
method on the file descriptor object generated by open()
.
The scraping part is officially over. We have all the required functions, and the rest is just calling them in a loop passing to them new urls each time.
But before we put everything together, let us get the mouseover texts into the images so that the comic book reading software can display them.
A picture is worth 1K words..
#####(except when it is not)
####The Job: 1. Get the original image. 2. Extend it at the bottom just the right amount to accommodate the mouseover text. 3. And finally put the text at the bottom.
Parts 1 and 3 can be easily done with pillow, but part 2 can be a little tricky(like 20 mins of experimentation). Fortunately, this is Python and someone on the internet has already spent those 20 minutes. We use this image_utils module that does exactly what we need.
To make it work for pillow(it is written for PIL), take a look at porting pil code to pillow. It is just a matter of changing the imports. The example code in the link is quite self explanatory.
First you get a special ImageText
object initializing it with size and background color tuples(2-tuple for size and 4-tuple for RGBA background color). Then calling the write_text_box()
method on the ImageText
object we can write text to this image.
Based on the parameters you provide to write_text_box()
you can get it to align, size, and split into lines according to max width any text and put it on the image. The image
attribute of the ImageText
object is the PIL/pillow compatible image object.
The write_text_box()
method also returns a size tuple of the text box which will be very useful in determining how much to expand the original image.
Reading through the pillow docs we can see that we can do standard operations like open image, save it, crop it, copy it, paste it etc using the various methods exposed by it.
####The Plan:
* Get original image size.
* Create temporary ImageText
object of sufficient height and width.
* Write text to temp object using write_text_box()
with box_width = original image’s width. (If width is provided, the module automatically manages the number of lines the text has to be split into.)
* Get the height of the text box from the call to write_text_box()
* Make a new blank pillow image with width = original width and height = original height + height of text box + 20pixels(this is the standard offset determined by trial and error by me)
* Crop the temporary image to extract just the text box portion.
* Paste the original image on the new blank image.
* Paste the cropped text box at the bottom of the new image.
* Profit.
Code:
Some loose ends
As you might have noticed we have used an ‘out’ directory to keep the downloaded files. And we will be using a ‘final’ directory to store the processed images. It will be awkward if those folders are not present. So let us make them. The os module in the standard library has a lot of convenience functions that help with this path stuff.
We will be also be using os.path.join()
and os.path.basename()
methods. They are fairly straightforward. join combines path components and basename gigvs you the main file name from a large path. These functions are very very useful if you use windows and are trapped in the eternal tragedy of the slashes. Another helpful function os.getcwd()
gets the current working directory.
We also store the output path in a string using join
and getcwd
outpath=os.path.join(os.getcwd(),'final')
The Mainloop
All the prerequisites have been taken care of and only putting it all together remains.
Here is the code:
4 peculiar things have been used here. enumerate, a generator expression, zfill and slicing syntax
Enumerate is a way to deal with index in for..in
style loops. This is how it works.
Zfill just takes a string and adds 0s in front of it till it reaches supplied width. It is useful for converting a ‘1’ to a ‘001’ which makes it easier for sorting.
A generator expression is a shorthand way of writing a loop. say you want an iterator(something we can loop over with for..in
) that contains the squares of the first n natural numbers. you can do this
Slicing syntax is a super convenient way of referring to sections in a list or a string. This is one of the reasons why string work is such a breeze in python.
Say there is a string ‘DeadParrot’
Element | Forward Index | Reverse Index |
---|---|---|
D | 0 | -10 |
E | 1 | -9 |
A | 2 | -8 |
D | 3 | -7 |
P | 4 | -6 |
A | 5 | -5 |
R | 6 | -4 |
R | 7 | -3 |
O | 8 | -2 |
T | 9 | -1 |
The full code
that was ~60 lines of python btw ;) if you hadn’t noticed.