Craigslist Web Scraping: A Step-by-Step Guide

neilcummings225 (12)in #webscraping • 4 years ago

Recently, I've been considering making a relocation. And what better way to ensure I'm getting a good deal than to go through Craigslist's "population" of housing? That's a job for...Python and web scraping!
In this article, I'll guide you through the code I use to scrape East Bay Area Craigslist for apartments. The code here, and/or the URI arguments, can be changed to retrieve data from any location, category, property type, and so on. Isn't it amazing?

I'll provide GitHub gists for each cell in the original Jupyter Notebook. Clone the repo if you want to see the entire code at once. Otherwise, have fun reading and following along!

Obtaining the Data

First and foremost, I wanted to make use of the get module from the requests package. Then I created a variable called response and assigned it to the get method on the base URL. What I mean by base URL is the URL of the first page from which you want to retrieve data, sans any further arguments. I went to the East Bay apartments section and checked the “Has Picture” option to reduce the search just a little, so it's not a genuine base URL.

Our web scraping Services provides high-quality structured data to improve business outcomes and enable intelligent decision making,
Our Web scraping service allows you to scrape data from any websites and transfer web pages into an easy-to-use format such as Excel, CSV, JSON and many others

Then I imported BeautifulSoup from bs4, which is the module capable of parsing the HTML of the web page fetched from the server. I then double-checked the type and length of that item to ensure it corresponded to the number of posts on the page (there are 120). My import statements and setup code are available below:
I located the posts by using the find all method on the newly formed html soup variable in the code above. To discover the parent tag of the postings, I needed to look at the website's structure. It's li class="result-row">, as you can see in the screenshot below. That's the tag for a single post, which is simply the box containing all of the elements I gathered!

You scale this, make sure to work in the following manner: Grab the first post and all the variables you want from it, make sure you know how to access each of them for one post before looping the entire page, and last, make sure you scraped one page successfully before adding the loop that goes through all pages.

bs4.element is a class.

Because ResultSet is indexed, I examined the first apartment by indexing posts[0]. Surprisingly, it's all the code from the li> tag!

I got the date and time by using the attribute ‘datetime' on the class ‘result-date.' I saved a step in data cleaning by specifying the ‘datetime' attribute, which eliminated the need to convert this attribute from a string to a datetime object. This could alternatively be a one-liner by appending ['datetime'] to the end of the.find() function, but I separated it into two lines for clarity.

Because the number of bedrooms and square footage are in the same tag, I separated them and retrieved each one element-by-element. I got the text from the neighbourhood, which is the span> tag of class "result-hood."

The following block is a loop for all of the East Bay pages. Because there isn't always information on square footage and bedroom count, I included a series of if statements embedded within the for loop to accommodate all instances.

#datacollection