Scrape Booking.com Data and Extract Hotel Information

urussword377 (32)in #web-scraping • last month

Have you ever needed to gather hotel data in bulk for analysis? Whether you're building a travel comparison website or conducting market research, scraping Booking.com can give you the hotel insights you need. In this guide, we'll show you how to use Python to extract data from Booking.com, including hotel names, ratings, prices, addresses, and more.
By the end of this article, you’ll have a solid foundation for scraping structured data from a website that applies dynamic content loading and anti-scraping measures.

Why This Matters

Before diving into the code, let’s acknowledge one thing: data is power. In the travel industry, few sites offer as much insight as Booking.com. With access to real-time data about hundreds of hotels—location, price, customer reviews, and more—you can gain valuable insights. Let’s make it happen.

Getting the Required Libraries

First, let’s get our environment set up. We need a few Python libraries to help us with this project:

Requests: This library will help us send HTTP requests to the Booking.com pages and fetch their HTML content.
LXML: This is our go-to tool for parsing HTML and extracting data using XPath.
JSON: Python’s built-in JSON library will help us handle the embedded structured data.
CSV: We'll use this to save the data into a neat CSV file.

You can install the required libraries with this simple pip command:

pip install requests lxml

The rest, such as JSON and CSV, come pre-installed with Python.

Decoding the URL and Data Organization

Scraping a website effectively begins with understanding its structure. Each hotel page on Booking.com contains JSON-LD data, which is essentially a structured way of embedding information like hotel names, locations, and pricing directly in the HTML. We’re going to scrape that data.

The Complete Scraping Process

Let’s dive in. Booking.com’s anti-scraping mechanisms might block you if you don’t approach the task correctly. So, we need to use a combination of proper headers and proxies to ensure our requests don’t get flagged.

Sending HTTP Requests with Headers

Headers act like a disguise for your scraper. Without them, Booking.com will recognize the request as coming from a bot and will block it. To prevent that, we’ll use custom headers to mimic a real user’s browser session.

Here’s a sample code to set up headers:

import requests
from lxml.html import fromstring

urls_list = ["https://example.com"]

for url in urls_list:
    headers = {
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
        'accept-language': 'en-US,en;q=0.9',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.0.0 Safari/537.36',
    }

    response = requests.get(url, headers=headers)

The Value of Proxies

Proxies are crucial for scraping sites like Booking.com. They help avoid IP bans by rotating between different IP addresses. Booking.com may limit the number of requests from a single IP, so spreading the requests across multiple proxies is essential.

Here’s how to add proxies:

proxies = {
    'http': 'http://proxy_ip:port',
    'https': 'https://proxy_ip:port'
}
response = requests.get(url, headers=headers, proxies=proxies)

Extracting JSON Data from HTML

Now that we’ve successfully sent a request, we need to extract the embedded JSON-LD data from the page. Here’s how to do that:

parser = fromstring(response.text)

# Extract JSON-LD data from the page
embedded_json = parser.xpath('//script[@type="application/ld+json"]/text()')
json_data = json.loads(embedded_json[0])

Gathering Hotel Information

Once we have the JSON data, extracting hotel details is a breeze. Here’s an example of how to grab the essentials:

name = json_data['name']
location = json_data['hasMap']
price_range = json_data['priceRange']
rating_value = json_data['aggregateRating']['ratingValue']
address = json_data['address']
image_url = json_data['image']

Writing Data to CSV

Once we’ve extracted the data, it’s time to save it. Here’s how you can write it into a CSV file for later analysis:

import csv

# Writing the extracted data to a CSV file
with open('booking_data.csv', 'w', newline='') as csvfile:
    fieldnames = ["Name", "Location", "Price Range", "Rating", "Street Address", "Image URL"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    # Write header and data rows
    writer.writeheader()
    writer.writerows(all_data)

Wrapping Up

By following these steps, you can easily scrape Booking.com data and collect valuable hotel data for analysis. Always be mindful of the website’s terms of service and take steps to avoid overloading the server with requests. With this knowledge, you can start scraping and building your own data-driven insights.

#scrapebookingcomdata

last month in #web-scraping by urussword377 (32)

$0.00

STEEM 0.15

TRX 0.25

JST 0.038

BTC 95433.96

ETH 1827.38

USDT 1.00

SBD 0.87