Efficiently Scrape Medium Articles Using Python: Tools and Techniques

urussword377 (32)in #python • yesterday

Ever wanted to analyze content from one of the largest writing platforms on the internet? Medium is a goldmine of articles, but what if you could automatically collect the data from there? In this guide, we’ll break down how to scrape Medium articles using Python, extracting useful information such as the article title, author, publication, and body text. Whether you're gathering data for research, monitoring specific authors, or even tracking trends, this tutorial will show you the ropes.

What You Need Before You Start
For this project, we’ll be scraping an article from Medium titled, “9 Python Built-in Decorators That Optimize Your Code Significantly.” Here’s what you’ll need to set up:

Requests – For sending HTTP requests to Medium.
lxml – To parse HTML content easily.
Pandas – For storing and exporting the scraped data to CSV.
To install these libraries, just run:

pip install requests
pip install lxml 
pip install pandas

Bypass Bot Detection with Headers and Proxies
Medium employs bot detection to prevent excessive or unauthorized scraping. So, how can you get around this? Simple—headers and proxies. These act like a disguise for your request, making it seem like it’s coming from a real user.
Headers simulate a request from a browser, including essential details like the browser type, language, and other attributes.
Here’s what you need to add to your request headers:

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-IN,en;q=0.9',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
    # Other header values...
}

Proxies rotate your IP address, making it much harder for Medium to block you. Here's an example configuration:

proxies = {
    'http': 'IP:PORT',
    'https': 'IP:PORT'
}

Sending the Request
Now that we have our headers set up, let’s send the request to the Medium article:

import requests

url = 'https://medium.com/techtofreedom/9-python-built-in-decorators-that-optimize-your-code-significantly-bc3f661e9017'
response = requests.get(url, headers=headers, proxies=proxies)

If you’re using a proxy server, Medium won’t see your real IP address. This is crucial when scraping large volumes of data without triggering blocks.

Extracting the Essentials from the Article
Once we’ve got the HTML content, we need to parse it to grab the article’s title, author, publication name, date, content, and other elements.
We'll use lxml to do this. Here’s how:

from lxml.html import fromstring

# Parse the HTML content
parser = fromstring(response.text)

# Extract specific data using XPath queries
title = parser.xpath('//h1[@data-testid="storyTitle"]/text()')[0]
author = parser.xpath('//a[@data-testid="authorName"]/text()')[0]
publication_name = parser.xpath('//a[@data-testid="publicationName"]/p/text()')[0]
publication_date = parser.xpath('//span[@data-testid="storyPublishDate"]/text()')[0]
content = '\n '.join(parser.xpath('//div[@class="ci bh ga gb gc gd"]/p/text()'))

The power of XPath is that it allows us to precisely target HTML elements, even deep within complex structures.

Writing Your Data to CSV
Once you've extracted the relevant data, you’ll want to save it for analysis or reporting. Pandas is perfect for this. We’ll convert our scraped data into a DataFrame and export it to a CSV file:

import pandas as pd

# Store the data in a dictionary
article_data = {
    'Title': title,
    'Author': author,
    'Publication': publication_name,
    'Date': publication_date,
    'Content': content,
}

# Save the data to a CSV file
df = pd.DataFrame([article_data])
df.to_csv('medium_article_data.csv', index=False)

print("Data saved to medium_article_data.csv")

And just like that, you’ve got your article saved in a CSV file, ready to be analyzed or shared.

Full Script
Here’s the complete code for scraping an article from Medium:

import requests
from lxml.html import fromstring
import pandas as pd

# Headers to mimic a browser request
headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'en-IN,en;q=0.9',
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36',
}

proxies = {
    'http': 'IP:PORT',
    'https': 'IP:PORT'
}

# Requesting the page
url = 'https://medium.com/techtofreedom/9-python-built-in-decorators-that-optimize-your-code-significantly-bc3f661e9017'
response = requests.get(url, headers=headers, proxies=proxies)

# Parsing the page
parser = fromstring(response.text)

# Extract data
title = parser.xpath('//h1[@data-testid="storyTitle"]/text()')[0]
author = parser.xpath('//a[@data-testid="authorName"]/text()')[0]
publication_name = parser.xpath('//a[@data-testid="publicationName"]/p/text()')[0]
publication_date = parser.xpath('//span[@data-testid="storyPublishDate"]/text()')[0]
content = '\n '.join(parser.xpath('//div[@class="ci bh ga gb gc gd"]/p/text()'))

# Saving data to CSV
article_data = {
    'Title': title,
    'Author': author,
    'Publication': publication_name,
    'Date': publication_date,
    'Content': content,
}

df = pd.DataFrame([article_data])
df.to_csv('medium_article_data.csv', index=False)
print("Data saved to medium_article_data.csv")

Final Thoughts
Scraping Medium, or any website, should be done responsibly. Overloading servers with excessive requests can impact performance, and scraping without permission could violate terms of service. Always check the site’s robots.txt file and review its terms before starting.

#web-scraping

yesterday in #python by urussword377 (32)

$0.00

STEEM 0.17

TRX 0.24

JST 0.034

BTC 97563.43

ETH 2703.58

SBD 0.43

Efficiently Scrape Medium Articles Using Python: Tools and Techniques

Coin Marketplace