Press ESC to close

Develop a Web Scraper Using BeautifulSoup and Requests to Extract Data from a Website in Python

Photo by spacestudio.com

To develop a web scraper using BeautifulSoup and requests in Python, follow these steps. This example will walk you through scraping basic data from a webpage, like extracting titles, links, or text content.

Step 1: Install Required Libraries

First, install the necessary libraries using pip:

pip install beautifulsoup4 requests

Step 2: Import Libraries

In your Python script, import the required libraries:

import requests
from bs4 import BeautifulSoup

Step 3: Send an HTTP Request to the Website

Use the requests library to get the content of the web page:

url = 'https://example.com'  # Replace with the website URL you want to scrape
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    print("Successfully retrieved the webpage")
else:
    print(f"Failed to retrieve webpage. Status code: {response.status_code}")

Step 4: Parse the HTML Content with BeautifulSoup

Once the page is successfully retrieved, use BeautifulSoup to parse the HTML content:

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Print the page title to verify it's working
print(soup.title.string)

Step 5: Extract Data

Now, let’s extract specific data from the webpage. Suppose we want to extract all the article titles and their associated links from a blog page. You need to inspect the webpage to determine the relevant HTML tags (use browser developer tools like Chrome DevTools to inspect the tags).

For this example, let’s assume the titles are in <h2> tags and the links are in <a> tags within those headers:

# Find all <h2> tags (assuming the blog titles are inside <h2>)
titles = soup.find_all('h2')

# Loop through each title and extract the text and link
for title in titles:
    # Extract the title text
    title_text = title.get_text()

    # Find the first <a> tag within the <h2> and extract the 'href' attribute
    link = title.find('a')['href']

    print(f'Title: {title_text}')
    print(f'Link: {link}')

Step 6: Handle Potential Errors and Edge Cases

To avoid issues like missing links or malformed HTML, you can add some error handling:

for title in titles:
    title_text = title.get_text()

    # Handle cases where there might not be a link
    link_tag = title.find('a')
    if link_tag:
        link = link_tag.get('href', 'No link found')
    else:
        link = 'No link found'

    print(f'Title: {title_text}')
    print(f'Link: {link}')

Step 7: Save the Data to a File (Optional)

You may want to save the scraped data to a file for later use. For example, you can write the titles and links to a CSV file:

import csv

with open('scraped_data.csv', mode='w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['Title', 'Link'])  # Write header

    for title in titles:
        title_text = title.get_text()
        link_tag = title.find('a')
        link = link_tag['href'] if link_tag else 'No link found'

        writer.writerow([title_text, link])

Read more “How to Check Whether a Number is Vowel or Consonent in Python”

Full Script Example

Here’s the complete script with all steps:

import requests
from bs4 import BeautifulSoup
import csv

url = 'https://example.com'  # Replace with the actual URL you want to scrape
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    print(soup.title.string)

    titles = soup.find_all('h2')

    # Write the scraped data to a CSV file
    with open('scraped_data.csv', mode='w', newline='') as file:
        writer = csv.writer(file)
        writer.writerow(['Title', 'Link'])  # CSV header

        for title in titles:
            title_text = title.get_text()
            link_tag = title.find('a')
            link = link_tag['href'] if link_tag else 'No link found'
            writer.writerow([title_text, link])

    print("Scraping completed and data saved to 'scraped_data.csv'")
else:
    print(f"Failed to retrieve webpage. Status code: {response.status_code}")

Notes:

  • Respect Robots.txt: Always check the website’s robots.txt file to ensure that scraping is allowed on certain pages.
  • Polite Scraping: To avoid overwhelming servers, add time delays between requests using the time.sleep() function, especially if scraping multiple pages.

Leave a Reply

Your email address will not be published. Required fields are marked *