Download Images from a Website Using Scrapy

Photo by freepik

To create a program to download images from a website using Scrapy in Python, follow this step-by-step guide. Scrapy is a powerful web scraping framework, and it makes tasks like extracting images straightforward.

Step 1: Install Scrapy

First, install Scrapy using pip:

pip install scrapy

Step 2: Create a New Scrapy Project

In your terminal or command prompt, create a new Scrapy project:

scrapy startproject image_downloader

Navigate into the newly created image_downloader folder:

cd image_downloader

Step 3: Define an Image Spider

Next, create a spider to crawl the website and download images. Inside the spiders/ directory, create a new file called image_spider.py.

cd image_downloader/spiders
touch image_spider.py

Edit the image_spider.py file and add the following code:

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.utils.project import get_project_settings
from scrapy.pipelines.files import FileException
import os
class ImageSpider(scrapy.Spider):
    name = 'image_spider'

    # Define the URL you want to scrape
    start_urls = ['https://example.com']  # Replace with the actual URL

    def parse(self, response):
        # Extract image URLs using an XPath or CSS selector
        image_urls = response.css('img::attr(src)').getall()

        # Ensure that the image URLs are complete (if relative, prepend the base URL)
        image_urls = [response.urljoin(url) for url in image_urls]

        yield {
            'image_urls': image_urls
        }

Step 4: Enable the Image Pipeline

To automatically download the images, Scrapy provides an ImagesPipeline. Enable it by editing the settings.py file in the image_downloader directory.In image_downloader/settings.py, update the settings:

# Enable the ImagesPipeline
ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1,
}

# Define the folder where images will be stored
IMAGES_STORE = 'downloaded_images'

# Optional: Filter based on minimum dimensions of images
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110

Step 5: Customize the Image Storage Path (Optional)

If you want to customize how images are saved, you can extend the ImagesPipeline. Create a new pipeline class in the image_spider.py or pipelines.py file (if needed).

For example, this code can modify how the file paths are generated:

class CustomImagePipeline(ImagesPipeline):
    def file_path(self, request, response=None, info=None, *, item=None):
        # Customize the file name based on image URL or other criteria
        image_name = request.url.split('/')[-1]
        return f'custom_folder/{image_name}'

# Enable the custom pipeline in settings.py
ITEM_PIPELINES = {
    'image_downloader.pipelines.CustomImagePipeline': 1,
}

Step 6: Run the Spider

To run the spider and start downloading the images, use the following command:

scrapy crawl image_spider

Step 7: Check the Downloaded Images

Once the spider finishes running, you can find the downloaded images in the downloaded_images folder (or the path you specified in IMAGES_STORE).

“Write a program to Automate Form Submissions on Websites Using Selenium”

Full Example Spider Code

Here’s the full code for image_spider.py:

import scrapy

class ImageSpider(scrapy.Spider):
    name = 'image_spider'

    start_urls = ['https://example.com']  # Replace with the target URL

    def parse(self, response):
        # Extract all image URLs from the page
        image_urls = response.css('img::attr(src)').getall()

        # Convert relative URLs to absolute URLs
        image_urls = [response.urljoin(url) for url in image_urls]

        # Return the image URLs as an item
        yield {
            'image_urls': image_urls
        }

Notes:

Respect Robots.txt: Scrapy will obey the website’s robots.txt by default. Ensure you are following the site’s scraping policies.
Pagination: If the website has multiple pages, you can extend the spider to follow pagination links by adding logic inside the parse() method to handle multiple pages.
Custom Image Filters: Scrapy’s ImagesPipeline can also filter based on file type, size, and other criteria.

This setup will allow you to scrape images and download them into a folder.

Create a Program to Download Images from a Website Using Scrapy in Python

Read More

Write a program to Automate Form Submissions on Websites Using Selenium

How to Clean Seafood with Salt Water?

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Popular Posts

Emerging Trends in Artificial Intelligence and Machine Learning

15 Most Valuable Aerospace Companies In The World

The Increasing Role of Artificial Intelligence in Healthcare

Explore Topics

Press ESC to close