Press ESC to close

Create a Program to Download Images from a Website Using Scrapy in Python

Photo by Medium

To create a program to download images from a website using Scrapy in Python, follow this step-by-step guide. Scrapy is a powerful web scraping framework, and it makes tasks like extracting images straightforward.

Step 1: Install Scrapy

First, install Scrapy using pip:

pip install scrapy

Step 2: Create a New Scrapy Project

In your terminal or command prompt, create a new Scrapy project:

scrapy startproject image_downloader

Navigate into the newly created image_downloader folder:

cd image_downloader

Step 3: Define an Image Spider

Next, create a spider to crawl the website and download images. Inside the spiders/ directory, create a new file called image_spider.py.

cd image_downloader/spiders
touch image_spider.py

Edit the image_spider.py file and add the following code:

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.utils.project import get_project_settings
from scrapy.pipelines.files import FileException
import os
class ImageSpider(scrapy.Spider):
    name = 'image_spider'

    # Define the URL you want to scrape
    start_urls = ['https://example.com']  # Replace with the actual URL

    def parse(self, response):
        # Extract image URLs using an XPath or CSS selector
        image_urls = response.css('img::attr(src)').getall()

        # Ensure that the image URLs are complete (if relative, prepend the base URL)
        image_urls = [response.urljoin(url) for url in image_urls]

        yield {
            'image_urls': image_urls
        }

Step 4: Enable the Image Pipeline

To automatically download the images, Scrapy provides an ImagesPipeline. Enable it by editing the settings.py file in the image_downloader directory.In image_downloader/settings.py, update the settings:

# Enable the ImagesPipeline
ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1,
}

# Define the folder where images will be stored
IMAGES_STORE = 'downloaded_images'

# Optional: Filter based on minimum dimensions of images
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110

Step 5: Customize the Image Storage Path (Optional)

If you want to customize how images are saved, you can extend the ImagesPipeline. Create a new pipeline class in the image_spider.py or pipelines.py file (if needed).

For example, this code can modify how the file paths are generated:

class CustomImagePipeline(ImagesPipeline):
    def file_path(self, request, response=None, info=None, *, item=None):
        # Customize the file name based on image URL or other criteria
        image_name = request.url.split('/')[-1]
        return f'custom_folder/{image_name}'

# Enable the custom pipeline in settings.py
ITEM_PIPELINES = {
    'image_downloader.pipelines.CustomImagePipeline': 1,
}

Step 6: Run the Spider

To run the spider and start downloading the images, use the following command:

scrapy crawl image_spider

Step 7: Check the Downloaded Images

Once the spider finishes running, you can find the downloaded images in the downloaded_images folder (or the path you specified in IMAGES_STORE).

“Write a program to Automate Form Submissions on Websites Using Selenium”

Full Example Spider Code

Here’s the full code for image_spider.py:

import scrapy

class ImageSpider(scrapy.Spider):
    name = 'image_spider'

    start_urls = ['https://example.com']  # Replace with the target URL

    def parse(self, response):
        # Extract all image URLs from the page
        image_urls = response.css('img::attr(src)').getall()

        # Convert relative URLs to absolute URLs
        image_urls = [response.urljoin(url) for url in image_urls]

        # Return the image URLs as an item
        yield {
            'image_urls': image_urls
        }

Notes:

  1. Respect Robots.txt: Scrapy will obey the website’s robots.txt by default. Ensure you are following the site’s scraping policies.
  2. Pagination: If the website has multiple pages, you can extend the spider to follow pagination links by adding logic inside the parse() method to handle multiple pages.
  3. Custom Image Filters: Scrapy’s ImagesPipeline can also filter based on file type, size, and other criteria.

This setup will allow you to scrape images and download them into a folder.

Leave a Reply

Your email address will not be published. Required fields are marked *