Photo by freepik
To create a program to download images from a website using Scrapy in Python, follow this step-by-step guide. Scrapy is a powerful web scraping framework, and it makes tasks like extracting images straightforward.
Step 1: Install Scrapy
First, install Scrapy using pip:
pip install scrapy
Step 2: Create a New Scrapy Project
In your terminal or command prompt, create a new Scrapy project:
scrapy startproject image_downloader
Navigate into the newly created image_downloader folder:
cd image_downloader
Step 3: Define an Image Spider
Next, create a spider to crawl the website and download images. Inside the spiders/ directory, create a new file called image_spider.py.
cd image_downloader/spiders
touch image_spider.py
Edit the image_spider.py file and add the following code:
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.utils.project import get_project_settings
from scrapy.pipelines.files import FileException
import os
class ImageSpider(scrapy.Spider):
name = 'image_spider'
# Define the URL you want to scrape
start_urls = ['https://example.com'] # Replace with the actual URL
def parse(self, response):
# Extract image URLs using an XPath or CSS selector
image_urls = response.css('img::attr(src)').getall()
# Ensure that the image URLs are complete (if relative, prepend the base URL)
image_urls = [response.urljoin(url) for url in image_urls]
yield {
'image_urls': image_urls
}
Step 4: Enable the Image Pipeline
To automatically download the images, Scrapy provides an ImagesPipeline. Enable it by editing the settings.py file in the image_downloader directory.In image_downloader/settings.py, update the settings:
# Enable the ImagesPipeline
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}
# Define the folder where images will be stored
IMAGES_STORE = 'downloaded_images'
# Optional: Filter based on minimum dimensions of images
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110
Step 5: Customize the Image Storage Path (Optional)
If you want to customize how images are saved, you can extend the ImagesPipeline. Create a new pipeline class in the image_spider.py or pipelines.py file (if needed).
For example, this code can modify how the file paths are generated:
class CustomImagePipeline(ImagesPipeline):
def file_path(self, request, response=None, info=None, *, item=None):
# Customize the file name based on image URL or other criteria
image_name = request.url.split('/')[-1]
return f'custom_folder/{image_name}'
# Enable the custom pipeline in settings.py
ITEM_PIPELINES = {
'image_downloader.pipelines.CustomImagePipeline': 1,
}
Step 6: Run the Spider
To run the spider and start downloading the images, use the following command:
scrapy crawl image_spider
Step 7: Check the Downloaded Images
Once the spider finishes running, you can find the downloaded images in the downloaded_images folder (or the path you specified in IMAGES_STORE).
“Write a program to Automate Form Submissions on Websites Using Selenium”
Full Example Spider Code
Here’s the full code for image_spider.py:
import scrapy
class ImageSpider(scrapy.Spider):
name = 'image_spider'
start_urls = ['https://example.com'] # Replace with the target URL
def parse(self, response):
# Extract all image URLs from the page
image_urls = response.css('img::attr(src)').getall()
# Convert relative URLs to absolute URLs
image_urls = [response.urljoin(url) for url in image_urls]
# Return the image URLs as an item
yield {
'image_urls': image_urls
}
Notes:
- Respect Robots.txt: Scrapy will obey the website’s robots.txt by default. Ensure you are following the site’s scraping policies.
- Pagination: If the website has multiple pages, you can extend the spider to follow pagination links by adding logic inside the parse() method to handle multiple pages.
- Custom Image Filters: Scrapy’s ImagesPipeline can also filter based on file type, size, and other criteria.
This setup will allow you to scrape images and download them into a folder.
Leave a Reply