š·ļø Scrapy Framework: Industrial-Strength Web Scraping
Scrapy is the Ferrari of web scraping frameworks - powerful, fast, and built for scale. While BeautifulSoup is like a Swiss Army knife for parsing HTML, Scrapy is a complete factory for extracting, processing, and storing web data. It handles concurrent requests, automatic retries, data pipelines, and distributed scraping out of the box. Let's master the art of industrial web scraping! š
The Scrapy Architecture
Think of Scrapy as a well-orchestrated assembly line where spiders crawl websites, extractors pull out data, pipelines process it, and exporters save it - all running concurrently with military precision. It's not just a tool; it's a complete ecosystem for professional web scraping at any scale!
Real-World Scenario: The E-Commerce Intelligence Platform š
You're building a competitive intelligence platform that monitors thousands of e-commerce sites, tracking prices, inventory, reviews, and product launches. You need to handle JavaScript-rendered pages, rotating proxies, CAPTCHA challenges, rate limiting, and real-time data processing. Scrapy will be your industrial-strength solution for this massive undertaking!
# First, install Scrapy: pip install scrapy scrapy-splash scrapy-redis scrapy-rotating-proxies
import scrapy
from scrapy import Spider, Request, FormRequest
from scrapy.crawler import CrawlerProcess, CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
from scrapy.itemloader import ItemLoader
from scrapy.itemloader.processors import TakeFirst, MapCompose, Join, Compose
from scrapy.exceptions import DropItem, CloseSpider
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import open_in_browser
import json
import re
from typing import Dict, List, Optional, Any, Generator
from datetime import datetime
from urllib.parse import urljoin, urlparse
import hashlib
from w3lib.html import remove_tags
import logging
# ==================== Items Definition ====================
class ProductItem(Item):
"""Define the structure of scraped product data."""
# Basic Information
url = Field()
product_id = Field()
name = Field()
brand = Field()
category = Field()
subcategory = Field()
# Pricing
price = Field()
original_price = Field()
discount = Field()
currency = Field()
# Availability
in_stock = Field()
stock_quantity = Field()
availability = Field()
# Product Details
description = Field()
features = Field()
specifications = Field()
# Images
image_urls = Field()
images = Field() # Downloaded images
# Reviews
rating = Field()
review_count = Field()
reviews = Field()
# Metadata
scraped_at = Field()
spider_name = Field()
class ReviewItem(Item):
"""Review data structure."""
product_id = Field()
reviewer_name = Field()
rating = Field()
title = Field()
content = Field()
date = Field()
verified_purchase = Field()
helpful_count = Field()
# ==================== Item Loaders ====================
class ProductLoader(ItemLoader):
"""Custom item loader with processing."""
default_item_class = ProductItem
default_output_processor = TakeFirst()
# Custom processors
name_in = MapCompose(remove_tags, str.strip)
price_in = MapCompose(remove_tags, lambda x: re.sub(r'[^\d.,]', '', x))
description_in = MapCompose(remove_tags, str.strip)
features_out = Identity() # Keep as list
image_urls_out = Identity() # Keep as list
def parse_price(price_string: str) -> float:
"""Parse price from string."""
if not price_string:
return None
# Remove currency symbols and whitespace
price_string = re.sub(r'[^\d.,]', '', price_string)
# Handle different decimal separators
if ',' in price_string and '.' in price_string:
# Assume comma is thousands separator
price_string = price_string.replace(',', '')
elif ',' in price_string:
# Could be decimal separator (European format)
if price_string.count(',') == 1 and len(price_string.split(',')[1]) <= 2:
price_string = price_string.replace(',', '.')
else:
price_string = price_string.replace(',', '')
try:
return float(price_string)
except:
return None
# ==================== Base Spider ====================
class BaseEcommerceSpider(scrapy.Spider):
"""Base spider with common functionality."""
custom_settings = {
'DOWNLOAD_DELAY': 1,
'CONCURRENT_REQUESTS': 16,
'CONCURRENT_REQUESTS_PER_DOMAIN': 8,
'AUTOTHROTTLE_ENABLED': True,
'AUTOTHROTTLE_START_DELAY': 0.5,
'AUTOTHROTTLE_MAX_DELAY': 10,
'AUTOTHROTTLE_TARGET_CONCURRENCY': 4,
'ROBOTSTXT_OBEY': True,
'USER_AGENT': 'EcommerceCrawler (+http://example.com/bot)',
# Retry configuration
'RETRY_TIMES': 3,
'RETRY_HTTP_CODES': [500, 502, 503, 504, 408, 429],
# Cache
'HTTPCACHE_ENABLED': True,
'HTTPCACHE_EXPIRATION_SECS': 3600,
# Export
'FEED_EXPORT_ENCODING': 'utf-8',
}
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.stats = {
'products_scraped': 0,
'products_failed': 0,
'start_time': datetime.now()
}
def parse_product(self, response):
"""Parse product page - to be implemented by subclasses."""
raise NotImplementedError
def extract_json_ld(self, response) -> Dict:
"""Extract JSON-LD structured data."""
json_ld = response.xpath('//script[@type="application/ld+json"]/text()').getall()
for json_text in json_ld:
try:
data = json.loads(json_text)
if isinstance(data, dict) and data.get('@type') == 'Product':
return data
except:
continue
return {}
# ==================== Amazon Spider ====================
class AmazonSpider(BaseEcommerceSpider):
"""Spider for Amazon products."""
name = 'amazon'
allowed_domains = ['amazon.com']
custom_settings = {
**BaseEcommerceSpider.custom_settings,
'DOWNLOAD_DELAY': 2, # Be more polite with Amazon
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
def __init__(self, category=None, *args, **kwargs):
super().__init__(*args, **kwargs)
self.category = category
# Start URLs based on category
if category:
self.start_urls = [
f'https://www.amazon.com/s?k={category}&page=1'
]
else:
self.start_urls = [
'https://www.amazon.com/Best-Sellers/zgbs'
]
def parse(self, response):
"""Parse search/category page."""
# Extract product URLs
product_links = response.css('div[data-component-type="s-search-result"] h2 a::attr(href)').getall()
for link in product_links:
url = urljoin(response.url, link)
yield Request(
url=url,
callback=self.parse_product,
meta={'product_url': url}
)
# Follow pagination
next_page = response.css('a.s-pagination-next::attr(href)').get()
if next_page:
yield Request(
urljoin(response.url, next_page),
callback=self.parse
)
def parse_product(self, response):
"""Parse Amazon product page."""
loader = ProductLoader(item=ProductItem(), response=response)
# Basic information
loader.add_value('url', response.url)
loader.add_value('product_id', self.extract_asin(response.url))
loader.add_css('name', 'span#productTitle::text')
loader.add_css('brand', 'a#bylineInfo::text')
# Pricing
price = response.css('span.a-price-whole::text').get()
if not price:
price = response.css('span.a-price.a-text-price.a-size-medium::text').get()
loader.add_value('price', parse_price(price))
loader.add_css('original_price', 'span.a-price.a-text-price span.a-offscreen::text')
loader.add_value('currency', 'USD')
# Availability
availability = response.css('div#availability span::text').get()
loader.add_value('in_stock', 'In Stock' in availability if availability else False)
loader.add_value('availability', availability)
# Product details
loader.add_css('description', 'div#feature-bullets ul span.a-list-item::text')
# Features from bullet points
features = response.css('div#feature-bullets ul span.a-list-item::text').getall()
loader.add_value('features', [f.strip() for f in features if f.strip()])
# Images
image_urls = response.css('div#altImages img::attr(src)').getall()
# Convert thumbnail URLs to full size
image_urls = [self.get_full_image_url(url) for url in image_urls]
loader.add_value('image_urls', image_urls)
# Reviews
loader.add_css('rating', 'span.a-icon-star span.a-icon-alt::text')
loader.add_css('review_count', 'span#acrCustomerReviewText::text')
# Metadata
loader.add_value('scraped_at', datetime.now())
loader.add_value('spider_name', self.name)
# Categories
categories = response.css('div#wayfinding-breadcrumbs_feature_div a::text').getall()
if categories:
loader.add_value('category', categories[0] if categories else None)
loader.add_value('subcategory', categories[-1] if len(categories) > 1 else None)
product = loader.load_item()
# Yield product
yield product
# Scrape reviews
yield from self.parse_reviews(response, product['product_id'])
def extract_asin(self, url: str) -> str:
"""Extract ASIN from Amazon URL."""
match = re.search(r'/dp/([A-Z0-9]{10})', url)
return match.group(1) if match else None
def get_full_image_url(self, thumbnail_url: str) -> str:
"""Convert thumbnail URL to full size image."""
# Amazon image URL manipulation
if '._' in thumbnail_url:
base_url = thumbnail_url.split('._')[0]
return base_url + '.jpg'
return thumbnail_url
def parse_reviews(self, response, product_id):
"""Parse product reviews."""
reviews_url = response.css('a[data-hook="see-all-reviews-link-foot"]::attr(href)').get()
if reviews_url:
yield Request(
urljoin(response.url, reviews_url),
callback=self.parse_reviews_page,
meta={'product_id': product_id}
)
def parse_reviews_page(self, response):
"""Parse reviews listing page."""
product_id = response.meta['product_id']
for review in response.css('div[data-hook="review"]'):
review_item = ReviewItem()
review_item['product_id'] = product_id
review_item['reviewer_name'] = review.css('span.a-profile-name::text').get()
review_item['rating'] = review.css('i[data-hook="review-star-rating"] span::text').get()
review_item['title'] = review.css('a[data-hook="review-title"] span::text').get()
review_item['content'] = ' '.join(review.css('span[data-hook="review-body"] span::text').getall())
review_item['date'] = review.css('span[data-hook="review-date"]::text').get()
review_item['verified_purchase'] = bool(review.css('span[data-hook="avp-badge"]'))
yield review_item
# Next page of reviews
next_page = response.css('li.a-last a::attr(href)').get()
if next_page:
yield Request(
urljoin(response.url, next_page),
callback=self.parse_reviews_page,
meta={'product_id': product_id}
)
# ==================== Generic E-commerce Spider ====================
class GenericEcommerceSpider(CrawlSpider):
"""Generic spider using rules for any e-commerce site."""
name = 'generic_ecommerce'
# Override these in subclasses or initialization
allowed_domains = []
start_urls = []
# Rules for following links
rules = (
# Follow category pages
Rule(
LinkExtractor(restrict_css='nav.categories a'),
follow=True
),
# Follow pagination
Rule(
LinkExtractor(restrict_css='a.pagination, a.next'),
follow=True
),
# Parse product pages
Rule(
LinkExtractor(restrict_css='a.product-link, div.product a'),
callback='parse_product',
follow=False
),
)
def parse_product(self, response):
"""Generic product parser using common patterns."""
loader = ProductLoader(item=ProductItem(), response=response)
# Try to extract from JSON-LD
json_data = self.extract_json_ld(response)
if json_data:
loader.add_value('name', json_data.get('name'))
loader.add_value('description', json_data.get('description'))
loader.add_value('brand', json_data.get('brand', {}).get('name'))
offers = json_data.get('offers', {})
loader.add_value('price', offers.get('price'))
loader.add_value('currency', offers.get('priceCurrency'))
loader.add_value('availability', offers.get('availability'))
aggregate_rating = json_data.get('aggregateRating', {})
loader.add_value('rating', aggregate_rating.get('ratingValue'))
loader.add_value('review_count', aggregate_rating.get('reviewCount'))
# Fallback to CSS/XPath selectors
else:
# Common patterns
loader.add_css('name', 'h1::text, h1.product-title::text, [itemprop="name"]::text')
loader.add_css('price', '.price::text, [itemprop="price"]::text, .product-price::text')
loader.add_css('description', '.description::text, [itemprop="description"]::text')
loader.add_css('rating', '[itemprop="ratingValue"]::text, .rating::text')
loader.add_css('review_count', '[itemprop="reviewCount"]::text')
loader.add_css('brand', '[itemprop="brand"]::text, .brand::text')
# Always add metadata
loader.add_value('url', response.url)
loader.add_value('scraped_at', datetime.now())
loader.add_value('spider_name', self.name)
# Extract images
image_urls = response.css('img.product-image::attr(src), [itemprop="image"]::attr(src)').getall()
loader.add_value('image_urls', [urljoin(response.url, url) for url in image_urls])
yield loader.load_item()
# ==================== Pipelines ====================
class ValidationPipeline:
"""Validate and clean scraped items."""
def process_item(self, item, spider):
"""Process and validate item."""
# Validate required fields
if not item.get('name') or not item.get('url'):
raise DropItem(f"Missing required fields: {item}")
# Clean and normalize data
if item.get('price'):
# Ensure price is float
if isinstance(item['price'], str):
item['price'] = parse_price(item['price'])
# Parse rating
if item.get('rating'):
rating = item['rating']
if isinstance(rating, str):
# Extract numeric rating (e.g., "4.5 out of 5 stars")
match = re.search(r'([\d.]+)', rating)
if match:
item['rating'] = float(match.group(1))
# Parse review count
if item.get('review_count'):
review_count = item['review_count']
if isinstance(review_count, str):
# Extract number (e.g., "1,234 customer reviews")
review_count = re.sub(r'[^\d]', '', review_count)
if review_count:
item['review_count'] = int(review_count)
# Calculate discount if both prices available
if item.get('price') and item.get('original_price'):
original = parse_price(str(item['original_price']))
current = item['price']
if original and current and original > current:
item['discount'] = round((original - current) / original * 100, 2)
return item
class DatabasePipeline:
"""Save items to database."""
def __init__(self, db_settings):
self.db_settings = db_settings
@classmethod
def from_crawler(cls, crawler):
"""Create pipeline from crawler."""
db_settings = crawler.settings.getdict("DATABASE")
return cls(db_settings)
def open_spider(self, spider):
"""Initialize database connection."""
# This would connect to your database
# Example: PostgreSQL, MongoDB, etc.
pass
def close_spider(self, spider):
"""Close database connection."""
pass
def process_item(self, item, spider):
"""Save item to database."""
# Insert item into database
# Handle duplicates, updates, etc.
spider.stats['products_scraped'] += 1
spider.logger.info(f"Saved product: {item.get('name')}")
return item
class DuplicatesPipeline:
"""Filter duplicate items."""
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
"""Check for duplicates."""
# Create unique identifier
if item.get('product_id'):
unique_id = item['product_id']
else:
# Fallback to URL hash
unique_id = hashlib.md5(item['url'].encode()).hexdigest()
if unique_id in self.ids_seen:
raise DropItem(f"Duplicate item found: {unique_id}")
self.ids_seen.add(unique_id)
return item
class ImagePipeline:
"""Download and process product images."""
def process_item(self, item, spider):
"""Download images if URLs present."""
if item.get('image_urls'):
# This would typically use Scrapy's ImagesPipeline
# For now, just validate URLs
valid_urls = []
for url in item['image_urls']:
if url and url.startswith(('http://', 'https://')):
valid_urls.append(url)
item['image_urls'] = valid_urls
return item
# ==================== Middlewares ====================
class RotateUserAgentMiddleware:
"""Rotate user agents for each request."""
def __init__(self, user_agents):
self.user_agents = user_agents
@classmethod
def from_crawler(cls, crawler):
"""Create middleware from crawler."""
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
]
return cls(user_agents)
def process_request(self, request, spider):
"""Add random user agent to request."""
import random
ua = random.choice(self.user_agents)
request.headers['User-Agent'] = ua
class ProxyMiddleware:
"""Add proxy to requests."""
def process_request(self, request, spider):
"""Add proxy to request."""
# This would select from a pool of proxies
# proxy = self.get_random_proxy()
# request.meta['proxy'] = proxy
pass
class JavaScriptMiddleware:
"""Handle JavaScript-rendered pages using Splash."""
def process_request(self, request, spider):
"""Convert request to use Splash."""
if request.meta.get('javascript'):
# Use Splash to render JavaScript
request.meta['splash'] = {
'args': {
'html': 1,
'wait': 2,
'render_all': 1
}
}
# ==================== Spider Manager ====================
class SpiderManager:
"""Manage and run multiple spiders."""
def __init__(self):
self.process = CrawlerProcess(get_project_settings())
self.runner = CrawlerRunner(get_project_settings())
def run_spider(self, spider_class, **kwargs):
"""Run a single spider."""
self.process.crawl(spider_class, **kwargs)
self.process.start()
def run_multiple_spiders(self, spider_configs: List[Dict]):
"""Run multiple spiders concurrently."""
for config in spider_configs:
spider_class = config['spider']
kwargs = config.get('kwargs', {})
self.process.crawl(spider_class, **kwargs)
self.process.start()
def run_spider_async(self, spider_class, **kwargs):
"""Run spider asynchronously."""
from twisted.internet import reactor
d = self.runner.crawl(spider_class, **kwargs)
d.addBoth(lambda _: reactor.stop())
reactor.run()
# ==================== Settings Configuration ====================
def get_scrapy_settings():
"""Get Scrapy settings configuration."""
return {
# Basic settings
'BOT_NAME': 'ecommerce_scraper',
'ROBOTSTXT_OBEY': True,
'CONCURRENT_REQUESTS': 16,
'DOWNLOAD_DELAY': 1,
# AutoThrottle
'AUTOTHROTTLE_ENABLED': True,
'AUTOTHROTTLE_START_DELAY': 0.5,
'AUTOTHROTTLE_MAX_DELAY': 10,
'AUTOTHROTTLE_TARGET_CONCURRENCY': 4,
'AUTOTHROTTLE_DEBUG': True,
# Pipelines
'ITEM_PIPELINES': {
'myproject.pipelines.ValidationPipeline': 100,
'myproject.pipelines.DuplicatesPipeline': 200,
'myproject.pipelines.ImagePipeline': 300,
'myproject.pipelines.DatabasePipeline': 400,
},
# Middlewares
'DOWNLOADER_MIDDLEWARES': {
'myproject.middlewares.RotateUserAgentMiddleware': 400,
'myproject.middlewares.ProxyMiddleware': 410,
'myproject.middlewares.JavaScriptMiddleware': 420,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 500,
},
# Cache
'HTTPCACHE_ENABLED': True,
'HTTPCACHE_EXPIRATION_SECS': 3600,
'HTTPCACHE_DIR': 'httpcache',
# Export
'FEED_FORMAT': 'json',
'FEED_URI': 'output/%(name)s_%(time)s.json',
'FEED_EXPORT_ENCODING': 'utf-8',
# Database
'DATABASE': {
'host': 'localhost',
'port': 5432,
'database': 'ecommerce',
'username': 'scraper',
'password': 'password'
},
# Logging
'LOG_LEVEL': 'INFO',
'LOG_FILE': 'logs/scrapy.log',
}
# ==================== Scrapy Project Structure ====================
class ScrapyProjectGenerator:
"""Generate Scrapy project structure."""
@staticmethod
def create_project(project_name: str, path: str = '.'):
"""Create a new Scrapy project structure."""
import os
from pathlib import Path
project_path = Path(path) / project_name
# Create directory structure
directories = [
project_path / project_name,
project_path / project_name / 'spiders',
project_path / project_name / 'pipelines',
project_path / project_name / 'middlewares',
project_path / 'logs',
project_path / 'output',
project_path / 'httpcache'
]
for directory in directories:
directory.mkdir(parents=True, exist_ok=True)
# Create __init__.py files
init_files = [
project_path / project_name / '__init__.py',
project_path / project_name / 'spiders' / '__init__.py'
]
for init_file in init_files:
init_file.touch()
# Create settings.py
settings_content = '''# Scrapy settings for {project_name} project
BOT_NAME = '{project_name}'
SPIDER_MODULES = ['{project_name}.spiders']
NEWSPIDER_MODULE = '{project_name}.spiders'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure pipelines
ITEM_PIPELINES = {{
'{project_name}.pipelines.ValidationPipeline': 300,
}}
# Configure a delay for requests for the same website (default: 0)
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 2
'''.format(project_name=project_name)
with open(project_path / project_name / 'settings.py', 'w') as f:
f.write(settings_content)
# Create scrapy.cfg
cfg_content = f'''# Automatically created by: scrapy startproject
[settings]
default = {project_name}.settings
[deploy]
#url = http://localhost:6800/
project = {project_name}
'''
with open(project_path / 'scrapy.cfg', 'w') as f:
f.write(cfg_content)
print(f"Created Scrapy project: {project_name}")
print(f"Project structure:")
print(f" {project_name}/")
print(f" scrapy.cfg # deploy configuration file")
print(f" {project_name}/ # project's Python module")
print(f" __init__.py")
print(f" settings.py # project settings file")
print(f" pipelines.py # project pipelines file")
print(f" middlewares.py # project middlewares file")
print(f" spiders/ # directory for spiders")
print(f" __init__.py")
# Example usage
if __name__ == "__main__":
print("š·ļø Scrapy Framework Examples\n")
# Example 1: Create Scrapy project structure
print("1ļøā£ Creating Scrapy Project Structure:")
generator = ScrapyProjectGenerator()
# generator.create_project("ecommerce_scraper", "./scrapy_projects")
print(" Project structure would be created (commented out for safety)")
# Example 2: Define items
print("\n2ļøā£ Defining Items:")
product = ProductItem()
product['name'] = "Example Product"
product['price'] = 99.99
product['in_stock'] = True
print(f" Product Item: {dict(product)}")
# Example 3: Item processing
print("\n3ļøā£ Item Processing Pipeline:")
pipeline = ValidationPipeline()
# Create test item
test_item = ProductItem({
'name': 'Test Product',
'url': 'https://example.com/product',
'price': '$49.99',
'rating': '4.5 out of 5 stars',
'review_count': '1,234 customer reviews'
})
# Process item
processed = pipeline.process_item(test_item, None)
print(f" Processed item:")
print(f" Price: {processed.get('price')} (parsed from string)")
print(f" Rating: {processed.get('rating')} (extracted number)")
print(f" Reviews: {processed.get('review_count')} (parsed count)")
# Example 4: URL patterns
print("\n4ļøā£ Spider URL Patterns:")
spider = AmazonSpider()
test_url = "https://www.amazon.com/dp/B08N5WRWNW"
asin = spider.extract_asin(test_url)
print(f" Extracted ASIN: {asin}")
thumbnail = "https://images-na.ssl-images-amazon.com/images/I/71Echo-Vd7L._AC_UL320_.jpg"
full_image = spider.get_full_image_url(thumbnail)
print(f" Full image URL: {full_image[:50]}...")
# Example 5: Settings configuration
print("\n5ļøā£ Scrapy Settings:")
settings = get_scrapy_settings()
print(" Key settings:")
print(f" Concurrent requests: {settings['CONCURRENT_REQUESTS']}")
print(f" Download delay: {settings['DOWNLOAD_DELAY']}s")
print(f" AutoThrottle: {settings['AUTOTHROTTLE_ENABLED']}")
print(f" Cache enabled: {settings['HTTPCACHE_ENABLED']}")
# Example 6: Duplicate detection
print("\n6ļøā£ Duplicate Detection:")
dup_pipeline = DuplicatesPipeline()
items = [
ProductItem({'product_id': 'ABC123', 'url': 'https://example.com/1'}),
ProductItem({'product_id': 'XYZ789', 'url': 'https://example.com/2'}),
ProductItem({'product_id': 'ABC123', 'url': 'https://example.com/3'}), # Duplicate
]
for item in items:
try:
dup_pipeline.process_item(item, None)
print(f" ā
Processed: {item.get('product_id')}")
except DropItem as e:
print(f" ā Dropped: {e}")
# Example 7: Price parsing
print("\n7ļøā£ Price Parsing:")
test_prices = [
"$49.99",
"⬠39,99",
"1,234.56",
"Rs. 2999",
"ᅣ5,000"
]
for price_str in test_prices:
parsed = parse_price(price_str)
print(f" '{price_str}' ā {parsed}")
# Example 8: Spider statistics
print("\n8ļøā£ Spider Statistics:")
print(" Spider run summary:")
print(f" Products scraped: 1234")
print(f" Products failed: 12")
print(f" Success rate: 99.0%")
print(f" Average response time: 0.8s")
print(f" Cache hit rate: 35%")
# Example 9: Scrapy commands
print("\n9ļøā£ Common Scrapy Commands:")
commands = [
("scrapy startproject myproject", "Create new project"),
("scrapy genspider amazon amazon.com", "Generate spider"),
("scrapy crawl amazon", "Run spider"),
("scrapy crawl amazon -o products.json", "Run with output"),
("scrapy shell 'https://example.com'", "Interactive shell"),
("scrapy view https://example.com", "View in browser"),
("scrapy bench", "Run benchmark test"),
]
for cmd, desc in commands:
print(f" $ {cmd}")
print(f" {desc}")
print("\nā
Scrapy framework demonstration complete!")
Key Takeaways and Best Practices šÆ
- Use Item Classes: Define clear data structures for consistency.
- Implement Pipelines: Process, validate, and store data systematically.
- Configure AutoThrottle: Let Scrapy automatically adjust speed.
- Handle Errors Gracefully: Use retry middleware and error callbacks.
- Cache Responses: Save bandwidth and time during development.
- Use Selectors Efficiently: CSS selectors are faster than XPath.
- Monitor Performance: Use Scrapy's built-in stats collection.
- Scale Horizontally: Use Scrapy-Redis for distributed crawling.
Scrapy Framework Best Practices š
Mastering Scrapy transforms you from a casual scraper to an industrial-scale data extraction engineer. You now have the power to build robust, scalable, and maintainable web scraping systems that can handle millions of pages efficiently. Whether you're building price monitors, search engines, or data aggregation platforms, Scrapy provides the industrial-strength foundation you need! šļø
Pro Tip: Scrapy isn't just a library - it's a complete framework with opinions about how web scraping should be done. Embrace its architecture! Always define Items for data structure, use ItemLoaders for consistent extraction, implement Pipelines for processing, and leverage Middlewares for request/response handling. AutoThrottle is your friend - it automatically adjusts crawling speed based on server response. Use Scrapy Shell for testing selectors before writing spiders. Cache responses during development to avoid hitting servers repeatedly. For JavaScript-heavy sites, integrate Scrapy-Splash or Scrapy-Selenium. When scaling, use Scrapy-Redis for distributed crawling across multiple machines. Most importantly: Scrapy is asynchronous by design - don't block the reactor with synchronous code! Let Scrapy's twisted engine handle concurrency for maximum performance.