接着上篇文章Python爬虫项目实战-使用常用库爬取豆瓣读书 Top 250我们把它升级成一个“麻雀虽小五脏俱全”的教学级工程基于上个豆瓣读书爬虫项目同时集成四大进阶方向并且保持结构清晰、可运行、可扩展。⚠️ 再次提醒仅供学习与课程演示请勿高频请求豆瓣。项目名称DoubanBookSpider-Pro一个集Scrapy 工程化 分布式去重 反爬策略 异步高性能​ 于一体的豆瓣读书爬虫一、项目结构重点douban_book_spider_pro/ ├── scrapy.cfg ├── requirements.txt ├── Dockerfile ├── docker-compose.yml └── douban_book_spider_pro/ ├── __init__.py ├── settings.py ├── pipelines.py ├── middlewares.py ├── items.py ├── db.py └── spiders/ ├── __init__.py ├── top250_spider.py # Scrapy 主爬虫 ├── async_spider.py # aiohttp 异步爬虫 ├── distributed_spider.py # Redis 分布式爬虫 └── anti_spider_demo.py # 反爬策略演示二、依赖清单requirements.txtscrapy2.11 redis5.0 fake-useragent requests beautifulsoup4 lxml pyquery selenium playwright aiohttp aioredispip install -r requirements.txt playwright install三、Item 定义items.pyimport scrapy class BookItem(scrapy.Item): title scrapy.Field() author scrapy.Field() publisher scrapy.Field() rating scrapy.Field() detail_url scrapy.Field()四、数据库封装db.pyimport redis REDIS_HOST localhost REDIS_PORT 6379 def get_redis(): return redis.Redis(hostREDIS_HOST, portREDIS_PORT, decode_responsesTrue)五、方向一Scrapy 工程化top250_spider.pyimport scrapy from douban_book_spider_pro.items import BookItem class Top250Spider(scrapy.Spider): name top250 allowed_domains [book.douban.com] start_urls [https://book.douban.com/top250] def parse(self, response): for item in response.css(.item): book BookItem() book[title] item.css(.title a::attr(title)).get() book[author] item.css(.author::text).get(default).strip() book[rating] item.css(.rating_nums::text).get() book[detail_url] item.css(.title a::attr(href)).get() yield book next_page response.css(.next a::attr(href)).get() if next_page: yield response.follow(next_page, self.parse)✅ 体现Spider 规范化Item 封装Pipeline 可扩展自动翻页六、方向二分布式去重distributed_spider.pyfrom scrapy_redis.spiders import RedisSpider from douban_book_spider_pro.items import BookItem class DistributedSpider(RedisSpider): name distributed redis_key douban:start_urls def parse(self, response): for item in response.css(.item): book BookItem() book[title] item.css(.title a::attr(title)).get() book[rating] item.css(.rating_nums::text).get() yield booksettings.py 关键配置SCHEDULER scrapy_redis.scheduler.Scheduler DUPEFILTER_CLASS scrapy_redis.dupefilter.RFPDupeFilter REDIS_HOST localhost REDIS_PORT 6379启动多个爬虫实例即可实现横向扩展。七、方向三反爬策略middlewares.pyfrom scrapy.downloadermiddlewares.useragent import UserAgentMiddleware from fake_useragent import UserAgent import random class RandomUserAgentMiddleware(UserAgentMiddleware): def __init__(self, *args, **kwargs): self.ua UserAgent() def process_request(self, request, spider): request.headers[User-Agent] self.ua.random class RandomDelayMiddleware: def process_request(self, request, spider): import time time.sleep(random.uniform(0.5, 1.5))settings.py 启用DOWNLOADER_MIDDLEWARES { douban_book_spider_pro.middlewares.RandomUserAgentMiddleware: 400, douban_book_spider_pro.middlewares.RandomDelayMiddleware: 500, }✅ 包含UA 随机化请求间隔可扩展代理池省略示例八、方向四异步高性能async_spider.pyimport aiohttp import asyncio from bs4 import BeautifulSoup from douban_book_spider_pro.items import BookItem URL https://book.douban.com/top250 async def fetch(session, url): async with session.get(url) as resp: return await resp.text() async def parse_html(html): soup BeautifulSoup(html, lxml) for item in soup.select(.item): book BookItem() book[title] item.select_one(.title a)[title] book[rating] item.select_one(.rating_nums).text print(book) async def main(): async with aiohttp.ClientSession() as session: html await fetch(session, URL) await parse_html(html) if __name__ __main__: asyncio.run(main())✅ 特点非阻塞 IO高并发适合大规模抓取九、Docker 化部署DockerfileFROM python:3.10-slim WORKDIR /app COPY . . RUN pip install -r requirements.txt playwright install --with-deps CMD [scrapy, crawl, top250]# docker-compose.yml version: 3 services: spider: build: . depends_on: - redis redis: image: redis:7十、整体数据流总结aiohttp / Scrapy ↓ UA 延迟 代理 ↓ Redis 去重 ↓ BookItem ↓ Pipeline ↓ JSON / DB通过这个项目你了解了✅ Scrapy 工程化架构✅ Redis 分布式爬虫✅ 常见反爬策略✅ 异步高性能爬虫✅ Docker 化部署思路下一步你可以做什么✅ 把Playwright / Selenium​ 无缝接入 Scrapy Downloader Middleware✅ 把数据写入MySQL / MongoDB / Elasticsearch​✅ 加一个前端可视化Flask ECharts​✅ 讲清楚Scrapy vs aiohttp 性能对比与选型