12.6.2 别把对方网站爬崩了——速率控制:请求频率与并发限制
一句话破题
速率控制的核心是"不要贪婪"——你的爬虫只是目标网站的无数访问者之一,不应该占用过多资源。
速率限制器
typescript
class RateLimiter {
private lastRequest = 0;
private minInterval: number;
constructor(requestsPerSecond: number) {
this.minInterval = 1000 / requestsPerSecond;
}
async wait() {
const now = Date.now();
const elapsed = now - this.lastRequest;
const waitTime = Math.max(0, this.minInterval - elapsed);
if (waitTime > 0) {
await new Promise(r => setTimeout(r, waitTime));
}
this.lastRequest = Date.now();
}
}
// 使用示例:每秒最多 2 个请求
const limiter = new RateLimiter(2);
async function fetchWithRateLimit(url: string) {
await limiter.wait();
return fetch(url);
}并发控制
typescript
class ConcurrencyLimiter {
private running = 0;
private queue: (() => void)[] = [];
constructor(private maxConcurrent: number) {}
async acquire() {
if (this.running >= this.maxConcurrent) {
await new Promise<void>(resolve => this.queue.push(resolve));
}
this.running++;
}
release() {
this.running--;
const next = this.queue.shift();
if (next) next();
}
async run<T>(fn: () => Promise<T>): Promise<T> {
await this.acquire();
try {
return await fn();
} finally {
this.release();
}
}
}
// 使用示例:最多 5 个并发请求
const concurrency = new ConcurrencyLimiter(5);
async function crawlUrls(urls: string[]) {
return Promise.all(
urls.map(url => concurrency.run(() => fetch(url)))
);
}综合爬虫调度器
typescript
interface CrawlerConfig {
maxConcurrent: number;
requestsPerSecond: number;
retryAttempts: number;
timeout: number;
}
class Crawler {
private rateLimiter: RateLimiter;
private concurrency: ConcurrencyLimiter;
constructor(private config: CrawlerConfig) {
this.rateLimiter = new RateLimiter(config.requestsPerSecond);
this.concurrency = new ConcurrencyLimiter(config.maxConcurrent);
}
async fetch(url: string): Promise<Response> {
return this.concurrency.run(async () => {
await this.rateLimiter.wait();
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), this.config.timeout);
try {
const response = await fetch(url, { signal: controller.signal });
return response;
} finally {
clearTimeout(timeout);
}
});
}
async crawl(urls: string[], onResult: (url: string, data: unknown) => void) {
const results = await Promise.allSettled(
urls.map(async url => {
const response = await this.fetch(url);
const data = await response.text();
onResult(url, data);
return data;
})
);
return results;
}
}
// 使用示例
const crawler = new Crawler({
maxConcurrent: 5,
requestsPerSecond: 2,
retryAttempts: 3,
timeout: 10000,
});指数退避
当遇到 429(Too Many Requests)时:
typescript
async function fetchWithBackoff(url: string, maxRetries = 5) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
const response = await fetch(url);
if (response.status === 429) {
const retryAfter = response.headers.get('Retry-After');
const delay = retryAfter
? parseInt(retryAfter) * 1000
: Math.pow(2, attempt) * 1000;
console.log(`收到 429,等待 ${delay}ms 后重试`);
await new Promise(r => setTimeout(r, delay));
continue;
}
return response;
}
throw new Error('重试次数已用尽');
}AI 协作指南
- 核心意图:让 AI 帮你实现合理的爬虫速率控制。
- 需求定义公式:
"请帮我实现一个爬虫调度器,支持每秒最多 2 个请求、最多 5 个并发,并能处理 429 响应的指数退避。" - 关键术语:
速率限制 (rate limiting)、并发控制、指数退避 (exponential backoff)、429 Too Many Requests
避坑指南
- 遵守 Crawl-delay:如果 robots.txt 指定了延迟,务必遵守。
- 监控响应码:429 和 503 通常表示你太快了。
- 随机化间隔:添加随机抖动,避免请求模式过于规律。
