⚠️ Alpha内测版本警告:此为早期内部构建版本,尚不完整且可能存在错误,欢迎大家提Issue反馈问题或建议
Skip to content

12.6.2 别把对方网站爬崩了——速率控制:请求频率与并发限制

一句话破题

速率控制的核心是"不要贪婪"——你的爬虫只是目标网站的无数访问者之一,不应该占用过多资源。

速率限制器

typescript
class RateLimiter {
  private lastRequest = 0;
  private minInterval: number;

  constructor(requestsPerSecond: number) {
    this.minInterval = 1000 / requestsPerSecond;
  }

  async wait() {
    const now = Date.now();
    const elapsed = now - this.lastRequest;
    const waitTime = Math.max(0, this.minInterval - elapsed);
    
    if (waitTime > 0) {
      await new Promise(r => setTimeout(r, waitTime));
    }
    
    this.lastRequest = Date.now();
  }
}

// 使用示例:每秒最多 2 个请求
const limiter = new RateLimiter(2);

async function fetchWithRateLimit(url: string) {
  await limiter.wait();
  return fetch(url);
}

并发控制

typescript
class ConcurrencyLimiter {
  private running = 0;
  private queue: (() => void)[] = [];

  constructor(private maxConcurrent: number) {}

  async acquire() {
    if (this.running >= this.maxConcurrent) {
      await new Promise<void>(resolve => this.queue.push(resolve));
    }
    this.running++;
  }

  release() {
    this.running--;
    const next = this.queue.shift();
    if (next) next();
  }

  async run<T>(fn: () => Promise<T>): Promise<T> {
    await this.acquire();
    try {
      return await fn();
    } finally {
      this.release();
    }
  }
}

// 使用示例:最多 5 个并发请求
const concurrency = new ConcurrencyLimiter(5);

async function crawlUrls(urls: string[]) {
  return Promise.all(
    urls.map(url => concurrency.run(() => fetch(url)))
  );
}

综合爬虫调度器

typescript
interface CrawlerConfig {
  maxConcurrent: number;
  requestsPerSecond: number;
  retryAttempts: number;
  timeout: number;
}

class Crawler {
  private rateLimiter: RateLimiter;
  private concurrency: ConcurrencyLimiter;
  
  constructor(private config: CrawlerConfig) {
    this.rateLimiter = new RateLimiter(config.requestsPerSecond);
    this.concurrency = new ConcurrencyLimiter(config.maxConcurrent);
  }

  async fetch(url: string): Promise<Response> {
    return this.concurrency.run(async () => {
      await this.rateLimiter.wait();
      
      const controller = new AbortController();
      const timeout = setTimeout(() => controller.abort(), this.config.timeout);
      
      try {
        const response = await fetch(url, { signal: controller.signal });
        return response;
      } finally {
        clearTimeout(timeout);
      }
    });
  }

  async crawl(urls: string[], onResult: (url: string, data: unknown) => void) {
    const results = await Promise.allSettled(
      urls.map(async url => {
        const response = await this.fetch(url);
        const data = await response.text();
        onResult(url, data);
        return data;
      })
    );
    
    return results;
  }
}

// 使用示例
const crawler = new Crawler({
  maxConcurrent: 5,
  requestsPerSecond: 2,
  retryAttempts: 3,
  timeout: 10000,
});

指数退避

当遇到 429(Too Many Requests)时:

typescript
async function fetchWithBackoff(url: string, maxRetries = 5) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    const response = await fetch(url);
    
    if (response.status === 429) {
      const retryAfter = response.headers.get('Retry-After');
      const delay = retryAfter 
        ? parseInt(retryAfter) * 1000 
        : Math.pow(2, attempt) * 1000;
      
      console.log(`收到 429,等待 ${delay}ms 后重试`);
      await new Promise(r => setTimeout(r, delay));
      continue;
    }
    
    return response;
  }
  
  throw new Error('重试次数已用尽');
}

AI 协作指南

  • 核心意图:让 AI 帮你实现合理的爬虫速率控制。
  • 需求定义公式"请帮我实现一个爬虫调度器,支持每秒最多 2 个请求、最多 5 个并发,并能处理 429 响应的指数退避。"
  • 关键术语速率限制 (rate limiting)并发控制指数退避 (exponential backoff)429 Too Many Requests

避坑指南

  • 遵守 Crawl-delay:如果 robots.txt 指定了延迟,务必遵守。
  • 监控响应码:429 和 503 通常表示你太快了。
  • 随机化间隔:添加随机抖动,避免请求模式过于规律。