耐力测试（Endurance Testing / Soak Testing）

xdd(雄弟弟)2025-12-25

耐力测试（Endurance Testing / Soak Testing）

一句话总结：测试系统在长时间持续负载下的稳定性，发现内存泄漏等问题。

🌟 快速理解

就像马拉松测试：

短跑（负载测试）：
- 距离：100米
- 时间：10秒
- 测试：爆发力

马拉松（耐力测试）：
- 距离：42公里
- 时间：3小时
- 测试：耐力、稳定性

目的：
- 能否长时间保持性能？
- 会不会越跑越慢？
- 会不会中途崩溃？

📌 核心概念

什么是耐力测试？

耐力测试：测试系统在长时间持续负载下的稳定性，发现内存泄漏、资源耗尽等问题。

耐力测试 vs 其他性能测试

维度	耐力测试	负载测试	压力测试
持续时间	长（小时/天）	短（分钟/小时）	短（分钟）
负载	正常负载	预期负载	超出预期
目的	发现内存泄漏	验证性能	找出极限
关注点	稳定性	响应时间	崩溃点

🎯 真实案例

案例：Netflix耐力测试

背景：Netflix视频流媒体服务

耐力测试目标：

1
2
3

负载：100万并发用户
持续时间：72小时（3天）
目的：发现长时间运行的问题

耐力测试过程：

from locust import HttpUser, task, between

class NetflixUser(HttpUser):
    wait_time = between(5, 10)
    
    @task(5)
    def watch_video(self):
        """观看视频（主要操作）"""
        self.client.get("/stream/video/123")
    
    @task(2)
    def browse_catalog(self):
        """浏览目录"""
        self.client.get("/catalog")
    
    @task(1)
    def search(self):
        """搜索"""
        self.client.get("/search?q=action")

# 耐力测试配置
# 用户数：100万
# 持续时间：72小时
# 负载：恒定

测试结果：

时间 | 响应时间 | 吞吐量 | 错误率 | 内存使用 | CPU
-----|---------|--------|--------|---------|-----
0h   | 100ms   | 10万TPS | 0%     | 2GB     | 50%
6h   | 120ms   | 10万TPS | 0%     | 3GB     | 50%
12h  | 150ms   | 10万TPS | 0.1%   | 4GB     | 50%
24h  | 200ms   | 9万TPS  | 0.5%   | 6GB     | 55%
48h  | 300ms   | 8万TPS  | 1%     | 8GB     | 60%
72h  | 500ms   | 7万TPS  | 2%     | 10GB    | 65%

发现问题：
1. 内存泄漏：内存使用从2GB增加到10GB
2. 性能下降：响应时间从100ms增加到500ms
3. 吞吐量下降：从10万TPS下降到7万TPS
4. 错误率增加：从0%增加到2%

根本原因：
- 视频缓存未释放
- 数据库连接未关闭
- 日志文件过大

优化措施：

1. 修复内存泄漏（释放视频缓存）
2. 关闭数据库连接
3. 日志轮转（定期清理日志）
4. 定期重启服务（每24小时）

✅ 耐力测试流程

1. 确定耐力目标

耐力目标：
- 负载：预期负载的80%
- 持续时间：24-72小时
- 监控指标：
  - 响应时间
  - 吞吐量
  - 内存使用
  - CPU使用
  - 错误率

2. 设计耐力场景

from locust import LoadTestShape

class EnduranceTestShape(LoadTestShape):
    """耐力测试：恒定负载，长时间运行"""
    
    def tick(self):
        run_time = self.get_run_time()
        
        # 恒定10000用户，持续72小时
        if run_time < 72 * 3600:
            return 10000, 100
        else:
            return None

3. 执行耐力测试

# 使用Locust执行耐力测试
locust -f endurance-test.py \
    --users 10000 \
    --spawn-rate 100 \
    --run-time 72h \
    --host https://example.com

4. 分析长期趋势

import matplotlib.pyplot as plt

# 绘制性能趋势图
time = [0, 6, 12, 24, 48, 72]
response_time = [100, 120, 150, 200, 300, 500]
memory_usage = [2, 3, 4, 6, 8, 10]

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(time, response_time)
plt.title('响应时间趋势')
plt.xlabel('时间（小时）')
plt.ylabel('响应时间（ms）')

plt.subplot(1, 2, 2)
plt.plot(time, memory_usage)
plt.title('内存使用趋势')
plt.xlabel('时间（小时）')
plt.ylabel('内存（GB）')

plt.show()

📊 常见问题

1. 内存泄漏 💧

# 问题代码
class VideoCache:
    cache = {}  # 全局缓存，永不释放
    
    def get_video(self, video_id):
        if video_id not in self.cache:
            self.cache[video_id] = load_video(video_id)
        return self.cache[video_id]

# 修复代码
from functools import lru_cache

class VideoCache:
    @lru_cache(maxsize=1000)  # 限制缓存大小
    def get_video(self, video_id):
        return load_video(video_id)

2. 资源耗尽 📉

# 问题代码
def query_database():
    conn = create_connection()
    result = conn.execute("SELECT * FROM users")
    return result
    # 连接未关闭！

# 修复代码
def query_database():
    with create_connection() as conn:
        result = conn.execute("SELECT * FROM users")
        return result
    # 自动关闭连接

3. 日志文件过大 📁

# 问题配置
logging.basicConfig(
    filename='app.log',
    level=logging.DEBUG
)
# 日志文件无限增长！

# 修复配置
from logging.handlers import RotatingFileHandler

handler = RotatingFileHandler(
    'app.log',
    maxBytes=100*1024*1024,  # 100MB
    backupCount=5  # 保留5个备份
)
logging.basicConfig(handlers=[handler])

🛠️ 耐力测试工具

工具	特点	推荐度
Locust	Python、易用、长时间运行	⭐⭐⭐⭐⭐
K6	高性能、云原生	⭐⭐⭐⭐⭐
JMeter	功能丰富	⭐⭐⭐⭐
Prometheus	监控指标	⭐⭐⭐⭐⭐
Grafana	可视化	⭐⭐⭐⭐⭐

📊 最佳实践

1. 监控长期趋势

# Prometheus监控配置
scrape_configs:
  - job_name: 'endurance-test'
    scrape_interval: 1m
    static_configs:
      - targets: ['localhost:9090']

# 监控指标
metrics:
  - response_time_trend
  - memory_usage_trend
  - cpu_usage_trend
  - error_rate_trend

2. 设置告警

# 告警规则
alerts:
  - name: memory_leak
    condition: memory_usage增长 > 10% per hour
    action: 发送告警邮件
  
  - name: performance_degradation
    condition: response_time增长 > 20% per hour
    action: 发送告警邮件

3. 定期检查

def hourly_check():
    """每小时检查一次"""
    current_metrics = get_current_metrics()
    baseline_metrics = get_baseline_metrics()
    
    # 检查性能下降
    if current_metrics['response_time'] > baseline_metrics['response_time'] * 1.5:
        alert("性能下降超过50%")
    
    # 检查内存泄漏
    if current_metrics['memory'] > baseline_metrics['memory'] * 1.2:
        alert("内存使用增长超过20%")

🔗 相关主题

[[负载测试]] - 短时间负载测试
[[压力测试]] - 测试系统极限
[[性能测试]] - 包含耐力测试

💡 快速参考

耐力测试速查表
================

定义：测试系统在长时间持续负载下的稳定性

核心目标：
🕐 长时间运行（24-72小时）
💧 发现内存泄漏
📉 发现性能下降
🔍 发现资源耗尽

耐力测试流程：
1. 确定耐力目标
2. 设计耐力场景
3. 执行耐力测试
4. 分析长期趋势

常见问题：
- 内存泄漏
- 资源耗尽
- 日志文件过大
- 性能下降

推荐工具：
- Locust（测试）
- Prometheus（监控）
- Grafana（可视化）

最佳实践：
1. 监控长期趋势
2. 设置告警
3. 定期检查
4. 分析根本原因

耐力 vs 其他：
- 耐力：长时间、稳定性
- 负载：短时间、性能
- 压力：超负载、极限