1 基础概念
1.1 向量数据库
01.向量数据库定义
a.基本概念
a.功能说明
向量数据库是专门用于存储、索引和查询高维向量数据的数据库系统。它通过向量相似度计算实现语义检索,广泛应用于推荐系统、图像搜索、自然语言处理等AI场景。向量数据库能够高效处理百万到十亿级别的向量数据,支持毫秒级的相似度查询。
b.代码示例
---
# 向量数据库核心概念
# 向量:[0.1, 0.2, 0.3, ..., 0.n] 高维数组
# 相似度:通过距离度量(欧氏距离、余弦相似度等)计算向量间的相似程度
# 索引:加速向量检索的数据结构(如HNSW、IVF等)
import numpy as np
# 示例:两个向量的余弦相似度计算
vector1 = np.array([0.1, 0.2, 0.3])
vector2 = np.array([0.2, 0.3, 0.4])
similarity = np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))
print(f"余弦相似度: {similarity}")
---
b.应用场景
a.功能说明
向量数据库在多个AI领域有广泛应用。在推荐系统中,通过用户和物品的向量表示实现个性化推荐。在图像搜索中,将图像编码为向量进行以图搜图。在自然语言处理中,支持语义搜索、问答系统和RAG应用。在异常检测中,通过向量距离识别异常模式。
b.代码示例
---
# 典型应用场景示例
# 1. 语义搜索:将文本转换为向量进行相似度检索
query_text = "什么是人工智能"
query_vector = embedding_model.encode(query_text)
results = vector_db.search(query_vector, top_k=5)
# 2. 推荐系统:基于用户向量找相似用户
user_vector = get_user_embedding(user_id)
similar_users = vector_db.search(user_vector, top_k=10)
# 3. 图像搜索:以图搜图
image_vector = image_encoder.encode(image)
similar_images = vector_db.search(image_vector, top_k=20)
---
02.向量数据库vs传统数据库
a.数据类型差异
a.功能说明
传统数据库主要存储结构化数据(数字、字符串、日期等),查询基于精确匹配或范围比较。向量数据库存储高维向量(通常128-1536维),查询基于相似度计算。传统数据库使用B树、哈希索引,向量数据库使用ANN索引(如HNSW、IVF)。两者的查询语义完全不同:传统数据库是精确查询,向量数据库是近似查询。
b.代码示例
---
# 传统数据库查询(精确匹配)
SELECT * FROM products WHERE category = 'electronics' AND price < 1000;
# 向量数据库查询(相似度检索)
from pymilvus import Collection
collection = Collection("products")
search_vector = [[0.1, 0.2, 0.3, ...]] # 查询向量
results = collection.search(
data=search_vector,
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": 10}},
limit=10
)
---
b.性能特点
a.功能说明
传统数据库在精确查询和事务处理上表现优异,支持ACID特性。向量数据库在高维相似度搜索上具有优势,通过近似最近邻算法实现亚线性时间复杂度。传统数据库扩展性受限于关系模型,向量数据库天然支持水平扩展。在查询延迟上,向量数据库对百万级数据可实现毫秒级响应。
b.代码示例
---
# 性能对比示例
# 传统数据库:精确查询,O(log n)复杂度
import time
start = time.time()
cursor.execute("SELECT * FROM users WHERE id = 12345")
print(f"传统数据库查询耗时: {time.time() - start}s")
# 向量数据库:近似查询,O(log n)复杂度(通过索引)
start = time.time()
results = collection.search(
data=[query_vector],
anns_field="vector",
param={"metric_type": "IP", "params": {"nprobe": 16}},
limit=10
)
print(f"向量数据库查询耗时: {time.time() - start}s")
# 向量数据库在百万级数据上通常能保持<10ms的查询延迟
---
1.2 Milvus架构
01.系统架构
a.云原生设计
a.功能说明
Milvus采用云原生架构,将存储和计算分离,支持弹性扩展。系统分为四个层次:接入层(负载均衡和请求路由)、协调层(元数据管理和任务调度)、执行层(数据处理和查询执行)、存储层(对象存储和消息队列)。这种架构使得各组件可以独立扩展,提高系统的可用性和可维护性。
b.代码示例
---
# Milvus架构组件
# 1. 接入层(Access Layer)
# - Proxy:接收客户端请求,进行负载均衡
# - 提供gRPC和RESTful API
# 2. 协调层(Coordinator Service)
# - Root Coordinator:管理DDL操作(创建/删除collection)
# - Data Coordinator:管理数据段和binlog
# - Query Coordinator:管理查询节点和负载均衡
# - Index Coordinator:管理索引构建任务
# 3. 执行层(Worker Nodes)
# - Query Node:执行向量搜索
# - Data Node:数据持久化
# - Index Node:构建向量索引
# 4. 存储层(Storage)
# - 对象存储(MinIO/S3):存储向量数据和索引
# - 元数据存储(etcd):存储集合schema和元信息
# - 消息队列(Pulsar/Kafka):数据流和日志复制
---
b.分布式特性
a.功能说明
Milvus支持分布式部署,通过数据分片和副本机制实现高可用。数据按segment切分,每个segment包含固定数量的向量。查询时,多个Query Node并行处理不同的segment,最后合并结果。系统支持动态扩缩容,新增节点可自动接管部分负载。通过副本机制保证数据可靠性,支持跨可用区部署。
b.代码示例
---
from pymilvus import connections, Collection, utility
# 连接Milvus集群
connections.connect(
alias="default",
host="milvus-cluster.example.com",
port="19530"
)
# 查看集群状态
print(f"Milvus版本: {utility.get_server_version()}")
# 创建collection时指定分片数量
from pymilvus import CollectionSchema, FieldSchema, DataType
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128)
]
schema = CollectionSchema(fields=fields)
collection = Collection(
name="distributed_collection",
schema=schema,
shards_num=4 # 指定4个分片,提高并行度
)
# 设置副本数量
collection.set_properties(properties={"collection.replica.number": 2})
---
02.核心组件
a.Proxy代理层
a.功能说明
Proxy是Milvus的接入层,负责接收客户端请求并路由到后端服务。它提供统一的API接口,支持gRPC和RESTful协议。Proxy执行请求验证、参数检查和结果聚合。在集群模式下,多个Proxy实例通过负载均衡器分发请求,保证高可用性。Proxy是无状态服务,可以水平扩展。
b.代码示例
---
# Proxy配置示例(milvus.yaml)
proxy:
port: 19530
grpc:
serverMaxRecvSize: 536870912 # 512MB
serverMaxSendSize: 536870912
clientMaxRecvSize: 104857600 # 100MB
clientMaxSendSize: 104857600
http:
enabled: true
port: 9091
timeTickInterval: 200 # ms
msgStream:
timeTick:
bufSize: 512
maxTaskNum: 1024 # 最大并发任务数
# 客户端通过Proxy连接
from pymilvus import connections
connections.connect(
alias="default",
host="proxy.milvus.svc.cluster.local",
port="19530",
user="username",
password="password"
)
---
b.Coordinator协调器
a.功能说明
Coordinator负责元数据管理和任务调度。Root Coordinator管理collection和partition的创建删除,维护全局时间戳。Data Coordinator管理数据段的分配和合并,协调数据持久化。Query Coordinator管理查询节点的负载均衡,分配segment到不同节点。Index Coordinator调度索引构建任务,监控索引状态。各Coordinator通过etcd实现高可用。
b.代码示例
---
# Coordinator工作流程示例
# 1. Root Coordinator:创建collection
from pymilvus import Collection, CollectionSchema, FieldSchema, DataType
schema = CollectionSchema([
FieldSchema("id", DataType.INT64, is_primary=True),
FieldSchema("vector", DataType.FLOAT_VECTOR, dim=128)
])
# Root Coordinator处理DDL请求
collection = Collection("example", schema=schema)
# 2. Data Coordinator:插入数据
data = [
[i for i in range(1000)],
[[np.random.random() for _ in range(128)] for _ in range(1000)]
]
collection.insert(data) # Data Coordinator分配segment
# 3. Index Coordinator:构建索引
index_params = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": 128}
}
collection.create_index("vector", index_params) # Index Coordinator调度构建任务
# 4. Query Coordinator:执行查询
collection.load() # Query Coordinator分配segment到Query Node
results = collection.search([[0.1]*128], "vector", {"nprobe": 10}, limit=10)
---
c.Worker节点
a.功能说明
Worker节点执行实际的数据处理任务。Query Node加载索引并执行向量搜索,支持多个segment并行查询。Data Node负责数据持久化,将binlog写入对象存储。Index Node构建向量索引,支持多种索引类型。Worker节点是有状态服务,通过Coordinator进行任务分配和负载均衡。节点故障时,Coordinator会将任务重新分配到其他节点。
b.代码示例
---
# Worker节点配置示例
# Query Node配置
queryNode:
cacheSize: 32 # GB,缓存大小
gracefulStopTimeout: 30 # 优雅停机超时
stats:
publishInterval: 1000 # 统计信息发布间隔(ms)
dataSync:
flowGraph:
maxQueueLength: 1024
maxParallelism: 1024
segcore:
chunkRows: 1024 # segment chunk大小
# Data Node配置
dataNode:
dataSync:
flowGraph:
maxQueueLength: 1024
flush:
insertBufSize: 16777216 # 16MB
# Index Node配置
indexNode:
scheduler:
buildParallel: 1 # 并行构建索引数量
# 监控Worker节点状态
from pymilvus import utility
# 查看Query Node信息
query_nodes = utility.get_query_segment_info("collection_name")
for node in query_nodes:
print(f"Node ID: {node.nodeID}, Segment: {node.segmentID}, State: {node.state}")
---
1.3 核心特性
01.高性能搜索
a.毫秒级响应
a.功能说明
Milvus通过优化的索引算法和内存管理实现毫秒级查询响应。在百万级向量数据上,使用HNSW索引可实现1-5ms的查询延迟。系统支持GPU加速,进一步提升搜索性能。通过预加载索引到内存,避免磁盘IO开销。支持批量查询,提高吞吐量。
b.代码示例
---
import time
from pymilvus import Collection
collection = Collection("benchmark")
collection.load() # 预加载索引到内存
# 单次查询性能测试
query_vector = [[0.1] * 128]
start = time.time()
results = collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2", "params": {"ef": 64}},
limit=10
)
latency = (time.time() - start) * 1000
print(f"查询延迟: {latency:.2f}ms")
# 批量查询提高吞吐量
batch_vectors = [[0.1] * 128 for _ in range(100)]
start = time.time()
results = collection.search(
data=batch_vectors,
anns_field="embedding",
param={"metric_type": "L2", "params": {"ef": 64}},
limit=10
)
total_time = time.time() - start
qps = len(batch_vectors) / total_time
print(f"批量查询QPS: {qps:.2f}")
---
b.海量数据支持
a.功能说明
Milvus支持十亿级向量数据存储和检索。通过分布式架构,数据分散存储在多个节点上。采用segment机制,将数据切分为固定大小的块,便于管理和查询。支持增量索引构建,新数据可快速加入索引。通过数据压缩和量化技术,降低存储成本。支持冷热数据分离,热数据保存在内存,冷数据存储在对象存储。
b.代码示例
---
from pymilvus import Collection, utility
# 查看collection统计信息
collection = Collection("large_scale")
stats = collection.num_entities
print(f"向量总数: {stats:,}")
# 大规模数据插入
batch_size = 10000
total_vectors = 10000000 # 1000万向量
for i in range(0, total_vectors, batch_size):
data = [
list(range(i, i + batch_size)),
[[np.random.random() for _ in range(128)] for _ in range(batch_size)]
]
collection.insert(data)
if (i + batch_size) % 100000 == 0:
collection.flush() # 定期刷新到磁盘
print(f"已插入 {i + batch_size:,} 条数据")
# 创建索引支持大规模检索
index_params = {
"index_type": "IVF_PQ", # 使用PQ量化降低内存占用
"metric_type": "L2",
"params": {
"nlist": 2048, # 增加聚类中心数量
"m": 8, # PQ子向量数量
"nbits": 8
}
}
collection.create_index("embedding", index_params)
---
02.灵活扩展
a.水平扩展
a.功能说明
Milvus支持无缝的水平扩展,可以动态增加Query Node、Data Node和Index Node。新增节点会自动加入集群,Coordinator会重新分配负载。通过增加Query Node提升查询吞吐量,增加Data Node提高写入性能,增加Index Node加速索引构建。扩展过程不影响在线服务,支持滚动升级。
b.代码示例
---
# Kubernetes环境下的水平扩展
# 1. 扩展Query Node(提升查询性能)
# kubectl scale deployment milvus-querynode --replicas=5
# 2. 扩展Data Node(提升写入性能)
# kubectl scale deployment milvus-datanode --replicas=3
# 3. 扩展Index Node(加速索引构建)
# kubectl scale deployment milvus-indexnode --replicas=2
# 在应用层监控扩展效果
from pymilvus import connections, utility
connections.connect("default", host="milvus-proxy", port="19530")
# 查看集群节点信息
import requests
response = requests.get("http://milvus-proxy:9091/api/v1/health")
print(f"集群状态: {response.json()}")
# 测试扩展后的性能
collection = Collection("test")
collection.load(replica_number=2) # 使用2个副本提高查询并发
# 并发查询测试
import concurrent.futures
def search_task(query_id):
results = collection.search(
data=[[0.1] * 128],
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": 16}},
limit=10
)
return query_id
with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
futures = [executor.submit(search_task, i) for i in range(1000)]
results = [f.result() for f in futures]
print(f"并发查询完成: {len(results)}个请求")
---
b.存储计算分离
a.功能说明
Milvus采用存储计算分离架构,向量数据和索引存储在对象存储(MinIO或S3)中,计算节点无状态。这种设计使得存储和计算可以独立扩展,降低成本。计算节点可以按需启动和销毁,支持弹性伸缩。存储层支持多副本和跨区域复制,保证数据可靠性。元数据存储在etcd中,支持高可用。
b.代码示例
---
# 存储计算分离配置示例(milvus.yaml)
# 对象存储配置(MinIO)
minio:
address: minio.example.com
port: 9000
accessKeyID: minioadmin
secretAccessKey: minioadmin
useSSL: false
bucketName: milvus-bucket
rootPath: file # 数据根路径
useIAM: false
iamEndpoint: ""
# 或使用AWS S3
# minio:
# address: s3.amazonaws.com
# port: 443
# accessKeyID: YOUR_ACCESS_KEY
# secretAccessKey: YOUR_SECRET_KEY
# useSSL: true
# bucketName: milvus-data
# rootPath: milvus
# useIAM: true
# iamEndpoint: ""
# region: us-west-2
# 元数据存储配置(etcd)
etcd:
endpoints:
- etcd-0.etcd:2379
- etcd-1.etcd:2379
- etcd-2.etcd:2379
rootPath: by-dev # 元数据根路径
metaSubPath: meta
kvSubPath: kv
# 消息队列配置(Pulsar)
pulsar:
address: pulsar://pulsar-proxy:6650
maxMessageSize: 5242880 # 5MB
# 这种架构的优势
# 1. 计算节点无状态,可快速扩缩容
# 2. 存储层独立扩展,支持PB级数据
# 3. 数据持久化在对象存储,成本低
# 4. 支持多个集群共享存储
---
03.多语言支持
a.SDK生态
a.功能说明
Milvus提供多语言SDK,包括Python、Java、Go、Node.js、C++等。所有SDK基于统一的gRPC接口,功能一致。Python SDK最为成熟,提供完整的API和丰富的示例。Java SDK适合企业级应用,性能优异。Go SDK轻量高效,适合微服务架构。Node.js SDK支持前端和后端开发。各SDK支持连接池、重试机制和负载均衡。
b.代码示例
---
# Python SDK
from pymilvus import connections, Collection
connections.connect("default", host="localhost", port="19530")
collection = Collection("example")
results = collection.search([[0.1]*128], "vector", {"nprobe": 10}, limit=10)
# Java SDK
// import io.milvus.client.*;
//
// MilvusServiceClient client = new MilvusServiceClient(
// ConnectParam.newBuilder()
// .withHost("localhost")
// .withPort(19530)
// .build()
// );
//
// SearchParam searchParam = SearchParam.newBuilder()
// .withCollectionName("example")
// .withVectorFieldName("vector")
// .withVectors(Arrays.asList(Arrays.asList(0.1f, 0.2f, ...)))
// .withTopK(10)
// .build();
// R<SearchResults> response = client.search(searchParam);
# Go SDK
// import "github.com/milvus-io/milvus-sdk-go/v2/client"
//
// c, _ := client.NewGrpcClient(context.Background(), "localhost:19530")
// searchResult, _ := c.Search(
// context.Background(),
// "example",
// []string{},
// "",
// []string{"id"},
// []entity.Vector{entity.FloatVector{0.1, 0.2, ...}},
// "vector",
// entity.L2,
// 10,
// sp,
// )
# Node.js SDK
// const { MilvusClient } = require("@zilliz/milvus2-sdk-node");
//
// const client = new MilvusClient("localhost:19530");
// const results = await client.search({
// collection_name: "example",
// vectors: [[0.1, 0.2, ...]],
// search_params: { nprobe: 10 },
// limit: 10
// });
---
b.RESTful API
a.功能说明
Milvus提供RESTful API,方便跨语言调用和快速集成。API基于HTTP协议,支持JSON格式的请求和响应。覆盖所有核心功能,包括collection管理、数据操作、搜索查询等。适合轻量级客户端和Web应用。支持API认证和访问控制。提供Swagger文档,便于测试和调试。
b.代码示例
---
import requests
import json
base_url = "http://localhost:9091/api/v1"
# 1. 创建collection
create_payload = {
"collection_name": "rest_example",
"schema": {
"fields": [
{"name": "id", "dtype": "Int64", "is_primary": True},
{"name": "vector", "dtype": "FloatVector", "params": {"dim": 128}}
]
}
}
response = requests.post(f"{base_url}/collection", json=create_payload)
print(f"创建collection: {response.json()}")
# 2. 插入数据
insert_payload = {
"collection_name": "rest_example",
"fields_data": [
{"field_name": "id", "type": "Int64", "field": [1, 2, 3]},
{"field_name": "vector", "type": "FloatVector", "field": [[0.1]*128, [0.2]*128, [0.3]*128]}
]
}
response = requests.post(f"{base_url}/entities", json=insert_payload)
print(f"插入数据: {response.json()}")
# 3. 搜索
search_payload = {
"collection_name": "rest_example",
"vectors": [[0.15] * 128],
"dsl_type": "Dsl",
"params": {"nprobe": 10},
"limit": 5
}
response = requests.post(f"{base_url}/search", json=search_payload)
print(f"搜索结果: {response.json()}")
# 4. 查询collection信息
response = requests.get(f"{base_url}/collection/info?collection_name=rest_example")
print(f"Collection信息: {response.json()}")
---
2 快速开始
2.1 安装部署
01.Docker部署
a.单机版安装
a.功能说明
使用Docker Compose可以快速部署Milvus单机版,适合开发和测试环境。单机版将所有组件运行在一个容器中,资源占用小,部署简单。支持数据持久化,重启后数据不丢失。默认端口19530用于gRPC连接,9091用于HTTP API。单机版性能受限于单台服务器资源,不支持高可用。
b.代码示例
---
# 1. 下载docker-compose.yml
wget https://github.com/milvus-io/milvus/releases/download/v2.3.0/milvus-standalone-docker-compose.yml -O docker-compose.yml
# 2. 启动Milvus
docker-compose up -d
# 3. 检查容器状态
docker-compose ps
# 输出示例:
# NAME COMMAND SERVICE STATUS PORTS
# milvus-standalone "/tini -- milvus run…" standalone running 0.0.0.0:9091->9091/tcp, 0.0.0.0:19530->19530/tcp
# milvus-minio "/usr/bin/docker-ent…" minio running 9000/tcp
# milvus-etcd "etcd -advertise-cli…" etcd running 2379-2380/tcp
# 4. 查看日志
docker-compose logs -f standalone
# 5. 停止服务
docker-compose down
# 6. 数据持久化配置(docker-compose.yml)
# volumes:
# - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus:/var/lib/milvus
---
b.集群版安装
a.功能说明
集群版通过Docker Compose部署多个组件,包括Proxy、Coordinator、Worker节点等。支持水平扩展和高可用,适合生产环境。各组件独立运行,可以单独扩展和升级。需要配置外部存储(MinIO/S3)和消息队列(Pulsar/Kafka)。集群版资源需求较高,建议至少3台服务器。
b.代码示例
---
# 1. 下载集群版配置
wget https://github.com/milvus-io/milvus/releases/download/v2.3.0/milvus-cluster-docker-compose.yml -O docker-compose.yml
# 2. 修改配置文件(可选)
# 编辑docker-compose.yml,调整资源限制和副本数量
# 3. 启动集群
docker-compose up -d
# 4. 检查所有组件状态
docker-compose ps
# 输出示例:
# NAME SERVICE STATUS
# milvus-rootcoord rootcoord running
# milvus-datacoord datacoord running
# milvus-querycoord querycoord running
# milvus-indexcoord indexcoord running
# milvus-proxy proxy running
# milvus-querynode querynode running
# milvus-datanode datanode running
# milvus-indexnode indexnode running
# milvus-minio minio running
# milvus-etcd etcd running
# milvus-pulsar pulsar running
# 5. 扩展Query Node(提升查询性能)
docker-compose up -d --scale querynode=3
# 6. 健康检查
curl http://localhost:9091/healthz
---
02.Kubernetes部署
a.Helm安装
a.功能说明
使用Helm Chart可以在Kubernetes集群中快速部署Milvus。Helm提供参数化配置,支持自定义资源限制、副本数量、存储类型等。支持滚动更新和回滚,保证服务稳定性。可以集成Kubernetes生态工具,如Prometheus监控、Grafana可视化等。适合大规模生产环境,支持自动扩缩容。
b.代码示例
---
# 1. 添加Milvus Helm仓库
helm repo add milvus https://milvus-io.github.io/milvus-helm/
helm repo update
# 2. 创建命名空间
kubectl create namespace milvus
# 3. 安装Milvus(使用默认配置)
helm install milvus milvus/milvus --namespace milvus
# 4. 自定义安装(创建values.yaml)
cat > values.yaml <<EOF
cluster:
enabled: true
image:
all:
repository: milvusdb/milvus
tag: v2.3.0
proxy:
replicas: 2
queryNode:
replicas: 3
resources:
limits:
cpu: 4
memory: 8Gi
dataNode:
replicas: 2
indexNode:
replicas: 1
minio:
enabled: true
mode: standalone
pulsar:
enabled: true
etcd:
replicaCount: 3
EOF
# 5. 使用自定义配置安装
helm install milvus milvus/milvus -f values.yaml --namespace milvus
# 6. 查看部署状态
kubectl get pods -n milvus
# 7. 暴露服务(使用LoadBalancer)
kubectl expose deployment milvus-proxy --type=LoadBalancer --name=milvus-service --port=19530 -n milvus
# 8. 获取外部IP
kubectl get svc milvus-service -n milvus
# 9. 升级Milvus
helm upgrade milvus milvus/milvus -f values.yaml --namespace milvus
# 10. 卸载
helm uninstall milvus --namespace milvus
---
b.Operator部署
a.功能说明
Milvus Operator是Kubernetes原生的部署方式,通过CRD定义Milvus集群。Operator自动管理集群生命周期,包括部署、升级、扩缩容、故障恢复等。支持声明式配置,只需定义期望状态,Operator自动调谐。提供更细粒度的控制,可以单独配置每个组件。适合需要深度定制和自动化运维的场景。
b.代码示例
---
# 1. 安装Milvus Operator
kubectl apply -f https://raw.githubusercontent.com/milvus-io/milvus-operator/main/deploy/manifests/deployment.yaml
# 2. 验证Operator安装
kubectl get pods -n milvus-operator
# 3. 创建Milvus集群(milvus-cluster.yaml)
cat > milvus-cluster.yaml <<EOF
apiVersion: milvus.io/v1beta1
kind: Milvus
metadata:
name: my-milvus
namespace: default
spec:
mode: cluster
dependencies:
etcd:
inCluster:
deletionPolicy: Delete
pvcDeletion: true
storage:
inCluster:
deletionPolicy: Delete
pvcDeletion: true
pulsar:
inCluster:
deletionPolicy: Delete
pvcDeletion: true
components:
proxy:
replicas: 2
resources:
limits:
cpu: 2
memory: 4Gi
queryNode:
replicas: 3
resources:
limits:
cpu: 4
memory: 8Gi
dataNode:
replicas: 2
indexNode:
replicas: 1
config:
minio:
bucketName: milvus-bucket
EOF
# 4. 部署集群
kubectl apply -f milvus-cluster.yaml
# 5. 查看集群状态
kubectl get milvus my-milvus -o yaml
# 6. 扩展Query Node
kubectl patch milvus my-milvus --type='json' -p='[{"op": "replace", "path": "/spec/components/queryNode/replicas", "value": 5}]'
# 7. 查看所有资源
kubectl get all -l app.kubernetes.io/instance=my-milvus
# 8. 删除集群
kubectl delete milvus my-milvus
---
03.本地开发
a.Python环境
a.功能说明
使用Milvus Lite可以在本地Python环境中快速启动Milvus,无需Docker或Kubernetes。Milvus Lite是轻量级版本,适合开发、测试和原型验证。支持大部分核心功能,与完整版API兼容。数据存储在本地文件系统,便于调试。资源占用小,可以在笔记本电脑上运行。
b.代码示例
---
# 1. 安装Milvus Lite
pip install milvus
# 2. 启动Milvus Lite
from milvus import default_server
# 启动本地服务器
default_server.start()
# 3. 连接并使用
from pymilvus import connections, Collection, CollectionSchema, FieldSchema, DataType
connections.connect(
alias="default",
host='127.0.0.1',
port=default_server.listen_port
)
# 4. 创建collection
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128)
]
schema = CollectionSchema(fields=fields)
collection = Collection(name="dev_test", schema=schema)
# 5. 插入数据
import numpy as np
data = [
[[np.random.random() for _ in range(128)] for _ in range(100)]
]
collection.insert(data)
# 6. 创建索引
index_params = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": 128}
}
collection.create_index("embedding", index_params)
# 7. 查询
collection.load()
results = collection.search(
data=[[np.random.random() for _ in range(128)]],
anns_field="embedding",
param={"nprobe": 10},
limit=5
)
# 8. 停止服务器
default_server.stop()
# 9. 清理数据
default_server.cleanup()
---
b.开发工具
a.功能说明
Milvus提供多种开发工具提升开发效率。Attu是官方GUI工具,提供可视化的collection管理、数据浏览和查询功能。Milvus CLI是命令行工具,支持交互式操作和脚本自动化。Birdwatcher是调试工具,可以查看内部状态和元数据。这些工具帮助开发者快速理解和调试Milvus。
b.代码示例
---
# 1. 安装Attu(Web GUI)
docker run -p 8000:3000 -e MILVUS_URL=localhost:19530 zilliz/attu:latest
# 访问 http://localhost:8000
# 功能:
# - 可视化collection管理
# - 数据浏览和编辑
# - 向量搜索测试
# - 索引管理
# - 系统监控
# 2. 安装Milvus CLI
pip install milvus-cli
# 启动CLI
milvus_cli
# CLI命令示例:
# connect -h localhost -p 19530
# list collections
# describe collection -c my_collection
# show index -c my_collection
# query -c my_collection -f "id > 100" -o id,vector
# search -c my_collection -v "[0.1, 0.2, ...]" -l 10
# 3. 使用Birdwatcher(调试工具)
# docker run -it --rm --network host milvusdb/birdwatcher:latest
# Birdwatcher命令:
# connect --etcd localhost:2379
# show collections
# show segments
# show segment-info --segment-id 12345
# show channel-watch
# 4. Python调试技巧
from pymilvus import connections, utility
connections.connect("default", host="localhost", port="19530")
# 查看所有collection
collections = utility.list_collections()
print(f"Collections: {collections}")
# 查看collection详情
from pymilvus import Collection
collection = Collection("my_collection")
print(f"Schema: {collection.schema}")
print(f"Entities: {collection.num_entities}")
print(f"Indexes: {collection.indexes}")
# 查看segment信息
segments = utility.get_query_segment_info("my_collection")
for seg in segments:
print(f"Segment {seg.segmentID}: {seg.num_rows} rows, state={seg.state}")
# 启用日志调试
import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger("pymilvus")
logger.setLevel(logging.DEBUG)
---
2.2 连接数据库
01.连接配置
a.基本连接
a.功能说明
使用PyMilvus SDK连接Milvus服务器需要指定主机地址和端口。默认端口为19530(gRPC)。连接建立后会创建一个全局连接对象,后续操作都基于此连接。支持多个连接别名,可以同时连接多个Milvus实例。连接对象是线程安全的,可以在多线程环境中使用。
b.代码示例
---
from pymilvus import connections
# 基本连接
connections.connect(
alias="default", # 连接别名
host="localhost",
port="19530"
)
# 验证连接
from pymilvus import utility
print(f"服务器版本: {utility.get_server_version()}")
# 多连接示例
connections.connect(
alias="cluster1",
host="milvus-cluster1.example.com",
port="19530"
)
connections.connect(
alias="cluster2",
host="milvus-cluster2.example.com",
port="19530"
)
# 使用指定连接
from pymilvus import Collection
collection1 = Collection("test", using="cluster1")
collection2 = Collection("test", using="cluster2")
---
b.认证连接
a.功能说明
Milvus支持用户名密码认证,保护数据安全。启用认证后,所有连接都需要提供有效的凭证。支持创建多个用户并分配不同的权限。认证信息在连接建立时验证,后续操作会自动携带认证令牌。建议在生产环境中启用认证功能。
b.代码示例
---
from pymilvus import connections
# 使用用户名密码连接
connections.connect(
alias="default",
host="localhost",
port="19530",
user="username",
password="password"
)
# 创建新用户(需要root权限)
from pymilvus import utility
utility.create_user(
user="new_user",
password="secure_password",
using="default"
)
# 修改密码
utility.reset_password(
user="new_user",
old_password="secure_password",
new_password="new_secure_password",
using="default"
)
# 列出所有用户
users = utility.list_usernames(using="default")
print(f"用户列表: {users}")
# 删除用户
utility.delete_user(user="new_user", using="default")
---
02.连接池管理
a.连接池配置
a.功能说明
PyMilvus内部使用连接池管理gRPC连接,提高并发性能。连接池会自动管理连接的创建、复用和销毁。可以配置连接池大小、超时时间等参数。连接池支持自动重连机制,网络故障恢复后会自动重建连接。合理配置连接池可以显著提升高并发场景下的性能。
b.代码示例
---
from pymilvus import connections
# 配置连接池参数
connections.connect(
alias="default",
host="localhost",
port="19530",
pool_size=10, # 连接池大小
timeout=30, # 连接超时(秒)
wait_for_ready=True, # 等待服务就绪
_secure=False, # 是否使用TLS
_server_pem_path=None, # TLS证书路径
_server_name=None # TLS服务器名称
)
# 查看连接信息
connections.list_connections()
# 获取连接详情
conn_info = connections.get_connection_addr("default")
print(f"连接信息: {conn_info}")
# 并发测试连接池
import concurrent.futures
from pymilvus import Collection
def query_task(task_id):
collection = Collection("test")
results = collection.query(
expr="id > 0",
limit=10,
output_fields=["id"]
)
return len(results)
# 100个并发查询
with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
futures = [executor.submit(query_task, i) for i in range(100)]
results = [f.result() for f in futures]
print(f"完成 {len(results)} 个并发查询")
---
b.连接管理
a.功能说明
连接对象支持显式断开和重连操作。断开连接会释放服务器端资源,但不会影响已加载的collection。应用退出前应该主动断开连接。支持检查连接状态,判断连接是否有效。可以通过别名管理多个连接,在不同连接间切换。
b.代码示例
---
from pymilvus import connections, utility
# 检查连接状态
has_connection = connections.has_connection("default")
print(f"连接存在: {has_connection}")
# 断开连接
connections.disconnect("default")
# 重新连接
connections.connect(
alias="default",
host="localhost",
port="19530"
)
# 断开所有连接
for alias in connections.list_connections():
connections.disconnect(alias[0])
# 连接健康检查
try:
version = utility.get_server_version()
print(f"连接正常,服务器版本: {version}")
except Exception as e:
print(f"连接异常: {e}")
# 尝试重连
connections.disconnect("default")
connections.connect(
alias="default",
host="localhost",
port="19530"
)
# 上下文管理器(自动断开)
class MilvusConnection:
def __init__(self, alias, host, port):
self.alias = alias
self.host = host
self.port = port
def __enter__(self):
connections.connect(
alias=self.alias,
host=self.host,
port=self.port
)
return self
def __exit__(self, exc_type, exc_val, exc_tb):
connections.disconnect(self.alias)
# 使用上下文管理器
with MilvusConnection("temp", "localhost", "19530"):
print(f"版本: {utility.get_server_version()}")
# 自动断开连接
---
03.高级配置
a.TLS加密
a.功能说明
Milvus支持TLS加密传输,保护数据在网络传输过程中的安全。需要配置服务器端证书和客户端证书。启用TLS后,所有通信都会加密,防止中间人攻击。适合在公网环境或对安全要求高的场景使用。TLS会增加一定的性能开销,但提供了更高的安全性。
b.代码示例
---
from pymilvus import connections
# 使用TLS连接
connections.connect(
alias="secure",
host="milvus.example.com",
port="19530",
secure=True, # 启用TLS
server_pem_path="/path/to/server.pem", # 服务器证书
server_name="milvus.example.com", # 服务器名称(用于证书验证)
user="username",
password="password"
)
# 双向TLS认证(客户端证书)
connections.connect(
alias="mutual_tls",
host="milvus.example.com",
port="19530",
secure=True,
server_pem_path="/path/to/server.pem",
client_pem_path="/path/to/client.pem", # 客户端证书
client_key_path="/path/to/client.key", # 客户端私钥
ca_pem_path="/path/to/ca.pem", # CA证书
server_name="milvus.example.com"
)
# 服务器端TLS配置(milvus.yaml)
# tls:
# serverPemPath: /path/to/server.pem
# serverKeyPath: /path/to/server.key
# caPemPath: /path/to/ca.pem
# 生成自签名证书(测试用)
# openssl req -x509 -newkey rsa:4096 -keyout server.key -out server.pem -days 365 -nodes
---
b.负载均衡
a.功能说明
在集群环境中,可以通过负载均衡器连接多个Proxy节点,提高可用性和吞吐量。客户端连接到负载均衡器地址,请求会自动分发到后端Proxy。支持多种负载均衡策略,如轮询、最少连接等。Proxy节点故障时,负载均衡器会自动剔除故障节点。这种架构提供了更好的容错能力和扩展性。
b.代码示例
---
from pymilvus import connections
# 连接到负载均衡器
connections.connect(
alias="cluster",
host="milvus-lb.example.com", # 负载均衡器地址
port="19530"
)
# Kubernetes环境下的负载均衡配置
# apiVersion: v1
# kind: Service
# metadata:
# name: milvus-proxy-lb
# spec:
# type: LoadBalancer
# selector:
# app: milvus-proxy
# ports:
# - protocol: TCP
# port: 19530
# targetPort: 19530
# 使用DNS轮询(多个Proxy地址)
# 配置DNS记录:
# milvus.example.com -> 10.0.1.1
# milvus.example.com -> 10.0.1.2
# milvus.example.com -> 10.0.1.3
connections.connect(
alias="dns_lb",
host="milvus.example.com", # DNS会自动轮询
port="19530"
)
# 客户端重试机制
import time
from pymilvus import connections, utility
def connect_with_retry(max_retries=3, retry_delay=5):
for attempt in range(max_retries):
try:
connections.connect(
alias="default",
host="milvus-lb.example.com",
port="19530",
timeout=10
)
version = utility.get_server_version()
print(f"连接成功,版本: {version}")
return True
except Exception as e:
print(f"连接失败 (尝试 {attempt + 1}/{max_retries}): {e}")
if attempt < max_retries - 1:
time.sleep(retry_delay)
return False
connect_with_retry()
---
2.3 基础操作
01.Collection操作
a.创建Collection
a.功能说明
Collection是Milvus中的基本数据单元,类似于关系数据库中的表。创建Collection需要定义Schema,包括字段名称、数据类型、维度等。主键字段是必需的,可以设置为自动生成。向量字段需要指定维度,必须与后续插入的向量维度一致。创建后的Schema不可修改,需要谨慎设计。
b.代码示例
---
from pymilvus import Collection, CollectionSchema, FieldSchema, DataType
# 定义字段
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False),
FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=200),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128)
]
# 创建Schema
schema = CollectionSchema(
fields=fields,
description="文档向量库",
enable_dynamic_field=False # 是否允许动态字段
)
# 创建Collection
collection = Collection(
name="documents",
schema=schema,
using="default",
shards_num=2 # 分片数量
)
print(f"Collection创建成功: {collection.name}")
print(f"Schema: {collection.schema}")
---
b.查看Collection
a.功能说明
可以列出所有Collection,查看Collection的详细信息,包括Schema定义、统计信息等。通过Collection对象可以获取实体数量、索引信息、加载状态等。这些信息有助于了解数据规模和系统状态。支持检查Collection是否存在,避免重复创建。
b.代码示例
---
from pymilvus import utility, Collection
# 列出所有Collection
collections = utility.list_collections()
print(f"所有Collection: {collections}")
# 检查Collection是否存在
has_collection = utility.has_collection("documents")
print(f"Collection存在: {has_collection}")
# 获取Collection对象
collection = Collection("documents")
# 查看Schema
print(f"Schema: {collection.schema}")
print(f"描述: {collection.description}")
# 查看统计信息
print(f"实体数量: {collection.num_entities}")
# 查看索引信息
indexes = collection.indexes
for index in indexes:
print(f"索引字段: {index.field_name}")
print(f"索引类型: {index.params}")
# 查看加载状态
print(f"已加载: {utility.load_state('documents')}")
# 查看Collection属性
properties = collection.properties
print(f"属性: {properties}")
---
02.数据插入
a.批量插入
a.功能说明
数据插入以列式格式进行,每个字段对应一个列表。插入操作是原子的,要么全部成功要么全部失败。返回值包含插入的主键列表。建议批量插入,提高吞吐量,单次插入建议1000-10000条。插入后数据不会立即可见,需要等待刷新或自动刷新周期。
b.代码示例
---
from pymilvus import Collection
import numpy as np
collection = Collection("documents")
# 准备数据(列式格式)
ids = [i for i in range(1000)]
titles = [f"文档{i}" for i in range(1000)]
embeddings = [[np.random.random() for _ in range(128)] for _ in range(1000)]
# 插入数据
data = [ids, titles, embeddings]
insert_result = collection.insert(data)
print(f"插入成功: {insert_result.insert_count} 条")
print(f"主键列表: {insert_result.primary_keys[:10]}...") # 显示前10个
# 自动生成主键
collection_auto = Collection("auto_id_collection")
data_auto = [titles, embeddings] # 不需要提供id
insert_result = collection_auto.insert(data_auto)
# 刷新数据(使数据立即可见)
collection.flush()
print(f"刷新后实体数量: {collection.num_entities}")
---
b.单条插入
a.功能说明
虽然Milvus优化了批量插入,但也支持单条插入。单条插入适合实时数据流场景,每次插入一条记录。性能不如批量插入,但延迟更低。可以通过累积小批量来平衡吞吐量和延迟。建议在应用层实现缓冲机制,积累一定数量后批量插入。
b.代码示例
---
from pymilvus import Collection
import numpy as np
collection = Collection("documents")
# 单条插入
single_data = [
[1001], # id
["单条文档"], # title
[[np.random.random() for _ in range(128)]] # embedding
]
collection.insert(single_data)
# 实时插入场景(带缓冲)
class BufferedInserter:
def __init__(self, collection, buffer_size=100):
self.collection = collection
self.buffer_size = buffer_size
self.buffer = {"ids": [], "titles": [], "embeddings": []}
def insert(self, id, title, embedding):
self.buffer["ids"].append(id)
self.buffer["titles"].append(title)
self.buffer["embeddings"].append(embedding)
if len(self.buffer["ids"]) >= self.buffer_size:
self.flush()
def flush(self):
if len(self.buffer["ids"]) > 0:
data = [
self.buffer["ids"],
self.buffer["titles"],
self.buffer["embeddings"]
]
self.collection.insert(data)
self.buffer = {"ids": [], "titles": [], "embeddings": []}
print(f"批量插入 {len(data[0])} 条数据")
# 使用缓冲插入器
inserter = BufferedInserter(collection, buffer_size=100)
for i in range(250):
inserter.insert(
id=2000 + i,
title=f"实时文档{i}",
embedding=[np.random.random() for _ in range(128)]
)
inserter.flush() # 刷新剩余数据
---
03.数据查询
a.主键查询
a.功能说明
通过主键精确查询实体,返回指定字段的值。主键查询是最快的查询方式,时间复杂度O(1)。支持批量主键查询,一次查询多个实体。可以指定返回的字段,减少数据传输量。主键查询不需要加载collection到内存,可以直接从存储层读取。
b.代码示例
---
from pymilvus import Collection
collection = Collection("documents")
# 单个主键查询
results = collection.query(
expr="id == 1",
output_fields=["id", "title", "embedding"]
)
print(f"查询结果: {results}")
# 批量主键查询
ids_to_query = [1, 10, 100, 1000]
results = collection.query(
expr=f"id in {ids_to_query}",
output_fields=["id", "title"]
)
for result in results:
print(f"ID: {result['id']}, Title: {result['title']}")
# 范围查询
results = collection.query(
expr="id > 100 and id < 200",
output_fields=["id", "title"],
limit=10
)
print(f"范围查询结果: {len(results)} 条")
---
b.标量过滤
a.功能说明
支持对标量字段进行过滤查询,使用类SQL的表达式语法。支持比较运算符(==, !=, >, <, >=, <=)、逻辑运算符(and, or, not)、成员运算符(in, not in)。可以组合多个条件进行复杂查询。标量查询需要加载collection,或者对标量字段建立索引。查询性能取决于数据量和过滤条件的选择性。
b.代码示例
---
from pymilvus import Collection
collection = Collection("documents")
collection.load() # 加载到内存
# 字符串匹配
results = collection.query(
expr='title like "文档1%"',
output_fields=["id", "title"],
limit=10
)
# 多条件查询
results = collection.query(
expr='id > 100 and id < 500 and title like "文档%"',
output_fields=["id", "title"]
)
# IN查询
titles_to_find = ["文档1", "文档10", "文档100"]
results = collection.query(
expr=f'title in {titles_to_find}',
output_fields=["id", "title"]
)
# 复杂表达式
results = collection.query(
expr='(id > 100 and id < 200) or (id > 800 and id < 900)',
output_fields=["id", "title"],
limit=20
)
# 分页查询
page_size = 100
offset = 0
while True:
results = collection.query(
expr="id > 0",
output_fields=["id", "title"],
limit=page_size,
offset=offset
)
if len(results) == 0:
break
print(f"第 {offset // page_size + 1} 页: {len(results)} 条")
offset += page_size
---
04.数据删除
a.按表达式删除
a.功能说明
通过表达式删除满足条件的实体。删除操作是异步的,立即返回但数据可能不会立即删除。支持按主键、标量字段或组合条件删除。删除大量数据时建议分批进行,避免单次删除过多影响性能。删除后的空间不会立即释放,需要等待compaction操作。
b.代码示例
---
from pymilvus import Collection
collection = Collection("documents")
# 删除单条记录
expr = "id == 1001"
collection.delete(expr)
# 批量删除
ids_to_delete = [1, 2, 3, 4, 5]
expr = f"id in {ids_to_delete}"
collection.delete(expr)
# 条件删除
expr = "id > 2000 and id < 2100"
collection.delete(expr)
# 删除所有数据(慎用)
# expr = "id > 0"
# collection.delete(expr)
# 分批删除大量数据
batch_size = 1000
start_id = 3000
end_id = 10000
for i in range(start_id, end_id, batch_size):
expr = f"id >= {i} and id < {i + batch_size}"
collection.delete(expr)
print(f"已删除 ID {i} 到 {i + batch_size}")
# 刷新删除操作
collection.flush()
print(f"删除后实体数量: {collection.num_entities}")
---
b.Compaction压缩
a.功能说明
Compaction是Milvus的后台维护操作,用于合并小segment和清理已删除的数据。删除操作只是标记删除,实际空间通过compaction释放。Compaction会重组数据,提高查询性能。可以手动触发compaction,也可以等待自动执行。Compaction过程中collection仍可正常使用,但可能影响性能。
b.代码示例
---
from pymilvus import Collection, utility
import time
collection = Collection("documents")
# 手动触发compaction
collection.compact()
print("Compaction已触发")
# 等待compaction完成
while True:
state = utility.get_compaction_state(collection.name)
if state.state == 3: # 3表示完成
print("Compaction完成")
break
print(f"Compaction进行中: {state.executing_plan_no}/{state.total_plan_no}")
time.sleep(1)
# 查看compaction计划
plans = utility.get_compaction_plans(collection.name)
for plan in plans:
print(f"计划ID: {plan.id}, 源segment: {plan.sources}, 目标segment: {plan.target}")
# 配置自动compaction(milvus.yaml)
# dataCoord:
# enableCompaction: true
# enableAutoCompaction: true
# compaction:
# min:
# interval: 60 # 最小间隔(秒)
# max:
# interval: 3600 # 最大间隔(秒)
# 查看segment信息
segments = utility.get_query_segment_info(collection.name)
total_size = sum(seg.num_rows for seg in segments)
print(f"总segment数: {len(segments)}, 总行数: {total_size}")
for seg in segments[:5]: # 显示前5个segment
print(f"Segment {seg.segmentID}: {seg.num_rows} rows, state={seg.state}")
---
3 Collection管理
3.1 Schema定义
01.字段类型
a.标量字段
a.功能说明
Milvus支持多种标量数据类型,包括整数(INT8, INT16, INT32, INT64)、浮点数(FLOAT, DOUBLE)、布尔值(BOOL)、字符串(VARCHAR)和JSON。标量字段用于存储元数据和过滤条件。VARCHAR类型需要指定最大长度。JSON类型支持嵌套结构,可以存储复杂的元数据。标量字段可以建立索引,加速过滤查询。
b.代码示例
---
from pymilvus import FieldSchema, DataType
# 整数类型
id_field = FieldSchema(
name="id",
dtype=DataType.INT64,
is_primary=True,
auto_id=False
)
age_field = FieldSchema(
name="age",
dtype=DataType.INT32
)
# 浮点数类型
score_field = FieldSchema(
name="score",
dtype=DataType.FLOAT
)
# 布尔类型
active_field = FieldSchema(
name="is_active",
dtype=DataType.BOOL
)
# 字符串类型
title_field = FieldSchema(
name="title",
dtype=DataType.VARCHAR,
max_length=500
)
# JSON类型
metadata_field = FieldSchema(
name="metadata",
dtype=DataType.JSON
)
# 所有标量类型示例
fields = [
id_field,
age_field,
score_field,
active_field,
title_field,
metadata_field
]
---
b.向量字段
a.功能说明
向量字段存储高维向量数据,是Milvus的核心字段类型。支持FLOAT_VECTOR(浮点向量)、BINARY_VECTOR(二值向量)和FLOAT16_VECTOR(半精度向量)。必须指定向量维度,维度在创建后不可修改。一个collection可以包含多个向量字段,支持多模态检索。向量字段必须建立索引才能进行相似度搜索。
b.代码示例
---
from pymilvus import FieldSchema, DataType
# 浮点向量(最常用)
embedding_field = FieldSchema(
name="embedding",
dtype=DataType.FLOAT_VECTOR,
dim=128 # 向量维度
)
# 高维向量
high_dim_field = FieldSchema(
name="high_dim_embedding",
dtype=DataType.FLOAT_VECTOR,
dim=1536 # OpenAI ada-002维度
)
# 二值向量(节省存储空间)
binary_field = FieldSchema(
name="binary_embedding",
dtype=DataType.BINARY_VECTOR,
dim=512 # 维度必须是8的倍数
)
# 半精度向量(节省内存)
fp16_field = FieldSchema(
name="fp16_embedding",
dtype=DataType.FLOAT16_VECTOR,
dim=256
)
# 多向量字段(多模态)
text_vector = FieldSchema(
name="text_embedding",
dtype=DataType.FLOAT_VECTOR,
dim=768
)
image_vector = FieldSchema(
name="image_embedding",
dtype=DataType.FLOAT_VECTOR,
dim=512
)
# 向量字段集合
vector_fields = [
embedding_field,
high_dim_field,
binary_field,
fp16_field,
text_vector,
image_vector
]
---
02.Schema配置
a.基本Schema
a.功能说明
Schema定义了collection的结构,包括所有字段的定义。必须包含一个主键字段,可以设置为自动生成。可以添加描述信息,便于理解collection用途。Schema创建后不可修改,需要谨慎设计。建议在设计阶段充分考虑业务需求和扩展性。
b.代码示例
---
from pymilvus import CollectionSchema, FieldSchema, DataType
# 定义字段
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=200),
FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=5000),
FieldSchema(name="category", dtype=DataType.VARCHAR, max_length=50),
FieldSchema(name="timestamp", dtype=DataType.INT64),
FieldSchema(name="score", dtype=DataType.FLOAT),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768)
]
# 创建Schema
schema = CollectionSchema(
fields=fields,
description="文档搜索系统",
enable_dynamic_field=False
)
# 查看Schema信息
print(f"字段数量: {len(schema.fields)}")
for field in schema.fields:
print(f"字段: {field.name}, 类型: {field.dtype}, 主键: {field.is_primary}")
# Schema验证
print(f"主键字段: {schema.primary_field.name}")
print(f"自动ID: {schema.auto_id}")
---
b.动态Schema
a.功能说明
动态Schema允许插入未在Schema中定义的字段,提供更大的灵活性。动态字段会自动推断类型,存储在内部的JSON字段中。适合元数据结构不固定的场景,如用户自定义属性。动态字段可以用于过滤查询,但性能不如预定义字段。启用动态Schema会增加一定的存储开销。
b.代码示例
---
from pymilvus import CollectionSchema, FieldSchema, DataType, Collection
# 启用动态Schema
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128)
]
schema = CollectionSchema(
fields=fields,
description="动态Schema示例",
enable_dynamic_field=True # 启用动态字段
)
collection = Collection("dynamic_collection", schema=schema)
# 插入数据(包含动态字段)
data = [
[1, 2, 3], # id
[[0.1]*128, [0.2]*128, [0.3]*128], # embedding
["标题1", "标题2", "标题3"], # 动态字段: title
[100, 200, 300], # 动态字段: score
[{"tag": "AI"}, {"tag": "ML"}, {"tag": "DL"}] # 动态字段: metadata
]
# 注意:动态字段需要在插入时指定字段名
collection.insert(data, fields=["id", "embedding", "title", "score", "metadata"])
# 查询动态字段
collection.load()
results = collection.query(
expr="id > 0",
output_fields=["id", "title", "score", "metadata"]
)
for result in results:
print(f"ID: {result['id']}, Title: {result.get('title')}, Score: {result.get('score')}")
---
03.主键设计
a.自增主键
a.功能说明
自增主键由Milvus自动生成,保证全局唯一。使用雪花算法生成64位整数ID,包含时间戳和节点信息。自增主键简化了数据插入流程,无需应用层维护ID。适合不需要自定义ID的场景。自增ID是递增的,但不保证连续。
b.代码示例
---
from pymilvus import CollectionSchema, FieldSchema, DataType, Collection
import numpy as np
# 定义自增主键Schema
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=200),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128)
]
schema = CollectionSchema(fields=fields, description="自增ID示例")
collection = Collection("auto_id_collection", schema=schema)
# 插入数据(不需要提供id)
texts = [f"文本{i}" for i in range(100)]
embeddings = [[np.random.random() for _ in range(128)] for _ in range(100)]
data = [texts, embeddings] # 注意:没有id字段
insert_result = collection.insert(data)
# 获取自动生成的ID
generated_ids = insert_result.primary_keys
print(f"生成的ID: {generated_ids[:10]}")
# 使用生成的ID查询
results = collection.query(
expr=f"id in {generated_ids[:5]}",
output_fields=["id", "text"]
)
for result in results:
print(f"ID: {result['id']}, Text: {result['text']}")
---
b.自定义主键
a.功能说明
自定义主键由应用层提供,可以使用业务ID或UUID。需要保证主键的全局唯一性,重复插入会报错。自定义主键便于与现有系统集成,可以直接使用业务ID查询。支持INT64和VARCHAR类型的主键。VARCHAR主键最大长度为65535字符。
b.代码示例
---
from pymilvus import CollectionSchema, FieldSchema, DataType, Collection
import uuid
import numpy as np
# INT64自定义主键
fields_int = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False),
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=200),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128)
]
schema_int = CollectionSchema(fields=fields_int, description="INT64主键")
collection_int = Collection("custom_int_id", schema=schema_int)
# 插入数据(提供自定义ID)
ids = [1000 + i for i in range(100)]
texts = [f"文本{i}" for i in range(100)]
embeddings = [[np.random.random() for _ in range(128)] for _ in range(100)]
data = [ids, texts, embeddings]
collection_int.insert(data)
# VARCHAR主键(UUID)
fields_str = [
FieldSchema(name="id", dtype=DataType.VARCHAR, max_length=36, is_primary=True, auto_id=False),
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=200),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128)
]
schema_str = CollectionSchema(fields=fields_str, description="VARCHAR主键")
collection_str = Collection("custom_str_id", schema=schema_str)
# 使用UUID作为主键
uuids = [str(uuid.uuid4()) for _ in range(100)]
data = [uuids, texts, embeddings]
collection_str.insert(data)
# 使用UUID查询
results = collection_str.query(
expr=f'id == "{uuids[0]}"',
output_fields=["id", "text"]
)
print(f"UUID查询结果: {results[0]}")
# 业务ID示例(如订单号)
order_ids = [f"ORDER{i:08d}" for i in range(100)]
data = [order_ids, texts, embeddings]
collection_str.insert(data)
---
04.Schema最佳实践
a.字段选择
a.功能说明
合理选择字段类型可以优化存储和性能。只包含必要的字段,避免冗余数据。VARCHAR字段设置合理的最大长度,过大会浪费存储空间。对于高频过滤的字段,建议建立标量索引。JSON字段适合存储非结构化元数据,但查询性能不如预定义字段。
b.代码示例
---
from pymilvus import CollectionSchema, FieldSchema, DataType
# 优化前:字段过多,类型不合理
fields_bad = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=10000), # 过大
FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=50000), # 过大
FieldSchema(name="author", dtype=DataType.VARCHAR, max_length=5000), # 过大
FieldSchema(name="tags", dtype=DataType.VARCHAR, max_length=10000), # 应该用JSON
FieldSchema(name="metadata", dtype=DataType.VARCHAR, max_length=10000), # 应该用JSON
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128)
]
# 优化后:字段精简,类型合理
fields_good = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=200), # 合理长度
FieldSchema(name="category", dtype=DataType.VARCHAR, max_length=50), # 用于过滤
FieldSchema(name="timestamp", dtype=DataType.INT64), # 时间戳(便于范围查询)
FieldSchema(name="metadata", dtype=DataType.JSON), # 灵活的元数据
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128)
]
schema_good = CollectionSchema(
fields=fields_good,
description="优化的Schema设计"
)
# 字段索引策略
# 1. 主键自动索引
# 2. 向量字段必须建索引
# 3. 高频过滤字段建标量索引
# 4. JSON字段不建索引(性能考虑)
---
b.版本管理
a.功能说明
Schema一旦创建就不可修改,需要做好版本管理。可以通过collection名称包含版本号来管理不同版本。数据迁移时,创建新collection并逐步迁移数据。使用别名机制,应用层无需感知collection变化。建议在开发阶段充分测试Schema设计,避免频繁变更。
b.代码示例
---
from pymilvus import Collection, utility
import numpy as np
# Schema版本管理策略
# V1版本
fields_v1 = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=200),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128)
]
schema_v1 = CollectionSchema(fields=fields_v1, description="V1版本")
collection_v1 = Collection("documents_v1", schema=schema_v1)
# V2版本(增加字段)
fields_v2 = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=200),
FieldSchema(name="category", dtype=DataType.VARCHAR, max_length=50), # 新增
FieldSchema(name="timestamp", dtype=DataType.INT64), # 新增
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=256) # 维度变化
]
schema_v2 = CollectionSchema(fields=fields_v2, description="V2版本")
collection_v2 = Collection("documents_v2", schema=schema_v2)
# 数据迁移函数
def migrate_data(source_collection, target_collection, batch_size=1000):
source_collection.load()
offset = 0
while True:
# 从源collection读取数据
results = source_collection.query(
expr="id > 0",
output_fields=["id", "text", "embedding"],
limit=batch_size,
offset=offset
)
if len(results) == 0:
break
# 转换数据格式
ids = [r["id"] for r in results]
texts = [r["text"] for r in results]
# 假设有函数将128维向量升级到256维
embeddings = [upgrade_embedding(r["embedding"]) for r in results]
# 填充新字段
categories = ["default"] * len(results)
timestamps = [int(time.time())] * len(results)
# 插入到目标collection
data = [ids, texts, categories, timestamps, embeddings]
target_collection.insert(data)
offset += batch_size
print(f"已迁移 {offset} 条数据")
target_collection.flush()
# 使用别名进行平滑切换
utility.create_alias(collection_name="documents_v1", alias="documents")
# 迁移完成后切换别名
# utility.alter_alias(collection_name="documents_v2", alias="documents")
# 应用层代码不变
collection = Collection("documents") # 通过别名访问
---
3.2 创建Collection
01.Collection创建方法
a.基本创建
a.功能说明
创建Collection需要提供名称和Schema定义。Collection名称必须唯一,不能与已存在的collection重复。可以指定分片数量,影响并行查询性能。创建后立即返回Collection对象,但不会自动加载到内存。建议在创建后立即创建索引,避免后续数据插入时的性能问题。Collection名称支持字母、数字和下划线,长度不超过255字符。
b.代码示例
---
from pymilvus import Collection, CollectionSchema, FieldSchema, DataType
# 定义Schema
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=200),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128)
]
schema = CollectionSchema(fields=fields, description="文档集合")
# 创建Collection
collection = Collection(
name="documents",
schema=schema,
using="default",
shards_num=2
)
print(f"Collection创建成功: {collection.name}")
print(f"分片数量: {collection.shards_num}")
print(f"Schema: {collection.schema}")
# 验证创建
from pymilvus import utility
assert utility.has_collection("documents")
---
b.从已有Collection创建
a.功能说明
可以通过Collection名称获取已存在的collection对象。这种方式不会重新创建collection,只是获取引用。适合在不同模块或进程中访问同一个collection。如果collection不存在会抛出异常,可以先检查是否存在。获取的Collection对象与原对象共享相同的元数据和数据。多个Collection对象可以指向同一个collection,修改会互相影响。
b.代码示例
---
from pymilvus import Collection, utility
# 检查Collection是否存在
if utility.has_collection("documents"):
# 获取已存在的Collection
collection = Collection("documents")
print(f"获取Collection: {collection.name}")
print(f"实体数量: {collection.num_entities}")
print(f"Schema: {collection.schema}")
else:
print("Collection不存在")
# 安全获取Collection
def get_or_create_collection(name, schema, shards_num=2):
if utility.has_collection(name):
return Collection(name)
else:
return Collection(name, schema=schema, shards_num=shards_num)
collection = get_or_create_collection("documents", schema)
# 多个引用示例
collection1 = Collection("documents")
collection2 = Collection("documents")
# 两个对象指向同一个collection
print(f"相同collection: {collection1.name == collection2.name}")
---
02.Collection配置
a.分片配置
a.功能说明
分片数量决定了数据的分布和并行度。更多分片可以提高查询并发性能,但也会增加管理开销。建议根据数据量和查询负载设置分片数。单机环境建议1-2个分片,集群环境可以设置更多。分片数量在创建后不可修改,需要谨慎选择。每个分片会独立管理一部分数据,查询时会并行处理所有分片。
b.代码示例
---
from pymilvus import Collection, CollectionSchema
# 单分片(小数据量,<100万)
collection_small = Collection(
name="small_collection",
schema=schema,
shards_num=1
)
# 多分片(大数据量,>1000万)
collection_large = Collection(
name="large_collection",
schema=schema,
shards_num=4
)
# 根据数据量动态选择分片数
def calculate_shards(estimated_entities):
if estimated_entities < 1000000:
return 1
elif estimated_entities < 10000000:
return 2
elif estimated_entities < 100000000:
return 4
else:
return 8
shards = calculate_shards(5000000)
collection = Collection(
name="dynamic_shards",
schema=schema,
shards_num=shards
)
print(f"数据量: 5000000, 分片数: {shards}")
# 查看分片信息
print(f"Collection分片数: {collection.shards_num}")
---
b.属性配置
a.功能说明
Collection支持设置多种属性,如TTL(数据过期时间)、副本数量等。TTL可以自动清理过期数据,适合时效性数据。副本数量影响查询性能和可用性,更多副本可以提高查询吞吐量。属性可以在创建后修改,提供灵活的配置能力。TTL以秒为单位,0表示永不过期。副本数量建议设置为2-3,过多会增加存储开销。
b.代码示例
---
from pymilvus import Collection
collection = Collection("documents")
# 设置TTL(秒)
collection.set_properties(properties={"collection.ttl.seconds": 86400}) # 1天
print("TTL设置为1天")
# 设置副本数量
collection.set_properties(properties={"collection.replica.number": 2})
print("副本数量设置为2")
# 查看属性
properties = collection.properties
print(f"Collection属性: {properties}")
# 批量设置属性
collection.set_properties(properties={
"collection.ttl.seconds": 172800, # 2天
"collection.replica.number": 3
})
# 删除TTL(永不过期)
collection.set_properties(properties={"collection.ttl.seconds": 0})
print("TTL已禁用")
# 常用属性配置
# 1. 缓存数据(短期)
cache_collection = Collection("cache")
cache_collection.set_properties(properties={"collection.ttl.seconds": 3600}) # 1小时
# 2. 日志数据(中期)
log_collection = Collection("logs")
log_collection.set_properties(properties={"collection.ttl.seconds": 604800}) # 7天
# 3. 持久数据(长期)
persistent_collection = Collection("persistent")
persistent_collection.set_properties(properties={"collection.ttl.seconds": 0}) # 永久
---
03.别名管理
a.创建别名
a.功能说明
别名是collection的另一个名称,可以用于平滑升级和版本管理。一个collection可以有多个别名,一个别名只能指向一个collection。通过别名访问collection,应用层无需感知实际的collection名称。适合在数据迁移或Schema变更时使用。别名操作是原子的,切换过程中不会影响服务。
b.代码示例
---
from pymilvus import utility, Collection
# 创建别名
utility.create_alias(
collection_name="documents_v1",
alias="documents"
)
print("别名创建成功")
# 通过别名访问
collection = Collection("documents") # 实际访问documents_v1
print(f"实际collection: {collection.name}")
# 查看别名列表
aliases = utility.list_aliases("documents_v1")
print(f"别名列表: {aliases}")
# 一个collection多个别名
utility.create_alias("documents_v1", "docs")
utility.create_alias("documents_v1", "doc_collection")
# 所有别名都指向同一个collection
col1 = Collection("documents")
col2 = Collection("docs")
col3 = Collection("doc_collection")
print(f"实体数量一致: {col1.num_entities == col2.num_entities == col3.num_entities}")
---
b.切换别名
a.功能说明
可以将别名切换到另一个collection,实现平滑升级。切换操作是原子的,不会出现中间状态。适合在新旧版本切换时使用,应用层无需修改代码。切换前建议先验证新collection的数据完整性。可以通过别名实现蓝绿部署和灰度发布。
b.代码示例
---
from pymilvus import utility, Collection
# 初始状态:别名指向v1
utility.create_alias("documents_v1", "documents")
# 创建新版本collection
collection_v2 = Collection("documents_v2", schema=new_schema)
# ... 迁移数据到v2 ...
# 切换别名到v2
utility.alter_alias(
collection_name="documents_v2",
alias="documents"
)
print("别名已切换到v2")
# 现在通过别名访问的是v2
collection = Collection("documents")
print(f"当前版本: {collection.name}")
# 蓝绿部署示例
def blue_green_deployment(old_collection, new_collection, alias):
# 1. 验证新collection
new_col = Collection(new_collection)
assert new_col.num_entities > 0, "新collection数据为空"
# 2. 切换别名
utility.alter_alias(
collection_name=new_collection,
alias=alias
)
print(f"已切换到新版本: {new_collection}")
# 3. 保留旧版本一段时间,以便回滚
# 如果需要回滚
# utility.alter_alias(collection_name=old_collection, alias=alias)
blue_green_deployment("documents_v1", "documents_v2", "documents")
# 删除别名
utility.drop_alias("documents")
print("别名已删除")
---
04.Collection元数据
a.查看元数据
a.功能说明
Collection包含丰富的元数据信息,包括Schema定义、统计信息、索引信息等。通过元数据可以了解collection的结构和状态。元数据查询不需要加载collection,性能开销小。可以用于监控和管理collection。元数据会实时更新,反映collection的最新状态。
b.代码示例
---
from pymilvus import Collection, utility
collection = Collection("documents")
# Schema信息
print(f"Collection名称: {collection.name}")
print(f"描述: {collection.description}")
print(f"Schema: {collection.schema}")
# 字段信息
for field in collection.schema.fields:
print(f"字段: {field.name}")
print(f" 类型: {field.dtype}")
print(f" 主键: {field.is_primary}")
if field.dtype == DataType.FLOAT_VECTOR:
print(f" 维度: {field.params.get('dim')}")
if field.dtype == DataType.VARCHAR:
print(f" 最大长度: {field.params.get('max_length')}")
# 统计信息
print(f"实体数量: {collection.num_entities}")
print(f"分片数量: {collection.shards_num}")
# 索引信息
indexes = collection.indexes
for index in indexes:
print(f"索引字段: {index.field_name}")
print(f"索引参数: {index.params}")
# 加载状态
load_state = utility.load_state("documents")
print(f"加载状态: {load_state}")
# 属性信息
properties = collection.properties
print(f"属性: {properties}")
---
b.监控统计
a.功能说明
可以通过元数据监控collection的使用情况和性能指标。统计信息包括实体数量、segment信息、内存占用等。定期监控可以及时发现问题,如数据倾斜、内存不足等。可以基于统计信息进行容量规划和性能优化。Milvus提供了丰富的监控API和指标。
b.代码示例
---
from pymilvus import Collection, utility
import time
collection = Collection("documents")
# 监控函数
def monitor_collection(collection_name, interval=60):
while True:
collection = Collection(collection_name)
# 基本统计
print(f"\\n=== {time.strftime('%Y-%m-%d %H:%M:%S')} ===")
print(f"实体数量: {collection.num_entities:,}")
# Segment信息
segments = utility.get_query_segment_info(collection_name)
print(f"Segment数量: {len(segments)}")
total_rows = sum(seg.num_rows for seg in segments)
print(f"总行数: {total_rows:,}")
# 按状态分组
state_counts = {}
for seg in segments:
state = seg.state
state_counts[state] = state_counts.get(state, 0) + 1
print(f"Segment状态: {state_counts}")
# 内存占用(需要collection已加载)
if utility.load_state(collection_name) == utility.LoadState.Loaded:
# 估算内存占用
vector_dim = 128
vector_size = total_rows * vector_dim * 4 # float32
print(f"估算向量内存: {vector_size / 1024 / 1024:.2f} MB")
time.sleep(interval)
# 启动监控(在后台线程中运行)
import threading
monitor_thread = threading.Thread(
target=monitor_collection,
args=("documents", 60),
daemon=True
)
monitor_thread.start()
# 性能指标收集
def collect_metrics(collection_name):
collection = Collection(collection_name)
metrics = {
"name": collection_name,
"entities": collection.num_entities,
"shards": collection.shards_num,
"load_state": str(utility.load_state(collection_name)),
"timestamp": time.time()
}
# 添加segment信息
segments = utility.get_query_segment_info(collection_name)
metrics["segments"] = len(segments)
metrics["total_rows"] = sum(seg.num_rows for seg in segments)
return metrics
metrics = collect_metrics("documents")
print(f"指标: {metrics}")
---
3.3 加载和释放
01.加载Collection
a.加载到内存
a.功能说明
Collection创建后默认不加载到内存,需要显式调用load方法。加载后数据和索引会被加载到Query Node的内存中,才能进行搜索查询。加载是异步操作,可以通过load_state查看加载进度。大型collection加载可能需要较长时间,建议在低峰期进行。加载后会占用内存资源,需要根据服务器配置合理规划。加载过程会读取所有segment和索引文件,网络和磁盘IO是主要瓶颈。
b.代码示例
---
from pymilvus import Collection, utility
import time
collection = Collection("documents")
# 加载Collection
print("开始加载Collection...")
collection.load()
# 等待加载完成
while True:
state = utility.load_state("documents")
if state == utility.LoadState.Loaded:
print("加载完成")
break
elif state == utility.LoadState.Loading:
print("加载中...")
time.sleep(1)
elif state == utility.LoadState.NotLoad:
print("未加载")
break
else:
print(f"加载状态: {state}")
break
# 查看加载状态
print(f"当前状态: {utility.load_state('documents')}")
# 加载时指定副本数量
collection.load(replica_number=2)
print("已加载2个副本")
# 加载进度监控
def monitor_load_progress(collection_name, check_interval=1):
start_time = time.time()
while True:
state = utility.load_state(collection_name)
elapsed = time.time() - start_time
if state == utility.LoadState.Loaded:
print(f"加载完成,耗时: {elapsed:.2f}秒")
break
elif state == utility.LoadState.Loading:
print(f"加载中... 已耗时: {elapsed:.2f}秒")
time.sleep(check_interval)
else:
print(f"加载异常: {state}")
break
monitor_load_progress("documents")
---
b.分区加载
a.功能说明
可以只加载部分分区到内存,节省资源。适合数据按时间或类别分区的场景,只加载热数据分区。分区加载可以显著减少内存占用,提高加载速度。查询时只能查询已加载的分区,未加载分区的数据不可见。可以动态加载和释放分区,实现冷热数据分离。分区加载特别适合时间序列数据,如日志、监控数据等。
b.代码示例
---
from pymilvus import Collection, Partition
collection = Collection("documents")
# 创建分区
partition_2024 = Partition(collection, "2024")
partition_2025 = Partition(collection, "2025")
partition_2026 = Partition(collection, "2026")
# 只加载2026分区(最新数据)
partition_2026.load()
print("已加载2026分区")
# 查询只在已加载分区中进行
results = collection.search(
data=[[0.1]*128],
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": 10}},
limit=10,
partition_names=["2026"]
)
print(f"搜索结果: {len(results[0])} 条")
# 加载多个分区
collection.load(partition_names=["2025", "2026"])
print("已加载2025和2026分区")
# 动态分区管理
def load_recent_partitions(collection, months=3):
from datetime import datetime, timedelta
# 计算需要加载的分区
current_date = datetime.now()
partitions_to_load = []
for i in range(months):
date = current_date - timedelta(days=30*i)
partition_name = date.strftime("%Y%m")
partitions_to_load.append(partition_name)
# 加载分区
collection.load(partition_names=partitions_to_load)
print(f"已加载最近{months}个月的分区: {partitions_to_load}")
load_recent_partitions(collection, months=3)
# 释放特定分区
partition_2024.release()
print("已释放2024分区")
# 查看分区加载状态
for partition in collection.partitions:
state = utility.load_state("documents", partition.name)
print(f"分区 {partition.name}: {state}")
---
02.释放Collection
a.释放内存
a.功能说明
释放操作会将collection从内存中卸载,释放Query Node的内存资源。释放后无法进行搜索查询,但数据仍然保存在存储层。适合临时使用的collection或需要释放内存的场景。释放是异步操作,立即返回但可能需要时间完成。释放后可以重新加载,不影响数据完整性。释放操作不会删除数据,只是从内存中移除。
b.代码示例
---
from pymilvus import Collection, utility
import time
collection = Collection("documents")
# 释放Collection
print("开始释放Collection...")
collection.release()
# 等待释放完成
time.sleep(1)
state = utility.load_state("documents")
print(f"释放后状态: {state}")
# 验证释放
assert state == utility.LoadState.NotLoad, "释放失败"
# 释放特定分区
from pymilvus import Partition
partition = Partition(collection, "2024")
partition.release()
print("已释放2024分区")
# 释放所有分区
collection.release()
print("已释放所有分区")
# 重新加载
collection.load()
print("已重新加载")
# 释放前检查
def safe_release(collection_name):
state = utility.load_state(collection_name)
if state == utility.LoadState.Loaded:
collection = Collection(collection_name)
collection.release()
print(f"已释放: {collection_name}")
return True
elif state == utility.LoadState.NotLoad:
print(f"未加载,无需释放: {collection_name}")
return True
else:
print(f"状态异常: {state}")
return False
safe_release("documents")
---
b.内存管理
a.功能说明
合理管理collection的加载和释放可以优化内存使用。建议只加载活跃使用的collection,定期释放不活跃的collection。可以通过监控内存使用情况,动态调整加载策略。使用分区加载可以更细粒度地控制内存占用。在内存不足时,系统可能会自动释放部分collection。实现LRU缓存策略可以自动管理collection的加载和释放。
b.代码示例
---
from pymilvus import Collection, utility
import psutil
import time
from collections import OrderedDict
def get_memory_usage():
"""获取当前内存使用量(MB)"""
process = psutil.Process()
return process.memory_info().rss / 1024 / 1024
# LRU Collection管理器
class CollectionManager:
def __init__(self, max_memory_mb=8192, max_collections=5):
self.max_memory_mb = max_memory_mb
self.max_collections = max_collections
self.loaded_collections = OrderedDict()
self.access_count = {}
def load_collection(self, collection_name):
# 如果已加载,更新访问时间
if collection_name in self.loaded_collections:
self.loaded_collections.move_to_end(collection_name)
self.access_count[collection_name] += 1
return
# 检查内存使用
current_memory = get_memory_usage()
# 内存不足或collection数量超限,释放最久未使用的
while (current_memory > self.max_memory_mb * 0.8 or
len(self.loaded_collections) >= self.max_collections):
if not self.loaded_collections:
break
old_name, _ = self.loaded_collections.popitem(last=False)
Collection(old_name).release()
print(f"释放Collection: {old_name}")
time.sleep(0.5)
current_memory = get_memory_usage()
# 加载新collection
collection = Collection(collection_name)
collection.load()
self.loaded_collections[collection_name] = time.time()
self.access_count[collection_name] = 1
print(f"加载Collection: {collection_name}")
def release_all(self):
"""释放所有collection"""
for name in list(self.loaded_collections.keys()):
Collection(name).release()
self.loaded_collections.clear()
self.access_count.clear()
print("已释放所有collection")
def get_stats(self):
"""获取统计信息"""
return {
"loaded_count": len(self.loaded_collections),
"memory_mb": get_memory_usage(),
"collections": list(self.loaded_collections.keys()),
"access_count": self.access_count
}
# 使用管理器
manager = CollectionManager(max_memory_mb=8192, max_collections=3)
# 模拟访问
manager.load_collection("documents")
manager.load_collection("images")
manager.load_collection("videos")
# 访问已加载的collection
manager.load_collection("documents") # 更新访问时间
# 加载新collection(会触发释放)
manager.load_collection("audio")
# 查看统计
stats = manager.get_stats()
print(f"统计信息: {stats}")
# 定期清理
def periodic_cleanup(manager, interval=300):
"""定期清理不活跃的collection"""
while True:
time.sleep(interval)
current_time = time.time()
to_release = []
for name, load_time in manager.loaded_collections.items():
# 超过5分钟未访问
if current_time - load_time > 300:
to_release.append(name)
for name in to_release:
Collection(name).release()
del manager.loaded_collections[name]
print(f"清理不活跃collection: {name}")
# 启动定期清理(后台线程)
import threading
cleanup_thread = threading.Thread(
target=periodic_cleanup,
args=(manager, 300),
daemon=True
)
cleanup_thread.start()
---
03.副本管理
a.副本配置
a.功能说明
副本是collection的完整内存拷贝,用于提高查询吞吐量和可用性。多个副本可以并行处理查询请求,提高并发性能。副本数量在加载时指定,可以动态调整。每个副本会占用相同的内存空间,需要考虑资源限制。副本会自动分布到不同的Query Node,实现负载均衡。副本故障时会自动切换到其他副本,保证服务可用性。
b.代码示例
---
from pymilvus import Collection, utility
collection = Collection("documents")
# 加载时指定副本数量
collection.load(replica_number=2)
print("已加载2个副本")
# 查看副本信息
replicas = collection.get_replicas()
print(f"副本数量: {len(replicas.groups)}")
for i, replica in enumerate(replicas.groups):
print(f"副本 {i}:")
print(f" 副本ID: {replica.id}")
print(f" 分片副本: {replica.shards}")
print(f" 节点: {replica.nodes}")
# 动态调整副本数量
collection.release()
collection.load(replica_number=3)
print("副本数量已调整为3")
# 副本负载均衡测试
import concurrent.futures
import time
def query_task(task_id):
start = time.time()
results = collection.search(
data=[[0.1]*128],
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": 10}},
limit=10
)
elapsed = time.time() - start
return elapsed
# 100个并发查询
with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
futures = [executor.submit(query_task, i) for i in range(100)]
times = [f.result() for f in futures]
avg_time = sum(times) / len(times)
print(f"平均查询时间: {avg_time*1000:.2f}ms")
print(f"QPS: {len(times) / sum(times):.2f}")
---
b.副本监控
a.功能说明
可以监控副本的状态和负载分布,确保系统正常运行。副本信息包括副本ID、所在节点、分片分布等。通过监控可以发现副本不均衡、节点故障等问题。Milvus会自动管理副本的分布和故障转移。建议定期检查副本状态,及时发现和处理异常。
b.代码示例
---
from pymilvus import Collection, utility
import time
collection = Collection("documents")
collection.load(replica_number=2)
# 副本监控函数
def monitor_replicas(collection_name, interval=60):
while True:
collection = Collection(collection_name)
replicas = collection.get_replicas()
print(f"\n=== {time.strftime('%Y-%m-%d %H:%M:%S')} ===")
print(f"副本数量: {len(replicas.groups)}")
for i, replica in enumerate(replicas.groups):
print(f"\n副本 {i}:")
print(f" ID: {replica.id}")
print(f" 分片数: {len(replica.shards)}")
print(f" 节点数: {len(replica.nodes)}")
# 分片信息
for shard in replica.shards:
print(f" 分片 {shard.shard_id}:")
print(f" 通道: {shard.channel_name}")
print(f" 节点: {shard.node_ids}")
# 检查副本分布
all_nodes = set()
for replica in replicas.groups:
all_nodes.update(replica.nodes)
print(f"\n总节点数: {len(all_nodes)}")
print(f"节点列表: {all_nodes}")
# 检查负载均衡
node_replica_count = {}
for replica in replicas.groups:
for node in replica.nodes:
node_replica_count[node] = node_replica_count.get(node, 0) + 1
print(f"节点副本分布: {node_replica_count}")
time.sleep(interval)
# 启动监控
import threading
monitor_thread = threading.Thread(
target=monitor_replicas,
args=("documents", 60),
daemon=True
)
monitor_thread.start()
# 副本健康检查
def check_replica_health(collection_name):
collection = Collection(collection_name)
replicas = collection.get_replicas()
if len(replicas.groups) == 0:
return False, "没有副本"
# 检查每个副本
for replica in replicas.groups:
if len(replica.nodes) == 0:
return False, f"副本 {replica.id} 没有节点"
if len(replica.shards) == 0:
return False, f"副本 {replica.id} 没有分片"
return True, "所有副本正常"
healthy, message = check_replica_health("documents")
print(f"健康检查: {message}")
---
3.4 删除Collection
01.删除操作
a.删除Collection
a.功能说明
删除操作会永久删除collection及其所有数据和索引。删除前需要先释放collection,否则会报错。删除是不可逆操作,建议在删除前进行备份。删除后collection名称可以重新使用。删除大型collection可能需要较长时间,建议在低峰期进行。删除操作会清理所有相关的元数据、索引文件和数据文件。删除过程是原子的,不会出现部分删除的情况。
b.代码示例
---
from pymilvus import Collection, utility
import time
# 检查Collection是否存在
if utility.has_collection("documents"):
collection = Collection("documents")
# 检查加载状态
state = utility.load_state("documents")
if state == utility.LoadState.Loaded:
# 释放Collection
collection.release()
print("已释放Collection")
time.sleep(1)
# 删除Collection
utility.drop_collection("documents")
print("Collection已删除")
# 验证删除
assert not utility.has_collection("documents"), "删除失败"
else:
print("Collection不存在")
# 安全删除函数
def safe_drop_collection(collection_name):
try:
if not utility.has_collection(collection_name):
print(f"Collection不存在: {collection_name}")
return True
collection = Collection(collection_name)
# 释放(如果已加载)
state = utility.load_state(collection_name)
if state == utility.LoadState.Loaded:
collection.release()
time.sleep(1)
# 删除
utility.drop_collection(collection_name)
print(f"已删除: {collection_name}")
return True
except Exception as e:
print(f"删除失败: {e}")
return False
# 使用安全删除
safe_drop_collection("test_collection")
# 删除前确认
def drop_with_confirmation(collection_name):
if not utility.has_collection(collection_name):
print("Collection不存在")
return
collection = Collection(collection_name)
entity_count = collection.num_entities
print(f"警告: 即将删除Collection '{collection_name}'")
print(f"包含 {entity_count:,} 条数据")
# 在实际应用中,这里应该等待用户确认
# confirm = input("确认删除? (yes/no): ")
# if confirm.lower() == "yes":
collection.release()
utility.drop_collection(collection_name)
print("删除完成")
drop_with_confirmation("documents")
---
b.批量删除
a.功能说明
可以批量删除多个collection,适合清理测试数据或过期数据。建议使用命名规范,便于批量识别和删除。批量删除时需要注意顺序,避免删除重要数据。可以通过前缀或后缀过滤collection名称。删除前应该进行二次确认,防止误删。批量删除适合定期清理任务,如删除临时collection、测试collection等。
b.代码示例
---
from pymilvus import utility, Collection
import re
from datetime import datetime, timedelta
# 列出所有Collection
all_collections = utility.list_collections()
print(f"所有Collection: {all_collections}")
# 删除测试Collection(前缀为test_)
for name in all_collections:
if name.startswith("test_"):
collection = Collection(name)
collection.release()
utility.drop_collection(name)
print(f"已删除测试Collection: {name}")
# 删除临时Collection(前缀为temp_)
def drop_temp_collections():
for name in utility.list_collections():
if name.startswith("temp_"):
safe_drop_collection(name)
drop_temp_collections()
# 删除过期Collection(基于命名规则)
def drop_expired_collections(days=30):
"""删除超过指定天数的collection"""
pattern = r"collection_(\d{8})" # collection_20240101
cutoff_date = datetime.now() - timedelta(days=days)
dropped_count = 0
for name in utility.list_collections():
match = re.match(pattern, name)
if match:
date_str = match.group(1)
try:
date = datetime.strptime(date_str, "%Y%m%d")
if date < cutoff_date:
collection = Collection(name)
collection.release()
utility.drop_collection(name)
print(f"删除过期Collection: {name} (日期: {date_str})")
dropped_count += 1
except ValueError:
print(f"日期格式错误: {name}")
print(f"共删除 {dropped_count} 个过期Collection")
drop_expired_collections(days=30)
# 按模式批量删除
def drop_by_pattern(pattern, dry_run=True):
"""按正则表达式模式删除collection"""
regex = re.compile(pattern)
to_drop = []
for name in utility.list_collections():
if regex.match(name):
to_drop.append(name)
print(f"匹配到 {len(to_drop)} 个Collection:")
for name in to_drop:
collection = Collection(name)
print(f" {name} ({collection.num_entities:,} 条数据)")
if dry_run:
print("(预览模式,未实际删除)")
return
# 实际删除
for name in to_drop:
safe_drop_collection(name)
# 预览要删除的collection
drop_by_pattern(r"^backup_\d+$", dry_run=True)
# 实际删除
# drop_by_pattern(r"^backup_\d+$", dry_run=False)
---
02.数据清理
a.清空数据
a.功能说明
如果只想清空数据但保留collection结构,可以删除所有实体。这种方式保留了Schema和索引定义,可以继续插入新数据。相比删除重建collection,清空数据更快且不需要重新创建索引。适合需要定期清空数据的场景,如临时缓存或测试环境。清空后需要执行compaction释放存储空间。清空大量数据建议分批进行,避免单次操作超时。
b.代码示例
---
from pymilvus import Collection
import time
collection = Collection("documents")
# 方法1: 删除所有数据(简单但可能超时)
expr = "id >= 0" # 匹配所有记录
collection.delete(expr)
# 刷新删除操作
collection.flush()
# 触发compaction释放空间
collection.compact()
print(f"清空后实体数量: {collection.num_entities}")
# 方法2: 分批清空大量数据
def clear_collection_data(collection, batch_size=10000):
"""分批删除所有数据"""
total_deleted = 0
while True:
# 查询一批ID
results = collection.query(
expr="id >= 0",
output_fields=["id"],
limit=batch_size
)
if len(results) == 0:
break
# 删除这批数据
ids = [r["id"] for r in results]
expr = f"id in {ids}"
collection.delete(expr)
total_deleted += len(ids)
print(f"已删除 {len(ids)} 条数据,累计: {total_deleted}")
# 避免过快删除
time.sleep(0.1)
# 刷新和压缩
collection.flush()
print("正在压缩...")
collection.compact()
# 等待压缩完成
from pymilvus import utility
while True:
state = utility.get_compaction_state(collection.name)
if state.state == 3: # 完成
break
time.sleep(1)
print(f"清空完成,共删除 {total_deleted} 条数据")
print(f"当前实体数量: {collection.num_entities}")
clear_collection_data(collection, batch_size=10000)
# 方法3: 按条件清空
def clear_by_condition(collection, expr):
"""按条件删除数据"""
# 先查询要删除的数量
results = collection.query(
expr=expr,
output_fields=["id"],
limit=16384 # 最大限制
)
print(f"匹配到 {len(results)} 条数据")
if len(results) == 0:
return
# 删除
collection.delete(expr)
collection.flush()
print(f"已删除 {len(results)} 条数据")
# 删除旧数据
clear_by_condition(collection, "timestamp < 1640000000")
# 删除特定类别
clear_by_condition(collection, 'category == "test"')
---
b.备份恢复
a.功能说明
删除前应该进行数据备份,以防误删或需要恢复。可以导出数据到文件,或复制到新collection。Milvus支持快照功能,可以创建collection的时间点快照。备份策略应该包括定期备份和删除前备份。恢复时需要重新创建collection并导入数据。备份文件应该包含Schema定义和所有数据。建议使用压缩格式减少存储空间。
b.代码示例
---
from pymilvus import Collection, CollectionSchema, utility
import json
import gzip
import pickle
from datetime import datetime
# 备份Collection数据
def backup_collection(collection_name, backup_dir="./backups"):
import os
os.makedirs(backup_dir, exist_ok=True)
collection = Collection(collection_name)
collection.load()
# 备份Schema
schema_dict = {
"fields": [
{
"name": f.name,
"dtype": str(f.dtype),
"is_primary": f.is_primary,
"auto_id": f.auto_id,
"params": f.params
}
for f in collection.schema.fields
],
"description": collection.schema.description
}
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
schema_file = f"{backup_dir}/{collection_name}_schema_{timestamp}.json"
with open(schema_file, 'w') as f:
json.dump(schema_dict, f, indent=2)
print(f"Schema已备份: {schema_file}")
# 备份数据(分批)
batch_size = 10000
offset = 0
batch_num = 0
while True:
results = collection.query(
expr="id >= 0",
output_fields=["*"],
limit=batch_size,
offset=offset
)
if len(results) == 0:
break
# 保存批次数据(使用gzip压缩)
data_file = f"{backup_dir}/{collection_name}_data_{timestamp}_batch{batch_num:04d}.pkl.gz"
with gzip.open(data_file, 'wb') as f:
pickle.dump(results, f)
print(f"批次 {batch_num} 已备份: {len(results)} 条数据")
offset += batch_size
batch_num += 1
print(f"备份完成: {offset} 条数据,{batch_num} 个批次")
return schema_file, batch_num
# 恢复Collection数据
def restore_collection(collection_name, schema_file, backup_dir, batch_count):
import os
# 读取Schema
with open(schema_file, 'r') as f:
schema_dict = json.load(f)
# 重建Schema
from pymilvus import FieldSchema, DataType
fields = []
for f in schema_dict["fields"]:
dtype = getattr(DataType, f["dtype"].split(".")[-1])
field = FieldSchema(
name=f["name"],
dtype=dtype,
is_primary=f.get("is_primary", False),
auto_id=f.get("auto_id", False),
**f.get("params", {})
)
fields.append(field)
schema = CollectionSchema(
fields=fields,
description=schema_dict.get("description", "")
)
# 删除旧collection(如果存在)
if utility.has_collection(collection_name):
safe_drop_collection(collection_name)
# 创建新collection
collection = Collection(collection_name, schema=schema)
print(f"Collection已创建: {collection_name}")
# 恢复数据
total_restored = 0
timestamp = os.path.basename(schema_file).split("_")[-1].replace(".json", "")
for batch_num in range(batch_count):
data_file = f"{backup_dir}/{collection_name}_data_{timestamp}_batch{batch_num:04d}.pkl.gz"
if not os.path.exists(data_file):
print(f"批次文件不存在: {data_file}")
continue
# 读取批次数据
with gzip.open(data_file, 'rb') as f:
batch_data = pickle.load(f)
# 转换数据格式
field_data = {}
for field in schema.fields:
field_data[field.name] = [item[field.name] for item in batch_data]
# 插入数据
data_list = [field_data[f.name] for f in schema.fields if not f.auto_id]
collection.insert(data_list)
total_restored += len(batch_data)
print(f"批次 {batch_num} 已恢复: {len(batch_data)} 条数据")
# 刷新
collection.flush()
print(f"恢复完成: {total_restored} 条数据")
print(f"当前实体数量: {collection.num_entities}")
# 使用备份和恢复
# 备份
schema_file, batch_count = backup_collection("documents", "./backups")
# 恢复
# restore_collection("documents_restored", schema_file, "./backups", batch_count)
# 定期备份任务
def scheduled_backup(collection_name, backup_dir, interval_hours=24):
import time
while True:
try:
print(f"开始备份: {datetime.now()}")
backup_collection(collection_name, backup_dir)
print("备份完成")
except Exception as e:
print(f"备份失败: {e}")
time.sleep(interval_hours * 3600)
# 启动定期备份(后台线程)
import threading
backup_thread = threading.Thread(
target=scheduled_backup,
args=("documents", "./backups", 24),
daemon=True
)
backup_thread.start()
---
4 数据操作
4.1 插入数据
01.插入方式
a.列式插入
a.功能说明
Milvus使用列式存储格式,插入数据时需要按列组织。每个字段对应一个列表,所有列表长度必须相同。列式插入是Milvus的标准插入方式,性能最优。插入操作是原子的,要么全部成功要么全部失败。返回值包含插入的主键列表和插入数量。插入后数据不会立即可见,需要等待刷新或自动刷新周期。建议批量插入,单次插入1000-10000条数据性能最佳。
b.代码示例
---
from pymilvus import Collection
import numpy as np
collection = Collection("documents")
# 准备数据(列式格式)
ids = [i for i in range(1000)]
titles = [f"文档{i}" for i in range(1000)]
categories = ["技术", "新闻", "博客"] * 334 # 循环填充
timestamps = [1700000000 + i for i in range(1000)]
embeddings = [[np.random.random() for _ in range(128)] for _ in range(1000)]
# 插入数据(按Schema字段顺序)
data = [ids, titles, categories, timestamps, embeddings]
insert_result = collection.insert(data)
print(f"插入成功: {insert_result.insert_count} 条")
print(f"主键列表: {insert_result.primary_keys[:10]}...")
# 刷新数据(使数据立即可见)
collection.flush()
print(f"刷新后实体数量: {collection.num_entities}")
# 验证插入
results = collection.query(
expr="id in [0, 1, 2]",
output_fields=["id", "title", "category"]
)
for r in results:
print(f"ID: {r['id']}, Title: {r['title']}, Category: {r['category']}")
---
b.字典式插入
a.功能说明
除了列式插入,Milvus也支持字典列表的插入方式。每条记录是一个字典,字段名作为key。这种方式更直观,但性能略低于列式插入。适合数据来源是JSON或字典格式的场景。字典中必须包含所有非自动生成的字段。字段顺序不重要,Milvus会自动匹配。对于动态Schema,字典式插入更灵活。
b.代码示例
---
from pymilvus import Collection
import numpy as np
collection = Collection("documents")
# 准备数据(字典列表格式)
data = [
{
"id": 2000 + i,
"title": f"文档{2000 + i}",
"category": "技术",
"timestamp": 1700000000 + i,
"embedding": [np.random.random() for _ in range(128)]
}
for i in range(100)
]
# 插入数据
insert_result = collection.insert(data)
print(f"插入成功: {insert_result.insert_count} 条")
# 混合字段顺序
data_mixed = [
{
"embedding": [0.1] * 128,
"id": 3000,
"timestamp": 1700000000,
"category": "新闻",
"title": "文档3000"
},
{
"title": "文档3001",
"id": 3001,
"embedding": [0.2] * 128,
"category": "博客",
"timestamp": 1700000001
}
]
collection.insert(data_mixed)
collection.flush()
# 动态Schema示例
collection_dynamic = Collection("dynamic_collection")
data_dynamic = [
{
"id": 1,
"embedding": [0.1] * 128,
"extra_field1": "额外数据", # 动态字段
"extra_field2": 123
}
]
collection_dynamic.insert(data_dynamic)
---
02.数据类型处理
a.向量数据
a.功能说明
向量数据是Milvus的核心数据类型,必须与Schema定义的维度一致。支持Python list、NumPy array等格式。浮点向量使用float32类型,维度可以是任意正整数。二值向量使用bytes类型,维度必须是8的倍数。向量数据会自动归一化(如果索引要求)。插入前建议验证向量维度,避免运行时错误。
b.代码示例
---
from pymilvus import Collection
import numpy as np
collection = Collection("documents")
# Python list格式
embedding_list = [[0.1, 0.2, 0.3] * 43 for _ in range(10)] # 129维截断到128
embedding_list = [[0.1] * 128 for _ in range(10)] # 正确的128维
# NumPy array格式
embedding_np = np.random.rand(10, 128).astype(np.float32)
# 转换为list(Milvus接受)
embedding_from_np = embedding_np.tolist()
# 插入向量数据
ids = list(range(4000, 4010))
titles = [f"文档{i}" for i in range(4000, 4010)]
categories = ["技术"] * 10
timestamps = [1700000000] * 10
data = [ids, titles, categories, timestamps, embedding_from_np]
collection.insert(data)
# 二值向量示例
from pymilvus import CollectionSchema, FieldSchema, DataType
fields_binary = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="binary_vector", dtype=DataType.BINARY_VECTOR, dim=512)
]
schema_binary = CollectionSchema(fields=fields_binary)
collection_binary = Collection("binary_collection", schema=schema_binary)
# 生成二值向量(512维 = 64字节)
binary_vectors = [bytes(np.random.randint(0, 256, 64)) for _ in range(10)]
ids_binary = list(range(10))
data_binary = [ids_binary, binary_vectors]
collection_binary.insert(data_binary)
# 向量维度验证
def validate_vectors(vectors, expected_dim):
for i, vec in enumerate(vectors):
if len(vec) != expected_dim:
raise ValueError(f"向量 {i} 维度错误: {len(vec)}, 期望: {expected_dim}")
return True
validate_vectors(embedding_from_np, 128)
print("向量维度验证通过")
---
b.标量数据
a.功能说明
标量数据包括整数、浮点数、字符串、布尔值等类型。VARCHAR类型必须符合最大长度限制,超长会被截断或报错。JSON类型支持嵌套结构,可以存储复杂对象。整数类型有范围限制,超出范围会报错。时间戳建议使用INT64存储Unix时间戳。NULL值不支持,所有字段都必须有值。
b.代码示例
---
from pymilvus import Collection
import json
import time
collection = Collection("documents")
# 整数类型
ids = [5000, 5001, 5002]
ages = [25, 30, 35] # INT32
# 浮点类型
scores = [95.5, 87.3, 92.1] # FLOAT
ratings = [4.5, 3.8, 4.9] # DOUBLE
# 字符串类型(注意长度限制)
titles = ["标题" * 50][:200] # 截断到200字符
long_title = "很长的标题" * 100
if len(long_title) > 200:
long_title = long_title[:200]
titles = [
"短标题",
long_title,
"中等长度的标题"
]
# 布尔类型
is_active = [True, False, True]
# JSON类型
metadata = [
{"author": "张三", "tags": ["AI", "ML"], "views": 1000},
{"author": "李四", "tags": ["DL"], "views": 500},
{"author": "王五", "tags": ["NLP", "CV"], "views": 800}
]
# 时间戳
timestamps = [
int(time.time()),
int(time.time()) - 86400, # 1天前
int(time.time()) - 172800 # 2天前
]
# 向量
embeddings = [[0.1] * 128 for _ in range(3)]
# 插入混合类型数据
data = [ids, titles, timestamps, embeddings]
collection.insert(data)
# 类型转换
def convert_data_types(data_dict):
"""确保数据类型正确"""
converted = {}
# 整数转换
if "id" in data_dict:
converted["id"] = int(data_dict["id"])
# 字符串长度限制
if "title" in data_dict:
title = str(data_dict["title"])
converted["title"] = title[:200] # 截断
# 时间戳转换
if "timestamp" in data_dict:
ts = data_dict["timestamp"]
if isinstance(ts, str):
from datetime import datetime
dt = datetime.fromisoformat(ts)
converted["timestamp"] = int(dt.timestamp())
else:
converted["timestamp"] = int(ts)
# JSON序列化
if "metadata" in data_dict:
if isinstance(data_dict["metadata"], dict):
converted["metadata"] = data_dict["metadata"]
else:
converted["metadata"] = json.loads(data_dict["metadata"])
return converted
# 使用转换函数
raw_data = {
"id": "6000",
"title": "x" * 300,
"timestamp": "2024-01-01T00:00:00",
"metadata": '{"key": "value"}'
}
converted = convert_data_types(raw_data)
print(f"转换后: {converted}")
---
03.批量插入优化
a.批次大小
a.功能说明
批次大小直接影响插入性能和内存占用。单次插入建议1000-10000条数据,过小会增加网络开销,过大可能导致超时或内存不足。需要根据数据大小和网络条件调整批次大小。向量维度越高,批次应该越小。建议通过性能测试确定最优批次大小。Milvus对单次插入有大小限制(通常几百MB),超过会报错。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("documents")
# 测试不同批次大小
def test_batch_size(collection, total_count, batch_size):
start_time = time.time()
for i in range(0, total_count, batch_size):
batch_end = min(i + batch_size, total_count)
batch_count = batch_end - i
# 生成批次数据
ids = list(range(i, batch_end))
titles = [f"文档{j}" for j in range(i, batch_end)]
categories = ["技术"] * batch_count
timestamps = [1700000000] * batch_count
embeddings = [[np.random.random() for _ in range(128)] for _ in range(batch_count)]
# 插入
data = [ids, titles, categories, timestamps, embeddings]
collection.insert(data)
# 刷新
collection.flush()
elapsed = time.time() - start_time
qps = total_count / elapsed
return elapsed, qps
# 测试不同批次大小
total_count = 10000
for batch_size in [100, 500, 1000, 5000, 10000]:
elapsed, qps = test_batch_size(collection, total_count, batch_size)
print(f"批次大小: {batch_size:5d}, 耗时: {elapsed:.2f}s, QPS: {qps:.2f}")
# 自适应批次大小
def adaptive_batch_insert(collection, data_generator, vector_dim=128):
# 估算单条数据大小(字节)
single_size = vector_dim * 4 + 1000 # 向量 + 元数据
# 目标批次大小:10MB
target_size = 10 * 1024 * 1024
batch_size = max(100, min(10000, target_size // single_size))
print(f"自适应批次大小: {batch_size}")
batch = []
for item in data_generator:
batch.append(item)
if len(batch) >= batch_size:
collection.insert(batch)
batch = []
# 插入剩余数据
if batch:
collection.insert(batch)
# 使用自适应批次
def data_gen():
for i in range(10000):
yield {
"id": 10000 + i,
"title": f"文档{i}",
"category": "技术",
"timestamp": 1700000000,
"embedding": [0.1] * 128
}
adaptive_batch_insert(collection, data_gen())
---
b.并发插入
a.功能说明
Milvus支持并发插入,可以显著提高吞吐量。多个客户端或线程可以同时插入数据。需要注意主键冲突,确保不同线程插入不同的ID范围。并发插入会增加服务器负载,需要根据服务器性能调整并发度。建议使用连接池管理连接。过高的并发可能导致性能下降或超时。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import concurrent.futures
import time
collection = Collection("documents")
# 单线程插入函数
def insert_batch(start_id, count):
ids = list(range(start_id, start_id + count))
titles = [f"文档{i}" for i in ids]
categories = ["技术"] * count
timestamps = [1700000000] * count
embeddings = [[np.random.random() for _ in range(128)] for _ in range(count)]
data = [ids, titles, categories, timestamps, embeddings]
result = collection.insert(data)
return result.insert_count
# 并发插入测试
def concurrent_insert_test(total_count, num_workers, batch_size):
start_time = time.time()
# 计算每个worker的ID范围
tasks = []
for i in range(num_workers):
start_id = 20000 + i * (total_count // num_workers)
count = total_count // num_workers
tasks.append((start_id, count))
# 并发执行
with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
futures = [executor.submit(insert_batch, start_id, count) for start_id, count in tasks]
results = [f.result() for f in futures]
# 刷新
collection.flush()
elapsed = time.time() - start_time
total_inserted = sum(results)
qps = total_inserted / elapsed
return elapsed, qps
# 测试不同并发度
total_count = 10000
for num_workers in [1, 2, 4, 8]:
elapsed, qps = concurrent_insert_test(total_count, num_workers, 1000)
print(f"并发度: {num_workers}, 耗时: {elapsed:.2f}s, QPS: {qps:.2f}")
# 生产者-消费者模式
import queue
import threading
def producer(data_queue, total_count):
"""生产数据"""
for i in range(total_count):
item = {
"id": 30000 + i,
"title": f"文档{i}",
"category": "技术",
"timestamp": 1700000000,
"embedding": [np.random.random() for _ in range(128)]
}
data_queue.put(item)
# 发送结束信号
for _ in range(4): # 4个消费者
data_queue.put(None)
def consumer(data_queue, collection, batch_size=1000):
"""消费并插入数据"""
batch = []
while True:
item = data_queue.get()
if item is None: # 结束信号
break
batch.append(item)
if len(batch) >= batch_size:
collection.insert(batch)
batch = []
# 插入剩余数据
if batch:
collection.insert(batch)
# 启动生产者-消费者
data_queue = queue.Queue(maxsize=1000)
# 启动生产者
producer_thread = threading.Thread(target=producer, args=(data_queue, 10000))
producer_thread.start()
# 启动消费者
consumer_threads = []
for _ in range(4):
t = threading.Thread(target=consumer, args=(data_queue, collection, 1000))
t.start()
consumer_threads.append(t)
# 等待完成
producer_thread.join()
for t in consumer_threads:
t.join()
collection.flush()
print("并发插入完成")
---
4.2 删除数据
01.删除方式
a.按表达式删除
a.功能说明
通过表达式删除满足条件的实体是Milvus的主要删除方式。支持按主键、标量字段或组合条件删除。删除操作是异步的,立即返回但数据可能不会立即删除。删除后的数据在查询中不可见,但存储空间不会立即释放。需要执行compaction操作才能真正释放空间。表达式语法与查询表达式相同,支持复杂的逻辑组合。单次删除建议不超过16384条记录。
b.代码示例
---
from pymilvus import Collection
collection = Collection("documents")
# 删除单条记录(按主键)
expr = "id == 1001"
collection.delete(expr)
print("已删除ID为1001的记录")
# 批量删除(按主键列表)
ids_to_delete = [1, 2, 3, 4, 5]
expr = f"id in {ids_to_delete}"
collection.delete(expr)
print(f"已删除{len(ids_to_delete)}条记录")
# 范围删除
expr = "id > 2000 and id < 2100"
collection.delete(expr)
print("已删除ID在2000-2100之间的记录")
# 按标量字段删除
expr = 'category == "test"'
collection.delete(expr)
print("已删除测试类别的记录")
# 复杂条件删除
expr = '(category == "test" or category == "temp") and timestamp < 1700000000'
collection.delete(expr)
print("已删除符合条件的记录")
# 刷新删除操作
collection.flush()
print(f"当前实体数量: {collection.num_entities}")
# 安全删除函数
def safe_delete(collection, expr, dry_run=False):
"""安全删除,支持预览模式"""
# 先查询要删除的数据
try:
results = collection.query(
expr=expr,
output_fields=["id"],
limit=16384
)
count = len(results)
print(f"匹配到 {count} 条记录")
if count == 0:
print("没有匹配的记录")
return 0
if dry_run:
print("(预览模式,未实际删除)")
return count
# 实际删除
collection.delete(expr)
collection.flush()
print(f"已删除 {count} 条记录")
return count
except Exception as e:
print(f"删除失败: {e}")
return 0
# 使用安全删除
safe_delete(collection, "id > 5000", dry_run=True) # 预览
safe_delete(collection, "id > 5000", dry_run=False) # 实际删除
---
b.分批删除
a.功能说明
删除大量数据时建议分批进行,避免单次删除过多影响性能。分批删除可以控制每次删除的数量,减少对系统的冲击。适合删除百万级以上的数据。每批删除后可以暂停一段时间,让系统有时间处理。分批删除需要合理设计批次大小和间隔时间。可以通过查询+删除的方式实现精确的分批控制。
b.代码示例
---
from pymilvus import Collection
import time
collection = Collection("documents")
# 分批删除大量数据
def batch_delete(collection, expr, batch_size=1000, sleep_interval=0.1):
"""分批删除数据"""
total_deleted = 0
while True:
# 查询一批要删除的ID
results = collection.query(
expr=expr,
output_fields=["id"],
limit=batch_size
)
if len(results) == 0:
break
# 删除这批数据
ids = [r["id"] for r in results]
delete_expr = f"id in {ids}"
collection.delete(delete_expr)
total_deleted += len(ids)
print(f"已删除 {len(ids)} 条数据,累计: {total_deleted}")
# 暂停
if sleep_interval > 0:
time.sleep(sleep_interval)
# 刷新
collection.flush()
print(f"分批删除完成,共删除 {total_deleted} 条数据")
return total_deleted
# 删除旧数据
batch_delete(collection, "timestamp < 1600000000", batch_size=1000)
# 按ID范围分批删除
def delete_by_id_range(collection, start_id, end_id, batch_size=1000):
"""按ID范围分批删除"""
total_deleted = 0
for i in range(start_id, end_id, batch_size):
batch_end = min(i + batch_size, end_id)
expr = f"id >= {i} and id < {batch_end}"
collection.delete(expr)
total_deleted += (batch_end - i)
print(f"已删除 ID {i} 到 {batch_end},累计: {total_deleted}")
time.sleep(0.1)
collection.flush()
print(f"范围删除完成,共删除 {total_deleted} 条数据")
return total_deleted
delete_by_id_range(collection, 10000, 20000, batch_size=1000)
# 带进度监控的分批删除
def batch_delete_with_progress(collection, expr, batch_size=1000):
"""带进度监控的分批删除"""
# 先统计总数
total_results = collection.query(
expr=expr,
output_fields=["id"],
limit=16384
)
total_count = len(total_results)
if total_count == 0:
print("没有匹配的记录")
return 0
print(f"总共需要删除 {total_count} 条数据")
deleted = 0
start_time = time.time()
while deleted < total_count:
# 查询一批
results = collection.query(
expr=expr,
output_fields=["id"],
limit=batch_size
)
if len(results) == 0:
break
# 删除
ids = [r["id"] for r in results]
collection.delete(f"id in {ids}")
deleted += len(ids)
progress = (deleted / total_count) * 100
elapsed = time.time() - start_time
print(f"进度: {progress:.1f}% ({deleted}/{total_count}), 耗时: {elapsed:.1f}s")
time.sleep(0.1)
collection.flush()
total_time = time.time() - start_time
print(f"删除完成,总耗时: {total_time:.1f}s")
return deleted
batch_delete_with_progress(collection, 'category == "temp"', batch_size=1000)
---
02.删除策略
a.软删除标记
a.功能说明
软删除是通过标记字段而不是真正删除数据的方式。可以保留数据历史,支持恢复操作。适合需要审计或回滚的场景。软删除的数据仍然占用存储空间,需要定期清理。查询时需要过滤已删除的数据。可以通过定时任务将软删除的数据真正删除。软删除提供了更大的灵活性,但会增加存储和查询开销。
b.代码示例
---
from pymilvus import Collection, CollectionSchema, FieldSchema, DataType
import time
# 创建带软删除标记的Schema
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=200),
FieldSchema(name="is_deleted", dtype=DataType.BOOL), # 软删除标记
FieldSchema(name="deleted_at", dtype=DataType.INT64), # 删除时间
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128)
]
schema = CollectionSchema(fields=fields, description="支持软删除")
collection = Collection("soft_delete_collection", schema=schema)
# 插入数据(初始未删除)
data = [
[1, 2, 3], # id
["文档1", "文档2", "文档3"], # title
[False, False, False], # is_deleted
[0, 0, 0], # deleted_at
[[0.1]*128, [0.2]*128, [0.3]*128] # embedding
]
collection.insert(data)
collection.flush()
# 软删除函数
def soft_delete(collection, ids):
"""软删除指定ID的记录"""
if not ids:
return
# 查询现有数据
results = collection.query(
expr=f"id in {ids}",
output_fields=["*"]
)
if not results:
print("没有找到要删除的记录")
return
# 先删除旧记录
collection.delete(f"id in {ids}")
# 重新插入,标记为已删除
deleted_time = int(time.time())
ids_list = [r["id"] for r in results]
titles = [r["title"] for r in results]
is_deleted = [True] * len(results)
deleted_at = [deleted_time] * len(results)
embeddings = [r["embedding"] for r in results]
data = [ids_list, titles, is_deleted, deleted_at, embeddings]
collection.insert(data)
collection.flush()
print(f"软删除 {len(ids)} 条记录")
# 使用软删除
soft_delete(collection, [1, 2])
# 查询未删除的数据
results = collection.query(
expr="is_deleted == false",
output_fields=["id", "title"]
)
print(f"未删除的记录: {results}")
# 恢复软删除的数据
def undelete(collection, ids):
"""恢复软删除的记录"""
results = collection.query(
expr=f"id in {ids} and is_deleted == true",
output_fields=["*"]
)
if not results:
print("没有找到要恢复的记录")
return
# 删除旧记录
collection.delete(f"id in {ids}")
# 重新插入,标记为未删除
ids_list = [r["id"] for r in results]
titles = [r["title"] for r in results]
is_deleted = [False] * len(results)
deleted_at = [0] * len(results)
embeddings = [r["embedding"] for r in results]
data = [ids_list, titles, is_deleted, deleted_at, embeddings]
collection.insert(data)
collection.flush()
print(f"恢复 {len(ids)} 条记录")
undelete(collection, [1])
# 定期清理软删除的数据
def cleanup_soft_deleted(collection, days=30):
"""清理超过指定天数的软删除数据"""
cutoff_time = int(time.time()) - (days * 86400)
# 查询要清理的数据
results = collection.query(
expr=f"is_deleted == true and deleted_at < {cutoff_time}",
output_fields=["id"],
limit=16384
)
if not results:
print("没有需要清理的数据")
return
# 真正删除
ids = [r["id"] for r in results]
collection.delete(f"id in {ids}")
collection.flush()
print(f"清理 {len(ids)} 条软删除数据")
cleanup_soft_deleted(collection, days=30)
---
b.定时清理
a.功能说明
定时清理是自动删除过期数据的机制。可以基于时间戳、访问频率等条件清理数据。适合日志、缓存等时效性数据。定时清理可以通过定时任务或后台线程实现。清理策略应该考虑业务需求和存储成本。建议在低峰期执行清理任务,减少对业务的影响。清理后需要执行compaction释放空间。
b.代码示例
---
from pymilvus import Collection
import time
import threading
from datetime import datetime, timedelta
collection = Collection("documents")
# 基于时间戳的清理
def cleanup_by_timestamp(collection, days=30):
"""删除超过指定天数的数据"""
cutoff_time = int(time.time()) - (days * 86400)
expr = f"timestamp < {cutoff_time}"
# 分批删除
total_deleted = 0
batch_size = 1000
while True:
results = collection.query(
expr=expr,
output_fields=["id"],
limit=batch_size
)
if len(results) == 0:
break
ids = [r["id"] for r in results]
collection.delete(f"id in {ids}")
total_deleted += len(ids)
print(f"已清理 {len(ids)} 条数据,累计: {total_deleted}")
time.sleep(0.1)
collection.flush()
collection.compact()
print(f"清理完成,共删除 {total_deleted} 条数据")
return total_deleted
cleanup_by_timestamp(collection, days=30)
# 定时清理任务
def scheduled_cleanup(collection, interval_hours=24, retention_days=30):
"""定时清理任务"""
while True:
try:
print(f"开始清理: {datetime.now()}")
deleted = cleanup_by_timestamp(collection, days=retention_days)
print(f"清理完成: 删除 {deleted} 条数据")
except Exception as e:
print(f"清理失败: {e}")
# 等待下次清理
time.sleep(interval_hours * 3600)
# 启动定时清理(后台线程)
cleanup_thread = threading.Thread(
target=scheduled_cleanup,
args=(collection, 24, 30),
daemon=True
)
cleanup_thread.start()
# 按类别清理
def cleanup_by_category(collection, categories_to_delete):
"""删除指定类别的数据"""
for category in categories_to_delete:
expr = f'category == "{category}"'
results = collection.query(
expr=expr,
output_fields=["id"],
limit=16384
)
if results:
ids = [r["id"] for r in results]
collection.delete(f"id in {ids}")
print(f"已删除类别 '{category}': {len(ids)} 条数据")
collection.flush()
collection.compact()
cleanup_by_category(collection, ["test", "temp", "draft"])
# 智能清理策略
class CleanupManager:
def __init__(self, collection, max_entities=1000000):
self.collection = collection
self.max_entities = max_entities
def check_and_cleanup(self):
"""检查并清理数据"""
current_count = self.collection.num_entities
if current_count <= self.max_entities:
print(f"当前数量 {current_count},无需清理")
return
# 需要删除的数量
to_delete = current_count - self.max_entities
print(f"当前数量 {current_count},需要删除 {to_delete} 条")
# 删除最旧的数据
results = self.collection.query(
expr="id >= 0",
output_fields=["id", "timestamp"],
limit=to_delete + 1000 # 多查一些
)
# 按时间戳排序
results_sorted = sorted(results, key=lambda x: x["timestamp"])
# 删除最旧的
ids_to_delete = [r["id"] for r in results_sorted[:to_delete]]
# 分批删除
batch_size = 1000
for i in range(0, len(ids_to_delete), batch_size):
batch = ids_to_delete[i:i+batch_size]
self.collection.delete(f"id in {batch}")
print(f"已删除 {len(batch)} 条旧数据")
self.collection.flush()
self.collection.compact()
print(f"清理完成,当前数量: {self.collection.num_entities}")
# 使用智能清理
manager = CleanupManager(collection, max_entities=1000000)
manager.check_and_cleanup()
---
4.3 更新数据
01.更新机制
a.Upsert操作
a.功能说明
Milvus使用Upsert(Update+Insert)机制更新数据。如果主键存在则更新,不存在则插入。Upsert是原子操作,保证数据一致性。更新操作会替换整条记录,不支持部分字段更新。需要提供完整的字段数据,包括向量。Upsert性能略低于纯插入,因为需要检查主键是否存在。适合需要保持数据最新的场景,如实时更新的文档库。
b.代码示例
---
from pymilvus import Collection
import numpy as np
collection = Collection("documents")
# Upsert单条数据
data = [
[1], # id (已存在则更新,不存在则插入)
["更新后的标题"], # title
["技术"], # category
[1700000000], # timestamp
[[0.9] * 128] # embedding (新向量)
]
collection.upsert(data)
collection.flush()
print("Upsert完成")
# 验证更新
results = collection.query(
expr="id == 1",
output_fields=["id", "title"]
)
print(f"更新后: {results}")
# 批量Upsert
ids = [10, 11, 12, 13, 14] # 部分存在,部分不存在
titles = [f"更新文档{i}" for i in ids]
categories = ["技术"] * len(ids)
timestamps = [1700000000] * len(ids)
embeddings = [[np.random.random() for _ in range(128)] for _ in ids]
data = [ids, titles, categories, timestamps, embeddings]
result = collection.upsert(data)
print(f"Upsert数量: {result.upsert_count}")
collection.flush()
# Upsert字典格式
data_dict = [
{
"id": 20,
"title": "字典格式更新",
"category": "新闻",
"timestamp": 1700000000,
"embedding": [0.5] * 128
},
{
"id": 21,
"title": "字典格式插入",
"category": "博客",
"timestamp": 1700000001,
"embedding": [0.6] * 128
}
]
collection.upsert(data_dict)
collection.flush()
print("字典格式Upsert完成")
---
b.更新策略
a.功能说明
由于Milvus不支持部分字段更新,需要先查询完整数据,修改后再Upsert。这种方式会有性能开销,不适合高频更新场景。可以在应用层缓存数据,减少查询次数。对于只需要更新向量的场景,可以只保存必要的元数据。建议批量更新,提高效率。更新操作会产生新的segment,需要定期compaction。
b.代码示例
---
from pymilvus import Collection
import numpy as np
collection = Collection("documents")
collection.load()
# 更新单个字段
def update_field(collection, id, field_name, new_value):
"""更新单个字段"""
# 查询现有数据
results = collection.query(
expr=f"id == {id}",
output_fields=["*"]
)
if not results:
print(f"ID {id} 不存在")
return False
# 修改字段
record = results[0]
record[field_name] = new_value
# Upsert
data = [[record[f.name] for f in collection.schema.fields if not f.auto_id]]
collection.upsert(data)
collection.flush()
print(f"已更新 ID {id} 的 {field_name}")
return True
update_field(collection, 1, "title", "新标题")
# 批量更新字段
def batch_update_field(collection, ids, field_name, new_values):
"""批量更新字段"""
if len(ids) != len(new_values):
raise ValueError("ID和值的数量不匹配")
# 查询现有数据
results = collection.query(
expr=f"id in {ids}",
output_fields=["*"]
)
# 创建ID到记录的映射
records_map = {r["id"]: r for r in results}
# 准备更新数据
updated_records = []
for id, new_value in zip(ids, new_values):
if id in records_map:
record = records_map[id]
record[field_name] = new_value
updated_records.append(record)
if not updated_records:
print("没有找到要更新的记录")
return
# 转换为列式格式
field_data = {}
for field in collection.schema.fields:
if not field.auto_id:
field_data[field.name] = [r[field.name] for r in updated_records]
data = [field_data[f.name] for f in collection.schema.fields if not f.auto_id]
collection.upsert(data)
collection.flush()
print(f"已更新 {len(updated_records)} 条记录的 {field_name}")
batch_update_field(collection, [1, 2, 3], "category", ["AI", "ML", "DL"])
# 更新向量
def update_embedding(collection, id, new_embedding):
"""更新向量"""
results = collection.query(
expr=f"id == {id}",
output_fields=["*"]
)
if not results:
print(f"ID {id} 不存在")
return False
record = results[0]
record["embedding"] = new_embedding
# 准备数据
data = [[record[f.name] for f in collection.schema.fields if not f.auto_id]]
collection.upsert(data)
collection.flush()
print(f"已更新 ID {id} 的向量")
return True
new_vector = [np.random.random() for _ in range(128)]
update_embedding(collection, 1, new_vector)
# 条件批量更新
def conditional_update(collection, expr, field_name, new_value):
"""根据条件批量更新字段"""
# 查询符合条件的记录
results = collection.query(
expr=expr,
output_fields=["*"],
limit=16384
)
if not results:
print("没有匹配的记录")
return 0
# 更新字段
for record in results:
record[field_name] = new_value
# 转换为列式格式
field_data = {}
for field in collection.schema.fields:
if not field.auto_id:
field_data[field.name] = [r[field.name] for r in results]
data = [field_data[f.name] for f in collection.schema.fields if not f.auto_id]
collection.upsert(data)
collection.flush()
print(f"已更新 {len(results)} 条记录")
return len(results)
# 将所有test类别改为tech类别
conditional_update(collection, 'category == "test"', "category", "tech")
---
02.增量更新
a.向量重新编码
a.功能说明
当文档内容变化时,需要重新生成向量并更新。这是向量数据库中最常见的更新场景。需要保持向量与文档内容的一致性。可以使用相同的编码模型确保向量空间一致。增量更新适合实时更新的应用,如新闻、社交媒体等。建议批量处理更新请求,提高效率。更新后可能需要重建索引以保持查询性能。
b.代码示例
---
from pymilvus import Collection
import numpy as np
collection = Collection("documents")
collection.load()
# 模拟向量编码器
def encode_text(text):
"""将文本编码为向量(实际应使用真实的编码模型)"""
# 这里用随机向量模拟
return [np.random.random() for _ in range(128)]
# 更新文档内容和向量
def update_document(collection, doc_id, new_title, new_content):
"""更新文档内容并重新编码向量"""
# 查询现有数据
results = collection.query(
expr=f"id == {doc_id}",
output_fields=["*"]
)
if not results:
print(f"文档 {doc_id} 不存在")
return False
# 重新编码向量
new_embedding = encode_text(new_title + " " + new_content)
# 更新记录
record = results[0]
record["title"] = new_title
record["embedding"] = new_embedding
record["timestamp"] = int(time.time()) # 更新时间戳
# Upsert
data = [[record[f.name] for f in collection.schema.fields if not f.auto_id]]
collection.upsert(data)
collection.flush()
print(f"已更新文档 {doc_id}")
return True
update_document(collection, 1, "新标题", "新内容...")
# 批量重新编码
def batch_reencode(collection, doc_ids):
"""批量重新编码向量"""
# 查询文档
results = collection.query(
expr=f"id in {doc_ids}",
output_fields=["*"]
)
if not results:
print("没有找到文档")
return 0
# 重新编码
updated_records = []
for record in results:
# 重新编码
new_embedding = encode_text(record["title"])
record["embedding"] = new_embedding
record["timestamp"] = int(time.time())
updated_records.append(record)
# 转换为列式格式
field_data = {}
for field in collection.schema.fields:
if not field.auto_id:
field_data[field.name] = [r[field.name] for r in updated_records]
data = [field_data[f.name] for f in collection.schema.fields if not f.auto_id]
collection.upsert(data)
collection.flush()
print(f"已重新编码 {len(updated_records)} 个文档")
return len(updated_records)
batch_reencode(collection, [1, 2, 3, 4, 5])
# 增量更新队列
import queue
import threading
import time
class IncrementalUpdater:
def __init__(self, collection, batch_size=100, flush_interval=5):
self.collection = collection
self.batch_size = batch_size
self.flush_interval = flush_interval
self.update_queue = queue.Queue()
self.running = False
def start(self):
"""启动更新线程"""
self.running = True
self.worker_thread = threading.Thread(target=self._worker, daemon=True)
self.worker_thread.start()
def stop(self):
"""停止更新线程"""
self.running = False
self.worker_thread.join()
def submit_update(self, doc_id, title, content):
"""提交更新请求"""
self.update_queue.put((doc_id, title, content))
def _worker(self):
"""后台更新线程"""
batch = []
last_flush = time.time()
while self.running:
try:
# 获取更新请求(超时)
item = self.update_queue.get(timeout=1)
batch.append(item)
# 达到批次大小或超时,执行更新
if len(batch) >= self.batch_size or \
(time.time() - last_flush) > self.flush_interval:
self._flush_batch(batch)
batch = []
last_flush = time.time()
except queue.Empty:
# 超时,检查是否有待处理的批次
if batch and (time.time() - last_flush) > self.flush_interval:
self._flush_batch(batch)
batch = []
last_flush = time.time()
def _flush_batch(self, batch):
"""刷新批次更新"""
if not batch:
return
doc_ids = [item[0] for item in batch]
# 查询现有数据
results = self.collection.query(
expr=f"id in {doc_ids}",
output_fields=["*"]
)
records_map = {r["id"]: r for r in results}
# 更新记录
updated_records = []
for doc_id, title, content in batch:
if doc_id in records_map:
record = records_map[doc_id]
record["title"] = title
record["embedding"] = encode_text(title + " " + content)
record["timestamp"] = int(time.time())
updated_records.append(record)
if updated_records:
# 转换为列式格式
field_data = {}
for field in self.collection.schema.fields:
if not field.auto_id:
field_data[field.name] = [r[field.name] for r in updated_records]
data = [field_data[f.name] for f in self.collection.schema.fields if not f.auto_id]
self.collection.upsert(data)
self.collection.flush()
print(f"批量更新 {len(updated_records)} 个文档")
# 使用增量更新器
updater = IncrementalUpdater(collection, batch_size=100, flush_interval=5)
updater.start()
# 提交更新请求
for i in range(50):
updater.submit_update(i, f"更新标题{i}", f"更新内容{i}")
# 等待处理完成
time.sleep(10)
updater.stop()
---
b.元数据更新
a.功能说明
元数据更新不涉及向量变化,只更新标量字段。这种更新比向量更新简单,但仍需要查询完整数据。适合更新分类、标签、状态等字段。可以通过缓存减少查询开销。元数据更新频率通常高于向量更新。建议使用批量更新提高效率。对于高频更新的字段,可以考虑使用外部存储。
b.代码示例
---
from pymilvus import Collection
import time
collection = Collection("documents")
collection.load()
# 更新分类
def update_category(collection, doc_ids, new_category):
"""批量更新分类"""
results = collection.query(
expr=f"id in {doc_ids}",
output_fields=["*"]
)
if not results:
return 0
# 更新分类
for record in results:
record["category"] = new_category
record["timestamp"] = int(time.time())
# Upsert
field_data = {}
for field in collection.schema.fields:
if not field.auto_id:
field_data[field.name] = [r[field.name] for r in results]
data = [field_data[f.name] for f in collection.schema.fields if not f.auto_id]
collection.upsert(data)
collection.flush()
print(f"已更新 {len(results)} 个文档的分类")
return len(results)
update_category(collection, [1, 2, 3], "AI")
# 批量添加标签
def add_tags(collection, doc_ids, new_tags):
"""批量添加标签(假设使用JSON字段存储标签)"""
results = collection.query(
expr=f"id in {doc_ids}",
output_fields=["*"]
)
for record in results:
# 获取现有标签
metadata = record.get("metadata", {})
existing_tags = metadata.get("tags", [])
# 添加新标签
updated_tags = list(set(existing_tags + new_tags))
metadata["tags"] = updated_tags
record["metadata"] = metadata
record["timestamp"] = int(time.time())
# Upsert
field_data = {}
for field in collection.schema.fields:
if not field.auto_id:
field_data[field.name] = [r[field.name] for r in results]
data = [field_data[f.name] for f in collection.schema.fields if not f.auto_id]
collection.upsert(data)
collection.flush()
print(f"已为 {len(results)} 个文档添加标签")
add_tags(collection, [1, 2, 3], ["机器学习", "深度学习"])
# 元数据缓存
class MetadataCache:
def __init__(self, collection, cache_size=1000):
self.collection = collection
self.cache = {}
self.cache_size = cache_size
self.access_order = []
def get(self, doc_id):
"""获取文档元数据"""
if doc_id in self.cache:
# 更新访问顺序
self.access_order.remove(doc_id)
self.access_order.append(doc_id)
return self.cache[doc_id]
# 从数据库查询
results = self.collection.query(
expr=f"id == {doc_id}",
output_fields=["*"]
)
if not results:
return None
record = results[0]
# 添加到缓存
if len(self.cache) >= self.cache_size:
# 移除最久未使用的
old_id = self.access_order.pop(0)
del self.cache[old_id]
self.cache[doc_id] = record
self.access_order.append(doc_id)
return record
def update(self, doc_id, updates):
"""更新文档元数据"""
record = self.get(doc_id)
if not record:
return False
# 更新字段
for key, value in updates.items():
record[key] = value
record["timestamp"] = int(time.time())
# 更新缓存
self.cache[doc_id] = record
# Upsert到数据库
data = [[record[f.name] for f in self.collection.schema.fields if not f.auto_id]]
self.collection.upsert(data)
return True
def flush(self):
"""刷新所有缓存的更新"""
self.collection.flush()
# 使用元数据缓存
cache = MetadataCache(collection, cache_size=1000)
# 更新元数据
cache.update(1, {"category": "AI", "views": 1000})
cache.update(2, {"category": "ML", "views": 500})
# 刷新
cache.flush()
---
4.4 批量操作
01.批量插入优化
a.数据预处理
a.功能说明
批量插入前的数据预处理可以显著提高性能。包括数据验证、格式转换、去重等操作。预处理可以在插入前发现错误,避免部分插入失败。建议使用NumPy等高效库处理大规模数据。可以并行处理数据预处理和插入操作。预处理应该包括维度检查、类型转换、空值处理等。合理的预处理可以减少插入时的错误和重试。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import pandas as pd
collection = Collection("documents")
# 数据验证器
class DataValidator:
def __init__(self, schema):
self.schema = schema
self.field_map = {f.name: f for f in schema.fields}
def validate_record(self, record):
"""验证单条记录"""
errors = []
# 检查必需字段
for field in self.schema.fields:
if field.auto_id:
continue
if field.name not in record:
errors.append(f"缺少字段: {field.name}")
continue
value = record[field.name]
# 检查向量维度
if str(field.dtype) == "DataType.FLOAT_VECTOR":
expected_dim = field.params.get("dim")
if len(value) != expected_dim:
errors.append(f"向量维度错误: {field.name}, 期望{expected_dim}, 实际{len(value)}")
# 检查VARCHAR长度
elif str(field.dtype) == "DataType.VARCHAR":
max_len = field.params.get("max_length")
if len(str(value)) > max_len:
errors.append(f"字符串过长: {field.name}, 最大{max_len}, 实际{len(str(value))}")
return len(errors) == 0, errors
def validate_batch(self, records):
"""验证批次数据"""
valid_records = []
invalid_records = []
for i, record in enumerate(records):
is_valid, errors = self.validate_record(record)
if is_valid:
valid_records.append(record)
else:
invalid_records.append((i, record, errors))
return valid_records, invalid_records
# 使用验证器
validator = DataValidator(collection.schema)
test_records = [
{"id": 1, "title": "文档1", "category": "AI", "timestamp": 1700000000, "embedding": [0.1]*128},
{"id": 2, "title": "文档2", "category": "ML", "timestamp": 1700000000, "embedding": [0.2]*100}, # 维度错误
{"id": 3, "title": "x"*300, "category": "DL", "timestamp": 1700000000, "embedding": [0.3]*128} # 标题过长
]
valid, invalid = validator.validate_batch(test_records)
print(f"有效记录: {len(valid)}")
print(f"无效记录: {len(invalid)}")
for i, record, errors in invalid:
print(f" 记录{i}: {errors}")
# 数据预处理管道
class DataPreprocessor:
def __init__(self, schema):
self.schema = schema
def preprocess_batch(self, records):
"""预处理批次数据"""
processed = []
for record in records:
processed_record = self.preprocess_record(record)
if processed_record:
processed.append(processed_record)
return processed
def preprocess_record(self, record):
"""预处理单条记录"""
processed = {}
for field in self.schema.fields:
if field.auto_id:
continue
if field.name not in record:
return None
value = record[field.name]
# VARCHAR截断
if str(field.dtype) == "DataType.VARCHAR":
max_len = field.params.get("max_length")
value = str(value)[:max_len]
# 向量归一化
elif str(field.dtype) == "DataType.FLOAT_VECTOR":
value = np.array(value, dtype=np.float32)
# L2归一化
norm = np.linalg.norm(value)
if norm > 0:
value = (value / norm).tolist()
else:
value = value.tolist()
# 整数类型转换
elif "INT" in str(field.dtype):
value = int(value)
# 浮点类型转换
elif "FLOAT" in str(field.dtype) or "DOUBLE" in str(field.dtype):
value = float(value)
processed[field.name] = value
return processed
# 使用预处理器
preprocessor = DataPreprocessor(collection.schema)
raw_data = [
{"id": "100", "title": "x"*300, "category": "AI", "timestamp": "1700000000", "embedding": [1.0]*128},
{"id": "101", "title": "文档2", "category": "ML", "timestamp": "1700000001", "embedding": [2.0]*128}
]
processed_data = preprocessor.preprocess_batch(raw_data)
print(f"预处理完成: {len(processed_data)} 条记录")
# 批量插入预处理后的数据
if processed_data:
# 转换为列式格式
field_data = {}
for field in collection.schema.fields:
if not field.auto_id:
field_data[field.name] = [r[field.name] for r in processed_data]
data = [field_data[f.name] for f in collection.schema.fields if not f.auto_id]
collection.insert(data)
collection.flush()
---
b.内存管理
a.功能说明
大规模批量插入需要注意内存管理,避免内存溢出。建议使用生成器或迭代器处理大文件,而不是一次性加载到内存。可以使用分块读取的方式处理CSV、JSON等文件。NumPy数组比Python list更节省内存。及时释放不再使用的数据结构。可以通过监控内存使用情况动态调整批次大小。使用内存映射文件处理超大数据集。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import pandas as pd
import psutil
import gc
collection = Collection("documents")
def get_memory_usage():
"""获取当前内存使用(MB)"""
process = psutil.Process()
return process.memory_info().rss / 1024 / 1024
# 生成器方式读取大文件
def read_large_csv(filename, chunk_size=10000):
"""分块读取大CSV文件"""
for chunk in pd.read_csv(filename, chunksize=chunk_size):
yield chunk
# 批量插入大文件
def insert_from_large_file(collection, filename, batch_size=1000):
"""从大文件批量插入"""
total_inserted = 0
for chunk in read_large_csv(filename, chunk_size=batch_size):
# 转换为插入格式
ids = chunk["id"].tolist()
titles = chunk["title"].tolist()
categories = chunk["category"].tolist()
timestamps = chunk["timestamp"].tolist()
# 假设embedding列是字符串格式的列表
embeddings = chunk["embedding"].apply(eval).tolist()
data = [ids, titles, categories, timestamps, embeddings]
collection.insert(data)
total_inserted += len(ids)
# 显示进度和内存使用
memory_mb = get_memory_usage()
print(f"已插入: {total_inserted}, 内存: {memory_mb:.2f}MB")
# 定期刷新
if total_inserted % 10000 == 0:
collection.flush()
gc.collect() # 强制垃圾回收
collection.flush()
print(f"插入完成: {total_inserted} 条记录")
# 使用NumPy节省内存
def efficient_batch_insert(collection, count=100000):
"""高效批量插入"""
batch_size = 1000
for i in range(0, count, batch_size):
batch_count = min(batch_size, count - i)
# 使用NumPy生成数据(更节省内存)
ids = np.arange(i, i + batch_count, dtype=np.int64)
embeddings = np.random.rand(batch_count, 128).astype(np.float32)
# 转换为list(Milvus要求)
data = [
ids.tolist(),
[f"文档{j}" for j in range(i, i + batch_count)],
["技术"] * batch_count,
[1700000000] * batch_count,
embeddings.tolist()
]
collection.insert(data)
# 清理NumPy数组
del ids, embeddings
if (i + batch_count) % 10000 == 0:
collection.flush()
gc.collect()
memory_mb = get_memory_usage()
print(f"进度: {i + batch_count}/{count}, 内存: {memory_mb:.2f}MB")
collection.flush()
efficient_batch_insert(collection, count=100000)
# 自适应批次大小
class AdaptiveBatchInserter:
def __init__(self, collection, max_memory_mb=1024):
self.collection = collection
self.max_memory_mb = max_memory_mb
self.batch_size = 1000
def insert_batch(self, data):
"""插入批次并调整批次大小"""
memory_before = get_memory_usage()
self.collection.insert(data)
memory_after = get_memory_usage()
memory_used = memory_after - memory_before
# 根据内存使用调整批次大小
if memory_after > self.max_memory_mb * 0.8:
# 内存使用过高,减小批次
self.batch_size = max(100, int(self.batch_size * 0.8))
print(f"减小批次大小: {self.batch_size}")
elif memory_used < 50 and self.batch_size < 10000:
# 内存使用较低,增大批次
self.batch_size = min(10000, int(self.batch_size * 1.2))
print(f"增大批次大小: {self.batch_size}")
return self.batch_size
inserter = AdaptiveBatchInserter(collection, max_memory_mb=1024)
# 使用自适应插入
total = 50000
current = 0
while current < total:
batch_count = min(inserter.batch_size, total - current)
# 生成批次数据
data = [
list(range(current, current + batch_count)),
[f"文档{i}" for i in range(batch_count)],
["技术"] * batch_count,
[1700000000] * batch_count,
[[0.1]*128 for _ in range(batch_count)]
]
inserter.insert_batch(data)
current += batch_count
collection.flush()
---
02.批量查询优化
a.并行查询
a.功能说明
批量查询可以通过并行处理提高吞吐量。Milvus支持多个查询并发执行。可以使用线程池或进程池并行发送查询请求。需要注意控制并发度,避免过载服务器。并行查询适合查询延迟敏感的场景。可以通过批量查询减少网络往返次数。建议根据服务器性能调整并发数量。
b.代码示例
---
from pymilvus import Collection
import concurrent.futures
import time
collection = Collection("documents")
collection.load()
# 单个查询函数
def query_by_id(collection, doc_id):
"""按ID查询"""
results = collection.query(
expr=f"id == {doc_id}",
output_fields=["id", "title", "category"]
)
return results
# 串行查询
def serial_query(collection, doc_ids):
"""串行查询"""
start = time.time()
results = []
for doc_id in doc_ids:
result = query_by_id(collection, doc_id)
results.extend(result)
elapsed = time.time() - start
return results, elapsed
# 并行查询
def parallel_query(collection, doc_ids, max_workers=10):
"""并行查询"""
start = time.time()
results = []
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = [executor.submit(query_by_id, collection, doc_id) for doc_id in doc_ids]
for future in concurrent.futures.as_completed(futures):
result = future.result()
results.extend(result)
elapsed = time.time() - start
return results, elapsed
# 性能对比
test_ids = list(range(1, 101))
results_serial, time_serial = serial_query(collection, test_ids)
print(f"串行查询: {len(results_serial)} 条, 耗时: {time_serial:.2f}s")
results_parallel, time_parallel = parallel_query(collection, test_ids, max_workers=10)
print(f"并行查询: {len(results_parallel)} 条, 耗时: {time_parallel:.2f}s")
print(f"加速比: {time_serial / time_parallel:.2f}x")
# 批量IN查询
def batch_in_query(collection, doc_ids, batch_size=100):
"""批量IN查询"""
results = []
for i in range(0, len(doc_ids), batch_size):
batch = doc_ids[i:i+batch_size]
batch_results = collection.query(
expr=f"id in {batch}",
output_fields=["id", "title", "category"]
)
results.extend(batch_results)
return results
# 批量查询(更高效)
results_batch = batch_in_query(collection, test_ids, batch_size=50)
print(f"批量查询: {len(results_batch)} 条")
# 混合策略:批量+并行
def hybrid_query(collection, doc_ids, batch_size=50, max_workers=5):
"""混合查询策略"""
# 分批
batches = [doc_ids[i:i+batch_size] for i in range(0, len(doc_ids), batch_size)]
results = []
# 并行执行批次查询
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = [
executor.submit(
collection.query,
expr=f"id in {batch}",
output_fields=["id", "title", "category"]
)
for batch in batches
]
for future in concurrent.futures.as_completed(futures):
batch_results = future.result()
results.extend(batch_results)
return results
start = time.time()
results_hybrid = hybrid_query(collection, test_ids, batch_size=20, max_workers=5)
time_hybrid = time.time() - start
print(f"混合查询: {len(results_hybrid)} 条, 耗时: {time_hybrid:.2f}s")
---
b.结果聚合
a.功能说明
批量查询后需要聚合结果,包括去重、排序、分页等操作。可以在应用层实现复杂的聚合逻辑。需要注意内存占用,大量结果应该分批处理。可以使用生成器返回结果,减少内存压力。聚合操作应该考虑性能,避免O(n²)复杂度的算法。可以使用Pandas等库简化聚合操作。
b.代码示例
---
from pymilvus import Collection
import pandas as pd
from collections import defaultdict
collection = Collection("documents")
collection.load()
# 批量查询并聚合
def query_and_aggregate(collection, categories):
"""按类别查询并聚合统计"""
results_by_category = defaultdict(list)
for category in categories:
results = collection.query(
expr=f'category == "{category}"',
output_fields=["id", "title", "category", "timestamp"],
limit=1000
)
results_by_category[category].extend(results)
# 统计每个类别的数量
stats = {cat: len(results) for cat, results in results_by_category.items()}
return results_by_category, stats
categories = ["AI", "ML", "DL"]
results, stats = query_and_aggregate(collection, categories)
print("类别统计:")
for cat, count in stats.items():
print(f" {cat}: {count} 条")
# 使用Pandas聚合
def query_to_dataframe(collection, expr, limit=10000):
"""查询结果转DataFrame"""
results = collection.query(
expr=expr,
output_fields=["*"],
limit=limit
)
if not results:
return pd.DataFrame()
df = pd.DataFrame(results)
return df
# 查询并分析
df = query_to_dataframe(collection, "id > 0", limit=10000)
if not df.empty:
# 按类别统计
category_counts = df["category"].value_counts()
print("\n类别分布:")
print(category_counts)
# 时间范围
if "timestamp" in df.columns:
df["datetime"] = pd.to_datetime(df["timestamp"], unit="s")
print(f"\n时间范围: {df['datetime'].min()} 到 {df['datetime'].max()}")
# 导出结果
df.to_csv("query_results.csv", index=False)
print("\n结果已导出到 query_results.csv")
# 分页聚合
def paginated_query(collection, expr, page_size=100):
"""分页查询(生成器)"""
offset = 0
while True:
results = collection.query(
expr=expr,
output_fields=["*"],
limit=page_size,
offset=offset
)
if not results:
break
yield results
offset += page_size
# 使用分页查询
total_count = 0
for page in paginated_query(collection, "id > 0", page_size=1000):
total_count += len(page)
print(f"处理了 {len(page)} 条记录,累计: {total_count}")
# 多条件聚合
def multi_condition_aggregate(collection):
"""多条件聚合查询"""
conditions = [
('category == "AI"', "AI类别"),
('category == "ML" and timestamp > 1700000000', "ML类别且时间>阈值"),
('category == "DL" or category == "NLP"', "DL或NLP类别")
]
results = {}
for expr, desc in conditions:
query_results = collection.query(
expr=expr,
output_fields=["id", "title", "category"],
limit=1000
)
results[desc] = query_results
print(f"{desc}: {len(query_results)} 条")
return results
aggregated = multi_condition_aggregate(collection)
---
5 索引系统
5.1 向量索引类型
01.索引分类
a.精确索引
a.功能说明
精确索引(FLAT)通过暴力计算保证100%召回率。适合小规模数据集(百万级以下)或对召回率要求极高的场景。不需要训练过程,构建速度快。查询时需要计算与所有向量的距离,性能随数据量线性下降。内存占用与数据量成正比。精确索引是其他索引的性能基准,常用于对比测试。适合原型开发和小规模应用。
b.代码示例
---
from pymilvus import Collection, CollectionSchema, FieldSchema, DataType
import numpy as np
# 创建Collection
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128)
]
schema = CollectionSchema(fields=fields)
collection = Collection("flat_index_demo", schema=schema)
# 插入测试数据
ids = list(range(10000))
embeddings = [[np.random.random() for _ in range(128)] for _ in range(10000)]
data = [ids, embeddings]
collection.insert(data)
collection.flush()
# 创建FLAT索引
index_params = {
"index_type": "FLAT",
"metric_type": "L2",
"params": {}
}
collection.create_index(
field_name="embedding",
index_params=index_params
)
print("FLAT索引创建完成")
# 加载并搜索
collection.load()
query_vector = [[np.random.random() for _ in range(128)]]
results = collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2"},
limit=10
)
print(f"搜索结果: {len(results[0])} 条")
for hit in results[0]:
print(f" ID: {hit.id}, 距离: {hit.distance:.4f}")
---
b.近似索引
a.功能说明
近似索引通过牺牲少量召回率换取查询性能提升。包括IVF、HNSW、ANNOY等多种算法。需要训练过程,构建时间较长。查询性能不随数据量线性增长,适合大规模数据。内存占用可以通过参数调整。召回率通常在95%-99%之间,满足大多数应用需求。不同算法有不同的性能特点,需要根据场景选择。
b.代码示例
---
from pymilvus import Collection, CollectionSchema, FieldSchema, DataType
import numpy as np
import time
# 创建Collection
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128)
]
schema = CollectionSchema(fields=fields)
collection = Collection("approx_index_demo", schema=schema)
# 插入大规模数据
batch_size = 10000
total_count = 100000
for i in range(0, total_count, batch_size):
ids = list(range(i, i + batch_size))
embeddings = [[np.random.random() for _ in range(128)] for _ in range(batch_size)]
data = [ids, embeddings]
collection.insert(data)
print(f"已插入: {i + batch_size}/{total_count}")
collection.flush()
# 创建IVF_FLAT索引(近似索引)
index_params = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": 1024} # 聚类中心数量
}
print("开始构建索引...")
start = time.time()
collection.create_index(
field_name="embedding",
index_params=index_params
)
elapsed = time.time() - start
print(f"索引构建完成,耗时: {elapsed:.2f}s")
# 加载并搜索
collection.load()
query_vector = [[np.random.random() for _ in range(128)]]
# 搜索参数(控制召回率和性能)
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16} # 搜索的聚类数量
}
start = time.time()
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10
)
elapsed = time.time() - start
print(f"搜索完成,耗时: {elapsed*1000:.2f}ms")
print(f"结果数量: {len(results[0])}")
---
02.索引算法
a.IVF系列
a.功能说明
IVF(Inverted File Index)是基于聚类的索引算法。将向量空间划分为多个聚类(Voronoi单元),查询时只搜索最近的几个聚类。IVF_FLAT保留原始向量,IVF_SQ8使用标量量化压缩,IVF_PQ使用乘积量化压缩。nlist参数控制聚类数量,通常设置为sqrt(N)到4*sqrt(N)。nprobe参数控制搜索的聚类数量,越大召回率越高但性能越低。适合中大规模数据集(百万到亿级)。
b.代码示例
---
from pymilvus import Collection
import numpy as np
collection = Collection("documents")
# IVF_FLAT: 精确距离计算
ivf_flat_params = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {
"nlist": 1024 # 聚类中心数量
}
}
# IVF_SQ8: 标量量化(节省75%内存)
ivf_sq8_params = {
"index_type": "IVF_SQ8",
"metric_type": "L2",
"params": {
"nlist": 1024
}
}
# IVF_PQ: 乘积量化(节省90%+内存)
ivf_pq_params = {
"index_type": "IVF_PQ",
"metric_type": "L2",
"params": {
"nlist": 1024,
"m": 8, # 子向量数量(必须能整除dim)
"nbits": 8 # 每个子向量的编码位数
}
}
# 创建索引
collection.create_index(
field_name="embedding",
index_params=ivf_flat_params
)
collection.load()
# 搜索参数
search_params_low = {"metric_type": "L2", "params": {"nprobe": 8}} # 低召回率,高性能
search_params_mid = {"metric_type": "L2", "params": {"nprobe": 16}} # 平衡
search_params_high = {"metric_type": "L2", "params": {"nprobe": 32}} # 高召回率,低性能
query_vector = [[np.random.random() for _ in range(128)]]
# 对比不同nprobe的性能
import time
for params in [search_params_low, search_params_mid, search_params_high]:
start = time.time()
results = collection.search(
data=query_vector,
anns_field="embedding",
param=params,
limit=10
)
elapsed = time.time() - start
nprobe = params["params"]["nprobe"]
print(f"nprobe={nprobe}: 耗时 {elapsed*1000:.2f}ms")
---
b.图索引
a.功能说明
图索引(HNSW)构建多层导航图,通过图遍历快速找到近邻。HNSW(Hierarchical Navigable Small World)是目前性能最好的近似索引之一。查询性能稳定,不受数据分布影响。内存占用较高,但查询速度快。M参数控制图的连接度,efConstruction控制构建质量,ef控制搜索质量。适合对查询延迟要求高的场景。构建时间较长,但查询性能优秀。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("documents")
# HNSW索引参数
hnsw_params = {
"index_type": "HNSW",
"metric_type": "L2",
"params": {
"M": 16, # 每层的最大连接数(4-64)
"efConstruction": 200 # 构建时的搜索深度(100-500)
}
}
print("开始构建HNSW索引...")
start = time.time()
collection.create_index(
field_name="embedding",
index_params=hnsw_params
)
elapsed = time.time() - start
print(f"索引构建完成,耗时: {elapsed:.2f}s")
collection.load()
# 搜索参数
search_params_fast = {"metric_type": "L2", "params": {"ef": 64}} # 快速搜索
search_params_balanced = {"metric_type": "L2", "params": {"ef": 128}} # 平衡
search_params_accurate = {"metric_type": "L2", "params": {"ef": 256}} # 高精度
query_vector = [[np.random.random() for _ in range(128)]]
# 对比不同ef的性能
for params in [search_params_fast, search_params_balanced, search_params_accurate]:
start = time.time()
results = collection.search(
data=query_vector,
anns_field="embedding",
param=params,
limit=10
)
elapsed = time.time() - start
ef = params["params"]["ef"]
print(f"ef={ef}: 耗时 {elapsed*1000:.2f}ms")
# HNSW vs IVF性能对比
# 重建为IVF索引
collection.release()
collection.drop_index()
ivf_params = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": 1024}
}
collection.create_index(field_name="embedding", index_params=ivf_params)
collection.load()
# IVF搜索
start = time.time()
results_ivf = collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": 16}},
limit=10
)
time_ivf = time.time() - start
print(f"\nIVF_FLAT: {time_ivf*1000:.2f}ms")
print(f"HNSW通常比IVF快2-5倍,但内存占用更高")
---
03.距离度量
a.欧氏距离
a.功能说明
欧氏距离(L2)是最常用的向量距离度量。计算两个向量之间的直线距离。适合大多数向量相似度场景。距离越小表示越相似。支持归一化和非归一化向量。计算复杂度为O(d),d为向量维度。Milvus对L2距离有硬件加速优化。适合图像、音频等连续特征的相似度计算。
b.代码示例
---
from pymilvus import Collection
import numpy as np
collection = Collection("documents")
# 创建L2索引
index_params = {
"index_type": "IVF_FLAT",
"metric_type": "L2", # 欧氏距离
"params": {"nlist": 1024}
}
collection.create_index(
field_name="embedding",
index_params=index_params
)
collection.load()
# L2搜索
query_vector = [[np.random.random() for _ in range(128)]]
results = collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2"},
limit=10
)
print("L2距离搜索结果:")
for hit in results[0]:
print(f" ID: {hit.id}, L2距离: {hit.distance:.4f}")
# 手动计算L2距离验证
def l2_distance(vec1, vec2):
"""计算L2距离"""
vec1 = np.array(vec1)
vec2 = np.array(vec2)
return np.sqrt(np.sum((vec1 - vec2) ** 2))
# 验证第一个结果
first_id = results[0][0].id
result_vec = collection.query(
expr=f"id == {first_id}",
output_fields=["embedding"]
)[0]["embedding"]
manual_distance = l2_distance(query_vector[0], result_vec)
milvus_distance = results[0][0].distance
print(f"\n验证:")
print(f" Milvus距离: {milvus_distance:.4f}")
print(f" 手动计算: {manual_distance:.4f}")
print(f" 误差: {abs(milvus_distance - manual_distance):.6f}")
---
b.内积和余弦
a.功能说明
内积(IP)计算两个向量的点积,值越大表示越相似。余弦相似度(COSINE)计算向量夹角的余弦值,范围[-1, 1]。对于归一化向量,IP和COSINE等价。适合文本向量、推荐系统等场景。Milvus中COSINE会自动归一化向量。IP适合已归一化的向量,避免重复归一化开销。内积计算比L2稍快。
b.代码示例
---
from pymilvus import Collection
import numpy as np
collection = Collection("documents")
# 创建IP索引
index_params_ip = {
"index_type": "IVF_FLAT",
"metric_type": "IP", # 内积
"params": {"nlist": 1024}
}
collection.create_index(
field_name="embedding",
index_params=index_params_ip
)
collection.load()
# 归一化查询向量
query_vector = np.random.random(128)
query_vector = query_vector / np.linalg.norm(query_vector) # L2归一化
query_vector = [query_vector.tolist()]
# IP搜索
results_ip = collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "IP"},
limit=10
)
print("内积搜索结果:")
for hit in results_ip[0]:
print(f" ID: {hit.id}, 内积: {hit.distance:.4f}")
# 使用COSINE
collection.release()
collection.drop_index()
index_params_cosine = {
"index_type": "IVF_FLAT",
"metric_type": "COSINE", # 余弦相似度
"params": {"nlist": 1024}
}
collection.create_index(
field_name="embedding",
index_params=index_params_cosine
)
collection.load()
# COSINE搜索(自动归一化)
query_vector_raw = [[np.random.random() for _ in range(128)]] # 未归一化
results_cosine = collection.search(
data=query_vector_raw,
anns_field="embedding",
param={"metric_type": "COSINE"},
limit=10
)
print("\n余弦相似度搜索结果:")
for hit in results_cosine[0]:
print(f" ID: {hit.id}, 余弦相似度: {hit.distance:.4f}")
# 手动计算余弦相似度
def cosine_similarity(vec1, vec2):
"""计算余弦相似度"""
vec1 = np.array(vec1)
vec2 = np.array(vec2)
return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
# 验证
first_id = results_cosine[0][0].id
result_vec = collection.query(
expr=f"id == {first_id}",
output_fields=["embedding"]
)[0]["embedding"]
manual_cosine = cosine_similarity(query_vector_raw[0], result_vec)
milvus_cosine = results_cosine[0][0].distance
print(f"\n验证:")
print(f" Milvus余弦: {milvus_cosine:.4f}")
print(f" 手动计算: {manual_cosine:.4f}")
# IP vs COSINE对比
print("\nIP vs COSINE:")
print(" 归一化向量: IP == COSINE")
print(" 未归一化向量: COSINE会自动归一化,IP不会")
print(" 性能: IP略快(避免归一化开销)")
print(" 适用场景: 文本向量通常使用COSINE,图像向量可以使用L2或IP")
---
5.2 FLAT索引
01.基本特性
a.精确搜索
a.功能说明
FLAT索引通过暴力计算保证100%召回率,是唯一的精确索引类型。搜索时计算查询向量与所有向量的距离,然后返回Top-K结果。不需要训练过程,创建索引几乎是瞬时的。内存占用等于原始向量数据大小。查询时间复杂度为O(N*d),N为向量数量,d为维度。适合数据量小于100万的场景。常用作其他索引的性能和召回率基准。
b.代码示例
---
from pymilvus import Collection, CollectionSchema, FieldSchema, DataType
import numpy as np
import time
# 创建测试Collection
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=200),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128)
]
schema = CollectionSchema(fields=fields, description="FLAT索引测试")
collection = Collection("flat_test", schema=schema)
# 插入测试数据
data_sizes = [1000, 10000, 100000]
for size in data_sizes:
# 清空collection
collection.drop()
collection = Collection("flat_test", schema=schema)
# 插入数据
ids = list(range(size))
titles = [f"文档{i}" for i in range(size)]
embeddings = [[np.random.random() for _ in range(128)] for _ in range(size)]
data = [ids, titles, embeddings]
collection.insert(data)
collection.flush()
# 创建FLAT索引
index_params = {
"index_type": "FLAT",
"metric_type": "L2",
"params": {}
}
start = time.time()
collection.create_index(field_name="embedding", index_params=index_params)
index_time = time.time() - start
collection.load()
# 测试查询性能
query_vector = [[np.random.random() for _ in range(128)]]
# 预热
collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2"},
limit=10
)
# 正式测试
start = time.time()
results = collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2"},
limit=10
)
query_time = time.time() - start
print(f"\n数据量: {size:,}")
print(f" 索引构建时间: {index_time*1000:.2f}ms")
print(f" 查询时间: {query_time*1000:.2f}ms")
print(f" 召回率: 100% (精确搜索)")
---
b.适用场景
a.功能说明
FLAT索引适合小规模数据集、原型开发、精确搜索需求、召回率基准测试等场景。在数据量小于10万时性能可接受。适合对召回率有严格要求的应用,如医疗、金融等领域。可以作为其他索引的对照组,验证近似索引的召回率。在开发初期使用FLAT索引可以快速验证功能。不适合大规模生产环境,除非数据量确实很小。
b.代码示例
---
from pymilvus import Collection
import numpy as np
# 场景1: 小规模精确搜索
def small_scale_exact_search():
"""小规模数据的精确搜索"""
collection = Collection("medical_images") # 假设医疗图像库
# FLAT索引保证精确结果
index_params = {
"index_type": "FLAT",
"metric_type": "L2",
"params": {}
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()
# 查询最相似的病例
query_vector = [[0.1] * 128] # 患者图像向量
results = collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2"},
limit=5,
output_fields=["id", "title"]
)
print("最相似的5个病例(100%精确):")
for hit in results[0]:
print(f" 病例ID: {hit.id}, 相似度: {hit.distance:.4f}")
# 场景2: 召回率基准测试
def recall_benchmark():
"""使用FLAT作为召回率基准"""
collection = Collection("documents")
query_vector = [[np.random.random() for _ in range(128)]]
# FLAT索引(精确结果)
collection.release()
collection.drop_index()
flat_params = {
"index_type": "FLAT",
"metric_type": "L2",
"params": {}
}
collection.create_index(field_name="embedding", index_params=flat_params)
collection.load()
flat_results = collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2"},
limit=100
)
flat_ids = set([hit.id for hit in flat_results[0]])
# IVF索引(近似结果)
collection.release()
collection.drop_index()
ivf_params = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": 1024}
}
collection.create_index(field_name="embedding", index_params=ivf_params)
collection.load()
ivf_results = collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": 16}},
limit=100
)
ivf_ids = set([hit.id for hit in ivf_results[0]])
# 计算召回率
recall = len(flat_ids & ivf_ids) / len(flat_ids)
print(f"IVF索引召回率: {recall*100:.2f}%")
# 场景3: 原型开发
def prototype_development():
"""原型开发阶段使用FLAT索引"""
collection = Collection("prototype_collection")
# 快速创建索引,无需调参
index_params = {
"index_type": "FLAT",
"metric_type": "L2",
"params": {}
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()
print("原型开发建议:")
print(" 1. 使用FLAT索引快速验证功能")
print(" 2. 数据量控制在10万以内")
print(" 3. 功能稳定后再切换到近似索引")
print(" 4. 保留FLAT索引作为召回率基准")
small_scale_exact_search()
recall_benchmark()
prototype_development()
---
02.性能特征
a.时间复杂度
a.功能说明
FLAT索引的构建时间复杂度为O(1),几乎瞬时完成。查询时间复杂度为O(N*d),N为向量数量,d为维度。随着数据量增长,查询时间线性增长。批量查询可以利用SIMD指令加速。GPU加速可以显著提升性能。对于固定数据量,查询时间相对稳定。不受数据分布影响,性能可预测。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
import matplotlib.pyplot as plt
# 测试不同数据量的查询时间
def test_query_time_scaling():
"""测试查询时间随数据量的变化"""
data_sizes = [1000, 5000, 10000, 50000, 100000]
query_times = []
for size in data_sizes:
# 创建collection
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128)
]
schema = CollectionSchema(fields=fields)
collection = Collection(f"flat_scale_test_{size}", schema=schema)
# 插入数据
ids = list(range(size))
embeddings = [[np.random.random() for _ in range(128)] for _ in range(size)]
data = [ids, embeddings]
collection.insert(data)
collection.flush()
# 创建索引
index_params = {
"index_type": "FLAT",
"metric_type": "L2",
"params": {}
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()
# 测试查询时间
query_vector = [[np.random.random() for _ in range(128)]]
# 多次查询取平均
times = []
for _ in range(10):
start = time.time()
collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2"},
limit=10
)
times.append(time.time() - start)
avg_time = np.mean(times) * 1000 # 转换为ms
query_times.append(avg_time)
print(f"数据量: {size:6d}, 平均查询时间: {avg_time:.2f}ms")
# 清理
collection.drop()
# 绘制曲线
plt.figure(figsize=(10, 6))
plt.plot(data_sizes, query_times, marker='o')
plt.xlabel('数据量')
plt.ylabel('查询时间 (ms)')
plt.title('FLAT索引查询时间随数据量的变化')
plt.grid(True)
plt.savefig('flat_scaling.png')
print("\n性能曲线已保存到 flat_scaling.png")
test_query_time_scaling()
# 测试不同维度的影响
def test_dimension_impact():
"""测试向量维度对查询时间的影响"""
dimensions = [64, 128, 256, 512, 1024]
query_times = []
data_size = 10000
for dim in dimensions:
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=dim)
]
schema = CollectionSchema(fields=fields)
collection = Collection(f"flat_dim_test_{dim}", schema=schema)
# 插入数据
ids = list(range(data_size))
embeddings = [[np.random.random() for _ in range(dim)] for _ in range(data_size)]
data = [ids, embeddings]
collection.insert(data)
collection.flush()
# 创建索引
index_params = {
"index_type": "FLAT",
"metric_type": "L2",
"params": {}
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()
# 测试查询时间
query_vector = [[np.random.random() for _ in range(dim)]]
times = []
for _ in range(10):
start = time.time()
collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2"},
limit=10
)
times.append(time.time() - start)
avg_time = np.mean(times) * 1000
query_times.append(avg_time)
print(f"维度: {dim:4d}, 平均查询时间: {avg_time:.2f}ms")
collection.drop()
print(f"\n结论: 查询时间与维度成正比")
test_dimension_impact()
---
b.空间复杂度
a.功能说明
FLAT索引的空间复杂度为O(N*d*4)字节,N为向量数量,d为维度。不进行任何压缩,完全存储原始向量。对于128维float32向量,每个向量占用512字节。100万向量约占用512MB内存。内存占用是可预测的,不受索引参数影响。相比压缩索引(如IVF_SQ8、IVF_PQ),内存占用最高。适合内存充足的场景。
b.代码示例
---
from pymilvus import Collection, utility
import numpy as np
# 计算内存占用
def calculate_memory_usage(num_vectors, dim):
"""计算FLAT索引的内存占用"""
bytes_per_vector = dim * 4 # float32
total_bytes = num_vectors * bytes_per_vector
total_mb = total_bytes / 1024 / 1024
total_gb = total_mb / 1024
return {
"vectors": num_vectors,
"dimension": dim,
"bytes_per_vector": bytes_per_vector,
"total_mb": total_mb,
"total_gb": total_gb
}
# 常见规模的内存占用
scenarios = [
(10000, 128, "小规模应用"),
(100000, 128, "中等规模应用"),
(1000000, 128, "大规模应用"),
(1000000, 768, "大模型embedding"),
(10000000, 128, "超大规模应用")
]
print("FLAT索引内存占用估算:\n")
for num_vectors, dim, desc in scenarios:
usage = calculate_memory_usage(num_vectors, dim)
print(f"{desc}:")
print(f" 向量数量: {usage['vectors']:,}")
print(f" 向量维度: {usage['dimension']}")
print(f" 单向量大小: {usage['bytes_per_vector']} 字节")
print(f" 总内存: {usage['total_mb']:.2f} MB ({usage['total_gb']:.2f} GB)")
print()
# 实际测量内存占用
def measure_actual_memory():
"""实际测量FLAT索引的内存占用"""
collection = Collection("memory_test")
# 插入数据
size = 100000
dim = 128
ids = list(range(size))
embeddings = [[np.random.random() for _ in range(dim)] for _ in range(size)]
data = [ids, embeddings]
collection.insert(data)
collection.flush()
# 创建索引
index_params = {
"index_type": "FLAT",
"metric_type": "L2",
"params": {}
}
collection.create_index(field_name="embedding", index_params=index_params)
# 获取collection统计信息
stats = collection.get_stats()
print("Collection统计信息:")
print(stats)
# 理论内存占用
theoretical_mb = calculate_memory_usage(size, dim)["total_mb"]
print(f"\n理论内存占用: {theoretical_mb:.2f} MB")
print("实际占用略高于理论值(包含元数据和索引结构)")
measure_actual_memory()
# 内存占用对比
def compare_index_memory():
"""对比不同索引的内存占用"""
print("\n不同索引类型的内存占用对比(100万向量,128维):\n")
comparisons = [
("FLAT", 1.0, "512 MB", "无压缩,精确搜索"),
("IVF_FLAT", 1.0, "512 MB", "无压缩,近似搜索"),
("IVF_SQ8", 0.25, "128 MB", "标量量化,节省75%"),
("IVF_PQ", 0.05, "26 MB", "乘积量化,节省95%"),
("HNSW", 1.5, "768 MB", "图索引,额外图结构")
]
for index_type, ratio, memory, description in comparisons:
print(f"{index_type:12s}: {memory:8s} (相对FLAT: {ratio*100:5.1f}%) - {description}")
print("\n建议:")
print(" - 内存充足: 使用FLAT或HNSW")
print(" - 内存紧张: 使用IVF_SQ8或IVF_PQ")
print(" - 平衡选择: 使用IVF_FLAT")
compare_index_memory()
---
5.3 IVF系列索引
01.IVF原理
a.聚类分区
a.功能说明
IVF(Inverted File Index)通过K-means聚类将向量空间划分为多个Voronoi单元。每个单元由一个聚类中心(centroid)表示,向量被分配到最近的聚类中心。查询时先找到最近的几个聚类中心,然后只在这些聚类内搜索。nlist参数控制聚类数量,通常设置为sqrt(N)到4*sqrt(N),N为向量总数。聚类过程需要训练,使用部分数据进行K-means迭代。训练时间与nlist和数据量成正比。
b.代码示例
---
from pymilvus import Collection, CollectionSchema, FieldSchema, DataType
import numpy as np
import time
# 创建测试Collection
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128)
]
schema = CollectionSchema(fields=fields)
collection = Collection("ivf_demo", schema=schema)
# 插入数据
data_size = 100000
ids = list(range(data_size))
embeddings = [[np.random.random() for _ in range(128)] for _ in range(data_size)]
data = [ids, embeddings]
collection.insert(data)
collection.flush()
# 测试不同nlist值
nlist_values = [128, 256, 512, 1024, 2048]
for nlist in nlist_values:
# 创建IVF索引
index_params = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": nlist}
}
print(f"\nnlist = {nlist}")
# 测量构建时间
start = time.time()
collection.create_index(field_name="embedding", index_params=index_params)
build_time = time.time() - start
print(f" 构建时间: {build_time:.2f}s")
collection.load()
# 测试查询性能
query_vector = [[np.random.random() for _ in range(128)]]
# 不同nprobe值
for nprobe in [1, 8, 16, 32]:
search_params = {
"metric_type": "L2",
"params": {"nprobe": nprobe}
}
start = time.time()
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10
)
query_time = time.time() - start
print(f" nprobe={nprobe:2d}: {query_time*1000:.2f}ms")
# 清理索引
collection.release()
collection.drop_index()
# nlist选择建议
def recommend_nlist(num_vectors):
"""推荐nlist值"""
sqrt_n = int(np.sqrt(num_vectors))
recommendations = {
"conservative": sqrt_n,
"balanced": 2 * sqrt_n,
"aggressive": 4 * sqrt_n
}
return recommendations
print(f"\n对于 {data_size:,} 个向量:")
recs = recommend_nlist(data_size)
for strategy, value in recs.items():
print(f" {strategy}: nlist = {value}")
---
b.搜索策略
a.功能说明
IVF搜索分为两个阶段:粗搜索和精搜索。粗搜索阶段计算查询向量到所有聚类中心的距离,选择最近的nprobe个聚类。精搜索阶段在选中的聚类内计算精确距离,返回Top-K结果。nprobe参数控制搜索的聚类数量,是召回率和性能的关键平衡点。nprobe越大召回率越高但性能越低。nprobe=nlist时等价于FLAT索引。建议通过实验确定最优nprobe值。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("ivf_demo")
# 创建IVF索引
index_params = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": 1024}
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()
# 测试不同nprobe的召回率和性能
query_vector = [[np.random.random() for _ in range(128)]]
# 先用FLAT获取精确结果作为基准
collection.release()
collection.drop_index()
flat_params = {
"index_type": "FLAT",
"metric_type": "L2",
"params": {}
}
collection.create_index(field_name="embedding", index_params=flat_params)
collection.load()
flat_results = collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2"},
limit=100
)
flat_ids = set([hit.id for hit in flat_results[0]])
# 切换回IVF索引
collection.release()
collection.drop_index()
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()
# 测试不同nprobe
print("nprobe性能和召回率对比:\n")
print(f"{'nprobe':>8s} {'查询时间':>10s} {'召回率':>8s}")
print("-" * 30)
nprobe_values = [1, 2, 4, 8, 16, 32, 64, 128]
for nprobe in nprobe_values:
search_params = {
"metric_type": "L2",
"params": {"nprobe": nprobe}
}
# 测量查询时间
times = []
for _ in range(10):
start = time.time()
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=100
)
times.append(time.time() - start)
avg_time = np.mean(times) * 1000
# 计算召回率
ivf_ids = set([hit.id for hit in results[0]])
recall = len(flat_ids & ivf_ids) / len(flat_ids)
print(f"{nprobe:8d} {avg_time:9.2f}ms {recall*100:7.2f}%")
# 自动选择nprobe
def auto_select_nprobe(collection, query_vector, target_recall=0.95, max_nprobe=128):
"""自动选择满足目标召回率的最小nprobe"""
# 获取精确结果
collection.release()
collection.drop_index()
flat_params = {"index_type": "FLAT", "metric_type": "L2", "params": {}}
collection.create_index(field_name="embedding", index_params=flat_params)
collection.load()
flat_results = collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2"},
limit=100
)
flat_ids = set([hit.id for hit in flat_results[0]])
# 恢复IVF索引
collection.release()
collection.drop_index()
ivf_params = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": 1024}
}
collection.create_index(field_name="embedding", index_params=ivf_params)
collection.load()
# 二分查找最优nprobe
left, right = 1, max_nprobe
best_nprobe = max_nprobe
while left <= right:
mid = (left + right) // 2
search_params = {
"metric_type": "L2",
"params": {"nprobe": mid}
}
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=100
)
ivf_ids = set([hit.id for hit in results[0]])
recall = len(flat_ids & ivf_ids) / len(flat_ids)
if recall >= target_recall:
best_nprobe = mid
right = mid - 1
else:
left = mid + 1
return best_nprobe
optimal_nprobe = auto_select_nprobe(collection, query_vector, target_recall=0.95)
print(f"\n推荐nprobe值(95%召回率): {optimal_nprobe}")
---
02.IVF变体
a.IVF_FLAT
a.功能说明
IVF_FLAT是最基础的IVF索引,保留原始向量不压缩。查询时计算精确距离,召回率仅受nprobe影响。内存占用与FLAT相同,但查询性能显著提升。适合内存充足且对召回率要求高的场景。是IVF系列中召回率最高的变体。构建速度快于压缩变体。推荐作为IVF系列的首选,除非内存受限。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("documents")
# IVF_FLAT索引配置
index_params = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {
"nlist": 1024 # 聚类数量
}
}
print("开始构建IVF_FLAT索引...")
start = time.time()
collection.create_index(field_name="embedding", index_params=index_params)
build_time = time.time() - start
print(f"构建完成,耗时: {build_time:.2f}s")
collection.load()
# 性能测试
query_vectors = [[np.random.random() for _ in range(128)] for _ in range(100)]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
# 单次查询
start = time.time()
results = collection.search(
data=[query_vectors[0]],
anns_field="embedding",
param=search_params,
limit=10
)
single_time = time.time() - start
print(f"单次查询: {single_time*1000:.2f}ms")
# 批量查询
start = time.time()
results = collection.search(
data=query_vectors,
anns_field="embedding",
param=search_params,
limit=10
)
batch_time = time.time() - start
print(f"批量查询(100): {batch_time*1000:.2f}ms")
print(f"平均每次: {batch_time/100*1000:.2f}ms")
# 内存占用估算
num_vectors = collection.num_entities
dim = 128
memory_mb = num_vectors * dim * 4 / 1024 / 1024
print(f"\n内存占用估算: {memory_mb:.2f} MB")
# 性能调优建议
print("\nIVF_FLAT调优建议:")
print(" 1. nlist = sqrt(N) ~ 4*sqrt(N)")
print(" 2. nprobe = 8~64 (根据召回率要求)")
print(" 3. 批量查询可提升吞吐量")
print(" 4. 适合内存充足的场景")
---
b.IVF_SQ8
a.功能说明
IVF_SQ8使用8位标量量化压缩向量,将float32压缩到uint8。内存占用降低75%,但会损失精度。量化过程将每个维度的值映射到0-255范围。查询时需要反量化计算距离,略微增加计算开销。适合内存受限但对精度要求不极端的场景。召回率略低于IVF_FLAT,通常在98%以上。推荐用于大规模数据集的内存优化。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("documents")
# IVF_SQ8索引配置
index_params = {
"index_type": "IVF_SQ8",
"metric_type": "L2",
"params": {
"nlist": 1024
}
}
print("开始构建IVF_SQ8索引...")
start = time.time()
collection.create_index(field_name="embedding", index_params=index_params)
build_time = time.time() - start
print(f"构建完成,耗时: {build_time:.2f}s")
collection.load()
# 性能测试
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
start = time.time()
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10
)
query_time = time.time() - start
print(f"查询时间: {query_time*1000:.2f}ms")
# 内存节省
num_vectors = collection.num_entities
dim = 128
flat_memory = num_vectors * dim * 4 / 1024 / 1024 # float32
sq8_memory = num_vectors * dim * 1 / 1024 / 1024 # uint8
savings = (1 - sq8_memory / flat_memory) * 100
print(f"\n内存对比:")
print(f" FLAT: {flat_memory:.2f} MB")
print(f" SQ8: {sq8_memory:.2f} MB")
print(f" 节省: {savings:.1f}%")
# 精度对比
print("\nIVF_SQ8特点:")
print(" 优点: 节省75%内存,查询速度接近IVF_FLAT")
print(" 缺点: 精度略有损失(通常<2%)")
print(" 适用: 大规模数据集,内存受限场景")
# 量化原理示例
def quantize_vector(vector):
"""演示标量量化过程"""
vector = np.array(vector)
# 找到最小值和最大值
vmin, vmax = vector.min(), vector.max()
# 映射到0-255
quantized = ((vector - vmin) / (vmax - vmin) * 255).astype(np.uint8)
# 反量化
dequantized = quantized.astype(np.float32) / 255 * (vmax - vmin) + vmin
# 计算误差
error = np.abs(vector - dequantized).mean()
return quantized, dequantized, error
test_vector = [np.random.random() for _ in range(128)]
quantized, dequantized, error = quantize_vector(test_vector)
print(f"\n量化示例:")
print(f" 原始范围: [{min(test_vector):.4f}, {max(test_vector):.4f}]")
print(f" 量化范围: [0, 255]")
print(f" 平均误差: {error:.6f}")
---
03.参数调优
a.nlist选择
a.功能说明
nlist是IVF索引最重要的参数,决定聚类数量。nlist过小导致每个聚类包含过多向量,查询性能下降。nlist过大导致聚类过细,粗搜索开销增加。推荐范围:sqrt(N)到4*sqrt(N),N为向量总数。对于100万向量,推荐nlist=1000-4000。nlist应该是2的幂次,便于内存对齐。需要根据数据分布和查询模式调整。构建时间与nlist成正比。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("documents")
# 测试不同nlist值的性能
num_vectors = collection.num_entities
sqrt_n = int(np.sqrt(num_vectors))
nlist_candidates = [
sqrt_n,
2 * sqrt_n,
4 * sqrt_n,
1024, # 常用值
2048,
4096
]
print(f"向量数量: {num_vectors:,}")
print(f"sqrt(N): {sqrt_n}\n")
results_summary = []
for nlist in nlist_candidates:
# 创建索引
index_params = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": nlist}
}
start = time.time()
collection.create_index(field_name="embedding", index_params=index_params)
build_time = time.time() - start
collection.load()
# 测试查询性能(nprobe=16)
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
times = []
for _ in range(10):
start = time.time()
collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10
)
times.append(time.time() - start)
avg_query_time = np.mean(times) * 1000
results_summary.append({
"nlist": nlist,
"build_time": build_time,
"query_time": avg_query_time
})
print(f"nlist={nlist:5d}: 构建 {build_time:5.2f}s, 查询 {avg_query_time:6.2f}ms")
# 清理
collection.release()
collection.drop_index()
# 推荐最优nlist
best = min(results_summary, key=lambda x: x["query_time"])
print(f"\n推荐nlist: {best['nlist']} (查询时间最短)")
# nlist选择策略
def recommend_nlist_strategy(num_vectors):
"""推荐nlist选择策略"""
sqrt_n = int(np.sqrt(num_vectors))
strategies = {
"快速构建": sqrt_n,
"平衡性能": 2 * sqrt_n,
"高性能": 4 * sqrt_n
}
# 限制在合理范围
for key in strategies:
strategies[key] = max(64, min(65536, strategies[key]))
# 向上取整到2的幂次
strategies[key] = 2 ** int(np.ceil(np.log2(strategies[key])))
return strategies
strategies = recommend_nlist_strategy(num_vectors)
print("\nnlist选择策略:")
for strategy, value in strategies.items():
print(f" {strategy}: {value}")
---
b.nprobe调优
a.功能说明
nprobe控制搜索时探测的聚类数量,是召回率和性能的平衡点。nprobe=1时性能最快但召回率最低。nprobe=nlist时等价于FLAT索引,召回率100%但性能最差。推荐范围:8-64,根据召回率要求调整。nprobe应该远小于nlist,通常是nlist的1%-10%。可以通过A/B测试确定最优nprobe。不同查询可以使用不同nprobe值。实时查询用小nprobe,离线分析用大nprobe。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("documents")
# 创建IVF索引
index_params = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": 1024}
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()
# 获取精确结果作为基准
collection.release()
collection.drop_index()
flat_params = {"index_type": "FLAT", "metric_type": "L2", "params": {}}
collection.create_index(field_name="embedding", index_params=flat_params)
collection.load()
query_vector = [[np.random.random() for _ in range(128)]]
flat_results = collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2"},
limit=100
)
flat_ids = set([hit.id for hit in flat_results[0]])
# 恢复IVF索引
collection.release()
collection.drop_index()
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()
# 测试不同nprobe
print("nprobe调优分析:\n")
print(f"{'nprobe':>8s} {'查询时间':>12s} {'召回率':>10s} {'性价比':>10s}")
print("-" * 45)
nprobe_range = [1, 2, 4, 8, 16, 32, 64, 128, 256]
for nprobe in nprobe_range:
if nprobe > 1024: # 不超过nlist
continue
search_params = {
"metric_type": "L2",
"params": {"nprobe": nprobe}
}
# 测量查询时间
times = []
for _ in range(10):
start = time.time()
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=100
)
times.append(time.time() - start)
avg_time = np.mean(times) * 1000
# 计算召回率
ivf_ids = set([hit.id for hit in results[0]])
recall = len(flat_ids & ivf_ids) / len(flat_ids)
# 性价比 = 召回率 / 查询时间
efficiency = recall / avg_time if avg_time > 0 else 0
print(f"{nprobe:8d} {avg_time:10.2f}ms {recall*100:9.2f}% {efficiency:10.4f}")
# 自动推荐nprobe
def recommend_nprobe(target_recall=0.95, max_latency_ms=10):
"""根据召回率和延迟要求推荐nprobe"""
recommendations = []
for nprobe in [1, 2, 4, 8, 16, 32, 64, 128]:
search_params = {
"metric_type": "L2",
"params": {"nprobe": nprobe}
}
# 测试
start = time.time()
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=100
)
query_time = (time.time() - start) * 1000
ivf_ids = set([hit.id for hit in results[0]])
recall = len(flat_ids & ivf_ids) / len(flat_ids)
if recall >= target_recall and query_time <= max_latency_ms:
recommendations.append({
"nprobe": nprobe,
"recall": recall,
"latency": query_time
})
return recommendations
print("\n推荐配置(召回率≥95%, 延迟≤10ms):")
recs = recommend_nprobe(target_recall=0.95, max_latency_ms=10)
if recs:
best = min(recs, key=lambda x: x["nprobe"])
print(f" 推荐nprobe: {best['nprobe']}")
print(f" 召回率: {best['recall']*100:.2f}%")
print(f" 延迟: {best['latency']:.2f}ms")
else:
print(" 无满足条件的配置,建议放宽要求或增加nlist")
---
5.4 HNSW索引
01.HNSW原理
a.分层图结构
a.功能说明
HNSW(Hierarchical Navigable Small World)构建多层导航图,每层是一个小世界图。底层包含所有向量节点,上层节点逐层稀疏。查询从最顶层开始,逐层向下搜索,每层找到局部最优后进入下层。图中节点通过边连接,边表示向量间的相似关系。M参数控制每层的最大连接数,影响图的连通性和内存占用。efConstruction控制构建时的搜索宽度,影响图质量。HNSW查询性能稳定,不受数据分布影响。
b.代码示例
---
from pymilvus import Collection, CollectionSchema, FieldSchema, DataType
import numpy as np
import time
# 创建测试Collection
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128)
]
schema = CollectionSchema(fields=fields)
collection = Collection("hnsw_demo", schema=schema)
# 插入数据
data_size = 100000
ids = list(range(data_size))
embeddings = [[np.random.random() for _ in range(128)] for _ in range(data_size)]
data = [ids, embeddings]
collection.insert(data)
collection.flush()
# 测试不同M值
m_values = [4, 8, 16, 32, 64]
print("HNSW参数M的影响:\n")
print(f"{'M':>4s} {'构建时间':>12s} {'查询时间':>12s} {'内存估算':>12s}")
print("-" * 45)
for m in m_values:
# 创建HNSW索引
index_params = {
"index_type": "HNSW",
"metric_type": "L2",
"params": {
"M": m,
"efConstruction": 200
}
}
# 构建时间
start = time.time()
collection.create_index(field_name="embedding", index_params=index_params)
build_time = time.time() - start
collection.load()
# 查询时间
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {"ef": 128}
}
times = []
for _ in range(10):
start = time.time()
collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10
)
times.append(time.time() - start)
avg_time = np.mean(times) * 1000
# 内存估算(每个节点约M*2条边)
memory_per_vector = 128 * 4 + m * 2 * 8 # 向量 + 边
total_memory_mb = data_size * memory_per_vector / 1024 / 1024
print(f"{m:4d} {build_time:10.2f}s {avg_time:10.2f}ms {total_memory_mb:10.2f}MB")
collection.release()
collection.drop_index()
print("\nM参数选择建议:")
print(" M=4-8: 低内存,适合大规模数据")
print(" M=16: 平衡选择(推荐)")
print(" M=32-64: 高精度,内存占用高")
---
b.搜索过程
a.功能说明
HNSW搜索从顶层入口节点开始,使用贪心策略找到当前层的局部最优节点。然后进入下一层,以上层的最优节点为起点继续搜索。在底层进行精细搜索,维护一个候选集合。ef参数控制搜索宽度,ef越大搜索越全面但速度越慢。ef必须大于等于limit(返回结果数)。推荐ef=64-512,根据精度要求调整。HNSW的查询时间是对数级别,性能优秀。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("hnsw_demo")
# 创建HNSW索引
index_params = {
"index_type": "HNSW",
"metric_type": "L2",
"params": {
"M": 16,
"efConstruction": 200
}
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()
# 测试不同ef值
query_vector = [[np.random.random() for _ in range(128)]]
# 获取FLAT基准
collection.release()
collection.drop_index()
flat_params = {"index_type": "FLAT", "metric_type": "L2", "params": {}}
collection.create_index(field_name="embedding", index_params=flat_params)
collection.load()
flat_results = collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2"},
limit=100
)
flat_ids = set([hit.id for hit in flat_results[0]])
# 恢复HNSW
collection.release()
collection.drop_index()
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()
# 测试ef参数
print("HNSW ef参数影响:\n")
print(f"{'ef':>6s} {'查询时间':>12s} {'召回率':>10s}")
print("-" * 32)
ef_values = [10, 32, 64, 128, 256, 512]
for ef in ef_values:
search_params = {
"metric_type": "L2",
"params": {"ef": ef}
}
times = []
for _ in range(10):
start = time.time()
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=100
)
times.append(time.time() - start)
avg_time = np.mean(times) * 1000
hnsw_ids = set([hit.id for hit in results[0]])
recall = len(flat_ids & hnsw_ids) / len(flat_ids)
print(f"{ef:6d} {avg_time:10.2f}ms {recall*100:9.2f}%")
# 搜索过程可视化(概念)
print("\nHNSW搜索过程:")
print(" 1. 从顶层入口节点开始")
print(" 2. 在当前层贪心搜索局部最优")
print(" 3. 进入下一层,以上层最优为起点")
print(" 4. 重复直到底层")
print(" 5. 在底层维护ef大小的候选集")
print(" 6. 返回Top-K结果")
# ef选择建议
def recommend_ef(target_recall=0.95):
"""推荐ef值"""
for ef in [32, 64, 128, 256, 512]:
search_params = {
"metric_type": "L2",
"params": {"ef": ef}
}
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=100
)
hnsw_ids = set([hit.id for hit in results[0]])
recall = len(flat_ids & hnsw_ids) / len(flat_ids)
if recall >= target_recall:
return ef, recall
return 512, recall
recommended_ef, recall = recommend_ef(0.95)
print(f"\n推荐ef值(召回率≥95%): {recommended_ef}")
print(f"实际召回率: {recall*100:.2f}%")
---
02.性能优化
a.构建优化
a.功能说明
HNSW构建时间较长,是其主要缺点。efConstruction参数控制构建质量,值越大构建越慢但图质量越高。推荐efConstruction=100-500,通常设置为200。构建过程可以并行化,利用多核CPU。增量构建性能较差,建议批量构建。构建完成后索引不可修改,新数据需要重建索引。可以通过预训练减少构建时间。构建时内存占用较高,需要充足内存。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("documents")
# 测试不同efConstruction值
ef_construction_values = [100, 200, 400]
print("efConstruction参数影响:\n")
print(f"{'efConstruction':>16s} {'构建时间':>12s} {'查询时间':>12s} {'召回率':>10s}")
print("-" * 55)
for ef_const in ef_construction_values:
# 创建索引
index_params = {
"index_type": "HNSW",
"metric_type": "L2",
"params": {
"M": 16,
"efConstruction": ef_const
}
}
start = time.time()
collection.create_index(field_name="embedding", index_params=index_params)
build_time = time.time() - start
collection.load()
# 测试查询性能
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {"ef": 128}
}
times = []
for _ in range(10):
start = time.time()
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=100
)
times.append(time.time() - start)
avg_time = np.mean(times) * 1000
# 计算召回率(需要FLAT基准)
# 这里简化,实际应该与FLAT对比
recall = 0.98 # 示例值
print(f"{ef_const:16d} {build_time:10.2f}s {avg_time:10.2f}ms {recall*100:9.2f}%")
collection.release()
collection.drop_index()
print("\nefConstruction选择建议:")
print(" 100-200: 快速构建,适合原型开发")
print(" 200-400: 平衡选择(推荐)")
print(" 400+: 高质量图,构建时间长")
# 批量构建策略
def batch_build_hnsw(data_batches):
"""批量构建HNSW索引"""
# 先插入所有数据
for batch in data_batches:
collection.insert(batch)
collection.flush()
# 一次性构建索引
index_params = {
"index_type": "HNSW",
"metric_type": "L2",
"params": {
"M": 16,
"efConstruction": 200
}
}
print("开始批量构建HNSW索引...")
start = time.time()
collection.create_index(field_name="embedding", index_params=index_params)
build_time = time.time() - start
print(f"构建完成,耗时: {build_time:.2f}s")
# 增量构建问题
print("\n增量构建注意事项:")
print(" - HNSW不支持高效增量构建")
print(" - 新数据需要重建整个索引")
print(" - 建议批量插入后统一构建")
print(" - 或使用IVF系列索引(支持增量)")
---
b.查询优化
a.功能说明
HNSW查询性能优秀,是其主要优势。查询时间与数据量呈对数关系,扩展性好。批量查询可以提升吞吐量,共享图遍历开销。ef参数是查询性能的关键,建议根据延迟要求动态调整。可以为不同查询场景设置不同ef值。HNSW对CPU友好,可以利用多核并行查询。内存访问模式较好,缓存命中率高。适合低延迟、高吞吐的查询场景。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
import concurrent.futures
collection = Collection("documents")
# 创建HNSW索引
index_params = {
"index_type": "HNSW",
"metric_type": "L2",
"params": {
"M": 16,
"efConstruction": 200
}
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()
# 单次查询性能
def test_single_query():
"""测试单次查询性能"""
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {"ef": 128}
}
times = []
for _ in range(100):
start = time.time()
collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10
)
times.append(time.time() - start)
avg_time = np.mean(times) * 1000
p50 = np.percentile(times, 50) * 1000
p95 = np.percentile(times, 95) * 1000
p99 = np.percentile(times, 99) * 1000
print("单次查询性能:")
print(f" 平均: {avg_time:.2f}ms")
print(f" P50: {p50:.2f}ms")
print(f" P95: {p95:.2f}ms")
print(f" P99: {p99:.2f}ms")
test_single_query()
# 批量查询性能
def test_batch_query():
"""测试批量查询性能"""
batch_sizes = [1, 10, 50, 100]
print("\n批量查询性能:")
print(f"{'批量大小':>8s} {'总时间':>10s} {'平均每次':>12s} {'QPS':>10s}")
print("-" * 45)
for batch_size in batch_sizes:
query_vectors = [[np.random.random() for _ in range(128)] for _ in range(batch_size)]
search_params = {
"metric_type": "L2",
"params": {"ef": 128}
}
start = time.time()
collection.search(
data=query_vectors,
anns_field="embedding",
param=search_params,
limit=10
)
total_time = time.time() - start
avg_time = total_time / batch_size * 1000
qps = batch_size / total_time
print(f"{batch_size:8d} {total_time*1000:9.2f}ms {avg_time:10.2f}ms {qps:9.2f}")
test_batch_query()
# 并发查询性能
def test_concurrent_query():
"""测试并发查询性能"""
def single_query():
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {"ef": 128}
}
collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10
)
print("\n并发查询性能:")
print(f"{'并发数':>8s} {'总时间':>10s} {'QPS':>10s}")
print("-" * 32)
for num_workers in [1, 2, 4, 8, 16]:
num_queries = 100
start = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
futures = [executor.submit(single_query) for _ in range(num_queries)]
for future in concurrent.futures.as_completed(futures):
future.result()
total_time = time.time() - start
qps = num_queries / total_time
print(f"{num_workers:8d} {total_time:9.2f}s {qps:9.2f}")
test_concurrent_query()
# 动态ef调整
class AdaptiveHNSWSearch:
def __init__(self, collection):
self.collection = collection
self.ef_map = {
"fast": 64,
"balanced": 128,
"accurate": 256
}
def search(self, query_vector, mode="balanced", limit=10):
"""根据模式动态调整ef"""
ef = self.ef_map.get(mode, 128)
search_params = {
"metric_type": "L2",
"params": {"ef": ef}
}
return self.collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=limit
)
adaptive_search = AdaptiveHNSWSearch(collection)
# 不同模式的查询
query_vector = [np.random.random() for _ in range(128)]
print("\n自适应查询:")
for mode in ["fast", "balanced", "accurate"]:
start = time.time()
results = adaptive_search.search(query_vector, mode=mode)
elapsed = time.time() - start
print(f" {mode:10s}: {elapsed*1000:.2f}ms")
---
03.使用建议
a.适用场景
a.功能说明
HNSW适合对查询延迟要求高的场景,如实时推荐、在线搜索等。适合数据量大但更新频率低的应用。内存充足时HNSW是最佳选择。不适合频繁更新的场景,因为不支持高效增量构建。适合CPU密集型查询,GPU加速效果不明显。适合高维向量(512维以上),性能优势更明显。推荐作为生产环境的首选索引。
b.代码示例
---
from pymilvus import Collection
# 场景1: 实时推荐系统
def realtime_recommendation():
"""实时推荐场景"""
collection = Collection("product_embeddings")
# HNSW配置(低延迟)
index_params = {
"index_type": "HNSW",
"metric_type": "IP", # 内积,适合推荐
"params": {
"M": 16,
"efConstruction": 200
}
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()
# 快速查询(ef=64)
user_vector = [[0.1] * 128]
search_params = {
"metric_type": "IP",
"params": {"ef": 64}
}
results = collection.search(
data=user_vector,
anns_field="embedding",
param=search_params,
limit=20,
output_fields=["id", "title"]
)
print("推荐商品:")
for hit in results[0]:
print(f" {hit.entity.get('title')}: {hit.distance:.4f}")
# 场景2: 图像搜索
def image_search():
"""图像搜索场景"""
collection = Collection("image_vectors")
# HNSW配置(高维向量)
index_params = {
"index_type": "HNSW",
"metric_type": "L2",
"params": {
"M": 32, # 高维向量用更大的M
"efConstruction": 400
}
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()
# 精确查询(ef=256)
query_image_vector = [[0.1] * 512] # 512维
search_params = {
"metric_type": "L2",
"params": {"ef": 256}
}
results = collection.search(
data=query_image_vector,
anns_field="embedding",
param=search_params,
limit=10
)
print("相似图像:")
for hit in results[0]:
print(f" ID: {hit.id}, 距离: {hit.distance:.4f}")
# 场景3: 文本语义搜索
def semantic_search():
"""文本语义搜索"""
collection = Collection("document_embeddings")
# HNSW配置(平衡)
index_params = {
"index_type": "HNSW",
"metric_type": "COSINE",
"params": {
"M": 16,
"efConstruction": 200
}
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()
# 语义查询
query_text_vector = [[0.1] * 768] # BERT embedding
search_params = {
"metric_type": "COSINE",
"params": {"ef": 128}
}
results = collection.search(
data=query_text_vector,
anns_field="embedding",
param=search_params,
limit=10,
output_fields=["title", "content"]
)
print("相关文档:")
for hit in results[0]:
print(f" {hit.entity.get('title')}: {hit.distance:.4f}")
realtime_recommendation()
image_search()
semantic_search()
---
b.对比总结
a.功能说明
HNSW vs IVF:HNSW查询更快但内存更高,IVF内存更低但查询较慢。HNSW构建慢,IVF构建快。HNSW不支持增量,IVF支持。HNSW适合静态数据,IVF适合动态数据。HNSW vs FLAT:HNSW是近似索引,FLAT是精确索引。HNSW性能远超FLAT,但召回率略低。选择建议:低延迟用HNSW,低内存用IVF,高召回用FLAT。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("documents")
# 性能对比测试
def compare_indexes():
"""对比不同索引的性能"""
indexes = [
("FLAT", {"index_type": "FLAT", "metric_type": "L2", "params": {}}, {"metric_type": "L2"}),
("IVF_FLAT", {"index_type": "IVF_FLAT", "metric_type": "L2", "params": {"nlist": 1024}}, {"metric_type": "L2", "params": {"nprobe": 16}}),
("HNSW", {"index_type": "HNSW", "metric_type": "L2", "params": {"M": 16, "efConstruction": 200}}, {"metric_type": "L2", "params": {"ef": 128}})
]
print("索引性能对比:\n")
print(f"{'索引类型':>12s} {'构建时间':>12s} {'查询时间':>12s} {'内存占用':>12s}")
print("-" * 52)
query_vector = [[np.random.random() for _ in range(128)]]
for index_name, index_params, search_params in indexes:
# 构建索引
start = time.time()
collection.create_index(field_name="embedding", index_params=index_params)
build_time = time.time() - start
collection.load()
# 查询性能
times = []
for _ in range(10):
start = time.time()
collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10
)
times.append(time.time() - start)
avg_time = np.mean(times) * 1000
# 内存估算
num_vectors = collection.num_entities
dim = 128
if index_name == "FLAT":
memory_mb = num_vectors * dim * 4 / 1024 / 1024
elif index_name == "IVF_FLAT":
memory_mb = num_vectors * dim * 4 / 1024 / 1024
else: # HNSW
memory_mb = num_vectors * (dim * 4 + 16 * 2 * 8) / 1024 / 1024
print(f"{index_name:>12s} {build_time:10.2f}s {avg_time:10.2f}ms {memory_mb:10.2f}MB")
collection.release()
collection.drop_index()
print("\n选择建议:")
print(" FLAT: 数据量<10万,需要100%召回率")
print(" IVF_FLAT: 数据量10万-1000万,内存受限")
print(" HNSW: 数据量>10万,低延迟要求,内存充足")
compare_indexes()
# 决策树
def recommend_index(num_vectors, memory_limit_gb, latency_requirement_ms, update_frequency):
"""推荐索引类型"""
print("\n索引推荐决策:")
print(f" 数据量: {num_vectors:,}")
print(f" 内存限制: {memory_limit_gb}GB")
print(f" 延迟要求: {latency_requirement_ms}ms")
print(f" 更新频率: {update_frequency}")
if num_vectors < 100000:
return "FLAT"
dim = 128
hnsw_memory_gb = num_vectors * (dim * 4 + 16 * 2 * 8) / 1024 / 1024 / 1024
if hnsw_memory_gb <= memory_limit_gb and latency_requirement_ms < 10:
if update_frequency == "low":
return "HNSW"
else:
return "IVF_FLAT (HNSW不支持高频更新)"
else:
return "IVF_FLAT"
recommendation = recommend_index(
num_vectors=1000000,
memory_limit_gb=4,
latency_requirement_ms=5,
update_frequency="low"
)
print(f"\n推荐索引: {recommendation}")
---
5.5 标量索引
01.标量索引类型
a.INVERTED索引
a.功能说明
倒排索引适用于VARCHAR和数值类型字段的等值查询和范围查询。通过建立值到文档ID的映射,加速标量字段的过滤。适合高基数字段(唯一值多的字段),如用户ID、商品ID等。对于低基数字段(如性别、类别)效果不明显。可以与向量索引配合使用,实现混合查询。标量索引占用内存较小,构建速度快。支持字符串前缀匹配和数值范围查询。
b.代码示例
---
from pymilvus import Collection, CollectionSchema, FieldSchema, DataType
import numpy as np
import time
# 创建带标量字段的Collection
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=200),
FieldSchema(name="category", dtype=DataType.VARCHAR, max_length=50),
FieldSchema(name="price", dtype=DataType.FLOAT),
FieldSchema(name="timestamp", dtype=DataType.INT64),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128)
]
schema = CollectionSchema(fields=fields)
collection = Collection("scalar_index_demo", schema=schema)
# 插入测试数据
data_size = 100000
ids = list(range(data_size))
titles = [f"商品{i}" for i in range(data_size)]
categories = ["电子", "服装", "食品", "图书"] * (data_size // 4)
prices = [np.random.uniform(10, 1000) for _ in range(data_size)]
timestamps = [1700000000 + i for i in range(data_size)]
embeddings = [[np.random.random() for _ in range(128)] for _ in range(data_size)]
data = [ids, titles, categories, prices, timestamps, embeddings]
collection.insert(data)
collection.flush()
# 创建标量索引
collection.create_index(
field_name="category",
index_name="category_index"
)
collection.create_index(
field_name="price",
index_name="price_index"
)
collection.create_index(
field_name="timestamp",
index_name="timestamp_index"
)
print("标量索引创建完成")
# 创建向量索引
vector_index_params = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": 1024}
}
collection.create_index(field_name="embedding", index_params=vector_index_params)
collection.load()
# 测试标量过滤性能
expr = 'category == "电子" and price > 500'
start = time.time()
results = collection.query(
expr=expr,
output_fields=["id", "title", "category", "price"],
limit=100
)
elapsed = time.time() - start
print(f"\n标量查询: {len(results)} 条结果,耗时 {elapsed*1000:.2f}ms")
# 混合查询(向量+标量)
query_vector = [[np.random.random() for _ in range(128)]]
start = time.time()
results = collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": 16}},
limit=10,
expr='category == "电子" and price > 500',
output_fields=["id", "title", "category", "price"]
)
elapsed = time.time() - start
print(f"混合查询: {len(results[0])} 条结果,耗时 {elapsed*1000:.2f}ms")
---
b.AUTO_INDEX
a.功能说明
AUTO_INDEX是Milvus自动选择的标量索引类型,根据字段类型和数据特征自动优化。简化索引创建流程,无需手动指定索引类型。适合不确定最佳索引类型的场景。对于大多数标量字段都能提供良好性能。推荐作为标量索引的默认选择。内部可能使用B树、哈希表等多种数据结构。
b.代码示例
---
from pymilvus import Collection
collection = Collection("documents")
# 使用AUTO_INDEX
collection.create_index(
field_name="category",
index_params={"index_type": "AUTO_INDEX"}
)
collection.create_index(
field_name="timestamp",
index_params={"index_type": "AUTO_INDEX"}
)
print("AUTO_INDEX创建完成")
collection.load()
# 测试查询
results = collection.query(
expr='category == "技术" and timestamp > 1700000000',
output_fields=["id", "title"],
limit=100
)
print(f"查询结果: {len(results)} 条")
# AUTO_INDEX建议
print("\nAUTO_INDEX使用建议:")
print(" 优点: 自动优化,无需调参")
print(" 缺点: 缺乏控制,可能不是最优")
print(" 适用: 快速开发,不确定最佳索引类型")
---
02.标量过滤优化
a.过滤表达式
a.功能说明
标量过滤表达式支持等值、范围、逻辑运算等操作。合理使用索引可以显著提升过滤性能。过滤条件应该尽量使用索引字段。复杂表达式可能无法充分利用索引。建议将高选择性条件放在前面。过滤后的结果集越小,向量搜索越快。标量过滤在向量搜索前执行,可以减少向量计算量。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("documents")
collection.load()
# 测试不同过滤条件的性能
test_cases = [
('category == "技术"', "单条件等值"),
('price > 100 and price < 500', "范围查询"),
('category == "技术" and price > 100', "组合条件"),
('category in ["技术", "新闻", "博客"]', "IN查询"),
('category == "技术" or category == "新闻"', "OR条件")
]
print("过滤表达式性能测试:\n")
for expr, desc in test_cases:
start = time.time()
results = collection.query(
expr=expr,
output_fields=["id"],
limit=1000
)
elapsed = time.time() - start
print(f"{desc:15s}: {len(results):5d} 条结果, {elapsed*1000:6.2f}ms")
# 混合查询优化
query_vector = [[np.random.random() for _ in range(128)]]
# 策略1: 宽松过滤(过滤后数据多)
expr_loose = 'category == "技术"'
start = time.time()
results_loose = collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": 16}},
limit=10,
expr=expr_loose
)
time_loose = time.time() - start
# 策略2: 严格过滤(过滤后数据少)
expr_strict = 'category == "技术" and price > 500 and timestamp > 1700000000'
start = time.time()
results_strict = collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": 16}},
limit=10,
expr=expr_strict
)
time_strict = time.time() - start
print(f"\n混合查询优化:")
print(f" 宽松过滤: {time_loose*1000:.2f}ms")
print(f" 严格过滤: {time_strict*1000:.2f}ms")
print(f" 建议: 过滤条件越严格,向量搜索越快")
# 表达式优化建议
print("\n表达式优化建议:")
print(" 1. 使用索引字段")
print(" 2. 高选择性条件在前")
print(" 3. 避免复杂嵌套")
print(" 4. 使用IN代替多个OR")
print(" 5. 范围查询使用索引")
---
b.索引选择
a.功能说明
不是所有标量字段都需要索引。高基数字段(唯一值多)适合建索引,如ID、邮箱等。低基数字段(唯一值少)索引效果不明显,如性别、状态等。频繁查询的字段应该建索引。索引会增加内存占用和插入开销。需要在查询性能和资源消耗间平衡。可以通过查询分析确定需要索引的字段。
b.代码示例
---
from pymilvus import Collection
import numpy as np
collection = Collection("documents")
# 分析字段基数
def analyze_cardinality(collection, field_name):
"""分析字段的基数(唯一值数量)"""
# 查询所有数据
results = collection.query(
expr="id >= 0",
output_fields=[field_name],
limit=16384
)
# 统计唯一值
unique_values = set([r[field_name] for r in results])
cardinality = len(unique_values)
total_count = len(results)
cardinality_ratio = cardinality / total_count if total_count > 0 else 0
return {
"field": field_name,
"total": total_count,
"unique": cardinality,
"ratio": cardinality_ratio
}
# 分析多个字段
fields_to_analyze = ["category", "timestamp", "id"]
print("字段基数分析:\n")
print(f"{'字段':>12s} {'总数':>8s} {'唯一值':>8s} {'基数比':>8s} {'建议':>12s}")
print("-" * 55)
for field in fields_to_analyze:
stats = analyze_cardinality(collection, field)
# 索引建议
if stats["ratio"] > 0.5:
recommendation = "建议索引"
elif stats["ratio"] > 0.1:
recommendation = "可选索引"
else:
recommendation = "不建议"
print(f"{stats['field']:>12s} {stats['total']:>8d} {stats['unique']:>8d} {stats['ratio']:>8.2%} {recommendation:>12s}")
# 索引决策树
def should_create_index(field_name, cardinality_ratio, query_frequency):
"""决定是否创建索引"""
if cardinality_ratio > 0.5 and query_frequency == "high":
return True, "高基数+高频查询"
elif cardinality_ratio > 0.1 and query_frequency == "high":
return True, "中基数+高频查询"
elif cardinality_ratio > 0.5 and query_frequency == "medium":
return True, "高基数+中频查询"
else:
return False, "不建议索引"
# 示例决策
decisions = [
("user_id", 0.9, "high"),
("category", 0.01, "high"),
("timestamp", 0.8, "medium"),
("status", 0.001, "low")
]
print("\n索引决策示例:")
for field, ratio, freq in decisions:
should_index, reason = should_create_index(field, ratio, freq)
print(f" {field:12s}: {'创建' if should_index else '跳过':4s} ({reason})")
# 索引成本分析
print("\n索引成本分析:")
print(" 内存成本: 每个索引约占原字段大小的10%-50%")
print(" 插入成本: 索引字段插入速度降低10%-30%")
print(" 查询收益: 索引查询速度提升10x-100x")
print(" 建议: 只为高频查询的高基数字段建索引")
---
5.6 索引参数
01.参数配置
a.构建参数
a.功能说明
索引构建参数决定索引的质量和构建时间。不同索引类型有不同的构建参数。IVF系列的nlist控制聚类数量,HNSW的M和efConstruction控制图结构。构建参数一旦设置无法修改,需要重建索引。应该根据数据规模和性能要求选择参数。可以通过小规模测试确定最优参数。构建参数影响索引大小和查询性能。
b.代码示例
---
from pymilvus import Collection, CollectionSchema, FieldSchema, DataType
import numpy as np
import time
# 创建测试Collection
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128)
]
schema = CollectionSchema(fields=fields)
collection = Collection("index_params_test", schema=schema)
# 插入数据
data_size = 100000
ids = list(range(data_size))
embeddings = [[np.random.random() for _ in range(128)] for _ in range(data_size)]
data = [ids, embeddings]
collection.insert(data)
collection.flush()
# IVF_FLAT参数配置
ivf_configs = [
{"nlist": 512},
{"nlist": 1024},
{"nlist": 2048}
]
print("IVF_FLAT构建参数测试:\n")
print(f"{'nlist':>8s} {'构建时间':>12s} {'索引大小':>12s}")
print("-" * 36)
for params in ivf_configs:
index_params = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": params
}
start = time.time()
collection.create_index(field_name="embedding", index_params=index_params)
build_time = time.time() - start
# 估算索引大小
index_size_mb = data_size * 128 * 4 / 1024 / 1024
print(f"{params['nlist']:8d} {build_time:10.2f}s {index_size_mb:10.2f}MB")
collection.drop_index()
# HNSW参数配置
hnsw_configs = [
{"M": 8, "efConstruction": 100},
{"M": 16, "efConstruction": 200},
{"M": 32, "efConstruction": 400}
]
print("\nHNSW构建参数测试:\n")
print(f"{'M':>4s} {'efConstruction':>16s} {'构建时间':>12s}")
print("-" * 36)
for params in hnsw_configs:
index_params = {
"index_type": "HNSW",
"metric_type": "L2",
"params": params
}
start = time.time()
collection.create_index(field_name="embedding", index_params=index_params)
build_time = time.time() - start
print(f"{params['M']:4d} {params['efConstruction']:16d} {build_time:10.2f}s")
collection.drop_index()
# 参数推荐函数
def recommend_build_params(num_vectors, index_type):
"""推荐构建参数"""
if index_type == "IVF_FLAT":
sqrt_n = int(np.sqrt(num_vectors))
return {
"conservative": {"nlist": sqrt_n},
"balanced": {"nlist": 2 * sqrt_n},
"aggressive": {"nlist": 4 * sqrt_n}
}
elif index_type == "HNSW":
return {
"fast_build": {"M": 8, "efConstruction": 100},
"balanced": {"M": 16, "efConstruction": 200},
"high_quality": {"M": 32, "efConstruction": 400}
}
else:
return {}
print(f"\n推荐参数({data_size:,}个向量):")
for index_type in ["IVF_FLAT", "HNSW"]:
print(f"\n{index_type}:")
recs = recommend_build_params(data_size, index_type)
for strategy, params in recs.items():
print(f" {strategy:15s}: {params}")
---
b.搜索参数
a.功能说明
搜索参数控制查询时的性能和召回率平衡。可以在运行时动态调整,无需重建索引。IVF的nprobe控制搜索的聚类数量,HNSW的ef控制搜索宽度。搜索参数越大召回率越高但性能越低。应该根据应用场景选择合适的搜索参数。可以为不同查询设置不同参数。建议通过A/B测试确定最优搜索参数。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("index_params_test")
# 创建IVF索引
index_params = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": 1024}
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()
# 测试不同搜索参数
query_vector = [[np.random.random() for _ in range(128)]]
# 获取FLAT基准
collection.release()
collection.drop_index()
flat_params = {"index_type": "FLAT", "metric_type": "L2", "params": {}}
collection.create_index(field_name="embedding", index_params=flat_params)
collection.load()
flat_results = collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2"},
limit=100
)
flat_ids = set([hit.id for hit in flat_results[0]])
# 恢复IVF索引
collection.release()
collection.drop_index()
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()
# 搜索参数测试
print("IVF搜索参数测试:\n")
print(f"{'nprobe':>8s} {'查询时间':>12s} {'召回率':>10s} {'QPS':>10s}")
print("-" * 45)
nprobe_values = [1, 4, 8, 16, 32, 64]
for nprobe in nprobe_values:
search_params = {
"metric_type": "L2",
"params": {"nprobe": nprobe}
}
# 测量性能
times = []
for _ in range(10):
start = time.time()
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=100
)
times.append(time.time() - start)
avg_time = np.mean(times) * 1000
qps = 1000 / avg_time if avg_time > 0 else 0
# 计算召回率
ivf_ids = set([hit.id for hit in results[0]])
recall = len(flat_ids & ivf_ids) / len(flat_ids)
print(f"{nprobe:8d} {avg_time:10.2f}ms {recall*100:9.2f}% {qps:9.2f}")
# 动态参数调整
class DynamicSearchParams:
def __init__(self):
self.params_map = {
"fast": {"nprobe": 4},
"balanced": {"nprobe": 16},
"accurate": {"nprobe": 64}
}
def get_params(self, mode="balanced"):
"""根据模式获取搜索参数"""
return {
"metric_type": "L2",
"params": self.params_map.get(mode, self.params_map["balanced"])
}
def auto_adjust(self, latency_ms, target_latency_ms=10):
"""根据延迟自动调整参数"""
if latency_ms > target_latency_ms * 1.5:
return "fast"
elif latency_ms < target_latency_ms * 0.5:
return "accurate"
else:
return "balanced"
dynamic_params = DynamicSearchParams()
# 自适应查询
print("\n自适应搜索参数:")
for mode in ["fast", "balanced", "accurate"]:
params = dynamic_params.get_params(mode)
start = time.time()
results = collection.search(
data=query_vector,
anns_field="embedding",
param=params,
limit=10
)
latency = (time.time() - start) * 1000
print(f" {mode:10s}: {latency:.2f}ms (nprobe={params['params']['nprobe']})")
---
02.参数调优
a.性能测试
a.功能说明
参数调优需要通过性能测试确定最优配置。测试应该覆盖不同数据规模和查询模式。关注指标包括构建时间、查询延迟、召回率、内存占用等。应该在真实数据和查询上测试,避免过拟合。可以使用网格搜索或贝叶斯优化寻找最优参数。需要在多个指标间权衡,没有绝对最优解。建议建立参数调优流程和工具。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
from itertools import product
collection = Collection("documents")
# 网格搜索最优参数
def grid_search_ivf_params(collection, query_vectors, target_recall=0.95):
"""网格搜索IVF最优参数"""
# 参数网格
nlist_values = [512, 1024, 2048]
nprobe_values = [8, 16, 32, 64]
# 获取FLAT基准
collection.release()
collection.drop_index()
flat_params = {"index_type": "FLAT", "metric_type": "L2", "params": {}}
collection.create_index(field_name="embedding", index_params=flat_params)
collection.load()
flat_results_list = []
for qv in query_vectors:
results = collection.search(
data=[qv],
anns_field="embedding",
param={"metric_type": "L2"},
limit=100
)
flat_results_list.append(set([hit.id for hit in results[0]]))
# 测试所有参数组合
best_config = None
best_score = float('inf')
results_table = []
for nlist in nlist_values:
# 构建索引
collection.release()
collection.drop_index()
index_params = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": nlist}
}
start = time.time()
collection.create_index(field_name="embedding", index_params=index_params)
build_time = time.time() - start
collection.load()
for nprobe in nprobe_values:
search_params = {
"metric_type": "L2",
"params": {"nprobe": nprobe}
}
# 测试查询
total_time = 0
total_recall = 0
for i, qv in enumerate(query_vectors):
start = time.time()
results = collection.search(
data=[qv],
anns_field="embedding",
param=search_params,
limit=100
)
total_time += time.time() - start
ivf_ids = set([hit.id for hit in results[0]])
recall = len(flat_results_list[i] & ivf_ids) / len(flat_results_list[i])
total_recall += recall
avg_time = total_time / len(query_vectors) * 1000
avg_recall = total_recall / len(query_vectors)
# 评分:满足召回率要求的最快配置
if avg_recall >= target_recall:
score = avg_time
if score < best_score:
best_score = score
best_config = {
"nlist": nlist,
"nprobe": nprobe,
"build_time": build_time,
"query_time": avg_time,
"recall": avg_recall
}
results_table.append({
"nlist": nlist,
"nprobe": nprobe,
"build_time": build_time,
"query_time": avg_time,
"recall": avg_recall
})
# 打印结果
print("参数网格搜索结果:\n")
print(f"{'nlist':>8s} {'nprobe':>8s} {'构建时间':>12s} {'查询时间':>12s} {'召回率':>10s}")
print("-" * 55)
for r in results_table:
print(f"{r['nlist']:8d} {r['nprobe']:8d} {r['build_time']:10.2f}s {r['query_time']:10.2f}ms {r['recall']*100:9.2f}%")
if best_config:
print(f"\n最优配置(召回率≥{target_recall*100:.0f}%):")
print(f" nlist: {best_config['nlist']}")
print(f" nprobe: {best_config['nprobe']}")
print(f" 查询时间: {best_config['query_time']:.2f}ms")
print(f" 召回率: {best_config['recall']*100:.2f}%")
return best_config
# 生成测试查询
test_queries = [[np.random.random() for _ in range(128)] for _ in range(10)]
# 执行网格搜索
best_config = grid_search_ivf_params(collection, test_queries, target_recall=0.95)
---
b.调优策略
a.功能说明
参数调优应该遵循系统化策略。首先确定性能目标(延迟、召回率、吞吐量等)。然后选择合适的索引类型。接着通过测试确定构建参数。最后调整搜索参数达到目标性能。应该在真实负载下测试,考虑并发查询。需要监控生产环境性能,持续优化。建议建立参数配置管理系统。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
# 参数调优流程
class IndexTuner:
def __init__(self, collection):
self.collection = collection
self.test_queries = [[np.random.random() for _ in range(128)] for _ in range(20)]
def step1_select_index_type(self, num_vectors, memory_limit_gb, latency_requirement_ms):
"""步骤1: 选择索引类型"""
print("步骤1: 选择索引类型\n")
if num_vectors < 100000:
recommendation = "FLAT"
reason = "数据量小,使用精确索引"
else:
dim = 128
hnsw_memory = num_vectors * (dim * 4 + 16 * 2 * 8) / 1024 / 1024 / 1024
if hnsw_memory <= memory_limit_gb and latency_requirement_ms < 10:
recommendation = "HNSW"
reason = "低延迟要求,内存充足"
else:
recommendation = "IVF_FLAT"
reason = "平衡性能和内存"
print(f"推荐索引: {recommendation}")
print(f"原因: {reason}\n")
return recommendation
def step2_tune_build_params(self, index_type):
"""步骤2: 调优构建参数"""
print("步骤2: 调优构建参数\n")
num_vectors = self.collection.num_entities
if index_type == "IVF_FLAT":
sqrt_n = int(np.sqrt(num_vectors))
candidates = [sqrt_n, 2*sqrt_n, 4*sqrt_n]
print(f"测试nlist值: {candidates}")
best_nlist = 2 * sqrt_n # 简化,实际应测试
build_params = {"nlist": best_nlist}
elif index_type == "HNSW":
candidates = [
{"M": 8, "efConstruction": 100},
{"M": 16, "efConstruction": 200},
{"M": 32, "efConstruction": 400}
]
print(f"测试M和efConstruction组合")
build_params = {"M": 16, "efConstruction": 200} # 简化
else:
build_params = {}
print(f"选择构建参数: {build_params}\n")
return build_params
def step3_tune_search_params(self, index_type, target_recall=0.95, target_latency_ms=10):
"""步骤3: 调优搜索参数"""
print("步骤3: 调优搜索参数\n")
print(f"目标召回率: {target_recall*100:.0f}%")
print(f"目标延迟: {target_latency_ms}ms\n")
if index_type == "IVF_FLAT":
# 二分查找最优nprobe
left, right = 1, 128
best_nprobe = 16
print(f"搜索最优nprobe...")
search_params = {"nprobe": best_nprobe}
elif index_type == "HNSW":
# 测试不同ef值
best_ef = 128
print(f"搜索最优ef...")
search_params = {"ef": best_ef}
else:
search_params = {}
print(f"选择搜索参数: {search_params}\n")
return search_params
def step4_validate(self, index_type, build_params, search_params):
"""步骤4: 验证配置"""
print("步骤4: 验证配置\n")
# 创建索引
index_params = {
"index_type": index_type,
"metric_type": "L2",
"params": build_params
}
start = time.time()
self.collection.create_index(field_name="embedding", index_params=index_params)
build_time = time.time() - start
self.collection.load()
# 测试查询
full_search_params = {
"metric_type": "L2",
"params": search_params
}
times = []
for qv in self.test_queries:
start = time.time()
self.collection.search(
data=[qv],
anns_field="embedding",
param=full_search_params,
limit=10
)
times.append(time.time() - start)
avg_time = np.mean(times) * 1000
p95_time = np.percentile(times, 95) * 1000
print(f"构建时间: {build_time:.2f}s")
print(f"平均查询时间: {avg_time:.2f}ms")
print(f"P95查询时间: {p95_time:.2f}ms")
return {
"build_time": build_time,
"avg_latency": avg_time,
"p95_latency": p95_time
}
def tune(self, num_vectors, memory_limit_gb, latency_requirement_ms, target_recall=0.95):
"""完整调优流程"""
print("=" * 60)
print("索引参数调优流程")
print("=" * 60 + "\n")
# 步骤1: 选择索引类型
index_type = self.step1_select_index_type(num_vectors, memory_limit_gb, latency_requirement_ms)
# 步骤2: 调优构建参数
build_params = self.step2_tune_build_params(index_type)
# 步骤3: 调优搜索参数
search_params = self.step3_tune_search_params(index_type, target_recall, latency_requirement_ms)
# 步骤4: 验证配置
metrics = self.step4_validate(index_type, build_params, search_params)
print("\n" + "=" * 60)
print("调优完成")
print("=" * 60)
return {
"index_type": index_type,
"build_params": build_params,
"search_params": search_params,
"metrics": metrics
}
# 使用调优器
tuner = IndexTuner(collection)
optimal_config = tuner.tune(
num_vectors=100000,
memory_limit_gb=4,
latency_requirement_ms=10,
target_recall=0.95
)
print(f"\n最优配置:")
print(f" 索引类型: {optimal_config['index_type']}")
print(f" 构建参数: {optimal_config['build_params']}")
print(f" 搜索参数: {optimal_config['search_params']}")
---
6 搜索查询
6.1 相似度搜索
01.基本搜索
a.向量搜索
a.功能说明
向量搜索是Milvus的核心功能,通过计算查询向量与数据库中向量的相似度返回Top-K结果。支持多种距离度量方式:L2(欧氏距离)、IP(内积)、COSINE(余弦相似度)。查询时需要指定anns_field(向量字段名)、limit(返回结果数)和搜索参数。可以同时返回标量字段,通过output_fields指定。搜索结果按相似度排序,距离值越小表示越相似(L2)或越大表示越相似(IP)。支持批量查询,一次提交多个查询向量。
b.代码示例
---
from pymilvus import Collection, connections
import numpy as np
# 连接Milvus
connections.connect(host="localhost", port="19530")
# 获取Collection
collection = Collection("documents")
collection.load()
# 单个向量搜索
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10,
output_fields=["id", "title", "content"]
)
print("搜索结果:")
for hit in results[0]:
print(f" ID: {hit.id}")
print(f" 标题: {hit.entity.get('title')}")
print(f" 距离: {hit.distance:.4f}")
print()
# 批量向量搜索
query_vectors = [[np.random.random() for _ in range(128)] for _ in range(5)]
results = collection.search(
data=query_vectors,
anns_field="embedding",
param=search_params,
limit=10
)
print(f"批量搜索: {len(results)} 个查询")
for i, hits in enumerate(results):
print(f"\n查询 {i+1}:")
for hit in hits[:3]: # 只显示前3个结果
print(f" ID: {hit.id}, 距离: {hit.distance:.4f}")
# 不同距离度量
metrics = ["L2", "IP", "COSINE"]
print("\n不同距离度量对比:")
for metric in metrics:
search_params = {
"metric_type": metric,
"params": {"nprobe": 16}
}
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=5
)
print(f"\n{metric}:")
for hit in results[0]:
print(f" ID: {hit.id}, 距离: {hit.distance:.4f}")
---
b.距离度量
a.功能说明
Milvus支持多种距离度量方式,适用于不同场景。L2(欧氏距离)适合一般向量搜索,值越小越相似。IP(内积)适合推荐系统,值越大越相似。COSINE(余弦相似度)适合文本语义搜索,归一化向量后与IP等价。JACCARD和HAMMING适合二值向量。选择合适的距离度量可以提升搜索效果。距离度量在创建索引时指定,搜索时必须使用相同度量。
b.代码示例
---
from pymilvus import Collection
import numpy as np
collection = Collection("documents")
# L2距离(欧氏距离)
def l2_search(query_vector):
"""L2距离搜索,值越小越相似"""
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
results = collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=10
)
print("L2距离搜索:")
for hit in results[0]:
print(f" ID: {hit.id}, L2距离: {hit.distance:.4f}")
return results
# IP距离(内积)
def ip_search(query_vector):
"""内积搜索,值越大越相似"""
search_params = {
"metric_type": "IP",
"params": {"nprobe": 16}
}
results = collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=10
)
print("\nIP内积搜索:")
for hit in results[0]:
print(f" ID: {hit.id}, 内积: {hit.distance:.4f}")
return results
# COSINE距离(余弦相似度)
def cosine_search(query_vector):
"""余弦相似度搜索,值越大越相似"""
# 归一化查询向量
norm = np.linalg.norm(query_vector)
normalized_vector = (query_vector / norm).tolist()
search_params = {
"metric_type": "COSINE",
"params": {"nprobe": 16}
}
results = collection.search(
data=[normalized_vector],
anns_field="embedding",
param=search_params,
limit=10
)
print("\nCOSINE余弦相似度搜索:")
for hit in results[0]:
print(f" ID: {hit.id}, 余弦相似度: {hit.distance:.4f}")
return results
# 测试不同距离度量
query_vector = [np.random.random() for _ in range(128)]
l2_results = l2_search(query_vector)
ip_results = ip_search(query_vector)
cosine_results = cosine_search(query_vector)
# 距离度量选择建议
print("\n距离度量选择建议:")
print(" L2: 通用向量搜索,适合图像、音频等")
print(" IP: 推荐系统,用户-物品匹配")
print(" COSINE: 文本语义搜索,归一化向量")
print(" JACCARD: 集合相似度,标签匹配")
print(" HAMMING: 二值向量,哈希检索")
# 距离转换
def convert_distance(distance, from_metric, to_metric):
"""距离值转换"""
if from_metric == "L2" and to_metric == "COSINE":
# L2 to COSINE (假设向量已归一化)
return 1 - distance / 2
elif from_metric == "IP" and to_metric == "COSINE":
# IP to COSINE (假设向量已归一化)
return distance
else:
return distance
print("\n距离转换示例:")
print(f" L2距离 0.5 ≈ 余弦相似度 {convert_distance(0.5, 'L2', 'COSINE'):.4f}")
---
02.搜索参数
a.limit参数
a.功能说明
limit参数控制返回结果的数量,即Top-K中的K值。limit必须大于0,推荐范围1-1000。limit越大查询时间越长,但增长不是线性的。对于分页场景,建议使用offset参数配合limit。limit不影响召回率,只影响返回结果数量。实际返回结果可能少于limit,当匹配结果不足时。建议根据业务需求设置合理的limit值。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("documents")
collection.load()
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
# 测试不同limit值的性能
limit_values = [1, 10, 50, 100, 500, 1000]
print("limit参数性能测试:\n")
print(f"{'limit':>8s} {'查询时间':>12s} {'结果数':>8s}")
print("-" * 32)
for limit in limit_values:
start = time.time()
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=limit
)
elapsed = time.time() - start
actual_count = len(results[0])
print(f"{limit:8d} {elapsed*1000:10.2f}ms {actual_count:8d}")
# 分页查询
def paginated_search(query_vector, page_size=10, page_num=1):
"""分页查询"""
offset = (page_num - 1) * page_size
results = collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=page_size,
offset=offset,
output_fields=["id", "title"]
)
return results[0]
# 获取第1页
query_vector = [np.random.random() for _ in range(128)]
print("\n分页查询示例:")
for page in range(1, 4):
results = paginated_search(query_vector, page_size=10, page_num=page)
print(f"\n第{page}页:")
for hit in results:
print(f" ID: {hit.id}, 距离: {hit.distance:.4f}")
# limit选择建议
print("\nlimit选择建议:")
print(" 实时推荐: limit=10-20")
print(" 搜索结果: limit=20-50")
print(" 批量处理: limit=100-1000")
print(" 注意: limit过大会影响性能和内存")
---
b.offset参数
a.功能说明
offset参数用于跳过前N个结果,实现分页查询。offset从0开始,offset=0表示不跳过。offset + limit不应超过16384(Milvus限制)。offset会影响查询性能,值越大性能越差。不推荐使用大offset进行深度分页。对于深度分页,建议使用游标或时间戳方式。offset在排序后应用,不影响召回过程。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("documents")
collection.load()
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
# 测试offset性能
offset_values = [0, 10, 50, 100, 500, 1000]
print("offset参数性能测试:\n")
print(f"{'offset':>8s} {'查询时间':>12s}")
print("-" * 24)
for offset in offset_values:
start = time.time()
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10,
offset=offset
)
elapsed = time.time() - start
print(f"{offset:8d} {elapsed*1000:10.2f}ms")
# 分页实现
class Paginator:
def __init__(self, collection, query_vector, page_size=10):
self.collection = collection
self.query_vector = query_vector
self.page_size = page_size
self.search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
def get_page(self, page_num):
"""获取指定页"""
if page_num < 1:
raise ValueError("page_num must be >= 1")
offset = (page_num - 1) * self.page_size
# 检查offset限制
if offset + self.page_size > 16384:
raise ValueError("offset + limit exceeds 16384")
results = self.collection.search(
data=[self.query_vector],
anns_field="embedding",
param=self.search_params,
limit=self.page_size,
offset=offset,
output_fields=["id", "title"]
)
return results[0]
def iterate_pages(self, max_pages=10):
"""迭代多页"""
for page_num in range(1, max_pages + 1):
try:
results = self.get_page(page_num)
if len(results) == 0:
break
yield page_num, results
except ValueError as e:
print(f"停止迭代: {e}")
break
# 使用分页器
query_vector = [np.random.random() for _ in range(128)]
paginator = Paginator(collection, query_vector, page_size=10)
print("\n分页迭代示例:")
for page_num, results in paginator.iterate_pages(max_pages=3):
print(f"\n第{page_num}页: {len(results)}条结果")
for hit in results[:3]: # 只显示前3条
print(f" ID: {hit.id}, 距离: {hit.distance:.4f}")
# 深度分页替代方案
print("\n深度分页替代方案:")
print(" 1. 使用游标(基于上次结果的最后ID)")
print(" 2. 使用时间戳范围过滤")
print(" 3. 限制最大页数(如只允许前100页)")
print(" 4. 使用Elasticsearch等专门的分页工具")
---
6.2 范围查询
01.范围搜索
a.距离范围
a.功能说明
范围搜索返回距离在指定范围内的所有向量,而不是Top-K结果。通过radius参数指定最大距离,返回所有距离小于radius的向量。可选range_filter参数指定最小距离,实现距离区间查询。适合需要获取所有相似结果的场景,如查找所有相似商品。返回结果数量不固定,可能为0或很多。需要合理设置radius避免返回过多结果。范围搜索性能与返回结果数量相关。
b.代码示例
---
from pymilvus import Collection
import numpy as np
collection = Collection("documents")
collection.load()
query_vector = [[np.random.random() for _ in range(128)]]
# 基本范围搜索
search_params = {
"metric_type": "L2",
"params": {
"nprobe": 16,
"radius": 0.5 # 最大距离
}
}
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=1000, # 最大返回数量
output_fields=["id", "title"]
)
print(f"范围搜索结果: {len(results[0])} 条")
for hit in results[0][:10]: # 只显示前10条
print(f" ID: {hit.id}, 距离: {hit.distance:.4f}")
# 距离区间搜索
search_params_range = {
"metric_type": "L2",
"params": {
"nprobe": 16,
"radius": 1.0, # 最大距离
"range_filter": 0.3 # 最小距离
}
}
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params_range,
limit=1000,
output_fields=["id", "title"]
)
print(f"\n距离区间 [0.3, 1.0] 搜索结果: {len(results[0])} 条")
# 不同距离范围对比
radius_values = [0.3, 0.5, 1.0, 2.0]
print("\n不同距离范围对比:")
print(f"{'radius':>8s} {'结果数':>8s}")
print("-" * 20)
for radius in radius_values:
search_params = {
"metric_type": "L2",
"params": {
"nprobe": 16,
"radius": radius
}
}
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10000
)
print(f"{radius:8.1f} {len(results[0]):8d}")
# 范围搜索应用场景
def find_similar_products(product_vector, max_distance=0.5):
"""查找所有相似商品"""
search_params = {
"metric_type": "L2",
"params": {
"nprobe": 16,
"radius": max_distance
}
}
results = collection.search(
data=[product_vector],
anns_field="embedding",
param=search_params,
limit=1000,
output_fields=["id", "title", "price"]
)
return results[0]
product_vector = [np.random.random() for _ in range(128)]
similar_products = find_similar_products(product_vector, max_distance=0.5)
print(f"\n相似商品查找: {len(similar_products)} 个商品")
---
b.范围过滤
a.功能说明
范围过滤结合标量字段的范围条件和向量范围搜索。可以同时指定距离范围和标量字段范围。通过expr参数指定标量过滤条件,支持数值范围、日期范围等。先执行标量过滤,再进行向量范围搜索,提升性能。适合复杂查询场景,如查找特定价格区间的相似商品。需要为范围查询字段创建索引。
b.代码示例
---
from pymilvus import Collection
import numpy as np
collection = Collection("products")
collection.load()
query_vector = [[np.random.random() for _ in range(128)]]
# 价格范围 + 向量范围
search_params = {
"metric_type": "L2",
"params": {
"nprobe": 16,
"radius": 0.8
}
}
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=1000,
expr='price >= 100 and price <= 500',
output_fields=["id", "title", "price"]
)
print(f"价格范围 [100, 500] + 向量范围: {len(results[0])} 条结果")
for hit in results[0][:5]:
print(f" {hit.entity.get('title')}: ¥{hit.entity.get('price'):.2f}, 距离: {hit.distance:.4f}")
# 时间范围 + 向量范围
import time
current_time = int(time.time())
one_week_ago = current_time - 7 * 24 * 3600
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=1000,
expr=f'timestamp >= {one_week_ago} and timestamp <= {current_time}',
output_fields=["id", "title", "timestamp"]
)
print(f"\n最近7天 + 向量范围: {len(results[0])} 条结果")
# 多条件范围过滤
complex_expr = '''
category == "电子产品" and
price >= 100 and price <= 1000 and
rating >= 4.0 and
stock > 0
'''
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=1000,
expr=complex_expr,
output_fields=["id", "title", "price", "rating"]
)
print(f"\n多条件范围过滤: {len(results[0])} 条结果")
# 范围查询优化
def optimized_range_search(query_vector, price_min, price_max, max_distance):
"""优化的范围查询"""
# 策略1: 先用严格的标量过滤减少候选集
expr = f'price >= {price_min} and price <= {price_max}'
# 策略2: 使用合理的radius避免返回过多结果
search_params = {
"metric_type": "L2",
"params": {
"nprobe": 16,
"radius": max_distance
}
}
# 策略3: 设置合理的limit上限
results = collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=500, # 限制最大返回数
expr=expr,
output_fields=["id", "title", "price"]
)
return results[0]
results = optimized_range_search(
query_vector=[np.random.random() for _ in range(128)],
price_min=200,
price_max=800,
max_distance=0.6
)
print(f"\n优化范围查询: {len(results)} 条结果")
---
02.范围查询优化
a.性能优化
a.功能说明
范围查询性能与返回结果数量密切相关。应该合理设置radius避免返回过多结果。使用标量过滤减少候选集,提升性能。为范围查询字段创建索引,加速过滤。考虑使用分页或流式返回大量结果。监控查询性能,调整参数。范围查询比Top-K查询慢,需要权衡。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("documents")
collection.load()
query_vector = [[np.random.random() for _ in range(128)]]
# 性能对比: Top-K vs 范围查询
print("性能对比: Top-K vs 范围查询\n")
# Top-K查询
search_params_topk = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
start = time.time()
results_topk = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params_topk,
limit=100
)
time_topk = time.time() - start
print(f"Top-K查询 (limit=100):")
print(f" 查询时间: {time_topk*1000:.2f}ms")
print(f" 结果数: {len(results_topk[0])}")
# 范围查询
search_params_range = {
"metric_type": "L2",
"params": {
"nprobe": 16,
"radius": 1.0
}
}
start = time.time()
results_range = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params_range,
limit=10000
)
time_range = time.time() - start
print(f"\n范围查询 (radius=1.0):")
print(f" 查询时间: {time_range*1000:.2f}ms")
print(f" 结果数: {len(results_range[0])}")
print(f" 性能比: {time_range/time_topk:.2f}x")
# 优化策略1: 使用标量过滤
start = time.time()
results_filtered = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params_range,
limit=10000,
expr='id % 10 == 0' # 过滤90%数据
)
time_filtered = time.time() - start
print(f"\n范围查询 + 标量过滤:")
print(f" 查询时间: {time_filtered*1000:.2f}ms")
print(f" 结果数: {len(results_filtered[0])}")
print(f" 加速比: {time_range/time_filtered:.2f}x")
# 优化策略2: 调整radius
radius_values = [0.3, 0.5, 0.8, 1.0, 1.5]
print("\n不同radius的性能:")
print(f"{'radius':>8s} {'查询时间':>12s} {'结果数':>8s}")
print("-" * 32)
for radius in radius_values:
search_params = {
"metric_type": "L2",
"params": {
"nprobe": 16,
"radius": radius
}
}
start = time.time()
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10000
)
elapsed = time.time() - start
print(f"{radius:8.1f} {elapsed*1000:10.2f}ms {len(results[0]):8d}")
# 优化策略3: 分批处理
def batch_range_search(query_vector, radius, batch_size=1000):
"""分批处理范围查询结果"""
search_params = {
"metric_type": "L2",
"params": {
"nprobe": 16,
"radius": radius
}
}
offset = 0
all_results = []
while True:
results = collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=batch_size,
offset=offset
)
if len(results[0]) == 0:
break
all_results.extend(results[0])
offset += batch_size
if offset >= 10000: # 最大限制
break
return all_results
print("\n分批处理范围查询:")
query_vec = [np.random.random() for _ in range(128)]
batch_results = batch_range_search(query_vec, radius=0.8, batch_size=500)
print(f" 总结果数: {len(batch_results)}")
---
b.使用建议
a.功能说明
范围查询适合需要获取所有相似结果的场景。不适合对性能要求极高的实时查询。建议先用小数据集测试radius值。监控返回结果数量,避免过载。考虑使用Top-K查询替代范围查询。范围查询结合标量过滤效果更好。需要在召回率和性能间权衡。
b.代码示例
---
from pymilvus import Collection
import numpy as np
collection = Collection("documents")
# 场景1: 查找所有相似文档
def find_all_similar_docs(query_vector, similarity_threshold=0.7):
"""查找所有相似文档(适合离线分析)"""
search_params = {
"metric_type": "L2",
"params": {
"nprobe": 32, # 更高的nprobe提升召回
"radius": similarity_threshold
}
}
results = collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=5000,
output_fields=["id", "title"]
)
print(f"找到 {len(results[0])} 个相似文档")
return results[0]
# 场景2: 去重检测
def detect_duplicates(query_vector, duplicate_threshold=0.1):
"""检测重复文档(距离很小)"""
search_params = {
"metric_type": "L2",
"params": {
"nprobe": 16,
"radius": duplicate_threshold
}
}
results = collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=100
)
duplicates = [hit for hit in results[0] if hit.distance < duplicate_threshold]
print(f"检测到 {len(duplicates)} 个可能的重复")
return duplicates
# 场景3: 聚类分析
def cluster_analysis(center_vector, cluster_radius=0.5):
"""基于中心点的聚类分析"""
search_params = {
"metric_type": "L2",
"params": {
"nprobe": 16,
"radius": cluster_radius
}
}
results = collection.search(
data=[center_vector],
anns_field="embedding",
param=search_params,
limit=10000
)
cluster_members = results[0]
# 统计聚类信息
distances = [hit.distance for hit in cluster_members]
avg_distance = sum(distances) / len(distances) if distances else 0
print(f"聚类成员数: {len(cluster_members)}")
print(f"平均距离: {avg_distance:.4f}")
return cluster_members
# 决策树: Top-K vs 范围查询
def choose_search_method(scenario):
"""根据场景选择搜索方法"""
recommendations = {
"实时推荐": "Top-K (limit=10-20)",
"搜索结果": "Top-K (limit=20-50)",
"相似内容": "范围查询 (radius=0.5-0.8)",
"去重检测": "范围查询 (radius=0.1-0.3)",
"聚类分析": "范围查询 (radius=0.5-1.0)",
"批量处理": "范围查询 + 分批"
}
return recommendations.get(scenario, "Top-K (默认)")
print("\n搜索方法选择建议:")
scenarios = ["实时推荐", "搜索结果", "相似内容", "去重检测", "聚类分析", "批量处理"]
for scenario in scenarios:
method = choose_search_method(scenario)
print(f" {scenario:12s}: {method}")
# 使用示例
query_vector = [np.random.random() for _ in range(128)]
print("\n实际应用示例:")
similar_docs = find_all_similar_docs(query_vector, similarity_threshold=0.7)
duplicates = detect_duplicates(query_vector, duplicate_threshold=0.1)
cluster = cluster_analysis(query_vector, cluster_radius=0.5)
---
6.3 混合检索
01.向量+标量混合
a.基本混合查询
a.功能说明
混合检索结合向量相似度搜索和标量字段过滤,实现更精确的查询。通过expr参数指定标量过滤条件,先过滤再进行向量搜索。可以显著减少向量计算量,提升查询性能。支持等值、范围、逻辑运算等多种过滤条件。标量过滤在向量搜索前执行,是性能优化的关键。适合需要同时满足语义相似和业务条件的场景。需要为过滤字段创建索引以获得最佳性能。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("products")
collection.load()
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
# 纯向量搜索(基准)
start = time.time()
results_vector_only = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10
)
time_vector_only = time.time() - start
print(f"纯向量搜索:")
print(f" 查询时间: {time_vector_only*1000:.2f}ms")
print(f" 结果数: {len(results_vector_only[0])}")
# 混合查询: 向量 + 类别过滤
start = time.time()
results_hybrid = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10,
expr='category == "电子产品"',
output_fields=["id", "title", "category", "price"]
)
time_hybrid = time.time() - start
print(f"\n混合查询(向量 + 类别):")
print(f" 查询时间: {time_hybrid*1000:.2f}ms")
print(f" 结果数: {len(results_hybrid[0])}")
for hit in results_hybrid[0][:5]:
print(f" {hit.entity.get('title')}: {hit.entity.get('category')}, ¥{hit.entity.get('price'):.2f}")
# 混合查询: 向量 + 价格范围
results_price = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10,
expr='price >= 100 and price <= 500',
output_fields=["id", "title", "price"]
)
print(f"\n混合查询(向量 + 价格范围):")
print(f" 结果数: {len(results_price[0])}")
for hit in results_price[0][:5]:
print(f" {hit.entity.get('title')}: ¥{hit.entity.get('price'):.2f}, 距离: {hit.distance:.4f}")
# 混合查询: 向量 + 多条件
complex_expr = '''
category == "电子产品" and
price >= 100 and price <= 1000 and
rating >= 4.0 and
stock > 0
'''
results_complex = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10,
expr=complex_expr,
output_fields=["id", "title", "category", "price", "rating", "stock"]
)
print(f"\n混合查询(向量 + 多条件):")
print(f" 结果数: {len(results_complex[0])}")
# 性能对比
print(f"\n性能对比:")
print(f" 纯向量: {time_vector_only*1000:.2f}ms")
print(f" 混合查询: {time_hybrid*1000:.2f}ms")
print(f" 性能比: {time_hybrid/time_vector_only:.2f}x")
print(f" 说明: 混合查询通过标量过滤减少向量计算,可能更快")
---
b.过滤策略
a.功能说明
过滤策略影响混合查询的性能和结果。高选择性过滤(过滤掉大部分数据)可以显著提升性能。低选择性过滤效果不明显,反而增加开销。应该将高选择性条件放在前面。复杂表达式可能无法充分利用索引。建议使用简单的AND组合条件。过滤后的候选集应该足够大,避免无结果。需要在过滤严格度和结果数量间平衡。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("products")
collection.load()
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
# 测试不同选择性的过滤条件
filters = [
('id >= 0', "无过滤(选择性0%)"),
('category == "电子产品"', "低选择性(约25%)"),
('price > 500', "中选择性(约50%)"),
('category == "电子产品" and price > 500', "高选择性(约10%)"),
('category == "电子产品" and price > 800 and rating >= 4.5', "极高选择性(约2%)")
]
print("不同选择性过滤条件的性能:\n")
print(f"{'过滤条件':>50s} {'查询时间':>12s} {'结果数':>8s}")
print("-" * 75)
for expr, desc in filters:
start = time.time()
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10,
expr=expr,
output_fields=["id"]
)
elapsed = time.time() - start
print(f"{desc:>50s} {elapsed*1000:10.2f}ms {len(results[0]):8d}")
# 过滤顺序优化
print("\n过滤顺序优化:")
# 策略1: 低选择性在前
expr1 = 'category == "电子产品" and price > 800'
start = time.time()
results1 = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10,
expr=expr1
)
time1 = time.time() - start
print(f" 低选择性在前: {time1*1000:.2f}ms")
# 策略2: 高选择性在前
expr2 = 'price > 800 and category == "电子产品"'
start = time.time()
results2 = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10,
expr=expr2
)
time2 = time.time() - start
print(f" 高选择性在前: {time2*1000:.2f}ms")
print(f" 说明: Milvus会自动优化,顺序影响不大")
# 过滤策略决策树
def recommend_filter_strategy(data_size, filter_selectivity):
"""推荐过滤策略"""
if filter_selectivity < 0.1:
return "极高选择性,优先使用标量查询"
elif filter_selectivity < 0.3:
return "高选择性,混合查询效果好"
elif filter_selectivity < 0.7:
return "中等选择性,混合查询有一定效果"
else:
return "低选择性,考虑纯向量搜索"
print("\n过滤策略建议:")
selectivities = [0.05, 0.2, 0.5, 0.8]
for sel in selectivities:
strategy = recommend_filter_strategy(1000000, sel)
print(f" 选择性 {sel*100:4.1f}%: {strategy}")
---
02.多向量混合
a.多字段搜索
a.功能说明
多向量混合搜索支持在一个Collection中搜索多个向量字段。每个向量字段可以使用不同的索引和搜索参数。适合多模态搜索场景,如图文混合搜索。可以为不同向量字段设置不同的权重。需要合并多个向量字段的搜索结果。Milvus支持在单次查询中搜索多个向量字段。结果合并策略影响最终排序。
b.代码示例
---
from pymilvus import Collection, CollectionSchema, FieldSchema, DataType
import numpy as np
# 创建多向量字段Collection
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=200),
FieldSchema(name="text_embedding", dtype=DataType.FLOAT_VECTOR, dim=768),
FieldSchema(name="image_embedding", dtype=DataType.FLOAT_VECTOR, dim=512)
]
schema = CollectionSchema(fields=fields, description="多模态搜索")
collection = Collection("multimodal_search", schema=schema)
# 插入数据
data_size = 10000
ids = list(range(data_size))
titles = [f"文档{i}" for i in range(data_size)]
text_embeddings = [[np.random.random() for _ in range(768)] for _ in range(data_size)]
image_embeddings = [[np.random.random() for _ in range(512)] for _ in range(data_size)]
data = [ids, titles, text_embeddings, image_embeddings]
collection.insert(data)
collection.flush()
# 为每个向量字段创建索引
text_index_params = {
"index_type": "IVF_FLAT",
"metric_type": "COSINE",
"params": {"nlist": 128}
}
image_index_params = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": 128}
}
collection.create_index(field_name="text_embedding", index_params=text_index_params)
collection.create_index(field_name="image_embedding", index_params=image_index_params)
collection.load()
# 文本向量搜索
text_query = [[np.random.random() for _ in range(768)]]
text_results = collection.search(
data=text_query,
anns_field="text_embedding",
param={"metric_type": "COSINE", "params": {"nprobe": 16}},
limit=10,
output_fields=["id", "title"]
)
print("文本向量搜索结果:")
for hit in text_results[0][:5]:
print(f" {hit.entity.get('title')}: {hit.distance:.4f}")
# 图像向量搜索
image_query = [[np.random.random() for _ in range(512)]]
image_results = collection.search(
data=image_query,
anns_field="image_embedding",
param={"metric_type": "L2", "params": {"nprobe": 16}},
limit=10,
output_fields=["id", "title"]
)
print("\n图像向量搜索结果:")
for hit in image_results[0][:5]:
print(f" {hit.entity.get('title')}: {hit.distance:.4f}")
# 多向量融合搜索
def multimodal_search(text_vector, image_vector, text_weight=0.6, image_weight=0.4):
"""多模态融合搜索"""
# 分别搜索
text_results = collection.search(
data=[text_vector],
anns_field="text_embedding",
param={"metric_type": "COSINE", "params": {"nprobe": 16}},
limit=50,
output_fields=["id", "title"]
)
image_results = collection.search(
data=[image_vector],
anns_field="image_embedding",
param={"metric_type": "L2", "params": {"nprobe": 16}},
limit=50,
output_fields=["id", "title"]
)
# 归一化距离到[0, 1]
text_scores = {}
for hit in text_results[0]:
# COSINE距离转相似度
text_scores[hit.id] = hit.distance
image_scores = {}
max_image_dist = max([hit.distance for hit in image_results[0]]) if image_results[0] else 1.0
for hit in image_results[0]:
# L2距离归一化
image_scores[hit.id] = 1 - (hit.distance / max_image_dist)
# 融合分数
all_ids = set(text_scores.keys()) | set(image_scores.keys())
fused_scores = {}
for doc_id in all_ids:
text_score = text_scores.get(doc_id, 0)
image_score = image_scores.get(doc_id, 0)
fused_scores[doc_id] = text_weight * text_score + image_weight * image_score
# 排序
sorted_results = sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
return sorted_results[:10]
# 执行多模态搜索
text_vec = [np.random.random() for _ in range(768)]
image_vec = [np.random.random() for _ in range(512)]
fused_results = multimodal_search(text_vec, image_vec, text_weight=0.6, image_weight=0.4)
print("\n多模态融合搜索结果:")
for doc_id, score in fused_results:
print(f" 文档ID: {doc_id}, 融合分数: {score:.4f}")
---
b.结果融合
a.功能说明
多向量搜索需要合并不同向量字段的结果。常见融合策略包括加权平均、RRF(Reciprocal Rank Fusion)、最大值等。权重设置影响不同模态的重要性。需要归一化不同距离度量的分数。融合算法应该考虑结果的排序位置。可以根据业务场景调整融合策略。需要实验确定最优权重配置。
b.代码示例
---
from pymilvus import Collection
import numpy as np
collection = Collection("multimodal_search")
collection.load()
# 融合策略1: 加权平均
def weighted_average_fusion(results_list, weights):
"""加权平均融合"""
all_scores = {}
for results, weight in zip(results_list, weights):
for hit in results[0]:
if hit.id not in all_scores:
all_scores[hit.id] = 0
all_scores[hit.id] += weight * hit.distance
sorted_results = sorted(all_scores.items(), key=lambda x: x[1], reverse=True)
return sorted_results[:10]
# 融合策略2: RRF (Reciprocal Rank Fusion)
def rrf_fusion(results_list, k=60):
"""RRF融合,对排序位置不敏感"""
rrf_scores = {}
for results in results_list:
for rank, hit in enumerate(results[0]):
if hit.id not in rrf_scores:
rrf_scores[hit.id] = 0
rrf_scores[hit.id] += 1 / (k + rank + 1)
sorted_results = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
return sorted_results[:10]
# 融合策略3: 最大值融合
def max_fusion(results_list):
"""取每个文档的最大分数"""
max_scores = {}
for results in results_list:
for hit in results[0]:
if hit.id not in max_scores:
max_scores[hit.id] = hit.distance
else:
max_scores[hit.id] = max(max_scores[hit.id], hit.distance)
sorted_results = sorted(max_scores.items(), key=lambda x: x[1], reverse=True)
return sorted_results[:10]
# 测试不同融合策略
text_query = [[np.random.random() for _ in range(768)]]
image_query = [[np.random.random() for _ in range(512)]]
text_results = collection.search(
data=text_query,
anns_field="text_embedding",
param={"metric_type": "COSINE", "params": {"nprobe": 16}},
limit=50
)
image_results = collection.search(
data=image_query,
anns_field="image_embedding",
param={"metric_type": "L2", "params": {"nprobe": 16}},
limit=50
)
results_list = [text_results, image_results]
print("不同融合策略对比:\n")
# 加权平均
wa_results = weighted_average_fusion(results_list, weights=[0.6, 0.4])
print("加权平均融合 (0.6:0.4):")
for doc_id, score in wa_results[:5]:
print(f" 文档ID: {doc_id}, 分数: {score:.4f}")
# RRF
rrf_results = rrf_fusion(results_list, k=60)
print("\nRRF融合:")
for doc_id, score in rrf_results[:5]:
print(f" 文档ID: {doc_id}, RRF分数: {score:.4f}")
# 最大值
max_results = max_fusion(results_list)
print("\n最大值融合:")
for doc_id, score in max_results[:5]:
print(f" 文档ID: {doc_id}, 最大分数: {score:.4f}")
# 自适应权重
class AdaptiveFusion:
def __init__(self):
self.history = []
def fuse(self, results_list, initial_weights=[0.5, 0.5]):
"""自适应权重融合"""
# 计算每个模态的结果质量
qualities = []
for results in results_list:
if len(results[0]) > 0:
# 使用距离分布评估质量
distances = [hit.distance for hit in results[0]]
quality = 1 / (np.std(distances) + 0.01) # 距离分布越集中质量越高
else:
quality = 0
qualities.append(quality)
# 归一化权重
total_quality = sum(qualities)
if total_quality > 0:
adaptive_weights = [q / total_quality for q in qualities]
else:
adaptive_weights = initial_weights
print(f"自适应权重: {adaptive_weights}")
# 加权融合
return weighted_average_fusion(results_list, adaptive_weights)
adaptive_fusion = AdaptiveFusion()
adaptive_results = adaptive_fusion.fuse(results_list)
print("\n自适应权重融合:")
for doc_id, score in adaptive_results[:5]:
print(f" 文档ID: {doc_id}, 分数: {score:.4f}")
---
6.4 标量过滤
01.过滤表达式
a.表达式语法
a.功能说明
Milvus支持丰富的过滤表达式语法,包括比较运算符(==, !=, >, >=, <, <=)、逻辑运算符(and, or, not)、成员运算符(in, not in)等。表达式支持整数、浮点数、字符串、布尔类型字段。可以使用括号改变优先级。字符串比较区分大小写。支持算术表达式和函数调用。表达式会被解析和优化,尽量利用索引。复杂表达式可能影响性能。
b.代码示例
---
from pymilvus import Collection
import numpy as np
collection = Collection("products")
collection.load()
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
# 比较运算符
expressions = [
('price == 99.99', "等于"),
('price != 99.99', "不等于"),
('price > 100', "大于"),
('price >= 100', "大于等于"),
('price < 500', "小于"),
('price <= 500', "小于等于")
]
print("比较运算符示例:\n")
for expr, desc in expressions:
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=5,
expr=expr,
output_fields=["id", "title", "price"]
)
print(f"{desc:10s} ({expr:20s}): {len(results[0])} 条结果")
# 逻辑运算符
logical_expressions = [
('price > 100 and price < 500', "AND运算"),
('category == "电子" or category == "图书"', "OR运算"),
('not (price > 1000)', "NOT运算"),
('(price > 100 and price < 500) or category == "特价"', "组合运算")
]
print("\n逻辑运算符示例:\n")
for expr, desc in logical_expressions:
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=5,
expr=expr
)
print(f"{desc:10s}: {len(results[0])} 条结果")
# 成员运算符
member_expressions = [
('category in ["电子", "图书", "服装"]', "IN运算"),
('category not in ["食品", "玩具"]', "NOT IN运算"),
('id in [1, 2, 3, 4, 5]', "ID列表")
]
print("\n成员运算符示例:\n")
for expr, desc in member_expressions:
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=5,
expr=expr,
output_fields=["id", "category"]
)
print(f"{desc:15s}: {len(results[0])} 条结果")
# 字符串匹配
string_expressions = [
('title like "手机%"', "前缀匹配"),
('title like "%Pro"', "后缀匹配"),
('title like "%iPhone%"', "包含匹配")
]
print("\n字符串匹配示例:\n")
for expr, desc in string_expressions:
try:
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=5,
expr=expr,
output_fields=["id", "title"]
)
print(f"{desc:10s}: {len(results[0])} 条结果")
except Exception as e:
print(f"{desc:10s}: 不支持或错误 - {str(e)}")
# 复杂表达式
complex_expr = '''
(category == "电子" and price >= 1000 and price <= 5000) or
(category == "图书" and price >= 50 and rating >= 4.5) or
(category == "服装" and discount > 0.5)
'''
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10,
expr=complex_expr,
output_fields=["id", "title", "category", "price"]
)
print(f"\n复杂表达式: {len(results[0])} 条结果")
---
b.表达式优化
a.功能说明
表达式优化可以显著提升查询性能。应该使用索引字段进行过滤。将高选择性条件放在前面。避免使用NOT运算符,改用正向条件。使用IN代替多个OR条件。避免在表达式中使用函数调用。简化复杂嵌套表达式。测试表达式的执行计划。监控过滤性能,及时优化。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("products")
collection.load()
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
# 优化前: 使用多个OR
expr_before = '''
category == "电子" or
category == "图书" or
category == "服装" or
category == "食品"
'''
start = time.time()
results_before = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10,
expr=expr_before
)
time_before = time.time() - start
print("优化前(多个OR):")
print(f" 查询时间: {time_before*1000:.2f}ms")
print(f" 结果数: {len(results_before[0])}")
# 优化后: 使用IN
expr_after = 'category in ["电子", "图书", "服装", "食品"]'
start = time.time()
results_after = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10,
expr=expr_after
)
time_after = time.time() - start
print(f"\n优化后(IN运算):")
print(f" 查询时间: {time_after*1000:.2f}ms")
print(f" 结果数: {len(results_after[0])}")
print(f" 加速比: {time_before/time_after:.2f}x")
# 优化: 避免NOT
expr_not = 'not (price > 1000)'
expr_positive = 'price <= 1000'
start = time.time()
results_not = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10,
expr=expr_not
)
time_not = time.time() - start
start = time.time()
results_positive = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10,
expr=expr_positive
)
time_positive = time.time() - start
print(f"\nNOT运算对比:")
print(f" NOT运算: {time_not*1000:.2f}ms")
print(f" 正向条件: {time_positive*1000:.2f}ms")
print(f" 加速比: {time_not/time_positive:.2f}x")
# 表达式简化
class ExpressionOptimizer:
@staticmethod
def optimize(expr):
"""简化表达式"""
optimizations = []
# 检查多个OR
if expr.count(' or ') >= 3:
optimizations.append("建议: 使用IN代替多个OR")
# 检查NOT
if 'not ' in expr.lower():
optimizations.append("建议: 避免NOT,使用正向条件")
# 检查复杂嵌套
if expr.count('(') > 3:
optimizations.append("建议: 简化嵌套表达式")
# 检查函数调用
if '(' in expr and ')' in expr:
optimizations.append("警告: 可能包含函数调用,影响性能")
return optimizations
@staticmethod
def analyze(expr, collection):
"""分析表达式性能"""
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
start = time.time()
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10,
expr=expr
)
elapsed = time.time() - start
return {
"query_time": elapsed * 1000,
"result_count": len(results[0])
}
optimizer = ExpressionOptimizer()
# 分析复杂表达式
complex_expr = '''
not (category == "电子" or category == "图书") and
(price > 100 or discount > 0.5) and
rating >= 4.0
'''
print(f"\n表达式优化建议:")
suggestions = optimizer.optimize(complex_expr)
for suggestion in suggestions:
print(f" {suggestion}")
metrics = optimizer.analyze(complex_expr, collection)
print(f"\n性能分析:")
print(f" 查询时间: {metrics['query_time']:.2f}ms")
print(f" 结果数: {metrics['result_count']}")
---
02.过滤性能
a.索引利用
a.功能说明
过滤性能高度依赖索引。为常用过滤字段创建索引可以显著提升性能。索引类型影响过滤效率,选择合适的索引类型。组合条件可能无法完全利用索引。过滤在向量搜索前执行,减少向量计算量。监控索引使用情况,优化索引配置。定期分析慢查询,优化过滤条件。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("products")
# 测试有索引 vs 无索引
print("索引对过滤性能的影响:\n")
# 场景1: 无索引
collection.release()
if collection.has_index("category"):
collection.drop_index("category")
collection.load()
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
start = time.time()
results_no_index = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10,
expr='category == "电子"'
)
time_no_index = time.time() - start
print(f"无索引:")
print(f" 查询时间: {time_no_index*1000:.2f}ms")
# 场景2: 有索引
collection.release()
collection.create_index(
field_name="category",
index_name="category_index"
)
collection.load()
start = time.time()
results_with_index = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10,
expr='category == "电子"'
)
time_with_index = time.time() - start
print(f"\n有索引:")
print(f" 查询时间: {time_with_index*1000:.2f}ms")
print(f" 加速比: {time_no_index/time_with_index:.2f}x")
# 组合条件的索引利用
print("\n组合条件索引利用:")
# 为price字段创建索引
collection.release()
collection.create_index(
field_name="price",
index_name="price_index"
)
collection.load()
# 测试不同组合
test_cases = [
('category == "电子"', "单字段(有索引)"),
('price > 100', "单字段(有索引)"),
('category == "电子" and price > 100', "两字段AND(都有索引)"),
('category == "电子" or price > 100', "两字段OR(都有索引)"),
('category == "电子" and rating > 4.0', "混合(一个有索引)")
]
print(f"\n{'表达式':>45s} {'查询时间':>12s}")
print("-" * 60)
for expr, desc in test_cases:
start = time.time()
try:
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10,
expr=expr
)
elapsed = time.time() - start
print(f"{desc:>45s} {elapsed*1000:10.2f}ms")
except Exception as e:
print(f"{desc:>45s} 错误: {str(e)}")
# 索引选择建议
print("\n索引选择建议:")
print(" 1. 为高频查询字段创建索引")
print(" 2. 高基数字段(唯一值多)索引效果好")
print(" 3. 低基数字段(如性别)索引效果有限")
print(" 4. 组合查询考虑创建多个单字段索引")
print(" 5. 监控索引使用率,删除无用索引")
---
b.性能监控
a.功能说明
监控过滤性能有助于发现瓶颈和优化机会。关注查询延迟、过滤选择性、索引命中率等指标。分析慢查询,识别性能问题。定期审查过滤表达式,优化复杂查询。使用性能分析工具定位瓶颈。建立性能基线,持续监控。设置告警阈值,及时发现异常。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
from collections import defaultdict
collection = Collection("products")
collection.load()
# 性能监控类
class FilterPerformanceMonitor:
def __init__(self):
self.query_log = []
self.stats = defaultdict(list)
def log_query(self, expr, query_time, result_count):
"""记录查询"""
self.query_log.append({
"expr": expr,
"time": query_time,
"count": result_count,
"timestamp": time.time()
})
self.stats[expr].append(query_time)
def get_slow_queries(self, threshold_ms=100):
"""获取慢查询"""
slow_queries = [
q for q in self.query_log
if q["time"] > threshold_ms
]
return slow_queries
def get_stats(self):
"""获取统计信息"""
stats_summary = {}
for expr, times in self.stats.items():
stats_summary[expr] = {
"count": len(times),
"avg_time": np.mean(times),
"p95_time": np.percentile(times, 95),
"max_time": max(times)
}
return stats_summary
def recommend_optimizations(self):
"""推荐优化建议"""
recommendations = []
slow_queries = self.get_slow_queries(threshold_ms=50)
if slow_queries:
recommendations.append(
f"发现 {len(slow_queries)} 个慢查询(>50ms),建议优化"
)
stats = self.get_stats()
for expr, stat in stats.items():
if stat["avg_time"] > 30:
recommendations.append(
f"表达式 '{expr[:50]}...' 平均耗时 {stat['avg_time']:.2f}ms,建议优化"
)
return recommendations
# 使用监控器
monitor = FilterPerformanceMonitor()
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
# 模拟多次查询
test_expressions = [
'category == "电子"',
'price > 100 and price < 500',
'category in ["电子", "图书", "服装"]',
'rating >= 4.0 and stock > 0'
]
print("执行测试查询...\n")
for _ in range(10):
for expr in test_expressions:
start = time.time()
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10,
expr=expr
)
elapsed = (time.time() - start) * 1000
monitor.log_query(expr, elapsed, len(results[0]))
# 分析结果
print("性能统计:\n")
print(f"{'表达式':>50s} {'查询次数':>10s} {'平均时间':>12s} {'P95时间':>12s}")
print("-" * 90)
stats = monitor.get_stats()
for expr, stat in stats.items():
print(f"{expr:>50s} {stat['count']:>10d} {stat['avg_time']:>10.2f}ms {stat['p95_time']:>10.2f}ms")
# 慢查询分析
slow_queries = monitor.get_slow_queries(threshold_ms=20)
if slow_queries:
print(f"\n慢查询 (>20ms): {len(slow_queries)} 个")
for q in slow_queries[:5]:
print(f" {q['expr'][:50]}: {q['time']:.2f}ms")
# 优化建议
print("\n优化建议:")
recommendations = monitor.recommend_optimizations()
for rec in recommendations:
print(f" - {rec}")
# 性能报告
print("\n性能报告:")
print(f" 总查询数: {len(monitor.query_log)}")
print(f" 平均延迟: {np.mean([q['time'] for q in monitor.query_log]):.2f}ms")
print(f" P95延迟: {np.percentile([q['time'] for q in monitor.query_log], 95):.2f}ms")
print(f" P99延迟: {np.percentile([q['time'] for q in monitor.query_log], 99):.2f}ms")
---
6.5 批量查询
01.批量搜索
a.批量提交
a.功能说明
批量搜索允许一次提交多个查询向量,提升吞吐量。Milvus会并行处理批量查询,共享索引访问开销。批量大小影响性能,推荐10-100个查询一批。过大的批量可能导致内存压力和延迟增加。批量查询返回列表,每个元素对应一个查询的结果。适合离线批处理场景,如批量推荐、批量相似度计算等。可以显著降低网络往返开销。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("documents")
collection.load()
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
# 单次查询性能
print("单次查询 vs 批量查询性能对比:\n")
num_queries = 100
query_vectors = [[np.random.random() for _ in range(128)] for _ in range(num_queries)]
# 方式1: 逐个查询
start = time.time()
results_sequential = []
for query_vector in query_vectors:
results = collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=10
)
results_sequential.append(results[0])
time_sequential = time.time() - start
print(f"逐个查询 ({num_queries}次):")
print(f" 总时间: {time_sequential:.2f}s")
print(f" 平均每次: {time_sequential/num_queries*1000:.2f}ms")
print(f" QPS: {num_queries/time_sequential:.2f}")
# 方式2: 批量查询
start = time.time()
results_batch = collection.search(
data=query_vectors,
anns_field="embedding",
param=search_params,
limit=10
)
time_batch = time.time() - start
print(f"\n批量查询 ({num_queries}次):")
print(f" 总时间: {time_batch:.2f}s")
print(f" 平均每次: {time_batch/num_queries*1000:.2f}ms")
print(f" QPS: {num_queries/time_batch:.2f}")
print(f" 加速比: {time_sequential/time_batch:.2f}x")
# 不同批量大小的性能
batch_sizes = [1, 10, 50, 100, 200]
print("\n不同批量大小的性能:\n")
print(f"{'批量大小':>10s} {'总时间':>10s} {'平均每次':>12s} {'QPS':>10s}")
print("-" * 48)
for batch_size in batch_sizes:
test_vectors = [[np.random.random() for _ in range(128)] for _ in range(batch_size)]
start = time.time()
results = collection.search(
data=test_vectors,
anns_field="embedding",
param=search_params,
limit=10
)
elapsed = time.time() - start
avg_time = elapsed / batch_size * 1000
qps = batch_size / elapsed
print(f"{batch_size:10d} {elapsed:9.3f}s {avg_time:10.2f}ms {qps:9.2f}")
# 批量查询最佳实践
print("\n批量查询最佳实践:")
print(" 1. 批量大小: 10-100(根据延迟要求)")
print(" 2. 离线处理: 使用更大批量(100-500)")
print(" 3. 实时场景: 使用小批量(10-50)")
print(" 4. 监控内存: 避免批量过大导致OOM")
print(" 5. 并发控制: 限制同时批量查询数")
---
b.并发查询
a.功能说明
并发查询通过多线程或多进程提升吞吐量。Milvus支持多客户端并发查询,充分利用服务器资源。并发数应该根据服务器CPU核心数调整。过高并发可能导致资源竞争和性能下降。需要在延迟和吞吐量间权衡。适合高吞吐场景,如批量推荐系统。建议使用连接池管理并发连接。
b.代码示例
---
from pymilvus import Collection, connections
import numpy as np
import time
import concurrent.futures
from threading import Lock
# 连接Milvus
connections.connect(host="localhost", port="19530")
collection = Collection("documents")
collection.load()
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
# 单线程查询
def single_thread_queries(num_queries=100):
"""单线程查询"""
query_vectors = [[np.random.random() for _ in range(128)] for _ in range(num_queries)]
start = time.time()
for query_vector in query_vectors:
collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=10
)
elapsed = time.time() - start
return elapsed, num_queries
# 多线程查询
def multi_thread_queries(num_queries=100, num_workers=4):
"""多线程查询"""
query_vectors = [[np.random.random() for _ in range(128)] for _ in range(num_queries)]
def query_worker(query_vector):
"""单个查询任务"""
return collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=10
)
start = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
futures = [executor.submit(query_worker, qv) for qv in query_vectors]
results = [future.result() for future in concurrent.futures.as_completed(futures)]
elapsed = time.time() - start
return elapsed, num_queries
# 性能对比
print("并发查询性能测试:\n")
num_queries = 100
# 单线程
time_single, count_single = single_thread_queries(num_queries)
qps_single = count_single / time_single
print(f"单线程:")
print(f" 总时间: {time_single:.2f}s")
print(f" QPS: {qps_single:.2f}")
# 不同并发数
worker_counts = [2, 4, 8, 16]
print(f"\n不同并发数性能:\n")
print(f"{'并发数':>8s} {'总时间':>10s} {'QPS':>10s} {'加速比':>10s}")
print("-" * 42)
for num_workers in worker_counts:
time_multi, count_multi = multi_thread_queries(num_queries, num_workers)
qps_multi = count_multi / time_multi
speedup = time_single / time_multi
print(f"{num_workers:8d} {time_multi:9.2f}s {qps_multi:9.2f} {speedup:9.2f}x")
# 并发控制器
class ConcurrentQueryController:
def __init__(self, collection, max_workers=8):
self.collection = collection
self.max_workers = max_workers
self.executor = concurrent.futures.ThreadPoolExecutor(max_workers=max_workers)
self.lock = Lock()
self.query_count = 0
def query(self, query_vector, search_params, limit=10):
"""提交查询任务"""
def _query():
with self.lock:
self.query_count += 1
return self.collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=limit
)
return self.executor.submit(_query)
def batch_query(self, query_vectors, search_params, limit=10):
"""批量提交查询"""
futures = [self.query(qv, search_params, limit) for qv in query_vectors]
return futures
def wait_all(self, futures):
"""等待所有查询完成"""
results = []
for future in concurrent.futures.as_completed(futures):
results.append(future.result())
return results
def get_stats(self):
"""获取统计信息"""
return {
"total_queries": self.query_count,
"max_workers": self.max_workers
}
def shutdown(self):
"""关闭执行器"""
self.executor.shutdown(wait=True)
# 使用并发控制器
controller = ConcurrentQueryController(collection, max_workers=8)
query_vectors = [[np.random.random() for _ in range(128)] for _ in range(50)]
print("\n使用并发控制器:")
start = time.time()
futures = controller.batch_query(query_vectors, search_params, limit=10)
results = controller.wait_all(futures)
elapsed = time.time() - start
stats = controller.get_stats()
print(f" 查询数: {stats['total_queries']}")
print(f" 并发数: {stats['max_workers']}")
print(f" 总时间: {elapsed:.2f}s")
print(f" QPS: {stats['total_queries']/elapsed:.2f}")
controller.shutdown()
# 并发优化建议
print("\n并发优化建议:")
print(" 1. 并发数 = CPU核心数 × 2")
print(" 2. 使用连接池避免频繁建立连接")
print(" 3. 监控资源使用,避免过载")
print(" 4. 实时场景用低并发,批处理用高并发")
print(" 5. 结合批量查询和并发,最大化吞吐")
---
02.批量优化
a.内存管理
a.功能说明
批量查询需要注意内存管理,避免OOM。查询向量和结果都占用内存,批量过大会导致内存溢出。应该根据可用内存限制批量大小。可以使用流式处理,分批加载和处理数据。监控内存使用,及时释放不需要的对象。使用生成器避免一次性加载所有数据。合理设置limit避免返回过多结果。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import psutil
import gc
collection = Collection("documents")
collection.load()
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
# 内存监控
def get_memory_usage():
"""获取当前内存使用"""
process = psutil.Process()
memory_info = process.memory_info()
return memory_info.rss / 1024 / 1024 # MB
# 批量查询内存分析
print("批量查询内存使用分析:\n")
batch_sizes = [10, 50, 100, 500, 1000]
print(f"{'批量大小':>10s} {'查询前':>12s} {'查询后':>12s} {'增长':>12s}")
print("-" * 50)
for batch_size in batch_sizes:
# 清理内存
gc.collect()
mem_before = get_memory_usage()
# 生成查询向量
query_vectors = [[np.random.random() for _ in range(128)] for _ in range(batch_size)]
# 执行查询
results = collection.search(
data=query_vectors,
anns_field="embedding",
param=search_params,
limit=10
)
mem_after = get_memory_usage()
mem_increase = mem_after - mem_before
print(f"{batch_size:10d} {mem_before:10.2f}MB {mem_after:10.2f}MB {mem_increase:10.2f}MB")
# 清理结果
del query_vectors
del results
gc.collect()
# 流式批量查询
def streaming_batch_query(total_queries, batch_size=100):
"""流式批量查询,避免内存溢出"""
num_batches = (total_queries + batch_size - 1) // batch_size
for batch_idx in range(num_batches):
start_idx = batch_idx * batch_size
end_idx = min(start_idx + batch_size, total_queries)
current_batch_size = end_idx - start_idx
# 生成当前批次的查询向量
query_vectors = [[np.random.random() for _ in range(128)] for _ in range(current_batch_size)]
# 执行查询
results = collection.search(
data=query_vectors,
anns_field="embedding",
param=search_params,
limit=10
)
# 处理结果(这里只是打印)
yield batch_idx, results
# 清理内存
del query_vectors
del results
gc.collect()
print("\n流式批量查询:")
mem_start = get_memory_usage()
print(f"开始内存: {mem_start:.2f}MB")
total_queries = 1000
batch_size = 100
for batch_idx, results in streaming_batch_query(total_queries, batch_size):
mem_current = get_memory_usage()
print(f" 批次 {batch_idx+1}: {len(results)} 个结果, 内存: {mem_current:.2f}MB")
mem_end = get_memory_usage()
print(f"结束内存: {mem_end:.2f}MB")
print(f"内存增长: {mem_end - mem_start:.2f}MB")
# 自适应批量大小
class AdaptiveBatchQuery:
def __init__(self, collection, max_memory_mb=1000):
self.collection = collection
self.max_memory_mb = max_memory_mb
self.batch_size = 100
def estimate_batch_size(self, vector_dim=128, limit=10):
"""估算合适的批量大小"""
# 估算单个查询的内存占用
query_memory = vector_dim * 4 / 1024 / 1024 # MB
result_memory = limit * (vector_dim * 4 + 100) / 1024 / 1024 # MB
per_query_memory = query_memory + result_memory
# 计算批量大小
available_memory = self.max_memory_mb * 0.8 # 留20%余量
estimated_batch_size = int(available_memory / per_query_memory)
return max(10, min(estimated_batch_size, 1000))
def query(self, query_vectors, search_params, limit=10):
"""自适应批量查询"""
# 动态调整批量大小
optimal_batch_size = self.estimate_batch_size(limit=limit)
print(f"自适应批量大小: {optimal_batch_size}")
all_results = []
num_queries = len(query_vectors)
for i in range(0, num_queries, optimal_batch_size):
batch = query_vectors[i:i+optimal_batch_size]
results = self.collection.search(
data=batch,
anns_field="embedding",
param=search_params,
limit=limit
)
all_results.extend(results)
# 检查内存
current_memory = get_memory_usage()
if current_memory > self.max_memory_mb:
print(f"警告: 内存使用 {current_memory:.2f}MB 超过限制")
gc.collect()
return all_results
adaptive_query = AdaptiveBatchQuery(collection, max_memory_mb=500)
query_vectors = [[np.random.random() for _ in range(128)] for _ in range(500)]
results = adaptive_query.query(query_vectors, search_params, limit=10)
print(f"\n自适应查询完成: {len(results)} 个结果")
---
b.性能调优
a.功能说明
批量查询性能调优需要综合考虑多个因素。批量大小、并发数、搜索参数都影响性能。应该通过实验确定最优配置。监控QPS、延迟、内存等指标。使用性能分析工具定位瓶颈。考虑使用缓存减少重复查询。优化网络传输,使用压缩等技术。建立性能基线,持续优化。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("documents")
collection.load()
# 性能调优实验
class BatchQueryTuner:
def __init__(self, collection):
self.collection = collection
self.results = []
def tune_batch_size(self, query_vectors, search_params):
"""调优批量大小"""
batch_sizes = [10, 20, 50, 100, 200]
print("批量大小调优:\n")
print(f"{'批量大小':>10s} {'总时间':>10s} {'QPS':>10s} {'平均延迟':>12s}")
print("-" * 48)
best_qps = 0
best_batch_size = 10
for batch_size in batch_sizes:
# 使用前N个查询
test_vectors = query_vectors[:min(batch_size * 10, len(query_vectors))]
start = time.time()
for i in range(0, len(test_vectors), batch_size):
batch = test_vectors[i:i+batch_size]
self.collection.search(
data=batch,
anns_field="embedding",
param=search_params,
limit=10
)
elapsed = time.time() - start
qps = len(test_vectors) / elapsed
avg_latency = elapsed / len(test_vectors) * 1000
print(f"{batch_size:10d} {elapsed:9.2f}s {qps:9.2f} {avg_latency:10.2f}ms")
if qps > best_qps:
best_qps = qps
best_batch_size = batch_size
print(f"\n最优批量大小: {best_batch_size} (QPS: {best_qps:.2f})")
return best_batch_size
def tune_search_params(self, query_vectors, batch_size):
"""调优搜索参数"""
nprobe_values = [8, 16, 32, 64]
print("\n搜索参数调优:\n")
print(f"{'nprobe':>8s} {'总时间':>10s} {'QPS':>10s}")
print("-" * 32)
best_qps = 0
best_nprobe = 16
for nprobe in nprobe_values:
search_params = {
"metric_type": "L2",
"params": {"nprobe": nprobe}
}
test_vectors = query_vectors[:min(batch_size * 10, len(query_vectors))]
start = time.time()
for i in range(0, len(test_vectors), batch_size):
batch = test_vectors[i:i+batch_size]
self.collection.search(
data=batch,
anns_field="embedding",
param=search_params,
limit=10
)
elapsed = time.time() - start
qps = len(test_vectors) / elapsed
print(f"{nprobe:8d} {elapsed:9.2f}s {qps:9.2f}")
if qps > best_qps:
best_qps = qps
best_nprobe = nprobe
print(f"\n最优nprobe: {best_nprobe} (QPS: {best_qps:.2f})")
return best_nprobe
def full_tune(self, num_queries=1000):
"""完整调优流程"""
print("=" * 60)
print("批量查询性能调优")
print("=" * 60 + "\n")
# 生成测试查询
query_vectors = [[np.random.random() for _ in range(128)] for _ in range(num_queries)]
# 调优批量大小
optimal_batch_size = self.tune_batch_size(
query_vectors,
{"metric_type": "L2", "params": {"nprobe": 16}}
)
# 调优搜索参数
optimal_nprobe = self.tune_search_params(query_vectors, optimal_batch_size)
# 最终配置
print("\n" + "=" * 60)
print("最优配置")
print("=" * 60)
print(f" 批量大小: {optimal_batch_size}")
print(f" nprobe: {optimal_nprobe}")
# 验证性能
optimal_search_params = {
"metric_type": "L2",
"params": {"nprobe": optimal_nprobe}
}
start = time.time()
for i in range(0, len(query_vectors), optimal_batch_size):
batch = query_vectors[i:i+optimal_batch_size]
self.collection.search(
data=batch,
anns_field="embedding",
param=optimal_search_params,
limit=10
)
elapsed = time.time() - start
final_qps = len(query_vectors) / elapsed
final_latency = elapsed / len(query_vectors) * 1000
print(f"\n最终性能:")
print(f" QPS: {final_qps:.2f}")
print(f" 平均延迟: {final_latency:.2f}ms")
print(f" 总时间: {elapsed:.2f}s")
# 执行调优
tuner = BatchQueryTuner(collection)
tuner.full_tune(num_queries=500)
---
7 高级特性
7.1 分区管理
01.分区概念
a.分区作用
a.功能说明
分区是Collection内的逻辑分组,用于组织和管理数据。通过分区可以提升查询性能,只搜索相关分区而不是整个Collection。分区适合按时间、类别、地域等维度划分数据。每个Collection可以有多个分区,默认有一个_default分区。分区之间数据隔离,互不影响。可以独立加载、释放、删除分区。合理使用分区可以显著优化查询效率和资源使用。
b.代码示例
---
from pymilvus import Collection, Partition
import numpy as np
collection = Collection("documents")
# 创建分区
partition_2024 = collection.create_partition("year_2024")
partition_2023 = collection.create_partition("year_2023")
partition_2022 = collection.create_partition("year_2022")
print("已创建分区:")
for partition in collection.partitions:
print(f" - {partition.name}")
# 向不同分区插入数据
data_2024 = [
[i for i in range(1000, 2000)], # ids
[f"文档2024_{i}" for i in range(1000)], # titles
[[np.random.random() for _ in range(128)] for _ in range(1000)] # embeddings
]
partition_2024.insert(data_2024)
data_2023 = [
[i for i in range(2000, 3000)],
[f"文档2023_{i}" for i in range(1000)],
[[np.random.random() for _ in range(128)] for _ in range(1000)]
]
partition_2023.insert(data_2023)
collection.flush()
print(f"\n分区数据量:")
print(f" year_2024: {partition_2024.num_entities} 条")
print(f" year_2023: {partition_2023.num_entities} 条")
print(f" 总计: {collection.num_entities} 条")
# 分区搜索
collection.load()
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
# 搜索特定分区
results_2024 = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10,
partition_names=["year_2024"],
output_fields=["id", "title"]
)
print(f"\n搜索year_2024分区:")
for hit in results_2024[0][:5]:
print(f" {hit.entity.get('title')}: {hit.distance:.4f}")
# 搜索多个分区
results_multi = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10,
partition_names=["year_2024", "year_2023"],
output_fields=["id", "title"]
)
print(f"\n搜索多个分区:")
for hit in results_multi[0][:5]:
print(f" {hit.entity.get('title')}: {hit.distance:.4f}")
# 搜索所有分区(不指定partition_names)
results_all = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10,
output_fields=["id", "title"]
)
print(f"\n搜索所有分区:")
for hit in results_all[0][:5]:
print(f" {hit.entity.get('title')}: {hit.distance:.4f}")
---
b.分区策略
a.功能说明
分区策略影响系统性能和可维护性。常见策略包括按时间分区(日、月、年)、按类别分区(产品类型、文档类型)、按哈希分区(均匀分布)等。时间分区适合时序数据,便于数据老化和归档。类别分区适合多租户或多类型数据。哈希分区适合均匀分布负载。分区数量不宜过多,推荐10-100个。需要根据业务特点选择合适策略。
b.代码示例
---
from pymilvus import Collection, CollectionSchema, FieldSchema, DataType
import numpy as np
import hashlib
from datetime import datetime, timedelta
# 策略1: 按时间分区
class TimeBasedPartitioning:
def __init__(self, collection):
self.collection = collection
def create_monthly_partitions(self, start_date, num_months):
"""创建按月分区"""
partitions = []
current_date = start_date
for i in range(num_months):
partition_name = current_date.strftime("month_%Y_%m")
if not self.collection.has_partition(partition_name):
partition = self.collection.create_partition(partition_name)
partitions.append(partition)
print(f"创建分区: {partition_name}")
# 下一个月
if current_date.month == 12:
current_date = datetime(current_date.year + 1, 1, 1)
else:
current_date = datetime(current_date.year, current_date.month + 1, 1)
return partitions
def get_partition_by_date(self, date):
"""根据日期获取分区名"""
return date.strftime("month_%Y_%m")
def insert_with_date(self, data, date):
"""插入数据到对应日期的分区"""
partition_name = self.get_partition_by_date(date)
if not self.collection.has_partition(partition_name):
self.collection.create_partition(partition_name)
partition = Partition(self.collection, partition_name)
partition.insert(data)
print(f"数据插入到分区: {partition_name}")
collection = Collection("time_series_docs")
time_partitioner = TimeBasedPartitioning(collection)
# 创建最近6个月的分区
start_date = datetime(2024, 1, 1)
time_partitioner.create_monthly_partitions(start_date, 6)
# 策略2: 按类别分区
class CategoryBasedPartitioning:
def __init__(self, collection):
self.collection = collection
self.categories = {}
def create_category_partitions(self, categories):
"""为每个类别创建分区"""
for category in categories:
partition_name = f"cat_{category.lower().replace(' ', '_')}"
if not self.collection.has_partition(partition_name):
partition = self.collection.create_partition(partition_name)
self.categories[category] = partition_name
print(f"创建分区: {partition_name}")
def insert_by_category(self, data, category):
"""插入数据到对应类别的分区"""
if category not in self.categories:
raise ValueError(f"未知类别: {category}")
partition_name = self.categories[category]
partition = Partition(self.collection, partition_name)
partition.insert(data)
print(f"数据插入到分区: {partition_name}")
category_partitioner = CategoryBasedPartitioning(collection)
categories = ["电子产品", "图书", "服装", "食品"]
category_partitioner.create_category_partitions(categories)
# 策略3: 按哈希分区
class HashBasedPartitioning:
def __init__(self, collection, num_partitions=10):
self.collection = collection
self.num_partitions = num_partitions
self.create_hash_partitions()
def create_hash_partitions(self):
"""创建哈希分区"""
for i in range(self.num_partitions):
partition_name = f"hash_{i:03d}"
if not self.collection.has_partition(partition_name):
self.collection.create_partition(partition_name)
print(f"创建分区: {partition_name}")
def get_partition_by_id(self, doc_id):
"""根据ID计算分区"""
partition_idx = hash(str(doc_id)) % self.num_partitions
return f"hash_{partition_idx:03d}"
def insert_by_hash(self, data):
"""根据哈希分配数据到分区"""
# 假设data[0]是ID列表
ids = data[0]
# 按分区分组数据
partition_data = {}
for i, doc_id in enumerate(ids):
partition_name = self.get_partition_by_id(doc_id)
if partition_name not in partition_data:
partition_data[partition_name] = [[] for _ in range(len(data))]
for j, field_data in enumerate(data):
partition_data[partition_name][j].append(field_data[i])
# 插入到各分区
for partition_name, pdata in partition_data.items():
partition = Partition(self.collection, partition_name)
partition.insert(pdata)
print(f"插入 {len(pdata[0])} 条数据到 {partition_name}")
hash_partitioner = HashBasedPartitioning(collection, num_partitions=10)
# 分区策略选择
print("\n分区策略选择建议:")
print(" 时间分区: 适合日志、时序数据,便于归档")
print(" 类别分区: 适合多租户、多类型数据")
print(" 哈希分区: 适合均匀分布,负载均衡")
print(" 混合分区: 先按类别再按时间,多级分区")
---
02.分区操作
a.加载释放
a.功能说明
分区可以独立加载和释放,节省内存资源。只加载需要查询的分区,其他分区保持释放状态。加载分区会将索引和部分数据加载到内存。释放分区会释放内存,但数据仍保留在存储中。可以动态加载释放分区,适应查询模式变化。热数据分区保持加载,冷数据分区按需加载。合理管理分区加载状态可以优化内存使用。
b.代码示例
---
from pymilvus import Collection, Partition
import numpy as np
import time
collection = Collection("documents")
# 创建多个分区
partitions = []
for year in [2022, 2023, 2024]:
partition_name = f"year_{year}"
if not collection.has_partition(partition_name):
partition = collection.create_partition(partition_name)
partitions.append(partition)
# 插入数据
data = [
[i for i in range(year*1000, year*1000+1000)],
[f"文档{year}_{i}" for i in range(1000)],
[[np.random.random() for _ in range(128)] for _ in range(1000)]
]
partition.insert(data)
collection.flush()
# 加载特定分区
print("加载特定分区:\n")
partition_2024 = Partition(collection, "year_2024")
print(f"分区状态: {partition_2024.is_loaded}")
partition_2024.load()
print(f"加载后状态: {partition_2024.is_loaded}")
# 查询已加载分区
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=5,
partition_names=["year_2024"]
)
print(f"\n查询year_2024分区: {len(results[0])} 条结果")
# 释放分区
partition_2024.release()
print(f"\n释放后状态: {partition_2024.is_loaded}")
# 动态加载管理
class PartitionLoadManager:
def __init__(self, collection):
self.collection = collection
self.loaded_partitions = set()
def load_partition(self, partition_name):
"""加载分区"""
if partition_name in self.loaded_partitions:
print(f"分区 {partition_name} 已加载")
return
partition = Partition(self.collection, partition_name)
start = time.time()
partition.load()
elapsed = time.time() - start
self.loaded_partitions.add(partition_name)
print(f"加载分区 {partition_name}: {elapsed:.2f}s")
def release_partition(self, partition_name):
"""释放分区"""
if partition_name not in self.loaded_partitions:
print(f"分区 {partition_name} 未加载")
return
partition = Partition(self.collection, partition_name)
partition.release()
self.loaded_partitions.remove(partition_name)
print(f"释放分区 {partition_name}")
def load_partitions(self, partition_names):
"""批量加载分区"""
for name in partition_names:
self.load_partition(name)
def release_all(self):
"""释放所有分区"""
for name in list(self.loaded_partitions):
self.release_partition(name)
def get_loaded_partitions(self):
"""获取已加载分区列表"""
return list(self.loaded_partitions)
# 使用加载管理器
load_manager = PartitionLoadManager(collection)
print("\n动态加载管理:")
# 加载热数据分区
load_manager.load_partitions(["year_2024", "year_2023"])
print(f"已加载分区: {load_manager.get_loaded_partitions()}")
# 查询热数据
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=5,
partition_names=["year_2024", "year_2023"]
)
print(f"查询热数据: {len(results[0])} 条结果")
# 切换到冷数据
load_manager.release_partition("year_2023")
load_manager.load_partition("year_2022")
print(f"切换后已加载分区: {load_manager.get_loaded_partitions()}")
# 释放所有
load_manager.release_all()
print(f"释放后已加载分区: {load_manager.get_loaded_partitions()}")
# 内存优化建议
print("\n内存优化建议:")
print(" 1. 只加载近期数据分区(如最近3个月)")
print(" 2. 历史数据按需加载,查询后释放")
print(" 3. 监控内存使用,避免加载过多分区")
print(" 4. 使用LRU策略自动管理分区加载")
print(" 5. 考虑分区大小,避免单个分区过大")
---
b.删除分区
a.功能说明
删除分区会永久删除分区及其所有数据。删除前需要先释放分区。删除操作不可逆,需要谨慎操作。可以用于清理过期数据,如删除旧的时间分区。删除分区可以释放存储空间。建议在删除前备份重要数据。删除分区不影响其他分区的数据和查询。
b.代码示例
---
from pymilvus import Collection, Partition
import numpy as np
collection = Collection("documents")
# 创建测试分区
test_partition = collection.create_partition("test_partition")
# 插入测试数据
data = [
[i for i in range(10000, 11000)],
[f"测试文档_{i}" for i in range(1000)],
[[np.random.random() for _ in range(128)] for _ in range(1000)]
]
test_partition.insert(data)
collection.flush()
print(f"创建测试分区: test_partition")
print(f"数据量: {test_partition.num_entities} 条")
# 列出所有分区
print(f"\n当前分区:")
for partition in collection.partitions:
print(f" - {partition.name}: {partition.num_entities} 条")
# 删除分区
print(f"\n删除test_partition分区...")
# 先释放(如果已加载)
if test_partition.is_loaded:
test_partition.release()
# 删除分区
collection.drop_partition("test_partition")
print(f"删除完成")
# 验证删除
print(f"\n删除后分区:")
for partition in collection.partitions:
print(f" - {partition.name}: {partition.num_entities} 条")
# 批量删除旧分区
class PartitionCleaner:
def __init__(self, collection):
self.collection = collection
def delete_old_time_partitions(self, keep_months=3):
"""删除旧的时间分区,保留最近N个月"""
from datetime import datetime, timedelta
cutoff_date = datetime.now() - timedelta(days=keep_months*30)
deleted_partitions = []
for partition in self.collection.partitions:
# 跳过默认分区
if partition.name == "_default":
continue
# 解析分区名(假设格式为month_YYYY_MM)
if partition.name.startswith("month_"):
try:
parts = partition.name.split("_")
year = int(parts[1])
month = int(parts[2])
partition_date = datetime(year, month, 1)
if partition_date < cutoff_date:
# 释放并删除
if partition.is_loaded:
partition.release()
self.collection.drop_partition(partition.name)
deleted_partitions.append(partition.name)
print(f"删除旧分区: {partition.name}")
except Exception as e:
print(f"解析分区名失败: {partition.name}, {e}")
return deleted_partitions
def delete_empty_partitions(self):
"""删除空分区"""
deleted_partitions = []
for partition in self.collection.partitions:
if partition.name == "_default":
continue
if partition.num_entities == 0:
if partition.is_loaded:
partition.release()
self.collection.drop_partition(partition.name)
deleted_partitions.append(partition.name)
print(f"删除空分区: {partition.name}")
return deleted_partitions
def safe_delete_partition(self, partition_name, backup_path=None):
"""安全删除分区(可选备份)"""
partition = Partition(self.collection, partition_name)
# 备份数据
if backup_path:
print(f"备份分区 {partition_name} 到 {backup_path}")
# 这里应该实现实际的备份逻辑
# 例如导出数据到文件
# 释放并删除
if partition.is_loaded:
partition.release()
self.collection.drop_partition(partition_name)
print(f"删除分区: {partition_name}")
cleaner = PartitionCleaner(collection)
# 删除旧分区
print("\n清理旧分区(保留最近3个月):")
deleted = cleaner.delete_old_time_partitions(keep_months=3)
print(f"删除了 {len(deleted)} 个旧分区")
# 删除空分区
print("\n清理空分区:")
deleted = cleaner.delete_empty_partitions()
print(f"删除了 {len(deleted)} 个空分区")
# 删除注意事项
print("\n删除分区注意事项:")
print(" 1. 删除操作不可逆,务必谨慎")
print(" 2. 删除前建议备份重要数据")
print(" 3. 先释放分区再删除")
print(" 4. 不能删除_default分区")
print(" 5. 定期清理过期分区释放存储")
---
7.2 副本配置
01.副本机制
a.副本作用
a.功能说明
副本机制提供数据冗余和高可用性,提升查询吞吐量。每个副本包含完整的数据和索引副本。多个副本可以并行处理查询请求,提升QPS。副本之间数据保持一致,自动同步更新。副本数量可以动态调整,适应负载变化。适合读多写少的场景,如搜索推荐系统。副本会占用额外的内存和存储资源。
b.代码示例
---
from pymilvus import Collection, utility
import numpy as np
import time
collection = Collection("documents")
collection.load()
# 查看当前副本配置
print("当前副本配置:")
replicas = collection.get_replicas()
print(f" 副本数量: {len(replicas.groups)}")
for i, replica in enumerate(replicas.groups):
print(f"\n 副本 {i+1}:")
print(f" 副本ID: {replica.id}")
print(f" 分片数: {len(replica.shards)}")
print(f" 节点: {replica.resource_group}")
# 创建副本
print("\n创建副本...")
collection.load(replica_number=3)
replicas = collection.get_replicas()
print(f"创建后副本数量: {len(replicas.groups)}")
# 测试副本对查询性能的影响
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
# 单副本性能
collection.release()
collection.load(replica_number=1)
start = time.time()
for _ in range(100):
collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10
)
time_single = time.time() - start
qps_single = 100 / time_single
print(f"\n单副本性能:")
print(f" 查询时间: {time_single:.2f}s")
print(f" QPS: {qps_single:.2f}")
# 多副本性能
collection.release()
collection.load(replica_number=3)
start = time.time()
for _ in range(100):
collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10
)
time_multi = time.time() - start
qps_multi = 100 / time_multi
print(f"\n三副本性能:")
print(f" 查询时间: {time_multi:.2f}s")
print(f" QPS: {qps_multi:.2f}")
print(f" 提升: {qps_multi/qps_single:.2f}x")
# 副本配置建议
print("\n副本配置建议:")
print(" 1. 读多写少: 使用2-3个副本")
print(" 2. 高可用: 至少2个副本")
print(" 3. 高吞吐: 3-5个副本")
print(" 4. 资源有限: 1个副本")
print(" 5. 副本数 ≤ QueryNode数量")
---
b.副本管理
a.功能说明
副本管理包括创建、调整、监控副本。可以动态调整副本数量,无需停机。副本数量影响内存使用和查询性能。需要监控副本状态,确保所有副本正常工作。副本故障会自动切换到其他副本。可以为不同Collection配置不同副本数。合理配置副本可以平衡性能和成本。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("documents")
# 副本管理类
class ReplicaManager:
def __init__(self, collection):
self.collection = collection
def get_replica_info(self):
"""获取副本信息"""
if not self.collection.is_loaded:
return {"loaded": False}
replicas = self.collection.get_replicas()
info = {
"loaded": True,
"replica_count": len(replicas.groups),
"replicas": []
}
for replica in replicas.groups:
replica_info = {
"id": replica.id,
"shard_count": len(replica.shards),
"resource_group": replica.resource_group
}
info["replicas"].append(replica_info)
return info
def set_replica_number(self, replica_number):
"""设置副本数量"""
print(f"设置副本数量为 {replica_number}...")
# 释放并重新加载
self.collection.release()
self.collection.load(replica_number=replica_number)
# 等待加载完成
while not self.collection.is_loaded:
time.sleep(0.1)
info = self.get_replica_info()
print(f"当前副本数量: {info['replica_count']}")
return info
def scale_replicas(self, target_replica_number):
"""扩缩容副本"""
current_info = self.get_replica_info()
if not current_info["loaded"]:
print("Collection未加载,直接加载指定副本数")
return self.set_replica_number(target_replica_number)
current_count = current_info["replica_count"]
if current_count == target_replica_number:
print(f"副本数量已经是 {target_replica_number}")
return current_info
if current_count < target_replica_number:
print(f"扩容: {current_count} -> {target_replica_number}")
else:
print(f"缩容: {current_count} -> {target_replica_number}")
return self.set_replica_number(target_replica_number)
def monitor_replicas(self):
"""监控副本状态"""
info = self.get_replica_info()
if not info["loaded"]:
print("Collection未加载")
return
print(f"\n副本监控:")
print(f" 副本总数: {info['replica_count']}")
for i, replica in enumerate(info["replicas"]):
print(f"\n 副本 {i+1}:")
print(f" ID: {replica['id']}")
print(f" 分片数: {replica['shard_count']}")
print(f" 资源组: {replica['resource_group']}")
def benchmark_replicas(self, num_queries=100):
"""测试不同副本数的性能"""
replica_numbers = [1, 2, 3]
results = []
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
print(f"\n副本性能测试 ({num_queries} 次查询):\n")
print(f"{'副本数':>8s} {'总时间':>10s} {'QPS':>10s} {'平均延迟':>12s}")
print("-" * 45)
for replica_num in replica_numbers:
self.set_replica_number(replica_num)
start = time.time()
for _ in range(num_queries):
self.collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10
)
elapsed = time.time() - start
qps = num_queries / elapsed
avg_latency = elapsed / num_queries * 1000
results.append({
"replica_number": replica_num,
"total_time": elapsed,
"qps": qps,
"avg_latency": avg_latency
})
print(f"{replica_num:8d} {elapsed:9.2f}s {qps:9.2f} {avg_latency:10.2f}ms")
return results
# 使用副本管理器
manager = ReplicaManager(collection)
# 获取当前副本信息
info = manager.get_replica_info()
print(f"当前副本信息: {info}")
# 设置副本数量
manager.set_replica_number(2)
# 监控副本
manager.monitor_replicas()
# 扩容副本
manager.scale_replicas(3)
# 性能测试
results = manager.benchmark_replicas(num_queries=50)
# 找到最优配置
best_result = max(results, key=lambda x: x["qps"])
print(f"\n最优配置:")
print(f" 副本数: {best_result['replica_number']}")
print(f" QPS: {best_result['qps']:.2f}")
---
02.高可用配置
a.故障切换
a.功能说明
副本提供自动故障切换能力,提升系统可用性。当某个副本节点故障时,查询自动路由到其他副本。故障切换对客户端透明,无需手动干预。多副本配置可以实现零停机维护。建议至少配置2个副本保证高可用。副本分布在不同节点,避免单点故障。监控副本健康状态,及时发现问题。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
from threading import Thread
collection = Collection("documents")
# 高可用配置类
class HighAvailabilityConfig:
def __init__(self, collection, min_replicas=2):
self.collection = collection
self.min_replicas = min_replicas
self.query_count = 0
self.error_count = 0
def ensure_high_availability(self):
"""确保高可用配置"""
if not self.collection.is_loaded:
print(f"加载Collection,副本数: {self.min_replicas}")
self.collection.load(replica_number=self.min_replicas)
return
replicas = self.collection.get_replicas()
current_replicas = len(replicas.groups)
if current_replicas < self.min_replicas:
print(f"副本数不足 ({current_replicas} < {self.min_replicas}),重新加载")
self.collection.release()
self.collection.load(replica_number=self.min_replicas)
else:
print(f"副本配置正常: {current_replicas} 个副本")
def query_with_retry(self, query_vector, search_params, limit=10, max_retries=3):
"""带重试的查询"""
for attempt in range(max_retries):
try:
results = self.collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=limit
)
self.query_count += 1
return results[0]
except Exception as e:
self.error_count += 1
print(f"查询失败 (尝试 {attempt+1}/{max_retries}): {e}")
if attempt < max_retries - 1:
time.sleep(0.1 * (attempt + 1)) # 指数退避
else:
raise
def health_check(self):
"""健康检查"""
try:
replicas = self.collection.get_replicas()
replica_count = len(replicas.groups)
health_status = {
"healthy": replica_count >= self.min_replicas,
"replica_count": replica_count,
"min_replicas": self.min_replicas,
"query_count": self.query_count,
"error_count": self.error_count,
"error_rate": self.error_count / self.query_count if self.query_count > 0 else 0
}
return health_status
except Exception as e:
return {
"healthy": False,
"error": str(e)
}
def start_health_monitor(self, interval=10):
"""启动健康监控"""
def monitor():
while True:
status = self.health_check()
print(f"\n健康检查:")
print(f" 状态: {'健康' if status.get('healthy') else '异常'}")
print(f" 副本数: {status.get('replica_count', 'N/A')}")
print(f" 查询数: {status.get('query_count', 0)}")
print(f" 错误数: {status.get('error_count', 0)}")
print(f" 错误率: {status.get('error_rate', 0)*100:.2f}%")
if not status.get('healthy'):
print(" 警告: 副本数不足,尝试恢复...")
self.ensure_high_availability()
time.sleep(interval)
monitor_thread = Thread(target=monitor, daemon=True)
monitor_thread.start()
return monitor_thread
# 使用高可用配置
ha_config = HighAvailabilityConfig(collection, min_replicas=2)
# 确保高可用
ha_config.ensure_high_availability()
# 带重试的查询
query_vector = [np.random.random() for _ in range(128)]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
print("\n执行查询(带重试):")
results = ha_config.query_with_retry(query_vector, search_params, limit=10)
print(f"查询成功: {len(results)} 条结果")
# 健康检查
status = ha_config.health_check()
print(f"\n健康状态: {status}")
# 故障模拟测试
print("\n故障切换测试:")
print(" 模拟副本故障...")
# 这里应该模拟实际的副本故障
# 在生产环境中,Milvus会自动处理故障切换
print(" 查询继续执行...")
for i in range(10):
try:
results = ha_config.query_with_retry(query_vector, search_params)
print(f" 查询 {i+1}: 成功")
except Exception as e:
print(f" 查询 {i+1}: 失败 - {e}")
final_status = ha_config.health_check()
print(f"\n最终状态:")
print(f" 总查询数: {final_status['query_count']}")
print(f" 错误数: {final_status['error_count']}")
print(f" 成功率: {(1-final_status['error_rate'])*100:.2f}%")
---
b.负载均衡
a.功能说明
多副本自动实现负载均衡,查询请求分散到不同副本。Milvus使用轮询策略分配查询到副本。负载均衡提升系统整体吞吐量和响应速度。可以根据副本负载动态调整查询分配。监控各副本的负载情况,确保均衡分布。副本数量应该与QueryNode数量匹配。合理配置可以充分利用集群资源。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
from collections import defaultdict
import concurrent.futures
collection = Collection("documents")
collection.load(replica_number=3)
# 负载均衡监控
class LoadBalancingMonitor:
def __init__(self, collection):
self.collection = collection
self.query_stats = defaultdict(int)
self.latency_stats = defaultdict(list)
def query(self, query_vector, search_params, limit=10):
"""执行查询并记录统计"""
start = time.time()
results = self.collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=limit
)
latency = time.time() - start
# 记录统计(这里简化,实际应该获取实际处理的副本ID)
replica_id = hash(time.time()) % 3 # 模拟副本ID
self.query_stats[replica_id] += 1
self.latency_stats[replica_id].append(latency)
return results[0]
def get_load_distribution(self):
"""获取负载分布"""
total_queries = sum(self.query_stats.values())
distribution = {}
for replica_id, count in self.query_stats.items():
avg_latency = np.mean(self.latency_stats[replica_id]) if self.latency_stats[replica_id] else 0
distribution[replica_id] = {
"query_count": count,
"percentage": count / total_queries * 100 if total_queries > 0 else 0,
"avg_latency": avg_latency * 1000 # ms
}
return distribution
def print_load_stats(self):
"""打印负载统计"""
distribution = self.get_load_distribution()
print("\n负载分布:")
print(f"{'副本ID':>10s} {'查询数':>10s} {'占比':>10s} {'平均延迟':>12s}")
print("-" * 48)
for replica_id, stats in sorted(distribution.items()):
print(f"{replica_id:10d} {stats['query_count']:10d} {stats['percentage']:9.1f}% {stats['avg_latency']:10.2f}ms")
def check_balance(self, threshold=0.2):
"""检查负载是否均衡"""
distribution = self.get_load_distribution()
if len(distribution) < 2:
return True, "副本数不足,无法判断"
percentages = [stats["percentage"] for stats in distribution.values()]
avg_percentage = np.mean(percentages)
max_deviation = max(abs(p - avg_percentage) for p in percentages)
is_balanced = max_deviation <= threshold * 100
return is_balanced, f"最大偏差: {max_deviation:.1f}%"
# 使用负载均衡监控
monitor = LoadBalancingMonitor(collection)
query_vector = [np.random.random() for _ in range(128)]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
# 执行大量查询
print("执行负载测试...")
for i in range(300):
monitor.query(query_vector, search_params)
if (i + 1) % 100 == 0:
print(f" 已完成 {i+1} 次查询")
# 打印负载统计
monitor.print_load_stats()
# 检查负载均衡
is_balanced, message = monitor.check_balance(threshold=0.2)
print(f"\n负载均衡检查: {'通过' if is_balanced else '不通过'}")
print(f" {message}")
# 并发负载测试
print("\n并发负载测试:")
def concurrent_query(monitor, query_vector, search_params):
"""并发查询任务"""
return monitor.query(query_vector, search_params)
concurrent_monitor = LoadBalancingMonitor(collection)
num_concurrent = 50
num_queries_per_thread = 10
with concurrent.futures.ThreadPoolExecutor(max_workers=num_concurrent) as executor:
futures = []
for _ in range(num_concurrent * num_queries_per_thread):
future = executor.submit(concurrent_query, concurrent_monitor, query_vector, search_params)
futures.append(future)
# 等待完成
concurrent.futures.wait(futures)
print(f"完成 {num_concurrent * num_queries_per_thread} 次并发查询")
concurrent_monitor.print_load_stats()
is_balanced, message = concurrent_monitor.check_balance(threshold=0.2)
print(f"\n并发负载均衡检查: {'通过' if is_balanced else '不通过'}")
print(f" {message}")
# 负载均衡建议
print("\n负载均衡建议:")
print(" 1. 副本数 = QueryNode数,充分利用资源")
print(" 2. 监控各副本负载,确保均衡")
print(" 3. 副本分布在不同节点,避免热点")
print(" 4. 使用资源组隔离不同业务")
print(" 5. 定期检查负载分布,及时调整")
---
7.3 动态Schema
01.动态字段
a.启用动态Schema
a.功能说明
动态Schema允许插入未在Schema中定义的字段,提供灵活性。启用后可以在插入数据时添加任意JSON字段。动态字段存储在特殊的$meta字段中。可以查询和过滤动态字段,但不能为其创建索引。适合字段不固定的场景,如用户自定义属性、元数据等。动态字段会略微影响性能。需要在创建Collection时启用。
b.代码示例
---
from pymilvus import Collection, CollectionSchema, FieldSchema, DataType
import numpy as np
# 创建启用动态Schema的Collection
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128)
]
schema = CollectionSchema(
fields=fields,
description="动态Schema示例",
enable_dynamic_field=True # 启用动态字段
)
collection = Collection("dynamic_collection", schema=schema)
print(f"动态Schema已启用: {schema.enable_dynamic_field}")
# 插入带动态字段的数据
data = [
[1, 2, 3, 4, 5], # ids
[[np.random.random() for _ in range(128)] for _ in range(5)], # embeddings
# 动态字段
[
{"title": "文档1", "category": "技术", "tags": ["AI", "ML"]},
{"title": "文档2", "author": "张三", "rating": 4.5},
{"title": "文档3", "category": "科学", "year": 2024},
{"title": "文档4", "price": 99.99, "stock": 100},
{"title": "文档5", "description": "这是一个测试文档"}
]
]
collection.insert(data)
collection.flush()
print(f"\n插入数据: {collection.num_entities} 条")
# 创建索引并加载
index_params = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": 128}
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()
# 查询动态字段
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=5,
output_fields=["id", "title", "category", "author"] # 包含动态字段
)
print("\n查询结果(包含动态字段):")
for hit in results[0]:
print(f" ID: {hit.id}")
print(f" 标题: {hit.entity.get('title')}")
print(f" 类别: {hit.entity.get('category')}")
print(f" 作者: {hit.entity.get('author')}")
print()
# 过滤动态字段
results_filtered = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=5,
expr='category == "技术"', # 过滤动态字段
output_fields=["id", "title", "category"]
)
print("过滤动态字段(category == '技术'):")
for hit in results_filtered[0]:
print(f" {hit.entity.get('title')}: {hit.entity.get('category')}")
---
b.动态字段管理
a.功能说明
动态字段管理需要注意数据一致性和查询性能。不同记录可以有不同的动态字段。动态字段不支持索引,过滤性能较差。建议将常用字段定义在Schema中。动态字段适合低频查询的元数据。可以通过output_fields指定返回的动态字段。需要处理字段缺失的情况。
b.代码示例
---
from pymilvus import Collection
import numpy as np
collection = Collection("dynamic_collection")
collection.load()
# 动态字段管理类
class DynamicFieldManager:
def __init__(self, collection):
self.collection = collection
self.field_usage = {}
def insert_with_dynamic_fields(self, ids, embeddings, dynamic_data):
"""插入带动态字段的数据"""
# 统计字段使用情况
for record in dynamic_data:
for field_name in record.keys():
self.field_usage[field_name] = self.field_usage.get(field_name, 0) + 1
data = [ids, embeddings, dynamic_data]
self.collection.insert(data)
self.collection.flush()
def query_dynamic_fields(self, query_vector, search_params, fields=None):
"""查询动态字段"""
# 如果未指定字段,返回所有常用字段
if fields is None:
fields = self.get_common_fields(threshold=0.5)
results = self.collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=10,
output_fields=["id"] + fields
)
return results[0]
def get_common_fields(self, threshold=0.5):
"""获取常用动态字段(出现频率 > threshold)"""
total_records = self.collection.num_entities
common_fields = []
for field_name, count in self.field_usage.items():
if count / total_records >= threshold:
common_fields.append(field_name)
return common_fields
def get_field_statistics(self):
"""获取字段统计信息"""
total_records = self.collection.num_entities
stats = {}
for field_name, count in self.field_usage.items():
stats[field_name] = {
"count": count,
"coverage": count / total_records * 100 if total_records > 0 else 0
}
return stats
def recommend_schema_fields(self, threshold=0.8):
"""推荐应该加入Schema的字段"""
stats = self.get_field_statistics()
recommendations = []
for field_name, stat in stats.items():
if stat["coverage"] >= threshold * 100:
recommendations.append({
"field": field_name,
"coverage": stat["coverage"],
"reason": f"字段覆盖率 {stat['coverage']:.1f}%,建议加入Schema并创建索引"
})
return recommendations
# 使用动态字段管理器
manager = DynamicFieldManager(collection)
# 插入更多数据
new_ids = [10, 11, 12, 13, 14]
new_embeddings = [[np.random.random() for _ in range(128)] for _ in range(5)]
new_dynamic_data = [
{"title": "文档10", "category": "技术", "views": 1000},
{"title": "文档11", "category": "科学", "views": 500},
{"title": "文档12", "category": "技术", "views": 800},
{"title": "文档13", "category": "艺术", "views": 300},
{"title": "文档14", "category": "技术", "views": 1200}
]
manager.insert_with_dynamic_fields(new_ids, new_embeddings, new_dynamic_data)
# 获取字段统计
stats = manager.get_field_statistics()
print("\n动态字段统计:")
for field_name, stat in sorted(stats.items(), key=lambda x: x[1]["coverage"], reverse=True):
print(f" {field_name}: {stat['count']} 次, 覆盖率 {stat['coverage']:.1f}%")
# 获取常用字段
common_fields = manager.get_common_fields(threshold=0.5)
print(f"\n常用字段 (覆盖率 > 50%): {common_fields}")
# 推荐Schema字段
recommendations = manager.recommend_schema_fields(threshold=0.8)
if recommendations:
print("\nSchema优化建议:")
for rec in recommendations:
print(f" {rec['field']}: {rec['reason']}")
# 查询动态字段
query_vector = [np.random.random() for _ in range(128)]
search_params = {"metric_type": "L2", "params": {"nprobe": 16}}
results = manager.query_dynamic_fields(query_vector, search_params, fields=["title", "category", "views"])
print("\n查询结果:")
for hit in results[:5]:
print(f" {hit.entity.get('title')}: {hit.entity.get('category')}, 浏览 {hit.entity.get('views', 'N/A')}")
---
7.4 时间旅行
01.时间旅行概念
a.时间戳机制
a.功能说明
时间旅行允许查询历史数据状态,基于时间戳实现。Milvus为每个操作分配时间戳,记录数据变更历史。可以指定时间点查询该时刻的数据状态。适合审计、回溯分析、版本对比等场景。时间旅行不影响当前数据,只是查询视图。历史数据保留时间由配置决定,默认保留一段时间。超过保留期的历史数据会被清理。
b.代码示例
---
from pymilvus import Collection, utility
import numpy as np
import time
collection = Collection("documents")
collection.load()
# 获取当前时间戳
current_ts = utility.mkts_from_unixtime(time.time())
print(f"当前时间戳: {current_ts}")
# 插入初始数据
initial_data = [
[1, 2, 3],
[f"文档{i}_v1" for i in [1, 2, 3]],
[[np.random.random() for _ in range(128)] for _ in range(3)]
]
collection.insert(initial_data)
collection.flush()
ts_after_insert = utility.mkts_from_unixtime(time.time())
print(f"插入后时间戳: {ts_after_insert}")
# 等待一段时间
time.sleep(2)
# 更新数据(通过删除和重新插入)
collection.delete(expr="id in [1, 2]")
update_data = [
[1, 2],
[f"文档{i}_v2" for i in [1, 2]],
[[np.random.random() for _ in range(128)] for _ in range(2)]
]
collection.insert(update_data)
collection.flush()
ts_after_update = utility.mkts_from_unixtime(time.time())
print(f"更新后时间戳: {ts_after_update}")
# 查询当前状态
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
results_current = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10,
output_fields=["id", "title"]
)
print("\n当前状态查询:")
for hit in results_current[0]:
print(f" ID: {hit.id}, 标题: {hit.entity.get('title')}")
# 时间旅行:查询插入后、更新前的状态
results_past = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10,
travel_timestamp=ts_after_insert, # 指定历史时间点
output_fields=["id", "title"]
)
print(f"\n历史状态查询(时间戳: {ts_after_insert}):")
for hit in results_past[0]:
print(f" ID: {hit.id}, 标题: {hit.entity.get('title')}")
# 时间戳转换
print("\n时间戳转换:")
unix_time = time.time()
milvus_ts = utility.mkts_from_unixtime(unix_time)
print(f" Unix时间: {unix_time}")
print(f" Milvus时间戳: {milvus_ts}")
# 从时间戳转回Unix时间
# Milvus时间戳是纳秒级,Unix时间是秒级
unix_time_back = milvus_ts / 1000000000
print(f" 转回Unix时间: {unix_time_back}")
---
b.历史查询
a.功能说明
历史查询允许访问特定时间点的数据状态。通过travel_timestamp参数指定查询时间点。可以对比不同时间点的数据变化。适合数据审计、错误恢复、A/B测试等场景。历史查询性能与当前查询相当。需要注意历史数据保留策略。超过保留期的数据无法查询。
b.代码示例
---
from pymilvus import Collection, utility
import numpy as np
import time
from datetime import datetime
collection = Collection("documents")
collection.load()
# 历史查询管理类
class TimeTravelManager:
def __init__(self, collection):
self.collection = collection
self.snapshots = {}
def create_snapshot(self, name):
"""创建快照"""
timestamp = utility.mkts_from_unixtime(time.time())
self.snapshots[name] = {
"timestamp": timestamp,
"unix_time": time.time(),
"datetime": datetime.now().isoformat()
}
print(f"创建快照: {name} (时间戳: {timestamp})")
return timestamp
def list_snapshots(self):
"""列出所有快照"""
print("\n快照列表:")
for name, info in self.snapshots.items():
print(f" {name}:")
print(f" 时间戳: {info['timestamp']}")
print(f" 时间: {info['datetime']}")
def query_at_snapshot(self, snapshot_name, query_vector, search_params, limit=10):
"""在指定快照时间点查询"""
if snapshot_name not in self.snapshots:
raise ValueError(f"快照不存在: {snapshot_name}")
timestamp = self.snapshots[snapshot_name]["timestamp"]
results = self.collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=limit,
travel_timestamp=timestamp,
output_fields=["id", "title"]
)
return results[0]
def compare_snapshots(self, snapshot1, snapshot2, query_vector, search_params):
"""对比两个快照的查询结果"""
results1 = self.query_at_snapshot(snapshot1, query_vector, search_params)
results2 = self.query_at_snapshot(snapshot2, query_vector, search_params)
ids1 = set(hit.id for hit in results1)
ids2 = set(hit.id for hit in results2)
added = ids2 - ids1
removed = ids1 - ids2
common = ids1 & ids2
comparison = {
"snapshot1": snapshot1,
"snapshot2": snapshot2,
"added": list(added),
"removed": list(removed),
"common": list(common)
}
return comparison
def rollback_view(self, snapshot_name):
"""回滚到指定快照(只是查询视图,不修改数据)"""
if snapshot_name not in self.snapshots:
raise ValueError(f"快照不存在: {snapshot_name}")
timestamp = self.snapshots[snapshot_name]["timestamp"]
print(f"\n回滚视图到快照: {snapshot_name}")
print(f" 时间戳: {timestamp}")
print(f" 时间: {self.snapshots[snapshot_name]['datetime']}")
return timestamp
# 使用时间旅行管理器
tt_manager = TimeTravelManager(collection)
# 创建初始快照
tt_manager.create_snapshot("initial")
# 插入数据
data1 = [
[100, 101, 102],
["文档A", "文档B", "文档C"],
[[np.random.random() for _ in range(128)] for _ in range(3)]
]
collection.insert(data1)
collection.flush()
time.sleep(1)
tt_manager.create_snapshot("after_insert_1")
# 插入更多数据
data2 = [
[103, 104],
["文档D", "文档E"],
[[np.random.random() for _ in range(128)] for _ in range(2)]
]
collection.insert(data2)
collection.flush()
time.sleep(1)
tt_manager.create_snapshot("after_insert_2")
# 删除数据
collection.delete(expr="id in [100, 101]")
collection.flush()
time.sleep(1)
tt_manager.create_snapshot("after_delete")
# 列出快照
tt_manager.list_snapshots()
# 查询不同时间点
query_vector = [np.random.random() for _ in range(128)]
search_params = {"metric_type": "L2", "params": {"nprobe": 16}}
print("\n不同时间点的查询结果:")
for snapshot_name in ["initial", "after_insert_1", "after_insert_2", "after_delete"]:
try:
results = tt_manager.query_at_snapshot(snapshot_name, query_vector, search_params, limit=10)
print(f"\n{snapshot_name}: {len(results)} 条结果")
for hit in results[:3]:
print(f" ID: {hit.id}, 标题: {hit.entity.get('title')}")
except Exception as e:
print(f"\n{snapshot_name}: 查询失败 - {e}")
# 对比快照
comparison = tt_manager.compare_snapshots("after_insert_1", "after_delete", query_vector, search_params)
print(f"\n快照对比:")
print(f" 新增ID: {comparison['added']}")
print(f" 删除ID: {comparison['removed']}")
print(f" 保留ID: {comparison['common']}")
---
02.应用场景
a.数据审计
a.功能说明
时间旅行支持数据审计,追踪数据变更历史。可以查询任意时间点的数据状态,验证数据完整性。适合合规审计、安全审查等场景。可以对比不同时间点的数据差异。帮助定位数据异常和错误操作。支持数据恢复和回滚决策。需要配置足够的历史数据保留期。
b.代码示例
---
from pymilvus import Collection, utility
import numpy as np
import time
from datetime import datetime
collection = Collection("documents")
collection.load()
# 数据审计类
class DataAuditor:
def __init__(self, collection):
self.collection = collection
self.audit_log = []
def log_operation(self, operation, details):
"""记录操作日志"""
timestamp = utility.mkts_from_unixtime(time.time())
log_entry = {
"timestamp": timestamp,
"unix_time": time.time(),
"datetime": datetime.now().isoformat(),
"operation": operation,
"details": details
}
self.audit_log.append(log_entry)
print(f"[审计] {operation}: {details}")
return timestamp
def insert_with_audit(self, data):
"""带审计的插入"""
timestamp_before = self.log_operation("INSERT_START", f"{len(data[0])} 条记录")
self.collection.insert(data)
self.collection.flush()
timestamp_after = self.log_operation("INSERT_COMPLETE", f"{len(data[0])} 条记录")
return timestamp_before, timestamp_after
def delete_with_audit(self, expr):
"""带审计的删除"""
timestamp_before = self.log_operation("DELETE_START", expr)
# 先查询要删除的数据
# 这里简化,实际应该查询并记录
self.collection.delete(expr)
self.collection.flush()
timestamp_after = self.log_operation("DELETE_COMPLETE", expr)
return timestamp_before, timestamp_after
def verify_data_integrity(self, expected_count, timestamp=None):
"""验证数据完整性"""
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {"metric_type": "L2", "params": {"nprobe": 16}}
search_kwargs = {
"data": query_vector,
"anns_field": "embedding",
"param": search_params,
"limit": 10000
}
if timestamp:
search_kwargs["travel_timestamp"] = timestamp
results = self.collection.search(**search_kwargs)
actual_count = len(results[0])
is_valid = actual_count >= expected_count * 0.9 # 允许10%误差
self.log_operation(
"INTEGRITY_CHECK",
f"预期: {expected_count}, 实际: {actual_count}, 结果: {'通过' if is_valid else '失败'}"
)
return is_valid, actual_count
def generate_audit_report(self):
"""生成审计报告"""
print("\n" + "="*60)
print("数据审计报告")
print("="*60)
print(f"\n总操作数: {len(self.audit_log)}")
# 按操作类型统计
operation_counts = {}
for entry in self.audit_log:
op = entry["operation"]
operation_counts[op] = operation_counts.get(op, 0) + 1
print("\n操作统计:")
for op, count in sorted(operation_counts.items()):
print(f" {op}: {count} 次")
print("\n操作时间线:")
for entry in self.audit_log:
print(f" [{entry['datetime']}] {entry['operation']}: {entry['details']}")
return {
"total_operations": len(self.audit_log),
"operation_counts": operation_counts,
"audit_log": self.audit_log
}
def rollback_analysis(self, target_timestamp):
"""回滚分析"""
print(f"\n回滚分析(目标时间戳: {target_timestamp}):")
# 找到目标时间戳之后的操作
operations_to_rollback = [
entry for entry in self.audit_log
if entry["timestamp"] > target_timestamp
]
print(f" 需要回滚的操作数: {len(operations_to_rollback)}")
for entry in operations_to_rollback:
print(f" [{entry['datetime']}] {entry['operation']}: {entry['details']}")
return operations_to_rollback
# 使用数据审计器
auditor = DataAuditor(collection)
# 执行一系列操作
print("执行审计操作:\n")
# 插入数据
data1 = [
[200, 201, 202],
["审计文档A", "审计文档B", "审计文档C"],
[[np.random.random() for _ in range(128)] for _ in range(3)]
]
ts_insert1_before, ts_insert1_after = auditor.insert_with_audit(data1)
time.sleep(1)
# 验证完整性
auditor.verify_data_integrity(expected_count=3, timestamp=ts_insert1_after)
time.sleep(1)
# 插入更多数据
data2 = [
[203, 204],
["审计文档D", "审计文档E"],
[[np.random.random() for _ in range(128)] for _ in range(2)]
]
ts_insert2_before, ts_insert2_after = auditor.insert_with_audit(data2)
time.sleep(1)
# 删除数据
ts_delete_before, ts_delete_after = auditor.delete_with_audit("id in [200, 201]")
time.sleep(1)
# 验证完整性
auditor.verify_data_integrity(expected_count=3)
# 生成审计报告
report = auditor.generate_audit_report()
# 回滚分析
auditor.rollback_analysis(target_timestamp=ts_insert1_after)
print("\n审计应用场景:")
print(" 1. 合规审计: 追踪所有数据变更")
print(" 2. 安全审查: 发现异常操作")
print(" 3. 错误恢复: 定位问题时间点")
print(" 4. 数据验证: 验证数据完整性")
print(" 5. 回滚决策: 分析回滚影响")
---
b.版本对比
a.功能说明
时间旅行支持版本对比,比较不同时间点的数据差异。可以对比数据内容、查询结果、统计指标等。适合A/B测试、算法对比、数据质量评估等场景。帮助理解数据演变过程。支持可视化版本差异。可以用于数据回归测试。
b.代码示例
---
from pymilvus import Collection, utility
import numpy as np
import time
collection = Collection("documents")
collection.load()
# 版本对比类
class VersionComparator:
def __init__(self, collection):
self.collection = collection
self.versions = {}
def create_version(self, version_name):
"""创建版本"""
timestamp = utility.mkts_from_unixtime(time.time())
self.versions[version_name] = timestamp
print(f"创建版本: {version_name} (时间戳: {timestamp})")
return timestamp
def compare_query_results(self, version1, version2, query_vector, search_params, limit=10):
"""对比两个版本的查询结果"""
if version1 not in self.versions or version2 not in self.versions:
raise ValueError("版本不存在")
# 查询版本1
results1 = self.collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=limit,
travel_timestamp=self.versions[version1],
output_fields=["id", "title"]
)
# 查询版本2
results2 = self.collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=limit,
travel_timestamp=self.versions[version2],
output_fields=["id", "title"]
)
# 对比结果
ids1 = [hit.id for hit in results1[0]]
ids2 = [hit.id for hit in results2[0]]
comparison = {
"version1": version1,
"version2": version2,
"results1": ids1,
"results2": ids2,
"intersection": list(set(ids1) & set(ids2)),
"only_in_v1": list(set(ids1) - set(ids2)),
"only_in_v2": list(set(ids2) - set(ids1)),
"similarity": len(set(ids1) & set(ids2)) / max(len(ids1), len(ids2)) if max(len(ids1), len(ids2)) > 0 else 0
}
return comparison
def print_comparison(self, comparison):
"""打印对比结果"""
print(f"\n版本对比: {comparison['version1']} vs {comparison['version2']}")
print(f" 相似度: {comparison['similarity']*100:.1f}%")
print(f" 共同结果: {len(comparison['intersection'])} 个")
print(f" 仅在{comparison['version1']}: {len(comparison['only_in_v1'])} 个")
print(f" 仅在{comparison['version2']}: {len(comparison['only_in_v2'])} 个")
if comparison['only_in_v1']:
print(f"\n 仅在{comparison['version1']}的ID: {comparison['only_in_v1'][:5]}")
if comparison['only_in_v2']:
print(f" 仅在{comparison['version2']}的ID: {comparison['only_in_v2'][:5]}")
# 使用版本对比器
comparator = VersionComparator(collection)
# 创建版本
comparator.create_version("v1.0")
# 修改数据...
time.sleep(1)
comparator.create_version("v1.1")
# 对比版本
query_vector = [np.random.random() for _ in range(128)]
search_params = {"metric_type": "L2", "params": {"nprobe": 16}}
comparison = comparator.compare_query_results("v1.0", "v1.1", query_vector, search_params)
comparator.print_comparison(comparison)
print("\n版本对比应用:")
print(" 1. A/B测试: 对比不同算法效果")
print(" 2. 数据质量: 评估数据变更影响")
print(" 3. 回归测试: 验证系统升级")
print(" 4. 性能分析: 对比不同配置")
---
7.5 混合搜索Hybrid
01.混合搜索原理
a.多路召回
a.功能说明
混合搜索结合多种检索方式,提升召回效果。支持向量搜索、全文搜索、标量过滤等多路召回。不同召回路径可以使用不同的权重。通过融合算法合并多路结果。适合复杂查询场景,如语义+关键词搜索。可以提升搜索准确率和用户满意度。需要合理设计融合策略。
b.代码示例
---
from pymilvus import Collection
import numpy as np
collection = Collection("documents")
collection.load()
# 多路召回类
class MultiRecallSearch:
def __init__(self, collection):
self.collection = collection
def vector_recall(self, query_vector, search_params, limit=50):
"""向量召回"""
results = self.collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=limit,
output_fields=["id", "title"]
)
# 转换为字典格式
recall_results = {}
for hit in results[0]:
recall_results[hit.id] = {
"score": 1 / (1 + hit.distance), # 距离转分数
"title": hit.entity.get("title"),
"source": "vector"
}
return recall_results
def keyword_recall(self, keywords, limit=50):
"""关键词召回(通过标量过滤模拟)"""
# 构建关键词过滤表达式
keyword_expr = " or ".join([f'title like "%{kw}%"' for kw in keywords])
# 使用随机向量进行搜索,主要依赖过滤
query_vector = [np.random.random() for _ in range(128)]
try:
results = self.collection.search(
data=[query_vector],
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": 16}},
limit=limit,
expr=keyword_expr,
output_fields=["id", "title"]
)
recall_results = {}
for hit in results[0]:
# 计算关键词匹配分数
title = hit.entity.get("title", "")
match_count = sum(1 for kw in keywords if kw in title)
score = match_count / len(keywords) if keywords else 0
recall_results[hit.id] = {
"score": score,
"title": title,
"source": "keyword"
}
return recall_results
except Exception as e:
print(f"关键词召回失败: {e}")
return {}
def category_recall(self, category, limit=50):
"""类别召回"""
query_vector = [np.random.random() for _ in range(128)]
results = self.collection.search(
data=[query_vector],
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": 16}},
limit=limit,
expr=f'category == "{category}"',
output_fields=["id", "title", "category"]
)
recall_results = {}
for hit in results[0]:
recall_results[hit.id] = {
"score": 1.0, # 类别匹配给固定分数
"title": hit.entity.get("title"),
"category": hit.entity.get("category"),
"source": "category"
}
return recall_results
def hybrid_recall(self, query_vector, keywords=None, category=None, weights=None):
"""混合召回"""
if weights is None:
weights = {"vector": 0.6, "keyword": 0.3, "category": 0.1}
all_results = {}
# 向量召回
search_params = {"metric_type": "L2", "params": {"nprobe": 16}}
vector_results = self.vector_recall(query_vector, search_params, limit=50)
for doc_id, info in vector_results.items():
all_results[doc_id] = {
"scores": {"vector": info["score"]},
"title": info["title"],
"sources": ["vector"]
}
# 关键词召回
if keywords:
keyword_results = self.keyword_recall(keywords, limit=50)
for doc_id, info in keyword_results.items():
if doc_id in all_results:
all_results[doc_id]["scores"]["keyword"] = info["score"]
all_results[doc_id]["sources"].append("keyword")
else:
all_results[doc_id] = {
"scores": {"keyword": info["score"]},
"title": info["title"],
"sources": ["keyword"]
}
# 类别召回
if category:
category_results = self.category_recall(category, limit=50)
for doc_id, info in category_results.items():
if doc_id in all_results:
all_results[doc_id]["scores"]["category"] = info["score"]
all_results[doc_id]["sources"].append("category")
else:
all_results[doc_id] = {
"scores": {"category": info["score"]},
"title": info["title"],
"sources": ["category"]
}
# 计算加权总分
for doc_id in all_results:
total_score = 0
for source, weight in weights.items():
if source in all_results[doc_id]["scores"]:
total_score += weight * all_results[doc_id]["scores"][source]
all_results[doc_id]["total_score"] = total_score
# 排序
sorted_results = sorted(
all_results.items(),
key=lambda x: x[1]["total_score"],
reverse=True
)
return sorted_results[:20]
# 使用多路召回
multi_recall = MultiRecallSearch(collection)
query_vector = [np.random.random() for _ in range(128)]
keywords = ["技术", "AI"]
category = "电子产品"
print("混合召回搜索:\n")
results = multi_recall.hybrid_recall(
query_vector=query_vector,
keywords=keywords,
category=category,
weights={"vector": 0.5, "keyword": 0.3, "category": 0.2}
)
print(f"{'排名':>4s} {'ID':>8s} {'总分':>8s} {'来源':>20s} {'标题':>30s}")
print("-" * 75)
for rank, (doc_id, info) in enumerate(results[:10], 1):
sources = ", ".join(info["sources"])
title = info["title"][:28] if info.get("title") else "N/A"
print(f"{rank:4d} {doc_id:8d} {info['total_score']:8.3f} {sources:>20s} {title:>30s}")
---
b.融合策略
a.功能说明
融合策略决定如何合并多路召回结果。常见策略包括加权平均、RRF、CombSUM等。权重设置影响不同召回路径的重要性。需要根据业务场景调整权重。可以使用机器学习优化融合参数。融合策略应该考虑结果的排序位置。需要实验确定最优融合方法。
b.代码示例
---
from pymilvus import Collection
import numpy as np
collection = Collection("documents")
# 融合策略类
class FusionStrategy:
@staticmethod
def weighted_sum(results_dict, weights):
"""加权求和融合"""
fused_scores = {}
for source, results in results_dict.items():
weight = weights.get(source, 0)
for doc_id, score in results.items():
if doc_id not in fused_scores:
fused_scores[doc_id] = 0
fused_scores[doc_id] += weight * score
return fused_scores
@staticmethod
def rrf(results_dict, k=60):
"""Reciprocal Rank Fusion"""
fused_scores = {}
for source, results in results_dict.items():
# 按分数排序获取排名
ranked = sorted(results.items(), key=lambda x: x[1], reverse=True)
for rank, (doc_id, score) in enumerate(ranked):
if doc_id not in fused_scores:
fused_scores[doc_id] = 0
fused_scores[doc_id] += 1 / (k + rank + 1)
return fused_scores
@staticmethod
def comb_sum(results_dict):
"""CombSUM: 简单求和"""
fused_scores = {}
for source, results in results_dict.items():
for doc_id, score in results.items():
if doc_id not in fused_scores:
fused_scores[doc_id] = 0
fused_scores[doc_id] += score
return fused_scores
@staticmethod
def comb_max(results_dict):
"""CombMAX: 取最大值"""
fused_scores = {}
for source, results in results_dict.items():
for doc_id, score in results.items():
if doc_id not in fused_scores:
fused_scores[doc_id] = score
else:
fused_scores[doc_id] = max(fused_scores[doc_id], score)
return fused_scores
@staticmethod
def adaptive_fusion(results_dict, quality_scores):
"""自适应融合:根据召回质量动态调整权重"""
# 归一化质量分数
total_quality = sum(quality_scores.values())
weights = {
source: quality / total_quality
for source, quality in quality_scores.items()
}
return FusionStrategy.weighted_sum(results_dict, weights)
# 测试不同融合策略
print("融合策略对比:\n")
# 模拟多路召回结果
vector_results = {1: 0.9, 2: 0.8, 3: 0.7, 4: 0.6, 5: 0.5}
keyword_results = {2: 0.95, 3: 0.85, 6: 0.75, 7: 0.65}
category_results = {1: 1.0, 4: 1.0, 8: 1.0}
results_dict = {
"vector": vector_results,
"keyword": keyword_results,
"category": category_results
}
# 加权求和
weights = {"vector": 0.5, "keyword": 0.3, "category": 0.2}
weighted_scores = FusionStrategy.weighted_sum(results_dict, weights)
print("加权求和融合:")
for doc_id, score in sorted(weighted_scores.items(), key=lambda x: x[1], reverse=True)[:5]:
print(f" 文档{doc_id}: {score:.3f}")
# RRF
rrf_scores = FusionStrategy.rrf(results_dict, k=60)
print("\nRRF融合:")
for doc_id, score in sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)[:5]:
print(f" 文档{doc_id}: {score:.3f}")
# CombSUM
combsum_scores = FusionStrategy.comb_sum(results_dict)
print("\nCombSUM融合:")
for doc_id, score in sorted(combsum_scores.items(), key=lambda x: x[1], reverse=True)[:5]:
print(f" 文档{doc_id}: {score:.3f}")
# CombMAX
combmax_scores = FusionStrategy.comb_max(results_dict)
print("\nCombMAX融合:")
for doc_id, score in sorted(combmax_scores.items(), key=lambda x: x[1], reverse=True)[:5]:
print(f" 文档{doc_id}: {score:.3f}")
# 自适应融合
quality_scores = {"vector": 0.8, "keyword": 0.6, "category": 0.9}
adaptive_scores = FusionStrategy.adaptive_fusion(results_dict, quality_scores)
print("\n自适应融合:")
for doc_id, score in sorted(adaptive_scores.items(), key=lambda x: x[1], reverse=True)[:5]:
print(f" 文档{doc_id}: {score:.3f}")
print("\n融合策略选择建议:")
print(" 加权求和: 适合权重明确的场景")
print(" RRF: 适合不同度量的结果融合")
print(" CombSUM: 简单快速,适合相同度量")
print(" CombMAX: 强调最佳匹配")
print(" 自适应: 根据召回质量动态调整")
---
02.应用实践
a.语义+关键词搜索
a.功能说明
语义+关键词混合搜索结合向量语义理解和关键词精确匹配。向量搜索捕捉语义相似性,关键词搜索保证精确匹配。适合搜索引擎、文档检索等场景。可以提升搜索准确率和用户满意度。需要合理设置两者权重。关键词匹配可以作为硬性约束或软性加分。
b.代码示例
---
from pymilvus import Collection
import numpy as np
collection = Collection("documents")
collection.load()
# 语义+关键词搜索类
class SemanticKeywordSearch:
def __init__(self, collection):
self.collection = collection
def search(self, query_text, query_vector, keywords=None, mode="soft"):
"""
混合搜索
mode: "soft" (软约束,关键词加分) 或 "hard" (硬约束,必须包含关键词)
"""
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
if mode == "hard" and keywords:
# 硬约束:必须包含关键词
keyword_expr = " or ".join([f'title like "%{kw}%"' for kw in keywords])
results = self.collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=20,
expr=keyword_expr,
output_fields=["id", "title"]
)
return [(hit.id, hit.entity.get("title"), hit.distance) for hit in results[0]]
else:
# 软约束:关键词加分
# 先进行向量搜索
results = self.collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=50,
output_fields=["id", "title"]
)
# 计算综合分数
scored_results = []
for hit in results[0]:
title = hit.entity.get("title", "")
# 向量分数(距离转相似度)
vector_score = 1 / (1 + hit.distance)
# 关键词匹配分数
keyword_score = 0
if keywords:
match_count = sum(1 for kw in keywords if kw in title)
keyword_score = match_count / len(keywords)
# 综合分数(可调整权重)
total_score = 0.7 * vector_score + 0.3 * keyword_score
scored_results.append((hit.id, title, total_score, vector_score, keyword_score))
# 按综合分数排序
scored_results.sort(key=lambda x: x[2], reverse=True)
return scored_results[:20]
# 使用语义+关键词搜索
sk_search = SemanticKeywordSearch(collection)
query_text = "人工智能机器学习"
query_vector = [np.random.random() for _ in range(128)]
keywords = ["AI", "机器学习"]
print("软约束模式(关键词加分):\n")
results_soft = sk_search.search(query_text, query_vector, keywords, mode="soft")
print(f"{'排名':>4s} {'ID':>8s} {'总分':>8s} {'向量分':>10s} {'关键词分':>10s} {'标题':>30s}")
print("-" * 75)
for rank, (doc_id, title, total, vector, keyword) in enumerate(results_soft[:10], 1):
title_short = title[:28] if title else "N/A"
print(f"{rank:4d} {doc_id:8d} {total:8.3f} {vector:10.3f} {keyword:10.3f} {title_short:>30s}")
print("\n硬约束模式(必须包含关键词):\n")
results_hard = sk_search.search(query_text, query_vector, keywords, mode="hard")
print(f"{'排名':>4s} {'ID':>8s} {'距离':>10s} {'标题':>40s}")
print("-" * 65)
for rank, (doc_id, title, distance) in enumerate(results_hard[:10], 1):
title_short = title[:38] if title else "N/A"
print(f"{rank:4d} {doc_id:8d} {distance:10.4f} {title_short:>40s}")
---
b.多模态搜索
a.功能说明
多模态搜索结合文本、图像、音频等多种模态。每种模态使用对应的向量编码器。可以实现跨模态检索,如用文本搜索图像。适合电商、视频平台等多媒体场景。需要为不同模态创建不同的向量字段。融合策略需要考虑模态间的权重。可以提供更丰富的搜索体验。
b.代码示例
---
from pymilvus import Collection, CollectionSchema, FieldSchema, DataType
import numpy as np
# 创建多模态Collection
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=200),
FieldSchema(name="text_embedding", dtype=DataType.FLOAT_VECTOR, dim=768),
FieldSchema(name="image_embedding", dtype=DataType.FLOAT_VECTOR, dim=512)
]
schema = CollectionSchema(fields=fields, description="多模态搜索")
multimodal_collection = Collection("multimodal_search", schema=schema)
# 插入多模态数据
data = [
list(range(100)), # ids
[f"商品{i}" for i in range(100)], # titles
[[np.random.random() for _ in range(768)] for _ in range(100)], # text embeddings
[[np.random.random() for _ in range(512)] for _ in range(100)] # image embeddings
]
multimodal_collection.insert(data)
multimodal_collection.flush()
# 创建索引
text_index = {
"index_type": "IVF_FLAT",
"metric_type": "COSINE",
"params": {"nlist": 128}
}
image_index = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": 128}
}
multimodal_collection.create_index("text_embedding", text_index)
multimodal_collection.create_index("image_embedding", image_index)
multimodal_collection.load()
# 多模态搜索类
class MultimodalSearch:
def __init__(self, collection):
self.collection = collection
def text_search(self, text_vector, limit=50):
"""文本模态搜索"""
results = self.collection.search(
data=[text_vector],
anns_field="text_embedding",
param={"metric_type": "COSINE", "params": {"nprobe": 16}},
limit=limit,
output_fields=["id", "title"]
)
return {hit.id: hit.distance for hit in results[0]}
def image_search(self, image_vector, limit=50):
"""图像模态搜索"""
results = self.collection.search(
data=[image_vector],
anns_field="image_embedding",
param={"metric_type": "L2", "params": {"nprobe": 16}},
limit=limit,
output_fields=["id", "title"]
)
return {hit.id: hit.distance for hit in results[0]}
def multimodal_search(self, text_vector=None, image_vector=None, weights=None):
"""多模态融合搜索"""
if weights is None:
weights = {"text": 0.5, "image": 0.5}
results_dict = {}
# 文本搜索
if text_vector is not None:
text_results = self.text_search(text_vector)
for doc_id, distance in text_results.items():
score = 1 / (1 + distance) # 转换为相似度分数
results_dict[doc_id] = {"text": score}
# 图像搜索
if image_vector is not None:
image_results = self.image_search(image_vector)
# 归一化L2距离
max_dist = max(image_results.values()) if image_results else 1.0
for doc_id, distance in image_results.items():
score = 1 - (distance / max_dist)
if doc_id in results_dict:
results_dict[doc_id]["image"] = score
else:
results_dict[doc_id] = {"image": score}
# 计算加权总分
final_scores = {}
for doc_id, scores in results_dict.items():
total = 0
for modality, weight in weights.items():
if modality in scores:
total += weight * scores[modality]
final_scores[doc_id] = total
# 排序
sorted_results = sorted(final_scores.items(), key=lambda x: x[1], reverse=True)
return sorted_results[:20]
# 使用多模态搜索
mm_search = MultimodalSearch(multimodal_collection)
text_query = [np.random.random() for _ in range(768)]
image_query = [np.random.random() for _ in range(512)]
print("多模态搜索结果:\n")
# 纯文本搜索
print("纯文本搜索:")
text_only = mm_search.multimodal_search(text_vector=text_query, weights={"text": 1.0})
for rank, (doc_id, score) in enumerate(text_only[:5], 1):
print(f" {rank}. 文档{doc_id}: {score:.3f}")
# 纯图像搜索
print("\n纯图像搜索:")
image_only = mm_search.multimodal_search(image_vector=image_query, weights={"image": 1.0})
for rank, (doc_id, score) in enumerate(image_only[:5], 1):
print(f" {rank}. 文档{doc_id}: {score:.3f}")
# 多模态融合
print("\n多模态融合搜索 (文本:图像 = 0.6:0.4):")
multimodal_results = mm_search.multimodal_search(
text_vector=text_query,
image_vector=image_query,
weights={"text": 0.6, "image": 0.4}
)
for rank, (doc_id, score) in enumerate(multimodal_results[:5], 1):
print(f" {rank}. 文档{doc_id}: {score:.3f}")
print("\n多模态搜索应用:")
print(" 1. 电商: 图文结合商品搜索")
print(" 2. 视频: 文本搜索视频内容")
print(" 3. 社交: 跨模态内容推荐")
print(" 4. 教育: 多媒体资源检索")
---
8 性能优化
8.1 索引选择策略
01.索引类型对比
a.FLAT索引
a.功能说明
FLAT索引是暴力搜索索引,不进行任何压缩或近似。提供100%召回率,结果最精确。适合小规模数据集(<100万向量)。查询速度随数据量线性增长。不需要训练,插入速度快。内存占用等于原始向量大小。适合对准确率要求极高的场景。数据量大时性能较差。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("documents")
# FLAT索引配置
flat_index = {
"index_type": "FLAT",
"metric_type": "L2",
"params": {} # FLAT索引无需参数
}
print("创建FLAT索引...")
start = time.time()
collection.create_index(field_name="embedding", index_params=flat_index)
build_time = time.time() - start
print(f"索引构建时间: {build_time:.2f}s")
collection.load()
# 测试查询性能
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {}
}
start = time.time()
for _ in range(100):
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10
)
query_time = time.time() - start
print(f"\n100次查询总时间: {query_time:.2f}s")
print(f"平均查询延迟: {query_time/100*1000:.2f}ms")
print(f"QPS: {100/query_time:.2f}")
print("\nFLAT索引特点:")
print(" 优点: 100%召回率,最精确")
print(" 缺点: 查询速度慢,不适合大规模数据")
print(" 适用: <100万向量,高精度要求")
---
b.IVF系列索引
a.功能说明
IVF系列索引使用倒排文件结构,将向量空间划分为多个聚类。包括IVF_FLAT、IVF_SQ8、IVF_PQ等变体。通过nlist参数控制聚类数量,nprobe控制搜索的聚类数。平衡了查询速度和召回率。适合中大规模数据集(100万-1000万向量)。需要训练阶段,构建时间较长。内存占用可通过量化降低。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("documents")
# IVF_FLAT索引配置
ivf_flat_index = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": 1024} # 聚类数量
}
print("创建IVF_FLAT索引...")
start = time.time()
collection.create_index(field_name="embedding", index_params=ivf_flat_index)
build_time = time.time() - start
print(f"索引构建时间: {build_time:.2f}s")
collection.load()
# 测试不同nprobe值的性能
query_vector = [[np.random.random() for _ in range(128)]]
nprobe_values = [1, 8, 16, 32, 64]
print(f"\n{'nprobe':>8s} {'查询时间':>10s} {'QPS':>10s}")
print("-" * 32)
for nprobe in nprobe_values:
search_params = {
"metric_type": "L2",
"params": {"nprobe": nprobe}
}
start = time.time()
for _ in range(100):
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10
)
query_time = time.time() - start
qps = 100 / query_time
print(f"{nprobe:8d} {query_time:9.2f}s {qps:9.2f}")
print("\nIVF索引特点:")
print(" 优点: 速度快,内存可控")
print(" 缺点: 需要训练,召回率<100%")
print(" 适用: 100万-1000万向量")
print(" 调优: nlist=4*sqrt(n), nprobe=nlist的1-10%")
---
02.索引选择决策
a.数据规模评估
a.功能说明
根据数据规模选择合适的索引类型。小规模(<10万)使用FLAT,中规模(10万-1000万)使用IVF系列,大规模(>1000万)使用HNSW或DiskANN。需要考虑数据增长趋势,预留性能空间。评估内存资源,选择合适的压缩方式。考虑查询QPS需求,平衡速度和精度。定期重新评估,根据业务变化调整索引。
b.代码示例
---
from pymilvus import Collection
import numpy as np
# 索引选择决策类
class IndexSelector:
def __init__(self, collection):
self.collection = collection
def recommend_index(self, vector_count, qps_requirement, recall_requirement, memory_limit_gb):
"""
推荐索引类型
参数:
vector_count: 向量数量
qps_requirement: QPS需求
recall_requirement: 召回率要求 (0-1)
memory_limit_gb: 内存限制(GB)
"""
recommendations = []
# 计算向量维度和内存需求
# 假设128维float32向量,每个向量512字节
vector_size_bytes = 128 * 4
total_memory_gb = vector_count * vector_size_bytes / (1024**3)
print(f"\n数据规模评估:")
print(f" 向量数量: {vector_count:,}")
print(f" 原始数据大小: {total_memory_gb:.2f} GB")
print(f" QPS需求: {qps_requirement}")
print(f" 召回率要求: {recall_requirement*100:.0f}%")
print(f" 内存限制: {memory_limit_gb} GB")
# 小规模数据
if vector_count < 100000:
if recall_requirement >= 0.99:
recommendations.append({
"index_type": "FLAT",
"reason": "小规模数据,高召回率要求",
"params": {},
"expected_recall": 1.0,
"expected_qps": "100-500",
"memory_gb": total_memory_gb
})
else:
recommendations.append({
"index_type": "IVF_FLAT",
"reason": "小规模数据,可接受近似搜索",
"params": {"nlist": 128},
"expected_recall": 0.95,
"expected_qps": "500-2000",
"memory_gb": total_memory_gb * 1.1
})
# 中规模数据
elif vector_count < 10000000:
nlist = int(4 * np.sqrt(vector_count))
if memory_limit_gb >= total_memory_gb:
recommendations.append({
"index_type": "IVF_FLAT",
"reason": "中规模数据,内存充足",
"params": {"nlist": nlist},
"expected_recall": 0.95,
"expected_qps": "1000-5000",
"memory_gb": total_memory_gb * 1.1
})
if memory_limit_gb < total_memory_gb * 0.5:
recommendations.append({
"index_type": "IVF_SQ8",
"reason": "中规模数据,内存受限",
"params": {"nlist": nlist},
"expected_recall": 0.90,
"expected_qps": "2000-8000",
"memory_gb": total_memory_gb * 0.3
})
if qps_requirement > 5000:
recommendations.append({
"index_type": "HNSW",
"reason": "高QPS需求",
"params": {"M": 16, "efConstruction": 200},
"expected_recall": 0.95,
"expected_qps": "5000-20000",
"memory_gb": total_memory_gb * 1.3
})
# 大规模数据
else:
recommendations.append({
"index_type": "HNSW",
"reason": "大规模数据,高性能需求",
"params": {"M": 16, "efConstruction": 200},
"expected_recall": 0.95,
"expected_qps": "5000-20000",
"memory_gb": total_memory_gb * 1.3
})
if memory_limit_gb < total_memory_gb:
recommendations.append({
"index_type": "IVF_PQ",
"reason": "大规模数据,内存受限",
"params": {"nlist": 4096, "m": 16},
"expected_recall": 0.85,
"expected_qps": "3000-10000",
"memory_gb": total_memory_gb * 0.1
})
return recommendations
def print_recommendations(self, recommendations):
"""打印推荐结果"""
print(f"\n索引推荐 (共{len(recommendations)}个选项):\n")
for i, rec in enumerate(recommendations, 1):
print(f"{i}. {rec['index_type']}")
print(f" 原因: {rec['reason']}")
print(f" 参数: {rec['params']}")
print(f" 预期召回率: {rec['expected_recall']*100:.0f}%")
print(f" 预期QPS: {rec['expected_qps']}")
print(f" 内存需求: {rec['memory_gb']:.2f} GB")
print()
# 使用索引选择器
collection = Collection("documents")
selector = IndexSelector(collection)
# 场景1: 小规模高精度
print("="*60)
print("场景1: 小规模高精度")
print("="*60)
recs = selector.recommend_index(
vector_count=50000,
qps_requirement=200,
recall_requirement=0.99,
memory_limit_gb=10
)
selector.print_recommendations(recs)
# 场景2: 中规模平衡
print("="*60)
print("场景2: 中规模平衡")
print("="*60)
recs = selector.recommend_index(
vector_count=5000000,
qps_requirement=3000,
recall_requirement=0.95,
memory_limit_gb=50
)
selector.print_recommendations(recs)
# 场景3: 大规模内存受限
print("="*60)
print("场景3: 大规模内存受限")
print("="*60)
recs = selector.recommend_index(
vector_count=50000000,
qps_requirement=5000,
recall_requirement=0.90,
memory_limit_gb=20
)
selector.print_recommendations(recs)
---
b.性能测试对比
a.功能说明
通过性能测试对比不同索引的实际表现。测试指标包括构建时间、查询延迟、QPS、召回率、内存占用等。使用真实数据和查询模式进行测试。对比不同参数配置的影响。测试结果指导索引选择和参数调优。定期进行性能回归测试。建立性能基准,监控性能变化。
b.代码示例
---
from pymilvus import Collection, utility
import numpy as np
import time
collection = Collection("documents")
# 性能测试类
class IndexBenchmark:
def __init__(self, collection):
self.collection = collection
self.results = []
def benchmark_index(self, index_config, search_params, num_queries=100):
"""测试单个索引配置"""
index_type = index_config["index_type"]
print(f"\n测试索引: {index_type}")
print(f" 参数: {index_config.get('params', {})}")
# 删除现有索引
try:
self.collection.release()
self.collection.drop_index()
except:
pass
# 构建索引
print(" 构建索引...")
start = time.time()
self.collection.create_index(field_name="embedding", index_params=index_config)
build_time = time.time() - start
# 加载Collection
self.collection.load()
# 查询测试
query_vectors = [[np.random.random() for _ in range(128)] for _ in range(num_queries)]
latencies = []
for query_vector in query_vectors:
start = time.time()
results = self.collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=10
)
latency = time.time() - start
latencies.append(latency)
# 统计结果
avg_latency = np.mean(latencies) * 1000 # ms
p95_latency = np.percentile(latencies, 95) * 1000
p99_latency = np.percentile(latencies, 99) * 1000
qps = 1 / np.mean(latencies)
# 内存占用(简化)
memory_usage = "N/A"
result = {
"index_type": index_type,
"params": index_config.get("params", {}),
"build_time": build_time,
"avg_latency": avg_latency,
"p95_latency": p95_latency,
"p99_latency": p99_latency,
"qps": qps,
"memory": memory_usage
}
self.results.append(result)
print(f" 构建时间: {build_time:.2f}s")
print(f" 平均延迟: {avg_latency:.2f}ms")
print(f" P95延迟: {p95_latency:.2f}ms")
print(f" P99延迟: {p99_latency:.2f}ms")
print(f" QPS: {qps:.2f}")
return result
def compare_indexes(self, index_configs, search_params_list, num_queries=100):
"""对比多个索引配置"""
print("="*80)
print("索引性能对比测试")
print("="*80)
for index_config, search_params in zip(index_configs, search_params_list):
self.benchmark_index(index_config, search_params, num_queries)
# 打印对比表格
print(f"\n{'索引类型':>15s} {'构建时间':>10s} {'平均延迟':>10s} {'P95延迟':>10s} {'QPS':>10s}")
print("-" * 60)
for result in self.results:
print(f"{result['index_type']:>15s} {result['build_time']:9.2f}s {result['avg_latency']:9.2f}ms {result['p95_latency']:9.2f}ms {result['qps']:9.2f}")
# 推荐最佳配置
best_qps = max(self.results, key=lambda x: x["qps"])
best_latency = min(self.results, key=lambda x: x["avg_latency"])
print(f"\n推荐:")
print(f" 最高QPS: {best_qps['index_type']} ({best_qps['qps']:.2f})")
print(f" 最低延迟: {best_latency['index_type']} ({best_latency['avg_latency']:.2f}ms)")
# 使用性能测试
benchmark = IndexBenchmark(collection)
# 定义测试配置
index_configs = [
{
"index_type": "FLAT",
"metric_type": "L2",
"params": {}
},
{
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": 128}
},
{
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": 512}
},
{
"index_type": "HNSW",
"metric_type": "L2",
"params": {"M": 16, "efConstruction": 200}
}
]
search_params_list = [
{"metric_type": "L2", "params": {}},
{"metric_type": "L2", "params": {"nprobe": 16}},
{"metric_type": "L2", "params": {"nprobe": 64}},
{"metric_type": "L2", "params": {"ef": 64}}
]
# 执行对比测试
benchmark.compare_indexes(index_configs, search_params_list, num_queries=50)
---
8.2 查询参数调优
01.搜索参数优化
a.nprobe参数调优
a.功能说明
nprobe控制IVF索引搜索的聚类数量,直接影响召回率和查询速度。nprobe越大,召回率越高,但查询速度越慢。推荐值为nlist的1-10%。需要根据业务场景平衡精度和性能。可以通过A/B测试确定最优值。不同查询可以使用不同的nprobe值。高优先级查询使用更大的nprobe。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("documents")
# 创建IVF索引
index_params = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": 1024}
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()
# nprobe调优类
class NprobeOptimizer:
def __init__(self, collection):
self.collection = collection
def test_nprobe_values(self, query_vector, nprobe_values, num_queries=100):
"""测试不同nprobe值的性能"""
results = []
print(f"\nnprobe参数调优测试 ({num_queries}次查询):\n")
print(f"{'nprobe':>8s} {'平均延迟':>12s} {'P95延迟':>12s} {'QPS':>10s} {'召回率估计':>12s}")
print("-" * 60)
# 获取基准结果(使用最大nprobe)
baseline_search_params = {
"metric_type": "L2",
"params": {"nprobe": max(nprobe_values)}
}
baseline_results = self.collection.search(
data=[query_vector],
anns_field="embedding",
param=baseline_search_params,
limit=10
)
baseline_ids = set(hit.id for hit in baseline_results[0])
for nprobe in nprobe_values:
search_params = {
"metric_type": "L2",
"params": {"nprobe": nprobe}
}
latencies = []
recall_sum = 0
for _ in range(num_queries):
start = time.time()
search_results = self.collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=10
)
latency = time.time() - start
latencies.append(latency)
# 计算召回率
result_ids = set(hit.id for hit in search_results[0])
recall = len(result_ids & baseline_ids) / len(baseline_ids)
recall_sum += recall
avg_latency = np.mean(latencies) * 1000
p95_latency = np.percentile(latencies, 95) * 1000
qps = 1 / np.mean(latencies)
avg_recall = recall_sum / num_queries
results.append({
"nprobe": nprobe,
"avg_latency": avg_latency,
"p95_latency": p95_latency,
"qps": qps,
"recall": avg_recall
})
print(f"{nprobe:8d} {avg_latency:11.2f}ms {p95_latency:11.2f}ms {qps:9.2f} {avg_recall*100:11.1f}%")
return results
def recommend_nprobe(self, results, min_recall=0.95):
"""推荐最优nprobe值"""
# 找到满足召回率要求的最小nprobe
valid_results = [r for r in results if r["recall"] >= min_recall]
if not valid_results:
print(f"\n警告: 没有配置满足{min_recall*100:.0f}%召回率要求")
return None
best = min(valid_results, key=lambda x: x["avg_latency"])
print(f"\n推荐配置 (召回率≥{min_recall*100:.0f}%):")
print(f" nprobe: {best['nprobe']}")
print(f" 平均延迟: {best['avg_latency']:.2f}ms")
print(f" QPS: {best['qps']:.2f}")
print(f" 召回率: {best['recall']*100:.1f}%")
return best
def adaptive_nprobe(self, query_priority):
"""根据查询优先级自适应选择nprobe"""
# 高优先级: nprobe更大,召回率更高
# 低优先级: nprobe更小,速度更快
nprobe_map = {
"high": 64, # 高优先级
"medium": 32, # 中优先级
"low": 16 # 低优先级
}
return nprobe_map.get(query_priority, 32)
# 使用nprobe优化器
optimizer = NprobeOptimizer(collection)
query_vector = [np.random.random() for _ in range(128)]
nprobe_values = [8, 16, 32, 64, 128, 256]
# 测试不同nprobe值
results = optimizer.test_nprobe_values(query_vector, nprobe_values, num_queries=50)
# 推荐最优配置
best_config = optimizer.recommend_nprobe(results, min_recall=0.95)
# 自适应nprobe示例
print("\n自适应nprobe策略:")
for priority in ["high", "medium", "low"]:
nprobe = optimizer.adaptive_nprobe(priority)
print(f" {priority}优先级: nprobe={nprobe}")
---
b.ef参数调优
a.功能说明
ef参数用于HNSW索引,控制搜索时的候选集大小。ef越大,召回率越高,但查询速度越慢。ef必须大于等于limit(返回结果数)。推荐值为limit的2-10倍。efConstruction是构建时参数,ef是查询时参数。可以动态调整ef值,无需重建索引。需要根据召回率要求选择合适的ef值。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("documents")
# 创建HNSW索引
index_params = {
"index_type": "HNSW",
"metric_type": "L2",
"params": {
"M": 16,
"efConstruction": 200
}
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()
# ef参数调优类
class EfOptimizer:
def __init__(self, collection):
self.collection = collection
def test_ef_values(self, query_vector, ef_values, limit=10, num_queries=100):
"""测试不同ef值的性能"""
results = []
print(f"\nef参数调优测试 (limit={limit}, {num_queries}次查询):\n")
print(f"{'ef':>6s} {'平均延迟':>12s} {'P95延迟':>12s} {'QPS':>10s} {'召回率估计':>12s}")
print("-" * 58)
# 获取基准结果
baseline_search_params = {
"metric_type": "L2",
"params": {"ef": max(ef_values)}
}
baseline_results = self.collection.search(
data=[query_vector],
anns_field="embedding",
param=baseline_search_params,
limit=limit
)
baseline_ids = set(hit.id for hit in baseline_results[0])
for ef in ef_values:
if ef < limit:
print(f"{ef:6d} 跳过 (ef必须≥limit={limit})")
continue
search_params = {
"metric_type": "L2",
"params": {"ef": ef}
}
latencies = []
recall_sum = 0
for _ in range(num_queries):
start = time.time()
search_results = self.collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=limit
)
latency = time.time() - start
latencies.append(latency)
result_ids = set(hit.id for hit in search_results[0])
recall = len(result_ids & baseline_ids) / len(baseline_ids)
recall_sum += recall
avg_latency = np.mean(latencies) * 1000
p95_latency = np.percentile(latencies, 95) * 1000
qps = 1 / np.mean(latencies)
avg_recall = recall_sum / num_queries
results.append({
"ef": ef,
"avg_latency": avg_latency,
"p95_latency": p95_latency,
"qps": qps,
"recall": avg_recall
})
print(f"{ef:6d} {avg_latency:11.2f}ms {p95_latency:11.2f}ms {qps:9.2f} {avg_recall*100:11.1f}%")
return results
def recommend_ef(self, results, limit, min_recall=0.95):
"""推荐最优ef值"""
valid_results = [r for r in results if r["recall"] >= min_recall]
if not valid_results:
print(f"\n警告: 没有配置满足{min_recall*100:.0f}%召回率要求")
return None
best = min(valid_results, key=lambda x: x["avg_latency"])
print(f"\n推荐配置 (limit={limit}, 召回率≥{min_recall*100:.0f}%):")
print(f" ef: {best['ef']} (约{best['ef']/limit:.1f}倍limit)")
print(f" 平均延迟: {best['avg_latency']:.2f}ms")
print(f" QPS: {best['qps']:.2f}")
print(f" 召回率: {best['recall']*100:.1f}%")
return best
# 使用ef优化器
ef_optimizer = EfOptimizer(collection)
query_vector = [np.random.random() for _ in range(128)]
# 测试不同limit下的ef值
for limit in [10, 50, 100]:
ef_values = [limit, limit*2, limit*4, limit*8, limit*10]
print(f"\n{'='*60}")
print(f"测试limit={limit}")
print(f"{'='*60}")
results = ef_optimizer.test_ef_values(query_vector, ef_values, limit=limit, num_queries=50)
ef_optimizer.recommend_ef(results, limit=limit, min_recall=0.95)
print("\nef参数调优建议:")
print(" 1. ef ≥ limit (必须)")
print(" 2. ef = limit * 2-4 (平衡)")
print(" 3. ef = limit * 8-10 (高召回)")
print(" 4. 根据召回率要求调整")
print(" 5. 可以动态调整,无需重建索引")
---
02.批量查询优化
a.批量大小调整
a.功能说明
批量查询可以提升吞吐量,减少网络开销。批量大小影响延迟和吞吐量的平衡。批量过大会增加单次查询延迟。批量过小无法充分利用并行能力。推荐批量大小为10-100。需要根据硬件资源和业务需求调整。高吞吐场景使用更大批量。低延迟场景使用更小批量。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("documents")
collection.load()
# 批量查询优化类
class BatchQueryOptimizer:
def __init__(self, collection):
self.collection = collection
def test_batch_sizes(self, batch_sizes, total_queries=1000):
"""测试不同批量大小的性能"""
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
print(f"\n批量大小优化测试 (总查询数={total_queries}):\n")
print(f"{'批量大小':>10s} {'总时间':>10s} {'吞吐量':>12s} {'平均延迟':>12s} {'P95延迟':>12s}")
print("-" * 62)
results = []
for batch_size in batch_sizes:
num_batches = total_queries // batch_size
total_time = 0
latencies = []
for _ in range(num_batches):
# 生成批量查询向量
query_vectors = [[np.random.random() for _ in range(128)] for _ in range(batch_size)]
start = time.time()
results_batch = self.collection.search(
data=query_vectors,
anns_field="embedding",
param=search_params,
limit=10
)
elapsed = time.time() - start
total_time += elapsed
latencies.append(elapsed)
throughput = total_queries / total_time
avg_latency = np.mean(latencies) * 1000
p95_latency = np.percentile(latencies, 95) * 1000
results.append({
"batch_size": batch_size,
"total_time": total_time,
"throughput": throughput,
"avg_latency": avg_latency,
"p95_latency": p95_latency
})
print(f"{batch_size:10d} {total_time:9.2f}s {throughput:11.2f}qps {avg_latency:11.2f}ms {p95_latency:11.2f}ms")
return results
def recommend_batch_size(self, results, max_latency_ms=None):
"""推荐最优批量大小"""
if max_latency_ms:
# 满足延迟要求的最大吞吐量
valid_results = [r for r in results if r["avg_latency"] <= max_latency_ms]
if not valid_results:
print(f"\n警告: 没有配置满足{max_latency_ms}ms延迟要求")
return None
best = max(valid_results, key=lambda x: x["throughput"])
print(f"\n推荐配置 (延迟≤{max_latency_ms}ms):")
else:
# 最大吞吐量
best = max(results, key=lambda x: x["throughput"])
print(f"\n推荐配置 (最大吞吐量):")
print(f" 批量大小: {best['batch_size']}")
print(f" 吞吐量: {best['throughput']:.2f} qps")
print(f" 平均延迟: {best['avg_latency']:.2f}ms")
print(f" P95延迟: {best['p95_latency']:.2f}ms")
return best
# 使用批量查询优化器
batch_optimizer = BatchQueryOptimizer(collection)
batch_sizes = [1, 10, 20, 50, 100, 200]
# 测试不同批量大小
results = batch_optimizer.test_batch_sizes(batch_sizes, total_queries=1000)
# 推荐最大吞吐量配置
batch_optimizer.recommend_batch_size(results)
# 推荐满足延迟要求的配置
batch_optimizer.recommend_batch_size(results, max_latency_ms=50)
print("\n批量查询优化建议:")
print(" 1. 高吞吐场景: 批量50-200")
print(" 2. 低延迟场景: 批量1-20")
print(" 3. 平衡场景: 批量20-50")
print(" 4. 监控延迟和吞吐量指标")
print(" 5. 根据硬件资源动态调整")
---
b.并发控制
a.功能说明
并发查询可以提升系统吞吐量,充分利用资源。并发数影响延迟和资源使用。并发过高会导致资源竞争和延迟增加。并发过低无法充分利用硬件能力。推荐并发数为CPU核心数的2-4倍。需要监控系统负载,避免过载。可以使用连接池管理并发连接。实现请求限流和熔断机制。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
collection = Collection("documents")
collection.load()
# 并发控制类
class ConcurrencyController:
def __init__(self, collection):
self.collection = collection
def single_query(self, query_id):
"""单个查询任务"""
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
start = time.time()
results = self.collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10
)
latency = time.time() - start
return query_id, latency
def test_concurrency(self, concurrency_levels, num_queries=1000):
"""测试不同并发级别的性能"""
print(f"\n并发控制测试 (总查询数={num_queries}):\n")
print(f"{'并发数':>8s} {'总时间':>10s} {'吞吐量':>12s} {'平均延迟':>12s} {'P95延迟':>12s}")
print("-" * 60)
results = []
for concurrency in concurrency_levels:
latencies = []
start = time.time()
with ThreadPoolExecutor(max_workers=concurrency) as executor:
futures = [executor.submit(self.single_query, i) for i in range(num_queries)]
for future in as_completed(futures):
query_id, latency = future.result()
latencies.append(latency)
total_time = time.time() - start
throughput = num_queries / total_time
avg_latency = np.mean(latencies) * 1000
p95_latency = np.percentile(latencies, 95) * 1000
results.append({
"concurrency": concurrency,
"total_time": total_time,
"throughput": throughput,
"avg_latency": avg_latency,
"p95_latency": p95_latency
})
print(f"{concurrency:8d} {total_time:9.2f}s {throughput:11.2f}qps {avg_latency:11.2f}ms {p95_latency:11.2f}ms")
return results
def recommend_concurrency(self, results, max_latency_ms=None):
"""推荐最优并发数"""
if max_latency_ms:
valid_results = [r for r in results if r["p95_latency"] <= max_latency_ms]
if not valid_results:
print(f"\n警告: 没有配置满足P95延迟≤{max_latency_ms}ms要求")
return None
best = max(valid_results, key=lambda x: x["throughput"])
print(f"\n推荐配置 (P95延迟≤{max_latency_ms}ms):")
else:
best = max(results, key=lambda x: x["throughput"])
print(f"\n推荐配置 (最大吞吐量):")
print(f" 并发数: {best['concurrency']}")
print(f" 吞吐量: {best['throughput']:.2f} qps")
print(f" 平均延迟: {best['avg_latency']:.2f}ms")
print(f" P95延迟: {best['p95_latency']:.2f}ms")
return best
# 使用并发控制器
concurrency_controller = ConcurrencyController(collection)
concurrency_levels = [1, 2, 4, 8, 16, 32, 64]
# 测试不同并发级别
results = concurrency_controller.test_concurrency(concurrency_levels, num_queries=500)
# 推荐最大吞吐量配置
concurrency_controller.recommend_concurrency(results)
# 推荐满足延迟要求的配置
concurrency_controller.recommend_concurrency(results, max_latency_ms=100)
print("\n并发控制建议:")
print(" 1. 并发数 = CPU核心数 * 2-4")
print(" 2. 监控CPU和内存使用率")
print(" 3. 避免过度并发导致资源竞争")
print(" 4. 实现请求限流机制")
print(" 5. 使用连接池管理连接")
---
8.3 内存优化
01.内存使用分析
a.内存占用评估
a.功能说明
内存是Milvus性能的关键资源,需要合理评估和管理。内存主要用于存储向量数据、索引结构、查询缓存等。不同索引类型内存占用差异很大。FLAT索引内存占用最大,PQ索引内存占用最小。需要监控内存使用情况,避免OOM。可以通过量化、压缩等技术降低内存占用。合理配置内存限制和缓存策略。
b.代码示例
---
from pymilvus import Collection, utility
import numpy as np
collection = Collection("documents")
# 内存分析类
class MemoryAnalyzer:
def __init__(self, collection):
self.collection = collection
def estimate_memory_usage(self, vector_count, vector_dim, index_type):
"""估算内存使用"""
# 单个向量大小(float32)
vector_size_bytes = vector_dim * 4
# 原始数据大小
raw_data_mb = vector_count * vector_size_bytes / (1024**2)
# 索引开销系数
index_overhead = {
"FLAT": 1.0, # 无额外开销
"IVF_FLAT": 1.1, # 10%开销
"IVF_SQ8": 0.3, # 压缩到30%
"IVF_PQ": 0.1, # 压缩到10%
"HNSW": 1.3, # 30%开销
}
overhead_factor = index_overhead.get(index_type, 1.0)
total_memory_mb = raw_data_mb * overhead_factor
return {
"vector_count": vector_count,
"vector_dim": vector_dim,
"index_type": index_type,
"raw_data_mb": raw_data_mb,
"overhead_factor": overhead_factor,
"total_memory_mb": total_memory_mb,
"total_memory_gb": total_memory_mb / 1024
}
def print_memory_report(self, estimates):
"""打印内存报告"""
print("\n内存使用估算:")
print(f" 向量数量: {estimates['vector_count']:,}")
print(f" 向量维度: {estimates['vector_dim']}")
print(f" 索引类型: {estimates['index_type']}")
print(f" 原始数据: {estimates['raw_data_mb']:.2f} MB ({estimates['raw_data_mb']/1024:.2f} GB)")
print(f" 开销系数: {estimates['overhead_factor']:.1f}x")
print(f" 总内存: {estimates['total_memory_mb']:.2f} MB ({estimates['total_memory_gb']:.2f} GB)")
def compare_index_memory(self, vector_count, vector_dim):
"""对比不同索引的内存占用"""
index_types = ["FLAT", "IVF_FLAT", "IVF_SQ8", "IVF_PQ", "HNSW"]
print(f"\n索引内存对比 ({vector_count:,}个{vector_dim}维向量):\n")
print(f"{'索引类型':>12s} {'原始数据':>12s} {'总内存':>12s} {'压缩率':>10s}")
print("-" * 50)
for index_type in index_types:
est = self.estimate_memory_usage(vector_count, vector_dim, index_type)
compression = est['total_memory_mb'] / est['raw_data_mb']
print(f"{index_type:>12s} {est['raw_data_mb']:11.2f}MB {est['total_memory_mb']:11.2f}MB {compression:9.1f}x")
# 使用内存分析器
analyzer = MemoryAnalyzer(collection)
# 估算不同规模的内存需求
scenarios = [
(100000, 128, "IVF_FLAT"),
(1000000, 128, "IVF_FLAT"),
(10000000, 128, "IVF_SQ8"),
(100000000, 128, "IVF_PQ")
]
for vector_count, vector_dim, index_type in scenarios:
estimates = analyzer.estimate_memory_usage(vector_count, vector_dim, index_type)
analyzer.print_memory_report(estimates)
# 对比索引内存
analyzer.compare_index_memory(10000000, 128)
print("\n内存优化建议:")
print(" 1. 使用量化索引(SQ8/PQ)降低内存")
print(" 2. 分区管理,按需加载")
print(" 3. 监控内存使用,设置限制")
print(" 4. 定期释放不用的分区")
print(" 5. 使用DiskANN处理超大规模数据")
---
b.内存限制配置
a.功能说明
配置内存限制可以避免OOM,保证系统稳定性。可以为QueryNode设置内存上限。超过限制时拒绝加载新数据或查询。需要合理设置限制,避免过于严格影响性能。监控内存使用率,及时调整配置。可以配置内存预留,避免突发流量。实现内存告警机制,提前发现问题。
b.代码示例
---
# 内存限制配置(通过配置文件)
memory_config = """
queryNode:
cache:
memoryLimit: 2147483648 # 2GB内存限制
enabled: true
loadMemoryUsageMaxLevel: 90 # 内存使用率超过90%时停止加载
gracefulStopTimeout: 30 # 优雅停机超时时间
# 监控配置
monitoring:
memory:
warningThreshold: 0.8 # 80%告警
criticalThreshold: 0.9 # 90%严重告警
"""
print("内存限制配置示例:")
print(memory_config)
print("\n内存限制策略:")
print(" 1. 设置QueryNode内存上限")
print(" 2. 配置内存使用率阈值")
print(" 3. 实现内存告警机制")
print(" 4. 优雅降级,拒绝新请求")
print(" 5. 定期清理缓存和临时数据")
---
02.量化压缩
a.标量量化SQ8
a.功能说明
标量量化将float32向量压缩为int8,内存降低75%。SQ8使用线性量化,精度损失较小。适合内存受限但对精度要求不高的场景。查询速度略快于FLAT,因为数据量更小。召回率略低于FLAT,通常在95%以上。需要在构建索引时指定。不可逆压缩,无法恢复原始数据。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("documents")
# 创建IVF_SQ8索引
sq8_index = {
"index_type": "IVF_SQ8",
"metric_type": "L2",
"params": {"nlist": 1024}
}
print("创建IVF_SQ8索引(标量量化)...")
start = time.time()
collection.create_index(field_name="embedding", index_params=sq8_index)
build_time = time.time() - start
print(f"索引构建时间: {build_time:.2f}s")
collection.load()
# 测试查询性能
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
start = time.time()
for _ in range(100):
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10
)
query_time = time.time() - start
print(f"\n100次查询总时间: {query_time:.2f}s")
print(f"平均查询延迟: {query_time/100*1000:.2f}ms")
print(f"QPS: {100/query_time:.2f}")
print("\nSQ8量化特点:")
print(" 压缩率: 4x (float32 -> int8)")
print(" 内存节省: 75%")
print(" 召回率: ~95%")
print(" 查询速度: 略快于FLAT")
print(" 适用: 内存受限,可接受小幅精度损失")
---
b.乘积量化PQ
a.功能说明
乘积量化将向量分段量化,压缩率更高。可以将内存降低到原来的10%甚至更低。通过m参数控制分段数,影响压缩率和精度。适合超大规模数据,内存严重受限的场景。召回率低于SQ8,通常在85-90%。查询速度较快,但精度损失较大。需要权衡内存和精度。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
collection = Collection("documents")
# 创建IVF_PQ索引
pq_index = {
"index_type": "IVF_PQ",
"metric_type": "L2",
"params": {
"nlist": 1024,
"m": 16, # 分段数,必须能整除向量维度
"nbits": 8 # 每段的比特数
}
}
print("创建IVF_PQ索引(乘积量化)...")
start = time.time()
collection.create_index(field_name="embedding", index_params=pq_index)
build_time = time.time() - start
print(f"索引构建时间: {build_time:.2f}s")
collection.load()
# 测试查询性能
query_vector = [[np.random.random() for _ in range(128)]]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
start = time.time()
for _ in range(100):
results = collection.search(
data=query_vector,
anns_field="embedding",
param=search_params,
limit=10
)
query_time = time.time() - start
print(f"\n100次查询总时间: {query_time:.2f}s")
print(f"平均查询延迟: {query_time/100*1000:.2f}ms")
print(f"QPS: {100/query_time:.2f}")
print("\nPQ量化特点:")
print(" 压缩率: 10-40x (取决于m和nbits)")
print(" 内存节省: 90-97%")
print(" 召回率: ~85-90%")
print(" 查询速度: 较快")
print(" 适用: 超大规模数据,内存严重受限")
print(" 参数: m必须能整除向量维度")
---
8.4 并发控制
01.连接池管理
a.连接池配置
a.功能说明
连接池复用连接,减少连接建立开销。配置合适的连接池大小可以提升并发性能。连接池过小会导致连接等待,过大会浪费资源。推荐连接池大小为并发数的1-2倍。需要配置连接超时和空闲超时。实现连接健康检查,自动重连。监控连接池使用情况,动态调整。
b.代码示例
---
from pymilvus import connections, Collection
import threading
import time
# 连接池配置
class ConnectionPool:
def __init__(self, alias_prefix="conn", pool_size=10):
self.alias_prefix = alias_prefix
self.pool_size = pool_size
self.connections = []
self.lock = threading.Lock()
self.init_pool()
def init_pool(self):
"""初始化连接池"""
print(f"初始化连接池,大小: {self.pool_size}")
for i in range(self.pool_size):
alias = f"{self.alias_prefix}_{i}"
connections.connect(
alias=alias,
host="localhost",
port="19530",
timeout=30
)
self.connections.append({
"alias": alias,
"in_use": False,
"last_used": time.time()
})
print(f"连接池初始化完成")
def acquire(self, timeout=10):
"""获取连接"""
start = time.time()
while time.time() - start < timeout:
with self.lock:
for conn in self.connections:
if not conn["in_use"]:
conn["in_use"] = True
conn["last_used"] = time.time()
return conn["alias"]
time.sleep(0.01)
raise TimeoutError("获取连接超时")
def release(self, alias):
"""释放连接"""
with self.lock:
for conn in self.connections:
if conn["alias"] == alias:
conn["in_use"] = False
conn["last_used"] = time.time()
break
def get_stats(self):
"""获取连接池统计"""
with self.lock:
total = len(self.connections)
in_use = sum(1 for conn in self.connections if conn["in_use"])
available = total - in_use
return {
"total": total,
"in_use": in_use,
"available": available,
"usage_rate": in_use / total if total > 0 else 0
}
def close_all(self):
"""关闭所有连接"""
print("关闭连接池...")
for conn in self.connections:
try:
connections.disconnect(conn["alias"])
except:
pass
self.connections.clear()
print("连接池已关闭")
# 使用连接池
pool = ConnectionPool(pool_size=5)
def worker_task(task_id, pool):
"""工作线程任务"""
try:
# 获取连接
alias = pool.acquire(timeout=5)
print(f"任务{task_id}: 获取连接 {alias}")
# 使用连接执行查询
collection = Collection("documents", using=alias)
# 模拟查询
time.sleep(0.1)
print(f"任务{task_id}: 完成查询")
# 释放连接
pool.release(alias)
print(f"任务{task_id}: 释放连接 {alias}")
except Exception as e:
print(f"任务{task_id}: 失败 - {e}")
# 创建多个工作线程
threads = []
for i in range(10):
thread = threading.Thread(target=worker_task, args=(i, pool))
threads.append(thread)
thread.start()
# 等待所有线程完成
for thread in threads:
thread.join()
# 打印连接池统计
stats = pool.get_stats()
print(f"\n连接池统计:")
print(f" 总连接数: {stats['total']}")
print(f" 使用中: {stats['in_use']}")
print(f" 可用: {stats['available']}")
print(f" 使用率: {stats['usage_rate']*100:.1f}%")
# 关闭连接池
pool.close_all()
print("\n连接池配置建议:")
print(" 1. 连接池大小 = 并发数 * 1-2")
print(" 2. 配置连接超时和空闲超时")
print(" 3. 实现连接健康检查")
print(" 4. 监控连接池使用率")
print(" 5. 动态调整连接池大小")
---
b.请求限流
a.功能说明
请求限流保护系统不被过载,保证服务稳定性。可以限制QPS、并发数、请求大小等。常见限流算法包括令牌桶、漏桶、固定窗口等。需要根据系统容量设置限流阈值。超过限流时返回错误或排队等待。可以为不同用户设置不同限流策略。实现优雅降级,保证核心功能可用。
b.代码示例
---
import time
import threading
from collections import deque
# 令牌桶限流器
class TokenBucketLimiter:
def __init__(self, rate, capacity):
"""
rate: 每秒生成的令牌数
capacity: 桶容量
"""
self.rate = rate
self.capacity = capacity
self.tokens = capacity
self.last_update = time.time()
self.lock = threading.Lock()
def acquire(self, tokens=1):
"""获取令牌"""
with self.lock:
now = time.time()
# 补充令牌
elapsed = now - self.last_update
self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
self.last_update = now
# 尝试获取令牌
if self.tokens >= tokens:
self.tokens -= tokens
return True
else:
return False
def wait_acquire(self, tokens=1, timeout=10):
"""等待获取令牌"""
start = time.time()
while time.time() - start < timeout:
if self.acquire(tokens):
return True
time.sleep(0.01)
return False
# 滑动窗口限流器
class SlidingWindowLimiter:
def __init__(self, max_requests, window_seconds):
"""
max_requests: 窗口内最大请求数
window_seconds: 窗口大小(秒)
"""
self.max_requests = max_requests
self.window_seconds = window_seconds
self.requests = deque()
self.lock = threading.Lock()
def acquire(self):
"""尝试获取许可"""
with self.lock:
now = time.time()
# 移除过期请求
while self.requests and self.requests[0] < now - self.window_seconds:
self.requests.popleft()
# 检查是否超过限制
if len(self.requests) < self.max_requests:
self.requests.append(now)
return True
else:
return False
def get_current_rate(self):
"""获取当前请求率"""
with self.lock:
now = time.time()
# 移除过期请求
while self.requests and self.requests[0] < now - self.window_seconds:
self.requests.popleft()
return len(self.requests) / self.window_seconds
# 并发限流器
class ConcurrencyLimiter:
def __init__(self, max_concurrent):
"""
max_concurrent: 最大并发数
"""
self.max_concurrent = max_concurrent
self.current = 0
self.lock = threading.Lock()
def acquire(self):
"""获取并发许可"""
with self.lock:
if self.current < self.max_concurrent:
self.current += 1
return True
else:
return False
def release(self):
"""释放并发许可"""
with self.lock:
if self.current > 0:
self.current -= 1
def get_current(self):
"""获取当前并发数"""
with self.lock:
return self.current
# 测试限流器
print("测试令牌桶限流器:")
token_limiter = TokenBucketLimiter(rate=10, capacity=20)
success_count = 0
for i in range(50):
if token_limiter.acquire():
success_count += 1
print(f" 尝试50次请求,成功{success_count}次")
print("\n测试滑动窗口限流器:")
window_limiter = SlidingWindowLimiter(max_requests=100, window_seconds=1)
success_count = 0
for i in range(150):
if window_limiter.acquire():
success_count += 1
print(f" 尝试150次请求,成功{success_count}次")
print(f" 当前请求率: {window_limiter.get_current_rate():.2f} qps")
print("\n测试并发限流器:")
concurrency_limiter = ConcurrencyLimiter(max_concurrent=10)
acquired = 0
for i in range(20):
if concurrency_limiter.acquire():
acquired += 1
print(f" 尝试获取20个并发,成功{acquired}个")
print(f" 当前并发数: {concurrency_limiter.get_current()}")
print("\n限流策略建议:")
print(" 1. 令牌桶: 允许突发流量,平滑限流")
print(" 2. 滑动窗口: 精确控制时间窗口内请求数")
print(" 3. 并发限流: 控制同时执行的请求数")
print(" 4. 组合使用: QPS + 并发双重限流")
print(" 5. 分级限流: 不同用户不同限制")
---
02.资源隔离
a.资源组配置
a.功能说明
资源组实现多租户资源隔离,避免相互影响。可以为不同业务分配独立的QueryNode资源。每个资源组有独立的内存和CPU配额。支持动态调整资源组配置。可以实现优先级调度,保证核心业务。适合多租户、多业务场景。需要合理规划资源分配。
b.代码示例
---
from pymilvus import utility
# 资源组管理类
class ResourceGroupManager:
@staticmethod
def create_resource_group(name, config=None):
"""创建资源组"""
if config is None:
config = {
"requests": {"node_num": 1},
"limits": {"node_num": 2}
}
try:
utility.create_resource_group(name, config=config)
print(f"创建资源组: {name}")
print(f" 配置: {config}")
except Exception as e:
print(f"创建资源组失败: {e}")
@staticmethod
def list_resource_groups():
"""列出所有资源组"""
try:
groups = utility.list_resource_groups()
print("\n资源组列表:")
for group in groups:
print(f" - {group}")
return groups
except Exception as e:
print(f"列出资源组失败: {e}")
return []
@staticmethod
def describe_resource_group(name):
"""查看资源组详情"""
try:
info = utility.describe_resource_group(name)
print(f"\n资源组详情: {name}")
print(f" {info}")
return info
except Exception as e:
print(f"查看资源组失败: {e}")
return None
@staticmethod
def transfer_node(source_group, target_group, num_nodes=1):
"""在资源组间转移节点"""
try:
utility.transfer_node(source_group, target_group, num_nodes)
print(f"转移{num_nodes}个节点: {source_group} -> {target_group}")
except Exception as e:
print(f"转移节点失败: {e}")
@staticmethod
def drop_resource_group(name):
"""删除资源组"""
try:
utility.drop_resource_group(name)
print(f"删除资源组: {name}")
except Exception as e:
print(f"删除资源组失败: {e}")
# 使用资源组管理器
manager = ResourceGroupManager()
# 创建资源组
print("创建资源组:")
manager.create_resource_group("business_a", config={"requests": {"node_num": 2}})
manager.create_resource_group("business_b", config={"requests": {"node_num": 1}})
manager.create_resource_group("business_c", config={"requests": {"node_num": 1}})
# 列出资源组
groups = manager.list_resource_groups()
# 查看资源组详情
for group in groups:
manager.describe_resource_group(group)
# 资源组使用示例
print("\n资源组使用场景:")
print(" 1. 多租户隔离: 每个租户独立资源组")
print(" 2. 业务隔离: 核心业务和非核心业务分离")
print(" 3. 环境隔离: 生产、测试、开发环境分离")
print(" 4. 优先级保证: 高优先级业务独享资源")
print(" 5. 资源弹性: 动态调整资源分配")
---
b.查询优先级
a.功能说明
查询优先级确保重要查询优先执行。可以为不同查询设置优先级级别。高优先级查询优先获取资源和执行。低优先级查询在资源紧张时可能被延迟或拒绝。适合多业务场景,保证核心业务SLA。需要合理设置优先级策略。实现优先级队列和调度算法。监控不同优先级的查询性能。
b.代码示例
---
import time
import threading
from queue import PriorityQueue
from pymilvus import Collection
import numpy as np
# 优先级查询管理器
class PriorityQueryManager:
def __init__(self, collection, max_workers=4):
self.collection = collection
self.max_workers = max_workers
self.query_queue = PriorityQueue()
self.workers = []
self.running = False
self.stats = {
"high": {"count": 0, "total_latency": 0},
"medium": {"count": 0, "total_latency": 0},
"low": {"count": 0, "total_latency": 0}
}
self.stats_lock = threading.Lock()
def start(self):
"""启动工作线程"""
self.running = True
for i in range(self.max_workers):
worker = threading.Thread(target=self._worker, args=(i,))
worker.daemon = True
worker.start()
self.workers.append(worker)
print(f"启动{self.max_workers}个工作线程")
def stop(self):
"""停止工作线程"""
self.running = False
for worker in self.workers:
worker.join()
print("所有工作线程已停止")
def _worker(self, worker_id):
"""工作线程"""
while self.running:
try:
# 获取查询任务(优先级高的先执行)
priority, query_id, query_vector, priority_name = self.query_queue.get(timeout=0.1)
# 执行查询
start = time.time()
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
results = self.collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=10
)
latency = time.time() - start
# 更新统计
with self.stats_lock:
self.stats[priority_name]["count"] += 1
self.stats[priority_name]["total_latency"] += latency
print(f"工作线程{worker_id}: 完成查询{query_id} (优先级:{priority_name}, 延迟:{latency*1000:.2f}ms)")
self.query_queue.task_done()
except:
pass
def submit_query(self, query_vector, priority="medium"):
"""提交查询"""
# 优先级映射(数字越小优先级越高)
priority_map = {
"high": 0,
"medium": 1,
"low": 2
}
priority_value = priority_map.get(priority, 1)
query_id = f"{priority}_{int(time.time()*1000000)}"
self.query_queue.put((priority_value, query_id, query_vector, priority))
return query_id
def get_stats(self):
"""获取统计信息"""
with self.stats_lock:
stats_copy = {}
for priority, data in self.stats.items():
if data["count"] > 0:
avg_latency = data["total_latency"] / data["count"]
else:
avg_latency = 0
stats_copy[priority] = {
"count": data["count"],
"avg_latency": avg_latency * 1000 # ms
}
return stats_copy
# 使用优先级查询管理器
collection = Collection("documents")
collection.load()
manager = PriorityQueryManager(collection, max_workers=4)
manager.start()
# 提交不同优先级的查询
print("\n提交查询:")
for i in range(10):
query_vector = [np.random.random() for _ in range(128)]
if i < 3:
priority = "high"
elif i < 7:
priority = "medium"
else:
priority = "low"
query_id = manager.submit_query(query_vector, priority=priority)
print(f" 提交查询{query_id} (优先级:{priority})")
time.sleep(0.1)
# 等待所有查询完成
manager.query_queue.join()
# 打印统计
stats = manager.get_stats()
print(f"\n查询统计:")
print(f"{'优先级':>10s} {'数量':>8s} {'平均延迟':>12s}")
print("-" * 35)
for priority in ["high", "medium", "low"]:
if priority in stats:
print(f"{priority:>10s} {stats[priority]['count']:8d} {stats[priority]['avg_latency']:11.2f}ms")
# 停止管理器
manager.stop()
print("\n优先级策略建议:")
print(" 1. 核心业务: 高优先级")
print(" 2. 常规业务: 中优先级")
print(" 3. 批量任务: 低优先级")
print(" 4. 监控不同优先级的性能")
print(" 5. 动态调整优先级策略")
---
8.5 缓存策略
01.查询缓存
a.缓存机制
a.功能说明
查询缓存存储热点查询结果,减少重复计算。相同查询向量可以直接返回缓存结果。缓存命中可以显著降低查询延迟。适合查询模式重复的场景,如推荐系统。需要配置缓存大小和过期策略。缓存会占用额外内存。需要权衡缓存收益和内存开销。实现缓存预热和失效机制。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
import hashlib
import json
collection = Collection("documents")
collection.load()
# 查询缓存类
class QueryCache:
def __init__(self, max_size=1000, ttl=300):
"""
max_size: 最大缓存条目数
ttl: 缓存过期时间(秒)
"""
self.max_size = max_size
self.ttl = ttl
self.cache = {}
self.access_count = {}
self.hit_count = 0
self.miss_count = 0
def _generate_key(self, query_vector, search_params, limit):
"""生成缓存键"""
# 将查询参数序列化为字符串
key_data = {
"vector": [round(v, 6) for v in query_vector], # 保留6位小数
"params": search_params,
"limit": limit
}
key_str = json.dumps(key_data, sort_keys=True)
key_hash = hashlib.md5(key_str.encode()).hexdigest()
return key_hash
def get(self, query_vector, search_params, limit):
"""从缓存获取结果"""
key = self._generate_key(query_vector, search_params, limit)
if key in self.cache:
entry = self.cache[key]
# 检查是否过期
if time.time() - entry["timestamp"] < self.ttl:
self.hit_count += 1
self.access_count[key] = self.access_count.get(key, 0) + 1
return entry["results"]
else:
# 过期,删除缓存
del self.cache[key]
if key in self.access_count:
del self.access_count[key]
self.miss_count += 1
return None
def put(self, query_vector, search_params, limit, results):
"""将结果放入缓存"""
key = self._generate_key(query_vector, search_params, limit)
# 检查缓存大小
if len(self.cache) >= self.max_size:
# LRU淘汰:删除访问次数最少的
if self.access_count:
lru_key = min(self.access_count, key=self.access_count.get)
del self.cache[lru_key]
del self.access_count[lru_key]
self.cache[key] = {
"results": results,
"timestamp": time.time()
}
self.access_count[key] = 0
def get_stats(self):
"""获取缓存统计"""
total_requests = self.hit_count + self.miss_count
hit_rate = self.hit_count / total_requests if total_requests > 0 else 0
return {
"cache_size": len(self.cache),
"max_size": self.max_size,
"hit_count": self.hit_count,
"miss_count": self.miss_count,
"hit_rate": hit_rate,
"total_requests": total_requests
}
def clear(self):
"""清空缓存"""
self.cache.clear()
self.access_count.clear()
self.hit_count = 0
self.miss_count = 0
# 带缓存的查询类
class CachedSearch:
def __init__(self, collection, cache):
self.collection = collection
self.cache = cache
def search(self, query_vector, search_params, limit=10):
"""带缓存的查询"""
# 尝试从缓存获取
cached_results = self.cache.get(query_vector, search_params, limit)
if cached_results is not None:
return cached_results, True # 缓存命中
# 缓存未命中,执行实际查询
results = self.collection.search(
data=[query_vector],
anns_field="embedding",
param=search_params,
limit=limit
)
# 将结果放入缓存
self.cache.put(query_vector, search_params, limit, results[0])
return results[0], False # 缓存未命中
# 使用查询缓存
cache = QueryCache(max_size=100, ttl=60)
cached_search = CachedSearch(collection, cache)
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
# 生成一些查询向量
query_vectors = [[np.random.random() for _ in range(128)] for _ in range(10)]
print("测试查询缓存:\n")
# 第一轮查询(缓存未命中)
print("第一轮查询(缓存未命中):")
for i, query_vector in enumerate(query_vectors):
start = time.time()
results, hit = cached_search.search(query_vector, search_params)
latency = time.time() - start
print(f" 查询{i+1}: {'命中' if hit else '未命中'}, 延迟: {latency*1000:.2f}ms")
# 第二轮查询(缓存命中)
print("\n第二轮查询(缓存命中):")
for i, query_vector in enumerate(query_vectors):
start = time.time()
results, hit = cached_search.search(query_vector, search_params)
latency = time.time() - start
print(f" 查询{i+1}: {'命中' if hit else '未命中'}, 延迟: {latency*1000:.2f}ms")
# 打印缓存统计
stats = cache.get_stats()
print(f"\n缓存统计:")
print(f" 缓存大小: {stats['cache_size']}/{stats['max_size']}")
print(f" 命中次数: {stats['hit_count']}")
print(f" 未命中次数: {stats['miss_count']}")
print(f" 命中率: {stats['hit_rate']*100:.1f}%")
print(f" 总请求数: {stats['total_requests']}")
print("\n查询缓存建议:")
print(" 1. 适合查询模式重复的场景")
print(" 2. 配置合适的缓存大小和TTL")
print(" 3. 使用LRU等淘汰策略")
print(" 4. 监控缓存命中率")
print(" 5. 数据更新时及时失效缓存")
---
b.缓存预热
a.功能说明
缓存预热在系统启动时预先加载热点数据。避免冷启动时大量缓存未命中。可以根据历史查询日志识别热点查询。预热可以显著提升初期性能。需要平衡预热时间和收益。可以异步预热,不阻塞服务启动。实现增量预热,逐步加载数据。监控预热效果,优化预热策略。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
import json
collection = Collection("documents")
collection.load()
# 缓存预热类
class CacheWarmer:
def __init__(self, cached_search):
self.cached_search = cached_search
def warm_from_queries(self, query_list):
"""从查询列表预热缓存"""
print(f"\n开始缓存预热,共{len(query_list)}个查询...")
start = time.time()
for i, query_info in enumerate(query_list):
query_vector = query_info["vector"]
search_params = query_info["params"]
limit = query_info.get("limit", 10)
# 执行查询,填充缓存
self.cached_search.search(query_vector, search_params, limit)
if (i + 1) % 10 == 0:
print(f" 已预热 {i+1}/{len(query_list)} 个查询")
elapsed = time.time() - start
print(f"缓存预热完成,耗时: {elapsed:.2f}s")
return elapsed
def warm_from_log(self, log_file, top_n=100):
"""从查询日志预热缓存"""
print(f"\n从查询日志预热缓存(Top {top_n})...")
# 读取查询日志
try:
with open(log_file, 'r') as f:
logs = json.load(f)
# 统计查询频率
query_freq = {}
for log in logs:
query_key = json.dumps(log, sort_keys=True)
query_freq[query_key] = query_freq.get(query_key, 0) + 1
# 选择Top N热点查询
top_queries = sorted(query_freq.items(), key=lambda x: x[1], reverse=True)[:top_n]
# 预热
query_list = [json.loads(q[0]) for q in top_queries]
elapsed = self.warm_from_queries(query_list)
print(f"预热了{len(query_list)}个热点查询")
return elapsed
except Exception as e:
print(f"从日志预热失败: {e}")
return 0
def warm_async(self, query_list, callback=None):
"""异步预热缓存"""
import threading
def warm_task():
elapsed = self.warm_from_queries(query_list)
if callback:
callback(elapsed)
thread = threading.Thread(target=warm_task, daemon=True)
thread.start()
print("异步预热已启动")
return thread
# 使用缓存预热
cache = QueryCache(max_size=100, ttl=300)
cached_search = CachedSearch(collection, cache)
warmer = CacheWarmer(cached_search)
# 准备预热查询列表
warm_queries = []
search_params = {
"metric_type": "L2",
"params": {"nprobe": 16}
}
for i in range(20):
warm_queries.append({
"vector": [np.random.random() for _ in range(128)],
"params": search_params,
"limit": 10
})
# 同步预热
warmer.warm_from_queries(warm_queries)
# 验证预热效果
stats = cache.get_stats()
print(f"\n预热后缓存统计:")
print(f" 缓存大小: {stats['cache_size']}")
# 异步预热示例
def on_warm_complete(elapsed):
print(f"\n异步预热完成回调: 耗时{elapsed:.2f}s")
warmer.warm_async(warm_queries[:10], callback=on_warm_complete)
print("\n缓存预热建议:")
print(" 1. 启动时预热热点查询")
print(" 2. 从历史日志识别热点")
print(" 3. 异步预热,不阻塞启动")
print(" 4. 增量预热,逐步加载")
print(" 5. 监控预热效果,优化策略")
---
02.数据缓存
a.Collection缓存
a.功能说明
Collection缓存将常用Collection保持在内存中。避免频繁加载释放Collection的开销。适合多Collection场景,优先缓存热点Collection。需要配置缓存大小,避免内存溢出。实现LRU淘汰策略,自动管理缓存。监控Collection访问频率,动态调整缓存。可以预加载预期会使用的Collection。
b.代码示例
---
from pymilvus import Collection
import time
from collections import OrderedDict
# Collection缓存管理器
class CollectionCacheManager:
def __init__(self, max_cached=10):
"""
max_cached: 最大缓存Collection数量
"""
self.max_cached = max_cached
self.cache = OrderedDict()
self.access_count = {}
self.hit_count = 0
self.miss_count = 0
def get_collection(self, collection_name):
"""获取Collection(带缓存)"""
if collection_name in self.cache:
# 缓存命中
self.hit_count += 1
self.access_count[collection_name] = self.access_count.get(collection_name, 0) + 1
# 移到最后(LRU)
self.cache.move_to_end(collection_name)
return self.cache[collection_name]
else:
# 缓存未命中
self.miss_count += 1
# 加载Collection
collection = Collection(collection_name)
# 检查缓存大小
if len(self.cache) >= self.max_cached:
# 淘汰最久未使用的
evicted_name, evicted_collection = self.cache.popitem(last=False)
# 释放Collection
try:
evicted_collection.release()
print(f" 淘汰Collection: {evicted_name}")
except:
pass
# 加载并缓存
collection.load()
self.cache[collection_name] = collection
self.access_count[collection_name] = 1
print(f" 加载Collection: {collection_name}")
return collection
def preload_collections(self, collection_names):
"""预加载Collection"""
print(f"\n预加载{len(collection_names)}个Collection...")
for name in collection_names:
self.get_collection(name)
print("预加载完成")
def get_stats(self):
"""获取缓存统计"""
total_requests = self.hit_count + self.miss_count
hit_rate = self.hit_count / total_requests if total_requests > 0 else 0
return {
"cached_collections": len(self.cache),
"max_cached": self.max_cached,
"hit_count": self.hit_count,
"miss_count": self.miss_count,
"hit_rate": hit_rate,
"access_count": self.access_count.copy()
}
def clear(self):
"""清空缓存"""
for collection in self.cache.values():
try:
collection.release()
except:
pass
self.cache.clear()
self.access_count.clear()
print("缓存已清空")
# 使用Collection缓存管理器
cache_manager = CollectionCacheManager(max_cached=5)
# 模拟访问多个Collection
collection_names = ["coll_1", "coll_2", "coll_3", "coll_4", "coll_5", "coll_6"]
print("测试Collection缓存:\n")
# 第一轮访问
print("第一轮访问:")
for name in collection_names:
try:
collection = cache_manager.get_collection(name)
except:
print(f" 加载{name}失败(Collection可能不存在)")
# 第二轮访问(部分命中)
print("\n第二轮访问:")
for name in collection_names[:3]:
try:
collection = cache_manager.get_collection(name)
except:
pass
# 打印统计
stats = cache_manager.get_stats()
print(f"\n缓存统计:")
print(f" 缓存Collection数: {stats['cached_collections']}/{stats['max_cached']}")
print(f" 命中次数: {stats['hit_count']}")
print(f" 未命中次数: {stats['miss_count']}")
print(f" 命中率: {stats['hit_rate']*100:.1f}%")
print(f"\n访问频率:")
for name, count in sorted(stats['access_count'].items(), key=lambda x: x[1], reverse=True):
print(f" {name}: {count}次")
# 清空缓存
cache_manager.clear()
print("\nCollection缓存建议:")
print(" 1. 缓存热点Collection")
print(" 2. 使用LRU淘汰策略")
print(" 3. 配置合适的缓存大小")
print(" 4. 预加载预期使用的Collection")
print(" 5. 监控访问频率,动态调整")
---
b.结果缓存
a.功能说明
结果缓存存储查询结果,避免重复计算。适合查询结果较大的场景,如返回大量向量。可以缓存中间结果,如召回结果、排序结果等。需要考虑缓存一致性,数据更新时失效缓存。实现分层缓存,L1内存缓存+L2磁盘缓存。监控缓存命中率和内存使用。权衡缓存收益和维护成本。
b.代码示例
---
from pymilvus import Collection
import numpy as np
import time
import pickle
import os
# 分层结果缓存
class TieredResultCache:
def __init__(self, l1_max_size=100, l2_cache_dir="/tmp/milvus_cache"):
"""
l1_max_size: L1内存缓存大小
l2_cache_dir: L2磁盘缓存目录
"""
self.l1_max_size = l1_max_size
self.l2_cache_dir = l2_cache_dir
self.l1_cache = {} # 内存缓存
self.l1_hit = 0
self.l2_hit = 0
self.miss = 0
# 创建L2缓存目录
os.makedirs(l2_cache_dir, exist_ok=True)
def _get_cache_path(self, key):
"""获取L2缓存文件路径"""
return os.path.join(self.l2_cache_dir, f"{key}.pkl")
def get(self, key):
"""获取缓存结果"""
# L1缓存查找
if key in self.l1_cache:
self.l1_hit += 1
return self.l1_cache[key], "L1"
# L2缓存查找
cache_path = self._get_cache_path(key)
if os.path.exists(cache_path):
try:
with open(cache_path, 'rb') as f:
results = pickle.load(f)
self.l2_hit += 1
# 提升到L1缓存
self._put_l1(key, results)
return results, "L2"
except:
pass
# 缓存未命中
self.miss += 1
return None, None
def _put_l1(self, key, results):
"""放入L1缓存"""
# 检查L1缓存大小
if len(self.l1_cache) >= self.l1_max_size:
# 淘汰一个(简单FIFO)
evicted_key = next(iter(self.l1_cache))
evicted_results = self.l1_cache.pop(evicted_key)
# 写入L2缓存
self._put_l2(evicted_key, evicted_results)
self.l1_cache[key] = results
def _put_l2(self, key, results):
"""放入L2缓存"""
cache_path = self._get_cache_path(key)
try:
with open(cache_path, 'wb') as f:
pickle.dump(results, f)
except:
pass
def put(self, key, results):
"""放入缓存"""
self._put_l1(key, results)
def get_stats(self):
"""获取统计信息"""
total = self.l1_hit + self.l2_hit + self.miss
return {
"l1_size": len(self.l1_cache),
"l1_hit": self.l1_hit,
"l2_hit": self.l2_hit,
"miss": self.miss,
"total": total,
"hit_rate": (self.l1_hit + self.l2_hit) / total if total > 0 else 0
}
def clear(self):
"""清空缓存"""
self.l1_cache.clear()
# 清空L2缓存
for filename in os.listdir(self.l2_cache_dir):
filepath = os.path.join(self.l2_cache_dir, filename)
try:
os.remove(filepath)
except:
pass
# 使用分层缓存
tiered_cache = TieredResultCache(l1_max_size=5)
print("测试分层结果缓存:\n")
# 模拟查询和缓存
for i in range(10):
key = f"query_{i}"
# 尝试从缓存获取
results, source = tiered_cache.get(key)
if results is None:
# 缓存未命中,生成结果
results = [np.random.random() for _ in range(100)]
tiered_cache.put(key, results)
print(f" {key}: 未命中,生成结果")
else:
print(f" {key}: 命中({source}缓存)")
# 再次访问前几个查询
print("\n再次访问:")
for i in range(5):
key = f"query_{i}"
results, source = tiered_cache.get(key)
print(f" {key}: {'命中' if results else '未命中'}({source}缓存)")
# 打印统计
stats = tiered_cache.get_stats()
print(f"\n缓存统计:")
print(f" L1缓存大小: {stats['l1_size']}")
print(f" L1命中: {stats['l1_hit']}")
print(f" L2命中: {stats['l2_hit']}")
print(f" 未命中: {stats['miss']}")
print(f" 总命中率: {stats['hit_rate']*100:.1f}%")
# 清空缓存
tiered_cache.clear()
print("\n结果缓存建议:")
print(" 1. 缓存大结果,避免重复计算")
print(" 2. 分层缓存,平衡速度和容量")
print(" 3. L1内存缓存热点,L2磁盘缓存冷数据")
print(" 4. 数据更新时及时失效缓存")
print(" 5. 监控缓存命中率和大小")
---
9 集群部署
9.1 分布式架构
01.架构组件
a.组件角色
a.功能说明
Milvus采用存储计算分离的分布式架构。主要组件包括Coordinator(协调器)、Worker Node(工作节点)、存储层。Coordinator包括Root Coord、Data Coord、Query Coord、Index Coord。Worker Node包括Query Node、Data Node、Index Node。存储层使用MinIO/S3存储向量数据,etcd存储元数据,Pulsar/Kafka作为消息队列。各组件独立扩展,支持水平扩容。
b.代码示例
---
# Milvus分布式架构组件说明
architecture = {
"coordinators": {
"root_coord": {
"role": "全局协调器",
"responsibilities": [
"DDL操作(创建/删除Collection)",
"分配时间戳",
"管理数据通道"
],
"count": 1 # 单实例
},
"data_coord": {
"role": "数据协调器",
"responsibilities": [
"管理数据分段",
"分配数据写入任务",
"触发数据持久化"
],
"count": 1
},
"query_coord": {
"role": "查询协调器",
"responsibilities": [
"管理查询节点",
"分配查询任务",
"负载均衡"
],
"count": 1
},
"index_coord": {
"role": "索引协调器",
"responsibilities": [
"管理索引构建",
"分配索引任务",
"监控索引进度"
],
"count": 1
}
},
"workers": {
"query_node": {
"role": "查询节点",
"responsibilities": [
"执行向量检索",
"加载数据到内存",
"处理查询请求"
],
"scalable": True, # 可水平扩展
"recommended_count": "2-10"
},
"data_node": {
"role": "数据节点",
"responsibilities": [
"接收数据写入",
"数据持久化",
"数据合并"
],
"scalable": True,
"recommended_count": "1-5"
},
"index_node": {
"role": "索引节点",
"responsibilities": [
"构建向量索引",
"索引优化",
"索引持久化"
],
"scalable": True,
"recommended_count": "1-5"
}
},
"storage": {
"object_storage": {
"type": "MinIO/S3",
"stores": "向量数据、索引文件",
"required": True
},
"meta_storage": {
"type": "etcd",
"stores": "元数据、配置信息",
"required": True
},
"message_queue": {
"type": "Pulsar/Kafka",
"purpose": "数据流、事件通知",
"required": True
}
}
}
print("Milvus分布式架构组件:\n")
print("协调器组件:")
for name, info in architecture["coordinators"].items():
print(f" {name}:")
print(f" 角色: {info['role']}")
print(f" 职责: {', '.join(info['responsibilities'])}")
print(f" 实例数: {info['count']}")
print("\n工作节点:")
for name, info in architecture["workers"].items():
print(f" {name}:")
print(f" 角色: {info['role']}")
print(f" 职责: {', '.join(info['responsibilities'])}")
print(f" 可扩展: {'是' if info['scalable'] else '否'}")
print(f" 推荐数量: {info['recommended_count']}")
print("\n存储层:")
for name, info in architecture["storage"].items():
print(f" {name}:")
print(f" 类型: {info['type']}")
print(f" 存储内容: {info.get('stores', info.get('purpose'))}")
print(f" 必需: {'是' if info['required'] else '否'}")
print("\n架构特点:")
print(" 1. 存储计算分离,独立扩展")
print(" 2. 无状态Worker,易于水平扩展")
print(" 3. 协调器单点,通过主备保证高可用")
print(" 4. 统一存储层,支持多种存储后端")
print(" 5. 消息队列解耦,异步处理")
---
b.数据流转
a.功能说明
数据在Milvus中经历写入、持久化、索引、查询等流程。写入数据首先进入消息队列,Data Node消费并持久化。持久化后触发索引构建,Index Node构建索引。查询时Query Node从存储层加载数据和索引。通过消息队列实现异步解耦。数据分段管理,支持增量更新。采用LSM-tree类似的设计,定期合并小段。
b.代码示例
---
# Milvus数据流转流程
data_flow = {
"write_path": [
{
"step": 1,
"component": "SDK/Client",
"action": "发送insert请求",
"data": "向量数据 + 标量字段"
},
{
"step": 2,
"component": "Proxy",
"action": "路由请求到Data Coord",
"data": "分配时间戳和数据通道"
},
{
"step": 3,
"component": "Data Coord",
"action": "分配数据段和Data Node",
"data": "segment分配信息"
},
{
"step": 4,
"component": "Message Queue",
"action": "写入消息队列",
"data": "数据消息"
},
{
"step": 5,
"component": "Data Node",
"action": "消费消息,缓存数据",
"data": "内存缓冲区"
},
{
"step": 6,
"component": "Data Node",
"action": "达到阈值后持久化",
"data": "写入对象存储(S3/MinIO)"
},
{
"step": 7,
"component": "Data Coord",
"action": "触发索引构建",
"data": "索引任务"
},
{
"step": 8,
"component": "Index Node",
"action": "构建索引并持久化",
"data": "索引文件写入对象存储"
}
],
"query_path": [
{
"step": 1,
"component": "SDK/Client",
"action": "发送search请求",
"data": "查询向量 + 参数"
},
{
"step": 2,
"component": "Proxy",
"action": "路由到Query Coord",
"data": "查询请求"
},
{
"step": 3,
"component": "Query Coord",
"action": "分配Query Node",
"data": "负载均衡分配"
},
{
"step": 4,
"component": "Query Node",
"action": "检查数据是否已加载",
"data": "内存中的数据和索引"
},
{
"step": 5,
"component": "Query Node",
"action": "如未加载,从对象存储加载",
"data": "加载数据和索引到内存"
},
{
"step": 6,
"component": "Query Node",
"action": "执行向量检索",
"data": "使用索引进行ANN搜索"
},
{
"step": 7,
"component": "Query Node",
"action": "返回结果",
"data": "Top-K结果"
},
{
"step": 8,
"component": "Proxy",
"action": "合并多个Query Node结果",
"data": "全局Top-K结果"
}
]
}
print("Milvus数据流转:\n")
print("写入路径:")
for step_info in data_flow["write_path"]:
print(f" 步骤{step_info['step']}: {step_info['component']}")
print(f" 操作: {step_info['action']}")
print(f" 数据: {step_info['data']}")
print("\n查询路径:")
for step_info in data_flow["query_path"]:
print(f" 步骤{step_info['step']}: {step_info['component']}")
print(f" 操作: {step_info['action']}")
print(f" 数据: {step_info['data']}")
print("\n关键特性:")
print(" 1. 异步写入: 通过消息队列解耦")
print(" 2. 批量持久化: 提升写入吞吐量")
print(" 3. 延迟索引: 数据先可查,后建索引")
print(" 4. 按需加载: Query Node按需加载数据")
print(" 5. 结果合并: Proxy合并分布式查询结果")
---
02.部署模式
a.单机模式
a.功能说明
单机模式所有组件运行在一个进程中,适合开发测试。资源占用小,部署简单。不支持水平扩展和高可用。数据量和QPS受限于单机性能。适合原型验证、功能测试、小规模应用。生产环境建议使用分布式模式。可以通过Docker快速部署。
b.代码示例
---
# 单机模式部署(Docker)
docker_standalone = """
# 拉取Milvus镜像
docker pull milvusdb/milvus:latest
# 下载配置文件
wget https://github.com/milvus-io/milvus/releases/download/v2.3.0/milvus-standalone-docker-compose.yml -O docker-compose.yml
# 启动Milvus
docker-compose up -d
# 查看状态
docker-compose ps
# 查看日志
docker-compose logs -f milvus-standalone
# 停止服务
docker-compose down
"""
print("Milvus单机模式部署:\n")
print(docker_standalone)
# 单机模式配置示例
standalone_config = {
"deployment": {
"mode": "standalone",
"components": "all-in-one",
"process_count": 1
},
"resources": {
"cpu": "4 cores",
"memory": "8 GB",
"disk": "100 GB SSD"
},
"limitations": {
"max_vectors": "~10M",
"max_qps": "~1000",
"scalability": "不支持",
"high_availability": "不支持"
},
"use_cases": [
"开发测试",
"功能验证",
"小规模应用",
"原型开发"
]
}
print("\n单机模式特点:")
print(f" 部署模式: {standalone_config['deployment']['mode']}")
print(f" 组件: {standalone_config['deployment']['components']}")
print(f" 进程数: {standalone_config['deployment']['process_count']}")
print(f"\n资源需求:")
print(f" CPU: {standalone_config['resources']['cpu']}")
print(f" 内存: {standalone_config['resources']['memory']}")
print(f" 磁盘: {standalone_config['resources']['disk']}")
print(f"\n限制:")
print(f" 最大向量数: {standalone_config['limitations']['max_vectors']}")
print(f" 最大QPS: {standalone_config['limitations']['max_qps']}")
print(f" 可扩展性: {standalone_config['limitations']['scalability']}")
print(f" 高可用: {standalone_config['limitations']['high_availability']}")
print(f"\n适用场景:")
for use_case in standalone_config['use_cases']:
print(f" - {use_case}")
---
b.集群模式
a.功能说明
集群模式各组件独立部署,支持水平扩展。Coordinator和Worker分离,Worker可独立扩展。支持高可用配置,Coordinator主备切换。适合生产环境,支持大规模数据和高并发。需要部署etcd、MinIO/S3、Pulsar/Kafka等依赖。推荐使用Kubernetes部署和管理。可以根据负载动态扩缩容。
b.代码示例
---
# 集群模式架构配置
cluster_config = {
"deployment": {
"mode": "cluster",
"components": "分布式部署",
"coordinators": {
"root_coord": {"count": 1, "ha": "主备"},
"data_coord": {"count": 1, "ha": "主备"},
"query_coord": {"count": 1, "ha": "主备"},
"index_coord": {"count": 1, "ha": "主备"}
},
"workers": {
"query_node": {"count": "2-10", "scalable": True},
"data_node": {"count": "1-5", "scalable": True},
"index_node": {"count": "1-5", "scalable": True}
}
},
"dependencies": {
"etcd": {
"purpose": "元数据存储",
"ha": "3节点集群",
"required": True
},
"minio_s3": {
"purpose": "对象存储",
"ha": "分布式部署",
"required": True
},
"pulsar_kafka": {
"purpose": "消息队列",
"ha": "集群部署",
"required": True
}
},
"resources": {
"coordinator": {
"cpu": "2 cores",
"memory": "4 GB"
},
"query_node": {
"cpu": "8 cores",
"memory": "32 GB"
},
"data_node": {
"cpu": "4 cores",
"memory": "16 GB"
},
"index_node": {
"cpu": "8 cores",
"memory": "16 GB"
}
},
"capabilities": {
"max_vectors": "100M+",
"max_qps": "10000+",
"scalability": "水平扩展",
"high_availability": "支持"
},
"use_cases": [
"生产环境",
"大规模应用",
"高并发场景",
"企业级应用"
]
}
print("Milvus集群模式配置:\n")
print("协调器部署:")
for name, info in cluster_config["deployment"]["coordinators"].items():
print(f" {name}: {info['count']}个实例, 高可用: {info['ha']}")
print("\n工作节点部署:")
for name, info in cluster_config["deployment"]["workers"].items():
scalable = "支持" if info['scalable'] else "不支持"
print(f" {name}: {info['count']}个实例, 水平扩展: {scalable}")
print("\n依赖组件:")
for name, info in cluster_config["dependencies"].items():
print(f" {name}:")
print(f" 用途: {info['purpose']}")
print(f" 高可用: {info['ha']}")
print(f" 必需: {'是' if info['required'] else '否'}")
print("\n资源配置:")
for component, resources in cluster_config["resources"].items():
print(f" {component}:")
print(f" CPU: {resources['cpu']}")
print(f" 内存: {resources['memory']}")
print("\n能力:")
print(f" 最大向量数: {cluster_config['capabilities']['max_vectors']}")
print(f" 最大QPS: {cluster_config['capabilities']['max_qps']}")
print(f" 可扩展性: {cluster_config['capabilities']['scalability']}")
print(f" 高可用: {cluster_config['capabilities']['high_availability']}")
print("\n适用场景:")
for use_case in cluster_config['use_cases']:
print(f" - {use_case}")
print("\n集群模式优势:")
print(" 1. 水平扩展: Worker节点按需扩展")
print(" 2. 高可用: Coordinator主备,Worker多副本")
print(" 3. 资源隔离: 不同组件独立资源")
print(" 4. 弹性伸缩: 根据负载动态调整")
print(" 5. 故障隔离: 单个节点故障不影响整体")
---
9.2 Docker Compose
01.Compose配置
a.服务定义
a.功能说明
Docker Compose简化Milvus集群部署,通过YAML文件定义所有服务。包括etcd、MinIO、Pulsar等依赖组件。定义网络、卷、环境变量等配置。支持一键启动和停止整个集群。适合开发测试和小规模生产环境。可以方便地调整资源配置。实现服务编排和依赖管理。
b.代码示例
---
# docker-compose.yml完整示例
version: '3.5'
services:
etcd:
container_name: milvus-etcd
image: quay.io/coreos/etcd:v3.5.5
environment:
- ETCD_AUTO_COMPACTION_MODE=revision
- ETCD_AUTO_COMPACTION_RETENTION=1000
- ETCD_QUOTA_BACKEND_BYTES=4294967296
- ETCD_SNAPSHOT_COUNT=50000
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/etcd:/etcd
command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
networks:
- milvus
minio:
container_name: milvus-minio
image: minio/minio:RELEASE.2023-03-20T20-16-18Z
environment:
MINIO_ACCESS_KEY: minioadmin
MINIO_SECRET_KEY: minioadmin
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data
command: minio server /minio_data --console-address ":9001"
ports:
- "9000:9000"
- "9001:9001"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
interval: 30s
timeout: 20s
retries: 3
networks:
- milvus
pulsar:
container_name: milvus-pulsar
image: apachepulsar/pulsar:2.8.2
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/pulsar:/pulsar/data
environment:
- PULSAR_MEM=" -Xms512m -Xmx512m -XX:MaxDirectMemorySize=1g"
command: |
bash -c "bin/apply-config-from-env.py conf/standalone.conf && bin/pulsar standalone"
networks:
- milvus
standalone:
container_name: milvus-standalone
image: milvusdb/milvus:v2.3.0
command: ["milvus", "run", "standalone"]
environment:
ETCD_ENDPOINTS: etcd:2379
MINIO_ADDRESS: minio:9000
PULSAR_ADDRESS: pulsar://pulsar:6650
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus:/var/lib/milvus
ports:
- "19530:19530"
- "9091:9091"
depends_on:
- "etcd"
- "minio"
- "pulsar"
networks:
- milvus
networks:
milvus:
name: milvus
volumes:
etcd:
minio:
pulsar:
milvus:
# 使用说明:
# 1. 启动所有服务:docker-compose up -d
# 2. 查看服务状态:docker-compose ps
# 3. 查看日志:docker-compose logs -f standalone
# 4. 停止服务:docker-compose down
# 5. 停止并删除数据:docker-compose down -v
---
b.资源配置
a.功能说明
通过Compose配置各服务的资源限制。设置CPU和内存限制,避免资源竞争。配置健康检查,自动重启失败服务。定义依赖关系,确保启动顺序。可以配置副本数,实现简单的高可用。支持环境变量覆盖默认配置。实现配置文件和数据持久化。
b.代码示例
---
# 资源配置增强版docker-compose.yml
version: '3.5'
services:
standalone:
container_name: milvus-standalone
image: milvusdb/milvus:v2.3.0
command: ["milvus", "run", "standalone"]
environment:
ETCD_ENDPOINTS: etcd:2379
MINIO_ADDRESS: minio:9000
PULSAR_ADDRESS: pulsar://pulsar:6650
# 性能调优参数
QUERY_NODE_GRACEFUL_STOP_TIMEOUT: 30
QUERY_NODE_SEARCH_TIMEOUT: 30
DATA_NODE_FLUSH_INSERT_BUFFER_SIZE: 16777216
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus:/var/lib/milvus
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus/logs:/var/log/milvus
ports:
- "19530:19530"
- "9091:9091"
depends_on:
- "etcd"
- "minio"
- "pulsar"
deploy:
resources:
limits:
cpus: '4.0'
memory: 8G
reservations:
cpus: '2.0'
memory: 4G
restart: always
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
logging:
driver: "json-file"
options:
max-size: "100m"
max-file: "3"
networks:
- milvus
minio:
container_name: milvus-minio
image: minio/minio:RELEASE.2023-03-20T20-16-18Z
environment:
MINIO_ACCESS_KEY: minioadmin
MINIO_SECRET_KEY: minioadmin
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data
command: minio server /minio_data --console-address ":9001"
ports:
- "9000:9000"
- "9001:9001"
deploy:
resources:
limits:
cpus: '2.0'
memory: 4G
restart: always
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
interval: 30s
timeout: 20s
retries: 3
networks:
- milvus
networks:
milvus:
name: milvus
driver: bridge
# 资源配置说明:
# - limits: 容器可使用的最大资源
# - reservations: 容器保证获得的资源
# - restart: always 自动重启
# - healthcheck: 健康检查配置
# - logging: 日志配置,限制日志大小
---
02.部署实践
a.快速部署
a.功能说明
使用官方提供的docker-compose.yml快速部署Milvus。下载配置文件,一键启动所有服务。自动拉取所需镜像,创建网络和卷。适合快速体验和功能测试。默认配置可满足基本需求。可以根据需要调整配置参数。支持数据持久化,重启不丢失数据。
b.代码示例
---
#!/bin/bash
# 快速部署Milvus脚本
set -e
echo "=========================================="
echo "Milvus快速部署脚本"
echo "=========================================="
# 检查Docker和Docker Compose
echo "检查环境..."
if ! command -v docker &> /dev/null; then
echo "错误: Docker未安装"
exit 1
fi
if ! command -v docker-compose &> /dev/null; then
echo "错误: Docker Compose未安装"
exit 1
fi
# 下载docker-compose配置文件
echo "下载docker-compose配置文件..."
wget https://github.com/milvus-io/milvus/releases/download/v2.3.0/milvus-standalone-docker-compose.yml -O docker-compose.yml
# 创建数据目录
echo "创建数据目录..."
mkdir -p volumes/etcd volumes/minio volumes/pulsar volumes/milvus
# 启动Milvus
echo "启动Milvus服务..."
docker-compose up -d
# 等待服务启动
echo "等待服务启动(约30秒)..."
sleep 30
# 检查服务状态
echo ""
echo "服务状态:"
docker-compose ps
# 检查Milvus健康状态
echo ""
echo "检查Milvus健康状态..."
for i in {1..10}; do
if curl -s http://localhost:9091/healthz | grep -q "OK"; then
echo "✓ Milvus服务健康"
break
else
echo "等待Milvus就绪... ($i/10)"
sleep 5
fi
done
echo ""
echo "=========================================="
echo "Milvus部署完成!"
echo "=========================================="
echo "连接信息:"
echo " - Milvus地址: localhost:19530"
echo " - Milvus管理界面: http://localhost:9091"
echo " - MinIO控制台: http://localhost:9001"
echo " 用户名: minioadmin"
echo " 密码: minioadmin"
echo ""
echo "常用命令:"
echo " - 查看日志: docker-compose logs -f standalone"
echo " - 停止服务: docker-compose down"
echo " - 重启服务: docker-compose restart"
echo "=========================================="
# 测试连接
echo ""
echo "测试连接..."
python3 << 'PYTHON'
from pymilvus import connections, utility
import time
max_retries = 5
for i in range(max_retries):
try:
connections.connect(host="localhost", port="19530")
print(f"✓ 连接成功!Milvus版本: {utility.get_server_version()}")
connections.disconnect("default")
break
except Exception as e:
if i < max_retries - 1:
print(f"连接失败,重试... ({i+1}/{max_retries})")
time.sleep(5)
else:
print(f"✗ 连接失败: {e}")
PYTHON
---
b.生产部署
a.功能说明
生产环境需要更完善的配置和监控。配置资源限制和健康检查。实现日志收集和持久化。配置备份和恢复策略。使用外部存储,避免数据丢失。实现监控告警,及时发现问题。配置网络安全,限制访问权限。定期更新和维护。
b.代码示例
---
# 生产环境docker-compose.yml
version: '3.5'
services:
etcd:
container_name: milvus-etcd
image: quay.io/coreos/etcd:v3.5.5
environment:
- ETCD_AUTO_COMPACTION_MODE=revision
- ETCD_AUTO_COMPACTION_RETENTION=1000
- ETCD_QUOTA_BACKEND_BYTES=4294967296
- ETCD_SNAPSHOT_COUNT=50000
- ETCD_HEARTBEAT_INTERVAL=500
- ETCD_ELECTION_TIMEOUT=2500
volumes:
- /data/milvus/etcd:/etcd
command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
deploy:
resources:
limits:
cpus: '2.0'
memory: 4G
restart: always
logging:
driver: "json-file"
options:
max-size: "200m"
max-file: "5"
networks:
- milvus
minio:
container_name: milvus-minio
image: minio/minio:RELEASE.2023-03-20T20-16-18Z
environment:
MINIO_ACCESS_KEY: ${MINIO_ACCESS_KEY:-minioadmin}
MINIO_SECRET_KEY: ${MINIO_SECRET_KEY:-minioadmin}
MINIO_PROMETHEUS_AUTH_TYPE: public
volumes:
- /data/milvus/minio:/minio_data
command: minio server /minio_data --console-address ":9001"
ports:
- "9000:9000"
- "9001:9001"
deploy:
resources:
limits:
cpus: '4.0'
memory: 8G
restart: always
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
interval: 30s
timeout: 20s
retries: 3
logging:
driver: "json-file"
options:
max-size: "200m"
max-file: "5"
networks:
- milvus
standalone:
container_name: milvus-standalone
image: milvusdb/milvus:v2.3.0
command: ["milvus", "run", "standalone"]
environment:
ETCD_ENDPOINTS: etcd:2379
MINIO_ADDRESS: minio:9000
MINIO_ACCESS_KEY_ID: ${MINIO_ACCESS_KEY:-minioadmin}
MINIO_SECRET_ACCESS_KEY: ${MINIO_SECRET_KEY:-minioadmin}
PULSAR_ADDRESS: pulsar://pulsar:6650
# 性能优化
QUERY_NODE_GRACEFUL_STOP_TIMEOUT: 30
QUERY_NODE_SEARCH_TIMEOUT: 30
DATA_NODE_FLUSH_INSERT_BUFFER_SIZE: 16777216
# 日志级别
LOG_LEVEL: info
volumes:
- /data/milvus/data:/var/lib/milvus
- /data/milvus/logs:/var/log/milvus
- /data/milvus/config:/milvus/configs
ports:
- "19530:19530"
- "9091:9091"
depends_on:
- "etcd"
- "minio"
- "pulsar"
deploy:
resources:
limits:
cpus: '8.0'
memory: 16G
reservations:
cpus: '4.0'
memory: 8G
restart: always
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
logging:
driver: "json-file"
options:
max-size: "200m"
max-file: "10"
networks:
- milvus
networks:
milvus:
name: milvus
driver: bridge
# 生产环境部署脚本
# #!/bin/bash
#
# # 设置环境变量
# export MINIO_ACCESS_KEY="your-access-key"
# export MINIO_SECRET_KEY="your-secret-key"
#
# # 创建数据目录
# mkdir -p /data/milvus/{etcd,minio,pulsar,data,logs,config}
#
# # 设置权限
# chmod 755 /data/milvus
#
# # 启动服务
# docker-compose up -d
#
# # 配置备份定时任务
# echo "0 2 * * * /opt/scripts/backup-milvus.sh" | crontab -
#
# # 备份脚本示例
# cat > /opt/scripts/backup-milvus.sh << 'EOF'
# #!/bin/bash
# BACKUP_DIR="/backup/milvus/$(date +%Y%m%d)"
# mkdir -p $BACKUP_DIR
#
# # 备份数据
# tar -czf $BACKUP_DIR/milvus-data.tar.gz /data/milvus/data
# tar -czf $BACKUP_DIR/milvus-etcd.tar.gz /data/milvus/etcd
# tar -czf $BACKUP_DIR/milvus-minio.tar.gz /data/milvus/minio
#
# # 保留最近7天的备份
# find /backup/milvus -type d -mtime +7 -exec rm -rf {} \;
# EOF
#
# chmod +x /opt/scripts/backup-milvus.sh
---
9.3 Kubernetes部署
01.Helm部署
a.Helm Chart
a.功能说明
使用Helm Chart简化Kubernetes部署。官方提供完整的Helm Chart,支持自定义配置。一键部署Milvus集群及所有依赖。支持滚动更新和回滚。可以方便地调整副本数和资源配置。实现配置管理和版本控制。适合生产环境大规模部署。
b.代码示例
---
# 使用Helm部署Milvus到Kubernetes
# 1. 添加Milvus Helm仓库
helm repo add milvus https://milvus-io.github.io/milvus-helm/
helm repo update
# 2. 查看可用版本
helm search repo milvus
# 3. 创建命名空间
kubectl create namespace milvus
# 4. 部署Milvus(默认配置)
helm install milvus-release milvus/milvus --namespace milvus
# 5. 自定义配置部署
cat > values-custom.yaml <<EOF
cluster:
enabled: true
image:
all:
repository: milvusdb/milvus
tag: v2.3.0
pullPolicy: IfNotPresent
queryNode:
replicas: 3
resources:
limits:
cpu: "4"
memory: "16Gi"
requests:
cpu: "2"
memory: "8Gi"
dataNode:
replicas: 2
resources:
limits:
cpu: "2"
memory: "8Gi"
requests:
cpu: "1"
memory: "4Gi"
indexNode:
replicas: 2
resources:
limits:
cpu: "4"
memory: "8Gi"
requests:
cpu: "2"
memory: "4Gi"
minio:
mode: distributed
replicas: 4
resources:
limits:
cpu: "2"
memory: "4Gi"
pulsar:
enabled: true
broker:
replicaCount: 3
etcd:
replicaCount: 3
resources:
limits:
cpu: "1"
memory: "2Gi"
service:
type: LoadBalancer
port: 19530
EOF
helm install milvus-release milvus/milvus -f values-custom.yaml --namespace milvus
# 6. 查看部署状态
kubectl get pods -n milvus
kubectl get svc -n milvus
# 7. 查看详细信息
kubectl describe pod <pod-name> -n milvus
# 8. 升级部署
helm upgrade milvus-release milvus/milvus -f values-custom.yaml --namespace milvus
# 9. 回滚到上一个版本
helm rollback milvus-release --namespace milvus
# 10. 查看发布历史
helm history milvus-release --namespace milvus
# 11. 卸载
helm uninstall milvus-release --namespace milvus
# 12. 删除命名空间
kubectl delete namespace milvus
---
b.配置优化
a.功能说明
根据业务需求优化Kubernetes配置。配置Pod资源请求和限制。设置节点亲和性和反亲和性。配置持久化卷,确保数据安全。实现自动扩缩容HPA。配置服务质量QoS。使用ConfigMap和Secret管理配置。实现滚动更新策略。
b.代码示例
---
# Kubernetes高级配置示例(values.yaml)
# Query Node配置
queryNode:
replicas: 3
resources:
requests:
cpu: "2"
memory: "8Gi"
limits:
cpu: "4"
memory: "16Gi"
# Pod反亲和性:确保Pod分散在不同节点
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- milvus
- key: app.kubernetes.io/component
operator: In
values:
- querynode
topologyKey: kubernetes.io/hostname
# 节点亲和性:优先调度到高性能节点
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node-type
operator: In
values:
- high-performance
# 容忍度:允许调度到特定污点的节点
tolerations:
- key: "milvus"
operator: "Equal"
value: "querynode"
effect: "NoSchedule"
# 更新策略
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
# 健康检查
livenessProbe:
httpGet:
path: /healthz
port: 9091
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /healthz
port: 9091
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# HPA自动扩缩容
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
# 持久化存储
persistence:
enabled: true
storageClass: "fast-ssd"
accessMode: ReadWriteOnce
size: 500Gi
# 监控配置
metrics:
enabled: true
serviceMonitor:
enabled: true
interval: 30s
# 日志配置
log:
level: info
format: json
persistence:
enabled: true
size: 100Gi
# 安全配置
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
---
02.运维管理
a.滚动更新
a.功能说明
Kubernetes支持滚动更新,实现零停机升级。逐个替换Pod,保证服务可用性。可以配置更新策略,控制更新速度。支持健康检查,自动回滚失败更新。可以暂停和恢复更新过程。实现灰度发布和金丝雀部署。监控更新过程,及时发现问题。
b.代码示例
---
# 滚动更新操作指南
# 1. 查看当前版本
kubectl get deployment -n milvus
kubectl describe deployment milvus-querynode -n milvus | grep Image
# 2. 更新到新版本
helm upgrade milvus-release milvus/milvus \\
--set image.all.tag=v2.3.1 \\
--namespace milvus
# 3. 监控更新过程
kubectl rollout status deployment/milvus-querynode -n milvus
# 4. 查看更新历史
kubectl rollout history deployment/milvus-querynode -n milvus
# 5. 暂停更新
kubectl rollout pause deployment/milvus-querynode -n milvus
# 6. 恢复更新
kubectl rollout resume deployment/milvus-querynode -n milvus
# 7. 回滚到上一个版本
kubectl rollout undo deployment/milvus-querynode -n milvus
# 8. 回滚到指定版本
kubectl rollout undo deployment/milvus-querynode -n milvus --to-revision=2
# 9. 查看Pod状态
kubectl get pods -n milvus -w
# 10. 查看事件
kubectl get events -n milvus --sort-by='.lastTimestamp'
# 灰度发布示例(使用Istio)
# 创建VirtualService实现流量分割
cat <<EOF | kubectl apply -f -
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: milvus-canary
namespace: milvus
spec:
hosts:
- milvus
http:
- match:
- headers:
canary:
exact: "true"
route:
- destination:
host: milvus
subset: v2
weight: 100
- route:
- destination:
host: milvus
subset: v1
weight: 90
- destination:
host: milvus
subset: v2
weight: 10
EOF
---
b.故障恢复
a.功能说明
Kubernetes提供自动故障恢复能力。Pod失败自动重启,保证服务可用。节点故障自动迁移Pod到健康节点。通过健康检查及时发现问题。配置重启策略,避免频繁重启。实现多副本部署,提升可用性。监控集群状态,及时处理异常。
b.代码示例
---
# Kubernetes故障恢复操作指南
# 1. 查看Pod状态
kubectl get pods -n milvus
kubectl get pods -n milvus -o wide
# 2. 查看失败的Pod
kubectl get pods -n milvus --field-selector=status.phase!=Running
# 3. 查看Pod日志
kubectl logs <pod-name> -n milvus
kubectl logs <pod-name> -n milvus --previous # 查看上一次运行的日志
kubectl logs <pod-name> -n milvus --tail=100 -f # 实时查看最后100行
# 4. 查看Pod详细信息
kubectl describe pod <pod-name> -n milvus
# 5. 查看Pod事件
kubectl get events -n milvus --field-selector involvedObject.name=<pod-name>
# 6. 进入Pod调试
kubectl exec -it <pod-name> -n milvus -- /bin/bash
# 7. 强制删除Pod(触发重建)
kubectl delete pod <pod-name> -n milvus --force --grace-period=0
# 8. 重启Deployment
kubectl rollout restart deployment/milvus-querynode -n milvus
# 9. 查看节点状态
kubectl get nodes
kubectl describe node <node-name>
# 10. 驱逐节点上的Pod(节点维护)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# 11. 恢复节点
kubectl uncordon <node-name>
# 12. 查看资源使用情况
kubectl top nodes
kubectl top pods -n milvus
# 故障排查脚本
cat > troubleshoot.sh <<'EOF'
#!/bin/bash
NAMESPACE="milvus"
echo "========== Pod状态 =========="
kubectl get pods -n $NAMESPACE
echo ""
echo "========== 失败的Pod =========="
kubectl get pods -n $NAMESPACE --field-selector=status.phase!=Running
echo ""
echo "========== 最近事件 =========="
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -20
echo ""
echo "========== 资源使用 =========="
kubectl top pods -n $NAMESPACE
echo ""
echo "========== 节点状态 =========="
kubectl get nodes
echo ""
echo "========== PVC状态 =========="
kubectl get pvc -n $NAMESPACE
echo ""
echo "========== Service状态 =========="
kubectl get svc -n $NAMESPACE
EOF
chmod +x troubleshoot.sh
./troubleshoot.sh
---
9.4 高可用配置
01.组件高可用
a.Coordinator高可用
a.功能说明
Coordinator采用主备模式实现高可用。通过etcd实现Leader选举。主节点故障时自动切换到备节点。切换时间通常在秒级。需要部署多个Coordinator实例。推荐部署3个实例,保证奇数。监控Leader状态,及时发现问题。实现自动故障转移。
b.代码示例
---
# Coordinator高可用配置(Kubernetes Helm values.yaml)
rootCoord:
replicas: 3 # 部署3个实例
resources:
limits:
cpu: "2"
memory: "4Gi"
requests:
cpu: "1"
memory: "2Gi"
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
component: rootcoord
topologyKey: kubernetes.io/hostname
dataCoord:
replicas: 3
resources:
limits:
cpu: "2"
memory: "4Gi"
queryCoord:
replicas: 3
resources:
limits:
cpu: "2"
memory: "4Gi"
indexCoord:
replicas: 3
resources:
limits:
cpu: "2"
memory: "4Gi"
# etcd高可用配置
etcd:
replicaCount: 3 # 3节点集群
resources:
limits:
cpu: "1"
memory: "2Gi"
persistence:
enabled: true
size: 10Gi
# 监控Coordinator状态
# kubectl get pods -n milvus | grep coord
# kubectl logs -f <rootcoord-pod> -n milvus
---
b.Worker高可用
a.功能说明
Worker节点通过多副本实现高可用。每个Worker类型部署多个实例。单个实例故障不影响整体服务。Query Coord自动分配任务到健康节点。支持动态扩缩容,根据负载调整。实现负载均衡,避免热点。监控Worker健康状态。自动剔除故障节点。
b.代码示例
---
# Worker高可用配置
queryNode:
replicas: 5 # 多副本部署
resources:
limits:
cpu: "4"
memory: "16Gi"
# Pod反亲和性:分散到不同节点
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
component: querynode
topologyKey: kubernetes.io/hostname
# 健康检查
livenessProbe:
httpGet:
path: /healthz
port: 9091
initialDelaySeconds: 60
periodSeconds: 30
failureThreshold: 3
readinessProbe:
httpGet:
path: /healthz
port: 9091
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
dataNode:
replicas: 3
resources:
limits:
cpu: "2"
memory: "8Gi"
indexNode:
replicas: 3
resources:
limits:
cpu: "4"
memory: "8Gi"
# 测试故障转移
# 1. 删除一个QueryNode Pod
# kubectl delete pod <querynode-pod> -n milvus
#
# 2. 观察自动重建
# kubectl get pods -n milvus -w
#
# 3. 验证服务可用
# python3 test_connection.py
---
02.数据高可用
a.存储高可用
a.功能说明
使用分布式存储保证数据高可用。MinIO采用分布式模式,多副本存储。etcd使用3节点集群,Raft协议保证一致性。Pulsar支持多副本,保证消息不丢失。配置持久化卷,数据持久化存储。实现定期备份,防止数据丢失。监控存储健康状态。
b.代码示例
---
# 存储高可用配置
# MinIO分布式模式
minio:
mode: distributed
replicas: 4 # 4节点分布式部署
drivesPerNode: 1
resources:
limits:
cpu: "2"
memory: "4Gi"
persistence:
enabled: true
storageClass: "fast-ssd"
size: 500Gi
# 纠删码配置
erasureCodingParity: 2 # 允许2个节点故障
# etcd集群
etcd:
replicaCount: 3
persistence:
enabled: true
storageClass: "fast-ssd"
size: 10Gi
resources:
limits:
cpu: "1"
memory: "2Gi"
# 快照备份
autoCompactionMode: revision
autoCompactionRetention: "1000"
# Pulsar集群
pulsar:
enabled: true
broker:
replicaCount: 3
resources:
limits:
cpu: "2"
memory: "4Gi"
bookkeeper:
replicaCount: 3
persistence:
enabled: true
size: 100Gi
zookeeper:
replicaCount: 3
# 备份配置
backup:
enabled: true
schedule: "0 2 * * *" # 每天凌晨2点备份
retention: 7 # 保留7天
destination: "s3://backup-bucket/milvus"
# 备份脚本示例
cat > backup.sh <<'EOF'
#!/bin/bash
BACKUP_DIR="/backup/milvus/$(date +%Y%m%d)"
mkdir -p $BACKUP_DIR
# 备份etcd
kubectl exec -n milvus etcd-0 -- etcdctl snapshot save /tmp/snapshot.db
kubectl cp milvus/etcd-0:/tmp/snapshot.db $BACKUP_DIR/etcd-snapshot.db
# 备份MinIO(使用mc工具)
mc mirror milvus-minio/milvus-bucket $BACKUP_DIR/minio-data
# 上传到S3
aws s3 sync $BACKUP_DIR s3://backup-bucket/milvus/$(date +%Y%m%d)
# 清理本地备份
find /backup/milvus -type d -mtime +7 -exec rm -rf {} \;
EOF
---
b.灾难恢复
a.功能说明
制定灾难恢复计划,应对极端情况。定期备份数据和配置。测试恢复流程,确保可用。实现跨区域容灾,防止区域故障。配置监控告警,及时发现问题。文档化恢复步骤,快速响应。定期演练,提升恢复能力。
b.代码示例
---
# 灾难恢复操作指南
# 1. 数据恢复流程
# 步骤1: 停止Milvus服务
helm uninstall milvus-release -n milvus
# 步骤2: 恢复etcd数据
# 从备份恢复etcd快照
kubectl exec -n milvus etcd-0 -- etcdctl snapshot restore /backup/etcd-snapshot.db \\
--data-dir=/var/lib/etcd-restore
# 步骤3: 恢复MinIO数据
# 从S3恢复数据到MinIO
aws s3 sync s3://backup-bucket/milvus/20240115/minio-data milvus-minio/milvus-bucket
# 步骤4: 恢复Pulsar数据
# Pulsar数据通常不需要恢复,因为是消息队列
# 步骤5: 重新部署Milvus
helm install milvus-release milvus/milvus -f values.yaml -n milvus
# 步骤6: 验证数据完整性
python3 <<'PYTHON'
from pymilvus import connections, Collection, utility
connections.connect(host="milvus.example.com", port="19530")
# 检查Collection
collections = utility.list_collections()
print(f"Collections: {collections}")
# 检查数据量
for coll_name in collections:
collection = Collection(coll_name)
count = collection.num_entities
print(f"{coll_name}: {count} entities")
connections.disconnect("default")
PYTHON
# 2. 跨区域容灾配置
# 主区域配置(values-primary.yaml)
global:
region: us-east-1
minio:
mode: distributed
replicas: 4
# 配置跨区域复制
bucketReplication:
enabled: true
destination: "s3://milvus-backup-us-west-1"
# 备区域配置(values-secondary.yaml)
global:
region: us-west-1
# 配置为只读模式,从主区域同步数据
readOnly: true
# 3. 故障切换流程
# 检测主区域故障
# 切换DNS到备区域
# 将备区域切换为读写模式
# 验证服务可用性
# 4. 恢复检查清单
cat > recovery-checklist.md <<'EOF'
# Milvus灾难恢复检查清单
## 恢复前
- [ ] 确认备份可用
- [ ] 评估数据丢失范围
- [ ] 通知相关人员
- [ ] 准备恢复环境
## 恢复中
- [ ] 停止现有服务
- [ ] 恢复etcd数据
- [ ] 恢复MinIO数据
- [ ] 重新部署Milvus
- [ ] 验证组件状态
## 恢复后
- [ ] 验证数据完整性
- [ ] 测试查询功能
- [ ] 测试写入功能
- [ ] 监控系统状态
- [ ] 通知恢复完成
- [ ] 编写事故报告
## RTO/RPO目标
- RTO (恢复时间目标): 2小时
- RPO (恢复点目标): 24小时
EOF
---
9.5 扩容缩容
01.手动扩缩容
a.Worker扩容
a.功能说明
根据负载手动扩展Worker节点数量。Query Node扩容提升查询并发能力。Data Node扩容提升写入吞吐量。Index Node扩容加快索引构建速度。通过Helm或kubectl调整副本数。扩容后自动加入集群,无需重启。监控资源使用情况,及时扩容。
b.代码示例
---
# Worker节点手动扩容
# 方法1: 使用Helm升级
helm upgrade milvus-release milvus/milvus \\
--set queryNode.replicas=5 \\
--set dataNode.replicas=3 \\
--set indexNode.replicas=3 \\
--namespace milvus
# 方法2: 使用kubectl scale
kubectl scale deployment milvus-querynode --replicas=5 -n milvus
kubectl scale deployment milvus-datanode --replicas=3 -n milvus
kubectl scale deployment milvus-indexnode --replicas=3 -n milvus
# 方法3: 修改values.yaml后重新部署
cat > values-scale.yaml <<EOF
queryNode:
replicas: 5
resources:
limits:
cpu: "4"
memory: "16Gi"
dataNode:
replicas: 3
resources:
limits:
cpu: "2"
memory: "8Gi"
indexNode:
replicas: 3
resources:
limits:
cpu: "4"
memory: "8Gi"
EOF
helm upgrade milvus-release milvus/milvus -f values-scale.yaml -n milvus
# 验证扩容结果
kubectl get pods -n milvus | grep -E "querynode|datanode|indexnode"
# 监控新节点状态
kubectl get pods -n milvus -w
# 查看负载分布
kubectl top pods -n milvus
# 扩容建议:
# - Query Node: 根据QPS需求扩容,每个节点支持1000-5000 QPS
# - Data Node: 根据写入吞吐量扩容,每个节点支持10000-50000 vectors/s
# - Index Node: 根据索引构建速度扩容,并行构建加快速度
---
b.Worker缩容
a.功能说明
负载降低时缩减Worker节点数量,节省资源。缩容前确保有足够的剩余容量。Kubernetes会优雅地终止Pod。Query Node会先停止接收新请求,完成现有请求后退出。需要监控缩容后的系统负载。避免过度缩容导致性能下降。
b.代码示例
---
# Worker节点手动缩容
# 缩容前检查当前负载
kubectl top pods -n milvus
# 查看当前副本数
kubectl get deployment -n milvus
# 缩容Query Node
kubectl scale deployment milvus-querynode --replicas=3 -n milvus
# 缩容Data Node
kubectl scale deployment milvus-datanode --replicas=2 -n milvus
# 缩容Index Node
kubectl scale deployment milvus-indexnode --replicas=2 -n milvus
# 或使用Helm
helm upgrade milvus-release milvus/milvus \\
--set queryNode.replicas=3 \\
--set dataNode.replicas=2 \\
--set indexNode.replicas=2 \\
--namespace milvus
# 监控缩容过程
kubectl get pods -n milvus -w
# 验证服务可用性
python3 <<'PYTHON'
from pymilvus import connections, Collection
import numpy as np
import time
connections.connect(host="milvus.example.com", port="19530")
collection = Collection("test_collection")
# 测试查询
query_vector = [[np.random.random() for _ in range(128)]]
for i in range(10):
start = time.time()
results = collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": 16}},
limit=10
)
latency = time.time() - start
print(f"查询{i+1}: {latency*1000:.2f}ms")
connections.disconnect("default")
PYTHON
# 缩容注意事项:
# - 确保剩余容量足够
# - 监控缩容后的性能
# - 避免频繁缩容
# - 保留最小副本数(至少2个)
---
02.自动扩缩容
a.HPA配置
a.功能说明
Horizontal Pod Autoscaler根据指标自动扩缩容。支持基于CPU、内存、自定义指标扩缩容。设置最小和最大副本数。配置目标利用率阈值。自动调整副本数,无需人工干预。适合负载波动较大的场景。需要配置metrics-server。
b.代码示例
---
# HPA自动扩缩容配置
# 1. 确保metrics-server已安装
kubectl get deployment metrics-server -n kube-system
# 2. 在values.yaml中启用HPA
queryNode:
replicas: 3
resources:
requests:
cpu: "2"
memory: "8Gi"
limits:
cpu: "4"
memory: "16Gi"
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
dataNode:
replicas: 2
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 5
targetCPUUtilizationPercentage: 70
indexNode:
replicas: 2
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 5
targetCPUUtilizationPercentage: 80
# 3. 部署或更新
helm upgrade milvus-release milvus/milvus -f values.yaml -n milvus
# 4. 查看HPA状态
kubectl get hpa -n milvus
# 5. 查看HPA详细信息
kubectl describe hpa milvus-querynode -n milvus
# 6. 手动创建HPA(如果Helm不支持)
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: milvus-querynode-hpa
namespace: milvus
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: milvus-querynode
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
- type: Pods
value: 2
periodSeconds: 60
selectPolicy: Max
EOF
# 7. 监控HPA行为
kubectl get hpa -n milvus -w
# 8. 查看扩缩容事件
kubectl get events -n milvus | grep -i "scaled"
# HPA配置说明:
# - minReplicas: 最小副本数
# - maxReplicas: 最大副本数
# - targetCPUUtilizationPercentage: CPU目标利用率
# - targetMemoryUtilizationPercentage: 内存目标利用率
# - stabilizationWindowSeconds: 稳定窗口,避免频繁扩缩容
# - scaleDown/scaleUp policies: 扩缩容策略
---
b.自定义指标
a.功能说明
除了CPU和内存,还可以基于自定义指标扩缩容。如QPS、查询延迟、队列长度等业务指标。需要安装Prometheus和Prometheus Adapter。定义自定义指标的计算规则。HPA根据自定义指标自动扩缩容。更贴近业务需求,扩缩容更精准。
b.代码示例
---
# 基于自定义指标的HPA配置
# 1. 安装Prometheus和Prometheus Adapter
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack -n monitoring
helm repo add prometheus-adapter https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-adapter/prometheus-adapter -n monitoring
# 2. 配置Prometheus Adapter自定义指标
cat > prometheus-adapter-values.yaml <<EOF
rules:
custom:
- seriesQuery: 'milvus_query_qps{namespace="milvus"}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_qps"
as: "milvus_query_qps"
metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
- seriesQuery: 'milvus_query_latency_ms{namespace="milvus"}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_latency_ms"
as: "milvus_query_latency"
metricsQuery: 'avg(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
EOF
helm upgrade prometheus-adapter prometheus-adapter/prometheus-adapter \\
-f prometheus-adapter-values.yaml -n monitoring
# 3. 验证自定义指标
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .
# 4. 创建基于自定义指标的HPA
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: milvus-querynode-custom-hpa
namespace: milvus
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: milvus-querynode
minReplicas: 2
maxReplicas: 10
metrics:
# 基于QPS扩缩容
- type: Pods
pods:
metric:
name: milvus_query_qps
target:
type: AverageValue
averageValue: "1000" # 每个Pod处理1000 QPS
# 基于查询延迟扩缩容
- type: Pods
pods:
metric:
name: milvus_query_latency
target:
type: AverageValue
averageValue: "50" # 平均延迟50ms
behavior:
scaleDown:
stabilizationWindowSeconds: 300
scaleUp:
stabilizationWindowSeconds: 60
EOF
# 5. 监控自定义指标HPA
kubectl get hpa milvus-querynode-custom-hpa -n milvus -w
# 6. 查看指标值
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/milvus/pods/*/milvus_query_qps" | jq .
# 自定义指标示例:
# - QPS: 每秒查询数
# - 查询延迟: 平均查询延迟
# - 队列长度: 待处理请求队列长度
# - 错误率: 查询错误率
# - 资源使用率: GPU使用率等
# 测试自动扩缩容
python3 <<'PYTHON'
from pymilvus import connections, Collection
import numpy as np
import time
import threading
connections.connect(host="milvus.example.com", port="19530")
collection = Collection("test_collection")
def query_worker():
"""持续查询,触发扩容"""
query_vector = [[np.random.random() for _ in range(128)]]
while True:
try:
collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": 16}},
limit=10
)
except:
pass
time.sleep(0.001) # 高频查询
# 启动多个线程模拟高负载
threads = []
for i in range(50):
t = threading.Thread(target=query_worker, daemon=True)
t.start()
threads.append(t)
print("高负载测试运行中,观察HPA扩容...")
print("kubectl get hpa -n milvus -w")
time.sleep(300) # 运行5分钟
PYTHON
---
10 AI框架集成
10.1 LangChain集成
01.基础集成
a.安装配置
a.功能说明
LangChain是流行的LLM应用开发框架。Milvus作为向量存储后端与LangChain无缝集成。支持文档加载、分割、嵌入、检索等完整流程。提供MilvusVectorStore类封装Milvus操作。支持相似度搜索和MMR检索。可以与LLM结合实现RAG应用。安装langchain和pymilvus即可使用。
b.代码示例
---
# 安装依赖
# pip install langchain langchain-community pymilvus openai
from langchain_community.vectorstores import Milvus
from langchain_community.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader
# 1. 加载文档
loader = TextLoader("document.txt")
documents = loader.load()
# 2. 分割文档
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
docs = text_splitter.split_documents(documents)
# 3. 创建嵌入模型
embeddings = OpenAIEmbeddings()
# 4. 创建Milvus向量存储
vector_store = Milvus.from_documents(
docs,
embeddings,
connection_args={
"host": "localhost",
"port": "19530"
},
collection_name="langchain_docs",
drop_old=True
)
# 5. 相似度搜索
query = "What is machine learning?"
results = vector_store.similarity_search(query, k=3)
for i, doc in enumerate(results):
print(f"\n结果 {i+1}:")
print(f"内容: {doc.page_content[:200]}...")
print(f"元数据: {doc.metadata}")
# 6. 带分数的搜索
results_with_scores = vector_store.similarity_search_with_score(query, k=3)
for doc, score in results_with_scores:
print(f"\n分数: {score}")
print(f"内容: {doc.page_content[:200]}...")
# 7. MMR检索(最大边际相关性)
mmr_results = vector_store.max_marginal_relevance_search(
query,
k=3,
fetch_k=10
)
print(f"\nMMR检索结果: {len(mmr_results)}个")
---
b.检索器配置
a.功能说明
LangChain提供Retriever抽象,统一检索接口。Milvus可以转换为Retriever使用。支持多种检索模式:相似度、MMR、阈值过滤。可以配置检索参数,如top-k、score阈值。Retriever可以与LLM链式组合。实现问答、摘要等应用。支持自定义检索逻辑。
b.代码示例
---
from langchain_community.vectorstores import Milvus
from langchain_community.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain_community.llms import OpenAI
# 创建向量存储
embeddings = OpenAIEmbeddings()
vector_store = Milvus(
embeddings,
connection_args={"host": "localhost", "port": "19530"},
collection_name="langchain_docs"
)
# 1. 转换为Retriever(相似度模式)
retriever = vector_store.as_retriever(
search_type="similarity",
search_kwargs={"k": 3}
)
# 测试检索
docs = retriever.get_relevant_documents("What is deep learning?")
print(f"检索到 {len(docs)} 个文档")
# 2. MMR模式Retriever
mmr_retriever = vector_store.as_retriever(
search_type="mmr",
search_kwargs={
"k": 3,
"fetch_k": 10,
"lambda_mult": 0.5
}
)
# 3. 阈值过滤Retriever
threshold_retriever = vector_store.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={
"score_threshold": 0.8,
"k": 5
}
)
# 4. 与LLM结合使用
llm = OpenAI(temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True
)
# 执行问答
query = "What are the main types of machine learning?"
result = qa_chain({"query": query})
print(f"\n问题: {query}")
print(f"\n答案: {result['result']}")
print(f"\n来源文档数: {len(result['source_documents'])}")
for i, doc in enumerate(result['source_documents']):
print(f"\n来源 {i+1}:")
print(doc.page_content[:200])
---
02.RAG应用
a.问答系统
a.功能说明
基于检索增强生成RAG构建问答系统。Milvus存储知识库向量。用户提问时检索相关文档。将文档作为上下文传给LLM生成答案。支持多种链类型:stuff、map_reduce、refine。可以自定义提示词模板。实现引用来源,提升可信度。支持流式输出。
b.代码示例
---
from langchain_community.vectorstores import Milvus
from langchain_community.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain_community.llms import OpenAI
from langchain.prompts import PromptTemplate
# 创建向量存储
embeddings = OpenAIEmbeddings()
vector_store = Milvus(
embeddings,
connection_args={"host": "localhost", "port": "19530"},
collection_name="knowledge_base"
)
# 自定义提示词模板
prompt_template = """使用以下上下文回答问题。如果不知道答案,就说不知道,不要编造答案。
上下文:
{context}
问题: {question}
答案:"""
PROMPT = PromptTemplate(
template=prompt_template,
input_variables=["context", "question"]
)
# 创建QA链
llm = OpenAI(temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vector_store.as_retriever(search_kwargs={"k": 3}),
return_source_documents=True,
chain_type_kwargs={"prompt": PROMPT}
)
# 问答示例
questions = [
"What is the capital of France?",
"Explain quantum computing in simple terms.",
"What are the benefits of exercise?"
]
for query in questions:
result = qa_chain({"query": query})
print(f"\n{'='*60}")
print(f"问题: {query}")
print(f"\n答案: {result['result']}")
print(f"\n参考来源:")
for i, doc in enumerate(result['source_documents']):
print(f"\n[{i+1}] {doc.metadata.get('source', 'Unknown')}")
print(f" {doc.page_content[:150]}...")
# 使用map_reduce处理长文档
qa_chain_mr = RetrievalQA.from_chain_type(
llm=llm,
chain_type="map_reduce",
retriever=vector_store.as_retriever(search_kwargs={"k": 5}),
return_source_documents=True
)
result = qa_chain_mr({"query": "Summarize the main points about AI safety."})
print(f"\n摘要: {result['result']}")
---
b.对话系统
a.功能说明
构建带记忆的对话系统。使用ConversationalRetrievalChain实现多轮对话。Milvus存储知识库,LLM生成回复。支持对话历史管理。可以根据历史优化检索。实现上下文感知的回答。支持流式对话。可以集成聊天界面。
b.代码示例
---
from langchain_community.vectorstores import Milvus
from langchain_community.embeddings import OpenAIEmbeddings
from langchain.chains import ConversationalRetrievalChain
from langchain_community.llms import OpenAI
from langchain.memory import ConversationBufferMemory
# 创建向量存储
embeddings = OpenAIEmbeddings()
vector_store = Milvus(
embeddings,
connection_args={"host": "localhost", "port": "19530"},
collection_name="chat_knowledge"
)
# 创建对话记忆
memory = ConversationBufferMemory(
memory_key="chat_history",
return_messages=True,
output_key="answer"
)
# 创建对话链
llm = OpenAI(temperature=0.7)
conversation_chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=vector_store.as_retriever(search_kwargs={"k": 3}),
memory=memory,
return_source_documents=True
)
# 多轮对话示例
print("对话系统启动(输入'quit'退出)\n")
while True:
query = input("用户: ")
if query.lower() == 'quit':
break
result = conversation_chain({"question": query})
print(f"\n助手: {result['answer']}\n")
if result.get('source_documents'):
print("参考来源:")
for i, doc in enumerate(result['source_documents'][:2]):
print(f" [{i+1}] {doc.page_content[:100]}...")
print()
# 对话示例脚本
demo_questions = [
"What is machine learning?",
"Can you give me an example?",
"How does it differ from traditional programming?",
"What are some applications?"
]
print("\n对话演示:\n")
for query in demo_questions:
result = conversation_chain({"question": query})
print(f"用户: {query}")
print(f"助手: {result['answer']}\n")
# 查看对话历史
print("\n对话历史:")
print(memory.load_memory_variables({}))
---
10.2 LlamaIndex集成
01.索引构建
a.向量索引
a.功能说明
LlamaIndex(原GPT Index)是数据框架,用于LLM应用。Milvus作为向量存储后端与LlamaIndex集成。支持构建向量索引,存储文档嵌入。提供MilvusVectorStore类封装操作。支持文档加载、索引、查询完整流程。可以与多种LLM配合使用。实现高效的文档检索和问答。
b.代码示例
---
# 安装依赖
# pip install llama-index llama-index-vector-stores-milvus pymilvus
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext
from llama_index.vector_stores.milvus import MilvusVectorStore
from llama_index.core import Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
# 1. 配置全局设置
Settings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
# 2. 加载文档
documents = SimpleDirectoryReader("./data").load_data()
print(f"加载了 {len(documents)} 个文档")
# 3. 创建Milvus向量存储
vector_store = MilvusVectorStore(
host="localhost",
port=19530,
dim=1536, # OpenAI embedding维度
collection_name="llamaindex_docs",
overwrite=True
)
# 4. 创建存储上下文
storage_context = StorageContext.from_defaults(
vector_store=vector_store
)
# 5. 构建索引
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
show_progress=True
)
print("索引构建完成!")
# 6. 查询索引
query_engine = index.as_query_engine(
similarity_top_k=3
)
response = query_engine.query("What is the main topic of these documents?")
print(f"\n查询: What is the main topic of these documents?")
print(f"回答: {response}")
# 7. 流式查询
streaming_response = query_engine.query("Explain the key concepts.")
for text in streaming_response.response_gen:
print(text, end="", flush=True)
print()
# 8. 加载已有索引
# 后续使用时无需重新构建
vector_store_existing = MilvusVectorStore(
host="localhost",
port=19530,
collection_name="llamaindex_docs"
)
storage_context_existing = StorageContext.from_defaults(
vector_store=vector_store_existing
)
index_loaded = VectorStoreIndex.from_vector_store(
vector_store_existing,
storage_context=storage_context_existing
)
query_engine_loaded = index_loaded.as_query_engine()
response = query_engine_loaded.query("Summarize the content.")
print(f"\n从已有索引查询: {response}")
---
b.混合索引
a.功能说明
LlamaIndex支持多种索引类型组合。可以结合向量索引和关键词索引。实现混合检索,提升准确率。支持自定义检索策略。可以配置不同索引的权重。实现多模态检索。支持图索引、树索引等高级结构。
b.代码示例
---
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext
from llama_index.vector_stores.milvus import MilvusVectorStore
from llama_index.core.indices.composability import ComposableGraph
from llama_index.core import SummaryIndex
from llama_index.core.tools import QueryEngineTool
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector
# 加载文档
documents = SimpleDirectoryReader("./data").load_data()
# 1. 创建向量索引
vector_store = MilvusVectorStore(
host="localhost",
port=19530,
collection_name="hybrid_index",
dim=1536
)
storage_context = StorageContext.from_defaults(
vector_store=vector_store
)
vector_index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context
)
# 2. 创建摘要索引
summary_index = SummaryIndex.from_documents(documents)
# 3. 创建查询引擎工具
vector_tool = QueryEngineTool.from_defaults(
query_engine=vector_index.as_query_engine(),
description="用于回答关于文档具体细节的问题"
)
summary_tool = QueryEngineTool.from_defaults(
query_engine=summary_index.as_query_engine(),
description="用于回答需要整体理解文档的问题"
)
# 4. 创建路由查询引擎
router_query_engine = RouterQueryEngine(
selector=LLMSingleSelector.from_defaults(),
query_engine_tools=[vector_tool, summary_tool]
)
# 5. 使用路由查询
response1 = router_query_engine.query(
"What is the specific definition of machine learning mentioned in the document?"
)
print(f"细节问题: {response1}")
response2 = router_query_engine.query(
"What is the overall theme of these documents?"
)
print(f"整体问题: {response2}")
# 6. 自定义混合检索
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
retriever = VectorIndexRetriever(
index=vector_index,
similarity_top_k=5
)
query_engine = RetrieverQueryEngine.from_args(
retriever=retriever,
response_mode="tree_summarize"
)
response = query_engine.query("Explain the main concepts.")
print(f"混合检索结果: {response}")
---
02.查询优化
a.高级查询
a.功能说明
LlamaIndex提供多种高级查询模式。支持子问题查询,分解复杂问题。实现多步推理,逐步求解。支持假设性文档嵌入HyDE。可以配置响应合成模式。实现引用追踪,提供来源。支持流式响应。可以自定义查询转换。
b.代码示例
---
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.milvus import MilvusVectorStore
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.response.notebook_utils import display_response
# 加载索引
vector_store = MilvusVectorStore(
host="localhost",
port=19530,
collection_name="advanced_query"
)
index = VectorStoreIndex.from_vector_store(vector_store)
# 1. 子问题查询引擎
# 将复杂问题分解为多个子问题
query_engine_tools = [
QueryEngineTool(
query_engine=index.as_query_engine(),
metadata=ToolMetadata(
name="document_index",
description="包含文档的详细信息"
)
)
]
sub_question_engine = SubQuestionQueryEngine.from_defaults(
query_engine_tools=query_engine_tools
)
response = sub_question_engine.query(
"Compare and contrast the advantages and disadvantages of different machine learning approaches."
)
print(f"子问题查询: {response}")
# 2. 配置响应模式
# compact: 紧凑模式,合并文本块
query_engine_compact = index.as_query_engine(
response_mode="compact",
similarity_top_k=5
)
# tree_summarize: 树形摘要,层次化处理
query_engine_tree = index.as_query_engine(
response_mode="tree_summarize",
similarity_top_k=5
)
# refine: 精炼模式,迭代优化答案
query_engine_refine = index.as_query_engine(
response_mode="refine",
similarity_top_k=5
)
query = "What are the key principles of effective learning?"
response_compact = query_engine_compact.query(query)
response_tree = query_engine_tree.query(query)
response_refine = query_engine_refine.query(query)
print(f"\nCompact模式: {response_compact}")
print(f"\nTree模式: {response_tree}")
print(f"\nRefine模式: {response_refine}")
# 3. 流式响应
streaming_engine = index.as_query_engine(
streaming=True
)
streaming_response = streaming_engine.query("Explain neural networks.")
print("\n流式响应:")
for text in streaming_response.response_gen:
print(text, end="", flush=True)
print()
# 4. 带元数据过滤的查询
from llama_index.core.vector_stores import MetadataFilters, ExactMatchFilter
filters = MetadataFilters(
filters=[
ExactMatchFilter(key="category", value="machine_learning")
]
)
filtered_engine = index.as_query_engine(
filters=filters,
similarity_top_k=3
)
response = filtered_engine.query("What is supervised learning?")
print(f"\n过滤查询: {response}")
# 5. 查看来源节点
response_with_sources = index.as_query_engine(
response_mode="compact"
).query("What is deep learning?")
print(f"\n回答: {response_with_sources}")
print(f"\n来源节点:")
for i, node in enumerate(response_with_sources.source_nodes):
print(f"\n[{i+1}] 分数: {node.score:.4f}")
print(f" 内容: {node.text[:200]}...")
print(f" 元数据: {node.metadata}")
---
b.Agent应用
a.功能说明
LlamaIndex支持构建Agent应用。Agent可以使用多种工具完成任务。Milvus作为知识库工具之一。Agent根据问题选择合适的工具。实现多步推理和规划。支持工具组合使用。可以自定义工具和策略。实现复杂的AI应用。
b.代码示例
---
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import QueryEngineTool, ToolMetadata, FunctionTool
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.milvus import MilvusVectorStore
from llama_index.llms.openai import OpenAI
# 1. 创建知识库工具
vector_store = MilvusVectorStore(
host="localhost",
port=19530,
collection_name="agent_knowledge"
)
index = VectorStoreIndex.from_vector_store(vector_store)
knowledge_tool = QueryEngineTool(
query_engine=index.as_query_engine(),
metadata=ToolMetadata(
name="knowledge_base",
description="包含公司文档、产品信息、技术文档的知识库"
)
)
# 2. 创建自定义函数工具
def calculate(expression: str) -> str:
"""计算数学表达式"""
try:
result = eval(expression)
return f"计算结果: {result}"
except:
return "计算错误"
calc_tool = FunctionTool.from_defaults(fn=calculate)
def search_web(query: str) -> str:
"""搜索网络信息"""
# 实际应用中调用搜索API
return f"网络搜索结果: {query}"
web_tool = FunctionTool.from_defaults(fn=search_web)
# 3. 创建ReAct Agent
llm = OpenAI(model="gpt-4", temperature=0)
agent = ReActAgent.from_tools(
tools=[knowledge_tool, calc_tool, web_tool],
llm=llm,
verbose=True
)
# 4. 使用Agent
response1 = agent.chat("What is our company's return policy?")
print(f"Agent回答: {response1}")
response2 = agent.chat("Calculate 15% discount on $299")
print(f"Agent回答: {response2}")
response3 = agent.chat(
"Find information about the latest AI trends and compare with our product features"
)
print(f"Agent回答: {response3}")
# 5. 多轮对话
print("\nAgent对话模式(输入'quit'退出):")
while True:
user_input = input("\n用户: ")
if user_input.lower() == 'quit':
break
response = agent.chat(user_input)
print(f"Agent: {response}")
# 6. 查看Agent推理过程
response_with_reasoning = agent.chat(
"What are the key features of our product and how much would it cost with a 20% discount?"
)
print(f"\n最终回答: {response_with_reasoning}")
print(f"\n推理步骤:")
for step in agent.chat_history:
print(f" - {step}")
---
10.3 Haystack集成
01.Pipeline构建
a.文档处理
a.功能说明
Haystack是端到端NLP框架,用于构建搜索和问答系统。Milvus作为文档存储后端与Haystack集成。支持文档索引、检索、问答完整流程。提供MilvusDocumentStore类封装操作。支持Pipeline模式,组合多个组件。可以与多种Reader和Retriever配合。实现生产级NLP应用。
b.代码示例
---
# 安装依赖
# pip install farm-haystack[milvus] pymilvus
from haystack.document_stores import MilvusDocumentStore
from haystack.nodes import PreProcessor, EmbeddingRetriever
from haystack.utils import convert_files_to_docs
# 1. 创建Milvus文档存储
document_store = MilvusDocumentStore(
host="localhost",
port=19530,
collection_name="haystack_docs",
embedding_dim=768,
similarity="cosine",
recreate_index=True
)
# 2. 加载文档
docs = convert_files_to_docs(
dir_path="./data",
clean_func=None,
split_paragraphs=True
)
print(f"加载了 {len(docs)} 个文档")
# 3. 预处理文档
preprocessor = PreProcessor(
clean_empty_lines=True,
clean_whitespace=True,
clean_header_footer=True,
split_by="word",
split_length=200,
split_overlap=20,
split_respect_sentence_boundary=True
)
processed_docs = preprocessor.process(docs)
print(f"预处理后: {len(processed_docs)} 个文档片段")
# 4. 写入文档存储
document_store.write_documents(processed_docs)
print("文档已写入Milvus")
# 5. 创建嵌入检索器
retriever = EmbeddingRetriever(
document_store=document_store,
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
model_format="sentence_transformers"
)
# 6. 更新文档嵌入
document_store.update_embeddings(retriever)
print("文档嵌入已更新")
# 7. 检索文档
query = "What is machine learning?"
retrieved_docs = retriever.retrieve(
query=query,
top_k=3
)
print(f"\n查询: {query}")
print(f"检索到 {len(retrieved_docs)} 个文档:\n")
for i, doc in enumerate(retrieved_docs):
print(f"[{i+1}] 分数: {doc.score:.4f}")
print(f" 内容: {doc.content[:200]}...")
print(f" 元数据: {doc.meta}\n")
---
b.Pipeline组装
a.功能说明
Haystack使用Pipeline模式组装NLP应用。Pipeline由多个节点组成,数据在节点间流动。支持检索、阅读、生成等多种节点。可以自定义节点和连接方式。实现复杂的处理流程。支持并行和条件分支。可以保存和加载Pipeline。
b.代码示例
---
from haystack import Pipeline
from haystack.document_stores import MilvusDocumentStore
from haystack.nodes import EmbeddingRetriever, FARMReader, PromptNode
from haystack.nodes import AnswerParser, PromptTemplate
# 1. 创建文档存储
document_store = MilvusDocumentStore(
host="localhost",
port=19530,
collection_name="haystack_pipeline",
embedding_dim=768
)
# 2. 创建检索器
retriever = EmbeddingRetriever(
document_store=document_store,
embedding_model="sentence-transformers/all-MiniLM-L6-v2"
)
# 3. 创建阅读器
reader = FARMReader(
model_name_or_path="deepset/roberta-base-squad2",
use_gpu=True
)
# 4. 构建检索式问答Pipeline
retrieval_qa_pipeline = Pipeline()
retrieval_qa_pipeline.add_node(
component=retriever,
name="Retriever",
inputs=["Query"]
)
retrieval_qa_pipeline.add_node(
component=reader,
name="Reader",
inputs=["Retriever"]
)
# 5. 运行Pipeline
query = "What are the main types of machine learning?"
result = retrieval_qa_pipeline.run(
query=query,
params={
"Retriever": {"top_k": 5},
"Reader": {"top_k": 3}
}
)
print(f"问题: {query}\n")
print("答案:")
for i, answer in enumerate(result["answers"]):
print(f"\n[{i+1}] 答案: {answer.answer}")
print(f" 分数: {answer.score:.4f}")
print(f" 上下文: {answer.context[:200]}...")
# 6. 构建生成式问答Pipeline(使用LLM)
prompt_template = PromptTemplate(
prompt="""根据以下上下文回答问题。
上下文: {join(documents)}
问题: {query}
答案:""",
output_parser=AnswerParser()
)
prompt_node = PromptNode(
model_name_or_path="gpt-3.5-turbo",
api_key="your-api-key",
default_prompt_template=prompt_template
)
generative_qa_pipeline = Pipeline()
generative_qa_pipeline.add_node(
component=retriever,
name="Retriever",
inputs=["Query"]
)
generative_qa_pipeline.add_node(
component=prompt_node,
name="PromptNode",
inputs=["Retriever"]
)
# 7. 运行生成式Pipeline
result_gen = generative_qa_pipeline.run(
query=query,
params={"Retriever": {"top_k": 3}}
)
print(f"\n生成式答案: {result_gen['answers'][0].answer}")
# 8. 保存和加载Pipeline
retrieval_qa_pipeline.save_to_yaml("qa_pipeline.yaml")
# 加载Pipeline
loaded_pipeline = Pipeline.load_from_yaml("qa_pipeline.yaml")
# 9. 批量查询
queries = [
"What is supervised learning?",
"Explain neural networks.",
"What is the difference between AI and ML?"
]
for q in queries:
result = retrieval_qa_pipeline.run(
query=q,
params={"Retriever": {"top_k": 3}, "Reader": {"top_k": 1}}
)
print(f"\n问题: {q}")
print(f"答案: {result['answers'][0].answer if result['answers'] else '未找到答案'}")
---
02.高级应用
a.多模态检索
a.功能说明
Haystack支持多模态文档处理。可以处理文本、表格、图片等多种格式。Milvus存储多模态嵌入。支持跨模态检索。可以提取PDF、Word等文件内容。实现文档理解和问答。支持OCR和图像理解。构建企业级文档搜索系统。
b.代码示例
---
from haystack.document_stores import MilvusDocumentStore
from haystack.nodes import (
PDFToTextConverter,
PreProcessor,
EmbeddingRetriever,
TableTextRetriever
)
from haystack import Pipeline
# 1. 创建文档存储
document_store = MilvusDocumentStore(
host="localhost",
port=19530,
collection_name="multimodal_docs",
embedding_dim=768
)
# 2. 创建PDF转换器
pdf_converter = PDFToTextConverter(
remove_numeric_tables=False,
valid_languages=["en", "zh"]
)
# 3. 转换PDF文档
pdf_docs = pdf_converter.convert(
file_path="document.pdf",
meta={"source": "document.pdf"}
)
# 4. 预处理
preprocessor = PreProcessor(
split_by="word",
split_length=200,
split_overlap=20
)
processed_docs = preprocessor.process(pdf_docs)
# 5. 写入文档存储
document_store.write_documents(processed_docs)
# 6. 创建检索器
retriever = EmbeddingRetriever(
document_store=document_store,
embedding_model="sentence-transformers/all-MiniLM-L6-v2"
)
document_store.update_embeddings(retriever)
# 7. 表格检索
table_retriever = TableTextRetriever(
document_store=document_store,
embedding_model="deepset/all-mpnet-base-v2-table"
)
# 8. 构建多模态检索Pipeline
multimodal_pipeline = Pipeline()
multimodal_pipeline.add_node(
component=retriever,
name="TextRetriever",
inputs=["Query"]
)
multimodal_pipeline.add_node(
component=table_retriever,
name="TableRetriever",
inputs=["Query"]
)
# 9. 查询
query = "What are the sales figures for Q3?"
result = multimodal_pipeline.run(
query=query,
params={
"TextRetriever": {"top_k": 3},
"TableRetriever": {"top_k": 2}
}
)
print(f"查询: {query}\n")
print("文本结果:")
for doc in result.get("documents", []):
if doc.content_type == "text":
print(f" - {doc.content[:200]}...")
print("\n表格结果:")
for doc in result.get("documents", []):
if doc.content_type == "table":
print(f" - {doc.content}")
---
b.语义搜索
a.功能说明
基于Milvus和Haystack构建语义搜索系统。支持自然语言查询。理解查询意图,返回语义相关结果。可以处理同义词、多语言查询。支持过滤和排序。实现个性化搜索。可以集成到网站或应用。提供API接口。
b.代码示例
---
from haystack.document_stores import MilvusDocumentStore
from haystack.nodes import EmbeddingRetriever, BM25Retriever
from haystack import Pipeline
from haystack.nodes import JoinDocuments
from flask import Flask, request, jsonify
# 1. 创建文档存储
document_store = MilvusDocumentStore(
host="localhost",
port=19530,
collection_name="semantic_search",
embedding_dim=768,
similarity="cosine"
)
# 2. 创建混合检索器
# 语义检索
embedding_retriever = EmbeddingRetriever(
document_store=document_store,
embedding_model="sentence-transformers/all-MiniLM-L6-v2"
)
# 关键词检索
bm25_retriever = BM25Retriever(document_store=document_store)
# 3. 构建混合检索Pipeline
join_documents = JoinDocuments(
join_mode="concatenate"
)
hybrid_pipeline = Pipeline()
hybrid_pipeline.add_node(
component=embedding_retriever,
name="EmbeddingRetriever",
inputs=["Query"]
)
hybrid_pipeline.add_node(
component=bm25_retriever,
name="BM25Retriever",
inputs=["Query"]
)
hybrid_pipeline.add_node(
component=join_documents,
name="JoinDocuments",
inputs=["EmbeddingRetriever", "BM25Retriever"]
)
# 4. 创建搜索API
app = Flask(__name__)
@app.route("/search", methods=["POST"])
def search():
data = request.json
query = data.get("query", "")
top_k = data.get("top_k", 5)
filters = data.get("filters", {})
result = hybrid_pipeline.run(
query=query,
params={
"EmbeddingRetriever": {
"top_k": top_k,
"filters": filters
},
"BM25Retriever": {
"top_k": top_k,
"filters": filters
}
}
)
documents = result.get("documents", [])
response = {
"query": query,
"total": len(documents),
"results": [
{
"id": doc.id,
"content": doc.content,
"score": doc.score,
"meta": doc.meta
}
for doc in documents[:top_k]
]
}
return jsonify(response)
@app.route("/index", methods=["POST"])
def index_documents():
data = request.json
documents = data.get("documents", [])
document_store.write_documents(documents)
document_store.update_embeddings(embedding_retriever)
return jsonify({
"status": "success",
"indexed": len(documents)
})
# 5. 启动API服务
# app.run(host="0.0.0.0", port=8000)
# 6. 测试搜索
test_queries = [
"machine learning algorithms",
"deep neural networks",
"natural language processing"
]
for query in test_queries:
result = hybrid_pipeline.run(
query=query,
params={
"EmbeddingRetriever": {"top_k": 3},
"BM25Retriever": {"top_k": 3}
}
)
print(f"\n查询: {query}")
print(f"结果数: {len(result['documents'])}")
for i, doc in enumerate(result["documents"][:3]):
print(f"\n[{i+1}] 分数: {doc.score:.4f}")
print(f" 内容: {doc.content[:150]}...")
# 7. 带过滤的搜索
filtered_result = hybrid_pipeline.run(
query="machine learning",
params={
"EmbeddingRetriever": {
"top_k": 5,
"filters": {"category": ["AI", "ML"]}
},
"BM25Retriever": {
"top_k": 5,
"filters": {"category": ["AI", "ML"]}
}
}
)
print(f"\n过滤搜索结果: {len(filtered_result['documents'])} 个文档")
---
11 运维监控
11.1 监控指标
01.系统指标
a.性能指标
a.功能说明
Milvus提供丰富的性能监控指标。包括QPS、延迟、吞吐量等核心指标。监控CPU、内存、磁盘、网络使用情况。跟踪查询性能和索引构建进度。支持Prometheus格式导出指标。可以集成Grafana可视化。实时监控系统健康状态。设置告警阈值及时发现问题。
b.代码示例
---
# Milvus性能指标监控配置
# 1. 启用Prometheus指标导出
# 在milvus.yaml中配置
metrics:
enabled: true
port: 9091
path: /metrics
# 2. 访问指标端点
# curl http://localhost:9091/metrics
# 3. 主要性能指标
performance_metrics = {
"查询性能": {
"milvus_query_qps": "每秒查询数",
"milvus_query_latency_ms": "查询延迟(毫秒)",
"milvus_query_success_rate": "查询成功率",
"milvus_query_timeout_count": "查询超时次数"
},
"写入性能": {
"milvus_insert_qps": "每秒插入数",
"milvus_insert_latency_ms": "插入延迟(毫秒)",
"milvus_insert_success_rate": "插入成功率",
"milvus_flush_duration_ms": "刷盘耗时"
},
"索引性能": {
"milvus_index_build_duration_ms": "索引构建耗时",
"milvus_index_build_progress": "索引构建进度",
"milvus_index_size_bytes": "索引大小(字节)"
},
"系统资源": {
"milvus_cpu_usage_percent": "CPU使用率",
"milvus_memory_usage_bytes": "内存使用量",
"milvus_disk_usage_bytes": "磁盘使用量",
"milvus_network_io_bytes": "网络IO"
}
}
# 4. Prometheus配置
prometheus_config = """
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'milvus'
static_configs:
- targets: ['localhost:9091']
labels:
instance: 'milvus-standalone'
cluster: 'production'
"""
# 5. 使用Python查询指标
import requests
def get_milvus_metrics():
response = requests.get("http://localhost:9091/metrics")
metrics = {}
for line in response.text.split('\n'):
if line.startswith('milvus_') and not line.startswith('#'):
parts = line.split()
if len(parts) >= 2:
metric_name = parts[0].split('{')[0]
metric_value = float(parts[-1])
metrics[metric_name] = metric_value
return metrics
# 获取当前指标
metrics = get_milvus_metrics()
print("Milvus性能指标:")
print(f" QPS: {metrics.get('milvus_query_qps', 0):.2f}")
print(f" 平均延迟: {metrics.get('milvus_query_latency_ms', 0):.2f}ms")
print(f" CPU使用率: {metrics.get('milvus_cpu_usage_percent', 0):.2f}%")
print(f" 内存使用: {metrics.get('milvus_memory_usage_bytes', 0) / 1024**3:.2f}GB")
# 6. PromQL查询示例
promql_queries = {
"平均QPS(5分钟)": "rate(milvus_query_total[5m])",
"P99延迟": "histogram_quantile(0.99, rate(milvus_query_latency_ms_bucket[5m]))",
"错误率": "rate(milvus_query_errors_total[5m]) / rate(milvus_query_total[5m])",
"内存增长率": "rate(milvus_memory_usage_bytes[5m])"
}
print("\nPromQL查询示例:")
for name, query in promql_queries.items():
print(f" {name}: {query}")
---
b.业务指标
a.功能说明
除系统指标外,还需监控业务相关指标。跟踪Collection数量和数据量。监控向量维度分布和数据增长趋势。统计热门查询和慢查询。分析用户行为和使用模式。监控数据质量和准确率。支持自定义业务指标。实现业务监控和分析。
b.代码示例
---
from pymilvus import connections, utility, Collection
import time
from datetime import datetime
connections.connect(host="localhost", port="19530")
# 1. Collection级别指标
def get_collection_metrics(collection_name):
collection = Collection(collection_name)
collection.load()
metrics = {
"name": collection_name,
"entity_count": collection.num_entities,
"schema": {
"fields": len(collection.schema.fields),
"description": collection.schema.description
},
"indexes": []
}
# 获取索引信息
for field in collection.schema.fields:
if field.dtype in [DataType.FLOAT_VECTOR, DataType.BINARY_VECTOR]:
index_info = collection.index(field.name).params
metrics["indexes"].append({
"field": field.name,
"type": index_info.get("index_type"),
"params": index_info.get("params")
})
return metrics
# 2. 数据增长监控
def monitor_data_growth(collection_name, interval=60):
"""监控数据增长趋势"""
collection = Collection(collection_name)
previous_count = 0
while True:
current_count = collection.num_entities
growth = current_count - previous_count
growth_rate = (growth / previous_count * 100) if previous_count > 0 else 0
print(f"[{datetime.now()}] 数据量: {current_count}, "
f"增长: +{growth}, 增长率: {growth_rate:.2f}%")
previous_count = current_count
time.sleep(interval)
# 3. 查询性能统计
class QueryMonitor:
def __init__(self):
self.query_count = 0
self.total_latency = 0
self.slow_queries = []
self.error_count = 0
def record_query(self, query, latency, success=True):
self.query_count += 1
if success:
self.total_latency += latency
# 记录慢查询(>100ms)
if latency > 100:
self.slow_queries.append({
"query": query,
"latency": latency,
"timestamp": datetime.now()
})
else:
self.error_count += 1
def get_stats(self):
avg_latency = self.total_latency / self.query_count if self.query_count > 0 else 0
error_rate = self.error_count / self.query_count if self.query_count > 0 else 0
return {
"total_queries": self.query_count,
"avg_latency_ms": avg_latency,
"slow_queries": len(self.slow_queries),
"error_rate": error_rate * 100
}
# 4. 使用监控器
monitor = QueryMonitor()
collection = Collection("test_collection")
# 模拟查询
import numpy as np
for i in range(100):
query_vector = [[np.random.random() for _ in range(128)]]
start = time.time()
try:
results = collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": 16}},
limit=10
)
latency = (time.time() - start) * 1000
monitor.record_query(f"query_{i}", latency, success=True)
except Exception as e:
monitor.record_query(f"query_{i}", 0, success=False)
# 5. 输出统计
stats = monitor.get_stats()
print("\n查询性能统计:")
print(f" 总查询数: {stats['total_queries']}")
print(f" 平均延迟: {stats['avg_latency_ms']:.2f}ms")
print(f" 慢查询数: {stats['slow_queries']}")
print(f" 错误率: {stats['error_rate']:.2f}%")
# 6. 导出指标到Prometheus
from prometheus_client import start_http_server, Gauge, Counter
# 定义指标
query_latency = Gauge('milvus_custom_query_latency_ms', 'Query latency in milliseconds')
query_count = Counter('milvus_custom_query_total', 'Total number of queries')
slow_query_count = Counter('milvus_custom_slow_query_total', 'Total number of slow queries')
# 启动HTTP服务器
# start_http_server(8000)
# 更新指标
# query_latency.set(stats['avg_latency_ms'])
# query_count.inc(stats['total_queries'])
# slow_query_count.inc(stats['slow_queries'])
---
02.告警配置
a.告警规则
a.功能说明
配置告警规则,及时发现系统问题。基于Prometheus Alertmanager实现告警。设置阈值,触发告警通知。支持多种告警渠道:邮件、钉钉、Slack等。配置告警级别和优先级。实现告警聚合和抑制。定期检查告警规则有效性。
b.代码示例
---
# Prometheus告警规则配置
# alert_rules.yml
alert_rules = """
groups:
- name: milvus_alerts
interval: 30s
rules:
# 高QPS告警
- alert: HighQueryRate
expr: rate(milvus_query_total[5m]) > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "Milvus查询QPS过高"
description: "当前QPS: {{ $value }}, 超过阈值10000"
# 高延迟告警
- alert: HighQueryLatency
expr: histogram_quantile(0.99, rate(milvus_query_latency_ms_bucket[5m])) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Milvus查询延迟过高"
description: "P99延迟: {{ $value }}ms, 超过阈值100ms"
# 错误率告警
- alert: HighErrorRate
expr: rate(milvus_query_errors_total[5m]) / rate(milvus_query_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Milvus错误率过高"
description: "错误率: {{ $value | humanizePercentage }}, 超过阈值5%"
# 内存使用告警
- alert: HighMemoryUsage
expr: milvus_memory_usage_bytes / milvus_memory_limit_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Milvus内存使用率过高"
description: "内存使用率: {{ $value | humanizePercentage }}, 超过阈值90%"
# 磁盘使用告警
- alert: HighDiskUsage
expr: milvus_disk_usage_bytes / milvus_disk_limit_bytes > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "Milvus磁盘使用率过高"
description: "磁盘使用率: {{ $value | humanizePercentage }}, 超过阈值85%"
# 服务不可用告警
- alert: MilvusDown
expr: up{job="milvus"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Milvus服务不可用"
description: "Milvus实例 {{ $labels.instance }} 无法访问"
# 索引构建缓慢告警
- alert: SlowIndexBuilding
expr: milvus_index_build_duration_ms > 300000
for: 10m
labels:
severity: warning
annotations:
summary: "索引构建缓慢"
description: "索引构建耗时: {{ $value }}ms, 超过5分钟"
"""
# Alertmanager配置
alertmanager_config = """
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: '[email protected]'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical'
continue: true
- match:
severity: warning
receiver: 'warning'
receivers:
- name: 'default'
email_configs:
- to: '[email protected]'
- name: 'critical'
email_configs:
- to: '[email protected]'
webhook_configs:
- url: 'https://hooks.slack.com/services/xxx'
- name: 'warning'
email_configs:
- to: '[email protected]'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
"""
print("告警规则配置示例:")
print(alert_rules)
print("\nAlertmanager配置示例:")
print(alertmanager_config)
---
b.告警通知
a.功能说明
实现多渠道告警通知。支持邮件、短信、电话、IM等方式。配置告警接收人和值班表。实现告警升级机制。支持告警确认和处理。记录告警历史和处理结果。实现告警统计和分析。优化告警策略,减少误报。
b.代码示例
---
# 自定义告警通知实现
import requests
import json
from datetime import datetime
class AlertNotifier:
def __init__(self):
self.alert_history = []
def send_email(self, to, subject, body):
"""发送邮件告警"""
# 实际应用中使用SMTP发送
print(f"[邮件] 发送到: {to}")
print(f" 主题: {subject}")
print(f" 内容: {body}")
def send_dingtalk(self, webhook_url, message):
"""发送钉钉告警"""
data = {
"msgtype": "markdown",
"markdown": {
"title": "Milvus告警",
"text": message
}
}
try:
response = requests.post(
webhook_url,
headers={"Content-Type": "application/json"},
data=json.dumps(data)
)
print(f"[钉钉] 发送成功: {response.status_code}")
except Exception as e:
print(f"[钉钉] 发送失败: {e}")
def send_slack(self, webhook_url, message):
"""发送Slack告警"""
data = {
"text": message,
"username": "Milvus Alert",
"icon_emoji": ":warning:"
}
try:
response = requests.post(
webhook_url,
headers={"Content-Type": "application/json"},
data=json.dumps(data)
)
print(f"[Slack] 发送成功: {response.status_code}")
except Exception as e:
print(f"[Slack] 发送失败: {e}")
def process_alert(self, alert):
"""处理告警"""
alert_info = {
"name": alert["labels"]["alertname"],
"severity": alert["labels"]["severity"],
"summary": alert["annotations"]["summary"],
"description": alert["annotations"]["description"],
"timestamp": datetime.now()
}
self.alert_history.append(alert_info)
# 根据严重程度选择通知方式
if alert_info["severity"] == "critical":
# 紧急告警:多渠道通知
self.send_email(
to="[email protected]",
subject=f"[紧急] {alert_info['summary']}",
body=alert_info["description"]
)
self.send_dingtalk(
webhook_url="https://oapi.dingtalk.com/robot/send?access_token=xxx",
message=f"## [紧急告警]\\n\\n**{alert_info['summary']}**\\n\\n{alert_info['description']}"
)
elif alert_info["severity"] == "warning":
# 警告:邮件通知
self.send_email(
to="[email protected]",
subject=f"[警告] {alert_info['summary']}",
body=alert_info["description"]
)
return alert_info
# 使用告警通知器
notifier = AlertNotifier()
# 模拟告警
sample_alert = {
"labels": {
"alertname": "HighQueryLatency",
"severity": "warning",
"instance": "milvus-01"
},
"annotations": {
"summary": "Milvus查询延迟过高",
"description": "P99延迟: 150ms, 超过阈值100ms"
}
}
alert_info = notifier.process_alert(sample_alert)
print(f"\n告警已处理: {alert_info['name']}")
# 告警统计
def get_alert_stats(notifier):
stats = {
"total": len(notifier.alert_history),
"by_severity": {},
"by_name": {}
}
for alert in notifier.alert_history:
# 按严重程度统计
severity = alert["severity"]
stats["by_severity"][severity] = stats["by_severity"].get(severity, 0) + 1
# 按告警名称统计
name = alert["name"]
stats["by_name"][name] = stats["by_name"].get(name, 0) + 1
return stats
stats = get_alert_stats(notifier)
print(f"\n告警统计:")
print(f" 总数: {stats['total']}")
print(f" 按严重程度: {stats['by_severity']}")
print(f" 按名称: {stats['by_name']}")
---
11.2 日志管理
01.日志配置
a.日志级别
a.功能说明
Milvus支持多种日志级别配置。包括debug、info、warn、error、fatal五个级别。开发环境使用debug级别,生产环境使用info或warn。通过配置文件或环境变量设置日志级别。支持动态调整日志级别,无需重启。不同组件可以配置不同日志级别。合理配置日志级别,平衡详细度和性能。
b.代码示例
---
# Milvus日志配置(milvus.yaml)
log_config = """
log:
level: info # debug, info, warn, error, fatal
file:
rootPath: /var/log/milvus
maxSize: 300 # MB
maxAge: 10 # days
maxBackups: 20
format: json # text or json
stdout: true
"""
# 通过环境变量设置
# export LOG_LEVEL=debug
# export LOG_FORMAT=json
# export LOG_FILE_MAXSIZE=500
# Docker Compose配置
docker_compose_log = """
services:
milvus:
environment:
- LOG_LEVEL=info
- LOG_FORMAT=json
- LOG_FILE_MAXSIZE=300
- LOG_FILE_MAXAGE=10
- LOG_FILE_MAXBACKUPS=20
volumes:
- /var/log/milvus:/var/log/milvus
"""
# Kubernetes ConfigMap配置
k8s_log_config = """
apiVersion: v1
kind: ConfigMap
metadata:
name: milvus-log-config
namespace: milvus
data:
log.level: "info"
log.format: "json"
log.file.maxSize: "300"
log.file.maxAge: "10"
log.file.maxBackups: "20"
"""
print("日志配置示例:")
print(log_config)
print("\nDocker Compose日志配置:")
print(docker_compose_log)
print("\nKubernetes日志配置:")
print(k8s_log_config)
# 日志级别说明
log_levels = {
"debug": "详细调试信息,包含所有操作细节",
"info": "一般信息,记录重要操作和状态变化",
"warn": "警告信息,可能的问题但不影响运行",
"error": "错误信息,操作失败但服务继续运行",
"fatal": "致命错误,服务无法继续运行"
}
print("\n日志级别说明:")
for level, desc in log_levels.items():
print(f" {level}: {desc}")
# 不同环境的推荐配置
env_configs = {
"开发环境": {
"level": "debug",
"format": "text",
"stdout": True
},
"测试环境": {
"level": "info",
"format": "json",
"stdout": True
},
"生产环境": {
"level": "warn",
"format": "json",
"stdout": False
}
}
print("\n不同环境的推荐配置:")
for env, config in env_configs.items():
print(f" {env}: {config}")
---
b.日志轮转
a.功能说明
配置日志轮转,避免日志文件过大。设置单个日志文件最大大小。配置日志文件保留天数。限制日志备份文件数量。支持按时间或大小轮转。自动压缩旧日志文件。定期清理过期日志。实现日志归档和备份。
b.代码示例
---
# 日志轮转配置
# 1. Milvus内置日志轮转
milvus_log_rotation = """
log:
file:
rootPath: /var/log/milvus
maxSize: 300 # 单个文件最大300MB
maxAge: 10 # 保留10天
maxBackups: 20 # 最多20个备份文件
"""
# 2. 使用logrotate(Linux)
logrotate_config = """
# /etc/logrotate.d/milvus
/var/log/milvus/*.log {
daily # 每天轮转
rotate 7 # 保留7天
compress # 压缩旧日志
delaycompress # 延迟压缩
missingok # 文件不存在不报错
notifempty # 空文件不轮转
create 0644 milvus milvus # 创建新文件权限
sharedscripts
postrotate
# 重新加载Milvus日志配置
killall -SIGUSR1 milvus || true
endscript
}
"""
# 3. Docker日志轮转
docker_log_config = """
# docker-compose.yml
services:
milvus:
logging:
driver: "json-file"
options:
max-size: "100m" # 单个文件最大100MB
max-file: "10" # 最多10个文件
compress: "true" # 压缩日志
"""
# 4. Kubernetes日志轮转
k8s_log_rotation = """
# 使用fluentd或filebeat收集日志
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd-config
data:
fluent.conf: |
<source>
@type tail
path /var/log/milvus/*.log
pos_file /var/log/fluentd/milvus.log.pos
tag milvus.*
<parse>
@type json
</parse>
</source>
<match milvus.**>
@type elasticsearch
host elasticsearch.logging.svc.cluster.local
port 9200
logstash_format true
logstash_prefix milvus
<buffer>
@type file
path /var/log/fluentd/buffer
flush_interval 10s
</buffer>
</match>
"""
print("日志轮转配置:")
print("\n1. Milvus内置:")
print(milvus_log_rotation)
print("\n2. logrotate:")
print(logrotate_config)
print("\n3. Docker:")
print(docker_log_config)
print("\n4. Kubernetes:")
print(k8s_log_rotation)
# 5. Python脚本清理旧日志
import os
import time
from datetime import datetime, timedelta
def cleanup_old_logs(log_dir, days=7):
"""清理超过指定天数的日志文件"""
cutoff_time = time.time() - (days * 86400)
cleaned_count = 0
cleaned_size = 0
for filename in os.listdir(log_dir):
filepath = os.path.join(log_dir, filename)
if os.path.isfile(filepath) and filename.endswith('.log'):
file_mtime = os.path.getmtime(filepath)
if file_mtime < cutoff_time:
file_size = os.path.getsize(filepath)
os.remove(filepath)
cleaned_count += 1
cleaned_size += file_size
print(f"删除: {filename}")
print(f"\n清理完成: 删除{cleaned_count}个文件, 释放{cleaned_size/1024/1024:.2f}MB空间")
# cleanup_old_logs("/var/log/milvus", days=7)
---
02.日志分析
a.日志收集
a.功能说明
集中收集Milvus日志,便于分析和查询。使用ELK或EFK栈收集日志。支持多种日志收集工具:Filebeat、Fluentd、Logstash。实现日志聚合和索引。支持全文搜索和过滤。可视化日志数据。实现日志告警和监控。
b.代码示例
---
# 日志收集方案
# 1. 使用Filebeat收集日志到Elasticsearch
filebeat_config = """
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/milvus/*.log
fields:
service: milvus
environment: production
json.keys_under_root: true
json.add_error_key: true
processors:
- add_host_metadata: ~
- add_cloud_metadata: ~
- add_docker_metadata: ~
- add_kubernetes_metadata: ~
output.elasticsearch:
hosts: ["elasticsearch:9200"]
index: "milvus-logs-%{+yyyy.MM.dd}"
username: "elastic"
password: "changeme"
setup.kibana:
host: "kibana:5601"
setup.ilm.enabled: true
setup.ilm.rollover_alias: "milvus-logs"
setup.ilm.pattern: "{now/d}-000001"
"""
# 2. 使用Fluentd收集日志
fluentd_config = """
# fluent.conf
<source>
@type tail
path /var/log/milvus/*.log
pos_file /var/log/fluentd/milvus.log.pos
tag milvus.log
<parse>
@type json
time_key time
time_format %Y-%m-%dT%H:%M:%S.%NZ
</parse>
</source>
<filter milvus.log>
@type record_transformer
<record>
hostname "#{Socket.gethostname}"
service "milvus"
environment "production"
</record>
</filter>
<match milvus.log>
@type elasticsearch
host elasticsearch
port 9200
logstash_format true
logstash_prefix milvus
<buffer>
@type file
path /var/log/fluentd/buffer
flush_interval 10s
retry_max_times 3
</buffer>
</match>
"""
# 3. Docker Compose部署ELK
elk_docker_compose = """
version: '3'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.5.0
environment:
- discovery.type=single-node
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
volumes:
- es_data:/usr/share/elasticsearch/data
ports:
- "9200:9200"
kibana:
image: docker.elastic.co/kibana/kibana:8.5.0
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
ports:
- "5601:5601"
depends_on:
- elasticsearch
filebeat:
image: docker.elastic.co/beats/filebeat:8.5.0
user: root
volumes:
- ./filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
- /var/log/milvus:/var/log/milvus:ro
- filebeat_data:/usr/share/filebeat/data
depends_on:
- elasticsearch
volumes:
es_data:
filebeat_data:
"""
print("日志收集配置:")
print("\n1. Filebeat:")
print(filebeat_config)
print("\n2. Fluentd:")
print(fluentd_config)
print("\n3. ELK Docker Compose:")
print(elk_docker_compose)
# 4. Python查询Elasticsearch日志
from elasticsearch import Elasticsearch
from datetime import datetime, timedelta
def query_milvus_logs(es_host="localhost:9200", hours=1):
"""查询最近N小时的Milvus日志"""
es = Elasticsearch([es_host])
# 构建查询
query = {
"query": {
"bool": {
"must": [
{"match": {"service": "milvus"}},
{"range": {
"@timestamp": {
"gte": f"now-{hours}h",
"lte": "now"
}
}}
]
}
},
"sort": [{"@timestamp": {"order": "desc"}}],
"size": 100
}
# 执行查询
result = es.search(index="milvus-logs-*", body=query)
print(f"查询到 {result['hits']['total']['value']} 条日志:\n")
for hit in result['hits']['hits']:
log = hit['_source']
print(f"[{log.get('@timestamp')}] {log.get('level', 'INFO')}: {log.get('message', '')}")
return result
# query_milvus_logs(hours=1)
# 5. 查询错误日志
def query_error_logs(es_host="localhost:9200", hours=24):
"""查询错误日志"""
es = Elasticsearch([es_host])
query = {
"query": {
"bool": {
"must": [
{"match": {"service": "milvus"}},
{"terms": {"level": ["error", "fatal"]}},
{"range": {
"@timestamp": {
"gte": f"now-{hours}h"
}
}}
]
}
},
"aggs": {
"error_types": {
"terms": {
"field": "message.keyword",
"size": 10
}
}
}
}
result = es.search(index="milvus-logs-*", body=query)
print(f"错误日志统计:")
for bucket in result['aggregations']['error_types']['buckets']:
print(f" {bucket['key']}: {bucket['doc_count']}次")
# query_error_logs(hours=24)
---
b.日志分析
a.功能说明
分析Milvus日志,发现问题和优化机会。统计错误类型和频率。分析慢查询和性能瓶颈。识别异常模式和趋势。生成日志报告和可视化。实现日志告警和通知。支持自定义分析规则。提供日志查询API。
b.代码示例
---
# 日志分析工具
import re
from collections import Counter
from datetime import datetime
class LogAnalyzer:
def __init__(self, log_file):
self.log_file = log_file
self.logs = []
self.load_logs()
def load_logs(self):
"""加载日志文件"""
with open(self.log_file, 'r') as f:
for line in f:
try:
import json
log = json.loads(line)
self.logs.append(log)
except:
pass
def count_by_level(self):
"""按级别统计日志"""
levels = [log.get('level', 'UNKNOWN') for log in self.logs]
return Counter(levels)
def find_errors(self):
"""查找错误日志"""
errors = [log for log in self.logs if log.get('level') in ['error', 'fatal']]
return errors
def find_slow_queries(self, threshold_ms=100):
"""查找慢查询"""
slow_queries = []
for log in self.logs:
if 'query' in log.get('message', '').lower():
latency = log.get('latency_ms', 0)
if latency > threshold_ms:
slow_queries.append({
'time': log.get('time'),
'latency': latency,
'message': log.get('message')
})
return sorted(slow_queries, key=lambda x: x['latency'], reverse=True)
def analyze_patterns(self):
"""分析日志模式"""
messages = [log.get('message', '') for log in self.logs]
message_counts = Counter(messages)
# 找出最频繁的消息
top_messages = message_counts.most_common(10)
return top_messages
def generate_report(self):
"""生成分析报告"""
report = {
'total_logs': len(self.logs),
'by_level': dict(self.count_by_level()),
'error_count': len(self.find_errors()),
'slow_query_count': len(self.find_slow_queries()),
'top_messages': self.analyze_patterns()
}
return report
# 使用日志分析器
# analyzer = LogAnalyzer('/var/log/milvus/milvus.log')
# report = analyzer.generate_report()
# print("日志分析报告:")
# print(f" 总日志数: {report['total_logs']}")
# print(f" 按级别: {report['by_level']}")
# print(f" 错误数: {report['error_count']}")
# print(f" 慢查询数: {report['slow_query_count']}")
# Kibana查询示例
kibana_queries = {
"错误日志": {
"query": 'level:"error" OR level:"fatal"',
"time_range": "Last 24 hours"
},
"慢查询": {
"query": 'message:"query" AND latency_ms:>100',
"time_range": "Last 1 hour"
},
"高QPS": {
"query": 'message:"query"',
"aggregation": "count by 1 minute",
"threshold": "> 1000"
},
"内存告警": {
"query": 'message:"memory" AND level:"warn"',
"time_range": "Last 6 hours"
}
}
print("\nKibana查询示例:")
for name, query in kibana_queries.items():
print(f"\n{name}:")
for key, value in query.items():
print(f" {key}: {value}")
---
11.3 备份恢复
01.备份策略
a.全量备份
a.功能说明
定期进行全量备份,保护数据安全。备份包括向量数据、元数据、配置文件。使用Milvus Backup工具或手动备份。备份到本地磁盘或对象存储。设置备份保留策略。验证备份完整性。记录备份历史和状态。实现自动化备份流程。
b.代码示例
---
# Milvus全量备份
# 1. 使用Milvus Backup工具
backup_commands = """
# 安装Milvus Backup
wget https://github.com/zilliztech/milvus-backup/releases/download/v0.3.0/milvus-backup
chmod +x milvus-backup
# 配置backup.yaml
cat > backup.yaml <<EOF
milvus:
address: localhost
port: 19530
username: ""
password: ""
minio:
address: localhost
port: 9000
accessKeyID: minioadmin
secretAccessKey: minioadmin
useSSL: false
bucketName: milvus-bucket
backup:
backupPath: /backup/milvus
maxBackupNum: 7
EOF
# 创建备份
./milvus-backup create -n backup_20240115
# 列出备份
./milvus-backup list
# 查看备份详情
./milvus-backup get -n backup_20240115
# 删除备份
./milvus-backup delete -n backup_20240115
"""
# 2. 手动备份脚本
backup_script = """
#!/bin/bash
# Milvus手动备份脚本
BACKUP_DIR="/backup/milvus/$(date +%Y%m%d_%H%M%S)"
mkdir -p $BACKUP_DIR
echo "开始备份Milvus数据..."
# 备份MinIO数据(向量数据)
echo "备份MinIO数据..."
mc mirror milvus-minio/milvus-bucket $BACKUP_DIR/minio-data
# 备份etcd数据(元数据)
echo "备份etcd数据..."
kubectl exec -n milvus etcd-0 -- etcdctl snapshot save /tmp/snapshot.db
kubectl cp milvus/etcd-0:/tmp/snapshot.db $BACKUP_DIR/etcd-snapshot.db
# 备份配置文件
echo "备份配置文件..."
kubectl get configmap -n milvus -o yaml > $BACKUP_DIR/configmaps.yaml
kubectl get secret -n milvus -o yaml > $BACKUP_DIR/secrets.yaml
# 压缩备份
echo "压缩备份文件..."
tar -czf $BACKUP_DIR.tar.gz -C $(dirname $BACKUP_DIR) $(basename $BACKUP_DIR)
rm -rf $BACKUP_DIR
# 上传到S3
echo "上传到S3..."
aws s3 cp $BACKUP_DIR.tar.gz s3://milvus-backups/
# 清理本地备份(保留最近7天)
find /backup/milvus -name "*.tar.gz" -mtime +7 -delete
echo "备份完成: $BACKUP_DIR.tar.gz"
"""
# 3. Python备份脚本
import subprocess
import os
from datetime import datetime
def backup_milvus(backup_dir="/backup/milvus"):
"""执行Milvus备份"""
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
backup_path = os.path.join(backup_dir, f"backup_{timestamp}")
os.makedirs(backup_path, exist_ok=True)
print(f"开始备份到: {backup_path}")
# 备份MinIO
print("备份MinIO数据...")
subprocess.run([
"mc", "mirror",
"milvus-minio/milvus-bucket",
f"{backup_path}/minio-data"
])
# 备份etcd
print("备份etcd数据...")
subprocess.run([
"kubectl", "exec", "-n", "milvus", "etcd-0", "--",
"etcdctl", "snapshot", "save", "/tmp/snapshot.db"
])
subprocess.run([
"kubectl", "cp",
"milvus/etcd-0:/tmp/snapshot.db",
f"{backup_path}/etcd-snapshot.db"
])
# 压缩备份
print("压缩备份...")
subprocess.run([
"tar", "-czf", f"{backup_path}.tar.gz",
"-C", backup_dir,
f"backup_{timestamp}"
])
# 清理临时目录
subprocess.run(["rm", "-rf", backup_path])
print(f"备份完成: {backup_path}.tar.gz")
return f"{backup_path}.tar.gz"
# backup_milvus()
# 4. 定时备份(crontab)
crontab_config = """
# 每天凌晨2点执行备份
0 2 * * * /opt/scripts/backup-milvus.sh >> /var/log/milvus-backup.log 2>&1
# 每周日凌晨3点执行全量备份
0 3 * * 0 /opt/scripts/backup-milvus-full.sh >> /var/log/milvus-backup.log 2>&1
"""
print("备份命令:")
print(backup_commands)
print("\n备份脚本:")
print(backup_script)
print("\n定时备份配置:")
print(crontab_config)
---
b.增量备份
a.功能说明
增量备份只备份变化的数据,节省存储空间。基于时间戳或版本号识别变化。适合频繁更新的场景。结合全量备份使用。需要记录备份基线。恢复时需要全量+增量。实现快速备份和恢复。
b.代码示例
---
# Milvus增量备份实现
from pymilvus import connections, Collection, utility
from datetime import datetime
import json
class IncrementalBackup:
def __init__(self, backup_dir="/backup/milvus/incremental"):
self.backup_dir = backup_dir
self.metadata_file = f"{backup_dir}/metadata.json"
self.load_metadata()
def load_metadata(self):
"""加载备份元数据"""
try:
with open(self.metadata_file, 'r') as f:
self.metadata = json.load(f)
except:
self.metadata = {
"last_backup_time": None,
"collections": {}
}
def save_metadata(self):
"""保存备份元数据"""
os.makedirs(self.backup_dir, exist_ok=True)
with open(self.metadata_file, 'w') as f:
json.dump(self.metadata, f, indent=2)
def backup_collection(self, collection_name):
"""增量备份Collection"""
collection = Collection(collection_name)
# 获取上次备份时间
last_backup = self.metadata["collections"].get(collection_name, {}).get("last_backup_time")
# 查询新增数据
if last_backup:
# 假设有timestamp字段
expr = f"timestamp > {last_backup}"
results = collection.query(expr=expr, output_fields=["*"])
else:
# 全量备份
results = collection.query(expr="", output_fields=["*"])
if not results:
print(f"{collection_name}: 没有新数据")
return
# 保存增量数据
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
backup_file = f"{self.backup_dir}/{collection_name}_{timestamp}.json"
with open(backup_file, 'w') as f:
json.dump(results, f)
# 更新元数据
self.metadata["collections"][collection_name] = {
"last_backup_time": datetime.now().timestamp(),
"last_backup_file": backup_file,
"record_count": len(results)
}
self.save_metadata()
print(f"{collection_name}: 备份{len(results)}条记录到 {backup_file}")
def backup_all(self):
"""增量备份所有Collection"""
collections = utility.list_collections()
for coll_name in collections:
self.backup_collection(coll_name)
self.metadata["last_backup_time"] = datetime.now().isoformat()
self.save_metadata()
# 使用增量备份
# connections.connect(host="localhost", port="19530")
# backup = IncrementalBackup()
# backup.backup_all()
# 增量备份脚本
incremental_backup_script = """
#!/bin/bash
# 增量备份脚本
BACKUP_DIR="/backup/milvus/incremental"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
# 获取上次备份时间
LAST_BACKUP=$(cat $BACKUP_DIR/last_backup_time.txt 2>/dev/null || echo "0")
CURRENT_TIME=$(date +%s)
# 备份MinIO中的新文件
mc mirror --newer-than ${LAST_BACKUP}s milvus-minio/milvus-bucket $BACKUP_DIR/$TIMESTAMP/
# 记录本次备份时间
echo $CURRENT_TIME > $BACKUP_DIR/last_backup_time.txt
# 压缩备份
tar -czf $BACKUP_DIR/incremental_$TIMESTAMP.tar.gz -C $BACKUP_DIR $TIMESTAMP
rm -rf $BACKUP_DIR/$TIMESTAMP
echo "增量备份完成: incremental_$TIMESTAMP.tar.gz"
"""
print("增量备份脚本:")
print(incremental_backup_script)
---
02.恢复流程
a.数据恢复
a.功能说明
从备份恢复Milvus数据。支持全量恢复和增量恢复。恢复前停止Milvus服务。恢复向量数据、元数据、配置。验证恢复后的数据完整性。测试服务可用性。记录恢复过程和结果。制定恢复预案和演练。
b.代码示例
---
# Milvus数据恢复
# 1. 使用Milvus Backup恢复
restore_commands = """
# 列出可用备份
./milvus-backup list
# 恢复指定备份
./milvus-backup restore -n backup_20240115
# 恢复到指定Collection
./milvus-backup restore -n backup_20240115 -c collection_name
# 恢复并重命名Collection
./milvus-backup restore -n backup_20240115 -c old_name -t new_name
"""
# 2. 手动恢复脚本
restore_script = """
#!/bin/bash
# Milvus手动恢复脚本
BACKUP_FILE=$1
if [ -z "$BACKUP_FILE" ]; then
echo "用法: $0 <backup_file.tar.gz>"
exit 1
fi
echo "开始恢复Milvus数据..."
# 停止Milvus服务
echo "停止Milvus服务..."
kubectl scale deployment milvus-standalone --replicas=0 -n milvus
sleep 10
# 解压备份
echo "解压备份文件..."
RESTORE_DIR="/tmp/milvus_restore"
mkdir -p $RESTORE_DIR
tar -xzf $BACKUP_FILE -C $RESTORE_DIR
# 恢复etcd数据
echo "恢复etcd数据..."
kubectl cp $RESTORE_DIR/etcd-snapshot.db milvus/etcd-0:/tmp/snapshot.db
kubectl exec -n milvus etcd-0 -- etcdctl snapshot restore /tmp/snapshot.db \\
--data-dir=/var/lib/etcd-restore
# 恢复MinIO数据
echo "恢复MinIO数据..."
mc mirror $RESTORE_DIR/minio-data milvus-minio/milvus-bucket
# 恢复配置
echo"恢复配置..."
kubectl apply -f $RESTORE_DIR/configmaps.yaml
kubectl apply -f $RESTORE_DIR/secrets.yaml
# 启动Milvus服务
echo "启动Milvus服务..."
kubectl scale deployment milvus-standalone --replicas=1 -n milvus
# 等待服务就绪
echo "等待服务就绪..."
kubectl wait --for=condition=ready pod -l app=milvus -n milvus --timeout=300s
# 清理临时文件
rm -rf $RESTORE_DIR
echo "恢复完成!"
"""
# 3. Python恢复脚本
import subprocess
import os
import time
def restore_milvus(backup_file):
"""恢复Milvus数据"""
print(f"开始恢复: {backup_file}")
# 停止服务
print("停止Milvus服务...")
subprocess.run([
"kubectl", "scale", "deployment", "milvus-standalone",
"--replicas=0", "-n", "milvus"
])
time.sleep(10)
# 解压备份
print("解压备份...")
restore_dir = "/tmp/milvus_restore"
os.makedirs(restore_dir, exist_ok=True)
subprocess.run([
"tar", "-xzf", backup_file,
"-C", restore_dir
])
# 恢复数据
print("恢复数据...")
# ... 恢复逻辑 ...
# 启动服务
print("启动服务...")
subprocess.run([
"kubectl", "scale", "deployment", "milvus-standalone",
"--replicas=1", "-n", "milvus"
])
# 等待就绪
print("等待服务就绪...")
subprocess.run([
"kubectl", "wait", "--for=condition=ready",
"pod", "-l", "app=milvus",
"-n", "milvus", "--timeout=300s"
])
print("恢复完成!")
# restore_milvus("/backup/milvus/backup_20240115.tar.gz")
# 4. 验证恢复
from pymilvus import connections, utility, Collection
def verify_restore():
"""验证恢复后的数据"""
connections.connect(host="localhost", port="19530")
print("验证恢复结果:\n")
# 检查Collections
collections = utility.list_collections()
print(f"Collections数量: {len(collections)}")
for coll_name in collections:
collection = Collection(coll_name)
count = collection.num_entities
print(f" {coll_name}: {count} entities")
# 测试查询
if collections:
collection = Collection(collections[0])
collection.load()
import numpy as np
query_vector = [[np.random.random() for _ in range(128)]]
results = collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": 16}},
limit=10
)
print(f"\n测试查询成功: 返回{len(results[0])}个结果")
connections.disconnect("default")
# verify_restore()
print("恢复命令:")
print(restore_commands)
print("\n恢复脚本:")
print(restore_script)
---
b.灾难恢复
a.功能说明
制定灾难恢复计划,应对极端情况。定义RTO和RPO目标。准备备用环境和资源。定期演练恢复流程。文档化恢复步骤。建立应急响应团队。实现跨区域容灾。监控恢复进度和状态。
b.代码示例
---
# 灾难恢复计划
disaster_recovery_plan = """
# Milvus灾难恢复计划
## 1. 恢复目标
- RTO (恢复时间目标): 2小时
- RPO (恢复点目标): 24小时
## 2. 恢复流程
### 2.1 评估阶段(15分钟)
- [ ] 确认灾难类型和影响范围
- [ ] 评估数据丢失程度
- [ ] 确定恢复策略
- [ ] 通知相关人员
### 2.2 准备阶段(30分钟)
- [ ] 准备备用环境
- [ ] 下载最新备份
- [ ] 验证备份完整性
- [ ] 准备恢复工具
### 2.3 恢复阶段(60分钟)
- [ ] 部署Milvus集群
- [ ] 恢复etcd数据
- [ ] 恢复MinIO数据
- [ ] 恢复配置文件
- [ ] 启动服务
### 2.4 验证阶段(15分钟)
- [ ] 验证数据完整性
- [ ] 测试查询功能
- [ ] 测试写入功能
- [ ] 性能测试
## 3. 联系人
- 技术负责人: xxx (电话: xxx)
- 运维负责人: xxx (电话: xxx)
- 业务负责人: xxx (电话: xxx)
## 4. 备用资源
- 备用集群: xxx
- 备份存储: s3://milvus-backups/
- 监控地址: https://monitoring.example.com
"""
# 灾难恢复脚本
dr_script = """
#!/bin/bash
# 灾难恢复自动化脚本
set -e
echo "=========================================="
echo "Milvus灾难恢复脚本"
echo "=========================================="
# 1. 评估阶段
echo "1. 评估灾难影响..."
BACKUP_LOCATION="s3://milvus-backups/"
LATEST_BACKUP=$(aws s3 ls $BACKUP_LOCATION | sort | tail -n 1 | awk '{print $4}')
echo "最新备份: $LATEST_BACKUP"
# 2. 准备阶段
echo "2. 准备恢复环境..."
# 创建新的Kubernetes命名空间
kubectl create namespace milvus-dr
# 部署依赖服务
helm install etcd bitnami/etcd -n milvus-dr
helm install minio bitnami/minio -n milvus-dr
helm install pulsar apache/pulsar -n milvus-dr
# 3. 恢复阶段
echo "3. 恢复数据..."
# 下载备份
aws s3 cp $BACKUP_LOCATION$LATEST_BACKUP /tmp/backup.tar.gz
# 解压备份
tar -xzf /tmp/backup.tar.gz -C /tmp/
# 恢复数据
# ... 恢复逻辑 ...
# 部署Milvus
helm install milvus-dr milvus/milvus -n milvus-dr
# 4. 验证阶段
echo "4. 验证恢复结果..."
# 等待服务就绪
kubectl wait --for=condition=ready pod -l app=milvus -n milvus-dr --timeout=300s
# 运行验证脚本
python3 verify_restore.py
echo "=========================================="
echo "灾难恢复完成!"
echo "=========================================="
"""
print("灾难恢复计划:")
print(disaster_recovery_plan)
print("\n灾难恢复脚本:")
print(dr_script)
---
11.4 故障处理
01.常见故障
a.连接失败
a.功能说明
连接失败是最常见的问题之一。可能原因包括网络问题、服务未启动、端口配置错误、防火墙阻止等。检查Milvus服务状态和网络连通性。验证连接参数配置。查看防火墙和安全组设置。检查DNS解析。使用telnet或curl测试连接。查看Milvus日志获取详细错误信息。
b.代码示例
---
# 连接失败故障排查
from pymilvus import connections
import socket
import subprocess
def diagnose_connection(host="localhost", port="19530"):
"""诊断连接问题"""
print(f"诊断Milvus连接: {host}:{port}\n")
# 1. 检查网络连通性
print("1. 检查网络连通性...")
try:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.settimeout(5)
result = sock.connect_ex((host, int(port)))
sock.close()
if result == 0:
print(" ✓ 端口可达")
else:
print(f" ✗ 端口不可达 (错误码: {result})")
return
except Exception as e:
print(f" ✗ 网络错误: {e}")
return
# 2. 检查DNS解析
print("\n2. 检查DNS解析...")
try:
ip = socket.gethostbyname(host)
print(f" ✓ DNS解析成功: {host} -> {ip}")
except Exception as e:
print(f" ✗ DNS解析失败: {e}")
# 3. 测试Milvus连接
print("\n3. 测试Milvus连接...")
try:
connections.connect(
alias="test",
host=host,
port=port,
timeout=10
)
print(" ✓ Milvus连接成功")
connections.disconnect("test")
except Exception as e:
print(f" ✗ Milvus连接失败: {e}")
# 4. 检查服务状态
print("\n4. 检查服务状态...")
try:
result = subprocess.run(
["kubectl", "get", "pods", "-n", "milvus"],
capture_output=True,
text=True
)
print(result.stdout)
except:
print(" 无法检查Kubernetes状态")
# 5. 检查防火墙
print("\n5. 防火墙检查建议:")
print(" - 检查iptables规则: sudo iptables -L")
print(" - 检查firewalld: sudo firewall-cmd --list-all")
print(" - 检查云安全组配置")
# 6. 检查日志
print("\n6. 查看日志:")
print(f" kubectl logs -n milvus <pod-name>")
print(f" 或: docker logs milvus-standalone")
# diagnose_connection("localhost", "19530")
# 常见连接错误及解决方案
connection_errors = {
"connection refused": {
"原因": "服务未启动或端口未监听",
"解决方案": [
"检查Milvus服务状态",
"验证端口配置",
"查看服务日志"
]
},
"timeout": {
"原因": "网络不通或服务响应慢",
"解决方案": [
"检查网络连通性",
"增加超时时间",
"检查服务负载"
]
},
"authentication failed": {
"原因": "用户名或密码错误",
"解决方案": [
"验证认证信息",
"检查用户权限",
"重置密码"
]
},
"DNS resolution failed": {
"原因": "域名无法解析",
"解决方案": [
"检查DNS配置",
"使用IP地址连接",
"检查hosts文件"
]
}
}
print("\n常见连接错误及解决方案:")
for error, info in connection_errors.items():
print(f"\n{error}:")
print(f" 原因: {info['原因']}")
print(f" 解决方案:")
for solution in info['解决方案']:
print(f" - {solution}")
---
b.查询超时
a.功能说明
查询超时通常由性能问题引起。可能原因包括数据量过大、索引未优化、资源不足、并发过高等。检查查询参数配置。优化索引类型和参数。增加Query Node资源。调整超时时间。分析慢查询日志。实现查询限流。优化数据模型。
b.代码示例
---
# 查询超时故障排查
from pymilvus import connections, Collection
import time
import numpy as np
def diagnose_query_timeout(collection_name):
"""诊断查询超时问题"""
connections.connect(host="localhost", port="19530")
collection = Collection(collection_name)
collection.load()
print(f"诊断Collection: {collection_name}\n")
# 1. 检查Collection信息
print("1. Collection信息:")
print(f" 数据量: {collection.num_entities}")
print(f" 字段数: {len(collection.schema.fields)}")
# 2. 检查索引
print("\n2. 索引信息:")
for field in collection.schema.fields:
if field.dtype in [DataType.FLOAT_VECTOR, DataType.BINARY_VECTOR]:
index = collection.index(field.name)
print(f" {field.name}:")
print(f" 类型: {index.params.get('index_type')}")
print(f" 参数: {index.params.get('params')}")
# 3. 测试查询性能
print("\n3. 查询性能测试:")
test_cases = [
{"nprobe": 8, "limit": 10},
{"nprobe": 16, "limit": 10},
{"nprobe": 32, "limit": 10},
{"nprobe": 16, "limit": 100}
]
query_vector = [[np.random.random() for _ in range(128)]]
for params in test_cases:
start = time.time()
try:
results = collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2", "params": params},
limit=params["limit"],
timeout=30
)
latency = (time.time() - start) * 1000
print(f" nprobe={params['nprobe']}, limit={params['limit']}: {latency:.2f}ms")
except Exception as e:
print(f" nprobe={params['nprobe']}, limit={params['limit']}: 超时或失败 ({e})")
# 4. 资源使用情况
print("\n4. 资源使用建议:")
print(" - 检查Query Node CPU/内存使用")
print(" - 检查是否需要增加Query Node数量")
print(" - 检查索引是否已加载到内存")
# 5. 优化建议
print("\n5. 优化建议:")
if collection.num_entities > 10000000:
print(" - 数据量较大,考虑分片或分区")
print(" - 优化索引参数(降低nprobe)")
print(" - 增加Query Node资源")
print(" - 使用更高效的索引类型(如HNSW)")
print(" - 实现查询缓存")
connections.disconnect("default")
# diagnose_query_timeout("test_collection")
# 查询超时优化方案
optimization_strategies = {
"索引优化": {
"FLAT -> IVF_FLAT": "适合中等规模数据",
"IVF_FLAT -> IVF_PQ": "牺牲精度换取速度",
"IVF -> HNSW": "更好的查询性能"
},
"参数调优": {
"降低nprobe": "减少搜索的聚类中心数量",
"降低limit": "减少返回结果数量",
"增加timeout": "给予更多查询时间"
},
"资源扩展": {
"增加Query Node": "提升并发查询能力",
"增加内存": "缓存更多索引数据",
"使用SSD": "加快数据加载速度"
},
"架构优化": {
"数据分区": "按业务逻辑分区数据",
"查询缓存": "缓存热门查询结果",
"异步查询": "使用异步API"
}
}
print("\n查询超时优化方案:")
for category, strategies in optimization_strategies.items():
print(f"\n{category}:")
for strategy, desc in strategies.items():
print(f" {strategy}: {desc}")
---
02.性能问题
a.性能分析
a.功能说明
系统性能下降需要全面分析。监控QPS、延迟、资源使用等指标。分析慢查询和热点数据。检查索引效率和数据分布。评估硬件资源是否充足。识别性能瓶颈所在。制定优化方案。实施性能测试验证效果。
b.代码示例
---
# 性能分析工具
from pymilvus import connections, Collection, utility
import time
import numpy as np
from collections import defaultdict
class PerformanceAnalyzer:
def __init__(self, host="localhost", port="19530"):
connections.connect(host=host, port=port)
self.metrics = defaultdict(list)
def analyze_collection(self, collection_name):
"""分析Collection性能"""
collection = Collection(collection_name)
collection.load()
print(f"性能分析: {collection_name}\n")
# 1. 基本信息
print("1. 基本信息:")
print(f" 数据量: {collection.num_entities:,}")
print(f" 字段数: {len(collection.schema.fields)}")
# 2. 索引分析
print("\n2. 索引分析:")
for field in collection.schema.fields:
if field.dtype in [DataType.FLOAT_VECTOR, DataType.BINARY_VECTOR]:
index = collection.index(field.name)
print(f" {field.name}:")
print(f" 类型: {index.params.get('index_type')}")
print(f" 参数: {index.params.get('params')}")
# 3. 查询性能测试
print("\n3. 查询性能测试:")
query_vector = [[np.random.random() for _ in range(128)]]
# 测试不同参数组合
test_params = [
{"nprobe": 8, "limit": 10},
{"nprobe": 16, "limit": 10},
{"nprobe": 32, "limit": 10},
]
for params in test_params:
latencies = []
# 多次测试取平均
for _ in range(10):
start = time.time()
collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2", "params": params},
limit=params["limit"]
)
latency = (time.time() - start) * 1000
latencies.append(latency)
avg_latency = sum(latencies) / len(latencies)
p99_latency = sorted(latencies)[int(len(latencies) * 0.99)]
print(f" nprobe={params['nprobe']}:")
print(f" 平均延迟: {avg_latency:.2f}ms")
print(f" P99延迟: {p99_latency:.2f}ms")
self.metrics[f"nprobe_{params['nprobe']}"] = {
"avg": avg_latency,
"p99": p99_latency
}
# 4. 并发性能测试
print("\n4. 并发性能测试:")
self.test_concurrent_queries(collection, threads=10)
# 5. 性能评分
print("\n5. 性能评分:")
score = self.calculate_performance_score()
print(f" 总分: {score}/100")
# 6. 优化建议
print("\n6. 优化建议:")
self.generate_recommendations(collection)
def test_concurrent_queries(self, collection, threads=10):
"""测试并发查询性能"""
import threading
query_vector = [[np.random.random() for _ in range(128)]]
results = []
def query_worker():
start = time.time()
collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": 16}},
limit=10
)
latency = (time.time() - start) * 1000
results.append(latency)
# 启动并发查询
thread_list = []
start = time.time()
for _ in range(threads):
t = threading.Thread(target=query_worker)
t.start()
thread_list.append(t)
for t in thread_list:
t.join()
total_time = (time.time() - start) * 1000
avg_latency = sum(results) / len(results)
print(f" 并发数: {threads}")
print(f" 总耗时: {total_time:.2f}ms")
print(f" 平均延迟: {avg_latency:.2f}ms")
print(f" QPS: {threads / (total_time / 1000):.2f}")
def calculate_performance_score(self):
"""计算性能评分"""
score = 100
# 根据延迟扣分
avg_latency = self.metrics.get("nprobe_16", {}).get("avg", 0)
if avg_latency > 100:
score -= 20
elif avg_latency > 50:
score -= 10
# 根据P99延迟扣分
p99_latency = self.metrics.get("nprobe_16", {}).get("p99", 0)
if p99_latency > 200:
score -= 20
elif p99_latency > 100:
score -= 10
return max(score, 0)
def generate_recommendations(self, collection):
"""生成优化建议"""
recommendations = []
# 检查数据量
if collection.num_entities > 10000000:
recommendations.append("数据量较大,建议使用分区")
# 检查延迟
avg_latency = self.metrics.get("nprobe_16", {}).get("avg", 0)
if avg_latency > 100:
recommendations.append("查询延迟较高,建议优化索引或增加资源")
# 检查索引
for field in collection.schema.fields:
if field.dtype in [DataType.FLOAT_VECTOR, DataType.BINARY_VECTOR]:
index = collection.index(field.name)
index_type = index.params.get('index_type')
if index_type == 'FLAT' and collection.num_entities > 100000:
recommendations.append(f"字段{field.name}使用FLAT索引,建议切换到IVF或HNSW")
if not recommendations:
recommendations.append("性能良好,无需优化")
for i, rec in enumerate(recommendations, 1):
print(f" {i}. {rec}")
# 使用性能分析器
# analyzer = PerformanceAnalyzer()
# analyzer.analyze_collection("test_collection")
---
b.性能优化
a.功能说明
根据分析结果实施性能优化。优化索引类型和参数。调整查询参数。增加硬件资源。实现数据分区和负载均衡。优化数据模型。实现缓存机制。调整系统配置参数。验证优化效果。
b.代码示例
---
# 性能优化实施
from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType
class PerformanceOptimizer:
def __init__(self, host="localhost", port="19530"):
connections.connect(host=host, port=port)
def optimize_index(self, collection_name, field_name):
"""优化索引"""
collection = Collection(collection_name)
print(f"优化索引: {collection_name}.{field_name}\n")
# 1. 删除旧索引
print("1. 删除旧索引...")
collection.release()
collection.drop_index(field_name)
# 2. 创建优化后的索引
print("2. 创建优化索引...")
# 根据数据量选择索引类型
num_entities = collection.num_entities
if num_entities < 100000:
# 小数据量使用FLAT
index_params = {
"index_type": "FLAT",
"metric_type": "L2"
}
elif num_entities < 1000000:
# 中等数据量使用IVF_FLAT
index_params = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": 1024}
}
else:
# 大数据量使用HNSW
index_params = {
"index_type": "HNSW",
"metric_type": "L2",
"params": {
"M": 16,
"efConstruction": 256
}
}
collection.create_index(
field_name=field_name,
index_params=index_params
)
print(f" 索引类型: {index_params['index_type']}")
print(f" 索引参数: {index_params.get('params', {})}")
# 3. 加载索引
print("\n3. 加载索引...")
collection.load()
print("索引优化完成!")
def optimize_query_params(self, collection_name):
"""优化查询参数"""
collection = Collection(collection_name)
collection.load()
print(f"优化查询参数: {collection_name}\n")
# 测试不同参数组合
import numpy as np
query_vector = [[np.random.random() for _ in range(128)]]
best_params = None
best_score = 0
for nprobe in [8, 16, 32, 64]:
latencies = []
for _ in range(5):
start = time.time()
results = collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": nprobe}},
limit=10
)
latency = (time.time() - start) * 1000
latencies.append(latency)
avg_latency = sum(latencies) / len(latencies)
# 计算得分(延迟越低越好)
score = 1000 / avg_latency
print(f"nprobe={nprobe}: 平均延迟={avg_latency:.2f}ms, 得分={score:.2f}")
if score > best_score:
best_score = score
best_params = {"nprobe": nprobe}
print(f"\n推荐参数: {best_params}")
return best_params
def implement_partitioning(self, collection_name, partition_field):
"""实现数据分区"""
print(f"实现数据分区: {collection_name}\n")
collection = Collection(collection_name)
# 创建分区
partitions = ["partition_2023", "partition_2024", "partition_2025"]
for partition_name in partitions:
if not collection.has_partition(partition_name):
collection.create_partition(partition_name)
print(f"创建分区: {partition_name}")
print("\n分区创建完成!")
print("使用方法:")
print(" # 插入到指定分区")
print(" collection.insert(data, partition_name='partition_2024')")
print(" # 查询指定分区")
print(" collection.search(data, partition_names=['partition_2024'])")
# 使用优化器
# optimizer = PerformanceOptimizer()
# optimizer.optimize_index("test_collection", "embedding")
# optimizer.optimize_query_params("test_collection")
# optimizer.implement_partitioning("test_collection", "year")
print("性能优化工具使用示例已生成")
---
12 最佳实践
12.1 数据建模
01.Schema设计
a.字段规划
a.功能说明
合理的Schema设计是高效使用Milvus的基础。规划字段类型和数量,避免冗余。向量字段选择合适的维度。标量字段用于过滤和元数据存储。主键字段必须唯一。考虑查询模式设计Schema。预留扩展空间。遵循最小化原则。
b.代码示例
---
from pymilvus import FieldSchema, CollectionSchema, DataType, Collection
# 1. 基础Schema设计
def create_basic_schema():
"""创建基础Schema"""
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768),
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535),
FieldSchema(name="timestamp", dtype=DataType.INT64),
FieldSchema(name="category", dtype=DataType.VARCHAR, max_length=100)
]
schema = CollectionSchema(
fields=fields,
description="基础文档检索Schema"
)
return schema
# 2. 多向量Schema设计
def create_multimodal_schema():
"""创建多模态Schema"""
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
# 文本嵌入
FieldSchema(name="text_embedding", dtype=DataType.FLOAT_VECTOR, dim=768),
# 图像嵌入
FieldSchema(name="image_embedding", dtype=DataType.FLOAT_VECTOR, dim=512),
# 元数据
FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=500),
FieldSchema(name="url", dtype=DataType.VARCHAR, max_length=1000),
FieldSchema(name="tags", dtype=DataType.VARCHAR, max_length=500),
FieldSchema(name="created_at", dtype=DataType.INT64)
]
schema = CollectionSchema(
fields=fields,
description="多模态检索Schema"
)
return schema
# 3. 电商推荐Schema
def create_ecommerce_schema():
"""创建电商推荐Schema"""
fields = [
FieldSchema(name="product_id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="product_embedding", dtype=DataType.FLOAT_VECTOR, dim=256),
FieldSchema(name="product_name", dtype=DataType.VARCHAR, max_length=200),
FieldSchema(name="category", dtype=DataType.VARCHAR, max_length=100),
FieldSchema(name="price", dtype=DataType.FLOAT),
FieldSchema(name="rating", dtype=DataType.FLOAT),
FieldSchema(name="stock", dtype=DataType.INT64),
FieldSchema(name="brand", dtype=DataType.VARCHAR, max_length=100),
FieldSchema(name="is_active", dtype=DataType.BOOL)
]
schema = CollectionSchema(
fields=fields,
description="电商商品推荐Schema"
)
return schema
# 4. Schema设计最佳实践
schema_best_practices = {
"字段数量": "保持在20个以内,避免过多字段影响性能",
"向量维度": "根据模型选择,常见768/512/256/128",
"VARCHAR长度": "根据实际需求设置,不要过大",
"主键设计": "使用auto_id或业务ID,确保唯一性",
"索引字段": "常用于过滤的字段建立标量索引",
"数据类型": "选择合适的数据类型,节省存储空间"
}
print("Schema设计最佳实践:")
for key, value in schema_best_practices.items():
print(f" {key}: {value}")
# 5. Schema验证
def validate_schema(schema):
"""验证Schema设计"""
issues = []
# 检查主键
primary_fields = [f for f in schema.fields if f.is_primary]
if len(primary_fields) == 0:
issues.append("缺少主键字段")
elif len(primary_fields) > 1:
issues.append("存在多个主键字段")
# 检查向量字段
vector_fields = [f for f in schema.fields if f.dtype in [DataType.FLOAT_VECTOR, DataType.BINARY_VECTOR]]
if len(vector_fields) == 0:
issues.append("缺少向量字段")
# 检查字段数量
if len(schema.fields) > 20:
issues.append(f"字段数量过多({len(schema.fields)}),建议少于20个")
# 检查VARCHAR长度
for field in schema.fields:
if field.dtype == DataType.VARCHAR:
if field.params.get("max_length", 0) > 65535:
issues.append(f"字段{field.name}的max_length过大")
if issues:
print("Schema验证失败:")
for issue in issues:
print(f" - {issue}")
return False
else:
print("Schema验证通过")
return True
# 测试Schema
schema = create_basic_schema()
validate_schema(schema)
---
b.分区策略
a.功能说明
合理使用分区提升查询性能。按时间、类别、地域等维度分区。每个分区独立管理和查询。分区数量建议在4096以内。避免过多小分区。支持动态创建和删除分区。查询时指定分区减少扫描范围。实现数据生命周期管理。
b.代码示例
---
from pymilvus import Collection, connections
from datetime import datetime
connections.connect(host="localhost", port="19530")
# 1. 按时间分区
def create_time_based_partitions(collection_name):
"""按时间创建分区"""
collection = Collection(collection_name)
# 按年份分区
years = ["2023", "2024", "2025"]
for year in years:
partition_name = f"year_{year}"
if not collection.has_partition(partition_name):
collection.create_partition(partition_name)
print(f"创建分区: {partition_name}")
# 按月份分区(更细粒度)
months = ["202401", "202402", "202403"]
for month in months:
partition_name = f"month_{month}"
if not collection.has_partition(partition_name):
collection.create_partition(partition_name)
print(f"创建分区: {partition_name}")
# 2. 按类别分区
def create_category_partitions(collection_name, categories):
"""按类别创建分区"""
collection = Collection(collection_name)
for category in categories:
partition_name = f"cat_{category}"
if not collection.has_partition(partition_name):
collection.create_partition(partition_name)
print(f"创建分区: {partition_name}")
# 使用示例
# create_category_partitions("products", ["electronics", "clothing", "books"])
# 3. 分区数据插入
def insert_with_partition(collection, data, partition_key_field, partition_mapping):
"""根据字段值插入到对应分区"""
# 按分区键分组数据
partition_data = {}
for i, value in enumerate(data[partition_key_field]):
partition_name = partition_mapping.get(value, "_default")
if partition_name not in partition_data:
partition_data[partition_name] = {field: [] for field in data.keys()}
for field, values in data.items():
partition_data[partition_name][field].append(values[i])
# 插入到各分区
for partition_name, pdata in partition_data.items():
collection.insert(pdata, partition_name=partition_name)
print(f"插入{len(pdata[partition_key_field])}条数据到分区: {partition_name}")
# 4. 分区查询
def search_with_partitions(collection, query_vector, partition_names=None):
"""在指定分区中查询"""
results = collection.search(
data=[query_vector],
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": 16}},
limit=10,
partition_names=partition_names # 指定分区
)
return results
# 查询示例
# import numpy as np
# query_vec = [np.random.random() for _ in range(128)]
# results = search_with_partitions(collection, query_vec, partition_names=["year_2024"])
# 5. 分区管理
class PartitionManager:
def __init__(self, collection):
self.collection = collection
def list_partitions(self):
"""列出所有分区"""
partitions = self.collection.partitions
print(f"分区数量: {len(partitions)}")
for partition in partitions:
print(f" {partition.name}: {partition.num_entities} entities")
def drop_old_partitions(self, keep_count=12):
"""删除旧分区,保留最近N个"""
partitions = sorted(
[p for p in self.collection.partitions if p.name != "_default"],
key=lambda p: p.name
)
if len(partitions) > keep_count:
to_drop = partitions[:-keep_count]
for partition in to_drop:
self.collection.drop_partition(partition.name)
print(f"删除分区: {partition.name}")
def merge_partitions(self, source_partitions, target_partition):
"""合并多个分区"""
# 从源分区查询所有数据
all_data = []
for partition_name in source_partitions:
data = self.collection.query(
expr="",
partition_names=[partition_name],
output_fields=["*"]
)
all_data.extend(data)
# 插入到目标分区
if not self.collection.has_partition(target_partition):
self.collection.create_partition(target_partition)
# 转换数据格式
insert_data = {}
for field in self.collection.schema.fields:
insert_data[field.name] = [item[field.name] for item in all_data]
self.collection.insert(insert_data, partition_name=target_partition)
# 删除源分区
for partition_name in source_partitions:
self.collection.drop_partition(partition_name)
print(f"合并{len(source_partitions)}个分区到: {target_partition}")
# 使用分区管理器
# collection = Collection("test_collection")
# manager = PartitionManager(collection)
# manager.list_partitions()
# manager.drop_old_partitions(keep_count=12)
# 6. 分区策略建议
partition_strategies = {
"时间分区": {
"适用场景": "日志、事件、时序数据",
"优点": "便于数据归档和清理",
"缺点": "可能导致热点分区",
"建议": "按月或季度分区,避免过细粒度"
},
"类别分区": {
"适用场景": "电商、内容分类",
"优点": "查询时可精确定位分区",
"缺点": "类别变化时需要调整",
"建议": "使用稳定的一级分类"
},
"哈希分区": {
"适用场景": "数据均匀分布",
"优点": "负载均衡",
"缺点": "无法按业务逻辑查询",
"建议": "结合其他策略使用"
}
}
print("\n分区策略建议:")
for strategy, info in partition_strategies.items():
print(f"\n{strategy}:")
for key, value in info.items():
print(f" {key}: {value}")
---
02.数据质量
a.数据清洗
a.功能说明
高质量的数据是准确检索的前提。清洗重复数据和异常值。标准化向量数据格式。验证向量维度一致性。处理缺失值和空值。过滤低质量数据。实现数据验证流程。记录数据质量指标。
b.代码示例
---
import numpy as np
from pymilvus import Collection, connections
class DataCleaner:
def __init__(self):
self.stats = {
"total": 0,
"duplicates": 0,
"invalid_vectors": 0,
"missing_fields": 0,
"cleaned": 0
}
def clean_vectors(self, vectors, dim=768):
"""清洗向量数据"""
cleaned = []
for vec in vectors:
# 检查维度
if len(vec) != dim:
self.stats["invalid_vectors"] += 1
continue
# 检查NaN和Inf
if np.isnan(vec).any() or np.isinf(vec).any():
self.stats["invalid_vectors"] += 1
continue
# 标准化
vec = np.array(vec, dtype=np.float32)
# L2归一化
norm = np.linalg.norm(vec)
if norm > 0:
vec = vec / norm
cleaned.append(vec.tolist())
self.stats["cleaned"] += 1
return cleaned
def remove_duplicates(self, data, id_field="id"):
"""去除重复数据"""
seen_ids = set()
cleaned_data = {field: [] for field in data.keys()}
for i in range(len(data[id_field])):
item_id = data[id_field][i]
if item_id in seen_ids:
self.stats["duplicates"] += 1
continue
seen_ids.add(item_id)
for field, values in data.items():
cleaned_data[field].append(values[i])
return cleaned_data
def validate_data(self, data, schema):
"""验证数据完整性"""
self.stats["total"] = len(data[list(data.keys())[0]])
# 检查必填字段
for field in schema.fields:
if field.name not in data:
print(f"缺少字段: {field.name}")
return False
# 检查数据长度一致性
if len(data[field.name]) != self.stats["total"]:
print(f"字段{field.name}数据长度不一致")
return False
# 检查空值
if field.dtype == DataType.VARCHAR:
empty_count = sum(1 for v in data[field.name] if not v)
if empty_count > 0:
print(f"字段{field.name}有{empty_count}个空值")
self.stats["missing_fields"] += empty_count
return True
def get_stats(self):
"""获取清洗统计"""
return self.stats
# 使用数据清洗器
cleaner = DataCleaner()
# 示例数据
raw_data = {
"id": [1, 2, 2, 3, 4], # 包含重复
"embedding": [
[0.1] * 768,
[0.2] * 768,
[0.2] * 768,
[float('nan')] * 768, # 包含NaN
[0.4] * 768
],
"text": ["doc1", "doc2", "doc2", "", "doc4"]
}
# 清洗向量
cleaned_vectors = cleaner.clean_vectors(raw_data["embedding"])
raw_data["embedding"] = cleaned_vectors
# 去重
cleaned_data = cleaner.remove_duplicates(raw_data)
# 输出统计
stats = cleaner.get_stats()
print("数据清洗统计:")
print(f" 总数: {stats['total']}")
print(f" 重复: {stats['duplicates']}")
print(f" 无效向量: {stats['invalid_vectors']}")
print(f" 缺失字段: {stats['missing_fields']}")
print(f" 清洗后: {stats['cleaned']}")
---
b.数据验证
a.功能说明
建立数据验证机制确保数据质量。验证数据格式和类型。检查向量维度和范围。验证主键唯一性。检查标量字段合法性。实现自动化验证流程。记录验证结果和异常。提供数据质量报告。
b.代码示例
---
from pymilvus import Collection, DataType
import numpy as np
class DataValidator:
def __init__(self, schema):
self.schema = schema
self.errors = []
def validate_batch(self, data):
"""验证批量数据"""
self.errors = []
# 1. 验证字段完整性
if not self._validate_fields(data):
return False
# 2. 验证数据类型
if not self._validate_types(data):
return False
# 3. 验证向量数据
if not self._validate_vectors(data):
return False
# 4. 验证主键唯一性
if not self._validate_primary_key(data):
return False
# 5. 验证VARCHAR长度
if not self._validate_varchar_length(data):
return False
return len(self.errors) == 0
def _validate_fields(self, data):
"""验证字段完整性"""
for field in self.schema.fields:
if field.name not in data:
self.errors.append(f"缺少字段: {field.name}")
return False
# 检查数据长度一致性
lengths = [len(values) for values in data.values()]
if len(set(lengths)) > 1:
self.errors.append(f"字段数据长度不一致: {lengths}")
return False
return True
def _validate_types(self, data):
"""验证数据类型"""
for field in self.schema.fields:
values = data[field.name]
if field.dtype == DataType.INT64:
if not all(isinstance(v, (int, np.integer)) for v in values):
self.errors.append(f"字段{field.name}类型错误,期望INT64")
return False
elif field.dtype == DataType.FLOAT:
if not all(isinstance(v, (float, np.floating, int)) for v in values):
self.errors.append(f"字段{field.name}类型错误,期望FLOAT")
return False
elif field.dtype == DataType.VARCHAR:
if not all(isinstance(v, str) for v in values):
self.errors.append(f"字段{field.name}类型错误,期望VARCHAR")
return False
elif field.dtype == DataType.BOOL:
if not all(isinstance(v, bool) for v in values):
self.errors.append(f"字段{field.name}类型错误,期望BOOL")
return False
return True
def _validate_vectors(self, data):
"""验证向量数据"""
for field in self.schema.fields:
if field.dtype in [DataType.FLOAT_VECTOR, DataType.BINARY_VECTOR]:
vectors = data[field.name]
expected_dim = field.params["dim"]
for i, vec in enumerate(vectors):
# 检查维度
if len(vec) != expected_dim:
self.errors.append(
f"字段{field.name}第{i}个向量维度错误: "
f"期望{expected_dim}, 实际{len(vec)}"
)
return False
# 检查NaN和Inf
vec_array = np.array(vec)
if np.isnan(vec_array).any():
self.errors.append(f"字段{field.name}第{i}个向量包含NaN")
return False
if np.isinf(vec_array).any():
self.errors.append(f"字段{field.name}第{i}个向量包含Inf")
return False
return True
def _validate_primary_key(self, data):
"""验证主键唯一性"""
for field in self.schema.fields:
if field.is_primary:
ids = data[field.name]
if len(ids) != len(set(ids)):
duplicates = [id for id in ids if ids.count(id) > 1]
self.errors.append(f"主键{field.name}存在重复值: {set(duplicates)}")
return False
return True
def _validate_varchar_length(self, data):
"""验证VARCHAR长度"""
for field in self.schema.fields:
if field.dtype == DataType.VARCHAR:
max_length = field.params.get("max_length", 65535)
values = data[field.name]
for i, value in enumerate(values):
if len(value) > max_length:
self.errors.append(
f"字段{field.name}第{i}个值超长: "
f"{len(value)} > {max_length}"
)
return False
return True
def get_errors(self):
"""获取验证错误"""
return self.errors
# 使用数据验证器
from pymilvus import FieldSchema, CollectionSchema
# 创建Schema
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128),
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=1000)
]
schema = CollectionSchema(fields=fields)
# 验证数据
validator = DataValidator(schema)
test_data = {
"id": [1, 2, 3],
"embedding": [
[0.1] * 128,
[0.2] * 128,
[0.3] * 128
],
"text": ["doc1", "doc2", "doc3"]
}
if validator.validate_batch(test_data):
print("数据验证通过")
else:
print("数据验证失败:")
for error in validator.get_errors():
print(f" - {error}")
---
12.2 索引选择
01.索引类型
a.FLAT索引
a.功能说明
FLAT索引是最简单的索引类型,不进行任何压缩或近似。适合小规模数据集(<10万向量)。提供100%召回率,结果最准确。查询速度随数据量线性增长。不需要训练过程,创建速度快。内存占用等于原始向量大小。适合对准确性要求极高的场景。作为其他索引的基准对比。
b.代码示例
---
from pymilvus import Collection, connections
connections.connect(host="localhost", port="19530")
collection = Collection("test_collection")
# 创建FLAT索引
index_params = {
"index_type": "FLAT",
"metric_type": "L2"
}
collection.create_index(
field_name="embedding",
index_params=index_params
)
print("FLAT索引特点:")
print(" 适用场景: 小规模数据(<10万)")
print(" 召回率: 100%")
print(" 查询速度: 慢(线性扫描)")
print(" 内存占用: 高(等于原始数据)")
print(" 构建时间: 快(无需训练)")
---
b.IVF索引
a.功能说明
IVF(Inverted File)索引通过聚类加速检索。将向量空间划分为nlist个聚类中心。查询时只搜索nprobe个最近的聚类。适合中大规模数据集(10万-1000万)。需要训练过程确定聚类中心。支持IVF_FLAT、IVF_SQ8、IVF_PQ等变体。平衡准确性和性能。是最常用的索引类型。
b.代码示例
---
# IVF_FLAT索引
ivf_flat_params = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": 1024}
}
collection.create_index(
field_name="embedding",
index_params=ivf_flat_params
)
# 查询参数
search_params = {"metric_type": "L2", "params": {"nprobe": 16}}
# IVF_SQ8索引(标量量化)
ivf_sq8_params = {
"index_type": "IVF_SQ8",
"metric_type": "L2",
"params": {"nlist": 1024}
}
# IVF_PQ索引(乘积量化)
ivf_pq_params = {
"index_type": "IVF_PQ",
"metric_type": "L2",
"params": {
"nlist": 1024,
"m": 8,
"nbits": 8
}
}
print("IVF索引对比:")
print(" IVF_FLAT: 准确度高,内存占用大")
print(" IVF_SQ8: 内存占用减少75%,准确度略降")
print(" IVF_PQ: 内存占用最小,准确度进一步降低")
---
02.参数调优
a.nlist参数
a.功能说明
nlist是IVF索引的聚类中心数量。影响索引构建时间和查询性能。nlist越大,聚类越细,查询越快但构建越慢。推荐值:sqrt(N)到4*sqrt(N),N为向量数量。常见取值:128、256、512、1024、2048。需要根据数据规模和查询需求调整。过大会增加内存占用,过小会降低查询性能。
b.代码示例
---
import math
def recommend_nlist(num_vectors):
"""推荐nlist参数"""
sqrt_n = int(math.sqrt(num_vectors))
recommendations = {
"保守": sqrt_n,
"推荐": 2 * sqrt_n,
"激进": 4 * sqrt_n
}
for key in recommendations:
recommendations[key] = min(max(recommendations[key], 128), 65536)
return recommendations
test_sizes = [10000, 100000, 1000000, 10000000]
print("nlist参数推荐:")
for size in test_sizes:
recs = recommend_nlist(size)
print(f"\n数据量: {size:,}")
for level, value in recs.items():
print(f" {level}: {value}")
---
b.nprobe参数
a.功能说明
nprobe是查询时搜索的聚类中心数量。影响查询准确度和速度。nprobe越大,准确度越高但速度越慢。推荐值:nlist的1%-10%。常见取值:8、16、32、64。需要在准确度和性能间平衡。可以根据业务需求动态调整。建议通过实验确定最优值。
b.代码示例
---
def recommend_nprobe(nlist, accuracy_requirement="medium"):
"""推荐nprobe参数"""
recommendations = {
"low": max(int(nlist * 0.01), 8),
"medium": max(int(nlist * 0.05), 16),
"high": max(int(nlist * 0.10), 32)
}
return recommendations.get(accuracy_requirement, 16)
nlist_values = [128, 512, 1024, 2048]
print("nprobe参数推荐:")
for nlist in nlist_values:
print(f"\nnlist={nlist}:")
for level in ["low", "medium", "high"]:
nprobe = recommend_nprobe(nlist, level)
print(f" {level}: {nprobe}")
import time
import numpy as np
def benchmark_nprobe(collection, nprobe_values):
"""测试不同nprobe的性能"""
query_vector = [[np.random.random() for _ in range(128)]]
results = {}
for nprobe in nprobe_values:
latencies = []
for _ in range(10):
start = time.time()
collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": nprobe}},
limit=10
)
latency = (time.time() - start) * 1000
latencies.append(latency)
results[nprobe] = {
"avg": sum(latencies) / len(latencies),
"p99": sorted(latencies)[int(len(latencies) * 0.99)]
}
return results
---
12.3 查询优化
01.查询策略
a.批量查询
a.功能说明
批量查询可以显著提升吞吐量。一次查询多个向量,减少网络开销。Milvus支持批量查询,自动并行处理。适合离线批处理场景。可以提升10-100倍吞吐量。需要平衡批量大小和延迟。建议批量大小:10-1000。实现异步批量查询进一步提升性能。
b.代码示例
---
from pymilvus import Collection, connections
import numpy as np
import time
connections.connect(host="localhost", port="19530")
collection = Collection("test_collection")
collection.load()
# 1. 单个查询基准测试
def single_query_benchmark(collection, num_queries=100):
"""单个查询基准测试"""
start = time.time()
for _ in range(num_queries):
query_vector = [[np.random.random() for _ in range(128)]]
collection.search(
data=query_vector,
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": 16}},
limit=10
)
elapsed = time.time() - start
qps = num_queries / elapsed
print(f"单个查询:")
print(f" 总耗时: {elapsed:.2f}s")
print(f" QPS: {qps:.2f}")
return qps
# 2. 批量查询基准测试
def batch_query_benchmark(collection, num_queries=100, batch_size=10):
"""批量查询基准测试"""
start = time.time()
for i in range(0, num_queries, batch_size):
batch_vectors = [
[np.random.random() for _ in range(128)]
for _ in range(min(batch_size, num_queries - i))
]
collection.search(
data=batch_vectors,
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": 16}},
limit=10
)
elapsed = time.time() - start
qps = num_queries / elapsed
print(f"\n批量查询(batch_size={batch_size}):")
print(f" 总耗时: {elapsed:.2f}s")
print(f" QPS: {qps:.2f}")
return qps
# 3. 对比测试
print("查询性能对比:\n")
single_qps = single_query_benchmark(collection, 100)
for batch_size in [10, 50, 100]:
batch_qps = batch_query_benchmark(collection, 100, batch_size)
speedup = batch_qps / single_qps
print(f" 加速比: {speedup:.2f}x")
---
b.过滤优化
a.功能说明
合理使用过滤条件提升查询效率。在向量检索前先过滤,减少搜索范围。使用标量索引加速过滤。避免复杂的过滤表达式。优先使用等值过滤和范围过滤。组合多个过滤条件时注意顺序。使用分区代替过滤提升性能。
b.代码示例
---
# 1. 基础过滤
def search_with_filter(collection, query_vector, filter_expr):
"""带过滤的查询"""
results = collection.search(
data=[query_vector],
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": 16}},
limit=10,
expr=filter_expr
)
return results
# 等值过滤
results = search_with_filter(
collection,
[np.random.random() for _ in range(128)],
'category == "electronics"'
)
# 范围过滤
results = search_with_filter(
collection,
[np.random.random() for _ in range(128)],
'price >= 100 and price <= 500'
)
# 2. 使用标量索引
collection.create_index(
field_name="category",
index_params={"index_type": "STL_SORT"}
)
collection.create_index(
field_name="price",
index_params={"index_type": "STL_SORT"}
)
# 3. 分区代替过滤
categories = ["electronics", "clothing", "books"]
for cat in categories:
if not collection.has_partition(f"cat_{cat}"):
collection.create_partition(f"cat_{cat}")
results = collection.search(
data=[[np.random.random() for _ in range(128)]],
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": 16}},
limit=10,
partition_names=["cat_electronics"]
)
print("过滤优化建议:")
print(" 1. 使用标量索引加速过滤")
print(" 2. 优化过滤条件顺序")
print(" 3. 使用分区代替过滤")
print(" 4. 避免复杂的表达式")
---
02.缓存策略
a.结果缓存
a.功能说明
缓存热门查询结果提升响应速度。适合查询重复率高的场景。使用Redis或内存缓存。设置合理的缓存过期时间。实现缓存预热和更新策略。监控缓存命中率。平衡缓存大小和命中率。实现多级缓存提升性能。
b.代码示例
---
import redis
import json
import hashlib
class QueryCache:
def __init__(self, redis_host="localhost", redis_port=6379, ttl=3600):
self.redis_client = redis.Redis(host=redis_host, port=redis_port, decode_responses=True)
self.ttl = ttl
self.stats = {"hits": 0, "misses": 0}
def _generate_key(self, query_vector, params):
"""生成缓存键"""
data = {
"vector": query_vector,
"params": params
}
data_str = json.dumps(data, sort_keys=True)
key = hashlib.md5(data_str.encode()).hexdigest()
return f"milvus:query:{key}"
def get(self, query_vector, params):
"""获取缓存结果"""
key = self._generate_key(query_vector, params)
cached = self.redis_client.get(key)
if cached:
self.stats["hits"] += 1
return json.loads(cached)
else:
self.stats["misses"] += 1
return None
def set(self, query_vector, params, results):
"""设置缓存"""
key = self._generate_key(query_vector, params)
results_data = [
{
"id": r.id,
"distance": r.distance,
"entity": r.entity
}
for r in results[0]
]
self.redis_client.setex(
key,
self.ttl,
json.dumps(results_data)
)
def search_with_cache(self, collection, query_vector, params):
"""带缓存的查询"""
cached_results = self.get(query_vector, params)
if cached_results:
return cached_results
results = collection.search(
data=[query_vector],
anns_field="embedding",
param=params,
limit=10
)
self.set(query_vector, params, results)
return results
def get_stats(self):
"""获取缓存统计"""
total = self.stats["hits"] + self.stats["misses"]
hit_rate = self.stats["hits"] / total if total > 0 else 0
return {
"hits": self.stats["hits"],
"misses": self.stats["misses"],
"hit_rate": hit_rate
}
cache = QueryCache(ttl=3600)
---
b.向量缓存
a.功能说明
缓存常用向量数据减少加载时间。将热点向量保存在内存。使用LRU策略管理缓存。预加载常用数据到缓存。监控缓存使用情况。实现缓存预热机制。平衡缓存大小和性能。
b.代码示例
---
from collections import OrderedDict
import numpy as np
class VectorCache:
def __init__(self, max_size=10000):
self.cache = OrderedDict()
self.max_size = max_size
self.stats = {"hits": 0, "misses": 0}
def get(self, vector_id):
"""获取向量"""
if vector_id in self.cache:
self.cache.move_to_end(vector_id)
self.stats["hits"] += 1
return self.cache[vector_id]
else:
self.stats["misses"] += 1
return None
def put(self, vector_id, vector):
"""存入向量"""
if vector_id in self.cache:
self.cache.move_to_end(vector_id)
else:
if len(self.cache) >= self.max_size:
self.cache.popitem(last=False)
self.cache[vector_id] = vector
def batch_put(self, vectors_dict):
"""批量存入"""
for vid, vec in vectors_dict.items():
self.put(vid, vec)
def preload(self, collection, vector_ids):
"""预加载向量"""
results = collection.query(
expr=f"id in {vector_ids}",
output_fields=["id", "embedding"]
)
for result in results:
self.put(result["id"], result["embedding"])
print(f"预加载{len(results)}个向量到缓存")
def get_stats(self):
"""获取统计信息"""
total = self.stats["hits"] + self.stats["misses"]
hit_rate = self.stats["hits"] / total if total > 0 else 0
return {
"size": len(self.cache),
"max_size": self.max_size,
"hits": self.stats["hits"],
"misses": self.stats["misses"],
"hit_rate": hit_rate
}
vector_cache = VectorCache(max_size=10000)
---
12.4 生产部署
01.部署架构
a.单机部署
a.功能说明
单机部署适合开发测试和小规模应用。所有组件运行在一台服务器。使用Docker Compose快速部署。资源需求:8核16GB内存起。支持数百万级向量。部署简单,维护成本低。不支持高可用和水平扩展。适合POC和小型项目。
b.代码示例
---
# Docker Compose单机部署配置
print("单机部署步骤:")
print("1. 创建docker-compose.yml文件")
print("2. 配置etcd、minio、milvus服务")
print("3. 执行: docker-compose up -d")
print("4. 验证: docker-compose ps")
print("5. 查看日志: docker-compose logs -f")
# 资源需求
resource_requirements = {
"CPU": "8核以上",
"内存": "16GB以上",
"存储": "SSD 100GB以上",
"网络": "千兆网卡",
"适用规模": "< 500万向量"
}
print("\n资源需求:")
for key, value in resource_requirements.items():
print(f" {key}: {value}")
# 单机部署优缺点
pros_cons = {
"优点": [
"部署简单快速",
"维护成本低",
"适合开发测试",
"无需复杂配置"
],
"缺点": [
"不支持高可用",
"无法水平扩展",
"性能受限于单机",
"存在单点故障"
]
}
print("\n优缺点分析:")
for category, items in pros_cons.items():
print(f"{category}:")
for item in items:
print(f" - {item}")
---
b.集群部署
a.功能说明
集群部署适合生产环境和大规模应用。组件分布式部署,支持水平扩展。使用Kubernetes编排管理。支持高可用和故障转移。可扩展到数十亿级向量。需要专业运维团队。适合企业级应用。
b.代码示例
---
# Kubernetes集群部署
print("Kubernetes集群部署步骤:")
print("1. 添加Milvus Helm仓库")
print(" helm repo add milvus https://milvus-io.github.io/milvus-helm/")
print("2. 创建命名空间")
print(" kubectl create namespace milvus")
print("3. 准备values.yaml配置文件")
print("4. 安装Milvus")
print(" helm install milvus milvus/milvus -n milvus -f values.yaml")
print("5. 验证部署")
print(" kubectl get pods -n milvus")
# 集群组件说明
cluster_components = {
"Proxy": "接收客户端请求,路由到相应节点",
"Query Node": "执行向量检索,可水平扩展",
"Data Node": "处理数据写入和持久化",
"Index Node": "构建和管理索引",
"Root Coord": "集群协调和元数据管理",
"Query Coord": "查询任务调度和负载均衡",
"Data Coord": "数据分片和副本管理",
"Index Coord": "索引构建任务调度"
}
print("\n集群组件:")
for component, desc in cluster_components.items():
print(f" {component}: {desc}")
# 集群配置建议
cluster_config = {
"Query Node": {
"副本数": "2-4",
"CPU": "4核/节点",
"内存": "8GB/节点"
},
"Data Node": {
"副本数": "2-3",
"CPU": "2核/节点",
"内存": "4GB/节点"
},
"Index Node": {
"副本数": "1-2",
"CPU": "4核/节点",
"内存": "8GB/节点"
},
"Proxy": {
"副本数": "2-3",
"CPU": "2核/节点",
"内存": "4GB/节点"
}
}
print("\n集群配置建议:")
for component, config in cluster_config.items():
print(f"{component}:")
for key, value in config.items():
print(f" {key}: {value}")
---
02.运维管理
a.监控告警
a.功能说明
建立完善的监控告警体系。监控服务健康状态和性能指标。使用Prometheus+Grafana可视化。配置告警规则和通知渠道。监控资源使用情况。跟踪查询性能和错误率。实现自动化运维。定期检查和优化。
b.代码示例
---
# 监控指标说明
monitoring_metrics = {
"性能指标": {
"QPS": "每秒查询数",
"查询延迟": "P50/P99延迟",
"吞吐量": "数据写入速率",
"索引构建速度": "向量/秒"
},
"资源指标": {
"CPU使用率": "各组件CPU占用",
"内存使用率": "各组件内存占用",
"磁盘使用率": "存储空间占用",
"网络流量": "入站/出站流量"
},
"业务指标": {
"向量数量": "Collection中的向量总数",
"查询成功率": "成功查询/总查询",
"错误率": "错误查询/总查询",
"缓存命中率": "缓存命中/总查询"
}
}
print("监控指标体系:")
for category, metrics in monitoring_metrics.items():
print(f"\n{category}:")
for metric, desc in metrics.items():
print(f" {metric}: {desc}")
# 告警规则
alert_rules = [
{
"名称": "查询延迟过高",
"条件": "P99延迟 > 100ms",
"级别": "Warning",
"持续时间": "5分钟"
},
{
"名称": "错误率过高",
"条件": "错误率 > 5%",
"级别": "Critical",
"持续时间": "5分钟"
},
{
"名称": "内存使用率过高",
"条件": "内存使用率 > 90%",
"级别": "Warning",
"持续时间": "5分钟"
},
{
"名称": "服务不可用",
"条件": "服务健康检查失败",
"级别": "Critical",
"持续时间": "1分钟"
}
]
print("\n告警规则:")
for rule in alert_rules:
print(f"\n{rule['名称']}:")
print(f" 条件: {rule['条件']}")
print(f" 级别: {rule['级别']}")
print(f" 持续时间: {rule['持续时间']}")
# Grafana仪表板
dashboard_panels = [
"QPS趋势图",
"查询延迟分布",
"CPU使用率",
"内存使用率",
"磁盘IO",
"网络流量",
"错误率",
"向量数量"
]
print("\nGrafana仪表板面板:")
for i, panel in enumerate(dashboard_panels, 1):
print(f" {i}. {panel}")
---
b.容量规划
a.功能说明
合理规划资源容量确保系统稳定。评估数据规模和增长趋势。计算存储、内存、CPU需求。预留30%-50%冗余空间。考虑峰值负载和突发流量。制定扩容策略和时间表。监控资源使用趋势。定期评估和调整。
b.代码示例
---
# 容量规划计算器
class CapacityPlanner:
def __init__(self):
self.index_overhead = 1.2
self.redundancy = 1.5
def calculate_storage(self, num_vectors, vector_dim, dtype="float32"):
"""计算存储需求"""
bytes_per_element = {
"float32": 4,
"float16": 2,
"int8": 1
}
vector_size = num_vectors * vector_dim * bytes_per_element[dtype]
total_size = vector_size * self.index_overhead
required_size = total_size * self.redundancy
return {
"vector_size_gb": vector_size / (1024**3),
"with_index_gb": total_size / (1024**3),
"required_gb": required_size / (1024**3)
}
def calculate_memory(self, num_vectors, vector_dim, index_type="IVF_FLAT"):
"""计算内存需求"""
vector_memory = num_vectors * vector_dim * 4
index_overhead = {
"FLAT": 1.0,
"IVF_FLAT": 1.1,
"IVF_SQ8": 0.35,
"IVF_PQ": 0.15,
"HNSW": 1.5
}
total_memory = vector_memory * index_overhead.get(index_type, 1.0)
required_memory = total_memory * 1.5
return {
"vector_memory_gb": vector_memory / (1024**3),
"total_memory_gb": total_memory / (1024**3),
"required_gb": required_memory / (1024**3)
}
def calculate_qps_capacity(self, num_query_nodes, cpu_per_node, latency_target_ms=50):
"""计算QPS容量"""
qps_per_core = 1000 / latency_target_ms
total_qps = num_query_nodes * cpu_per_node * qps_per_core
safe_qps = total_qps * 0.7
return {
"theoretical_qps": total_qps,
"safe_qps": safe_qps
}
def generate_plan(self, num_vectors, vector_dim, qps_requirement, index_type="IVF_FLAT"):
"""生成容量规划方案"""
storage = self.calculate_storage(num_vectors, vector_dim)
memory = self.calculate_memory(num_vectors, vector_dim, index_type)
qps_per_node = 1000
num_query_nodes = max(2, int(qps_requirement / qps_per_node) + 1)
qps_capacity = self.calculate_qps_capacity(num_query_nodes, cpu_per_node=4)
plan = {
"数据规模": {
"向量数量": f"{num_vectors:,}",
"向量维度": vector_dim,
"索引类型": index_type
},
"存储需求": {
"原始数据": f"{storage['vector_size_gb']:.2f} GB",
"含索引": f"{storage['with_index_gb']:.2f} GB",
"推荐容量": f"{storage['required_gb']:.2f} GB"
},
"内存需求": {
"向量数据": f"{memory['vector_memory_gb']:.2f} GB",
"含索引": f"{memory['total_memory_gb']:.2f} GB",
"推荐容量": f"{memory['required_gb']:.2f} GB"
},
"计算资源": {
"Query Node数量": num_query_nodes,
"每节点CPU": "4核",
"每节点内存": f"{memory['required_gb'] / num_query_nodes:.0f} GB"
},
"QPS容量": {
"理论QPS": f"{qps_capacity['theoretical_qps']:.0f}",
"安全QPS": f"{qps_capacity['safe_qps']:.0f}",
"需求QPS": qps_requirement
}
}
return plan
# 使用容量规划器
planner = CapacityPlanner()
# 场景1: 1000万向量,768维,1000 QPS
plan1 = planner.generate_plan(
num_vectors=10000000,
vector_dim=768,
qps_requirement=1000,
index_type="IVF_FLAT"
)
print("容量规划方案:")
import json
print(json.dumps(plan1, indent=2, ensure_ascii=False))
# 场景2: 1亿向量,512维,5000 QPS
plan2 = planner.generate_plan(
num_vectors=100000000,
vector_dim=512,
qps_requirement=5000,
index_type="HNSW"
)
print("\n大规模场景:")
print(json.dumps(plan2, indent=2, ensure_ascii=False))
# 容量规划建议
planning_tips = [
"预留30%-50%冗余空间",
"考虑数据增长趋势",
"评估峰值负载需求",
"制定扩容策略",
"定期审查和调整",
"监控资源使用趋势",
"建立容量告警机制"
]
print("\n容量规划建议:")
for i, tip in enumerate(planning_tips, 1):
print(f" {i}. {tip}")
---