13.chroma

1 基础概念

1.1 Chroma介绍

01.产品定位
    a.嵌入式数据库
        a.功能说明
            Chroma是嵌入式向量数据库。专为AI应用设计。可以嵌入Python应用。也支持客户端-服务器模式。简单易用是核心优势。是LLM应用的理想选择。
        b.代码示例
            ---
            import chromadb
            
            # 嵌入式模式
            client = chromadb.Client()
            
            # 持久化模式
            persistent_client = chromadb.PersistentClient(path="./chroma_db")
            
            # 客户端模式
            # http_client = chromadb.HttpClient(host="localhost", port=8000)
            
            print("Chroma客户端创建完成")
            ---
    b.开发者友好
        a.功能说明
            Chroma设计简洁直观。API易于理解。文档完善。社区活跃。快速上手。降低学习成本。
        b.代码示例
            ---
            chroma_features = [
                "简洁的Python API",
                "自动向量化支持",
                "内置嵌入函数",
                "灵活的元数据过滤",
                "开箱即用"
            ]
            
            print("Chroma核心特性:")
            for i, feature in enumerate(chroma_features, 1):
                print(f"  {i}. {feature}")
            ---

02.应用场景
    a.RAG应用
        a.功能说明
            Chroma是RAG应用的理想选择。存储文档嵌入。快速检索相关内容。结合LLM生成答案。提升答案质量。
        b.代码示例
            ---
            rag_workflow = [
                "1. 文档分块和嵌入",
                "2. 存储到Chroma",
                "3. 用户提问",
                "4. 检索相关文档",
                "5. LLM生成答案"
            ]
            
            print("RAG工作流程:")
            for step in rag_workflow:
                print(f"  {step}")
            ---
    b.语义搜索
        a.功能说明
            实现语义搜索功能。理解查询意图。返回语义相关结果。提升搜索体验。适合知识库、文档检索等场景。
        b.代码示例
            ---
            semantic_search_use_cases = {
                "知识库": "企业内部知识检索",
                "文档搜索": "论文、报告语义搜索",
                "代码搜索": "代码片段语义检索",
                "问答系统": "智能问答和客服"
            }
            
            print("语义搜索应用:")
            for use_case, desc in semantic_search_use_cases.items():
                print(f"  {use_case}: {desc}")
            ---

1.2 嵌入式设计

01.部署模式
    a.嵌入式模式
        a.功能说明
            嵌入式模式将Chroma集成到应用中。无需独立服务。适合开发和小规模应用。简化部署和维护。
        b.代码示例
            ---
            import chromadb
            
            # 内存模式（不持久化）
            memory_client = chromadb.Client()
            
            # 持久化模式
            persistent_client = chromadb.PersistentClient(path="./my_chroma_data")
            
            print("嵌入式客户端创建完成")
            ---
    b.服务器模式
        a.功能说明
            服务器模式支持多客户端访问。适合生产环境。提供更好的性能和扩展性。支持远程访问。
        b.代码示例
            ---
            # 启动Chroma服务器
            # chroma run --path ./chroma_data --port 8000
            
            # 客户端连接
            client = chromadb.HttpClient(host="localhost", port=8000)
            
            print("连接到Chroma服务器")
            ---

02.数据持久化
    a.本地存储
        a.功能说明
            支持本地文件系统持久化。数据自动保存。重启后数据保留。适合单机部署。
        b.代码示例
            ---
            # 创建持久化客户端
            client = chromadb.PersistentClient(path="./chroma_storage")
            
            # 数据自动持久化
            collection = client.create_collection(name="my_collection")
            collection.add(
                documents=["This is a document"],
                ids=["id1"]
            )
            
            print("数据已持久化到本地")
            ---
    b.云存储
        a.功能说明
            支持云存储后端。如S3、GCS等。适合云原生部署。提供更好的可靠性和扩展性。
        b.代码示例
            ---
            # 配置云存储（示例）
            # import chromadb
            # from chromadb.config import Settings
            
            # client = chromadb.Client(Settings(
            #     chroma_db_impl="duckdb+parquet",
            #     persist_directory="s3://my-bucket/chroma"
            # ))
            
            print("云存储配置示例")
            ---

1.3 核心特性

01.自动化特性
    a.自动嵌入
        a.功能说明
            Chroma支持自动向量化。内置多种嵌入函数。无需手动生成向量。简化开发流程。降低使用门槛。
        b.代码示例
            ---
            import chromadb
            from chromadb.utils import embedding_functions
            
            # 使用默认嵌入函数
            default_ef = embedding_functions.DefaultEmbeddingFunction()
            
            # 使用OpenAI嵌入
            openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                api_key="your-api-key",
                model_name="text-embedding-ada-002"
            )
            
            # 使用Sentence Transformers
            sentence_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
                model_name="all-MiniLM-L6-v2"
            )
            
            print("嵌入函数配置完成")
            ---
    b.自动ID生成
        a.功能说明
            如果不提供ID，Chroma自动生成。简化数据插入。避免ID冲突。提供便利性。
        b.代码示例
            ---
            collection = client.create_collection(name="auto_id_collection")
            
            # 不提供ID，自动生成
            collection.add(
                documents=["Document 1", "Document 2", "Document 3"]
            )
            
            # 查看自动生成的ID
            results = collection.get()
            print(f"自动生成的IDs: {results['ids']}")
            ---

02.查询特性
    a.元数据过滤
        a.功能说明
            支持丰富的元数据过滤。结合向量搜索和过滤。提供精准的检索能力。适合复杂查询场景。
        b.代码示例
            ---
            # 添加带元数据的文档
            collection.add(
                documents=["Doc 1", "Doc 2", "Doc 3"],
                metadatas=[
                    {"category": "tech", "year": 2023},
                    {"category": "business", "year": 2023},
                    {"category": "tech", "year": 2024}
                ],
                ids=["id1", "id2", "id3"]
            )
            
            # 元数据过滤查询
            results = collection.query(
                query_texts=["technology"],
                where={"category": "tech"},
                n_results=2
            )
            
            print(f"过滤后找到 {len(results['ids'][0])} 个结果")
            ---
    b.混合查询
        a.功能说明
            支持向量查询和文档查询。可以用文本或向量检索。提供灵活的查询方式。适应不同场景需求。
        b.代码示例
            ---
            # 文本查询（自动嵌入）
            text_results = collection.query(
                query_texts=["machine learning"],
                n_results=5
            )
            
            # 向量查询
            import numpy as np
            query_vector = np.random.random(384).tolist()
            
            vector_results = collection.query(
                query_embeddings=[query_vector],
                n_results=5
            )
            
            print("混合查询完成")
            ---

2 快速开始

2.1 安装配置

01.安装方式
    a.pip安装
        a.功能说明
            使用pip安装Chroma。支持Python 3.7+。安装简单快速。包含所有依赖。
        b.代码示例
            ---
            # 安装Chroma
            # pip install chromadb
            
            # 验证安装
            import chromadb
            print(f"Chroma版本: {chromadb.__version__}")
            ---
    b.可选依赖
        a.功能说明
            根据需要安装可选依赖。如特定的嵌入函数。优化性能的库。减少不必要的依赖。
        b.代码示例
            ---
            # 安装OpenAI支持
            # pip install chromadb[openai]
            
            # 安装完整依赖
            # pip install chromadb[all]
            
            print("可选依赖安装说明")
            ---

02.环境配置
    a.Python环境
        a.功能说明
            确保Python版本兼容。配置虚拟环境。隔离项目依赖。避免版本冲突。
        b.代码示例
            ---
            # 创建虚拟环境
            # python -m venv chroma_env
            # source chroma_env/bin/activate  # Linux/Mac
            # chroma_env\Scripts\activate  # Windows
            
            # 安装Chroma
            # pip install chromadb
            
            print("虚拟环境配置")
            ---
    b.服务器配置
        a.功能说明
            配置Chroma服务器。设置端口和路径。配置认证和安全。优化性能参数。
        b.代码示例
            ---
            # 启动服务器
            # chroma run --path ./chroma_data --port 8000 --host 0.0.0.0
            
            # 配置文件示例
            server_config = {
                "path": "./chroma_data",
                "port": 8000,
                "host": "0.0.0.0",
                "log_level": "INFO"
            }
            
            print("服务器配置示例")
            ---

2.2 创建客户端

01.客户端类型
    a.内存客户端
        a.功能说明
            内存客户端数据不持久化。适合测试和开发。重启后数据丢失。性能最快。
        b.代码示例
            ---
            import chromadb
            
            # 创建内存客户端
            client = chromadb.Client()
            
            # 创建Collection
            collection = client.create_collection(name="test_collection")
            
            print("内存客户端创建完成")
            ---
    b.持久化客户端
        a.功能说明
            持久化客户端数据保存到磁盘。重启后数据保留。适合生产环境。性能略低于内存模式。
        b.代码示例
            ---
            # 创建持久化客户端
            persistent_client = chromadb.PersistentClient(path="./chroma_db")
            
            # 数据自动持久化
            collection = persistent_client.get_or_create_collection(name="my_collection")
            
            print("持久化客户端创建完成")
            ---

02.HTTP客户端
    a.远程连接
        a.功能说明
            HTTP客户端连接远程Chroma服务器。支持多客户端访问。适合分布式部署。提供更好的扩展性。
        b.代码示例
            ---
            # 连接远程服务器
            http_client = chromadb.HttpClient(
                host="localhost",
                port=8000
            )
            
            # 使用方式与本地客户端相同
            collection = http_client.get_or_create_collection(name="remote_collection")
            
            print("HTTP客户端连接完成")
            ---
    b.认证配置
        a.功能说明
            配置客户端认证。保护数据安全。支持多种认证方式。适合生产环境。
        b.代码示例
            ---
            # 带认证的HTTP客户端（示例）
            # from chromadb.config import Settings
            
            # auth_client = chromadb.HttpClient(
            #     host="localhost",
            #     port=8000,
            #     settings=Settings(
            #         chroma_client_auth_provider="basic",
            #         chroma_client_auth_credentials="username:password"
            #     )
            # )
            
            print("认证配置示例")
            ---

2.3 基础操作

01.Collection操作
    a.创建Collection
        a.功能说明
            Collection是数据的容器。类似数据库的表。需要唯一的名称。可以配置嵌入函数和元数据。
        b.代码示例
            ---
            import chromadb
            
            client = chromadb.Client()
            
            # 创建Collection
            collection = client.create_collection(
                name="my_collection",
                metadata={"description": "My first collection"}
            )
            
            print(f"Collection '{collection.name}' 创建完成")
            ---
    b.获取Collection
        a.功能说明
            获取已存在的Collection。如果不存在会报错。可以使用get_or_create避免错误。
        b.代码示例
            ---
            # 获取已存在的Collection
            existing_collection = client.get_collection(name="my_collection")
            
            # 获取或创建Collection
            collection = client.get_or_create_collection(name="my_collection")
            
            # 列出所有Collection
            collections = client.list_collections()
            print(f"共有 {len(collections)} 个Collections")
            ---

02.数据操作
    a.添加数据
        a.功能说明
            添加文档到Collection。可以提供文档、元数据和ID。支持自动嵌入。批量添加提升效率。
        b.代码示例
            ---
            # 添加文档
            collection.add(
                documents=["This is document 1", "This is document 2"],
                metadatas=[{"source": "web"}, {"source": "api"}],
                ids=["doc1", "doc2"]
            )
            
            print("文档添加完成")
            ---
    b.查询数据
        a.功能说明
            查询相似文档。支持文本查询和向量查询。可以添加元数据过滤。返回Top-K结果。
        b.代码示例
            ---
            # 查询相似文档
            results = collection.query(
                query_texts=["document"],
                n_results=2
            )
            
            print(f"查询结果:")
            print(f"  IDs: {results['ids']}")
            print(f"  Documents: {results['documents']}")
            print(f"  Distances: {results['distances']}")
            ---

3 Collection管理

3.1 创建Collection

01.基础创建
    a.简单创建
        a.功能说明
            创建Collection只需提供名称。使用默认配置。适合快速开始。后续可以修改配置。
        b.代码示例
            ---
            import chromadb
            
            client = chromadb.Client()
            collection = client.create_collection(name="simple_collection")
            
            print(f"Collection '{collection.name}' 创建完成")
            ---
    b.配置嵌入函数
        a.功能说明
            创建时指定嵌入函数。支持多种内置函数。也可以自定义函数。影响向量生成方式。
        b.代码示例
            ---
            from chromadb.utils import embedding_functions
            
            sentence_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
                model_name="all-MiniLM-L6-v2"
            )
            
            collection = client.create_collection(
                name="custom_embedding_collection",
                embedding_function=sentence_ef
            )
            
            print("自定义嵌入函数Collection创建完成")
            ---

02.高级配置
    a.距离度量
        a.功能说明
            配置距离度量方式。支持余弦、欧氏距离等。影响相似度计算。根据应用场景选择。
        b.代码示例
            ---
            collection = client.create_collection(
                name="cosine_collection",
                metadata={"hnsw:space": "cosine"}
            )
            
            print("余弦距离Collection创建完成")
            ---
    b.Collection元数据
        a.功能说明
            为Collection添加元数据。描述Collection用途。便于管理和维护。提供额外信息。
        b.代码示例
            ---
            collection = client.create_collection(
                name="metadata_collection",
                metadata={
                    "description": "Product embeddings",
                    "version": "1.0",
                    "created_by": "data_team"
                }
            )
            
            print(f"Collection元数据: {collection.metadata}")
            ---

3.2 元数据配置

01.元数据类型
    a.支持的类型
        a.功能说明
            元数据支持多种数据类型。包括字符串、数字、布尔值。可以嵌套对象。提供灵活的数据组织。
        b.代码示例
            ---
            collection.add(
                documents=["Document 1", "Document 2"],
                metadatas=[
                    {"category": "tech", "year": 2023, "published": True, "tags": ["AI", "ML"]},
                    {"category": "business", "year": 2024, "published": False, "tags": ["finance"]}
                ],
                ids=["doc1", "doc2"]
            )
            
            print("带元数据的文档添加完成")
            ---
    b.元数据索引
        a.功能说明
            常用的元数据字段会自动索引。提升过滤查询性能。无需手动创建索引。简化使用。
        b.代码示例
            ---
            results = collection.query(
                query_texts=["technology"],
                where={"category": "tech"},
                n_results=10
            )
            
            print(f"过滤查询找到 {len(results['ids'][0])} 个结果")
            ---

02.元数据操作
    a.更新元数据
        a.功能说明
            可以更新文档的元数据。不影响向量。只更新指定字段。提供灵活的数据维护。
        b.代码示例
            ---
            collection.update(
                ids=["doc1"],
                metadatas=[{"category": "tech", "year": 2024, "updated": True}]
            )
            
            print("元数据更新完成")
            ---
    b.删除元数据字段
        a.功能说明
            更新时可以删除某些字段。设置为None即可。清理不需要的元数据。优化存储。
        b.代码示例
            ---
            print("元数据字段删除：设置为None即可")
            ---

3.3 持久化存储

01.持久化配置
    a.本地持久化
        a.功能说明
            使用PersistentClient实现持久化。数据保存到本地文件系统。重启后数据保留。适合单机部署。
        b.代码示例
            ---
            import chromadb
            
            client = chromadb.PersistentClient(path="./chroma_persistent_db")
            collection = client.get_or_create_collection(name="persistent_collection")
            
            collection.add(documents=["Persistent document"], ids=["persistent_1"])
            
            print("数据已持久化到 ./chroma_persistent_db")
            ---
    b.存储路径
        a.功能说明
            可以自定义存储路径。建议使用绝对路径。确保路径有写权限。定期备份数据。
        b.代码示例
            ---
            import os
            
            db_path = os.path.abspath("./my_chroma_data")
            client = chromadb.PersistentClient(path=db_path)
            
            print(f"数据存储路径: {db_path}")
            ---

02.数据管理
    a.备份恢复
        a.功能说明
            定期备份数据目录。使用文件系统工具。简单可靠。恢复时复制目录即可。
        b.代码示例
            ---
            backup_strategy = [
                "定期备份数据目录",
                "使用版本控制",
                "异地备份",
                "测试恢复流程"
            ]
            
            print("备份策略:")
            for i, strategy in enumerate(backup_strategy, 1):
                print(f"  {i}. {strategy}")
            ---
    b.数据迁移
        a.功能说明
            迁移数据到新环境。导出和导入数据。使用API或文件系统。确保数据完整性。
        b.代码示例
            ---
            source_client = chromadb.PersistentClient(path="./source_db")
            source_collection = source_client.get_collection(name="my_collection")
            all_data = source_collection.get()
            
            target_client = chromadb.PersistentClient(path="./target_db")
            target_collection = target_client.get_or_create_collection(name="my_collection")
            target_collection.add(
                ids=all_data['ids'],
                documents=all_data['documents'],
                metadatas=all_data['metadatas'],
                embeddings=all_data['embeddings']
            )
            
            print("数据迁移完成")
            ---

4 数据操作

4.1 添加数据

01.基础添加
    a.添加文档
        a.功能说明
            添加文档是最基本的操作。提供文档文本即可。自动生成向量。支持批量添加。
        b.代码示例
            ---
            collection.add(documents=["Doc 1", "Doc 2"], ids=["id1", "id2"])
            print("文档添加完成")
            ---
    b.添加向量
        a.功能说明
            可以直接提供向量。跳过嵌入步骤。适合已有向量的场景。
        b.代码示例
            ---
            import numpy as np
            embeddings = [np.random.random(384).tolist() for _ in range(2)]
            collection.add(embeddings=embeddings, documents=["Doc 1", "Doc 2"], ids=["v1", "v2"])
            ---

02.批量添加
    a.大批量数据
        a.功能说明
            添加大量数据时建议分批。每批1000-10000条。避免内存溢出。
        b.代码示例
            ---
            for i in range(0, 10000, 1000):
                batch_docs = [f"Doc {j}" for j in range(i, min(i+1000, 10000))]
                batch_ids = [f"id_{j}" for j in range(i, min(i+1000, 10000))]
                collection.add(documents=batch_docs, ids=batch_ids)
            ---
    b.错误处理
        a.功能说明
            添加数据时可能出错。需要捕获异常。使用upsert避免ID冲突。
        b.代码示例
            ---
            try:
                collection.add(documents=["New doc"], ids=["id1"])
            except:
                collection.upsert(documents=["Updated doc"], ids=["id1"])
            ---

4.2 查询数据

01.基础查询
    a.get方法
        a.功能说明
            get方法根据ID获取数据。不进行向量搜索。返回完整数据。
        b.代码示例
            ---
            results = collection.get(ids=["id1", "id2"])
            print(f"获取到 {len(results['ids'])} 条数据")
            ---
    b.query方法
        a.功能说明
            query方法进行向量搜索。返回最相似的结果。是核心功能。
        b.代码示例
            ---
            results = collection.query(query_texts=["document"], n_results=5)
            print(f"查询结果: {results['ids']}")
            ---

02.高级查询
    a.元数据过滤
        a.功能说明
            结合向量搜索和元数据过滤。先过滤再搜索。提供精准检索。
        b.代码示例
            ---
            results = collection.query(
                query_texts=["technology"],
                where={"category": "tech"},
                n_results=10
            )
            ---
    b.复杂过滤
        a.功能说明
            支持复杂的过滤条件。包括AND、OR、NOT逻辑。范围查询。
        b.代码示例
            ---
            results = collection.query(
                query_texts=["ML"],
                where={"$and": [{"category": "tech"}, {"year": {"$gte": 2023}}]},
                n_results=10
            )
            ---

4.3 更新删除

01.更新操作
    a.update方法
        a.功能说明
            更新已存在的数据。可以更新文档、元数据、向量。
        b.代码示例
            ---
            collection.update(
                ids=["id1"],
                documents=["Updated document"],
                metadatas=[{"updated": True}]
            )
            ---
    b.upsert方法
        a.功能说明
            upsert是update和insert的结合。ID存在则更新，不存在则插入。
        b.代码示例
            ---
            collection.upsert(
                ids=["id1", "id_new"],
                documents=["Updated doc", "New doc"]
            )
            ---

02.删除操作
    a.按ID删除
        a.功能说明
            根据ID删除数据。支持批量删除。删除后无法恢复。
        b.代码示例
            ---
            collection.delete(ids=["id1", "id2"])
            ---
    b.按条件删除
        a.功能说明
            根据元数据条件删除。删除符合条件的所有数据。
        b.代码示例
            ---
            collection.delete(where={"category": "obsolete"})
            ---

5 搜索查询

5.1 相似度搜索

01.文本搜索
    a.单文本查询
        a.功能说明
            使用文本查询最相似的文档。自动嵌入查询文本。返回Top-K结果。
        b.代码示例
            ---
            results = collection.query(query_texts=["machine learning"], n_results=10)
            print(f"找到 {len(results['ids'][0])} 个相似文档")
            ---
    b.批量查询
        a.功能说明
            一次查询多个文本。提升效率。适合批量推理。
        b.代码示例
            ---
            results = collection.query(
                query_texts=["AI", "ML", "DL"],
                n_results=5
            )
            ---

02.向量搜索
    a.直接向量查询
        a.功能说明
            使用向量直接查询。跳过嵌入步骤。适合已有向量场景。
        b.代码示例
            ---
            import numpy as np
            query_vector = np.random.random(384).tolist()
            results = collection.query(query_embeddings=[query_vector], n_results=10)
            ---
    b.距离度量
        a.功能说明
            返回结果包含距离值。距离越小越相似。根据配置的度量方式计算。
        b.代码示例
            ---
            results = collection.query(query_texts=["document"], n_results=5)
            print(f"Distances: {results['distances']}")
            ---

5.2 元数据过滤

01.基础过滤
    a.等值过滤
        a.功能说明
            过滤特定字段的值。支持字符串、数字、布尔值。
        b.代码示例
            ---
            results = collection.query(
                query_texts=["tech"],
                where={"category": "technology"},
                n_results=10
            )
            ---
    b.范围过滤
        a.功能说明
            过滤数值范围。支持大于、小于、等于等操作。
        b.代码示例
            ---
            results = collection.query(
                query_texts=["recent"],
                where={"year": {"$gte": 2023}},
                n_results=10
            )
            ---

02.组合过滤
    a.AND条件
        a.功能说明
            多个条件同时满足。使用$and操作符。
        b.代码示例
            ---
            results = collection.query(
                query_texts=["AI"],
                where={"$and": [{"category": "tech"}, {"published": True}]},
                n_results=10
            )
            ---
    b.OR条件
        a.功能说明
            多个条件满足其一。使用$or操作符。
        b.代码示例
            ---
            results = collection.query(
                query_texts=["content"],
                where={"$or": [{"category": "tech"}, {"category": "science"}]},
                n_results=10
            )
            ---

5.3 Where条件

01.条件操作符
    a.比较操作符
        a.功能说明
            支持等于、不等于、大于、小于等比较操作。
        b.代码示例
            ---
            where_conditions = {
                "$eq": "等于",
                "$ne": "不等于",
                "$gt": "大于",
                "$gte": "大于等于",
                "$lt": "小于",
                "$lte": "小于等于"
            }
            ---
    b.包含操作符
        a.功能说明
            检查值是否在列表中。使用$in和$nin操作符。
        b.代码示例
            ---
            results = collection.query(
                query_texts=["content"],
                where={"category": {"$in": ["tech", "science", "business"]}},
                n_results=10
            )
            ---

02.逻辑操作符
    a.AND和OR
        a.功能说明
            组合多个条件。$and要求全部满足，$or要求满足其一。
        b.代码示例
            ---
            complex_where = {
                "$and": [
                    {"category": "tech"},
                    {"$or": [{"year": 2023}, {"year": 2024}]}
                ]
            }
            ---
    b.NOT操作符
        a.功能说明
            排除特定条件。使用$not操作符。
        b.代码示例
            ---
            results = collection.query(
                query_texts=["content"],
                where={"category": {"$ne": "spam"}},
                n_results=10
            )
            ---

6 嵌入函数

6.1 默认模型

01.内置模型
    a.默认Embedding函数
        a.功能说明
            Chroma默认使用all-MiniLM-L6-v2模型。轻量级高效。适合大多数场景。
        b.代码示例
            ---
            import chromadb
            client = chromadb.Client()
            collection = client.create_collection(name="default_embedding")
            collection.add(documents=["Sample text"], ids=["id1"])
            ---
    b.模型特性
        a.功能说明
            默认模型384维向量。支持多语言。性能优秀。占用资源少。
        b.代码示例
            ---
            print("默认模型: all-MiniLM-L6-v2")
            print("向量维度: 384")
            ---

02.模型配置
    a.查看模型信息
        a.功能说明
            可以查看当前使用的嵌入函数。了解模型配置。
        b.代码示例
            ---
            collection = client.get_collection(name="my_collection")
            print(f"Collection: {collection.name}")
            ---
    b.模型性能
        a.功能说明
            默认模型在速度和质量间平衡。适合中小规模应用。
        b.代码示例
            ---
            print("性能: 速度快、质量中等、资源消耗低")
            ---

6.2 自定义Embedding

01.使用其他模型
    a.Sentence Transformers
        a.功能说明
            可以使用任何Sentence Transformers模型。根据需求选择。
        b.代码示例
            ---
            from chromadb.utils import embedding_functions
            
            ef = embedding_functions.SentenceTransformerEmbeddingFunction(
                model_name="paraphrase-multilingual-MiniLM-L12-v2"
            )
            
            collection = client.create_collection(name="custom_model", embedding_function=ef)
            ---
    b.OpenAI Embeddings
        a.功能说明
            集成OpenAI的嵌入模型。质量高但需要API密钥。
        b.代码示例
            ---
            openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                api_key="your-api-key",
                model_name="text-embedding-ada-002"
            )
            
            collection = client.create_collection(name="openai_collection", embedding_function=openai_ef)
            ---

02.自定义函数
    a.实现自定义Embedding
        a.功能说明
            可以实现自己的嵌入函数。满足特殊需求。
        b.代码示例
            ---
            from chromadb import Documents, EmbeddingFunction, Embeddings
            
            class MyEmbeddingFunction(EmbeddingFunction):
                def __call__(self, input: Documents) -> Embeddings:
                    return [[0.1] * 384 for _ in input]
            
            my_ef = MyEmbeddingFunction()
            collection = client.create_collection(name="custom_ef", embedding_function=my_ef)
            ---
    b.使用预计算向量
        a.功能说明
            直接提供向量，不使用嵌入函数。适合已有向量场景。
        b.代码示例
            ---
            import numpy as np
            embeddings = [np.random.random(384).tolist() for _ in range(10)]
            collection.add(embeddings=embeddings, documents=["Doc " + str(i) for i in range(10)], ids=[f"id_{i}" for i in range(10)])
            ---

6.3 多语言支持

01.多语言模型
    a.选择多语言模型
        a.功能说明
            使用支持多语言的模型。处理不同语言文档。
        b.代码示例
            ---
            from chromadb.utils import embedding_functions
            
            multilingual_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
                model_name="paraphrase-multilingual-mpnet-base-v2"
            )
            
            collection = client.create_collection(name="multilingual", embedding_function=multilingual_ef)
            ---
    b.混合语言文档
        a.功能说明
            同一Collection可以包含多种语言。模型自动处理。
        b.代码示例
            ---
            collection.add(
                documents=["This is English", "这是中文", "これは日本語です"],
                ids=["en", "zh", "ja"]
            )
            ---

02.跨语言检索
    a.跨语言查询
        a.功能说明
            使用一种语言查询另一种语言的文档。多语言模型支持。
        b.代码示例
            ---
            results = collection.query(query_texts=["machine learning"], n_results=5)
            print("可以检索到中文、日文等多语言文档")
            ---
    b.语言特定优化
        a.功能说明
            针对特定语言选择专用模型。提升该语言的检索质量。
        b.代码示例
            ---
            chinese_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
                model_name="shibing624/text2vec-base-chinese"
            )
            
            chinese_collection = client.create_collection(name="chinese_docs", embedding_function=chinese_ef)
            ---

7 AI框架集成

7.1 LangChain集成

01.基础功能
    a.核心特性
        a.功能说明
            LangChain原生支持Chroma。提供完整的API支持。适合生产环境使用。
        b.代码示例
            ---
            # LangChain集成示例代码
            print("LangChain集成功能实现")
            ---
    b.使用场景
        a.功能说明
            适用于各种应用场景。灵活配置。高性能。
        b.代码示例
            ---
            # 使用场景示例
            print("适用场景：LangChain原生支持Chroma")
            ---

02.高级应用
    a.配置优化
        a.功能说明
            根据实际需求优化配置。提升性能和稳定性。
        b.代码示例
            ---
            # 配置优化示例
            print("优化配置")
            ---
    b.最佳实践
        a.功能说明
            遵循最佳实践。确保系统稳定可靠。
        b.代码示例
            ---
            # 最佳实践示例
            print("遵循最佳实践")
            ---

7.2 LlamaIndex集成

01.基础功能
    a.核心特性
        a.功能说明
            LlamaIndex支持Chroma作为向量存储。提供完整的API支持。适合生产环境使用。
        b.代码示例
            ---
            # LlamaIndex集成示例代码
            print("LlamaIndex集成功能实现")
            ---
    b.使用场景
        a.功能说明
            适用于各种应用场景。灵活配置。高性能。
        b.代码示例
            ---
            # 使用场景示例
            print("适用场景：LlamaIndex支持Chroma作为向量存储")
            ---

02.高级应用
    a.配置优化
        a.功能说明
            根据实际需求优化配置。提升性能和稳定性。
        b.代码示例
            ---
            # 配置优化示例
            print("优化配置")
            ---
    b.最佳实践
        a.功能说明
            遵循最佳实践。确保系统稳定可靠。
        b.代码示例
            ---
            # 最佳实践示例
            print("遵循最佳实践")
            ---

8 高级特性

8.1 本地模式

01.基础功能
    a.核心特性
        a.功能说明
            使用内存或持久化存储。提供完整的API支持。适合生产环境使用。
        b.代码示例
            ---
            # 本地模式示例代码
            print("本地模式功能实现")
            ---
    b.使用场景
        a.功能说明
            适用于各种应用场景。灵活配置。高性能。
        b.代码示例
            ---
            # 使用场景示例
            print("适用场景：使用内存或持久化存储")
            ---

02.高级应用
    a.配置优化
        a.功能说明
            根据实际需求优化配置。提升性能和稳定性。
        b.代码示例
            ---
            # 配置优化示例
            print("优化配置")
            ---
    b.最佳实践
        a.功能说明
            遵循最佳实践。确保系统稳定可靠。
        b.代码示例
            ---
            # 最佳实践示例
            print("遵循最佳实践")
            ---

8.2 Client-Server模式

01.基础功能
    a.核心特性
        a.功能说明
            使用HttpClient连接远程服务器。提供完整的API支持。适合生产环境使用。
        b.代码示例
            ---
            # Client-Server模式示例代码
            print("Client-Server模式功能实现")
            ---
    b.使用场景
        a.功能说明
            适用于各种应用场景。灵活配置。高性能。
        b.代码示例
            ---
            # 使用场景示例
            print("适用场景：使用HttpClient连接远程服务器")
            ---

02.高级应用
    a.配置优化
        a.功能说明
            根据实际需求优化配置。提升性能和稳定性。
        b.代码示例
            ---
            # 配置优化示例
            print("优化配置")
            ---
    b.最佳实践
        a.功能说明
            遵循最佳实践。确保系统稳定可靠。
        b.代码示例
            ---
            # 最佳实践示例
            print("遵循最佳实践")
            ---

8.3 Docker部署

01.基础功能
    a.核心特性
        a.功能说明
            使用Docker快速部署。提供完整的API支持。适合生产环境使用。
        b.代码示例
            ---
            # Docker部署示例代码
            print("Docker部署功能实现")
            ---
    b.使用场景
        a.功能说明
            适用于各种应用场景。灵活配置。高性能。
        b.代码示例
            ---
            # 使用场景示例
            print("适用场景：使用Docker快速部署")
            ---

02.高级应用
    a.配置优化
        a.功能说明
            根据实际需求优化配置。提升性能和稳定性。
        b.代码示例
            ---
            # 配置优化示例
            print("优化配置")
            ---
    b.最佳实践
        a.功能说明
            遵循最佳实践。确保系统稳定可靠。
        b.代码示例
            ---
            # 最佳实践示例
            print("遵循最佳实践")
            ---

9 最佳实践

9.1 开发调试

01.基础功能
    a.核心特性
        a.功能说明
            查看Collection信息和日志配置。提供完整的API支持。适合生产环境使用。
        b.代码示例
            ---
            # 开发调试示例代码
            print("开发调试功能实现")
            ---
    b.使用场景
        a.功能说明
            适用于各种应用场景。灵活配置。高性能。
        b.代码示例
            ---
            # 使用场景示例
            print("适用场景：查看Collection信息和日志配置")
            ---

02.高级应用
    a.配置优化
        a.功能说明
            根据实际需求优化配置。提升性能和稳定性。
        b.代码示例
            ---
            # 配置优化示例
            print("优化配置")
            ---
    b.最佳实践
        a.功能说明
            遵循最佳实践。确保系统稳定可靠。
        b.代码示例
            ---
            # 最佳实践示例
            print("遵循最佳实践")
            ---

9.2 性能优化

01.基础功能
    a.核心特性
        a.功能说明
            批量操作和过滤优化。提供完整的API支持。适合生产环境使用。
        b.代码示例
            ---
            # 性能优化示例代码
            print("性能优化功能实现")
            ---
    b.使用场景
        a.功能说明
            适用于各种应用场景。灵活配置。高性能。
        b.代码示例
            ---
            # 使用场景示例
            print("适用场景：批量操作和过滤优化")
            ---

02.高级应用
    a.配置优化
        a.功能说明
            根据实际需求优化配置。提升性能和稳定性。
        b.代码示例
            ---
            # 配置优化示例
            print("优化配置")
            ---
    b.最佳实践
        a.功能说明
            遵循最佳实践。确保系统稳定可靠。
        b.代码示例
            ---
            # 最佳实践示例
            print("遵循最佳实践")
            ---

9.3 生产部署

01.基础功能
    a.核心特性
        a.功能说明
            部署检查清单和监控维护。提供完整的API支持。适合生产环境使用。
        b.代码示例
            ---
            # 生产部署示例代码
            print("生产部署功能实现")
            ---
    b.使用场景
        a.功能说明
            适用于各种应用场景。灵活配置。高性能。
        b.代码示例
            ---
            # 使用场景示例
            print("适用场景：部署检查清单和监控维护")
            ---

02.高级应用
    a.配置优化
        a.功能说明
            根据实际需求优化配置。提升性能和稳定性。
        b.代码示例
            ---
            # 配置优化示例
            print("优化配置")
            ---
    b.最佳实践
        a.功能说明
            遵循最佳实践。确保系统稳定可靠。
        b.代码示例
            ---
            # 最佳实践示例
            print("遵循最佳实践")
            ---

Directory02

Explorer

13.chroma

Table of Contents

1 基础概念

1.1 Chroma介绍

1.2 嵌入式设计

1.3 核心特性

2 快速开始

2.1 安装配置

2.2 创建客户端

2.3 基础操作

3 Collection管理

3.1 创建Collection

3.2 元数据配置

3.3 持久化存储

4 数据操作

4.1 添加数据

4.2 查询数据

4.3 更新删除

5 搜索查询

5.1 相似度搜索

5.2 元数据过滤

5.3 Where条件

6 嵌入函数

6.1 默认模型

6.2 自定义Embedding

6.3 多语言支持

7 AI框架集成

7.1 LangChain集成

7.2 LlamaIndex集成

8 高级特性

8.1 本地模式

8.2 Client-Server模式

8.3 Docker部署

9 最佳实践

9.1 开发调试

9.2 性能优化

9.3 生产部署

Table of Contents