1 课程概述

1.1 课程定位

01.课程简介
    a.定位说明
        本课程是自然语言处理(NLP)的进阶实战课程,专注于工业界主流应用场景的深度实现。课程涵盖文本分类、命名实体识别、机器翻译、文本生成、问答系统等核心任务,采用最新的预训练模型(BERT、GPT、T5等)和深度学习框架(PyTorch、Transformers)。与基础课程不同,本课程强调端到端项目实现、模型调优技巧、工程化部署等实战能力。适合已掌握Python、深度学习基础,希望从事NLP工程师、算法工程师岗位的学员。
    b.代码示例
        ---
        # NLP进阶课程技术栈示例
        import torch
        from transformers import (
            BertTokenizer, BertForSequenceClassification,
            GPT2LMHeadModel, GPT2Tokenizer,
            T5ForConditionalGeneration, T5Tokenizer
        )
        from torch.utils.data import DataLoader, Dataset
        import numpy as np

        # 检查CUDA可用性
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        print(f"使用设备: {device}")

        # 加载预训练BERT模型(文本分类)
        bert_tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
        bert_model = BertForSequenceClassification.from_pretrained(
            'bert-base-chinese',
            num_labels=10  # 假设10个分类
        ).to(device)

        # 加载GPT-2模型(文本生成)
        gpt_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
        gpt_model = GPT2LMHeadModel.from_pretrained('gpt2').to(device)

        # 加载T5模型(序列到序列任务)
        t5_tokenizer = T5Tokenizer.from_pretrained('t5-small')
        t5_model = T5ForConditionalGeneration.from_pretrained('t5-small').to(device)

        print("所有模型加载完成!")
        print(f"BERT参数量: {sum(p.numel() for p in bert_model.parameters()):,}")
        print(f"GPT-2参数量: {sum(p.numel() for p in gpt_model.parameters()):,}")
        print(f"T5参数量: {sum(p.numel() for p in t5_model.parameters()):,}")
        ---

02.与基础课程的区别
    a.对比分析
        基础课程侧重理论学习和简单实现,如词向量、RNN、LSTM等传统方法。本课程则聚焦工业级应用,使用最新的Transformer架构和预训练模型。基础课程使用小规模数据集演示,本课程采用真实业务数据(数十万到数百万样本),强调数据预处理、模型调优、性能优化等工程实践。此外,本课程包含完整的项目开发流程,从需求分析、数据标注、模型训练到API部署,培养学员独立承担NLP项目的能力。
    b.技术对比
        ---
        # 基础课程 vs 进阶课程技术对比

        # 【基础课程】传统方法示例
        from sklearn.feature_extraction.text import TfidfVectorizer
        from sklearn.naive_bayes import MultinomialNB
        from sklearn.pipeline import Pipeline

        # 简单的TF-IDF + 朴素贝叶斯分类器
        basic_pipeline = Pipeline([
            ('tfidf', TfidfVectorizer(max_features=5000)),
            ('clf', MultinomialNB())
        ])

        # 训练数据(小规模)
        basic_texts = ["这是一条正面评论", "这是负面评论"]
        basic_labels = [1, 0]
        basic_pipeline.fit(basic_texts, basic_labels)

        # 【进阶课程】深度学习方法示例
        import torch
        import torch.nn as nn
        from transformers import BertTokenizer, BertModel

        class AdvancedTextClassifier(nn.Module):
            def __init__(self, num_labels=2, dropout=0.1):
                super().__init__()
                self.bert = BertModel.from_pretrained('bert-base-chinese')
                self.dropout = nn.Dropout(dropout)
                self.classifier = nn.Linear(768, num_labels)

            def forward(self, input_ids, attention_mask):
                outputs = self.bert(
                    input_ids=input_ids,
                    attention_mask=attention_mask
                )
                pooled_output = outputs.pooler_output
                pooled_output = self.dropout(pooled_output)
                logits = self.classifier(pooled_output)
                return logits

        # 创建进阶模型
        advanced_model = AdvancedTextClassifier(num_labels=10)
        print(f"进阶模型参数量: {sum(p.numel() for p in advanced_model.parameters()):,}")

        # 对比总结
        print("\n技术对比:")
        print("基础课程: TF-IDF + 传统机器学习(5000维特征)")
        print("进阶课程: BERT预训练模型(110M参数,深度学习)")
        ---

03.学习价值
    a.职业提升
        完成本课程后,学员将具备NLP领域的核心竞争力。掌握BERT、GPT等主流预训练模型的微调技术,能够独立实现文本分类(准确率>90%)、命名实体识别(F1>85%)、机器翻译(BLEU>30)等工业级应用。熟悉PyTorch深度学习框架和Transformers库,可以快速适应各类NLP项目需求。此外,学员将了解模型压缩、推理加速、分布式训练等工程优化技术,满足大规模生产环境的性能要求,为晋升高级算法工程师打下坚实基础。
    b.能力图谱
        ---
        # NLP工程师能力图谱与课程覆盖
        import matplotlib.pyplot as plt
        import numpy as np

        # 定义能力维度
        categories = [
            '预训练模型',
            '模型微调',
            '数据处理',
            '模型评估',
            '工程部署',
            '项目管理'
        ]

        # 课程前后能力评分(0-10分)
        before_course = [3, 2, 4, 3, 1, 2]  # 课程前
        after_course = [9, 8, 9, 8, 7, 7]   # 课程后

        # 创建雷达图
        angles = np.linspace(0, 2 * np.pi, len(categories), endpoint=False).tolist()
        before_course += before_course[:1]
        after_course += after_course[:1]
        angles += angles[:1]
        categories_plot = categories + [categories[0]]

        fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(projection='polar'))
        ax.plot(angles, before_course, 'o-', linewidth=2, label='课程前', color='orange')
        ax.fill(angles, before_course, alpha=0.25, color='orange')
        ax.plot(angles, after_course, 'o-', linewidth=2, label='课程后', color='blue')
        ax.fill(angles, after_course, alpha=0.25, color='blue')

        ax.set_xticks(angles[:-1])
        ax.set_xticklabels(categories)
        ax.set_ylim(0, 10)
        ax.set_title('NLP工程师能力图谱', size=16, pad=20)
        ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.1))
        ax.grid(True)

        plt.tight_layout()
        # plt.savefig('nlp_skill_radar.png', dpi=300, bbox_inches='tight')
        print("能力提升统计:")
        for i, cat in enumerate(categories):
            improvement = after_course[i] - before_course[i]
            print(f"{cat}: {before_course[i]} -> {after_course[i]} (+{improvement})")
        ---

1.2 学习目标

01.核心技能目标
    a.技能清单
        课程结束后,学员将掌握五大核心技能。一是预训练模型应用能力,熟练使用BERT、RoBERTa、ELECTRA进行文本分类,使用GPT-2、GPT-3进行文本生成,使用T5、BART进行序列转换任务。二是模型微调技术,理解迁移学习原理,掌握全参数微调、LoRA、Adapter等高效微调方法。三是数据工程能力,包括大规模文本数据清洗、标注工具使用、数据增强技术、不平衡数据处理。四是模型评估与优化,熟悉各类NLP任务的评价指标(准确率、F1、BLEU、ROUGE等),掌握超参数调优、模型集成、错误分析方法。五是工程部署能力,使用ONNX、TensorRT进行模型加速,使用FastAPI、Flask构建推理服务,实现模��的容器化和云端部署。
    b.技能验证代码
        ---
        # NLP核心技能验证代码集合
        import torch
        from transformers import (
            AutoTokenizer, AutoModelForSequenceClassification,
            AutoModelForTokenClassification, AutoModelForSeq2SeqLM,
            Trainer, TrainingArguments
        )
        from datasets import load_dataset
        import numpy as np
        from sklearn.metrics import accuracy_score, f1_score, classification_report

        # 技能1: 预训练模型加载与使用
        def skill_1_pretrained_models():
            """验证预训练模型使用能力"""
            # 加载不同任务的预训练模型
            models = {
                'classification': 'bert-base-uncased',
                'ner': 'distilbert-base-uncased',
                'translation': 't5-small'
            }

            for task, model_name in models.items():
                if task == 'classification':
                    model = AutoModelForSequenceClassification.from_pretrained(
                        model_name, num_labels=2
                    )
                elif task == 'ner':
                    model = AutoModelForTokenClassification.from_pretrained(
                        model_name, num_labels=9
                    )
                else:
                    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

                print(f"{task}模型参数量: {sum(p.numel() for p in model.parameters()):,}")

        # 技能2: 模型微调
        def skill_2_fine_tuning():
            """验证模型微调能力"""
            # 定义训练参数
            training_args = TrainingArguments(
                output_dir='./results',
                num_train_epochs=3,
                per_device_train_batch_size=16,
                per_device_eval_batch_size=32,
                warmup_steps=500,
                weight_decay=0.01,
                logging_dir='./logs',
                logging_steps=100,
                evaluation_strategy='epoch',
                save_strategy='epoch',
                load_best_model_at_end=True,
                metric_for_best_model='f1',
            )
            print("训练参数配置完成")
            return training_args

        # 技能3: 数据处理
        def skill_3_data_processing(texts, labels):
            """验证数据处理能力"""
            from collections import Counter

            # 数据统计
            label_dist = Counter(labels)
            print(f"数据量: {len(texts)}")
            print(f"标签分布: {dict(label_dist)}")

            # 数据清洗示例
            cleaned_texts = []
            for text in texts:
                # 去除特殊字符,统一小写
                cleaned = text.lower().strip()
                cleaned_texts.append(cleaned)

            return cleaned_texts

        # 技能4: 模型评估
        def skill_4_evaluation(y_true, y_pred):
            """验证模型评估能力"""
            accuracy = accuracy_score(y_true, y_pred)
            f1 = f1_score(y_true, y_pred, average='weighted')

            print(f"准确率: {accuracy:.4f}")
            print(f"F1分数: {f1:.4f}")
            print("\n分类报告:")
            print(classification_report(y_true, y_pred))

            return {'accuracy': accuracy, 'f1': f1}

        # 技能5: 模型导出
        def skill_5_model_export(model, tokenizer, save_path):
            """验证模型部署能力"""
            # 保存模型和分词器
            model.save_pretrained(save_path)
            tokenizer.save_pretrained(save_path)
            print(f"模型已保存到: {save_path}")

            # ONNX导出示例
            dummy_input = tokenizer("示例文本", return_tensors="pt")
            torch.onnx.export(
                model,
                (dummy_input['input_ids'], dummy_input['attention_mask']),
                f"{save_path}/model.onnx",
                input_names=['input_ids', 'attention_mask'],
                output_names=['logits'],
                dynamic_axes={
                    'input_ids': {0: 'batch', 1: 'sequence'},
                    'attention_mask': {0: 'batch', 1: 'sequence'},
                    'logits': {0: 'batch'}
                }
            )
            print("ONNX模型导出完成")

        # 执行技能验证
        print("=== NLP核心技能验证 ===\n")
        skill_1_pretrained_models()
        skill_2_fine_tuning()

        # 示例数据
        sample_texts = ["positive review", "negative review", "neutral comment"]
        sample_labels = [1, 0, 2]
        skill_3_data_processing(sample_texts, sample_labels)

        # 示例评估
        y_true = [0, 1, 1, 0, 1]
        y_pred = [0, 1, 0, 0, 1]
        skill_4_evaluation(y_true, y_pred)
        ---

02.项目实战目标
    a.项目能力
        课程包含6个完整的实战项目,覆盖NLP主流应用场景。情感分析项目:基于BERT实现电商评论情感分类,准确率达到92%以上。命名实体识别项目:使用BiLSTM-CRF或BERT-NER识别人名、地名、机构名等实体,F1分数>85%。机器翻译项目:基于Transformer或mBART实现中英文翻译,BLEU分数>30。文本摘要项目:使用T5或BART生成新闻摘要,ROUGE-L>0.4。智能问答项目:构建基于检索和生成的混合问答系统,准确率>80%。舆情分析系统:整合多个NLP模块,实现实时舆情监控和分析。每个项目都包含完整的代码实现、详细的文档说明、可视化分析工具。
    b.项目框架代码
        ---
        # NLP项目通用框架
        import torch
        import torch.nn as nn
        from torch.utils.data import Dataset, DataLoader
        from transformers import AutoTokenizer, AutoModel
        from typing import List, Dict, Tuple
        import json
        from pathlib import Path

        class NLPProject:
            """NLP项目基础框架"""

            def __init__(self, config: Dict):
                self.config = config
                self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
                self.model = None
                self.tokenizer = None
                self.setup()

            def setup(self):
                """初始化项目"""
                # 创建必要目录
                for dir_name in ['data', 'models', 'logs', 'results']:
                    Path(self.config.get(dir_name, f'./{dir_name}')).mkdir(
                        parents=True, exist_ok=True
                    )
                print("项目目录结构创建完成")

            def load_data(self, data_path: str) -> Tuple[List, List]:
                """加载数据"""
                with open(data_path, 'r', encoding='utf-8') as f:
                    data = json.load(f)

                texts = [item['text'] for item in data]
                labels = [item['label'] for item in data]
                print(f"加载数据: {len(texts)} 条")
                return texts, labels

            def build_model(self, model_name: str, num_labels: int):
                """构建模型"""
                from transformers import AutoModelForSequenceClassification

                self.tokenizer = AutoTokenizer.from_pretrained(model_name)
                self.model = AutoModelForSequenceClassification.from_pretrained(
                    model_name,
                    num_labels=num_labels
                ).to(self.device)
                print(f"模型加载完成: {model_name}")

            def train(self, train_data, val_data, epochs: int = 3):
                """训练模型"""
                from transformers import Trainer, TrainingArguments

                training_args = TrainingArguments(
                    output_dir=self.config['models'],
                    num_train_epochs=epochs,
                    per_device_train_batch_size=16,
                    per_device_eval_batch_size=32,
                    evaluation_strategy='epoch',
                    save_strategy='epoch',
                    logging_steps=100,
                    load_best_model_at_end=True,
                )

                trainer = Trainer(
                    model=self.model,
                    args=training_args,
                    train_dataset=train_data,
                    eval_dataset=val_data,
                )

                print("开始训练...")
                trainer.train()
                print("训练完成!")

            def evaluate(self, test_data) -> Dict:
                """评估模型"""
                self.model.eval()
                predictions = []
                labels = []

                with torch.no_grad():
                    for batch in test_data:
                        outputs = self.model(**batch)
                        preds = torch.argmax(outputs.logits, dim=-1)
                        predictions.extend(preds.cpu().numpy())
                        labels.extend(batch['labels'].cpu().numpy())

                from sklearn.metrics import accuracy_score, f1_score
                metrics = {
                    'accuracy': accuracy_score(labels, predictions),
                    'f1': f1_score(labels, predictions, average='weighted')
                }
                print(f"评估结果: {metrics}")
                return metrics

            def deploy(self, save_path: str):
                """部署模型"""
                self.model.save_pretrained(save_path)
                self.tokenizer.save_pretrained(save_path)
                print(f"模型已部署到: {save_path}")

        # 使用示例
        config = {
            'data': './data',
            'models': './models',
            'logs': './logs',
            'results': './results'
        }

        project = NLPProject(config)
        print("NLP项目框架初始化完成")
        ---

03.考核标准
    a.评估体系
        课程采用多维度考核体系,确保学员真正掌握NLP技能。理论考核(20%):包括选择题、简答题,测试对预训练模型原理、Transformer架构、注意力机制等核心概念的理解。编程作业(30%):每章配套编程练习,要求实现特定的NLP功能,如文本预处理、模型微调、评估指标计算等,代码需通过自动化测试。项目实战(40%):完成至少3个完整项目,提交代码、文档、演示视频,项目需达到指定的性能指标(如分类准确率>90%)。综合答辩(10%):向讲师展示项目成果,回答技术问题,考察问题分析和解决能力。总分85分以上颁发结业证书,95分以上推荐就业。
    b.自动评分系统
        ---
        # NLP课程自动评分系统
        import numpy as np
        from typing import Dict, List
        from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

        class NLPGradingSystem:
            """NLP课程评分系统"""

            def __init__(self):
                self.weights = {
                    'theory': 0.20,      # 理论考核20%
                    'programming': 0.30,  # 编程作业30%
                    'project': 0.40,      # 项目实战40%
                    'defense': 0.10       # 综合答辩10%
                }

            def grade_theory(self, answers: List[str], correct: List[str]) -> float:
                """理论考核评分"""
                score = sum(1 for a, c in zip(answers, correct) if a == c)
                percentage = (score / len(correct)) * 100
                print(f"理论考核: {score}/{len(correct)} 正确, 得分 {percentage:.1f}")
                return percentage

            def grade_programming(self, code_results: Dict) -> float:
                """编程作业评分"""
                scores = []

                # 评估各项指标
                for task, result in code_results.items():
                    if result['passed']:
                        task_score = 100
                    else:
                        # 部分分数
                        task_score = result.get('partial_score', 0)
                    scores.append(task_score)
                    print(f"{task}: {task_score:.1f}分")

                avg_score = np.mean(scores)
                print(f"编程作业平均分: {avg_score:.1f}")
                return avg_score

            def grade_project(self, metrics: Dict, requirements: Dict) -> float:
                """项目实战评分"""
                scores = []

                # 检查各项指标是否达标
                for metric_name, metric_value in metrics.items():
                    requirement = requirements.get(metric_name, 0)
                    if metric_value >= requirement:
                        score = 100
                    else:
                        # 按比例给分
                        score = (metric_value / requirement) * 100
                    scores.append(score)
                    print(f"{metric_name}: {metric_value:.4f} (要求≥{requirement}), 得分 {score:.1f}")

                avg_score = np.mean(scores)
                print(f"项目实战平均分: {avg_score:.1f}")
                return avg_score

            def calculate_final_score(self, scores: Dict) -> Dict:
                """计算最终成绩"""
                weighted_scores = {}
                for category, score in scores.items():
                    weighted = score * self.weights[category]
                    weighted_scores[category] = weighted

                final_score = sum(weighted_scores.values())

                # 判断等级
                if final_score >= 95:
                    grade = 'A+ (优秀,推荐就业)'
                elif final_score >= 85:
                    grade = 'A (良好,颁发证书)'
                elif final_score >= 75:
                    grade = 'B (及格)'
                else:
                    grade = 'C (不及格)'

                result = {
                    'scores': scores,
                    'weighted_scores': weighted_scores,
                    'final_score': final_score,
                    'grade': grade
                }

                return result

        # 使用示例
        grader = NLPGradingSystem()

        # 模拟学员成绩
        student_scores = {
            'theory': grader.grade_theory(
                ['A', 'B', 'C', 'D'],
                ['A', 'B', 'C', 'C']
            ),
            'programming': grader.grade_programming({
                'task1': {'passed': True},
                'task2': {'passed': True},
                'task3': {'passed': False, 'partial_score': 70}
            }),
            'project': grader.grade_project(
                {'accuracy': 0.92, 'f1': 0.88},
                {'accuracy': 0.90, 'f1': 0.85}
            ),
            'defense': 85.0
        }

        # 计算最终成绩
        final_result = grader.calculate_final_score(student_scores)
        print(f"\n最终成绩: {final_result['final_score']:.2f}")
        print(f"等级: {final_result['grade']}")
        ---

1.3 前置要求

01.技术基础要求
    a.必备知识
        参加本课程需要具备扎实的Python编程基础,熟练使用NumPy、Pandas进行数据处理。需要理解深度学习基本概念,包括神经网络、反向传播、梯度下降等,最好有PyTorch或TensorFlow使用经验。应掌握基础的NLP概念,如分词、词向量、语言模型等。需要了解Linux命令行操作,能够使用Git进行版本控制。数学基础方面,需要理解线性代数(矩阵运算、向量空间)、概率论(条件概率、贝叶斯定理)、微积分(求导、链式法则)的基本概念。建议提前学习Transformer论文《Attention Is All You Need》,了解自注意力机制的原理。
    b.知识检测代码
        ---
        # 前置知识检测脚本
        import sys

        def check_prerequisites():
            """检查前置环境和知识"""
            print("=== NLP课程前置条件检查 ===\n")

            # 检查1: Python版本
            python_version = sys.version_info
            if python_version >= (3, 8):
                print(f"✓ Python版本: {python_version.major}.{python_version.minor} (合格)")
            else:
                print(f"✗ Python版本过低: {python_version.major}.{python_version.minor} (需要≥3.8)")

            # 检查2: 必要库
            required_packages = {
                'numpy': '数值计算',
                'pandas': '数据处理',
                'torch': 'PyTorch深度学习框架',
                'transformers': 'Hugging Face Transformers',
                'sklearn': '机器学习工具',
                'matplotlib': '数据可视化'
            }

            for package, description in required_packages.items():
                try:
                    __import__(package)
                    print(f"✓ {package}: {description} (已安装)")
                except ImportError:
                    print(f"✗ {package}: {description} (未安装)")

            # 检查3: CUDA可用性
            try:
                import torch
                if torch.cuda.is_available():
                    print(f"✓ CUDA: 可用 (设备: {torch.cuda.get_device_name(0)})")
                else:
                    print("⚠ CUDA: 不可用 (建议使用GPU加速)")
            except:
                print("✗ PyTorch未安装,无法检查CUDA")

            # 检查4: Python编程能力测试
            print("\n=== Python编程能力测试 ===")
            test_python_skills()

            # 检查5: 深度学习知识测试
            print("\n=== 深度学习知识测试 ===")
            test_dl_knowledge()

        def test_python_skills():
            """Python编程能力测试"""
            import numpy as np
            import pandas as pd

            # 测试NumPy
            try:
                arr = np.array([[1, 2], [3, 4]])
                result = np.dot(arr, arr.T)
                print(f"✓ NumPy矩阵运算: 通过")
            except Exception as e:
                print(f"✗ NumPy测试失败: {e}")

            # 测试Pandas
            try:
                df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
                grouped = df.groupby('A').sum()
                print(f"✓ Pandas数据处理: 通过")
            except Exception as e:
                print(f"✗ Pandas测试失败: {e}")

            # 测试列表推导式
            try:
                squares = [x**2 for x in range(10) if x % 2 == 0]
                assert squares == [0, 4, 16, 36, 64]
                print(f"✓ Python高级语法: 通过")
            except:
                print(f"✗ Python高级语法测试失败")

        def test_dl_knowledge():
            """深度学习知识测试"""
            try:
                import torch
                import torch.nn as nn

                # 测试1: 创建简单神经网络
                class SimpleNN(nn.Module):
                    def __init__(self):
                        super().__init__()
                        self.fc1 = nn.Linear(10, 5)
                        self.fc2 = nn.Linear(5, 2)

                    def forward(self, x):
                        x = torch.relu(self.fc1(x))
                        x = self.fc2(x)
                        return x

                model = SimpleNN()
                print(f"✓ 神经网络构建: 通过")

                # 测试2: 前向传播
                x = torch.randn(1, 10)
                output = model(x)
                print(f"✓ 前向传播: 通过 (输出形状: {output.shape})")

                # 测试3: 损失计算和反向传播
                criterion = nn.CrossEntropyLoss()
                optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

                target = torch.tensor([1])
                loss = criterion(output, target)
                loss.backward()
                optimizer.step()
                print(f"✓ 反向传播和优化: 通过")

            except Exception as e:
                print(f"✗ 深度学习测试失败: {e}")

        # 运行检查
        check_prerequisites()

        # 知识点自测题
        print("\n=== 知识点自测题(请自行回答) ===")
        questions = [
            "1. 什么是词嵌入(Word Embedding)?",
            "2. 解释RNN的梯度消失问题",
            "3. Transformer的自注意力机制是如何工作的?",
            "4. 什么是迁移学习?为什么预训练模型有效?",
            "5. 解释precision、recall、F1-score的含义"
        ]

        for q in questions:
            print(q)
        ---

02.硬件环境要求
    a.配置建议
        本课程涉及大规模预训练模型的微调,对硬件配置有一定要求。最低配置:CPU为Intel i5或AMD Ryzen 5以上,内存16GB,存储空间100GB(SSD优先)。推荐配置:CPU为Intel i7/i9或AMD Ryzen 7/9,内存32GB以上,GPU为NVIDIA RTX 3060(12GB显存)或更高,存储空间500GB SSD。GPU是必需的,因为BERT-base模型微调在CPU上可能需要数小时甚至数天,而在GPU上只需30分钟到2小时。如果本地没有GPU,可以使用Google Colab(免费GPU)、Kaggle Kernels或云平台(AWS、阿里云)。课程提供详细的云环境配置教程,确保所有学员都能顺利完成实战项目。
    b.环境配置代码
        ---
        # 硬件环境检测和配置脚本
        import torch
        import platform
        import psutil
        import subprocess

        def check_hardware():
            """检查硬件配置"""
            print("=== 硬件环境检测 ===\n")

            # CPU信息
            print(f"操作系统: {platform.system()} {platform.release()}")
            print(f"CPU: {platform.processor()}")
            print(f"CPU核心数: {psutil.cpu_count(logical=False)} 物理核心, {psutil.cpu_count()} 逻辑核心")

            # 内存信息
            mem = psutil.virtual_memory()
            print(f"内存: {mem.total / (1024**3):.1f} GB 总量, {mem.available / (1024**3):.1f} GB 可用")

            # 磁盘信息
            disk = psutil.disk_usage('/')
            print(f"磁盘: {disk.total / (1024**3):.1f} GB 总量, {disk.free / (1024**3):.1f} GB 可用")

            # GPU信息
            print("\n=== GPU信息 ===")
            if torch.cuda.is_available():
                for i in range(torch.cuda.device_count()):
                    props = torch.cuda.get_device_properties(i)
                    print(f"GPU {i}: {props.name}")
                    print(f"  显存: {props.total_memory / (1024**3):.1f} GB")
                    print(f"  计算能力: {props.major}.{props.minor}")
                    print(f"  多处理器数量: {props.multi_processor_count}")

                # 显示当前显存使用
                print(f"\n当前显存使用:")
                print(f"  已分配: {torch.cuda.memory_allocated() / (1024**3):.2f} GB")
                print(f"  已缓存: {torch.cuda.memory_reserved() / (1024**3):.2f} GB")
            else:
                print("未检测到CUDA GPU")
                print("建议: 使用Google Colab或云平台进行训练")

            # 配置建议
            print("\n=== 配置建议 ===")
            give_recommendations(mem.total, torch.cuda.is_available())

        def give_recommendations(memory_bytes, has_gpu):
            """给出配置建议"""
            memory_gb = memory_bytes / (1024**3)

            if memory_gb < 16:
                print("⚠ 内存不足16GB,建议升级内存")
            elif memory_gb < 32:
                print("✓ 内存充足(16-32GB),可以运行大部分任务")
            else:
                print("✓ 内存充足(≥32GB),可以运行所有任务")

            if not has_gpu:
                print("⚠ 未检测到GPU,强烈建议:")
                print("  1. 使用Google Colab (免费GPU)")
                print("  2. 使用云平台 (AWS, 阿里云等)")
                print("  3. 购买云GPU实例")
            else:
                gpu_memory = torch.cuda.get_device_properties(0).total_memory / (1024**3)
                if gpu_memory < 8:
                    print("⚠ GPU显存<8GB,只能运行小模型")
                elif gpu_memory < 16:
                    print("✓ GPU显存8-16GB,可以运行中等模型")
                else:
                    print("✓ GPU显存≥16GB,可以运行大模型")

        def setup_environment():
            """配置开发环境"""
            print("\n=== 环境配置建议 ===\n")

            # PyTorch安装命令
            if torch.cuda.is_available():
                print("PyTorch (GPU版本)已安装")
            else:
                print("安装PyTorch GPU版本:")
                print("pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118")

            # 其他依赖
            print("\n安装NLP相关库:")
            packages = [
                "transformers",  # Hugging Face
                "datasets",      # 数据集库
                "tokenizers",    # 快速分词
                "sentencepiece", # 分词工具
                "sacremoses",    # 文本处理
                "evaluate",      # 评估指标
            ]

            for pkg in packages:
                print(f"pip install {pkg}")

            # Jupyter配置
            print("\n配置Jupyter环境:")
            print("pip install jupyter jupyterlab")
            print("jupyter notebook --generate-config")

        # 运行检测
        check_hardware()
        setup_environment()

        # 性能基准测试
        print("\n=== 性能基准测试 ===")
        def benchmark():
            if torch.cuda.is_available():
                import time
                # 简单的矩阵乘法测试
                size = 5000
                device = torch.device('cuda')

                a = torch.randn(size, size, device=device)
                b = torch.randn(size, size, device=device)

                # 预热
                torch.matmul(a, b)
                torch.cuda.synchronize()

                # 测试
                start = time.time()
                for _ in range(10):
                    torch.matmul(a, b)
                torch.cuda.synchronize()
                end = time.time()

                print(f"GPU性能: {size}x{size}矩阵乘法x10次耗时 {end-start:.2f}秒")
            else:
                print("无GPU,跳过性能测试")

        benchmark()
        ---

03.软件工具要求
    a.开发工具
        推荐使用以下开发工具提高学习效率。代码编辑器:VS Code(推荐,支持Python、Jupyter插件)或PyCharm Professional。Jupyter环境:本地安装Jupyter Lab,或使用Google Colab在线编程。版本控制:Git(必需),配合GitHub或GitLab进行代码管理。虚拟环境:使用conda或venv创建独立的Python环境,避免依赖冲突。数据标注工具:Labelbox、doccano用于文本标注,Prodigy用于NER标注。模型管理:Weights & Biases(wandb)跟踪实验,TensorBoard可视化训练过程。部署工具:Docker容器化,FastAPI构建API服务。课程提供所有工具的详细配置教程和使用示例。
    b.工具安装脚本
        ---
        # 开发工具一键安装脚本
        import subprocess
        import sys
        import os

        def install_tools():
            """安装NLP开发工具"""
            print("=== NLP开发工具安装 ===\n")

            # 核心依赖
            core_packages = [
                "torch>=2.0.0",
                "transformers>=4.30.0",
                "datasets>=2.12.0",
                "tokenizers>=0.13.0",
                "accelerate>=0.20.0",  # 分布式训练
            ]

            # 数据处理
            data_packages = [
                "pandas>=2.0.0",
                "numpy>=1.24.0",
                "scikit-learn>=1.3.0",
                "nltk>=3.8.0",
                "jieba>=0.42.0",  # 中文分词
            ]

            # 可视化和监控
            viz_packages = [
                "matplotlib>=3.7.0",
                "seaborn>=0.12.0",
                "tensorboard>=2.13.0",
                "wandb>=0.15.0",  # 实验跟踪
            ]

            # Web开发
            web_packages = [
                "fastapi>=0.100.0",
                "uvicorn>=0.23.0",
                "pydantic>=2.0.0",
            ]

            # Jupyter
            jupyter_packages = [
                "jupyter>=1.0.0",
                "jupyterlab>=4.0.0",
                "ipywidgets>=8.0.0",
            ]

            all_packages = (
                core_packages + data_packages +
                viz_packages + web_packages +
                jupyter_packages
            )

            print("将安装以下包:")
            for pkg in all_packages:
                print(f"  - {pkg}")

            # 安装
            print("\n开始安装...")
            for pkg in all_packages:
                try:
                    subprocess.check_call(
                        [sys.executable, "-m", "pip", "install", pkg],
                        stdout=subprocess.DEVNULL,
                        stderr=subprocess.DEVNULL
                    )
                    print(f"✓ {pkg.split('>=')[0]}")
                except subprocess.CalledProcessError:
                    print(f"✗ {pkg.split('>=')[0]} 安装失败")

        def setup_jupyter():
            """配置Jupyter环境"""
            print("\n=== 配置Jupyter ===")

            # 安装有用的扩展
            extensions = [
                "jupyter_contrib_nbextensions",
                "jupyterlab-git",
                "jupyterlab_code_formatter",
            ]

            for ext in extensions:
                try:
                    subprocess.check_call(
                        [sys.executable, "-m", "pip", "install", ext],
                        stdout=subprocess.DEVNULL
                    )
                    print(f"✓ {ext}")
                except:
                    print(f"✗ {ext} 安装失败")

        def setup_git():
            """配置Git"""
            print("\n=== Git配置建议 ===")
            print("git config --global user.name '你的名字'")
            print("git config --global user.email '你的邮箱'")
            print("git config --global core.editor 'code --wait'")

        def create_project_structure():
            """创建项目目录结构"""
            print("\n=== 创建项目目录 ===")

            dirs = [
                'data/raw',
                'data/processed',
                'models/checkpoints',
                'models/final',
                'notebooks',
                'src/data',
                'src/models',
                'src/utils',
                'logs',
                'results',
            ]

            for d in dirs:
                os.makedirs(d, exist_ok=True)
                print(f"✓ {d}")

            # 创建.gitignore
            gitignore_content = """
            # Python
            __pycache__/
            *.py[cod]
            *$py.class
            *.so
            .Python
            env/
            venv/

            # Jupyter
            .ipynb_checkpoints

            # 数据和模型
            data/
            models/
            *.pth
            *.bin
            *.ckpt

            # 日志
            logs/
            *.log

            # IDE
            .vscode/
            .idea/
            """

            with open('.gitignore', 'w') as f:
                f.write(gitignore_content.strip())
            print("✓ .gitignore")

        # 运行安装
        install_tools()
        setup_jupyter()
        setup_git()
        create_project_structure()

        print("\n=== 安装完成 ===")
        print("建议重启终端以确保所有更改生效")
        ---

1.4 学习时长

01.总体时间规划
    a.说明部分
        本课程采用循序渐进的学习方式,建议学习时长为8-12周。前4周重点掌握文本分类和命名实体识别等基础任务,
        中间4周深入学习机器翻译、文本生成等序列任务,最后4周进行综合项目实战。每周建议投入15-20小时,
        包括视频学习、代码实践和项目开发。对于有较强基础的学员,可以适当压缩到6-8周完成核心内容。
        学习过程中建议采用"理论-实践-项目"的三段式学习法,每学完一个知识点就立即动手实践,
        每完成一个章节就尝试做一个小项目巩固。建议使用Jupyter Notebook进行交互式学习,
        便于调试代码和查看中间结果。同时要养成良好的代码管理习惯,使用Git进行版本控制。
    b.代码示例
        ---
        # 学习进度跟踪系统
        import json
        import datetime
        from typing import Dict, List
        from pathlib import Path

        class LearningTracker:
            """NLP课程学习进度跟踪器"""

            def __init__(self, course_name: str = "NLP进阶课程"):
                self.course_name = course_name
                self.progress_file = Path("learning_progress.json")
                self.progress = self._load_progress()

            def _load_progress(self) -> Dict:
                """加载学习进度"""
                if self.progress_file.exists():
                    with open(self.progress_file, 'r', encoding='utf-8') as f:
                        return json.load(f)
                return {
                    "start_date": str(datetime.date.today()),
                    "chapters": {},
                    "total_hours": 0,
                    "completed_tasks": []
                }

            def save_progress(self):
                """保存学习进度"""
                with open(self.progress_file, 'w', encoding='utf-8') as f:
                    json.dump(self.progress, f, ensure_ascii=False, indent=2)

            def log_study_session(self, chapter: str, hours: float, tasks: List[str]):
                """记录学习会话"""
                # 更新章节进度
                if chapter not in self.progress["chapters"]:
                    self.progress["chapters"][chapter] = {
                        "hours": 0,
                        "tasks": [],
                        "last_study": None
                    }

                # 累计学习时长
                self.progress["chapters"][chapter]["hours"] += hours
                self.progress["chapters"][chapter]["tasks"].extend(tasks)
                self.progress["chapters"][chapter]["last_study"] = str(datetime.date.today())
                self.progress["total_hours"] += hours
                self.progress["completed_tasks"].extend(tasks)

                self.save_progress()
                print(f"✓ 已记录 {chapter} 的学习: {hours}小时, 完成{len(tasks)}个任务")

            def get_statistics(self) -> Dict:
                """获取学习统计信息"""
                start_date = datetime.datetime.strptime(
                    self.progress["start_date"], "%Y-%m-%d"
                ).date()
                days_elapsed = (datetime.date.today() - start_date).days
                weeks_elapsed = days_elapsed / 7

                stats = {
                    "学习天数": days_elapsed,
                    "学习周数": round(weeks_elapsed, 1),
                    "总学习时长": self.progress["total_hours"],
                    "周均时长": round(self.progress["total_hours"] / max(weeks_elapsed, 1), 1),
                    "章节完成度": len(self.progress["chapters"]),
                    "任务完成数": len(self.progress["completed_tasks"])
                }

                return stats

            def show_progress(self):
                """显示学习进度"""
                print(f"\n{'='*50}")
                print(f"{self.course_name} - 学习进度报告")
                print(f"{'='*50}\n")

                stats = self.get_statistics()
                for key, value in stats.items():
                    print(f"{key}: {value}")

                print(f"\n{'='*50}")
                print("各章节学习情况:")
                print(f"{'='*50}\n")

                for chapter, data in self.progress["chapters"].items():
                    print(f"📚 {chapter}")
                    print(f"   学习时长: {data['hours']}小时")
                    print(f"   完成任务: {len(data['tasks'])}个")
                    print(f"   最近学习: {data['last_study']}\n")

        # 使用示例
        if __name__ == "__main__":
            # 创建跟踪器
            tracker = LearningTracker()

            # 记录学习会话
            tracker.log_study_session(
                chapter="第2章 文本分类",
                hours=3.5,
                tasks=["情感分析数据准备", "BERT模型微调", "模型评估"]
            )

            tracker.log_study_session(
                chapter="第3章 命名实体识别",
                hours=4.0,
                tasks=["BiLSTM-CRF实现", "BERT-NER训练"]
            )

            # 显示进度
            tracker.show_progress()
        ---

02.各章节时间分配
    a.说明部分
        课程9个章节的时间分配建议如下:第1章课程概述1天,第2章文本分类10-12天(含数据准备、模型训练、调优),
        第3章命名实体识别10-12天(含BiLSTM-CRF和BERT-NER两种方法),第4章机器翻译8-10天(重点是Transformer架构),
        第5章文本生成8-10天(包括摘要、对话等多种生成任务),第6章问答系统10-12天(涵盖多种问答类型),
        第7章信息抽取6-8天(关系抽取和事件抽取),第8章综合项目15-20天(完整的端到端系统开发),
        第9章学习路径2-3天。实际学习时可以根据个人基础和时间安排灵活调整,但建议保证每个章节的实战练习时间。
    b.代码示例
        ---
        # 课程时间规划工具
        import matplotlib.pyplot as plt
        import numpy as np
        from datetime import datetime, timedelta
        from typing import List, Tuple

        # 设置中文显示
        plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei']
        plt.rcParams['axes.unicode_minus'] = False

        class CourseScheduler:
            """课程学习计划生成器"""

            def __init__(self, start_date: str, hours_per_week: float = 15):
                self.start_date = datetime.strptime(start_date, "%Y-%m-%d")
                self.hours_per_week = hours_per_week

                # 定义各章节的学习时长(小时)
                self.chapters = {
                    "第1章 课程概述": 3,
                    "第2章 文本分类": 20,
                    "第3章 命名实体识别": 22,
                    "第4章 机器翻译": 18,
                    "第5章 文本生成": 18,
                    "第6章 问答系统": 20,
                    "第7章 信息抽取": 14,
                    "第8章 综合项目": 30,
                    "第9章 学习路径": 5
                }

            def generate_schedule(self) -> List[Tuple[str, datetime, datetime, float]]:
                """生成学习计划"""
                schedule = []
                current_date = self.start_date

                for chapter, hours in self.chapters.items():
                    # 计算所需天数
                    days_needed = (hours / self.hours_per_week) * 7

                    # 开始和结束日期
                    start = current_date
                    end = current_date + timedelta(days=days_needed)

                    schedule.append((chapter, start, end, hours))
                    current_date = end

                return schedule

            def print_schedule(self):
                """打印学习计划"""
                schedule = self.generate_schedule()
                total_hours = sum(self.chapters.values())
                total_weeks = total_hours / self.hours_per_week

                print(f"\n{'='*70}")
                print(f"NLP进阶课程学习计划")
                print(f"开始日期: {self.start_date.strftime('%Y-%m-%d')}")
                print(f"每周学习: {self.hours_per_week}小时")
                print(f"预计总时长: {total_hours}小时 ({total_weeks:.1f}周)")
                print(f"{'='*70}\n")

                for chapter, start, end, hours in schedule:
                    days = (end - start).days
                    print(f"{chapter:20s} | {start.strftime('%Y-%m-%d')} ~ {end.strftime('%Y-%m-%d')} "
                          f"| {days:2d}天 | {hours:3.0f}小时")

                end_date = schedule[-1][2]
                print(f"\n预计完成日期: {end_date.strftime('%Y-%m-%d')}")
                print(f"总学习天数: {(end_date - self.start_date).days}天\n")

            def plot_gantt_chart(self):
                """绘制甘特图"""
                schedule = self.generate_schedule()

                fig, ax = plt.subplots(figsize=(14, 8))

                # 准备数据
                chapters = [item[0] for item in schedule]
                starts = [(item[1] - self.start_date).days for item in schedule]
                durations = [(item[2] - item[1]).days for item in schedule]

                # 绘制条形图
                colors = plt.cm.Set3(np.linspace(0, 1, len(chapters)))
                y_pos = np.arange(len(chapters))

                ax.barh(y_pos, durations, left=starts, height=0.6,
                       color=colors, edgecolor='black', linewidth=1.5)

                # 添加标签
                ax.set_yticks(y_pos)
                ax.set_yticklabels(chapters)
                ax.set_xlabel('学习天数', fontsize=12, fontweight='bold')
                ax.set_title('NLP进阶课程学习甘特图', fontsize=14, fontweight='bold', pad=20)

                # 添加网格
                ax.grid(axis='x', alpha=0.3, linestyle='--')
                ax.set_axisbelow(True)

                # 在条形上添加时长标注
                for i, (start, duration, hours) in enumerate(zip(starts, durations,
                                                                  [item[3] for item in schedule])):
                    ax.text(start + duration/2, i, f'{int(hours)}h',
                           ha='center', va='center', fontweight='bold', fontsize=9)

                plt.tight_layout()
                plt.savefig('course_schedule.png', dpi=300, bbox_inches='tight')
                print("✓ 学习计划甘特图已保存为 course_schedule.png")

        # 使用示例
        if __name__ == "__main__":
            # 创建学习计划(从今天开始,每周学习15小时)
            scheduler = CourseScheduler(
                start_date="2025-01-01",
                hours_per_week=15
            )

            # 打印计划
            scheduler.print_schedule()

            # 生成甘特图
            scheduler.plot_gantt_chart()
        ---

03.学习效率优化建议
    a.说明部分
        提高学习效率的关键在于合理安排时间和采用科学的学习方法。建议采用番茄工作法,每学习25分钟休息5分钟,
        保持专注力。实践代码时使用GPU加速训练,推荐使用Google Colab或本地配置CUDA环境。
        遇到困难时先查阅官方文档和GitHub issues,然后在Stack Overflow或相关论坛提问。
        建立学习笔记系统,使用Notion或Obsidian记录知识点和代码片段。加入NLP学习社群,
        与同行交流经验和问题。定期回顾已学内容,采用间隔重复的记忆策略。每周末做一次知识复盘,
        整理本周所学的核心概念和技术要点。建议准备一个实验记录本,记录每次实验的参数设置和结果。
    b.代码示例
        ---
        # 学习效率分析工具
        import pandas as pd
        import matplotlib.pyplot as plt
        from datetime import datetime, timedelta
        import seaborn as sns

        # 设置中文显示和样式
        plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei']
        plt.rcParams['axes.unicode_minus'] = False
        sns.set_style("whitegrid")

        class EfficiencyAnalyzer:
            """学习效率分析器"""

            def __init__(self):
                self.sessions = []

            def add_session(self, date: str, hours: float, tasks_completed: int,
                          focus_level: int, understanding: int):
                """
                添加学习会话记录

                Args:
                    date: 学习日期 (YYYY-MM-DD)
                    hours: 学习时长
                    tasks_completed: 完成任务数
                    focus_level: 专注度评分 (1-10)
                    understanding: 理解程度评分 (1-10)
                """
                self.sessions.append({
                    'date': datetime.strptime(date, '%Y-%m-%d'),
                    'hours': hours,
                    'tasks_completed': tasks_completed,
                    'focus_level': focus_level,
                    'understanding': understanding,
                    'efficiency': (tasks_completed / hours) * (focus_level / 10)
                })

            def get_dataframe(self) -> pd.DataFrame:
                """转换为DataFrame"""
                return pd.DataFrame(self.sessions)

            def analyze_efficiency(self):
                """分析学习效率"""
                df = self.get_dataframe()

                if df.empty:
                    print("暂无学习记录")
                    return

                print(f"\n{'='*60}")
                print("学习效率分析报告")
                print(f"{'='*60}\n")

                # 基本统计
                print(f"总学习天数: {len(df)}天")
                print(f"总学习时长: {df['hours'].sum():.1f}小时")
                print(f"日均学习时长: {df['hours'].mean():.1f}小时")
                print(f"完成任务总数: {df['tasks_completed'].sum()}个")
                print(f"平均效率得分: {df['efficiency'].mean():.2f}\n")

                # 找出最高效的学习时段
                best_day = df.loc[df['efficiency'].idxmax()]
                print(f"最高效学习日: {best_day['date'].strftime('%Y-%m-%d')}")
                print(f"  - 学习时长: {best_day['hours']}小时")
                print(f"  - 完成任务: {best_day['tasks_completed']}个")
                print(f"  - 效率得分: {best_day['efficiency']:.2f}\n")

                # 趋势分析
                if len(df) >= 7:
                    recent_week = df.tail(7)
                    prev_week = df.iloc[-14:-7] if len(df) >= 14 else df.head(7)

                    efficiency_change = (
                        recent_week['efficiency'].mean() - prev_week['efficiency'].mean()
                    ) / prev_week['efficiency'].mean() * 100

                    print(f"近7天效率变化: {efficiency_change:+.1f}%")

                    if efficiency_change > 10:
                        print("✓ 学习效率显著提升!继续保持!")
                    elif efficiency_change < -10:
                        print("⚠ 学习效率下降,建议调整学习方法或休息")
                    else:
                        print("→ 学习效率稳定")

            def plot_efficiency_trends(self):
                """绘制效率趋势图"""
                df = self.get_dataframe()

                if df.empty:
                    print("暂无数据可视化")
                    return

                fig, axes = plt.subplots(2, 2, figsize=(15, 10))

                # 1. 学习时长趋势
                axes[0, 0].plot(df['date'], df['hours'], marker='o', linewidth=2,
                              markersize=6, color='#2E86AB')
                axes[0, 0].set_title('学习时长趋势', fontsize=12, fontweight='bold')
                axes[0, 0].set_ylabel('小时', fontsize=10)
                axes[0, 0].grid(True, alpha=0.3)

                # 2. 任务完成数
                axes[0, 1].bar(df['date'], df['tasks_completed'], color='#A23B72', alpha=0.7)
                axes[0, 1].set_title('任务完成数', fontsize=12, fontweight='bold')
                axes[0, 1].set_ylabel('任务数', fontsize=10)
                axes[0, 1].grid(True, alpha=0.3, axis='y')

                # 3. 效率得分趋势
                axes[1, 0].plot(df['date'], df['efficiency'], marker='s', linewidth=2,
                              markersize=6, color='#F18F01')
                axes[1, 0].axhline(df['efficiency'].mean(), color='red',
                                  linestyle='--', label='平均效率')
                axes[1, 0].set_title('效率得分趋势', fontsize=12, fontweight='bold')
                axes[1, 0].set_ylabel('效率得分', fontsize=10)
                axes[1, 0].legend()
                axes[1, 0].grid(True, alpha=0.3)

                # 4. 专注度与理解度对比
                x = range(len(df))
                width = 0.35
                axes[1, 1].bar([i - width/2 for i in x], df['focus_level'],
                              width, label='专注度', color='#6A994E', alpha=0.8)
                axes[1, 1].bar([i + width/2 for i in x], df['understanding'],
                              width, label='理解程度', color='#BC4749', alpha=0.8)
                axes[1, 1].set_title('专注度与理解程度', fontsize=12, fontweight='bold')
                axes[1, 1].set_ylabel('评分 (1-10)', fontsize=10)
                axes[1, 1].set_xticks(x)
                axes[1, 1].set_xticklabels([d.strftime('%m-%d') for d in df['date']],
                                          rotation=45)
                axes[1, 1].legend()
                axes[1, 1].grid(True, alpha=0.3, axis='y')

                plt.tight_layout()
                plt.savefig('efficiency_analysis.png', dpi=300, bbox_inches='tight')
                print("\n✓ 效率分析图表已保存为 efficiency_analysis.png")

        # 使用示例
        if __name__ == "__main__":
            # 创建分析器
            analyzer = EfficiencyAnalyzer()

            # 添加学习记录(模拟两周的数据)
            sample_data = [
                ("2025-01-01", 3.5, 4, 8, 7),
                ("2025-01-02", 4.0, 5, 9, 8),
                ("2025-01-03", 2.5, 2, 6, 6),
                ("2025-01-05", 5.0, 6, 9, 9),
                ("2025-01-06", 3.0, 4, 7, 7),
                ("2025-01-08", 4.5, 6, 8, 8),
                ("2025-01-09", 3.5, 4, 8, 8),
                ("2025-01-10", 4.0, 5, 9, 9),
                ("2025-01-12", 5.0, 7, 9, 9),
                ("2025-01-13", 3.5, 5, 8, 8),
            ]

            for data in sample_data:
                analyzer.add_session(*data)

            # 分析效率
            analyzer.analyze_efficiency()

            # 绘制趋势图
            analyzer.plot_efficiency_trends()
        ---

1.5 岗位关联

01.NLP算法工程师
    a.说明部分
        NLP算法工程师是当前AI领域最热门的岗位之一,主要负责设计和优化自然语言处理算法。
        核心技能包括深度学习框架(PyTorch/TensorFlow)、预训练模型应用(BERT/GPT系列)、
        算法优化和模型部署。工作内容涵盖文本分类、命名实体识别、机器翻译、问答系统等多个方向。
        薪资水平通常在25-60K之间,一线城市的高级工程师可达80K+。本课程的文本分类、NER、
        机器翻译等章节直接对应该岗位的核心技能要求。需要具备扎实的数学基础、编程能力和论文阅读能力,
        同时要关注NLP领域的最新进展,如大语言模型、提示工程等前沿技术。
    b.代码示例
        ---
        # NLP算法工程师技能评估系统
        import numpy as np
        import pandas as pd
        import matplotlib.pyplot as plt
        from typing import Dict, List
        import seaborn as sns

        # 设置中文显示
        plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei']
        plt.rcParams['axes.unicode_minus'] = False
        sns.set_style("whitegrid")

        class NLPEngineerSkillAssessment:
            """NLP算法工程师技能评估"""

            def __init__(self):
                # 定义技能树和权重
                self.skill_tree = {
                    "深度学习基础": {
                        "weight": 0.20,
                        "skills": ["神经网络", "反向传播", "优化算法", "正则化", "dropout"]
                    },
                    "NLP核心技术": {
                        "weight": 0.25,
                        "skills": ["文本分类", "NER", "机器翻译", "文本生成", "问答系统"]
                    },
                    "预训练模型": {
                        "weight": 0.20,
                        "skills": ["BERT", "GPT", "T5", "模型微调", "提示工程"]
                    },
                    "工程实践": {
                        "weight": 0.20,
                        "skills": ["PyTorch", "模型部署", "性能优化", "数据处理", "版本管理"]
                    },
                    "数学基础": {
                        "weight": 0.15,
                        "skills": ["线性代数", "概率统计", "信息论", "优化理论", "数值计算"]
                    }
                }

                self.scores = {}

            def assess_skill(self, category: str, skill_scores: Dict[str, int]):
                """
                评估某个技能类别

                Args:
                    category: 技能类别
                    skill_scores: 技能评分字典 (技能名: 分数1-10)
                """
                if category not in self.skill_tree:
                    raise ValueError(f"未知技能类别: {category}")

                self.scores[category] = skill_scores

            def calculate_overall_score(self) -> float:
                """计算综合得分"""
                total_score = 0.0

                for category, data in self.skill_tree.items():
                    if category in self.scores:
                        # 计算该类别的平均分
                        category_score = np.mean(list(self.scores[category].values()))
                        # 加权求和
                        total_score += category_score * data["weight"]

                return total_score

            def get_level(self, score: float) -> str:
                """根据得分判断级别"""
                if score >= 9.0:
                    return "专家级 (Expert)"
                elif score >= 8.0:
                    return "高级 (Senior)"
                elif score >= 7.0:
                    return "中级 (Intermediate)"
                elif score >= 6.0:
                    return "初级 (Junior)"
                else:
                    return "入门 (Beginner)"

            def generate_report(self):
                """生成评估报告"""
                overall_score = self.calculate_overall_score()
                level = self.get_level(overall_score)

                print(f"\n{'='*70}")
                print(f"NLP算法工程师技能评估报告")
                print(f"{'='*70}\n")

                print(f"综合得分: {overall_score:.2f} / 10.00")
                print(f"当前级别: {level}\n")

                print(f"{'='*70}")
                print("各类别详细得分:")
                print(f"{'='*70}\n")

                for category, data in self.skill_tree.items():
                    if category in self.scores:
                        category_score = np.mean(list(self.scores[category].values()))
                        weight = data["weight"]
                        contribution = category_score * weight

                        print(f"{category} (权重: {weight:.0%})")
                        print(f"  平均分: {category_score:.2f}")
                        print(f"  贡献值: {contribution:.2f}")
                        print(f"  技能明细:")

                        for skill, score in self.scores[category].items():
                            bar = "█" * score + "░" * (10 - score)
                            print(f"    {skill:15s} [{bar}] {score}/10")
                        print()

                # 薪资预估
                salary_range = self._estimate_salary(overall_score)
                print(f"{'='*70}")
                print(f"薪资预估: {salary_range}")
                print(f"{'='*70}\n")

            def _estimate_salary(self, score: float) -> str:
                """根据得分估算薪资范围"""
                if score >= 9.0:
                    return "60-100K/月 (一线城市资深专家)"
                elif score >= 8.0:
                    return "40-60K/月 (高级算法工程师)"
                elif score >= 7.0:
                    return "25-40K/月 (中级算法工程师)"
                elif score >= 6.0:
                    return "15-25K/月 (初级算法工程师)"
                else:
                    return "10-15K/月 (实习生/助理工程师)"

            def plot_radar_chart(self):
                """绘制雷达图"""
                categories = list(self.skill_tree.keys())
                scores = []

                for category in categories:
                    if category in self.scores:
                        scores.append(np.mean(list(self.scores[category].values())))
                    else:
                        scores.append(0)

                # 闭合雷达图
                scores_plot = scores + [scores[0]]
                angles = np.linspace(0, 2 * np.pi, len(categories), endpoint=False).tolist()
                angles_plot = angles + [angles[0]]

                fig, ax = plt.subplots(figsize=(10, 10), subplot_kw=dict(projection='polar'))

                # 绘制雷达图
                ax.plot(angles_plot, scores_plot, 'o-', linewidth=2, color='#2E86AB',
                       markersize=8, label='当前水平')
                ax.fill(angles_plot, scores_plot, alpha=0.25, color='#2E86AB')

                # 绘制参考线(高级工程师标准:8分)
                reference = [8] * (len(categories) + 1)
                ax.plot(angles_plot, reference, '--', linewidth=1.5, color='red',
                       alpha=0.7, label='高级工程师标准')

                # 设置标签
                ax.set_xticks(angles)
                ax.set_xticklabels(categories, fontsize=11)
                ax.set_ylim(0, 10)
                ax.set_yticks(range(0, 11, 2))
                ax.set_yticklabels(range(0, 11, 2), fontsize=9)
                ax.grid(True, linestyle='--', alpha=0.6)

                ax.set_title('NLP算法工程师技能雷达图', fontsize=14,
                           fontweight='bold', pad=20)
                ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.1))

                plt.tight_layout()
                plt.savefig('nlp_skill_radar.png', dpi=300, bbox_inches='tight')
                print("✓ 技能雷达图已保存为 nlp_skill_radar.png")

        # 使用示例
        if __name__ == "__main__":
            # 创建评估器
            assessor = NLPEngineerSkillAssessment()

            # 进行自我评估(1-10分)
            assessor.assess_skill("深度学习基础", {
                "神经网络": 8,
                "反向传播": 7,
                "优化算法": 8,
                "正则化": 7,
                "dropout": 8
            })

            assessor.assess_skill("NLP核心技术", {
                "文本分类": 8,
                "NER": 7,
                "机器翻译": 6,
                "文本生成": 7,
                "问答系统": 6
            })

            assessor.assess_skill("预训练模型", {
                "BERT": 8,
                "GPT": 7,
                "T5": 6,
                "模型微调": 8,
                "提示工程": 7
            })

            assessor.assess_skill("工程实践", {
                "PyTorch": 8,
                "模型部署": 6,
                "性能优化": 7,
                "数据处理": 8,
                "版本管理": 8
            })

            assessor.assess_skill("数学基础", {
                "线性代数": 7,
                "概率统计": 7,
                "信息论": 6,
                "优化理论": 6,
                "数值计算": 7
            })

            # 生成报告
            assessor.generate_report()

            # 绘制雷达图
            assessor.plot_radar_chart()
        ---

02.对话系统工程师
    a.说明部分
        对话系统工程师专注于智能客服、语音助手、聊天机器人等产品的开发。需要掌握对话管理、意图识别、
        槽位填充、多轮对话等核心技术。工作内容包括对话流程设计、NLU模型训练、对话策略优化、
        系统集成和效果评估。技术栈涵盖Rasa、微软Bot Framework等对话框架,以及BERT、GPT等预训练模型。
        薪资范围20-50K,大厂对话系统专家可达60K+。本课程第5章文本生成和第6章问答系统直接关联该岗位。
        该岗位特别强调用户体验和业务理解能力,需要能够将技术方案与实际业务场景紧密结合。
    b.代码示例
        ---
        # 对话系统技能评估和学习路径规划
        import networkx as nx
        import matplotlib.pyplot as plt
        from collections import defaultdict
        from typing import List, Dict

        # 设置中文显示
        plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei']
        plt.rcParams['axes.unicode_minus'] = False

        class DialogueSystemCareerPath:
            """对话系统工程师职业路径规划"""

            def __init__(self):
                # 构建技能依赖图
                self.skill_graph = nx.DiGraph()

                # 添加技能节点和依赖关系
                self.skills = {
                    # 基础技能
                    "Python编程": {"level": 1, "hours": 100, "priority": "必需"},
                    "机器学习基础": {"level": 1, "hours": 80, "priority": "必需"},

                    # NLP基础
                    "文本预处理": {"level": 2, "hours": 40, "priority": "必需",
                                "depends": ["Python编程"]},
                    "词向量": {"level": 2, "hours": 50, "priority": "必需",
                             "depends": ["机器学习基础"]},

                    # 对话系统核心
                    "意图识别": {"level": 3, "hours": 60, "priority": "必需",
                              "depends": ["文本预处理", "词向量"]},
                    "槽位填充": {"level": 3, "hours": 60, "priority": "必需",
                              "depends": ["文本预处理", "词向量"]},
                    "对话管理": {"level": 4, "hours": 80, "priority": "必需",
                              "depends": ["意图识别", "槽位填充"]},

                    # 高级技能
                    "多轮对话": {"level": 4, "hours": 70, "priority": "重要",
                              "depends": ["对话管理"]},
                    "任务型对话": {"level": 5, "hours": 90, "priority": "重要",
                                "depends": ["多轮对话"]},
                    "闲聊对话": {"level": 4, "hours": 60, "priority": "重要",
                              "depends": ["对话管理"]},

                    # 框架和工具
                    "Rasa框架": {"level": 5, "hours": 70, "priority": "重要",
                              "depends": ["任务型对话"]},
                    "大模型应用": {"level": 5, "hours": 80, "priority": "重要",
                                "depends": ["多轮对话", "闲聊对话"]},

                    # 工程能力
                    "系统部署": {"level": 6, "hours": 50, "priority": "重要",
                              "depends": ["Rasa框架", "大模型应用"]},
                    "性能优化": {"level": 6, "hours": 60, "priority": "加分项",
                              "depends": ["系统部署"]}
                }

                self._build_graph()

            def _build_graph(self):
                """构建技能依赖图"""
                # 添加节点
                for skill, data in self.skills.items():
                    self.skill_graph.add_node(skill, **data)

                # 添加边(依赖关系)
                for skill, data in self.skills.items():
                    if "depends" in data:
                        for dep in data["depends"]:
                            self.skill_graph.add_edge(dep, skill)

            def get_learning_path(self) -> List[List[str]]:
                """获取学习路径(按层级排序)"""
                levels = defaultdict(list)

                for node in self.skill_graph.nodes():
                    level = self.skill_graph.nodes[node]["level"]
                    levels[level].append(node)

                # 按层级排序
                path = [levels[i] for i in sorted(levels.keys())]
                return path

            def estimate_time(self, current_skills: List[str] = None) -> Dict:
                """估算学习时间"""
                if current_skills is None:
                    current_skills = []

                # 找出需要学习的技能
                to_learn = set(self.skills.keys()) - set(current_skills)

                # 计算总时长和各优先级时长
                total_hours = 0
                priority_hours = defaultdict(int)

                for skill in to_learn:
                    hours = self.skills[skill]["hours"]
                    priority = self.skills[skill]["priority"]

                    total_hours += hours
                    priority_hours[priority] += hours

                return {
                    "总学习时长": total_hours,
                    "必需技能": priority_hours["必需"],
                    "重要技能": priority_hours["重要"],
                    "加分项": priority_hours.get("加分项", 0),
                    "预计周数": total_hours / 20  # 假设每周学习20小时
                }

            def print_learning_path(self):
                """打印学习路径"""
                path = self.get_learning_path()

                print(f"\n{'='*70}")
                print("对话系统工程师学习路径")
                print(f"{'='*70}\n")

                for level_idx, level_skills in enumerate(path, 1):
                    print(f"阶段 {level_idx}:")
                    for skill in level_skills:
                        data = self.skills[skill]
                        depends = ", ".join(data.get("depends", []))
                        depends_str = f" (依赖: {depends})" if depends else ""

                        print(f"  • {skill:15s} | {data['hours']:3d}小时 | "
                              f"{data['priority']:6s}{depends_str}")
                    print()

                # 时间估算
                time_est = self.estimate_time()
                print(f"{'='*70}")
                print("学习时间估算:")
                print(f"{'='*70}\n")
                for key, value in time_est.items():
                    if key != "预计周数":
                        print(f"{key}: {value}小时")
                    else:
                        print(f"{key}: {value:.1f}周")

            def plot_skill_tree(self):
                """绘制技能树"""
                plt.figure(figsize=(16, 12))

                # 使用层次布局
                pos = nx.spring_layout(self.skill_graph, k=2, iterations=50, seed=42)

                # 按优先级设置颜色
                color_map = {
                    "必需": "#E63946",
                    "重要": "#F4A261",
                    "加分项": "#2A9D8F"
                }

                node_colors = [
                    color_map[self.skill_graph.nodes[node]["priority"]]
                    for node in self.skill_graph.nodes()
                ]

                # 按层级设置节点大小
                node_sizes = [
                    3000 - self.skill_graph.nodes[node]["level"] * 300
                    for node in self.skill_graph.nodes()
                ]

                # 绘制图
                nx.draw_networkx_nodes(self.skill_graph, pos,
                                      node_color=node_colors,
                                      node_size=node_sizes,
                                      alpha=0.9,
                                      edgecolors='black',
                                      linewidths=2)

                nx.draw_networkx_labels(self.skill_graph, pos,
                                       font_size=9,
                                       font_weight='bold')

                nx.draw_networkx_edges(self.skill_graph, pos,
                                      edge_color='gray',
                                      arrows=True,
                                      arrowsize=20,
                                      arrowstyle='->',
                                      width=2,
                                      alpha=0.6,
                                      connectionstyle='arc3,rad=0.1')

                # 添加图例
                from matplotlib.patches import Patch
                legend_elements = [
                    Patch(facecolor=color_map["必需"], label='必需技能', edgecolor='black'),
                    Patch(facecolor=color_map["重要"], label='重要技能', edgecolor='black'),
                    Patch(facecolor=color_map["加分项"], label='加分项', edgecolor='black')
                ]
                plt.legend(handles=legend_elements, loc='upper left', fontsize=11)

                plt.title('对话系统工程师技能树', fontsize=16, fontweight='bold', pad=20)
                plt.axis('off')
                plt.tight_layout()
                plt.savefig('dialogue_skill_tree.png', dpi=300, bbox_inches='tight')
                print("\n✓ 技能树已保存为 dialogue_skill_tree.png")

        # 使用示例
        if __name__ == "__main__":
            # 创建职业路径规划器
            planner = DialogueSystemCareerPath()

            # 打印学习路径
            planner.print_learning_path()

            # 假设已掌握的技能
            current_skills = ["Python编程", "机器学习基础", "文本预处理"]

            # 估算剩余学习时间
            print(f"\n{'='*70}")
            print("基于当前技能的学习时间估算:")
            print(f"{'='*70}\n")
            time_est = planner.estimate_time(current_skills)
            for key, value in time_est.items():
                if key != "预计周数":
                    print(f"{key}: {value}小时")
                else:
                    print(f"{key}: {value:.1f}周")

            # 绘制技能树
            planner.plot_skill_tree()
        ---

03.知识图谱工程师
    a.说明部分
        知识图谱工程师负责构建和维护企业级知识图谱,应用于智能搜索、推荐系统、风控系统等场景。
        核心技能包括信息抽取(实体识别、关系抽取、事件抽取)、知识融合、知识推理、图数据库(Neo4j)、
        图神经网络等。工作内容涵盖数据采集、实体链接、关系挖掘、知识表示、图谱查询等。
        薪资范围25-55K,金融、医疗等行业的资深工程师可达70K+。本课程第3章NER、第7章信息抽取
        直接对应该岗位需求。该岗位需要较强的领域知识积累和数据工程能力,同时要熟悉图数据库和分布式系统。
    b.代码示例
        ---
        # 知识图谱工程师技能矩阵和职业发展路径
        import pandas as pd
        import numpy as np
        import matplotlib.pyplot as plt
        import seaborn as sns

        # 设置中文显示
        plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei']
        plt.rcParams['axes.unicode_minus'] = False

        class KnowledgeGraphCareer:
            """知识图谱工程师职业发展分析"""

            def __init__(self):
                # 定义技能矩阵
                self.skill_matrix = pd.DataFrame({
                    '技能': [
                        'Python编程', 'NLP基础', '信息抽取', '实体链接',
                        '关系抽取', '事件抽取', '知识表示', '知识融合',
                        '知识推理', 'Neo4j', 'GraphDB', '图神经网络',
                        'SPARQL', '分布式系统', '领域建模', '数据质量'
                    ],
                    '初级要求': [8, 7, 6, 4, 5, 3, 5, 3, 2, 6, 4, 2, 4, 3, 4, 5],
                    '中级要求': [9, 8, 8, 7, 7, 6, 7, 6, 5, 8, 6, 5, 6, 6, 7, 7],
                    '高级要求': [9, 9, 9, 8, 9, 8, 8, 8, 7, 9, 8, 7, 8, 8, 8, 9],
                    '专家要求': [10, 10, 10, 9, 10, 9, 9, 9, 9, 10, 9, 9, 9, 9, 10, 10],
                    '类别': [
                        '基础', '基础', 'NLP', 'NLP', 'NLP', 'NLP',
                        '图谱', '图谱', '图谱', '图谱', '图谱', '图谱',
                        '工程', '工程', '业务', '工程'
                    ]
                })

                # 职业发展阶段
                self.career_stages = {
                    '初级工程师': {
                        '薪资范围': '15-25K',
                        '工作年限': '0-2年',
                        '核心能力': ['实体识别', '关系抽取', 'Neo4j基础'],
                        '主要职责': '辅助构建知识图谱,进行数据标注和基础开发'
                    },
                    '中级工程师': {
                        '薪资范围': '25-40K',
                        '工作年限': '2-4年',
                        '核心能力': ['信息抽取', '知识融合', '图数据库', '领域建模'],
                        '主要职责': '独立完成知识图谱构建,优化抽取算法'
                    },
                    '高级工程师': {
                        '薪资范围': '40-60K',
                        '工作年限': '4-7年',
                        '核心能力': ['知识推理', '图神经网络', '分布式系统', '架构设计'],
                        '主要职责': '设计图谱架构,解决复杂技术问题,指导团队'
                    },
                    '专家/架构师': {
                        '薪资范围': '60-100K',
                        '工作年限': '7年+',
                        '核心能力': ['全栈技术', '业务理解', '团队管理', '技术规划'],
                        '主要职责': '制定技术方向,设计整体架构,业务创新'
                    }
                }

            def plot_skill_heatmap(self):
                """绘制技能要求热力图"""
                fig, ax = plt.subplots(figsize=(12, 10))

                # 准备数据
                data = self.skill_matrix.set_index('技能')[
                    ['初级要求', '中级要求', '高级要求', '专家要求']
                ]

                # 绘制热力图
                sns.heatmap(data, annot=True, fmt='d', cmap='YlOrRd',
                           cbar_kws={'label': '技能要求 (1-10)'},
                           linewidths=0.5, linecolor='gray',
                           vmin=0, vmax=10, ax=ax)

                ax.set_title('知识图谱工程师各阶段技能要求热力图',
                           fontsize=14, fontweight='bold', pad=15)
                ax.set_xlabel('职业阶段', fontsize=12, fontweight='bold')
                ax.set_ylabel('技能项', fontsize=12, fontweight='bold')

                plt.tight_layout()
                plt.savefig('kg_skill_heatmap.png', dpi=300, bbox_inches='tight')
                print("✓ 技能热力图已保存为 kg_skill_heatmap.png")

            def plot_skill_gap_analysis(self, current_skills: dict):
                """
                绘制技能差距分析图

                Args:
                    current_skills: 当前技能水平字典 {技能名: 分数}
                """
                fig, axes = plt.subplots(2, 2, figsize=(16, 12))
                stages = ['初级要求', '中级要求', '高级要求', '专家要求']

                for idx, (ax, stage) in enumerate(zip(axes.flat, stages)):
                    # 计算技能差距
                    skills = self.skill_matrix['技能'].tolist()
                    required = self.skill_matrix[stage].tolist()
                    current = [current_skills.get(skill, 0) for skill in skills]
                    gap = [req - cur for req, cur in zip(required, current)]

                    # 按类别分组
                    categories = self.skill_matrix['类别'].unique()
                    colors = {'基础': '#E63946', 'NLP': '#F4A261',
                             '图谱': '#2A9D8F', '工程': '#264653', '业务': '#E76F51'}

                    y_pos = np.arange(len(skills))

                    # 绘制条形图
                    for i, (skill, g, cat) in enumerate(zip(skills, gap, self.skill_matrix['类别'])):
                        color = colors[cat]
                        ax.barh(i, g, color=color, alpha=0.7, edgecolor='black', linewidth=0.5)

                    ax.set_yticks(y_pos)
                    ax.set_yticklabels(skills, fontsize=8)
                    ax.set_xlabel('技能差距', fontsize=10, fontweight='bold')
                    ax.set_title(f'{stage.replace("要求", "")}技能差距',
                               fontsize=11, fontweight='bold')
                    ax.axvline(x=0, color='black', linestyle='-', linewidth=1)
                    ax.grid(axis='x', alpha=0.3)

                    # 添加图例(只在第一个子图)
                    if idx == 0:
                        from matplotlib.patches import Patch
                        legend_elements = [
                            Patch(facecolor=color, label=cat, edgecolor='black', alpha=0.7)
                            for cat, color in colors.items()
                        ]
                        ax.legend(handles=legend_elements, loc='lower right', fontsize=8)

                plt.suptitle('知识图谱工程师技能差距分析', fontsize=14, fontweight='bold', y=1.00)
                plt.tight_layout()
                plt.savefig('kg_skill_gap.png', dpi=300, bbox_inches='tight')
                print("✓ 技能差距分析图已保存为 kg_skill_gap.png")

            def print_career_path(self):
                """打印职业发展路径"""
                print(f"\n{'='*70}")
                print("知识图谱工程师职业发展路径")
                print(f"{'='*70}\n")

                for stage, info in self.career_stages.items():
                    print(f"【{stage}】")
                    print(f"  薪资范围: {info['薪资范围']}/月")
                    print(f"  工作年限: {info['工作年限']}")
                    print(f"  核心能力: {', '.join(info['核心能力'])}")
                    print(f"  主要职责: {info['主要职责']}")
                    print()

            def recommend_learning_path(self, current_skills: dict, target_stage: str):
                """推荐学习路径"""
                stage_col = target_stage.replace('工程师', '要求').replace('专家/架构师', '专家要求')

                if stage_col not in self.skill_matrix.columns:
                    print(f"错误: 未知的目标阶段 {target_stage}")
                    return

                print(f"\n{'='*70}")
                print(f"面向【{target_stage}】的学习建议")
                print(f"{'='*70}\n")

                # 计算技能差距
                gaps = []
                for _, row in self.skill_matrix.iterrows():
                    skill = row['技能']
                    required = row[stage_col]
                    current = current_skills.get(skill, 0)
                    gap = required - current

                    if gap > 0:
                        gaps.append({
                            '技能': skill,
                            '当前': current,
                            '要求': required,
                            '差距': gap,
                            '类别': row['类别']
                        })

                # 按差距排序
                gaps_df = pd.DataFrame(gaps).sort_values('差距', ascending=False)

                if gaps_df.empty:
                    print(f"恭喜!您已达到{target_stage}的技能要求!")
                    return

                print("需要提升的技能(按优先级排序):\n")
                for idx, row in gaps_df.iterrows():
                    print(f"{row['技能']:15s} | 当前: {row['当前']}/10 | "
                          f"要求: {row['要求']}/10 | 差距: {row['差距']} | "
                          f"类别: {row['类别']}")

                # 学习建议
                print(f"\n{'='*70}")
                print("学习建议:")
                print(f"{'='*70}\n")

                top_gaps = gaps_df.head(5)
                for idx, row in top_gaps.iterrows():
                    print(f"• {row['技能']} (差距: {row['差距']})")
                    if row['技能'] in ['信息抽取', '实体链接', '关系抽取', '事件抽取']:
                        print(f"  建议: 学习本课程第3章(NER)和第7章(信息抽取)")
                    elif row['技能'] in ['Neo4j', 'GraphDB']:
                        print(f"  建议: 完成Neo4j官方认证课程,实践图数据库项目")
                    elif row['技能'] == '图神经网络':
                        print(f"  建议: 学习PyG/DGL框架,复现经典GNN论文")
                    print()

        # 使用示例
        if __name__ == "__main__":
            # 创建职业分析器
            analyzer = KnowledgeGraphCareer()

            # 打印职业路径
            analyzer.print_career_path()

            # 当前技能水平(自我评估)
            my_skills = {
                'Python编程': 8, 'NLP基础': 7, '信息抽取': 6, '实体链接': 5,
                '关系抽取': 6, '事件抽取': 4, '知识表示': 5, '知识融合': 4,
                '知识推理': 3, 'Neo4j': 6, 'GraphDB': 4, '图神经网络': 3,
                'SPARQL': 5, '分布式系统': 5, '领域建模': 5, '数据质量': 6
            }

            # 绘制技能热力图
            analyzer.plot_skill_heatmap()

            # 技能差距分析
            analyzer.plot_skill_gap_analysis(my_skills)

            # 推荐学习路径
            analyzer.recommend_learning_path(my_skills, '高级工程师')
        ---

2 文本分类

2.1 任务定义

01.文本分类基本概念
    a.说明部分
        文本分类是NLP中最基础也是最重要的任务之一,目标是将文本自动归类到预定义的类别中。
        根据类别数量可分为二分类(如垃圾邮件检测)、多分类(如新闻分类)和多标签分类(如文章标签)。
        根据训练数据可分为监督学习、半监督学习和零样本学习。文本分类的核心挑战包括:
        特征表示的选择(词袋、TF-IDF、词向量、预训练模型)、类别不平衡问题、长文本处理、
        跨领域迁移等。现代文本分类主要基于深度学习方法,特别是预训练语言模型(BERT、RoBERTa等),
        在多个benchmark上已经接近或超过人类水平。实际应用中还需要考虑推理速度、模型大小、
        可解释性等工程因素。评估指标包括准确率、精确率、召回率、F1值和AUC-ROC曲线。
    b.代码示例
        ---
        # 文本分类任务定义和数据结构
        import torch
        from torch.utils.data import Dataset, DataLoader
        from typing import List, Dict, Tuple
        import numpy as np
        from sklearn.model_selection import train_test_split
        from collections import Counter

        class TextClassificationDataset(Dataset):
            """文本分类数据集类"""

            def __init__(self, texts: List[str], labels: List[int],
                        tokenizer=None, max_length: int = 128):
                """
                初始化数据集

                Args:
                    texts: 文本列表
                    labels: 标签列表
                    tokenizer: 分词器(可选)
                    max_length: 最大序列长度
                """
                self.texts = texts
                self.labels = labels
                self.tokenizer = tokenizer
                self.max_length = max_length

                # 统计类别分布
                self.label_distribution = Counter(labels)
                self.num_classes = len(set(labels))

            def __len__(self):
                return len(self.texts)

            def __getitem__(self, idx):
                text = self.texts[idx]
                label = self.labels[idx]

                if self.tokenizer is not None:
                    # 使用tokenizer编码
                    encoding = self.tokenizer(
                        text,
                        add_special_tokens=True,
                        max_length=self.max_length,
                        padding='max_length',
                        truncation=True,
                        return_tensors='pt'
                    )

                    return {
                        'input_ids': encoding['input_ids'].flatten(),
                        'attention_mask': encoding['attention_mask'].flatten(),
                        'label': torch.tensor(label, dtype=torch.long)
                    }
                else:
                    # 返回原始文本
                    return {'text': text, 'label': label}

            def get_class_weights(self) -> torch.Tensor:
                """计算类别权重(用于处理不平衡数据)"""
                total_samples = len(self.labels)
                weights = []

                for class_id in range(self.num_classes):
                    class_count = self.label_distribution.get(class_id, 1)
                    weight = total_samples / (self.num_classes * class_count)
                    weights.append(weight)

                return torch.tensor(weights, dtype=torch.float)

            def get_statistics(self) -> Dict:
                """获取数据集统计信息"""
                text_lengths = [len(text.split()) for text in self.texts]

                stats = {
                    'total_samples': len(self.texts),
                    'num_classes': self.num_classes,
                    'label_distribution': dict(self.label_distribution),
                    'avg_text_length': np.mean(text_lengths),
                    'max_text_length': np.max(text_lengths),
                    'min_text_length': np.min(text_lengths),
                    'std_text_length': np.std(text_lengths)
                }

                return stats

            def print_statistics(self):
                """打印数据集统计信息"""
                stats = self.get_statistics()

                print(f"\n{'='*60}")
                print("数据集统计信息")
                print(f"{'='*60}\n")

                print(f"总样本数: {stats['total_samples']}")
                print(f"类别数: {stats['num_classes']}\n")

                print("类别分布:")
                for label, count in sorted(stats['label_distribution'].items()):
                    percentage = count / stats['total_samples'] * 100
                    bar = '█' * int(percentage / 2)
                    print(f"  类别 {label}: {count:5d} ({percentage:5.2f}%) {bar}")

                print(f"\n文本长度统计:")
                print(f"  平均长度: {stats['avg_text_length']:.2f} 词")
                print(f"  最大长度: {stats['max_text_length']} 词")
                print(f"  最小长度: {stats['min_text_length']} 词")
                print(f"  标准差: {stats['std_text_length']:.2f} 词")

        class TextClassificationTask:
            """文本分类任务管理器"""

            def __init__(self, task_type: str = "binary"):
                """
                初始化任务

                Args:
                    task_type: 任务类型 ("binary", "multiclass", "multilabel")
                """
                self.task_type = task_type
                self.label_map = {}
                self.reverse_label_map = {}

            def create_label_mapping(self, labels: List[str]) -> Dict:
                """创建标签映射"""
                unique_labels = sorted(set(labels))

                self.label_map = {label: idx for idx, label in enumerate(unique_labels)}
                self.reverse_label_map = {idx: label for label, idx in self.label_map.items()}

                return self.label_map

            def encode_labels(self, labels: List[str]) -> List[int]:
                """编码标签"""
                return [self.label_map[label] for label in labels]

            def decode_labels(self, label_ids: List[int]) -> List[str]:
                """解码标签"""
                return [self.reverse_label_map[idx] for idx in label_ids]

            def split_data(self, texts: List[str], labels: List[int],
                          test_size: float = 0.2, val_size: float = 0.1,
                          random_state: int = 42) -> Tuple:
                """分割数据集"""
                # 首先分割出测试集
                X_temp, X_test, y_temp, y_test = train_test_split(
                    texts, labels, test_size=test_size,
                    random_state=random_state, stratify=labels
                )

                # 再从剩余数据中分割出验证集
                val_ratio = val_size / (1 - test_size)
                X_train, X_val, y_train, y_val = train_test_split(
                    X_temp, y_temp, test_size=val_ratio,
                    random_state=random_state, stratify=y_temp
                )

                print(f"数据分割完成:")
                print(f"  训练集: {len(X_train)} 样本")
                print(f"  验证集: {len(X_val)} 样本")
                print(f"  测试集: {len(X_test)} 样本")

                return X_train, X_val, X_test, y_train, y_val, y_test

        # 使用示例
        if __name__ == "__main__":
            # 模拟数据
            texts = [
                "这部电影非常精彩,值得一看",
                "剧情拖沓,浪费时间",
                "演员表演出色,导演功力深厚",
                "无聊透顶,不推荐",
                "视觉效果震撼,音乐动人",
                "故事老套,毫无新意"
            ] * 100  # 扩展数据

            str_labels = ["positive", "negative", "positive",
                         "negative", "positive", "negative"] * 100

            # 创建任务管理器
            task = TextClassificationTask(task_type="binary")

            # 创建标签映射
            label_map = task.create_label_mapping(str_labels)
            print("标签映射:", label_map)

            # 编码标签
            labels = task.encode_labels(str_labels)

            # 分割数据
            X_train, X_val, X_test, y_train, y_val, y_test = task.split_data(
                texts, labels, test_size=0.2, val_size=0.1
            )

            # 创建数据集
            train_dataset = TextClassificationDataset(X_train, y_train)
            train_dataset.print_statistics()

            # 获取类别权重
            class_weights = train_dataset.get_class_weights()
            print(f"\n类别权重: {class_weights}")

            # 创建数据加载器
            train_loader = DataLoader(
                train_dataset,
                batch_size=16,
                shuffle=True,
                num_workers=0
            )

            print(f"\n训练集批次数: {len(train_loader)}")

            # 查看一个批次
            batch = next(iter(train_loader))
            print(f"\n批次样例:")
            print(f"  文本数: {len(batch['text'])}")
            print(f"  标签: {batch['label'][:5]}")
        ---

02.评估指标体系
    a.说明部分
        文本分类的评估需要根据任务特点选择合适的指标。对于平衡数据集,准确率(Accuracy)是最直观的指标;
        对于不平衡数据,精确率(Precision)、召回率(Recall)和F1值更为重要。宏平均(Macro)对所有类别一视同仁,
        微平均(Micro)考虑样本数量,加权平均(Weighted)结合两者优点。多标签分类还需要考虑Hamming Loss、
        Subset Accuracy等指标。ROC曲线和AUC值适用于二分类问题的阈值选择。实际应用中还需关注混淆矩阵、
        类别级别的性能、以及模型的校准度。对于实时系统,推理速度和资源消耗也是重要的评估维度。
    b.代码示例
        ---
        # 文本分类评估指标系统
        import numpy as np
        from sklearn.metrics import (
            accuracy_score, precision_recall_fscore_support,
            classification_report, confusion_matrix,
            roc_auc_score, roc_curve
        )
        import matplotlib.pyplot as plt
        import seaborn as sns
        from typing import List, Dict, Optional

        # 设置中文显示
        plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei']
        plt.rcParams['axes.unicode_minus'] = False

        class ClassificationMetrics:
            """文本分类评估指标计算器"""

            def __init__(self, num_classes: int, class_names: Optional[List[str]] = None):
                """
                初始化评估器

                Args:
                    num_classes: 类别数量
                    class_names: 类别名称列表
                """
                self.num_classes = num_classes
                self.class_names = class_names or [f"类别{i}" for i in range(num_classes)]

            def compute_metrics(self, y_true: np.ndarray, y_pred: np.ndarray,
                              y_prob: Optional[np.ndarray] = None) -> Dict:
                """
                计算所有评估指标

                Args:
                    y_true: 真实标签
                    y_pred: 预测标签
                    y_prob: 预测概率(可选,用于AUC计算)

                Returns:
                    指标字典
                """
                # 基本指标
                accuracy = accuracy_score(y_true, y_pred)

                # 每个类别的精确率、召回率、F1
                precision, recall, f1, support = precision_recall_fscore_support(
                    y_true, y_pred, average=None, zero_division=0
                )

                # 各种平均方式
                macro_precision, macro_recall, macro_f1, _ = precision_recall_fscore_support(
                    y_true, y_pred, average='macro', zero_division=0
                )

                weighted_precision, weighted_recall, weighted_f1, _ = precision_recall_fscore_support(
                    y_true, y_pred, average='weighted', zero_division=0
                )

                metrics = {
                    'accuracy': accuracy,
                    'macro_precision': macro_precision,
                    'macro_recall': macro_recall,
                    'macro_f1': macro_f1,
                    'weighted_precision': weighted_precision,
                    'weighted_recall': weighted_recall,
                    'weighted_f1': weighted_f1,
                    'per_class_precision': precision,
                    'per_class_recall': recall,
                    'per_class_f1': f1,
                    'per_class_support': support
                }

                # 计算AUC(如果提供了概率)
                if y_prob is not None:
                    try:
                        if self.num_classes == 2:
                            # 二分类
                            auc = roc_auc_score(y_true, y_prob[:, 1])
                            metrics['auc'] = auc
                        else:
                            # 多分类(one-vs-rest)
                            auc = roc_auc_score(y_true, y_prob,
                                              multi_class='ovr', average='macro')
                            metrics['auc_ovr'] = auc
                    except Exception as e:
                        print(f"计算AUC时出错: {e}")

                return metrics

            def print_report(self, y_true: np.ndarray, y_pred: np.ndarray,
                           y_prob: Optional[np.ndarray] = None):
                """打印详细评估报告"""
                metrics = self.compute_metrics(y_true, y_pred, y_prob)

                print(f"\n{'='*70}")
                print("文本分类评估报告")
                print(f"{'='*70}\n")

                # 总体指标
                print("总体性能:")
                print(f"  准确率 (Accuracy):        {metrics['accuracy']:.4f}")
                print(f"  宏平均 F1 (Macro F1):     {metrics['macro_f1']:.4f}")
                print(f"  加权平均 F1 (Weighted F1): {metrics['weighted_f1']:.4f}")

                if 'auc' in metrics:
                    print(f"  AUC:                      {metrics['auc']:.4f}")
                elif 'auc_ovr' in metrics:
                    print(f"  AUC (OvR):                {metrics['auc_ovr']:.4f}")

                # 每个类别的详细指标
                print(f"\n{'='*70}")
                print("各类别详细指标:")
                print(f"{'='*70}\n")
                print(f"{'类别':<15} {'精确率':>10} {'召回率':>10} {'F1值':>10} {'样本数':>10}")
                print("-" * 70)

                for i in range(self.num_classes):
                    print(f"{self.class_names[i]:<15} "
                          f"{metrics['per_class_precision'][i]:>10.4f} "
                          f"{metrics['per_class_recall'][i]:>10.4f} "
                          f"{metrics['per_class_f1'][i]:>10.4f} "
                          f"{int(metrics['per_class_support'][i]):>10d}")

                # Sklearn分类报告
                print(f"\n{'='*70}")
                print("Sklearn分类报告:")
                print(f"{'='*70}\n")
                print(classification_report(y_true, y_pred,
                                          target_names=self.class_names,
                                          digits=4, zero_division=0))

            def plot_confusion_matrix(self, y_true: np.ndarray, y_pred: np.ndarray,
                                    normalize: bool = False):
                """绘制混淆矩阵"""
                cm = confusion_matrix(y_true, y_pred)

                if normalize:
                    cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

                plt.figure(figsize=(10, 8))
                sns.heatmap(cm, annot=True, fmt='.2f' if normalize else 'd',
                           cmap='Blues', xticklabels=self.class_names,
                           yticklabels=self.class_names, cbar_kws={'label': '数量'})

                plt.title('混淆矩阵' + (' (归一化)' if normalize else ''),
                         fontsize=14, fontweight='bold', pad=15)
                plt.ylabel('真实标签', fontsize=12, fontweight='bold')
                plt.xlabel('预测标签', fontsize=12, fontweight='bold')

                plt.tight_layout()
                plt.savefig('confusion_matrix.png', dpi=300, bbox_inches='tight')
                print("✓ 混淆矩阵已保存为 confusion_matrix.png")

            def plot_roc_curve(self, y_true: np.ndarray, y_prob: np.ndarray):
                """绘制ROC曲线(二分类)"""
                if self.num_classes != 2:
                    print("ROC曲线仅适用于二分类问题")
                    return

                fpr, tpr, thresholds = roc_curve(y_true, y_prob[:, 1])
                auc = roc_auc_score(y_true, y_prob[:, 1])

                plt.figure(figsize=(10, 8))
                plt.plot(fpr, tpr, linewidth=2, label=f'ROC曲线 (AUC = {auc:.4f})')
                plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='随机猜测')

                plt.xlim([0.0, 1.0])
                plt.ylim([0.0, 1.05])
                plt.xlabel('假正例率 (False Positive Rate)', fontsize=12, fontweight='bold')
                plt.ylabel('真正例率 (True Positive Rate)', fontsize=12, fontweight='bold')
                plt.title('ROC曲线', fontsize=14, fontweight='bold', pad=15)
                plt.legend(loc="lower right", fontsize=11)
                plt.grid(alpha=0.3)

                plt.tight_layout()
                plt.savefig('roc_curve.png', dpi=300, bbox_inches='tight')
                print("✓ ROC曲线已保存为 roc_curve.png")

        # 使用示例
        if __name__ == "__main__":
            # 模拟预测结果
            np.random.seed(42)
            n_samples = 1000

            # 二分类示例
            y_true = np.random.randint(0, 2, n_samples)
            y_prob = np.random.rand(n_samples, 2)
            y_prob = y_prob / y_prob.sum(axis=1, keepdims=True)  # 归一化
            y_pred = np.argmax(y_prob, axis=1)

            # 创建评估器
            evaluator = ClassificationMetrics(
                num_classes=2,
                class_names=['负面', '正面']
            )

            # 打印报告
            evaluator.print_report(y_true, y_pred, y_prob)

            # 绘制混淆矩阵
            evaluator.plot_confusion_matrix(y_true, y_pred, normalize=True)

            # 绘制ROC曲线
            evaluator.plot_roc_curve(y_true, y_prob)
        ---

03.常见应用场景
    a.说明部分
        文本分类在实际业务中有广泛应用。情感分析用于分析用户评论、社交媒体情绪监控,帮助企业了解产品口碑;
        新闻分类用于内容推荐、信息聚合,提升用户阅读体验;垃圾邮件过滤保护用户免受骚扰信息;
        意图识别是对话系统的核心模块,理解用户查询意图;内容审核识别违规、暴力、色情等不良信息,维护平台安全;
        文档分类用于企业知识管理、自动归档;问题分类用于智能客服系统的问题路由。不同场景对模型的要求各异:
        实时系统需要低延迟推理,内容审核需要高召回率以避免漏检,推荐系统需要个性化建模。
    b.代码示例
        ---
        # 文本分类应用场景示例框架
        from abc import ABC, abstractmethod
        from typing import List, Dict, Tuple
        import time

        class TextClassificationApplication(ABC):
            """文本分类应用基类"""

            def __init__(self, model_name: str, threshold: float = 0.5):
                """
                初始化应用

                Args:
                    model_name: 模型名称
                    threshold: 分类阈值
                """
                self.model_name = model_name
                self.threshold = threshold
                self.prediction_count = 0
                self.total_inference_time = 0.0

            @abstractmethod
            def preprocess(self, text: str) -> str:
                """预处理文本"""
                pass

            @abstractmethod
            def predict(self, text: str) -> Dict:
                """预测分类结果"""
                pass

            @abstractmethod
            def postprocess(self, prediction: Dict) -> Dict:
                """后处理预测结果"""
                pass

            def inference(self, text: str) -> Dict:
                """完整的推理流程"""
                start_time = time.time()

                # 预处理
                processed_text = self.preprocess(text)

                # 预测
                prediction = self.predict(processed_text)

                # 后处理
                result = self.postprocess(prediction)

                # 记录推理时间
                inference_time = time.time() - start_time
                self.prediction_count += 1
                self.total_inference_time += inference_time

                result['inference_time'] = inference_time

                return result

            def get_statistics(self) -> Dict:
                """获取统计信息"""
                avg_time = (self.total_inference_time / self.prediction_count
                           if self.prediction_count > 0 else 0)

                return {
                    'model_name': self.model_name,
                    'total_predictions': self.prediction_count,
                    'total_time': self.total_inference_time,
                    'avg_inference_time': avg_time,
                    'predictions_per_second': 1 / avg_time if avg_time > 0 else 0
                }

        class SentimentAnalysisApp(TextClassificationApplication):
            """情感分析应用"""

            def __init__(self, model_name: str = "BERT-Sentiment"):
                super().__init__(model_name, threshold=0.5)
                self.label_map = {0: "负面", 1: "中性", 2: "正面"}

            def preprocess(self, text: str) -> str:
                """清洗和标准化文本"""
                # 移除特殊字符,统一空格
                text = text.strip()
                text = ' '.join(text.split())
                return text

            def predict(self, text: str) -> Dict:
                """模拟情感分析预测"""
                import random
                # 实际应用中这里应该调用真实的模型
                scores = [random.random() for _ in range(3)]
                total = sum(scores)
                probs = [s / total for s in scores]
                pred_label = probs.index(max(probs))

                return {
                    'label': pred_label,
                    'probabilities': probs
                }

            def postprocess(self, prediction: Dict) -> Dict:
                """格式化输出结果"""
                label = prediction['label']
                probs = prediction['probabilities']

                return {
                    'sentiment': self.label_map[label],
                    'confidence': probs[label],
                    'scores': {
                        self.label_map[i]: prob
                        for i, prob in enumerate(probs)
                    }
                }

        class SpamDetectionApp(TextClassificationApplication):
            """垃圾邮件检测应用"""

            def __init__(self, model_name: str = "SpamFilter-v2"):
                super().__init__(model_name, threshold=0.8)  # 高阈值避免误判

            def preprocess(self, text: str) -> str:
                """提取关键特征"""
                text = text.lower().strip()
                return text

            def predict(self, text: str) -> Dict:
                """检测垃圾邮件"""
                import random
                # 模拟预测
                spam_score = random.random()

                return {
                    'is_spam': spam_score > self.threshold,
                    'spam_score': spam_score
                }

            def postprocess(self, prediction: Dict) -> Dict:
                """添加处理建议"""
                is_spam = prediction['is_spam']
                score = prediction['spam_score']

                action = "拦截" if is_spam else "放行"
                confidence = "高" if abs(score - 0.5) > 0.3 else "中" if abs(score - 0.5) > 0.15 else "低"

                return {
                    'classification': '垃圾邮件' if is_spam else '正常邮件',
                    'spam_probability': score,
                    'action': action,
                    'confidence_level': confidence
                }

        class IntentRecognitionApp(TextClassificationApplication):
            """意图识别应用(对话系统)"""

            def __init__(self, model_name: str = "Intent-BERT"):
                super().__init__(model_name, threshold=0.6)
                self.intents = [
                    "查询天气", "设置闹钟", "播放音乐",
                    "查询时间", "讲个笑话", "其他"
                ]

            def preprocess(self, text: str) -> str:
                """标准化用户输入"""
                text = text.strip()
                # 移除语气词
                filler_words = ["嗯", "啊", "呢", "吧", "哦"]
                for word in filler_words:
                    text = text.replace(word, "")
                return text

            def predict(self, text: str) -> Dict:
                """识别用户意图"""
                import random
                # 模拟多意图预测
                scores = [random.random() for _ in range(len(self.intents))]
                total = sum(scores)
                probs = [s / total for s in scores]

                # 排序获取top-k意图
                intent_scores = list(zip(self.intents, probs))
                intent_scores.sort(key=lambda x: x[1], reverse=True)

                return {
                    'top_intents': intent_scores[:3],
                    'all_scores': dict(intent_scores)
                }

            def postprocess(self, prediction: Dict) -> Dict:
                """格式化意图识别结果"""
                top_intent, top_score = prediction['top_intents'][0]

                # 判断是否需要澄清
                need_clarification = top_score < self.threshold

                return {
                    'primary_intent': top_intent,
                    'confidence': top_score,
                    'need_clarification': need_clarification,
                    'alternative_intents': [
                        {'intent': intent, 'score': score}
                        for intent, score in prediction['top_intents'][1:3]
                    ]
                }

        # 使用示例
        if __name__ == "__main__":
            print("="*70)
            print("文本分类应用场景演示")
            print("="*70)

            # 1. 情感分析
            print("\n【场景1: 情感分析】")
            sentiment_app = SentimentAnalysisApp()
            text1 = "这个产品真的太好用了,强烈推荐!"
            result1 = sentiment_app.inference(text1)
            print(f"文本: {text1}")
            print(f"情感: {result1['sentiment']}")
            print(f"置信度: {result1['confidence']:.4f}")
            print(f"各类别得分: {result1['scores']}")

            # 2. 垃圾邮件检测
            print("\n【场景2: 垃圾邮件检测】")
            spam_app = SpamDetectionApp()
            text2 = "恭喜您中奖100万,请点击链接领取..."
            result2 = spam_app.inference(text2)
            print(f"文本: {text2}")
            print(f"分类: {result2['classification']}")
            print(f"垃圾邮件概率: {result2['spam_probability']:.4f}")
            print(f"处理动作: {result2['action']}")

            # 3. 意图识别
            print("\n【场景3: 意图识别】")
            intent_app = IntentRecognitionApp()
            text3 = "明天北京天气怎么样"
            result3 = intent_app.inference(text3)
            print(f"文本: {text3}")
            print(f"主要意图: {result3['primary_intent']}")
            print(f"置信度: {result3['confidence']:.4f}")
            print(f"需要澄清: {'是' if result3['need_clarification'] else '否'}")

            # 性能统计
            print("\n" + "="*70)
            print("性能统计")
            print("="*70)

            for app, name in [(sentiment_app, "情感分析"),
                             (spam_app, "垃圾邮件检测"),
                             (intent_app, "意图识别")]:
                stats = app.get_statistics()
                print(f"\n{name}:")
                print(f"  总预测次数: {stats['total_predictions']}")
                print(f"  平均推理时间: {stats['avg_inference_time']*1000:.2f}ms")
                print(f"  吞吐量: {stats['predictions_per_second']:.2f} 次/秒")
        ---

2.2 数据准备

01.数据采集与标注
    a.说明部分
        高质量的训练数据是文本分类成功的关键。数据采集渠道包括公开数据集(如IMDb、AG News)、
        网络爬虫、用户生成内容、业务系统日志等。数据标注方式有人工标注、众包标注、半自动标注和弱监督学习。
        人工标注需要制定清晰的标注规范,确保标注一致性,通常需要多人标注计算一致性(Kappa系数)。
        标注成本是主要挑战,可采用主动学习策略选择最有价值的样本标注。数据质量检查包括:
        重复样本检测、噪声标签识别、类别分布分析、文本长度统计等。建议训练集、验证集、测试集按7:1.5:1.5分配,
        保持类别分布一致(分层采样)。对于不平衡数据,可采用过采样(SMOTE)、欠采样或类别权重调整。
    b.代码示例
        ---
        # 数据采集、清洗和标注工具
        import pandas as pd
        import numpy as np
        from sklearn.model_selection import train_test_split
        from collections import Counter
        import re
        from typing import List, Tuple, Dict
        import jieba
        from tqdm import tqdm

        class TextDataPreprocessor:
            """文本数据预处理器"""

            def __init__(self):
                self.stopwords = set()
                self.load_stopwords()

            def load_stopwords(self):
                """加载停用词表"""
                # 常见中文停用词
                common_stopwords = [
                    '的', '了', '在', '是', '我', '有', '和', '就', '不', '人',
                    '都', '一', '一个', '上', '也', '很', '到', '说', '要', '去',
                    '你', '会', '着', '没有', '看', '好', '自己', '这', '那'
                ]
                self.stopwords = set(common_stopwords)

            def clean_text(self, text: str) -> str:
                """
                清洗文本

                Args:
                    text: 原始文本

                Returns:
                    清洗后的文本
                """
                if not isinstance(text, str):
                    return ""

                # 移除URL
                text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
                            '', text)

                # 移除邮箱
                text = re.sub(r'\w+@\w+\.\w+', '', text)

                # 移除HTML标签
                text = re.sub(r'<[^>]+>', '', text)

                # 移除特殊字符,保留中英文、数字、基本标点
                text = re.sub(r'[^\w\s\u4e00-\u9fff.,!?;:,。!?;:]', '', text)

                # 统一空白字符
                text = re.sub(r'\s+', ' ', text)

                # 去除首尾空格
                text = text.strip()

                return text

            def remove_duplicates(self, texts: List[str], labels: List[int]) -> Tuple[List[str], List[int]]:
                """
                移除重复样本

                Args:
                    texts: 文本列表
                    labels: 标签列表

                Returns:
                    去重后的文本和标签
                """
                # 使用字典保留第一次出现的样本
                seen = {}
                unique_texts = []
                unique_labels = []

                for text, label in zip(texts, labels):
                    if text not in seen:
                        seen[text] = True
                        unique_texts.append(text)
                        unique_labels.append(label)

                removed_count = len(texts) - len(unique_texts)
                print(f"移除了 {removed_count} 个重复样本")

                return unique_texts, unique_labels

            def filter_by_length(self, texts: List[str], labels: List[int],
                                min_length: int = 5, max_length: int = 512) -> Tuple[List[str], List[int]]:
                """
                根据长度过滤文本

                Args:
                    texts: 文本列表
                    labels: 标签列表
                    min_length: 最小字符数
                    max_length: 最大字符数

                Returns:
                    过滤后的文本和标签
                """
                filtered_texts = []
                filtered_labels = []

                for text, label in zip(texts, labels):
                    text_len = len(text)
                    if min_length <= text_len <= max_length:
                        filtered_texts.append(text)
                        filtered_labels.append(label)

                removed_count = len(texts) - len(filtered_texts)
                print(f"根据长度过滤,移除了 {removed_count} 个样本")

                return filtered_texts, filtered_labels

            def balance_dataset(self, texts: List[str], labels: List[int],
                              strategy: str = 'oversample') -> Tuple[List[str], List[int]]:
                """
                平衡数据集

                Args:
                    texts: 文本列表
                    labels: 标签列表
                    strategy: 平衡策略 ('oversample' 或 'undersample')

                Returns:
                    平衡后的文本和标签
                """
                # 统计类别分布
                label_counts = Counter(labels)
                print(f"原始类别分布: {dict(label_counts)}")

                if strategy == 'oversample':
                    # 过采样到最大类别数量
                    max_count = max(label_counts.values())
                    balanced_texts = []
                    balanced_labels = []

                    for label in label_counts.keys():
                        # 获取该类别的所有样本
                        class_texts = [t for t, l in zip(texts, labels) if l == label]
                        class_labels = [l for l in labels if l == label]

                        # 过采样
                        n_samples = len(class_texts)
                        n_repeats = max_count // n_samples
                        n_extra = max_count % n_samples

                        balanced_texts.extend(class_texts * n_repeats)
                        balanced_texts.extend(class_texts[:n_extra])
                        balanced_labels.extend(class_labels * n_repeats)
                        balanced_labels.extend(class_labels[:n_extra])

                elif strategy == 'undersample':
                    # 欠采样到最小类别数量
                    min_count = min(label_counts.values())
                    balanced_texts = []
                    balanced_labels = []

                    np.random.seed(42)
                    for label in label_counts.keys():
                        # 获取该类别的所有样本
                        class_texts = [t for t, l in zip(texts, labels) if l == label]
                        class_labels = [l for l in labels if l == label]

                        # 随机采样
                        indices = np.random.choice(len(class_texts), min_count, replace=False)
                        balanced_texts.extend([class_texts[i] for i in indices])
                        balanced_labels.extend([class_labels[i] for i in indices])
                else:
                    raise ValueError(f"未知的平衡策略: {strategy}")

                # 打乱顺序
                combined = list(zip(balanced_texts, balanced_labels))
                np.random.shuffle(combined)
                balanced_texts, balanced_labels = zip(*combined)

                new_label_counts = Counter(balanced_labels)
                print(f"平衡后类别分布: {dict(new_label_counts)}")

                return list(balanced_texts), list(balanced_labels)

            def tokenize(self, texts: List[str], language: str = 'zh') -> List[List[str]]:
                """
                文本分词

                Args:
                    texts: 文本列表
                    language: 语言 ('zh' 或 'en')

                Returns:
                    分词后的文本列表
                """
                tokenized_texts = []

                print("正在分词...")
                for text in tqdm(texts):
                    if language == 'zh':
                        # 中文分词
                        tokens = jieba.lcut(text)
                    else:
                        # 英文分词(简单空格切分)
                        tokens = text.lower().split()

                    # 移除停用词
                    tokens = [t for t in tokens if t not in self.stopwords and len(t) > 1]
                    tokenized_texts.append(tokens)

                return tokenized_texts

        class DataAnnotationHelper:
            """数据标注辅助工具"""

            def __init__(self, label_schema: Dict[int, str]):
                """
                初始化标注助手

                Args:
                    label_schema: 标签映射 {标签ID: 标签名称}
                """
                self.label_schema = label_schema
                self.annotations = []

            def calculate_kappa(self, annotations1: List[int],
                              annotations2: List[int]) -> float:
                """
                计算Cohen's Kappa系数(标注一致性)

                Args:
                    annotations1: 标注者1的标注结果
                    annotations2: 标注者2的标注结果

                Returns:
                    Kappa系数
                """
                from sklearn.metrics import cohen_kappa_score

                kappa = cohen_kappa_score(annotations1, annotations2)
                print(f"Cohen's Kappa: {kappa:.4f}")

                if kappa < 0:
                    print("一致性: 很差(小于0)")
                elif kappa < 0.2:
                    print("一致性: 轻微")
                elif kappa < 0.4:
                    print("一致性: 一般")
                elif kappa < 0.6:
                    print("一致性: 中等")
                elif kappa < 0.8:
                    print("一致性: 较好")
                else:
                    print("一致性: 几乎完全一致")

                return kappa

            def majority_voting(self, multi_annotations: List[List[int]]) -> List[int]:
                """
                多个标注者的投票

                Args:
                    multi_annotations: 多个标注者的标注结果列表

                Returns:
                    投票后的最终标签
                """
                n_samples = len(multi_annotations[0])
                final_labels = []

                for i in range(n_samples):
                    # 获取该样本的所有标注
                    votes = [annotations[i] for annotations in multi_annotations]

                    # 统计投票
                    vote_counts = Counter(votes)
                    most_common = vote_counts.most_common(1)[0]

                    final_labels.append(most_common[0])

                return final_labels

            def identify_difficult_samples(self, multi_annotations: List[List[int]],
                                         threshold: float = 0.5) -> List[int]:
                """
                识别难以标注的样本(标注者分歧大)

                Args:
                    multi_annotations: 多个标注者的标注结果
                    threshold: 分歧阈值

                Returns:
                    难样本的索引列表
                """
                n_samples = len(multi_annotations[0])
                n_annotators = len(multi_annotations)
                difficult_indices = []

                for i in range(n_samples):
                    votes = [annotations[i] for annotations in multi_annotations]
                    vote_counts = Counter(votes)
                    max_agree = max(vote_counts.values())

                    # 如果一致性低于阈值,标记为难样本
                    agreement_ratio = max_agree / n_annotators
                    if agreement_ratio < threshold:
                        difficult_indices.append(i)

                print(f"识别出 {len(difficult_indices)} 个难标注样本 "
                      f"({len(difficult_indices)/n_samples*100:.2f}%)")

                return difficult_indices

        # 使用示例
        if __name__ == "__main__":
            # 1. 数据清洗示例
            print("="*70)
            print("数据预处理示例")
            print("="*70 + "\n")

            preprocessor = TextDataPreprocessor()

            # 模拟原始数据
            raw_texts = [
                "这个产品真的很好用!http://example.com 推荐购买",
                "质量太差了,<b>强烈</b>不推荐!!!",
                "一般般吧,价格还可以",
                "这个产品真的很好用!http://example.com 推荐购买",  # 重复
                "非常满意,会回购的 ❤️❤️❤️",
                "差",  # 太短
                "服务态度很好,物流也快,产品质量不错,总体来说很满意" * 20  # 太长
            ]

            labels = [1, 0, 2, 1, 1, 0, 1]

            # 清洗文本
            print("清洗前后对比:")
            for i, text in enumerate(raw_texts[:3]):
                cleaned = preprocessor.clean_text(text)
                print(f"原始: {text}")
                print(f"清洗: {cleaned}\n")

            # 批量清洗
            cleaned_texts = [preprocessor.clean_text(t) for t in raw_texts]

            # 去重
            print("\n去重处理:")
            unique_texts, unique_labels = preprocessor.remove_duplicates(
                cleaned_texts, labels
            )

            # 长度过滤
            print("\n长度过滤:")
            filtered_texts, filtered_labels = preprocessor.filter_by_length(
                unique_texts, unique_labels, min_length=5, max_length=100
            )

            # 2. 数据平衡示例
            print("\n" + "="*70)
            print("数据平衡示例")
            print("="*70 + "\n")

            # 构造不平衡数据集
            imbalanced_texts = ["正面样本"] * 100 + ["负面样本"] * 20
            imbalanced_labels = [1] * 100 + [0] * 20

            # 过采样平衡
            balanced_texts, balanced_labels = preprocessor.balance_dataset(
                imbalanced_texts, imbalanced_labels, strategy='oversample'
            )

            # 3. 标注一致性检查
            print("\n" + "="*70)
            print("标注一致性检查")
            print("="*70 + "\n")

            helper = DataAnnotationHelper({0: "负面", 1: "中性", 2: "正面"})

            # 模拟两个标注者的标注结果
            annotator1 = [0, 1, 2, 0, 1, 2, 0, 1, 2, 0]
            annotator2 = [0, 1, 2, 0, 1, 1, 0, 2, 2, 0]  # 有些不一致

            # 计算Kappa
            kappa = helper.calculate_kappa(annotator1, annotator2)

            # 4. 多标注者投票
            print("\n多标注者投票:")
            annotator3 = [0, 1, 2, 0, 1, 2, 0, 1, 1, 0]
            multi_annotations = [annotator1, annotator2, annotator3]

            final_labels = helper.majority_voting(multi_annotations)
            print(f"投票结果: {final_labels}")

            # 识别难样本
            difficult_samples = helper.identify_difficult_samples(
                multi_annotations, threshold=0.7
            )
            print(f"难样本索引: {difficult_samples}")
        ---

02.数据增强技术
    a.说明部分
        数据增强可以有效扩充训练数据,提高模型泛化能力。常见方法包括:回译(中→英→中)引入语义多样性;
        同义词替换使用词向量或WordNet查找近义词;随机插入、删除、交换词语增加扰动;
        上下文感知替换使用BERT等模型预测masked词;EDA(Easy Data Augmentation)组合多种简单方法;
        对抗样本生成提高模型鲁棒性。文本生成模型(如GPT)可用于生成伪样本。
        数据增强需要保持语义一致性,避免改变原始标签。增强后的数据应与原始数据混合训练,
        比例一般为1:1到1:3。对于小样本场景,数据增强尤其有效。需要注意的是,
        过度增强可能引入噪声,降低数据质量,应在验证集上评估增强效果。
    b.代码示例
        ---
        # 文本数据增强工具集
        import random
        import jieba
        import numpy as np
        from typing import List, Tuple
        import synonyms  # 需要安装: pip install synonyms

        class TextAugmenter:
            """文本数据增强器"""

            def __init__(self, language: str = 'zh'):
                """
                初始化增强器

                Args:
                    language: 语言类型 ('zh' 或 'en')
                """
                self.language = language

                # 停用词(不进行替换的词)
                self.stopwords = set([
                    '的', '了', '在', '是', '我', '有', '和', '就',
                    '不', '人', '都', '一', '一个', '上', '也'
                ])

            def synonym_replacement(self, text: str, n: int = 2) -> str:
                """
                同义词替换

                Args:
                    text: 原始文本
                    n: 替换词语数量

                Returns:
                    增强后的文本
                """
                words = jieba.lcut(text)

                # 找出可以替换的词
                replaceable_words = [w for w in words
                                    if w not in self.stopwords and len(w) > 1]

                if len(replaceable_words) == 0:
                    return text

                # 随机选择n个词进行替换
                n_replace = min(n, len(replaceable_words))
                words_to_replace = random.sample(replaceable_words, n_replace)

                new_words = words.copy()
                for word in words_to_replace:
                    # 查找同义词
                    syns = synonyms.nearby(word)[0]

                    if len(syns) > 1:
                        # 选择一个不同的同义词
                        synonym = random.choice([s for s in syns[:5] if s != word])

                        # 替换所有出现的该词
                        new_words = [synonym if w == word else w for w in new_words]

                return ''.join(new_words)

            def random_insertion(self, text: str, n: int = 2) -> str:
                """
                随机插入

                Args:
                    text: 原始文本
                    n: 插入词语数量

                Returns:
                    增强后的文本
                """
                words = jieba.lcut(text)

                for _ in range(n):
                    # 随机选择一个词
                    word = random.choice(words)

                    # 查找同义词
                    syns = synonyms.nearby(word)[0]

                    if len(syns) > 1:
                        synonym = random.choice(syns[:5])

                        # 在随机位置插入
                        insert_pos = random.randint(0, len(words))
                        words.insert(insert_pos, synonym)

                return ''.join(words)

            def random_swap(self, text: str, n: int = 2) -> str:
                """
                随机交换

                Args:
                    text: 原始文本
                    n: 交换次数

                Returns:
                    增强后的文本
                """
                words = jieba.lcut(text)

                if len(words) < 2:
                    return text

                for _ in range(n):
                    # 随机选择两个位置
                    idx1, idx2 = random.sample(range(len(words)), 2)

                    # 交换
                    words[idx1], words[idx2] = words[idx2], words[idx1]

                return ''.join(words)

            def random_deletion(self, text: str, p: float = 0.1) -> str:
                """
                随机删除

                Args:
                    text: 原始文本
                    p: 删除概率

                Returns:
                    增强后的文本
                """
                words = jieba.lcut(text)

                # 如果只有一个词,不删除
                if len(words) == 1:
                    return text

                # 随机删除词语
                new_words = []
                for word in words:
                    if random.random() > p:
                        new_words.append(word)

                # 如果全被删除了,随机返回一个词
                if len(new_words) == 0:
                    return random.choice(words)

                return ''.join(new_words)

            def eda(self, text: str, alpha: float = 0.1, num_aug: int = 4) -> List[str]:
                """
                EDA: Easy Data Augmentation
                组合使用多种增强方法

                Args:
                    text: 原始文本
                    alpha: 增强强度
                    num_aug: 生成增强样本数量

                Returns:
                    增强后的文本列表
                """
                augmented_texts = []

                # 计算操作次数
                words = jieba.lcut(text)
                n_ops = max(1, int(alpha * len(words)))

                for _ in range(num_aug):
                    # 随机选择一种增强方法
                    aug_method = random.choice([
                        lambda t: self.synonym_replacement(t, n_ops),
                        lambda t: self.random_insertion(t, n_ops),
                        lambda t: self.random_swap(t, n_ops),
                        lambda t: self.random_deletion(t, alpha)
                    ])

                    augmented_text = aug_method(text)
                    augmented_texts.append(augmented_text)

                return augmented_texts

            def back_translation(self, text: str,
                                intermediate_lang: str = 'en') -> str:
                """
                回译增强(需要翻译API)

                Args:
                    text: 原始文本
                    intermediate_lang: 中间语言

                Returns:
                    回译后的文本
                """
                # 注意:这里需要实际的翻译API
                # 这里仅作示例,实际使用需要接入翻译服务
                print("回译需要翻译API支持,这里返回原文")
                return text

        class AugmentationPipeline:
            """数据增强管道"""

            def __init__(self, augmenter: TextAugmenter):
                self.augmenter = augmenter

            def augment_dataset(self, texts: List[str], labels: List[int],
                              aug_per_sample: int = 2) -> Tuple[List[str], List[int]]:
                """
                对整个数据集进行增强

                Args:
                    texts: 文本列表
                    labels: 标签列表
                    aug_per_sample: 每个样本生成的增强样本数

                Returns:
                    增强后的文本和标签
                """
                augmented_texts = []
                augmented_labels = []

                print(f"正在增强数据集...")
                for text, label in zip(texts, labels):
                    # 保留原始样本
                    augmented_texts.append(text)
                    augmented_labels.append(label)

                    # 生成增强样本
                    try:
                        aug_texts = self.augmenter.eda(text, num_aug=aug_per_sample)

                        for aug_text in aug_texts:
                            augmented_texts.append(aug_text)
                            augmented_labels.append(label)
                    except Exception as e:
                        # 如果增强失败,跳过
                        print(f"增强失败: {text[:20]}... 错误: {e}")
                        continue

                print(f"增强完成: {len(texts)} -> {len(augmented_texts)} 样本")
                return augmented_texts, augmented_labels

            def selective_augmentation(self, texts: List[str], labels: List[int],
                                     target_class: int,
                                     aug_per_sample: int = 3) -> Tuple[List[str], List[int]]:
                """
                选择性增强(只增强特定类别)

                Args:
                    texts: 文本列表
                    labels: 标签列表
                    target_class: 目标类别
                    aug_per_sample: 每个样本生成的增强样本数

                Returns:
                    增强后的文本和标签
                """
                augmented_texts = texts.copy()
                augmented_labels = labels.copy()

                # 只对目标类别进行增强
                target_texts = [t for t, l in zip(texts, labels) if l == target_class]

                print(f"选择性增强类别 {target_class},共 {len(target_texts)} 个样本")

                for text in target_texts:
                    try:
                        aug_texts = self.augmenter.eda(text, num_aug=aug_per_sample)

                        for aug_text in aug_texts:
                            augmented_texts.append(aug_text)
                            augmented_labels.append(target_class)
                    except Exception as e:
                        continue

                print(f"增强完成: {len(texts)} -> {len(augmented_texts)} 样本")
                return augmented_texts, augmented_labels

        # 使用示例
        if __name__ == "__main__":
            print("="*70)
            print("文本数据增强示例")
            print("="*70 + "\n")

            # 创建增强器
            augmenter = TextAugmenter(language='zh')

            # 原始文本
            original_text = "这部电影的剧情非常精彩,演员表演也很出色"

            print(f"原始文本: {original_text}\n")

            # 1. 同义词替换
            print("【同义词替换】")
            for i in range(3):
                aug_text = augmenter.synonym_replacement(original_text, n=2)
                print(f"  {i+1}. {aug_text}")

            # 2. 随机插入
            print("\n【随机插入】")
            for i in range(3):
                aug_text = augmenter.random_insertion(original_text, n=2)
                print(f"  {i+1}. {aug_text}")

            # 3. 随机交换
            print("\n【随机交换】")
            for i in range(3):
                aug_text = augmenter.random_swap(original_text, n=2)
                print(f"  {i+1}. {aug_text}")

            # 4. 随机删除
            print("\n【随机删除】")
            for i in range(3):
                aug_text = augmenter.random_deletion(original_text, p=0.15)
                print(f"  {i+1}. {aug_text}")

            # 5. EDA组合增强
            print("\n【EDA组合增强】")
            eda_texts = augmenter.eda(original_text, alpha=0.1, num_aug=5)
            for i, aug_text in enumerate(eda_texts):
                print(f"  {i+1}. {aug_text}")

            # 6. 批量增强
            print("\n" + "="*70)
            print("批量数据增强")
            print("="*70 + "\n")

            # 模拟数据集
            sample_texts = [
                "这个餐厅的菜品很美味",
                "服务态度太差了",
                "环境不错,价格合理"
            ]
            sample_labels = [1, 0, 1]

            # 创建增强管道
            pipeline = AugmentationPipeline(augmenter)

            # 增强数据集
            aug_texts, aug_labels = pipeline.augment_dataset(
                sample_texts, sample_labels, aug_per_sample=2
            )

            print(f"\n原始数据: {len(sample_texts)} 样本")
            print(f"增强后数据: {len(aug_texts)} 样本")
            print(f"\n增强样本示例:")
            for i in range(min(10, len(aug_texts))):
                print(f"  [{aug_labels[i]}] {aug_texts[i]}")
        ---

03.特征工程
    a.说明部分
        传统机器学习方法需要手工特征工程,包括词袋模型(Bag of Words)统计词频、TF-IDF加权重要词汇、
        N-gram捕捉词序信息。深度学习时代,特征工程转向词嵌入(Word2Vec、GloVe、FastText)学习语义表示。
        预训练语言模型(BERT、RoBERTa)提供上下文相关的动态词向量,显著提升性能。
        特征工程还包括:文本长度、特殊字符数量、大写字母比例等统计特征;情感词典匹配、
        主题模型(LDA)等领域特征;句法依存、词性标注等语言学特征。现代实践中,
        预训练模型+少量人工特征是常见组合。特征选择可使用卡方检验、互信息、递归特征消除等方法。
    b.代码示例
        ---
        # 文本特征工程工具
        import numpy as np
        from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
        from typing import List, Dict, Tuple
        import jieba
        from collections import Counter

        class TextFeatureExtractor:
            """文本特征提取器"""

            def __init__(self):
                self.tfidf_vectorizer = None
                self.count_vectorizer = None
                self.vocab = None

            def extract_statistical_features(self, texts: List[str]) -> np.ndarray:
                """
                提取统计特征

                Args:
                    texts: 文本列表

                Returns:
                    特征矩阵 (n_samples, n_features)
                """
                features = []

                for text in texts:
                    feature_dict = {}

                    # 基本长度特征
                    feature_dict['char_length'] = len(text)
                    feature_dict['word_count'] = len(jieba.lcut(text))

                    # 标点符号特征
                    feature_dict['exclamation_count'] = text.count('!')
                    feature_dict['question_count'] = text.count('?')
                    feature_dict['comma_count'] = text.count(',') + text.count(',')

                    # 数字和英文比例
                    digit_count = sum(c.isdigit() for c in text)
                    alpha_count = sum(c.isalpha() for c in text)
                    feature_dict['digit_ratio'] = digit_count / len(text) if len(text) > 0 else 0
                    feature_dict['alpha_ratio'] = alpha_count / len(text) if len(text) > 0 else 0

                    # 大写字母比例(英文文本)
                    upper_count = sum(c.isupper() for c in text)
                    feature_dict['upper_ratio'] = upper_count / len(text) if len(text) > 0 else 0

                    # 平均词长
                    words = jieba.lcut(text)
                    avg_word_len = np.mean([len(w) for w in words]) if words else 0
                    feature_dict['avg_word_length'] = avg_word_len

                    # 唯一词比例
                    unique_words = len(set(words))
                    feature_dict['unique_word_ratio'] = unique_words / len(words) if words else 0

                    features.append(list(feature_dict.values()))

                return np.array(features)

            def extract_bow_features(self, texts: List[str],
                                    max_features: int = 5000) -> Tuple[np.ndarray, List[str]]:
                """
                提取词袋特征

                Args:
                    texts: 文本列表
                    max_features: 最大特征数

                Returns:
                    特征矩阵和词汇表
                """
                # 分词
                tokenized_texts = [' '.join(jieba.lcut(text)) for text in texts]

                # 使用CountVectorizer
                self.count_vectorizer = CountVectorizer(
                    max_features=max_features,
                    token_pattern=r'(?u)\b\w+\b'  # 匹配中文
                )

                bow_features = self.count_vectorizer.fit_transform(tokenized_texts)
                vocab = self.count_vectorizer.get_feature_names_out()

                print(f"词袋特征: {bow_features.shape}")
                print(f"词汇表大小: {len(vocab)}")

                return bow_features.toarray(), vocab

            def extract_tfidf_features(self, texts: List[str],
                                      max_features: int = 5000,
                                      ngram_range: Tuple[int, int] = (1, 2)) -> Tuple[np.ndarray, List[str]]:
                """
                提取TF-IDF特征

                Args:
                    texts: 文本列表
                    max_features: 最大特征数
                    ngram_range: N-gram范围

                Returns:
                    特征矩阵和词汇表
                """
                # 分词
                tokenized_texts = [' '.join(jieba.lcut(text)) for text in texts]

                # 使用TfidfVectorizer
                self.tfidf_vectorizer = TfidfVectorizer(
                    max_features=max_features,
                    ngram_range=ngram_range,
                    token_pattern=r'(?u)\b\w+\b'
                )

                tfidf_features = self.tfidf_vectorizer.fit_transform(tokenized_texts)
                vocab = self.tfidf_vectorizer.get_feature_names_out()

                print(f"TF-IDF特征: {tfidf_features.shape}")
                print(f"词汇表大小: {len(vocab)}")
                print(f"N-gram范围: {ngram_range}")

                return tfidf_features.toarray(), vocab

            def get_top_tfidf_words(self, texts: List[str], top_n: int = 20) -> List[Tuple[str, float]]:
                """
                获取TF-IDF权重最高的词

                Args:
                    texts: 文本列表
                    top_n: 返回前N个词

                Returns:
                    (词, TF-IDF值) 列表
                """
                if self.tfidf_vectorizer is None:
                    self.extract_tfidf_features(texts)

                # 计算平均TF-IDF
                tfidf_matrix = self.tfidf_vectorizer.transform(
                    [' '.join(jieba.lcut(text)) for text in texts]
                )

                avg_tfidf = np.array(tfidf_matrix.mean(axis=0)).flatten()
                vocab = self.tfidf_vectorizer.get_feature_names_out()

                # 排序
                word_scores = [(vocab[i], avg_tfidf[i]) for i in range(len(vocab))]
                word_scores.sort(key=lambda x: x[1], reverse=True)

                return word_scores[:top_n]

            def extract_combined_features(self, texts: List[str]) -> np.ndarray:
                """
                组合多种特征

                Args:
                    texts: 文本列表

                Returns:
                    组合特征矩阵
                """
                print("提取组合特征...")

                # 统计特征
                stat_features = self.extract_statistical_features(texts)
                print(f"  统计特征: {stat_features.shape}")

                # TF-IDF特征
                tfidf_features, _ = self.extract_tfidf_features(texts, max_features=1000)
                print(f"  TF-IDF特征: {tfidf_features.shape}")

                # 拼接特征
                combined_features = np.hstack([stat_features, tfidf_features])
                print(f"  组合特征: {combined_features.shape}")

                return combined_features

        # 使用示例
        if __name__ == "__main__":
            print("="*70)
            print("文本特征工程示例")
            print("="*70 + "\n")

            # 示例文本
            sample_texts = [
                "这部电影真的太精彩了!强烈推荐!",
                "剧情拖沓,演技尴尬,浪费时间",
                "还可以吧,有些地方不错",
                "视觉效果很震撼,但故事一般",
                "非常失望,完全不值这个票价"
            ] * 20  # 扩展数据

            # 创建特征提取器
            extractor = TextFeatureExtractor()

            # 1. 统计特征
            print("【统计特征】")
            stat_features = extractor.extract_statistical_features(sample_texts[:3])
            print(f"特征维度: {stat_features.shape}")
            print(f"前3个样本的特征:\n{stat_features}\n")

            # 2. 词袋特征
            print("【词袋特征】")
            bow_features, bow_vocab = extractor.extract_bow_features(
                sample_texts, max_features=50
            )
            print(f"前10个词: {bow_vocab[:10]}\n")

            # 3. TF-IDF特征
            print("【TF-IDF特征】")
            tfidf_features, tfidf_vocab = extractor.extract_tfidf_features(
                sample_texts, max_features=50, ngram_range=(1, 2)
            )

            # 4. Top TF-IDF词
            print("\n【TF-IDF权重最高的词】")
            top_words = extractor.get_top_tfidf_words(sample_texts, top_n=15)
            for word, score in top_words:
                print(f"  {word:15s}: {score:.4f}")

            # 5. 组合特征
            print("\n【组合特征】")
            combined_features = extractor.extract_combined_features(sample_texts)

            # 6. 特征可视化
            print("\n特征统计:")
            print(f"  样本数: {combined_features.shape[0]}")
            print(f"  特征数: {combined_features.shape[1]}")
            print(f"  特征均值: {combined_features.mean():.4f}")
            print(f"  特征标准差: {combined_features.std():.4f}")
            print(f"  特征最大值: {combined_features.max():.4f}")
            print(f"  特征最小值: {combined_features.min():.4f}")
        ---

2.3 模型选择

01.传统机器学习模型
    a.说明部分
        传统机器学习模型虽然在深度学习时代被边缘化,但在某些场景下仍有价值。朴素贝叶斯基于贝叶斯定理,
        假设特征独立,训练快速,适合基线模型;逻辑回归线性模型,可解释性强,适合特征工程后的场景;
        支持向量机(SVM)在小样本上表现优秀,核技巧可处理非线性问题;随机森林和XGBoost等集成方法
        泛化能力强,对超参数不敏感。这些模型的优势在于训练快、资源消耗少、可解释性好,
        适合数据量小(<10K样本)或实时性要求极高的场景。特征工程质量直接影响模型效果,
        TF-IDF+XGBoost在工业界仍有广泛应用。模型选择需权衡性能、速度、可解释性和维护成本。
    b.代码示例
        ---
        # 传统机器学习文本分类实现
        import numpy as np
        from sklearn.naive_bayes import MultinomialNB
        from sklearn.linear_model import LogisticRegression
        from sklearn.svm import LinearSVC
        from sklearn.ensemble import RandomForestClassifier
        from xgboost import XGBClassifier
        from sklearn.feature_extraction.text import TfidfVectorizer
        from sklearn.metrics import classification_report, accuracy_score
        from sklearn.model_selection import cross_val_score
        import jieba
        from typing import List, Tuple, Dict
        import time

        class TraditionalMLClassifier:
            """传统机器学习分类器集合"""

            def __init__(self, model_type: str = 'logistic'):
                """
                初始化分类器

                Args:
                    model_type: 模型类型 ('naive_bayes', 'logistic', 'svm', 'random_forest', 'xgboost')
                """
                self.model_type = model_type
                self.model = self._create_model(model_type)
                self.vectorizer = TfidfVectorizer(
                    max_features=5000,
                    ngram_range=(1, 2),
                    token_pattern=r'(?u)\b\w+\b'
                )

            def _create_model(self, model_type: str):
                """创建模型实例"""
                models = {
                    'naive_bayes': MultinomialNB(alpha=1.0),
                    'logistic': LogisticRegression(
                        max_iter=1000,
                        C=1.0,
                        class_weight='balanced',
                        random_state=42
                    ),
                    'svm': LinearSVC(
                        C=1.0,
                        max_iter=2000,
                        class_weight='balanced',
                        random_state=42
                    ),
                    'random_forest': RandomForestClassifier(
                        n_estimators=100,
                        max_depth=20,
                        class_weight='balanced',
                        random_state=42,
                        n_jobs=-1
                    ),
                    'xgboost': XGBClassifier(
                        n_estimators=100,
                        max_depth=6,
                        learning_rate=0.1,
                        random_state=42,
                        n_jobs=-1
                    )
                }

                if model_type not in models:
                    raise ValueError(f"未知模型类型: {model_type}")

                return models[model_type]

            def preprocess_texts(self, texts: List[str]) -> List[str]:
                """文本预处理"""
                processed_texts = []
                for text in texts:
                    # 分词
                    words = jieba.lcut(text)
                    # 用空格连接(sklearn要求)
                    processed_texts.append(' '.join(words))
                return processed_texts

            def train(self, train_texts: List[str], train_labels: List[int],
                     verbose: bool = True):
                """
                训练模型

                Args:
                    train_texts: 训练文本列表
                    train_labels: 训练标签列表
                    verbose: 是否打印训练信息
                """
                # 预处理
                processed_texts = self.preprocess_texts(train_texts)

                # 特征提取
                if verbose:
                    print(f"提取TF-IDF特征...")

                start_time = time.time()
                X_train = self.vectorizer.fit_transform(processed_texts)

                if verbose:
                    print(f"特征维度: {X_train.shape}")
                    print(f"特征提取时间: {time.time() - start_time:.2f}秒")

                # 训练模型
                if verbose:
                    print(f"\n训练{self.model_type}模型...")

                start_time = time.time()
                self.model.fit(X_train, train_labels)

                train_time = time.time() - start_time

                if verbose:
                    print(f"训练时间: {train_time:.2f}秒")

                # 训练集性能
                train_pred = self.model.predict(X_train)
                train_acc = accuracy_score(train_labels, train_pred)

                if verbose:
                    print(f"训练集准确率: {train_acc:.4f}")

            def predict(self, texts: List[str]) -> np.ndarray:
                """预测"""
                processed_texts = self.preprocess_texts(texts)
                X = self.vectorizer.transform(processed_texts)
                return self.model.predict(X)

            def predict_proba(self, texts: List[str]) -> np.ndarray:
                """预测概率"""
                processed_texts = self.preprocess_texts(texts)
                X = self.vectorizer.transform(processed_texts)

                # SVM没有predict_proba方法
                if hasattr(self.model, 'predict_proba'):
                    return self.model.predict_proba(X)
                elif hasattr(self.model, 'decision_function'):
                    # 使用decision_function作为替代
                    scores = self.model.decision_function(X)
                    # 归一化到[0, 1]
                    return (scores - scores.min()) / (scores.max() - scores.min() + 1e-8)
                else:
                    raise ValueError("模型不支持概率预测")

            def evaluate(self, test_texts: List[str], test_labels: List[int]) -> Dict:
                """评估模型"""
                start_time = time.time()
                predictions = self.predict(test_texts)
                inference_time = time.time() - start_time

                # 计算指标
                accuracy = accuracy_score(test_labels, predictions)

                # 打印报告
                print(f"\n{'='*70}")
                print(f"{self.model_type.upper()} 模型评估结果")
                print(f"{'='*70}\n")
                print(f"测试集准确率: {accuracy:.4f}")
                print(f"推理时间: {inference_time:.4f}秒")
                print(f"平均推理速度: {len(test_texts)/inference_time:.2f} 样本/秒\n")

                print(classification_report(test_labels, predictions, digits=4))

                return {
                    'accuracy': accuracy,
                    'inference_time': inference_time,
                    'predictions': predictions
                }

            def cross_validate(self, texts: List[str], labels: List[int],
                             cv: int = 5) -> Dict:
                """交叉验证"""
                processed_texts = self.preprocess_texts(texts)
                X = self.vectorizer.fit_transform(processed_texts)

                print(f"\n进行{cv}折交叉验证...")
                scores = cross_val_score(self.model, X, labels, cv=cv, scoring='accuracy')

                print(f"交叉验证结果:")
                print(f"  平均准确率: {scores.mean():.4f}")
                print(f"  标准差: {scores.std():.4f}")
                print(f"  各折得分: {scores}")

                return {
                    'mean_score': scores.mean(),
                    'std_score': scores.std(),
                    'fold_scores': scores
                }

        class ModelComparison:
            """模型对比工具"""

            def __init__(self):
                self.results = {}

            def compare_models(self, train_texts: List[str], train_labels: List[int],
                             test_texts: List[str], test_labels: List[int]):
                """对比多个模型"""
                model_types = ['naive_bayes', 'logistic', 'svm', 'random_forest', 'xgboost']

                print("="*70)
                print("模型对比实验")
                print("="*70 + "\n")

                for model_type in model_types:
                    print(f"\n{'='*70}")
                    print(f"训练和评估 {model_type.upper()} 模型")
                    print(f"{'='*70}")

                    # 创建模型
                    classifier = TraditionalMLClassifier(model_type)

                    # 训练
                    train_start = time.time()
                    classifier.train(train_texts, train_labels, verbose=True)
                    train_time = time.time() - train_start

                    # 评估
                    eval_results = classifier.evaluate(test_texts, test_labels)

                    # 记录结果
                    self.results[model_type] = {
                        'train_time': train_time,
                        'inference_time': eval_results['inference_time'],
                        'accuracy': eval_results['accuracy']
                    }

                # 打印对比结果
                self.print_comparison()

            def print_comparison(self):
                """打印对比结果"""
                print("\n" + "="*70)
                print("模型性能对比汇总")
                print("="*70 + "\n")

                print(f"{'模型':<15} {'准确率':>12} {'训练时间(秒)':>15} {'推理时间(秒)':>15}")
                print("-"*70)

                # 按准确率排序
                sorted_results = sorted(self.results.items(),
                                      key=lambda x: x[1]['accuracy'],
                                      reverse=True)

                for model_type, metrics in sorted_results:
                    print(f"{model_type:<15} "
                          f"{metrics['accuracy']:>12.4f} "
                          f"{metrics['train_time']:>15.2f} "
                          f"{metrics['inference_time']:>15.4f}")

                # 最佳模型
                best_model = sorted_results[0][0]
                best_accuracy = sorted_results[0][1]['accuracy']
                print(f"\n最佳模型: {best_model.upper()} (准确率: {best_accuracy:.4f})")

        # 使用示例
        if __name__ == "__main__":
            # 生成模拟数据
            from sklearn.datasets import fetch_20newsgroups
            from sklearn.model_selection import train_test_split

            # 使用20newsgroups数据集的子集(模拟中文数据)
            print("准备数据...")

            # 模拟中文数据
            positive_samples = [
                "这部电影非常精彩,值得一看",
                "演员表演出色,剧情引人入胜",
                "视觉效果震撼,音乐动人",
                "导演功力深厚,细节处理到位",
                "强烈推荐,五星好评"
            ] * 100

            negative_samples = [
                "剧情拖沓,浪费时间",
                "演技尴尬,毫无亮点",
                "无聊透顶,不推荐观看",
                "故事老套,缺乏创意",
                "完全不值票价,失望"
            ] * 100

            texts = positive_samples + negative_samples
            labels = [1] * len(positive_samples) + [0] * len(negative_samples)

            # 打乱数据
            indices = np.random.permutation(len(texts))
            texts = [texts[i] for i in indices]
            labels = [labels[i] for i in indices]

            # 分割数据
            X_train, X_test, y_train, y_test = train_test_split(
                texts, labels, test_size=0.2, random_state=42, stratify=labels
            )

            print(f"训练集: {len(X_train)} 样本")
            print(f"测试集: {len(X_test)} 样本\n")

            # 1. 单个模型训练和评估
            print("="*70)
            print("单模型训练示例")
            print("="*70)

            classifier = TraditionalMLClassifier('xgboost')
            classifier.train(X_train, y_train)
            classifier.evaluate(X_test, y_test)

            # 2. 交叉验证
            classifier.cross_validate(X_train, y_train, cv=5)

            # 3. 模型对比
            print("\n" + "="*70)
            print("多模型对比")
            print("="*70)

            comparator = ModelComparison()
            comparator.compare_models(X_train, y_train, X_test, y_test)
        ---

02.深度学习模型架构
    a.说明部分
        深度学习模型在文本分类上取得了突破性进展。TextCNN使用卷积神经网络捕捉局部特征,速度快效果好;
        TextRNN(LSTM/GRU)建模序列依赖,适合长文本;TextRCNN结合CNN和RNN优势,性能优异;
        注意力机制(Attention)聚焦关键信息,提升模型可解释性;FastText基于词嵌入的简单架构,
        训练极快,适合大规模分类;DPCNN深度金字塔CNN,通过池化捕捉长距离依赖。
        模型选择需考虑:数据量大小(CNN适合中等规模)、文本长度(RNN适合长文本)、
        训练时间限制(FastText最快)、性能要求(组合模型更优)。超参数包括:嵌入维度(100-300)、
        隐藏层大小(128-512)、学习率(1e-3到1e-4)、批次大小(32-128)、Dropout率(0.3-0.5)。
    b.代码示例
        ---
        # 深度学习文本分类模型实现
        import torch
        import torch.nn as nn
        import torch.nn.functional as F
        from torch.utils.data import Dataset, DataLoader
        import numpy as np
        from typing import List, Dict

        class TextCNN(nn.Module):
            """TextCNN模型"""

            def __init__(self, vocab_size: int, embed_dim: int, num_classes: int,
                        kernel_sizes: List[int] = [3, 4, 5], num_filters: int = 100,
                        dropout: float = 0.5):
                """
                初始化TextCNN

                Args:
                    vocab_size: 词汇表大小
                    embed_dim: 词嵌入维度
                    num_classes: 类别数
                    kernel_sizes: 卷积核大小列表
                    num_filters: 每种卷积核的数量
                    dropout: Dropout率
                """
                super(TextCNN, self).__init__()

                # 词嵌入层
                self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)

                # 多种大小的卷积层
                self.convs = nn.ModuleList([
                    nn.Conv1d(in_channels=embed_dim,
                             out_channels=num_filters,
                             kernel_size=k)
                    for k in kernel_sizes
                ])

                # Dropout
                self.dropout = nn.Dropout(dropout)

                # 全连接层
                self.fc = nn.Linear(len(kernel_sizes) * num_filters, num_classes)

            def forward(self, x):
                """
                前向传播

                Args:
                    x: 输入张量 (batch_size, seq_len)

                Returns:
                    输出logits (batch_size, num_classes)
                """
                # 词嵌入 (batch_size, seq_len, embed_dim)
                embedded = self.embedding(x)

                # 转换为卷积需要的格式 (batch_size, embed_dim, seq_len)
                embedded = embedded.permute(0, 2, 1)

                # 多个卷积和池化操作
                conv_outputs = []
                for conv in self.convs:
                    # 卷积 (batch_size, num_filters, conv_seq_len)
                    conv_out = F.relu(conv(embedded))

                    # 最大池化 (batch_size, num_filters)
                    pooled = F.max_pool1d(conv_out, conv_out.size(2)).squeeze(2)

                    conv_outputs.append(pooled)

                # 拼接所有卷积结果 (batch_size, num_filters * len(kernel_sizes))
                cat = torch.cat(conv_outputs, dim=1)

                # Dropout
                cat = self.dropout(cat)

                # 全连接层
                logits = self.fc(cat)

                return logits

        class TextRNN(nn.Module):
            """TextRNN模型(LSTM)"""

            def __init__(self, vocab_size: int, embed_dim: int, hidden_dim: int,
                        num_classes: int, num_layers: int = 2, bidirectional: bool = True,
                        dropout: float = 0.5):
                """
                初始化TextRNN

                Args:
                    vocab_size: 词汇表大小
                    embed_dim: 词嵌入维度
                    hidden_dim: LSTM隐藏层维度
                    num_classes: 类别数
                    num_layers: LSTM层数
                    bidirectional: 是否双向
                    dropout: Dropout率
                """
                super(TextRNN, self).__init__()

                self.hidden_dim = hidden_dim
                self.num_layers = num_layers
                self.bidirectional = bidirectional

                # 词嵌入层
                self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)

                # LSTM层
                self.lstm = nn.LSTM(
                    input_size=embed_dim,
                    hidden_size=hidden_dim,
                    num_layers=num_layers,
                    bidirectional=bidirectional,
                    dropout=dropout if num_layers > 1 else 0,
                    batch_first=True
                )

                # Dropout
                self.dropout = nn.Dropout(dropout)

                # 全连接层
                fc_input_dim = hidden_dim * 2 if bidirectional else hidden_dim
                self.fc = nn.Linear(fc_input_dim, num_classes)

            def forward(self, x):
                """
                前向传播

                Args:
                    x: 输入张量 (batch_size, seq_len)

                Returns:
                    输出logits (batch_size, num_classes)
                """
                # 词嵌入 (batch_size, seq_len, embed_dim)
                embedded = self.embedding(x)

                # LSTM (batch_size, seq_len, hidden_dim * 2)
                lstm_out, (hidden, cell) = self.lstm(embedded)

                # 使用最后时刻的隐藏状态
                if self.bidirectional:
                    # 拼接前向和后向的最后隐藏状态
                    hidden = torch.cat([hidden[-2], hidden[-1]], dim=1)
                else:
                    hidden = hidden[-1]

                # Dropout (batch_size, hidden_dim * 2)
                hidden = self.dropout(hidden)

                # 全连接层
                logits = self.fc(hidden)

                return logits

        class TextRCNN(nn.Module):
            """TextRCNN模型(结合RNN和CNN)"""

            def __init__(self, vocab_size: int, embed_dim: int, hidden_dim: int,
                        num_classes: int, dropout: float = 0.5):
                """
                初始化TextRCNN

                Args:
                    vocab_size: 词汇表大小
                    embed_dim: 词嵌入维度
                    hidden_dim: LSTM隐藏层维度
                    num_classes: 类别数
                    dropout: Dropout率
                """
                super(TextRCNN, self).__init__()

                # 词嵌入层
                self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)

                # 双向LSTM
                self.lstm = nn.LSTM(
                    input_size=embed_dim,
                    hidden_size=hidden_dim,
                    num_layers=1,
                    bidirectional=True,
                    batch_first=True
                )

                # 线性变换
                self.W = nn.Linear(embed_dim + 2 * hidden_dim, hidden_dim)

                # Dropout
                self.dropout = nn.Dropout(dropout)

                # 全连接层
                self.fc = nn.Linear(hidden_dim, num_classes)

            def forward(self, x):
                """
                前向传播

                Args:
                    x: 输入张量 (batch_size, seq_len)

                Returns:
                    输出logits (batch_size, num_classes)
                """
                # 词嵌入 (batch_size, seq_len, embed_dim)
                embedded = self.embedding(x)

                # LSTM (batch_size, seq_len, hidden_dim * 2)
                lstm_out, _ = self.lstm(embedded)

                # 拼接词嵌入和LSTM输出 (batch_size, seq_len, embed_dim + hidden_dim * 2)
                combined = torch.cat([embedded, lstm_out], dim=2)

                # 线性变换 (batch_size, seq_len, hidden_dim)
                y = torch.tanh(self.W(combined))

                # 最大池化 (batch_size, hidden_dim)
                y = torch.max(y, dim=1)[0]

                # Dropout
                y = self.dropout(y)

                # 全连接层
                logits = self.fc(y)

                return logits

        class ModelTrainer:
            """模型训练器"""

            def __init__(self, model: nn.Module, device: str = 'cpu'):
                """
                初始化训练器

                Args:
                    model: 待训练的模型
                    device: 设备 ('cpu' 或 'cuda')
                """
                self.model = model.to(device)
                self.device = device

            def train_epoch(self, train_loader: DataLoader, optimizer, criterion):
                """训练一个epoch"""
                self.model.train()
                total_loss = 0
                correct = 0
                total = 0

                for batch in train_loader:
                    # 获取数据
                    inputs = batch['input_ids'].to(self.device)
                    labels = batch['label'].to(self.device)

                    # 前向传播
                    optimizer.zero_grad()
                    outputs = self.model(inputs)
                    loss = criterion(outputs, labels)

                    # 反向传播
                    loss.backward()
                    optimizer.step()

                    # 统计
                    total_loss += loss.item()
                    predictions = torch.argmax(outputs, dim=1)
                    correct += (predictions == labels).sum().item()
                    total += labels.size(0)

                avg_loss = total_loss / len(train_loader)
                accuracy = correct / total

                return avg_loss, accuracy

            def evaluate(self, test_loader: DataLoader, criterion):
                """评估模型"""
                self.model.eval()
                total_loss = 0
                correct = 0
                total = 0

                with torch.no_grad():
                    for batch in test_loader:
                        inputs = batch['input_ids'].to(self.device)
                        labels = batch['label'].to(self.device)

                        outputs = self.model(inputs)
                        loss = criterion(outputs, labels)

                        total_loss += loss.item()
                        predictions = torch.argmax(outputs, dim=1)
                        correct += (predictions == labels).sum().item()
                        total += labels.size(0)

                avg_loss = total_loss / len(test_loader)
                accuracy = correct / total

                return avg_loss, accuracy

        # 使用示例
        if __name__ == "__main__":
            print("="*70)
            print("深度学习文本分类模型示例")
            print("="*70 + "\n")

            # 模型参数
            vocab_size = 10000
            embed_dim = 100
            hidden_dim = 128
            num_classes = 2
            batch_size = 32
            seq_len = 50

            # 生成随机数据(模拟)
            num_samples = 1000
            X = torch.randint(0, vocab_size, (num_samples, seq_len))
            y = torch.randint(0, num_classes, (num_samples,))

            # 1. TextCNN
            print("【TextCNN模型】")
            textcnn = TextCNN(vocab_size, embed_dim, num_classes)
            print(textcnn)
            print(f"参数量: {sum(p.numel() for p in textcnn.parameters()):,}\n")

            # 测试前向传播
            sample_input = X[:batch_size]
            output = textcnn(sample_input)
            print(f"输入形状: {sample_input.shape}")
            print(f"输出形状: {output.shape}\n")

            # 2. TextRNN
            print("【TextRNN模型】")
            textrnn = TextRNN(vocab_size, embed_dim, hidden_dim, num_classes)
            print(textrnn)
            print(f"参数量: {sum(p.numel() for p in textrnn.parameters()):,}\n")

            # 测试前向传播
            output = textrnn(sample_input)
            print(f"输入形状: {sample_input.shape}")
            print(f"输出形状: {output.shape}\n")

            # 3. TextRCNN
            print("【TextRCNN模型】")
            textrcnn = TextRCNN(vocab_size, embed_dim, hidden_dim, num_classes)
            print(textrcnn)
            print(f"参数量: {sum(p.numel() for p in textrcnn.parameters()):,}\n")

            # 测试前向传播
            output = textrcnn(sample_input)
            print(f"输入形状: {sample_input.shape}")
            print(f"输出形状: {output.shape}")
        ---

03.预训练模型选择
    a.说明部分
        预训练语言模型是当前文本分类的主流方案。BERT(Bidirectional Encoder Representations from Transformers)
        使用双向Transformer编码器,在大规模语料上预训练,通过微调适应下游任务。中文模型包括:
        BERT-base-chinese(Google官方)、RoBERTa-wwm-ext(哈工大讯飞)、ERNIE(百度)、MacBERT等。
        模型选择需考虑:语言匹配(中文/英文)、模型大小(base 110M vs large 340M参数)、
        训练语料领域(通用/垂直领域)、推理速度要求(ALBERT、DistilBERT等轻量版本)。
        微调策略包括:全参数微调(效果最好但慢)、固定部分层(加速训练)、使用适配器(Adapter)、
        提示学习(Prompt-based)等。学习率通常设为2e-5到5e-5,训练轮数3-5个epoch,
        使用warmup和学习率衰减。注意过拟合问题,可通过Early Stopping、正则化、数据增强缓解。
    b.代码示例
        ---
        # 基于BERT的文本分类实现
        import torch
        import torch.nn as nn
        from transformers import BertTokenizer, BertModel, BertConfig
        from transformers import AdamW, get_linear_schedule_with_warmup
        from torch.utils.data import Dataset, DataLoader
        from typing import List, Dict, Tuple
        import numpy as np
        from tqdm import tqdm

        class BERTClassifier(nn.Module):
            """基于BERT的文本分类器"""

            def __init__(self, model_name: str = 'bert-base-chinese',
                        num_classes: int = 2, dropout: float = 0.3,
                        freeze_bert: bool = False):
                """
                初始化BERT分类器

                Args:
                    model_name: 预训练模型名称
                    num_classes: 类别数
                    dropout: Dropout率
                    freeze_bert: 是否冻结BERT参数
                """
                super(BERTClassifier, self).__init__()

                # 加载预训练BERT
                self.bert = BertModel.from_pretrained(model_name)

                # 是否冻结BERT参数
                if freeze_bert:
                    for param in self.bert.parameters():
                        param.requires_grad = False

                # Dropout层
                self.dropout = nn.Dropout(dropout)

                # 分类头
                self.classifier = nn.Linear(self.bert.config.hidden_size, num_classes)

            def forward(self, input_ids, attention_mask, token_type_ids=None):
                """
                前向传播

                Args:
                    input_ids: 输入token IDs (batch_size, seq_len)
                    attention_mask: 注意力掩码 (batch_size, seq_len)
                    token_type_ids: token类型IDs (可选)

                Returns:
                    logits (batch_size, num_classes)
                """
                # BERT编码
                outputs = self.bert(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    token_type_ids=token_type_ids
                )

                # 使用[CLS]标记的输出
                pooled_output = outputs.pooler_output  # (batch_size, hidden_size)

                # Dropout
                pooled_output = self.dropout(pooled_output)

                # 分类
                logits = self.classifier(pooled_output)

                return logits

        class BERTDataset(Dataset):
            """BERT数据集"""

            def __init__(self, texts: List[str], labels: List[int],
                        tokenizer: BertTokenizer, max_length: int = 128):
                self.texts = texts
                self.labels = labels
                self.tokenizer = tokenizer
                self.max_length = max_length

            def __len__(self):
                return len(self.texts)

            def __getitem__(self, idx):
                text = self.texts[idx]
                label = self.labels[idx]

                # 编码文本
                encoding = self.tokenizer(
                    text,
                    add_special_tokens=True,
                    max_length=self.max_length,
                    padding='max_length',
                    truncation=True,
                    return_tensors='pt'
                )

                return {
                    'input_ids': encoding['input_ids'].flatten(),
                    'attention_mask': encoding['attention_mask'].flatten(),
                    'token_type_ids': encoding['token_type_ids'].flatten(),
                    'label': torch.tensor(label, dtype=torch.long)
                }

        class BERTTrainer:
            """BERT训练器"""

            def __init__(self, model: nn.Module, device: str = 'cpu'):
                self.model = model.to(device)
                self.device = device
                self.best_accuracy = 0.0

            def train(self, train_loader: DataLoader, val_loader: DataLoader,
                     epochs: int = 3, learning_rate: float = 2e-5,
                     warmup_steps: int = 0):
                """
                训练模型

                Args:
                    train_loader: 训练数据加载器
                    val_loader: 验证数据加载器
                    epochs: 训练轮数
                    learning_rate: 学习率
                    warmup_steps: warmup步数
                """
                # 优化器
                optimizer = AdamW(self.model.parameters(), lr=learning_rate)

                # 学习率调度器
                total_steps = len(train_loader) * epochs
                scheduler = get_linear_schedule_with_warmup(
                    optimizer,
                    num_warmup_steps=warmup_steps,
                    num_training_steps=total_steps
                )

                # 损失函数
                criterion = nn.CrossEntropyLoss()

                # 训练循环
                for epoch in range(epochs):
                    print(f"\n{'='*70}")
                    print(f"Epoch {epoch + 1}/{epochs}")
                    print(f"{'='*70}")

                    # 训练
                    train_loss, train_acc = self._train_epoch(
                        train_loader, optimizer, criterion, scheduler
                    )

                    print(f"\n训练 - Loss: {train_loss:.4f}, Accuracy: {train_acc:.4f}")

                    # 验证
                    val_loss, val_acc = self._evaluate(val_loader, criterion)

                    print(f"验证 - Loss: {val_loss:.4f}, Accuracy: {val_acc:.4f}")

                    # 保存最佳模型
                    if val_acc > self.best_accuracy:
                        self.best_accuracy = val_acc
                        print(f"✓ 保存最佳模型 (验证准确率: {val_acc:.4f})")

            def _train_epoch(self, train_loader, optimizer, criterion, scheduler):
                """训练一个epoch"""
                self.model.train()
                total_loss = 0
                correct = 0
                total = 0

                progress_bar = tqdm(train_loader, desc="训练中")

                for batch in progress_bar:
                    # 获取数据
                    input_ids = batch['input_ids'].to(self.device)
                    attention_mask = batch['attention_mask'].to(self.device)
                    token_type_ids = batch['token_type_ids'].to(self.device)
                    labels = batch['label'].to(self.device)

                    # 前向传播
                    optimizer.zero_grad()
                    outputs = self.model(input_ids, attention_mask, token_type_ids)
                    loss = criterion(outputs, labels)

                    # 反向传播
                    loss.backward()
                    torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
                    optimizer.step()
                    scheduler.step()

                    # 统计
                    total_loss += loss.item()
                    predictions = torch.argmax(outputs, dim=1)
                    correct += (predictions == labels).sum().item()
                    total += labels.size(0)

                    # 更新进度条
                    progress_bar.set_postfix({
                        'loss': f'{loss.item():.4f}',
                        'acc': f'{correct/total:.4f}'
                    })

                avg_loss = total_loss / len(train_loader)
                accuracy = correct / total

                return avg_loss, accuracy

            def _evaluate(self, data_loader, criterion):
                """评估模型"""
                self.model.eval()
                total_loss = 0
                correct = 0
                total = 0

                with torch.no_grad():
                    for batch in tqdm(data_loader, desc="评估中"):
                        input_ids = batch['input_ids'].to(self.device)
                        attention_mask = batch['attention_mask'].to(self.device)
                        token_type_ids = batch['token_type_ids'].to(self.device)
                        labels = batch['label'].to(self.device)

                        outputs = self.model(input_ids, attention_mask, token_type_ids)
                        loss = criterion(outputs, labels)

                        total_loss += loss.item()
                        predictions = torch.argmax(outputs, dim=1)
                        correct += (predictions == labels).sum().item()
                        total += labels.size(0)

                avg_loss = total_loss / len(data_loader)
                accuracy = correct / total

                return avg_loss, accuracy

            def predict(self, data_loader) -> List[int]:
                """预测"""
                self.model.eval()
                predictions = []

                with torch.no_grad():
                    for batch in tqdm(data_loader, desc="预测中"):
                        input_ids = batch['input_ids'].to(self.device)
                        attention_mask = batch['attention_mask'].to(self.device)
                        token_type_ids = batch['token_type_ids'].to(self.device)

                        outputs = self.model(input_ids, attention_mask, token_type_ids)
                        preds = torch.argmax(outputs, dim=1)
                        predictions.extend(preds.cpu().numpy().tolist())

                return predictions

        # 使用示例
        if __name__ == "__main__":
            print("="*70)
            print("BERT文本分类示例")
            print("="*70 + "\n")

            # 检查设备
            device = 'cuda' if torch.cuda.is_available() else 'cpu'
            print(f"使用设备: {device}\n")

            # 加载tokenizer
            print("加载tokenizer...")
            tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')

            # 模拟数据
            train_texts = ["这是一个正面的例子"] * 100 + ["这是一个负面的例子"] * 100
            train_labels = [1] * 100 + [0] * 100

            val_texts = ["正面样本"] * 20 + ["负面样本"] * 20
            val_labels = [1] * 20 + [0] * 20

            # 创建数据集
            print("创建数据集...")
            train_dataset = BERTDataset(train_texts, train_labels, tokenizer)
            val_dataset = BERTDataset(val_texts, val_labels, tokenizer)

            train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
            val_loader = DataLoader(val_dataset, batch_size=16)

            # 创建模型
            print("创建模型...")
            model = BERTClassifier(num_classes=2, dropout=0.3)
            print(f"模型参数量: {sum(p.numel() for p in model.parameters()):,}\n")

            # 创建训练器
            trainer = BERTTrainer(model, device=device)

            # 训练
            print("开始训练...")
            trainer.train(
                train_loader,
                val_loader,
                epochs=3,
                learning_rate=2e-5,
                warmup_steps=100
            )

            print(f"\n✓ 训练完成! 最佳验证准确率: {trainer.best_accuracy:.4f}")
        ---

2.4 情感分析实战

01.数据集构建与预处理
    a.说明部分
        情感分析是文本分类的典型应用,目标是判断文本的情感倾向(正面/负面/中性)。常用数据集包括:
        IMDb电影评论、Amazon商品评论、微博情感数据、酒店评论等。数据预处理需要处理表情符号、网络用语、
        重复字符等特殊情况。标注时需要明确情感强度划分标准,如五星评分可映射为:1-2星负面、3星中性、4-5星正面。
        数据不平衡是常见问题,可通过过采样或加权损失函数解决。文本清洗要保留情感词,不要过度清洗。
        分词时注意否定词、程度副词等对情感的影响。构建情感词典可以辅助特征工程,提升模型效果。
    b.代码示例
        ---
        # 情感分析数据集构建
        import pandas as pd
        import numpy as np
        import re
        import jieba
        from typing import List, Dict, Tuple
        from collections import Counter
        import matplotlib.pyplot as plt
        import seaborn as sns

        # 设置中文显示
        plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei']
        plt.rcParams['axes.unicode_minus'] = False

        class SentimentDatasetBuilder:
            """情感分析数据集构建器"""

            def __init__(self):
                # 情感词典
                self.positive_words = set([
                    '好', '棒', '优秀', '喜欢', '满意', '推荐', '精彩', '完美',
                    '赞', '不错', '舒服', '快乐', '开心', '惊喜', '值得'
                ])

                self.negative_words = set([
                    '差', '烂', '失望', '糟糕', '难受', '后悔', '垃圾', '恶心',
                    '讨厌', '无聊', '浪费', '骗人', '坑', '慢', '贵'
                ])

                # 否定词
                self.negation_words = set(['不', '没', '别', '莫', '无', '非'])

                # 程度副词
                self.degree_words = {
                    '很': 1.5, '非常': 2.0, '特别': 2.0, '十分': 1.8,
                    '极其': 2.5, '超级': 2.2, '太': 1.8, '相当': 1.5,
                    '有点': 0.5, '稍微': 0.5, '略': 0.4
                }

            def clean_text(self, text: str) -> str:
                """
                清洗文本(保留情感信息)

                Args:
                    text: 原始文本

                Returns:
                    清洗后的文本
                """
                # 处理表情符号(替换为文字)
                emoji_dict = {
                    '😊': '开心', '😃': '开心', '😄': '开心', '😁': '开心',
                    '😢': '难过', '😭': '哭', '😞': '失望', '😔': '失望',
                    '😡': '生气', '😠': '生气', '👍': '好', '👎': '不好',
                    '❤️': '喜欢', '💔': '伤心'
                }

                for emoji, word in emoji_dict.items():
                    text = text.replace(emoji, word)

                # 处理重复字符(如"太太太好了" -> "太好了")
                text = re.sub(r'(.)\1{2,}', r'\1\1', text)

                # 移除URL
                text = re.sub(r'http[s]?://\S+', '', text)

                # 移除@用户名
                text = re.sub(r'@\S+', '', text)

                # 统一标点
                text = text.replace('!', '!').replace('?', '?')

                # 去除多余空格
                text = re.sub(r'\s+', ' ', text).strip()

                return text

            def label_from_rating(self, rating: float) -> int:
                """
                从评分转换为情感标签

                Args:
                    rating: 评分 (1-5)

                Returns:
                    情感标签 (0:负面, 1:中性, 2:正面)
                """
                if rating <= 2.0:
                    return 0  # 负面
                elif rating < 4.0:
                    return 1  # 中性
                else:
                    return 2  # 正面

            def analyze_sentiment_words(self, text: str) -> Dict:
                """
                分析文本中的情感词

                Args:
                    text: 输入文本

                Returns:
                    情感词统计
                """
                words = jieba.lcut(text)

                pos_count = sum(1 for w in words if w in self.positive_words)
                neg_count = sum(1 for w in words if w in self.negative_words)
                neg_flag = any(w in self.negation_words for w in words)

                return {
                    'positive_count': pos_count,
                    'negative_count': neg_count,
                    'has_negation': neg_flag,
                    'sentiment_score': pos_count - neg_count
                }

            def create_balanced_dataset(self, texts: List[str], labels: List[int],
                                      target_count: int = None) -> Tuple[List[str], List[int]]:
                """
                创建平衡数据集

                Args:
                    texts: 文本列表
                    labels: 标签列表
                    target_count: 每个类别的目标样本数

                Returns:
                    平衡后的文本和标签
                """
                # 按类别分组
                label_groups = {}
                for text, label in zip(texts, labels):
                    if label not in label_groups:
                        label_groups[label] = []
                    label_groups[label].append(text)

                # 确定目标数量
                if target_count is None:
                    target_count = min(len(texts) for texts in label_groups.values())

                print(f"原始类别分布:")
                for label, texts in label_groups.items():
                    print(f"  类别 {label}: {len(texts)} 样本")

                print(f"\n目标: 每个类别 {target_count} 样本")

                # 平衡采样
                balanced_texts = []
                balanced_labels = []

                np.random.seed(42)
                for label, label_texts in label_groups.items():
                    if len(label_texts) >= target_count:
                        # 随机采样
                        selected = np.random.choice(label_texts, target_count, replace=False)
                    else:
                        # 过采样
                        selected = np.random.choice(label_texts, target_count, replace=True)

                    balanced_texts.extend(selected)
                    balanced_labels.extend([label] * target_count)

                # 打乱顺序
                indices = np.random.permutation(len(balanced_texts))
                balanced_texts = [balanced_texts[i] for i in indices]
                balanced_labels = [balanced_labels[i] for i in indices]

                print(f"\n平衡后总样本数: {len(balanced_texts)}")

                return balanced_texts, balanced_labels

            def visualize_dataset(self, texts: List[str], labels: List[int]):
                """可视化数据集特征"""
                fig, axes = plt.subplots(2, 2, figsize=(14, 10))

                # 1. 类别分布
                label_counts = Counter(labels)
                labels_name = {0: '负面', 1: '中性', 2: '正面'}

                ax = axes[0, 0]
                ax.bar([labels_name[l] for l in sorted(label_counts.keys())],
                      [label_counts[l] for l in sorted(label_counts.keys())],
                      color=['#E63946', '#F4A261', '#2A9D8F'])
                ax.set_title('类别分布', fontsize=12, fontweight='bold')
                ax.set_ylabel('样本数', fontsize=10)

                # 2. 文本长度分布
                text_lengths = [len(text) for text in texts]

                ax = axes[0, 1]
                ax.hist(text_lengths, bins=30, color='#457B9D', alpha=0.7, edgecolor='black')
                ax.set_title('文本长度分布', fontsize=12, fontweight='bold')
                ax.set_xlabel('字符数', fontsize=10)
                ax.set_ylabel('样本数', fontsize=10)
                ax.axvline(np.mean(text_lengths), color='red', linestyle='--',
                          label=f'平均: {np.mean(text_lengths):.1f}')
                ax.legend()

                # 3. 各类别平均长度
                label_lengths = {l: [] for l in set(labels)}
                for text, label in zip(texts, labels):
                    label_lengths[label].append(len(text))

                avg_lengths = {labels_name[l]: np.mean(lengths)
                             for l, lengths in label_lengths.items()}

                ax = axes[1, 0]
                ax.bar(avg_lengths.keys(), avg_lengths.values(),
                      color=['#E63946', '#F4A261', '#2A9D8F'])
                ax.set_title('各类别平均文本长度', fontsize=12, fontweight='bold')
                ax.set_ylabel('平均字符数', fontsize=10)

                # 4. 情感词统计
                sentiment_stats = {0: [], 1: [], 2: []}
                for text, label in zip(texts, labels):
                    stats = self.analyze_sentiment_words(text)
                    sentiment_stats[label].append(stats['sentiment_score'])

                avg_scores = {labels_name[l]: np.mean(scores)
                            for l, scores in sentiment_stats.items()}

                ax = axes[1, 1]
                colors_score = ['#E63946' if v < 0 else '#2A9D8F' if v > 0 else '#F4A261'
                               for v in avg_scores.values()]
                ax.bar(avg_scores.keys(), avg_scores.values(), color=colors_score)
                ax.set_title('各类别平均情感得分', fontsize=12, fontweight='bold')
                ax.set_ylabel('情感得分', fontsize=10)
                ax.axhline(0, color='black', linestyle='-', linewidth=0.5)

                plt.tight_layout()
                plt.savefig('sentiment_dataset_analysis.png', dpi=300, bbox_inches='tight')
                print("\n✓ 数据集分析图已保存为 sentiment_dataset_analysis.png")

        # 使用示例
        if __name__ == "__main__":
            print("="*70)
            print("情感分析数据集构建示例")
            print("="*70 + "\n")

            # 创建构建器
            builder = SentimentDatasetBuilder()

            # 模拟原始数据(商品评论)
            raw_data = [
                ("这个产品真的太太太好用了!!!😊👍", 5.0),
                ("质量很差,非常失望😞", 1.0),
                ("一般般吧,没什么特别的", 3.0),
                ("物流很快,包装不错,商品也很满意", 5.0),
                ("完全不值这个价格,后悔购买😡", 1.5),
                ("还可以,但有些小问题", 3.5),
                ("超级推荐!性价比很高!", 5.0),
                ("烂到家了,浪费钱💔", 1.0),
            ] * 50  # 扩展数据

            # 数据清洗
            print("1. 数据清洗")
            cleaned_texts = []
            labels = []

            for text, rating in raw_data:
                # 清洗文本
                cleaned = builder.clean_text(text)
                cleaned_texts.append(cleaned)

                # 生成标签
                label = builder.label_from_rating(rating)
                labels.append(label)

                # 显示前3个示例
                if len(cleaned_texts) <= 3:
                    print(f"原始: {text}")
                    print(f"清洗: {cleaned}")
                    print(f"评分: {rating} -> 标签: {label}\n")

            # 2. 情感词分析
            print("\n2. 情感词分析示例")
            for i in range(3):
                text = cleaned_texts[i]
                stats = builder.analyze_sentiment_words(text)
                print(f"文本: {text}")
                print(f"  正面词数: {stats['positive_count']}")
                print(f"  负面词数: {stats['negative_count']}")
                print(f"  包含否定: {stats['has_negation']}")
                print(f"  情感得分: {stats['sentiment_score']}\n")

            # 3. 数据平衡
            print("3. 数据平衡")
            balanced_texts, balanced_labels = builder.create_balanced_dataset(
                cleaned_texts, labels
            )

            # 4. 数据集可视化
            print("\n4. 数据集可视化")
            builder.visualize_dataset(balanced_texts, balanced_labels)

            # 5. 保存数据集
            print("\n5. 保存数据集")
            df = pd.DataFrame({
                'text': balanced_texts,
                'label': balanced_labels
            })
            df.to_csv('sentiment_dataset.csv', index=False, encoding='utf-8')
            print(f"✓ 数据集已保存: sentiment_dataset.csv ({len(df)} 样本)")
        ---

02.BERT模型微调
    a.说明部分
        BERT微调是当前情感分析的最佳实践。微调流程包括:加载预训练模型、添加分类头、冻结或微调不同层、
        训练超参数调整。中文情感分析推荐使用roberta-wwm-ext或macbert模型,这些模型在中文语料上表现更好。
        微调技巧:使用较小学习率(2e-5)避免灾难性遗忘、设置warmup逐步增加学习率、
        使用梯度裁剪防止梯度爆炸、early stopping避免过拟合。训练时可以先冻结BERT层只训练分类头,
        再逐层解冻微调全部参数。批次大小受GPU内存限制,通常16-32,可用梯度累积模拟大批次。
        验证集上监控性能,选择最佳检查点。实际部署时可以使用模型蒸馏或量化加速推理。
    b.代码示例
        ---
        # BERT情感分析微调实现
        import torch
        import torch.nn as nn
        from torch.utils.data import Dataset, DataLoader
        from transformers import (
            BertTokenizer, BertForSequenceClassification,
            AdamW, get_linear_schedule_with_warmup
        )
        from sklearn.metrics import accuracy_score, classification_report
        from tqdm import tqdm
        import numpy as np
        from typing import List, Dict
        import pandas as pd

        class SentimentDataset(Dataset):
            """情感分析数据集"""

            def __init__(self, texts: List[str], labels: List[int],
                        tokenizer: BertTokenizer, max_length: int = 128):
                self.texts = texts
                self.labels = labels
                self.tokenizer = tokenizer
                self.max_length = max_length

            def __len__(self):
                return len(self.texts)

            def __getitem__(self, idx):
                text = self.texts[idx]
                label = self.labels[idx]

                encoding = self.tokenizer(
                    text,
                    add_special_tokens=True,
                    max_length=self.max_length,
                    padding='max_length',
                    truncation=True,
                    return_tensors='pt'
                )

                return {
                    'input_ids': encoding['input_ids'].flatten(),
                    'attention_mask': encoding['attention_mask'].flatten(),
                    'label': torch.tensor(label, dtype=torch.long)
                }

        class SentimentBERTTrainer:
            """BERT情感分析训练器"""

            def __init__(self, model_name: str = 'bert-base-chinese',
                        num_labels: int = 3, device: str = 'cpu'):
                """
                初始化训练器

                Args:
                    model_name: 预训练模型名称
                    num_labels: 类别数量
                    device: 设备
                """
                self.device = device

                # 加载tokenizer和模型
                self.tokenizer = BertTokenizer.from_pretrained(model_name)
                self.model = BertForSequenceClassification.from_pretrained(
                    model_name,
                    num_labels=num_labels
                ).to(device)

                self.best_val_loss = float('inf')
                self.patience_counter = 0

                # 训练历史
                self.history = {
                    'train_loss': [],
                    'train_acc': [],
                    'val_loss': [],
                    'val_acc': []
                }

            def train(self, train_loader: DataLoader, val_loader: DataLoader,
                     epochs: int = 5, learning_rate: float = 2e-5,
                     warmup_ratio: float = 0.1, patience: int = 3):
                """
                训练模型

                Args:
                    train_loader: 训练数据加载器
                    val_loader: 验证数据加载器
                    epochs: 训练轮数
                    learning_rate: 学习率
                    warmup_ratio: warmup比例
                    patience: early stopping耐心值
                """
                # 优化器
                optimizer = AdamW(self.model.parameters(), lr=learning_rate)

                # 学习率调度器
                total_steps = len(train_loader) * epochs
                warmup_steps = int(total_steps * warmup_ratio)

                scheduler = get_linear_schedule_with_warmup(
                    optimizer,
                    num_warmup_steps=warmup_steps,
                    num_training_steps=total_steps
                )

                print(f"总训练步数: {total_steps}")
                print(f"Warmup步数: {warmup_steps}\n")

                # 训练循环
                for epoch in range(epochs):
                    print(f"\n{'='*70}")
                    print(f"Epoch {epoch + 1}/{epochs}")
                    print(f"{'='*70}")

                    # 训练
                    train_loss, train_acc = self._train_epoch(
                        train_loader, optimizer, scheduler
                    )

                    # 验证
                    val_loss, val_acc = self._validate(val_loader)

                    # 记录历史
                    self.history['train_loss'].append(train_loss)
                    self.history['train_acc'].append(train_acc)
                    self.history['val_loss'].append(val_loss)
                    self.history['val_acc'].append(val_acc)

                    print(f"\n训练 - Loss: {train_loss:.4f}, Accuracy: {train_acc:.4f}")
                    print(f"验证 - Loss: {val_loss:.4f}, Accuracy: {val_acc:.4f}")

                    # Early stopping
                    if val_loss < self.best_val_loss:
                        self.best_val_loss = val_loss
                        self.patience_counter = 0
                        torch.save(self.model.state_dict(), 'best_sentiment_model.pt')
                        print(f"✓ 保存最佳模型 (验证损失: {val_loss:.4f})")
                    else:
                        self.patience_counter += 1
                        print(f"验证损失未改善 ({self.patience_counter}/{patience})")

                        if self.patience_counter >= patience:
                            print(f"\nEarly stopping触发,停止训练")
                            break

                print(f"\n训练完成! 最佳验证损失: {self.best_val_loss:.4f}")

            def _train_epoch(self, train_loader, optimizer, scheduler):
                """训练一个epoch"""
                self.model.train()
                total_loss = 0
                all_preds = []
                all_labels = []

                progress_bar = tqdm(train_loader, desc="训练中")

                for batch in progress_bar:
                    # 数据迁移到设备
                    input_ids = batch['input_ids'].to(self.device)
                    attention_mask = batch['attention_mask'].to(self.device)
                    labels = batch['label'].to(self.device)

                    # 前向传播
                    optimizer.zero_grad()
                    outputs = self.model(
                        input_ids=input_ids,
                        attention_mask=attention_mask,
                        labels=labels
                    )

                    loss = outputs.loss
                    logits = outputs.logits

                    # 反向传播
                    loss.backward()

                    # 梯度裁剪
                    torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)

                    optimizer.step()
                    scheduler.step()

                    # 统计
                    total_loss += loss.item()
                    preds = torch.argmax(logits, dim=1)
                    all_preds.extend(preds.cpu().numpy())
                    all_labels.extend(labels.cpu().numpy())

                    # 更新进度条
                    progress_bar.set_postfix({
                        'loss': f'{loss.item():.4f}',
                        'lr': f'{scheduler.get_last_lr()[0]:.2e}'
                    })

                avg_loss = total_loss / len(train_loader)
                accuracy = accuracy_score(all_labels, all_preds)

                return avg_loss, accuracy

            def _validate(self, val_loader):
                """验证模型"""
                self.model.eval()
                total_loss = 0
                all_preds = []
                all_labels = []

                with torch.no_grad():
                    for batch in tqdm(val_loader, desc="验证中"):
                        input_ids = batch['input_ids'].to(self.device)
                        attention_mask = batch['attention_mask'].to(self.device)
                        labels = batch['label'].to(self.device)

                        outputs = self.model(
                            input_ids=input_ids,
                            attention_mask=attention_mask,
                            labels=labels
                        )

                        loss = outputs.loss
                        logits = outputs.logits

                        total_loss += loss.item()
                        preds = torch.argmax(logits, dim=1)
                        all_preds.extend(preds.cpu().numpy())
                        all_labels.extend(labels.cpu().numpy())

                avg_loss = total_loss / len(val_loader)
                accuracy = accuracy_score(all_labels, all_preds)

                return avg_loss, accuracy

            def evaluate(self, test_loader: DataLoader) -> Dict:
                """完整评估"""
                self.model.eval()
                all_preds = []
                all_labels = []
                all_probs = []

                with torch.no_grad():
                    for batch in tqdm(test_loader, desc="测试中"):
                        input_ids = batch['input_ids'].to(self.device)
                        attention_mask = batch['attention_mask'].to(self.device)
                        labels = batch['label'].to(self.device)

                        outputs = self.model(
                            input_ids=input_ids,
                            attention_mask=attention_mask
                        )

                        logits = outputs.logits
                        probs = torch.softmax(logits, dim=1)

                        preds = torch.argmax(logits, dim=1)
                        all_preds.extend(preds.cpu().numpy())
                        all_labels.extend(labels.cpu().numpy())
                        all_probs.extend(probs.cpu().numpy())

                # 打印详细报告
                print(f"\n{'='*70}")
                print("测试集评估结果")
                print(f"{'='*70}\n")

                label_names = ['负面', '中性', '正面']
                print(classification_report(
                    all_labels, all_preds,
                    target_names=label_names,
                    digits=4
                ))

                return {
                    'predictions': all_preds,
                    'probabilities': all_probs,
                    'labels': all_labels
                }

            def predict(self, texts: List[str], batch_size: int = 32) -> List[Dict]:
                """预测新文本"""
                # 创建数据集
                dummy_labels = [0] * len(texts)
                dataset = SentimentDataset(texts, dummy_labels, self.tokenizer)
                loader = DataLoader(dataset, batch_size=batch_size)

                self.model.eval()
                predictions = []

                with torch.no_grad():
                    for batch in loader:
                        input_ids = batch['input_ids'].to(self.device)
                        attention_mask = batch['attention_mask'].to(self.device)

                        outputs = self.model(
                            input_ids=input_ids,
                            attention_mask=attention_mask
                        )

                        logits = outputs.logits
                        probs = torch.softmax(logits, dim=1)

                        for prob in probs:
                            label = torch.argmax(prob).item()
                            confidence = prob[label].item()

                            predictions.append({
                                'label': label,
                                'sentiment': ['负面', '中性', '正面'][label],
                                'confidence': confidence,
                                'probabilities': {
                                    '负面': prob[0].item(),
                                    '中性': prob[1].item(),
                                    '正面': prob[2].item()
                                }
                            })

                return predictions

        # 使用示例
        if __name__ == "__main__":
            print("="*70)
            print("BERT情感分析微调")
            print("="*70 + "\n")

            # 检查设备
            device = 'cuda' if torch.cuda.is_available() else 'cpu'
            print(f"使用设备: {device}\n")

            # 准备数据(这里使用模拟数据)
            print("准备数据...")

            # 实际应用中应从文件加载
            train_texts = ["这个产品很好用"] * 50 + ["质量太差了"] * 50 + ["一般般"] * 50
            train_labels = [2] * 50 + [0] * 50 + [1] * 50

            val_texts = ["很满意"] * 10 + ["很失望"] * 10 + ["还可以"] * 10
            val_labels = [2] * 10 + [0] * 10 + [1] * 10

            # 创建训练器
            trainer = SentimentBERTTrainer(
                model_name='bert-base-chinese',
                num_labels=3,
                device=device
            )

            # 创建数据加载器
            train_dataset = SentimentDataset(train_texts, train_labels, trainer.tokenizer)
            val_dataset = SentimentDataset(val_texts, val_labels, trainer.tokenizer)

            train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
            val_loader = DataLoader(val_dataset, batch_size=16)

            # 训练
            print("开始训练...")
            trainer.train(
                train_loader, val_loader,
                epochs=3,
                learning_rate=2e-5,
                patience=2
            )

            # 预测示例
            print("\n" + "="*70)
            print("预测示例")
            print("="*70 + "\n")

            test_texts = [
                "这个产品真的太棒了,强烈推荐!",
                "质量很差,完全不值这个价",
                "还行吧,没什么特别的"
            ]

            predictions = trainer.predict(test_texts)

            for text, pred in zip(test_texts, predictions):
                print(f"文本: {text}")
                print(f"情感: {pred['sentiment']} (置信度: {pred['confidence']:.4f})")
                print(f"概率: {pred['probabilities']}\n")
        ---

03.模型部署与在线服务
    a.说明部分
        模型训练完成后需要部署为在线服务供业务调用。部署方案包括:Flask/FastAPI构建REST API、
        TorchServe模型服务化、Docker容器化部署、Kubernetes集群管理。性能优化技术:模型量化(INT8)
        减小模型大小、ONNX Runtime加速推理、批处理提高吞吐量、模型缓存减少重复计算。
        生产环境需要考虑:负载均衡、自动扩缩容、监控告警、日志记录、A/B测试等。
        模型更新策略包括在线更新和灰度发布。接口设计要考虑超时设置、错误处理、限流保护。
        可以提供同步API(实时响应)和异步API(队列处理)两种模式。部署时要准备模型版本管理和回滚机制。
    b.代码示例
        ---
        # 情感分析API服务实现
        from fastapi import FastAPI, HTTPException
        from pydantic import BaseModel
        from typing import List, Dict, Optional
        import torch
        from transformers import BertTokenizer, BertForSequenceClassification
        import uvicorn
        import time
        from functools import lru_cache
        import logging

        # 配置日志
        logging.basicConfig(
            level=logging.INFO,
            format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
        )
        logger = logging.getLogger(__name__)

        # 请求模型
        class SentimentRequest(BaseModel):
            """情感分析请求"""
            texts: List[str]
            return_probabilities: bool = False

        class SentimentResponse(BaseModel):
            """情感分析响应"""
            results: List[Dict]
            model_version: str
            inference_time: float

        # 模型服务类
        class SentimentModelService:
            """情感分析模型服务"""

            def __init__(self, model_path: str, model_name: str = 'bert-base-chinese',
                        device: str = 'cpu'):
                """
                初始化模型服务

                Args:
                    model_path: 模型权重路径
                    model_name: 预训练模型名称
                    device: 设备
                """
                self.device = device
                self.model_version = "1.0.0"

                logger.info(f"加载模型: {model_name}")
                logger.info(f"使用设备: {device}")

                # 加载tokenizer
                self.tokenizer = BertTokenizer.from_pretrained(model_name)

                # 加载模型
                self.model = BertForSequenceClassification.from_pretrained(
                    model_name,
                    num_labels=3
                )

                # 加载训练好的权重
                try:
                    self.model.load_state_dict(torch.load(model_path, map_location=device))
                    logger.info(f"成功加载模型权重: {model_path}")
                except Exception as e:
                    logger.warning(f"无法加载模型权重: {e},使用预训练权重")

                self.model.to(device)
                self.model.eval()

                # 标签映射
                self.label_names = {0: '负面', 1: '中性', 2: '正面'}

                # 统计信息
                self.total_requests = 0
                self.total_inference_time = 0.0

                logger.info("模型服务初始化完成")

            @lru_cache(maxsize=1000)
            def _cached_predict(self, text: str) -> tuple:
                """
                带缓存的预测(用于相同文本的快速响应)

                Args:
                    text: 输入文本

                Returns:
                    (label, confidence, probabilities)
                """
                return self._predict_single(text)

            def _predict_single(self, text: str) -> tuple:
                """单文本预测"""
                # 编码
                encoding = self.tokenizer(
                    text,
                    add_special_tokens=True,
                    max_length=128,
                    padding='max_length',
                    truncation=True,
                    return_tensors='pt'
                )

                input_ids = encoding['input_ids'].to(self.device)
                attention_mask = encoding['attention_mask'].to(self.device)

                # 预测
                with torch.no_grad():
                    outputs = self.model(
                        input_ids=input_ids,
                        attention_mask=attention_mask
                    )

                logits = outputs.logits
                probs = torch.softmax(logits, dim=1).squeeze()

                label = torch.argmax(probs).item()
                confidence = probs[label].item()

                return label, confidence, probs.cpu().numpy().tolist()

            def predict(self, texts: List[str],
                       return_probabilities: bool = False) -> List[Dict]:
                """
                批量预测

                Args:
                    texts: 文本列表
                    return_probabilities: 是否返回所有类别概率

                Returns:
                    预测结果列表
                """
                start_time = time.time()
                results = []

                try:
                    for text in texts:
                        # 使用缓存预测
                        label, confidence, probs = self._cached_predict(text)

                        result = {
                            'text': text,
                            'label': label,
                            'sentiment': self.label_names[label],
                            'confidence': round(confidence, 4)
                        }

                        if return_probabilities:
                            result['probabilities'] = {
                                self.label_names[i]: round(probs[i], 4)
                                for i in range(3)
                            }

                        results.append(result)

                    # 更新统计
                    self.total_requests += 1
                    inference_time = time.time() - start_time
                    self.total_inference_time += inference_time

                    logger.info(f"处理{len(texts)}个文本,耗时{inference_time:.4f}秒")

                except Exception as e:
                    logger.error(f"预测失败: {e}")
                    raise HTTPException(status_code=500, detail=f"预测失败: {str(e)}")

                return results

            def get_statistics(self) -> Dict:
                """获取服务统计信息"""
                avg_time = (self.total_inference_time / self.total_requests
                           if self.total_requests > 0 else 0)

                return {
                    'model_version': self.model_version,
                    'device': self.device,
                    'total_requests': self.total_requests,
                    'total_inference_time': round(self.total_inference_time, 2),
                    'average_inference_time': round(avg_time, 4),
                    'cache_info': self._cached_predict.cache_info()._asdict()
                }

        # 创建FastAPI应用
        app = FastAPI(
            title="情感分析API",
            description="基于BERT的中文情感分析服务",
            version="1.0.0"
        )

        # 初始化模型服务(全局单例)
        model_service = None

        @app.on_event("startup")
        async def startup_event():
            """启动时加载模型"""
            global model_service
            device = 'cuda' if torch.cuda.is_available() else 'cpu'
            model_service = SentimentModelService(
                model_path='best_sentiment_model.pt',
                model_name='bert-base-chinese',
                device=device
            )
            logger.info("API服务启动完成")

        @app.get("/")
        async def root():
            """根路径"""
            return {
                "message": "情感分析API服务",
                "version": "1.0.0",
                "endpoints": {
                    "/predict": "情感分析预测",
                    "/health": "健康检查",
                    "/stats": "服务统计"
                }
            }

        @app.post("/predict", response_model=SentimentResponse)
        async def predict_sentiment(request: SentimentRequest):
            """
            情感分析预测接口

            Args:
                request: 包含文本列表的请求

            Returns:
                预测结果
            """
            if not request.texts:
                raise HTTPException(status_code=400, detail="文本列表不能为空")

            if len(request.texts) > 100:
                raise HTTPException(status_code=400, detail="单次请求最多100个文本")

            start_time = time.time()

            # 预测
            results = model_service.predict(
                request.texts,
                return_probabilities=request.return_probabilities
            )

            inference_time = time.time() - start_time

            return SentimentResponse(
                results=results,
                model_version=model_service.model_version,
                inference_time=round(inference_time, 4)
            )

        @app.get("/health")
        async def health_check():
            """健康检查"""
            return {
                "status": "healthy",
                "model_loaded": model_service is not None,
                "device": model_service.device if model_service else None
            }

        @app.get("/stats")
        async def get_statistics():
            """获取服务统计"""
            if model_service is None:
                raise HTTPException(status_code=503, detail="模型服务未初始化")

            return model_service.get_statistics()

        @app.post("/clear_cache")
        async def clear_cache():
            """清除预测缓存"""
            if model_service is None:
                raise HTTPException(status_code=503, detail="模型服务未初始化")

            model_service._cached_predict.cache_clear()
            logger.info("缓存已清除")

            return {"message": "缓存已清除"}

        # 运行服务
        if __name__ == "__main__":
            uvicorn.run(
                "sentiment_api:app",
                host="0.0.0.0",
                port=8000,
                reload=False,
                workers=1
            )

        # 客户端调用示例
        """
        import requests

        # API地址
        API_URL = "http://localhost:8000"

        # 调用预测接口
        response = requests.post(
            f"{API_URL}/predict",
            json={
                "texts": [
                    "这个产品真的太好了!",
                    "质量很差,不推荐",
                    "还可以吧"
                ],
                "return_probabilities": True
            }
        )

        if response.status_code == 200:
            result = response.json()
            print(f"预测结果: {result['results']}")
            print(f"推理时间: {result['inference_time']}秒")
        else:
            print(f"请求失败: {response.text}")

        # 获取统计信息
        stats = requests.get(f"{API_URL}/stats").json()
        print(f"服务统计: {stats}")
        """
        ---

2.5 新闻分类实战

01.多分类任务处理
    a.说明部分
        新闻分类是典型的多分类任务,将新闻文章分类到预定义的类别(如科技、体育、财经、娱乐等)。
        与二分类情感分析不同,多分类需要处理类间相似性、层次分类、长文本等问题。常用数据集包括THUCNews、
        搜狗新闻、AG News等。特征工程侧重标题和正文的权重分配,标题虽短但信息密度高。
        处理长文本可采用截断、分段、层次注意力等方法。类别不平衡通过focal loss或类别权重缓解。
        多分类的评估指标包括宏平均和微平均F1值,需要关注每个类别的性能。模型选择上,
        TextCNN适合标题分类,BERT适合全文理解,层次模型适合长文档。实际应用需要考虑标签体系设计、
        跨类别样本处理、新类别扩展等问题。
    b.代码示例
        ---
        # 新闻多分类实现
        import torch
        import torch.nn as nn
        from torch.utils.data import Dataset, DataLoader
        from transformers import BertTokenizer, BertModel
        import numpy as np
        from sklearn.metrics import classification_report, confusion_matrix
        from typing import List, Dict, Tuple
        import seaborn as sns
        import matplotlib.pyplot as plt

        # 设置中文显示
        plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei']
        plt.rcParams['axes.unicode_minus'] = False

        class NewsDataset(Dataset):
            """新闻分类数据集"""

            def __init__(self, titles: List[str], contents: List[str],
                        labels: List[int], tokenizer: BertTokenizer,
                        max_length: int = 256):
                """
                初始化数据集

                Args:
                    titles: 新闻标题列表
                    contents: 新闻正文列表
                    labels: 类别标签
                    tokenizer: 分词器
                    max_length: 最大序列长度
                """
                self.titles = titles
                self.contents = contents
                self.labels = labels
                self.tokenizer = tokenizer
                self.max_length = max_length

            def __len__(self):
                return len(self.titles)

            def __getitem__(self, idx):
                title = self.titles[idx]
                content = self.contents[idx]
                label = self.labels[idx]

                # 截断正文(保留更多标题信息)
                max_content_len = self.max_length - len(title) - 10

                # 拼接标题和正文
                text = f"{title} [SEP] {content[:max_content_len]}"

                # 编码
                encoding = self.tokenizer(
                    text,
                    add_special_tokens=True,
                    max_length=self.max_length,
                    padding='max_length',
                    truncation=True,
                    return_tensors='pt'
                )

                return {
                    'input_ids': encoding['input_ids'].flatten(),
                    'attention_mask': encoding['attention_mask'].flatten(),
                    'label': torch.tensor(label, dtype=torch.long)
                }

        class HierarchicalNewsClassifier(nn.Module):
            """层次化新闻分类器(处理长文本)"""

            def __init__(self, bert_model_name: str, num_classes: int,
                        hidden_dim: int = 256, dropout: float = 0.3):
                """
                初始化分类器

                Args:
                    bert_model_name: BERT模型名称
                    num_classes: 类别数
                    hidden_dim: 隐藏层维度
                    dropout: Dropout率
                """
                super(HierarchicalNewsClassifier, self).__init__()

                # BERT编码器
                self.bert = BertModel.from_pretrained(bert_model_name)

                # 句子级编码器
                self.sentence_encoder = nn.GRU(
                    input_size=self.bert.config.hidden_size,
                    hidden_size=hidden_dim,
                    num_layers=1,
                    bidirectional=True,
                    batch_first=True
                )

                # 注意力层
                self.attention = nn.Linear(hidden_dim * 2, 1)

                # Dropout
                self.dropout = nn.Dropout(dropout)

                # 分类层
                self.classifier = nn.Linear(hidden_dim * 2, num_classes)

            def forward(self, input_ids, attention_mask):
                """
                前向传播

                Args:
                    input_ids: 输入token IDs
                    attention_mask: 注意力掩码

                Returns:
                    logits
                """
                # BERT编码
                bert_output = self.bert(
                    input_ids=input_ids,
                    attention_mask=attention_mask
                )

                # 使用最后一层hidden states
                sequence_output = bert_output.last_hidden_state  # (batch, seq_len, hidden)

                # 句子级编码
                gru_output, _ = self.sentence_encoder(sequence_output)  # (batch, seq_len, hidden*2)

                # 注意力机制
                attention_weights = torch.softmax(
                    self.attention(gru_output).squeeze(-1),  # (batch, seq_len)
                    dim=1
                )

                # 加权求和
                weighted_output = torch.bmm(
                    attention_weights.unsqueeze(1),  # (batch, 1, seq_len)
                    gru_output  # (batch, seq_len, hidden*2)
                ).squeeze(1)  # (batch, hidden*2)

                # Dropout
                weighted_output = self.dropout(weighted_output)

                # 分类
                logits = self.classifier(weighted_output)

                return logits

        class FocalLoss(nn.Module):
            """Focal Loss(处理类别不平衡)"""

            def __init__(self, alpha: List[float] = None, gamma: float = 2.0):
                """
                初始化Focal Loss

                Args:
                    alpha: 各类别权重
                    gamma: 聚焦参数
                """
                super(FocalLoss, self).__init__()
                self.alpha = alpha
                self.gamma = gamma

            def forward(self, inputs, targets):
                """
                计算损失

                Args:
                    inputs: 模型输出logits (batch, num_classes)
                    targets: 目标标签 (batch,)

                Returns:
                    loss值
                """
                ce_loss = nn.functional.cross_entropy(inputs, targets, reduction='none')
                pt = torch.exp(-ce_loss)
                focal_loss = ((1 - pt) ** self.gamma) * ce_loss

                if self.alpha is not None:
                    alpha_t = torch.tensor(self.alpha, device=inputs.device)[targets]
                    focal_loss = alpha_t * focal_loss

                return focal_loss.mean()

        class NewsClassificationTrainer:
            """新闻分类训练器"""

            def __init__(self, model: nn.Module, device: str = 'cpu',
                        class_names: List[str] = None):
                """
                初始化训练器

                Args:
                    model: 分类模型
                    device: 设备
                    class_names: 类别名称列表
                """
                self.model = model.to(device)
                self.device = device
                self.class_names = class_names or []
                self.best_f1 = 0.0

            def train_epoch(self, train_loader: DataLoader,
                          optimizer, criterion, epoch: int):
                """训练一个epoch"""
                self.model.train()
                total_loss = 0
                correct = 0
                total = 0

                for batch_idx, batch in enumerate(train_loader):
                    input_ids = batch['input_ids'].to(self.device)
                    attention_mask = batch['attention_mask'].to(self.device)
                    labels = batch['label'].to(self.device)

                    # 前向传播
                    optimizer.zero_grad()
                    outputs = self.model(input_ids, attention_mask)
                    loss = criterion(outputs, labels)

                    # 反向传播
                    loss.backward()
                    torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
                    optimizer.step()

                    # 统计
                    total_loss += loss.item()
                    predictions = torch.argmax(outputs, dim=1)
                    correct += (predictions == labels).sum().item()
                    total += labels.size(0)

                    if (batch_idx + 1) % 10 == 0:
                        print(f'  Batch {batch_idx+1}/{len(train_loader)} - '
                              f'Loss: {loss.item():.4f}, Acc: {correct/total:.4f}')

                avg_loss = total_loss / len(train_loader)
                accuracy = correct / total

                return avg_loss, accuracy

            def evaluate(self, test_loader: DataLoader,
                        plot_confusion: bool = True) -> Dict:
                """完整评估"""
                self.model.eval()
                all_preds = []
                all_labels = []

                with torch.no_grad():
                    for batch in test_loader:
                        input_ids = batch['input_ids'].to(self.device)
                        attention_mask = batch['attention_mask'].to(self.device)
                        labels = batch['label'].to(self.device)

                        outputs = self.model(input_ids, attention_mask)
                        predictions = torch.argmax(outputs, dim=1)

                        all_preds.extend(predictions.cpu().numpy())
                        all_labels.extend(labels.cpu().numpy())

                # 打印分类报告
                print(f"\n{'='*70}")
                print("分类报告")
                print(f"{'='*70}\n")

                print(classification_report(
                    all_labels, all_preds,
                    target_names=self.class_names,
                    digits=4
                ))

                # 绘制混淆矩阵
                if plot_confusion:
                    self.plot_confusion_matrix(all_labels, all_preds)

                return {
                    'predictions': all_preds,
                    'labels': all_labels
                }

            def plot_confusion_matrix(self, y_true, y_pred):
                """绘制混淆矩阵"""
                cm = confusion_matrix(y_true, y_pred)

                # 归一化
                cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

                plt.figure(figsize=(12, 10))
                sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues',
                           xticklabels=self.class_names,
                           yticklabels=self.class_names,
                           cbar_kws={'label': '比例'})

                plt.title('新闻分类混淆矩阵(归一化)', fontsize=14, fontweight='bold', pad=15)
                plt.ylabel('真实类别', fontsize=12, fontweight='bold')
                plt.xlabel('预测类别', fontsize=12, fontweight='bold')

                plt.tight_layout()
                plt.savefig('news_confusion_matrix.png', dpi=300, bbox_inches='tight')
                print("\n✓ 混淆矩阵已保存为 news_confusion_matrix.png")

        # 使用示例
        if __name__ == "__main__":
            print("="*70)
            print("新闻多分类实战示例")
            print("="*70 + "\n")

            # 模拟数据
            class_names = ['科技', '体育', '财经', '娱乐', '政治']

            # 模拟新闻数据
            titles = [
                "最新AI技术突破", "篮球比赛精彩瞬间", "股市行情分析",
                "电影票房创新高", "国际政治新动态"
            ] * 40

            contents = [
                "人工智能领域取得重大进展,新算法性能提升显著。" * 10,
                "昨晚的篮球比赛非常精彩,双方球员表现出色。" * 10,
                "今日股市震荡上行,投资者信心增强。" * 10,
                "最新上映的电影获得观众好评,票房持续攀升。" * 10,
                "国际政治局势复杂多变,各国积极应对。" * 10
            ] * 40

            labels = [0, 1, 2, 3, 4] * 40

            # 打乱数据
            indices = np.random.permutation(len(titles))
            titles = [titles[i] for i in indices]
            contents = [contents[i] for i in indices]
            labels = [labels[i] for i in indices]

            # 分割数据
            split_idx = int(len(titles) * 0.8)
            train_titles = titles[:split_idx]
            train_contents = contents[:split_idx]
            train_labels = labels[:split_idx]

            test_titles = titles[split_idx:]
            test_contents = contents[split_idx:]
            test_labels = labels[split_idx:]

            print(f"训练集: {len(train_titles)} 样本")
            print(f"测试集: {len(test_titles)} 样本\n")

            # 加载tokenizer
            tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')

            # 创建数据集
            train_dataset = NewsDataset(
                train_titles, train_contents, train_labels,
                tokenizer, max_length=256
            )

            test_dataset = NewsDataset(
                test_titles, test_contents, test_labels,
                tokenizer, max_length=256
            )

            train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
            test_loader = DataLoader(test_dataset, batch_size=16)

            # 创建模型
            device = 'cuda' if torch.cuda.is_available() else 'cpu'
            print(f"使用设备: {device}\n")

            model = HierarchicalNewsClassifier(
                bert_model_name='bert-base-chinese',
                num_classes=len(class_names),
                hidden_dim=256,
                dropout=0.3
            )

            print(f"模型参数量: {sum(p.numel() for p in model.parameters()):,}\n")

            # 创建训练器
            trainer = NewsClassificationTrainer(model, device, class_names)

            # 计算类别权重(处理不平衡)
            from collections import Counter
            label_counts = Counter(train_labels)
            total = sum(label_counts.values())
            class_weights = [total / (len(label_counts) * label_counts[i])
                           for i in range(len(class_names))]

            print(f"类别权重: {class_weights}\n")

            # 创建Focal Loss
            criterion = FocalLoss(alpha=class_weights, gamma=2.0)
            optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)

            # 训练
            print("开始训练...\n")
            for epoch in range(3):
                print(f"Epoch {epoch+1}/3")
                train_loss, train_acc = trainer.train_epoch(
                    train_loader, optimizer, criterion, epoch
                )
                print(f"\n训练集 - Loss: {train_loss:.4f}, Accuracy: {train_acc:.4f}\n")

            # 评估
            print("\n" + "="*70)
            print("测试集评估")
            print("="*70)

            trainer.evaluate(test_loader, plot_confusion=True)
        ---

02.层次分类处理
    a.说明部分
        新闻往往有层次化的分类体系,如"科技-互联网-人工智能"三级分类。层次分类需要考虑父子类关系,
        可采用局部分类器(每层独立分类器)或全局分类器(一次性预测完整路径)策略。
        损失函数设计要权衡各层级的重要性,可使用加权多任务学习。特征共享策略包括:
        低层特征共享(共用BERT编码)、高层特征独立(各类别独立分类头)、层次注意力机制等。
        评估指标需要考虑层次一致性,如预测"科技-体育-AI"明显违背常识。实际应用中可采用两阶段预测:
        先预测一级类别,再在该类别下预测二级类别,减少错误传播。层次分类还能利用父类标注数据增强子类训练。
    b.代码示例
        ---
        # 层次化新闻分类实现
        import torch
        import torch.nn as nn
        from typing import List, Dict, Tuple
        import numpy as np

        class HierarchicalClassificationHead(nn.Module):
            """层次化分类头"""

            def __init__(self, input_dim: int, hierarchy: Dict[str, List[str]],
                        dropout: float = 0.3):
                """
                初始化层次化分类头

                Args:
                    input_dim: 输入特征维度
                    hierarchy: 层次结构字典 {父类: [子类列表]}
                    dropout: Dropout率
                """
                super(HierarchicalClassificationHead, self).__init__()

                self.hierarchy = hierarchy
                self.dropout = nn.Dropout(dropout)

                # 一级分类器(父类)
                self.level1_classes = list(hierarchy.keys())
                self.level1_classifier = nn.Linear(input_dim, len(self.level1_classes))

                # 二级分类器(子类,每个父类一个)
                self.level2_classifiers = nn.ModuleDict()
                for parent in self.level1_classes:
                    n_children = len(hierarchy[parent])
                    self.level2_classifiers[parent] = nn.Linear(input_dim, n_children)

            def forward(self, features, level1_labels=None, training=True):
                """
                前向传播

                Args:
                    features: 输入特征 (batch, input_dim)
                    level1_labels: 一级类别标签(训练时提供)
                    training: 是否训练模式

                Returns:
                    level1_logits, level2_logits字典
                """
                features = self.dropout(features)

                # 一级分类
                level1_logits = self.level1_classifier(features)

                # 二级分类
                level2_logits = {}

                if training and level1_labels is not None:
                    # 训练模式:使用真实的一级标签
                    for i, parent in enumerate(self.level1_classes):
                        # 找到属于该父类的样本
                        mask = (level1_labels == i)
                        if mask.sum() > 0:
                            parent_features = features[mask]
                            level2_logits[parent] = self.level2_classifiers[parent](parent_features)
                else:
                    # 推理模式:使用预测的一级标签
                    level1_preds = torch.argmax(level1_logits, dim=1)

                    for i, parent in enumerate(self.level1_classes):
                        mask = (level1_preds == i)
                        if mask.sum() > 0:
                            parent_features = features[mask]
                            level2_logits[parent] = self.level2_classifiers[parent](parent_features)

                return level1_logits, level2_logits

        class HierarchicalLoss(nn.Module):
            """层次化损失函数"""

            def __init__(self, level1_weight: float = 0.3, level2_weight: float = 0.7):
                """
                初始化层次化损失

                Args:
                    level1_weight: 一级分类损失权重
                    level2_weight: 二级分类损失权重
                """
                super(HierarchicalLoss, self).__init__()
                self.level1_weight = level1_weight
                self.level2_weight = level2_weight
                self.ce_loss = nn.CrossEntropyLoss()

            def forward(self, level1_logits, level2_logits_dict,
                       level1_labels, level2_labels, hierarchy, level1_classes):
                """
                计算总损失

                Args:
                    level1_logits: 一级分类logits
                    level2_logits_dict: 二级分类logits字典
                    level1_labels: 一级类别标签
                    level2_labels: 二级类别标签
                    hierarchy: 层次结构
                    level1_classes: 一级类别列表

                Returns:
                    总损失
                """
                # 一级分类损失
                loss_level1 = self.ce_loss(level1_logits, level1_labels)

                # 二级分类损失
                loss_level2 = 0.0
                n_level2_samples = 0

                for i, parent in enumerate(level1_classes):
                    mask = (level1_labels == i)
                    if mask.sum() > 0 and parent in level2_logits_dict:
                        parent_level2_labels = level2_labels[mask]
                        parent_level2_logits = level2_logits_dict[parent]

                        loss_level2 += self.ce_loss(parent_level2_logits, parent_level2_labels)
                        n_level2_samples += 1

                if n_level2_samples > 0:
                    loss_level2 /= n_level2_samples

                # 总损失
                total_loss = (self.level1_weight * loss_level1 +
                            self.level2_weight * loss_level2)

                return total_loss, loss_level1, loss_level2

        # 使用示例
        if __name__ == "__main__":
            print("="*70)
            print("层次化新闻分类示例")
            print("="*70 + "\n")

            # 定义层次结构
            hierarchy = {
                '科技': ['人工智能', '互联网', '硬件'],
                '体育': ['足球', '篮球', '网球'],
                '财经': ['股票', '基金', '外汇'],
                '娱乐': ['电影', '音乐', '综艺']
            }

            print("层次结构:")
            for parent, children in hierarchy.items():
                print(f"  {parent}: {children}")

            # 创建层次化分类头
            input_dim = 768  # BERT hidden size
            classifier = HierarchicalClassificationHead(
                input_dim=input_dim,
                hierarchy=hierarchy,
                dropout=0.3
            )

            print(f"\n模型参数量: {sum(p.numel() for p in classifier.parameters()):,}")

            # 模拟输入
            batch_size = 8
            features = torch.randn(batch_size, input_dim)

            # 模拟标签
            level1_labels = torch.tensor([0, 0, 1, 1, 2, 2, 3, 3])  # 一级标签
            level2_labels = torch.tensor([0, 1, 0, 1, 0, 1, 0, 1])  # 二级标签

            # 前向传播(训练模式)
            print("\n训练模式前向传播:")
            level1_logits, level2_logits = classifier(
                features, level1_labels, training=True
            )

            print(f"  一级分类 logits shape: {level1_logits.shape}")
            print(f"  二级分类 logits:")
            for parent, logits in level2_logits.items():
                print(f"    {parent}: {logits.shape}")

            # 计算损失
            loss_fn = HierarchicalLoss(level1_weight=0.3, level2_weight=0.7)

            total_loss, loss_l1, loss_l2 = loss_fn(
                level1_logits, level2_logits,
                level1_labels, level2_labels,
                hierarchy, list(hierarchy.keys())
            )

            print(f"\n损失:")
            print(f"  一级损失: {loss_l1.item():.4f}")
            print(f"  二级损失: {loss_l2.item():.4f}")
            print(f"  总损失: {total_loss.item():.4f}")

            # 推理模式
            print("\n推理模式前向传播:")
            classifier.eval()
            with torch.no_grad():
                level1_logits, level2_logits = classifier(
                    features, training=False
                )

                level1_preds = torch.argmax(level1_logits, dim=1)
                print(f"  一级预测: {level1_preds}")

                # 解析完整预测路径
                level1_classes = list(hierarchy.keys())
                print("\n  完整预测路径:")
                for i, (l1_pred, l1_label) in enumerate(zip(level1_preds, level1_labels)):
                    l1_class = level1_classes[l1_pred.item()]
                    l1_true = level1_classes[l1_label.item()]

                    print(f"    样本{i}: {l1_class} (真实: {l1_true})")
        ---

03.新类别扩展策略
    a.说明部分
        实际应用中经常需要添加新类别,而重新训练全部数据成本高昂。新类别扩展策略包括:
        增量学习仅用新类别数据微调模型、few-shot学习利用少量样本学习新类别、
        原型网络计算类别中心进行分类、元学习快速适应新任务。知识蒸馏可以保留旧类别知识避免遗忘。
        数据增强和迁移学习能缓解新类别样本少的问题。实践中可采用"基础模型+插件"架构:
        BERT编码器保持不变,只增加新的分类头。零样本学习利用类别描述文本进行分类,无需标注数据。
        持续学习需要平衡新旧类别性能,可使用经验回放、弹性权重巩固等技术。定期评估所有类别性能,
        避免灾难性遗忘。模型版本管理和AB测试确保平滑过渡。
    b.代码示例
        ---
        # 新类别增量学习实现
        import torch
        import torch.nn as nn
        from transformers import BertModel, BertTokenizer
        from typing import List, Dict
        import numpy as np

        class IncrementalNewsClassifier(nn.Module):
            """支持增量学习的新闻分类器"""

            def __init__(self, bert_model_name: str, initial_classes: int,
                        hidden_dim: int = 256, dropout: float = 0.3):
                """
                初始化分类器

                Args:
                    bert_model_name: BERT模型名称
                    initial_classes: 初始类别数
                    hidden_dim: 隐藏层维度
                    dropout: Dropout率
                """
                super(IncrementalNewsClassifier, self).__init__()

                # BERT编码器(固定)
                self.bert = BertModel.from_pretrained(bert_model_name)

                # 投影层
                self.projection = nn.Sequential(
                    nn.Linear(self.bert.config.hidden_size, hidden_dim),
                    nn.ReLU(),
                    nn.Dropout(dropout)
                )

                # 分类头(可扩展)
                self.classifiers = nn.ModuleList([
                    nn.Linear(hidden_dim, initial_classes)
                ])

                self.num_classes = [initial_classes]

            def add_new_class_head(self, num_new_classes: int):
                """
                添加新类别分类头

                Args:
                    num_new_classes: 新增类别数
                """
                hidden_dim = self.projection[0].out_features

                # 添加新分类头
                new_classifier = nn.Linear(hidden_dim, num_new_classes)

                # 初始化权重
                nn.init.xavier_uniform_(new_classifier.weight)
                nn.init.zeros_(new_classifier.bias)

                self.classifiers.append(new_classifier)
                self.num_classes.append(num_new_classes)

                print(f"✓ 添加新分类头: {num_new_classes} 个类别")

            def freeze_bert(self):
                """冻结BERT参数"""
                for param in self.bert.parameters():
                    param.requires_grad = False
                print("✓ BERT参数已冻结")

            def unfreeze_bert(self):
                """解冻BERT参数"""
                for param in self.bert.parameters():
                    param.requires_grad = True
                print("✓ BERT参数已解冻")

            def forward(self, input_ids, attention_mask, classifier_idx=0):
                """
                前向传播

                Args:
                    input_ids: 输入token IDs
                    attention_mask: 注意力掩码
                    classifier_idx: 使用哪个分类头

                Returns:
                    logits
                """
                # BERT编码
                outputs = self.bert(
                    input_ids=input_ids,
                    attention_mask=attention_mask
                )

                # 使用[CLS]向量
                pooled_output = outputs.pooler_output

                # 投影
                features = self.projection(pooled_output)

                # 分类
                logits = self.classifiers[classifier_idx](features)

                return logits

        class ProtoypicalClassifier:
            """基于原型的few-shot分类器"""

            def __init__(self, feature_extractor: nn.Module):
                """
                初始化原型分类器

                Args:
                    feature_extractor: 特征提取器
                """
                self.feature_extractor = feature_extractor
                self.prototypes = {}  # {类别: 原型向量}

            def compute_prototype(self, texts: List[str], label: int,
                                tokenizer: BertTokenizer, device: str = 'cpu'):
                """
                计算类别原型

                Args:
                    texts: 该类别的样本文本
                    label: 类别标签
                    tokenizer: 分词器
                    device: 设备
                """
                features = []

                self.feature_extractor.eval()
                with torch.no_grad():
                    for text in texts:
                        # 编码
                        encoding = tokenizer(
                            text,
                            add_special_tokens=True,
                            max_length=128,
                            padding='max_length',
                            truncation=True,
                            return_tensors='pt'
                        )

                        input_ids = encoding['input_ids'].to(device)
                        attention_mask = encoding['attention_mask'].to(device)

                        # 提取特征(使用投影后的特征)
                        outputs = self.feature_extractor.bert(
                            input_ids=input_ids,
                            attention_mask=attention_mask
                        )

                        feature = self.feature_extractor.projection(
                            outputs.pooler_output
                        )

                        features.append(feature.cpu().numpy())

                # 计算原型(类别中心)
                prototype = np.mean(features, axis=0)
                self.prototypes[label] = prototype

                print(f"✓ 计算类别 {label} 的原型 (基于 {len(texts)} 个样本)")

            def predict(self, text: str, tokenizer: BertTokenizer,
                       device: str = 'cpu') -> Dict:
                """
                基于原型距离预测

                Args:
                    text: 输入文本
                    tokenizer: 分词器
                    device: 设备

                Returns:
                    预测结果字典
                """
                # 编码
                encoding = tokenizer(
                    text,
                    add_special_tokens=True,
                    max_length=128,
                    padding='max_length',
                    truncation=True,
                    return_tensors='pt'
                )

                input_ids = encoding['input_ids'].to(device)
                attention_mask = encoding['attention_mask'].to(device)

                # 提取特征
                self.feature_extractor.eval()
                with torch.no_grad():
                    outputs = self.feature_extractor.bert(
                        input_ids=input_ids,
                        attention_mask=attention_mask
                    )

                    feature = self.feature_extractor.projection(
                        outputs.pooler_output
                    ).cpu().numpy()

                # 计算与各原型的距离
                distances = {}
                for label, prototype in self.prototypes.items():
                    distance = np.linalg.norm(feature - prototype)
                    distances[label] = distance

                # 最近的原型
                pred_label = min(distances, key=distances.get)

                return {
                    'label': pred_label,
                    'distances': distances
                }

        # 使用示例
        if __name__ == "__main__":
            print("="*70)
            print("新类别增量学习示例")
            print("="*70 + "\n")

            # 1. 创建初始模型
            print("1. 创建初始模型(3个类别)")
            model = IncrementalNewsClassifier(
                bert_model_name='bert-base-chinese',
                initial_classes=3,
                hidden_dim=256
            )

            print(f"   初始参数量: {sum(p.numel() for p in model.parameters()):,}")
            print(f"   分类头数量: {len(model.classifiers)}")
            print(f"   类别数量: {model.num_classes}\n")

            # 2. 冻结BERT
            print("2. 冻结BERT参数")
            model.freeze_bert()

            trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
            print(f"   可训练参数: {trainable_params:,}\n")

            # 3. 添加新类别
            print("3. 添加新类别分类头(2个新类别)")
            model.add_new_class_head(num_new_classes=2)

            print(f"   更新后分类头数量: {len(model.classifiers)}")
            print(f"   更新后类别数量: {model.num_classes}\n")

            # 4. Few-shot学习示例
            print("4. Few-shot原型学习")

            tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')

            # 创建原型分类器
            proto_classifier = ProtoypicalClassifier(model)

            # 模拟few-shot数据(每类5个样本)
            few_shot_data = {
                0: ["科技新闻样本1", "科技新闻样本2", "科技新闻样本3", "科技新闻样本4", "科技新闻样本5"],
                1: ["体育新闻样本1", "体育新闻样本2", "体育新闻样本3", "体育新闻样本4", "体育新闻样本5"],
                2: ["财经新闻样本1", "财经新闻样本2", "财经新闻样本3", "财经新闻样本4", "财经新闻样本5"],
            }

            # 计算各类别原型
            for label, texts in few_shot_data.items():
                proto_classifier.compute_prototype(texts, label, tokenizer)

            # 预测
            print("\n5. 基于原型预测")
            test_texts = ["科技发展迅速", "篮球比赛激烈", "股市波动"]

            for text in test_texts:
                result = proto_classifier.predict(text, tokenizer)
                print(f"   文本: {text}")
                print(f"   预测类别: {result['label']}")
                print(f"   距离: {result['distances']}\n")

            print("✓ 增量学习演示完成")
        ---

2.6 优化技巧

01.数据增强与正则化
    a.说明部分
        文本分类模型容易过拟合,特别是在小数据集上。数据增强技术包括:回译(中英互译增加多样性)、
        同义词替换(保持语义的词汇替换)、随机插入/删除/交换(EDA方法)、对抗训练(添加扰动提高鲁棒性)、
        Mixup(混合样本)等。正则化方法包括:Dropout(随机丢弃神经元)、权重衰减(L2正则化)、
        标签平滑(softmax目标平滑化)、早停(early stopping)等。对于BERT类模型,
        可使用R-Drop(一致性正则化)、对抗性扰动(FGM/PGD)等高级技巧。
        超参数调优建议使用贝叶斯优化或网格搜索,关键超参数包括学习率、批次大小、dropout率、权重衰减系数等。
        训练技巧包括warmup学习率预热、梯度累积模拟大批次、混合精度训练加速等。
    b.代码示例
        ---
        # 文本分类优化技巧实现
        import torch
        import torch.nn as nn
        import torch.nn.functional as F
        from transformers import BertModel
        import numpy as np
        from typing import List, Dict, Tuple

        class LabelSmoothingLoss(nn.Module):
            """标签平滑损失"""

            def __init__(self, num_classes: int, smoothing: float = 0.1):
                """
                初始化标签平滑损失

                Args:
                    num_classes: 类别数
                    smoothing: 平滑系数
                """
                super(LabelSmoothingLoss, self).__init__()
                self.num_classes = num_classes
                self.smoothing = smoothing
                self.confidence = 1.0 - smoothing

            def forward(self, pred, target):
                """
                计算损失

                Args:
                    pred: 预测logits (batch, num_classes)
                    target: 目标标签 (batch,)

                Returns:
                    loss值
                """
                pred = pred.log_softmax(dim=-1)

                # 创建平滑标签
                true_dist = torch.zeros_like(pred)
                true_dist.fill_(self.smoothing / (self.num_classes - 1))
                true_dist.scatter_(1, target.unsqueeze(1), self.confidence)

                # 计算KL散度
                loss = torch.sum(-true_dist * pred, dim=-1)

                return loss.mean()

        class RDropLoss(nn.Module):
            """R-Drop一致性正则化损失"""

            def __init__(self, alpha: float = 4.0):
                """
                初始化R-Drop损失

                Args:
                    alpha: KL散度权重
                """
                super(RDropLoss, self).__init__()
                self.alpha = alpha
                self.ce_loss = nn.CrossEntropyLoss()

            def forward(self, logits1, logits2, labels):
                """
                计算R-Drop损失

                Args:
                    logits1: 第一次前向传播的logits
                    logits2: 第二次前向传播的logits
                    labels: 目标标签

                Returns:
                    总损失
                """
                # 交叉熵损失
                ce_loss = 0.5 * (self.ce_loss(logits1, labels) +
                                self.ce_loss(logits2, labels))

                # KL散度(双向)
                p1 = F.log_softmax(logits1, dim=-1)
                p2 = F.log_softmax(logits2, dim=-1)

                q1 = F.softmax(logits1, dim=-1)
                q2 = F.softmax(logits2, dim=-1)

                kl_loss = 0.5 * (F.kl_div(p1, q2, reduction='batchmean') +
                                F.kl_div(p2, q1, reduction='batchmean'))

                # 总损失
                loss = ce_loss + self.alpha * kl_loss

                return loss, ce_loss, kl_loss

        class FGM:
            """Fast Gradient Method对抗训练"""

            def __init__(self, model: nn.Module, epsilon: float = 1.0):
                """
                初始化FGM

                Args:
                    model: 模型
                    epsilon: 扰动系数
                """
                self.model = model
                self.epsilon = epsilon
                self.backup = {}

            def attack(self, emb_name: str = 'word_embeddings'):
                """
                生成对抗样本

                Args:
                    emb_name: embedding参数名称
                """
                # 保存原始参数
                for name, param in self.model.named_parameters():
                    if param.requires_grad and emb_name in name:
                        self.backup[name] = param.data.clone()

                        # 计算扰动
                        norm = torch.norm(param.grad)
                        if norm != 0 and not torch.isnan(norm):
                            r_at = self.epsilon * param.grad / norm
                            param.data.add_(r_at)

            def restore(self, emb_name: str = 'word_embeddings'):
                """
                恢复原始参数

                Args:
                    emb_name: embedding参数名称
                """
                for name, param in self.model.named_parameters():
                    if param.requires_grad and emb_name in name:
                        assert name in self.backup
                        param.data = self.backup[name]

                self.backup = {}

        class PGD:
            """Projected Gradient Descent对抗训练"""

            def __init__(self, model: nn.Module, epsilon: float = 1.0,
                        alpha: float = 0.3, K: int = 3):
                """
                初始化PGD

                Args:
                    model: 模型
                    epsilon: 扰动上限
                    alpha: 步长
                    K: 迭代次数
                """
                self.model = model
                self.epsilon = epsilon
                self.alpha = alpha
                self.K = K
                self.backup = {}
                self.grad_backup = {}

            def attack(self, emb_name: str = 'word_embeddings', is_first_attack: bool = False):
                """生成对抗样本"""
                for name, param in self.model.named_parameters():
                    if param.requires_grad and emb_name in name:
                        if is_first_attack:
                            self.backup[name] = param.data.clone()

                        # 计算扰动
                        norm = torch.norm(param.grad)
                        if norm != 0 and not torch.isnan(norm):
                            r_at = self.alpha * param.grad / norm
                            param.data.add_(r_at)

                            # 投影到epsilon球内
                            param.data = self.backup[name] + torch.clamp(
                                param.data - self.backup[name],
                                -self.epsilon,
                                self.epsilon
                            )

            def restore(self, emb_name: str = 'word_embeddings'):
                """恢复原始参数"""
                for name, param in self.model.named_parameters():
                    if param.requires_grad and emb_name in name:
                        assert name in self.backup
                        param.data = self.backup[name]

                self.backup = {}

            def backup_grad(self):
                """备份梯度"""
                for name, param in self.model.named_parameters():
                    if param.requires_grad and param.grad is not None:
                        self.grad_backup[name] = param.grad.clone()

            def restore_grad(self):
                """恢复梯度"""
                for name, param in self.model.named_parameters():
                    if param.requires_grad and param.grad is not None:
                        param.grad = self.grad_backup[name]

        class MixupDataAugmentation:
            """Mixup数据增强"""

            def __init__(self, alpha: float = 0.2):
                """
                初始化Mixup

                Args:
                    alpha: Beta分布参数
                """
                self.alpha = alpha

            def mixup_data(self, x, y):
                """
                Mixup数据增强

                Args:
                    x: 输入特征
                    y: 标签

                Returns:
                    混合后的特征和标签
                """
                if self.alpha > 0:
                    lam = np.random.beta(self.alpha, self.alpha)
                else:
                    lam = 1

                batch_size = x.size(0)
                index = torch.randperm(batch_size).to(x.device)

                mixed_x = lam * x + (1 - lam) * x[index]
                y_a, y_b = y, y[index]

                return mixed_x, y_a, y_b, lam

            def mixup_criterion(self, criterion, pred, y_a, y_b, lam):
                """
                Mixup损失函数

                Args:
                    criterion: 原始损失函数
                    pred: 预测值
                    y_a: 第一个标签
                    y_b: 第二个标签
                    lam: 混合系数

                Returns:
                    混合损失
                """
                return lam * criterion(pred, y_a) + (1 - lam) * criterion(pred, y_b)

        class OptimizedTrainer:
            """集成多种优化技巧的训练器"""

            def __init__(self, model: nn.Module, device: str = 'cpu',
                        use_rdrop: bool = True, use_adversarial: str = 'fgm',
                        use_mixup: bool = False, use_label_smoothing: bool = True):
                """
                初始化优化训练器

                Args:
                    model: 模型
                    device: 设备
                    use_rdrop: 是否使用R-Drop
                    use_adversarial: 对抗训练类型 ('fgm', 'pgd', None)
                    use_mixup: 是否使用Mixup
                    use_label_smoothing: 是否使用标签平滑
                """
                self.model = model.to(device)
                self.device = device
                self.use_rdrop = use_rdrop
                self.use_mixup = use_mixup

                # 损失函数
                if use_label_smoothing:
                    self.criterion = LabelSmoothingLoss(
                        num_classes=model.classifier.out_features,
                        smoothing=0.1
                    )
                    print("✓ 使用标签平滑损失")
                else:
                    self.criterion = nn.CrossEntropyLoss()

                # R-Drop
                if use_rdrop:
                    self.rdrop_loss = RDropLoss(alpha=4.0)
                    print("✓ 启用R-Drop一致性正则化")

                # 对抗训练
                if use_adversarial == 'fgm':
                    self.adversarial = FGM(model, epsilon=1.0)
                    print("✓ 启用FGM对抗训练")
                elif use_adversarial == 'pgd':
                    self.adversarial = PGD(model, epsilon=1.0, alpha=0.3, K=3)
                    print("✓ 启用PGD对抗训练")
                else:
                    self.adversarial = None

                # Mixup
                if use_mixup:
                    self.mixup = MixupDataAugmentation(alpha=0.2)
                    print("✓ 启用Mixup数据增强")

            def train_step(self, batch, optimizer):
                """
                单步训练(集成所有优化技巧)

                Args:
                    batch: 数据批次
                    optimizer: 优化器

                Returns:
                    loss值
                """
                input_ids = batch['input_ids'].to(self.device)
                attention_mask = batch['attention_mask'].to(self.device)
                labels = batch['label'].to(self.device)

                self.model.train()

                # 1. 正常前向传播
                if self.use_rdrop:
                    # R-Drop: 两次前向传播
                    logits1 = self.model(input_ids, attention_mask)
                    logits2 = self.model(input_ids, attention_mask)

                    loss, ce_loss, kl_loss = self.rdrop_loss(logits1, logits2, labels)
                else:
                    logits = self.model(input_ids, attention_mask)
                    loss = self.criterion(logits, labels)

                # 2. 反向传播
                optimizer.zero_grad()
                loss.backward()

                # 3. 对抗训练
                if self.adversarial is not None:
                    if isinstance(self.adversarial, FGM):
                        # FGM攻击
                        self.adversarial.attack()
                        logits_adv = self.model(input_ids, attention_mask)
                        loss_adv = self.criterion(logits_adv, labels)
                        loss_adv.backward()
                        self.adversarial.restore()

                    elif isinstance(self.adversarial, PGD):
                        # PGD攻击(多步)
                        self.adversarial.backup_grad()

                        for t in range(self.adversarial.K):
                            self.adversarial.attack(is_first_attack=(t==0))

                            if t != self.adversarial.K - 1:
                                optimizer.zero_grad()
                            else:
                                self.adversarial.restore_grad()

                            logits_adv = self.model(input_ids, attention_mask)
                            loss_adv = self.criterion(logits_adv, labels)
                            loss_adv.backward()

                        self.adversarial.restore()

                # 4. 梯度裁剪
                torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)

                # 5. 更新参数
                optimizer.step()

                return loss.item()

        # 使用示例
        if __name__ == "__main__":
            print("="*70)
            print("文本分类优化技巧示例")
            print("="*70 + "\n")

            # 1. 标签平滑损失
            print("【标签平滑损失】")
            num_classes = 5
            ls_loss = LabelSmoothingLoss(num_classes, smoothing=0.1)

            pred = torch.randn(8, num_classes)
            target = torch.randint(0, num_classes, (8,))

            loss = ls_loss(pred, target)
            print(f"标签平滑损失: {loss.item():.4f}\n")

            # 2. R-Drop
            print("【R-Drop一致性正则化】")
            rdrop_loss = RDropLoss(alpha=4.0)

            logits1 = torch.randn(8, num_classes)
            logits2 = torch.randn(8, num_classes)

            total_loss, ce_loss, kl_loss = rdrop_loss(logits1, logits2, target)
            print(f"交叉熵损失: {ce_loss.item():.4f}")
            print(f"KL散度损失: {kl_loss.item():.4f}")
            print(f"总损失: {total_loss.item():.4f}\n")

            # 3. Mixup
            print("【Mixup数据增强】")
            mixup = MixupDataAugmentation(alpha=0.2)

            x = torch.randn(8, 768)
            y = torch.randint(0, num_classes, (8,))

            mixed_x, y_a, y_b, lam = mixup.mixup_data(x, y)
            print(f"混合系数 lambda: {lam:.4f}")
            print(f"混合前形状: {x.shape}")
            print(f"混合后形状: {mixed_x.shape}")
            print(f"标签A: {y_a}")
            print(f"标签B: {y_b}\n")

            # 4. 创建简单分类模型
            print("【创建优化训练器】")

            class SimpleClassifier(nn.Module):
                def __init__(self, input_dim=768, num_classes=5):
                    super().__init__()
                    self.classifier = nn.Linear(input_dim, num_classes)

                def forward(self, x, attention_mask=None):
                    return self.classifier(x)

            model = SimpleClassifier()

            # 创建优化训练器
            trainer = OptimizedTrainer(
                model=model,
                device='cpu',
                use_rdrop=True,
                use_adversarial='fgm',
                use_mixup=False,
                use_label_smoothing=True
            )

            print(f"\n模型参数量: {sum(p.numel() for p in model.parameters()):,}")

            # 5. 模拟训练步骤
            print("\n【模拟训练步骤】")

            # 创建模拟数据
            batch = {
                'input_ids': torch.randn(8, 768),
                'attention_mask': torch.ones(8, 128),
                'label': torch.randint(0, num_classes, (8,))
            }

            optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)

            # 训练3步
            for step in range(3):
                loss = trainer.train_step(batch, optimizer)
                print(f"Step {step+1}: Loss = {loss:.4f}")

            print("\n✓ 优化技巧演示完成")
        ---

02.超参数调优
    a.说明部分
        超参数调优对模型性能影响巨大。关键超参数包括:学习率(BERT通常2e-5到5e-5)、批次大小(受限于GPU内存,
        一般16-32)、训练轮数(3-5 epochs)、warmup步数(总步数的10%)、权重衰减(0.01)、
        dropout率(0.1-0.3)、最大序列长度(128-512)。调优策略包括:网格搜索(穷举所有组合)、
        随机搜索(随机采样)、贝叶斯优化(利用历史结果指导搜索)、超带算法(early stopping不佳配置)。
        使用Optuna、Ray Tune等工具可自动化调优。实践建议:先固定大部分参数调学习率,
        再调批次大小和训练轮数,最后精调其他参数。使用验证集选择超参数,避免在测试集上调优。
        记录所有实验结果,使用MLflow等工具管理实验。
    b.代码示例
        ---
        # 超参数调优实现
        import optuna
        import torch
        import torch.nn as nn
        from torch.utils.data import DataLoader, Dataset
        from sklearn.metrics import accuracy_score, f1_score
        from typing import Dict
        import json

        class HyperparameterTuner:
            """超参数调优器"""

            def __init__(self, model_class, train_data, val_data,
                        device: str = 'cpu'):
                """
                初始化调优器

                Args:
                    model_class: 模型类
                    train_data: 训练数据
                    val_data: 验证数据
                    device: 设备
                """
                self.model_class = model_class
                self.train_data = train_data
                self.val_data = val_data
                self.device = device

                # 记录最佳结果
                self.best_params = None
                self.best_score = 0.0

            def objective(self, trial: optuna.Trial) -> float:
                """
                Optuna优化目标函数

                Args:
                    trial: Optuna trial对象

                Returns:
                    验证集F1分数
                """
                # 采样超参数
                params = {
                    'learning_rate': trial.suggest_loguniform('learning_rate', 1e-5, 5e-5),
                    'batch_size': trial.suggest_categorical('batch_size', [16, 32, 64]),
                    'dropout': trial.suggest_uniform('dropout', 0.1, 0.5),
                    'weight_decay': trial.suggest_loguniform('weight_decay', 1e-5, 1e-2),
                    'warmup_ratio': trial.suggest_uniform('warmup_ratio', 0.0, 0.2),
                    'num_epochs': trial.suggest_int('num_epochs', 3, 6)
                }

                print(f"\n尝试参数组合: {params}")

                # 创建数据加载器
                train_loader = DataLoader(
                    self.train_data,
                    batch_size=params['batch_size'],
                    shuffle=True
                )

                val_loader = DataLoader(
                    self.val_data,
                    batch_size=params['batch_size']
                )

                # 创建模型
                model = self.model_class(dropout=params['dropout']).to(self.device)

                # 优化器
                optimizer = torch.optim.AdamW(
                    model.parameters(),
                    lr=params['learning_rate'],
                    weight_decay=params['weight_decay']
                )

                # 学习率调度器
                total_steps = len(train_loader) * params['num_epochs']
                warmup_steps = int(total_steps * params['warmup_ratio'])

                from transformers import get_linear_schedule_with_warmup
                scheduler = get_linear_schedule_with_warmup(
                    optimizer,
                    num_warmup_steps=warmup_steps,
                    num_training_steps=total_steps
                )

                # 损失函数
                criterion = nn.CrossEntropyLoss()

                # 训练
                best_val_f1 = 0.0
                for epoch in range(params['num_epochs']):
                    # 训练一个epoch
                    model.train()
                    for batch in train_loader:
                        input_ids = batch['input_ids'].to(self.device)
                        attention_mask = batch['attention_mask'].to(self.device)
                        labels = batch['label'].to(self.device)

                        optimizer.zero_grad()
                        outputs = model(input_ids, attention_mask)
                        loss = criterion(outputs, labels)
                        loss.backward()
                        optimizer.step()
                        scheduler.step()

                    # 验证
                    model.eval()
                    all_preds = []
                    all_labels = []

                    with torch.no_grad():
                        for batch in val_loader:
                            input_ids = batch['input_ids'].to(self.device)
                            attention_mask = batch['attention_mask'].to(self.device)
                            labels = batch['label'].to(self.device)

                            outputs = model(input_ids, attention_mask)
                            preds = torch.argmax(outputs, dim=1)

                            all_preds.extend(preds.cpu().numpy())
                            all_labels.extend(labels.cpu().numpy())

                    # 计算F1
                    val_f1 = f1_score(all_labels, all_preds, average='macro')

                    print(f"Epoch {epoch+1}: Val F1 = {val_f1:.4f}")

                    if val_f1 > best_val_f1:
                        best_val_f1 = val_f1

                    # Optuna剪枝
                    trial.report(val_f1, epoch)
                    if trial.should_prune():
                        raise optuna.TrialPruned()

                return best_val_f1

            def tune(self, n_trials: int = 20, timeout: int = 3600) -> Dict:
                """
                执行超参数调优

                Args:
                    n_trials: 尝试次数
                    timeout: 超时时间(秒)

                Returns:
                    最佳参数字典
                """
                print("="*70)
                print(f"开始超参数调优({n_trials} 次尝试)")
                print("="*70 + "\n")

                # 创建Optuna study
                study = optuna.create_study(
                    direction='maximize',
                    pruner=optuna.pruners.MedianPruner()
                )

                # 运行优化
                study.optimize(
                    self.objective,
                    n_trials=n_trials,
                    timeout=timeout,
                    show_progress_bar=True
                )

                # 最佳结果
                self.best_params = study.best_params
                self.best_score = study.best_value

                print(f"\n{'='*70}")
                print("调优完成")
                print(f"{'='*70}\n")

                print(f"最佳F1分数: {self.best_score:.4f}")
                print(f"最佳参数:")
                for key, value in self.best_params.items():
                    print(f"  {key}: {value}")

                # 保存结果
                results = {
                    'best_params': self.best_params,
                    'best_score': self.best_score,
                    'n_trials': n_trials
                }

                with open('hyperparameter_tuning_results.json', 'w') as f:
                    json.dump(results, f, indent=2)

                print(f"\n✓ 结果已保存至 hyperparameter_tuning_results.json")

                return self.best_params

            def plot_optimization_history(self, study):
                """绘制优化历史"""
                import matplotlib.pyplot as plt

                # 设置中文显示
                plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei']

                fig, axes = plt.subplots(1, 2, figsize=(14, 5))

                # 1. 优化历史
                trials = study.trials
                values = [t.value for t in trials if t.value is not None]

                ax = axes[0]
                ax.plot(values, marker='o', linewidth=2, markersize=6)
                ax.axhline(self.best_score, color='red', linestyle='--',
                          label=f'最佳: {self.best_score:.4f}')
                ax.set_xlabel('试验次数', fontsize=12, fontweight='bold')
                ax.set_ylabel('验证集F1分数', fontsize=12, fontweight='bold')
                ax.set_title('超参数优化历史', fontsize=14, fontweight='bold')
                ax.legend()
                ax.grid(alpha=0.3)

                # 2. 参数重要性
                try:
                    importance = optuna.importance.get_param_importances(study)
                    params = list(importance.keys())
                    values = list(importance.values())

                    ax = axes[1]
                    ax.barh(params, values, color='#2A9D8F')
                    ax.set_xlabel('重要性', fontsize=12, fontweight='bold')
                    ax.set_title('参数重要性', fontsize=14, fontweight='bold')
                    ax.grid(alpha=0.3, axis='x')
                except:
                    ax = axes[1]
                    ax.text(0.5, 0.5, '参数重要性分析需要更多试验',
                           ha='center', va='center', fontsize=12)

                plt.tight_layout()
                plt.savefig('hyperparameter_optimization.png', dpi=300, bbox_inches='tight')
                print("✓ 优化历史图已保存为 hyperparameter_optimization.png")

        # 使用示例
        if __name__ == "__main__":
            print("="*70)
            print("超参数调优示例")
            print("="*70 + "\n")

            # 注意:这里使用简化的示例
            # 实际使用时需要准备真实的数据集

            print("超参数调优框架已准备就绪")
            print("使用方法:")
            print("""
            # 1. 准备数据
            train_dataset = YourDataset(train_texts, train_labels, tokenizer)
            val_dataset = YourDataset(val_texts, val_labels, tokenizer)

            # 2. 定义模型类
            class YourModel(nn.Module):
                def __init__(self, dropout=0.3):
                    super().__init__()
                    # 定义模型结构

            # 3. 创建调优器
            tuner = HyperparameterTuner(
                model_class=YourModel,
                train_data=train_dataset,
                val_data=val_dataset,
                device='cuda'
            )

            # 4. 执行调优
            best_params = tuner.tune(n_trials=50, timeout=3600)

            # 5. 使用最佳参数训练最终模型
            final_model = YourModel(**best_params)
            """)

            print("\n建议的超参数搜索空间:")
            print("  - learning_rate: [1e-5, 5e-5] (对数均匀分布)")
            print("  - batch_size: [16, 32, 64] (分类)")
            print("  - dropout: [0.1, 0.5] (均匀分布)")
            print("  - weight_decay: [1e-5, 1e-2] (对数均匀分布)")
            print("  - warmup_ratio: [0.0, 0.2] (均匀分布)")
            print("  - num_epochs: [3, 6] (整数)")

            print("\n✓ 超参数调优示例完成")
        ---

03.模型集成策略
    a.说明部分
        模型集成通过组合多个模型的预测来提升性能和鲁棒性。常见策略包括:投票法(多数投票或加权投票)、
        平均法(预测概率平均)、堆叠法(训练元模型组合基模型)、Boosting(依次训练纠正前序模型错误)、
        Bagging(并行训练多个模型)。实践中可集成:不同架构(TextCNN + BERT + RoBERTa)、
        不同初始化(同一架构但随机种子不同)、不同数据(交叉验证的多个fold)、不同超参数(学习率、dropout等)。
        集成权重可通过验证集性能确定,表现好的模型权重更高。注意集成会增加推理时间和资源消耗,
        需权衡性能提升和成本。生产环境可采用蒸馏将集成模型知识迁移到单个小模型。
        集成多样性很重要,相似模型集成收益有限。可使用差异化训练策略增加多样性,如不同数据增强方法。
    b.代码示例
        ---
        # 模型集成实现
        import torch
        import torch.nn as nn
        from typing import List, Dict
        import numpy as np
        from sklearn.metrics import accuracy_score, f1_score

        class ModelEnsemble:
            """模型集成器"""

            def __init__(self, models: List[nn.Module], weights: List[float] = None,
                        device: str = 'cpu'):
                """
                初始化集成器

                Args:
                    models: 模型列表
                    weights: 模型权重(如果为None则平均)
                    device: 设备
                """
                self.models = [m.to(device) for m in models]
                self.device = device

                # 设置权重
                if weights is None:
                    self.weights = [1.0 / len(models)] * len(models)
                else:
                    assert len(weights) == len(models), "权重数量与模型数量不匹配"
                    total = sum(weights)
                    self.weights = [w / total for w in weights]

                print(f"集成 {len(models)} 个模型")
                print(f"模型权重: {self.weights}")

            def predict_vote(self, input_ids, attention_mask) -> torch.Tensor:
                """
                投票法预测

                Args:
                    input_ids: 输入token IDs
                    attention_mask: 注意力掩码

                Returns:
                    预测标签
                """
                all_predictions = []

                for model in self.models:
                    model.eval()
                    with torch.no_grad():
                        logits = model(input_ids, attention_mask)
                        preds = torch.argmax(logits, dim=1)
                        all_predictions.append(preds.cpu().numpy())

                # 投票
                all_predictions = np.array(all_predictions)  # (n_models, batch_size)
                voted_preds = []

                for i in range(all_predictions.shape[1]):
                    votes = all_predictions[:, i]
                    # 多数投票
                    voted_preds.append(np.bincount(votes).argmax())

                return torch.tensor(voted_preds)

            def predict_average(self, input_ids, attention_mask) -> torch.Tensor:
                """
                概率平均法预测

                Args:
                    input_ids: 输入token IDs
                    attention_mask: 注意力掩码

                Returns:
                    预测标签
                """
                all_probs = []

                for model, weight in zip(self.models, self.weights):
                    model.eval()
                    with torch.no_grad():
                        logits = model(input_ids, attention_mask)
                        probs = torch.softmax(logits, dim=1)
                        all_probs.append(probs.cpu().numpy() * weight)

                # 加权平均
                avg_probs = np.sum(all_probs, axis=0)
                predictions = np.argmax(avg_probs, axis=1)

                return torch.tensor(predictions)

            def predict_stacking(self, input_ids, attention_mask,
                                meta_model: nn.Module) -> torch.Tensor:
                """
                堆叠法预测

                Args:
                    input_ids: 输入token IDs
                    attention_mask: 注意力掩码
                    meta_model: 元模型

                Returns:
                    预测标签
                """
                # 收集基模型预测概率
                base_predictions = []

                for model in self.models:
                    model.eval()
                    with torch.no_grad():
                        logits = model(input_ids, attention_mask)
                        probs = torch.softmax(logits, dim=1)
                        base_predictions.append(probs)

                # 拼接基模型输出
                stacked_features = torch.cat(base_predictions, dim=1)  # (batch, n_models * n_classes)

                # 元模型预测
                meta_model.eval()
                with torch.no_grad():
                    meta_logits = meta_model(stacked_features)
                    predictions = torch.argmax(meta_logits, dim=1)

                return predictions

            def evaluate_ensemble(self, data_loader, method: str = 'average',
                                meta_model: nn.Module = None):
                """
                评估集成模型

                Args:
                    data_loader: 数据加载器
                    method: 集成方法 ('vote', 'average', 'stacking')
                    meta_model: 元模型(stacking需要)

                Returns:
                    评估指标字典
                """
                all_preds = []
                all_labels = []

                print(f"\n使用 {method} 方法评估集成模型...")

                for batch in data_loader:
                    input_ids = batch['input_ids'].to(self.device)
                    attention_mask = batch['attention_mask'].to(self.device)
                    labels = batch['label'].to(self.device)

                    # 根据方法选择预测函数
                    if method == 'vote':
                        preds = self.predict_vote(input_ids, attention_mask)
                    elif method == 'average':
                        preds = self.predict_average(input_ids, attention_mask)
                    elif method == 'stacking':
                        assert meta_model is not None, "堆叠法需要提供元模型"
                        preds = self.predict_stacking(input_ids, attention_mask, meta_model)
                    else:
                        raise ValueError(f"未知集成方法: {method}")

                    all_preds.extend(preds.numpy())
                    all_labels.extend(labels.cpu().numpy())

                # 计算指标
                accuracy = accuracy_score(all_labels, all_preds)
                f1 = f1_score(all_labels, all_preds, average='macro')

                print(f"  准确率: {accuracy:.4f}")
                print(f"  F1分数: {f1:.4f}")

                return {
                    'accuracy': accuracy,
                    'f1_score': f1,
                    'predictions': all_preds,
                    'labels': all_labels
                }

        class StackingMetaModel(nn.Module):
            """堆叠法的元模型"""

            def __init__(self, n_models: int, n_classes_per_model: int, n_output_classes: int):
                """
                初始化元模型

                Args:
                    n_models: 基模型数量
                    n_classes_per_model: 每个基模型的输出类别数
                    n_output_classes: 最终输出类别数
                """
                super(StackingMetaModel, self).__init__()

                input_dim = n_models * n_classes_per_model

                self.meta_classifier = nn.Sequential(
                    nn.Linear(input_dim, 128),
                    nn.ReLU(),
                    nn.Dropout(0.3),
                    nn.Linear(128, n_output_classes)
                )

            def forward(self, x):
                """
                前向传播

                Args:
                    x: 拼接的基模型输出 (batch, n_models * n_classes)

                Returns:
                    logits
                """
                return self.meta_classifier(x)

        # 使用示例
        if __name__ == "__main__":
            print("="*70)
            print("模型集成示例")
            print("="*70 + "\n")

            # 创建多个简单模型(实际应该是训练好的不同模型)
            class SimpleClassifier(nn.Module):
                def __init__(self, input_dim=768, num_classes=3):
                    super().__init__()
                    self.fc = nn.Linear(input_dim, num_classes)

                def forward(self, x, attention_mask=None):
                    return self.fc(x)

            # 创建3个模型
            models = [SimpleClassifier() for _ in range(3)]

            # 创建集成器
            ensemble = ModelEnsemble(
                models=models,
                weights=[0.4, 0.35, 0.25],  # 根据验证集性能设定
                device='cpu'
            )

            # 模拟数据
            batch_size = 8
            input_ids = torch.randn(batch_size, 768)
            attention_mask = torch.ones(batch_size, 128)

            # 1. 投票法
            print("\n【投票法预测】")
            preds_vote = ensemble.predict_vote(input_ids, attention_mask)
            print(f"预测结果: {preds_vote}")

            # 2. 平均法
            print("\n【平均法预测】")
            preds_avg = ensemble.predict_average(input_ids, attention_mask)
            print(f"预测结果: {preds_avg}")

            # 3. 堆叠法
            print("\n【堆叠法预测】")
            meta_model = StackingMetaModel(
                n_models=3,
                n_classes_per_model=3,
                n_output_classes=3
            )
            preds_stack = ensemble.predict_stacking(input_ids, attention_mask, meta_model)
            print(f"预测结果: {preds_stack}")

            print("\n✓ 模型集成演示完成")

            print("\n集成策略对比:")
            print("  - 投票法: 简单高效,适合模型性能相近")
            print("  - 平均法: 考虑预测置信度,更加稳健")
            print("  - 堆叠法: 性能最优,但需要额外训练")

            print("\n实践建议:")
            print("  1. 确保基模型有足够多样性(不同架构/超参数)")
            print("  2. 根据验证集性能设置模型权重")
            print("  3. 堆叠法要注意避免过拟合")
            print("  4. 生产环境考虑模型蒸馏降低推理成本")
        ---

3 命名实体识别

3.1 NER任务介绍

01.命名实体识别概述
    a.说明
        命名实体识别(Named Entity Recognition, NER)是从文本中识别并分类命名实体的任务,包括人名、地名、机构名、时间、数字等。
        NER是信息抽取的基础任务,广泛应用于知识图谱构建、智能搜索、问答系统等场景。
        传统方法包括基于规则、基于特征的CRF,现代方法主要使用深度学习模型如BiLSTM-CRF、BERT-NER。
        NER任务的挑战包括实体边界识别困难、嵌套实体处理、领域适应性差等问题。
        评估指标主要使用准确率、召回率、F1值,以实体级别(entity-level)评估为主。
        标注数据集如CoNLL-2003、OntoNotes、MSRA、人民日报语料等是训练和评估的基础。
        实体类型可根据应用场景自定义,通用类型包括PER(人名)、LOC(地名)、ORG(机构)、TIME(时间)等。
    b.代码示例
        ---
        # NER任务数据加载与预处理示例
        import pandas as pd
        import numpy as np
        from typing import List, Tuple, Dict

        class NERDataLoader:
            """命名实体识别数据加载器"""

            def __init__(self, data_path: str):
                self.data_path = data_path
                self.entity_types = set()

            def load_conll_format(self) -> List[Tuple[List[str], List[str]]]:
                """
                加载CoNLL格式的NER数据
                格式: token label (每行一个词和对应标签,句子之间空行分隔)
                """
                sentences = []
                labels = []
                current_tokens = []
                current_labels = []

                with open(self.data_path, 'r', encoding='utf-8') as f:
                    for line in f:
                        line = line.strip()
                        if not line:  # 空行表示句子结束
                            if current_tokens:
                                sentences.append(current_tokens)
                                labels.append(current_labels)
                                current_tokens = []
                                current_labels = []
                        else:
                            parts = line.split()
                            if len(parts) >= 2:
                                token, label = parts[0], parts[-1]
                                current_tokens.append(token)
                                current_labels.append(label)
                                self.entity_types.add(label)

                # 添加最后一个句子
                if current_tokens:
                    sentences.append(current_tokens)
                    labels.append(current_labels)

                print(f"加载了 {len(sentences)} 个句子")
                print(f"实体类型: {sorted(self.entity_types)}")
                return list(zip(sentences, labels))

            def analyze_dataset(self, dataset: List[Tuple[List[str], List[str]]]):
                """分析数据集统计信息"""
                total_tokens = sum(len(tokens) for tokens, _ in dataset)
                total_entities = 0
                entity_counts = {}

                for tokens, labels in dataset:
                    for label in labels:
                        if label != 'O':  # O表示非实体
                            total_entities += 1
                            entity_type = label.split('-')[-1]  # BIO标注提取实体类型
                            entity_counts[entity_type] = entity_counts.get(entity_type, 0) + 1

                print(f"\n数据集统计:")
                print(f"总句子数: {len(dataset)}")
                print(f"总词数: {total_tokens}")
                print(f"平均句长: {total_tokens / len(dataset):.2f}")
                print(f"总实体标签数: {total_entities}")
                print(f"实体分布: {entity_counts}")

                return entity_counts

        # 使用示例
        # loader = NERDataLoader('train.conll')
        # dataset = loader.load_conll_format()
        # loader.analyze_dataset(dataset)
        ---

02.NER标注示例与实体类型
    a.说明
        实体标注需要明确定义实体类型和标注规范,常见类型包括通用实体和领域特定实体。
        通用实体类型如人名(PER)、地名(LOC)、机构名(ORG)、时间(TIME)、货币(MONEY)、百分比(PERCENT)等。
        医疗领域实体如疾病(DIS)、药物(DRUG)、症状(SYM)、检查(TEST)、手术(OPER)等。
        金融领域实体如公司(COMP)、产品(PROD)、指标(METRIC)、事件(EVENT)等。
        标注时需注意实体边界的一致性,如"中国人民银行"应完整标注为ORG,不应拆分。
        嵌套实体是常见难点,如"北京大学医学院"包含LOC(北京)和ORG(北京大学医学院)。
    b.代码示例
        ---
        # 实体类型定义与标注示例生成
        from dataclasses import dataclass
        from typing import List
        import random

        @dataclass
        class Entity:
            """实体数据类"""
            text: str
            entity_type: str
            start: int
            end: int

        class NERAnnotationExample:
            """NER标注示例生成器"""

            def __init__(self):
                # 定义实体类型和示例
                self.entity_examples = {
                    'PER': ['李明', '王小华', '张三', '刘德华', '马云'],
                    'LOC': ['北京', '上海', '杭州', '西湖', '长城'],
                    'ORG': ['阿里巴巴', '腾讯', '华为', '北京大学', '清华大学'],
                    'TIME': ['2024年', '上个月', '明天', '下午3点', '春节'],
                    'PRODUCT': ['iPhone', 'ChatGPT', '微信', '支付宝', 'Tesla']
                }

            def generate_sample_sentence(self) -> Tuple[str, List[Entity]]:
                """生成带标注的示例句子"""
                templates = [
                    "{PER}在{TIME}加入了{ORG}担任CEO",
                    "{ORG}在{LOC}发布了新产品{PRODUCT}",
                    "{PER}和{PER}计划{TIME}去{LOC}旅游",
                    "{ORG}的{PRODUCT}在{LOC}很受欢迎"
                ]

                template = random.choice(templates)
                sentence = template
                entities = []
                offset = 0

                # 替换模板中的实体占位符
                for entity_type in ['PER', 'ORG', 'LOC', 'TIME', 'PRODUCT']:
                    placeholder = f"{{{entity_type}}}"
                    while placeholder in sentence:
                        entity_text = random.choice(self.entity_examples.get(entity_type, ['未知']))
                        start = sentence.index(placeholder)
                        sentence = sentence.replace(placeholder, entity_text, 1)

                        entities.append(Entity(
                            text=entity_text,
                            entity_type=entity_type,
                            start=start,
                            end=start + len(entity_text)
                        ))

                return sentence, entities

            def convert_to_bio(self, sentence: str, entities: List[Entity]) -> List[Tuple[str, str]]:
                """将实体标注转换为BIO格式"""
                # 简单分词(实际应用需使用专业分词工具)
                tokens = list(sentence)  # 字符级别
                labels = ['O'] * len(tokens)

                # 标注实体
                for entity in entities:
                    for i in range(entity.start, entity.end):
                        if i == entity.start:
                            labels[i] = f'B-{entity.entity_type}'
                        else:
                            labels[i] = f'I-{entity.entity_type}'

                return list(zip(tokens, labels))

            def print_annotation_example(self):
                """打印标注示例"""
                sentence, entities = self.generate_sample_sentence()
                bio_format = self.convert_to_bio(sentence, entities)

                print("原始句子:", sentence)
                print("\n实体列表:")
                for entity in entities:
                    print(f"  {entity.text} -> {entity.entity_type} [{entity.start}:{entity.end}]")

                print("\nBIO标注:")
                for token, label in bio_format:
                    print(f"  {token}\t{label}")

        # 使用示例
        annotator = NERAnnotationExample()
        annotator.print_annotation_example()
        ---

03.NER评估指标
    a.说明
        NER评估需要在实体级别(entity-level)而非标记级别(token-level)计算指标,只有完全正确识别实体才算正确。
        准确率(Precision)计算预测出的实体中正确的比例,召回率(Recall)计算真实实体中被识别出的比例。
        F1值是准确率和召回率的调和平均,是NER任务的主要评估指标,通常报告macro-F1和micro-F1。
        严格匹配要求实体类型和边界完全正确,宽松匹配只要求边界重叠即可,实际应用通常使用严格匹配。
        seqeval库是NER评估的标准工具,支持BIO、BIOES等多种标注格式,自动处理实体级别的评估。
        混淆矩阵可以展示不同实体类型之间的识别混淆情况,帮助分析模型的弱点。
    b.代码示例
        ---
        # NER评估指标计算
        from seqeval.metrics import classification_report, f1_score, precision_score, recall_score
        from seqeval.scheme import IOB2
        from typing import List
        import numpy as np

        class NEREvaluator:
            """NER模型评估器"""

            def __init__(self, scheme=IOB2):
                self.scheme = scheme

            def evaluate(self,
                        y_true: List[List[str]],
                        y_pred: List[List[str]]) -> Dict[str, float]:
                """
                评估NER预测结果
                y_true: 真实标签序列列表
                y_pred: 预测标签序列列表
                """
                # 确保长度一致
                assert len(y_true) == len(y_pred), "标签序列数量不匹配"

                # 计算各项指标
                precision = precision_score(y_true, y_pred, scheme=self.scheme)
                recall = recall_score(y_true, y_pred, scheme=self.scheme)
                f1 = f1_score(y_true, y_pred, scheme=self.scheme)

                # 生成详细报告
                report = classification_report(y_true, y_pred, scheme=self.scheme, digits=4)

                results = {
                    'precision': precision,
                    'recall': recall,
                    'f1': f1,
                    'report': report
                }

                return results

            def print_evaluation_results(self, results: Dict):
                """打印评估结果"""
                print(f"总体指标:")
                print(f"  Precision: {results['precision']:.4f}")
                print(f"  Recall:    {results['recall']:.4f}")
                print(f"  F1 Score:  {results['f1']:.4f}")
                print(f"\n详细报告:")
                print(results['report'])

        # 使用示例
        evaluator = NEREvaluator()

        # 示例数据(真实场景中来自模型预测)
        y_true = [
            ['O', 'O', 'B-PER', 'I-PER', 'O', 'B-LOC', 'I-LOC', 'O'],
            ['B-ORG', 'I-ORG', 'O', 'O', 'B-TIME', 'O'],
            ['O', 'B-PER', 'O', 'O', 'B-LOC', 'O']
        ]

        y_pred = [
            ['O', 'O', 'B-PER', 'I-PER', 'O', 'B-LOC', 'O', 'O'],  # LOC边界错误
            ['B-ORG', 'I-ORG', 'O', 'O', 'B-TIME', 'O'],  # 完全正确
            ['O', 'B-ORG', 'O', 'O', 'B-LOC', 'O']  # PER识别为ORG
        ]

        results = evaluator.evaluate(y_true, y_pred)
        evaluator.print_evaluation_results(results)

        # 输出示例:
        # 总体指标:
        #   Precision: 0.6667  (4个预测实体中3个正确)
        #   Recall:    0.5000  (6个真实实体中3个被识别)
        #   F1 Score:  0.5714
        ---

3.2 标注体系

01.BIO标注体系
    a.说明
        BIO(Begin-Inside-Outside)是最常用的序列标注体系,B表示实体开始,I表示实体内部,O表示非实体。
        BIO标注可以明确区分相邻的同类型实体,如"北京大学"和"清华大学"连续出现时可以正确分割。
        标注时遵循规则:每个实体首字用B标注,后续字用I标注,非实体用O标注,如"李/B-PER 明/I-PER"。
        BIO2(IOB2)是BIO的变体,要求实体开头必须用B,而IOB1允许单独实体开头用I标注。
        实际应用中BIO2更常用,因为它能更明确地标识实体边界,便于后处理和错误分析。
        标注一致性是关键,需要制定详细的标注规范,处理歧义情况如缩写、嵌套实体等。
    b.代码示例
        ---
        # BIO标注体系实现与转换
        from typing import List, Tuple
        from collections import defaultdict

        class BIOTagger:
            """BIO标注工具类"""

            @staticmethod
            def entities_to_bio(tokens: List[str],
                               entities: List[Tuple[int, int, str]]) -> List[str]:
                """
                将实体标注转换为BIO格式
                tokens: 分词后的词列表
                entities: 实体列表 [(start, end, entity_type), ...]
                """
                # 初始化为O标签
                bio_tags = ['O'] * len(tokens)

                # 按开始位置排序,确保处理顺序
                sorted_entities = sorted(entities, key=lambda x: x[0])

                for start, end, entity_type in sorted_entities:
                    if start < len(bio_tags) and end <= len(bio_tags):
                        # 标注B(Begin)
                        bio_tags[start] = f'B-{entity_type}'
                        # 标注I(Inside)
                        for i in range(start + 1, end):
                            bio_tags[i] = f'I-{entity_type}'

                return bio_tags

            @staticmethod
            def bio_to_entities(tokens: List[str],
                               bio_tags: List[str]) -> List[Tuple[str, int, int, str]]:
                """
                将BIO标注转换回实体列表
                返回: [(entity_text, start, end, entity_type), ...]
                """
                entities = []
                current_entity = None
                current_start = None

                for i, (token, tag) in enumerate(zip(tokens, bio_tags)):
                    if tag.startswith('B-'):
                        # 保存前一个实体
                        if current_entity is not None:
                            entity_text = ''.join(tokens[current_start:i])
                            entities.append((entity_text, current_start, i, current_entity))

                        # 开始新实体
                        current_entity = tag[2:]  # 去掉'B-'前缀
                        current_start = i

                    elif tag.startswith('I-'):
                        # 继续当前实体
                        if current_entity is None:
                            # 错误:I标签出现但没有B标签,跳过或修正
                            current_entity = tag[2:]
                            current_start = i

                    else:  # 'O'标签
                        # 结束当前实体
                        if current_entity is not None:
                            entity_text = ''.join(tokens[current_start:i])
                            entities.append((entity_text, current_start, i, current_entity))
                            current_entity = None
                            current_start = None

                # 处理句尾实体
                if current_entity is not None:
                    entity_text = ''.join(tokens[current_start:])
                    entities.append((entity_text, current_start, len(tokens), current_entity))

                return entities

            @staticmethod
            def validate_bio_tags(bio_tags: List[str]) -> Tuple[bool, List[str]]:
                """
                验证BIO标注的合法性
                返回: (是否合法, 错误信息列表)
                """
                errors = []
                prev_tag = 'O'

                for i, tag in enumerate(bio_tags):
                    if tag == 'O':
                        prev_tag = 'O'
                        continue

                    if not (tag.startswith('B-') or tag.startswith('I-')):
                        errors.append(f"位置{i}: 非法标签格式 '{tag}'")
                        continue

                    prefix, entity_type = tag.split('-', 1)

                    if prefix == 'I':
                        # I标签前必须是B或同类型I
                        if prev_tag == 'O':
                            errors.append(f"位置{i}: I-{entity_type}前没有对应的B标签")
                        elif prev_tag.startswith('I-') or prev_tag.startswith('B-'):
                            prev_entity = prev_tag.split('-', 1)[1]
                            if prev_entity != entity_type:
                                errors.append(f"位置{i}: 实体类型不一致 {prev_entity} -> {entity_type}")

                    prev_tag = tag

                return len(errors) == 0, errors

        # 使用示例
        tagger = BIOTagger()

        # 示例1: 实体到BIO
        tokens = list("李明在北京工作")
        entities = [(0, 2, 'PER'), (3, 5, 'LOC')]  # 李明:PER, 北京:LOC
        bio_tags = tagger.entities_to_bio(tokens, entities)
        print("BIO标注:", list(zip(tokens, bio_tags)))
        # 输出: [('李', 'B-PER'), ('明', 'I-PER'), ('在', 'O'), ('北', 'B-LOC'), ('京', 'I-LOC'), ...]

        # 示例2: BIO到实体
        extracted_entities = tagger.bio_to_entities(tokens, bio_tags)
        print("\n提取的实体:", extracted_entities)
        # 输出: [('李明', 0, 2, 'PER'), ('北京', 3, 5, 'LOC')]

        # 示例3: 验证BIO标注
        invalid_tags = ['B-PER', 'I-LOC', 'O', 'I-ORG']  # 错误:类型不一致,孤立I标签
        is_valid, errors = tagger.validate_bio_tags(invalid_tags)
        print(f"\n标注合法性: {is_valid}")
        if not is_valid:
            print("错误:")
            for error in errors:
                print(f"  {error}")
        ---

02.BIOES标注体系
    a.说明
        BIOES(Begin-Inside-Outside-End-Single)是BIO的扩展,增加了E(End)和S(Single)标签,能更精确地标识实体边界。
        E标签表示实体结尾,S标签表示单字实体,这样可以更明确地区分实体的开始、中间、结尾和独立实体。
        BIOES在实体边界识别准确性上优于BIO,特别是对于短实体和单字实体的识别效果更好。
        标注示例:"李/S-PER"(单字人名),"中/B-LOC 国/I-LOC 银/I-LOC 行/E-ORG"(多字机构名)。
        模型训练时BIOES通常比BIO有1-2个点的F1提升,但标签空间增大会增加模型复杂度。
        标签转换需要特别注意单字实体的处理,确保B-E配对正确。
    b.代码示例
        ---
        # BIOES标注体系实现
        from typing import List, Tuple

        class BIOESTagger:
            """BIOES标注工具类"""

            @staticmethod
            def bio_to_bioes(bio_tags: List[str]) -> List[str]:
                """将BIO标注转换为BIOES标注"""
                bioes_tags = []

                for i, tag in enumerate(bio_tags):
                    if tag == 'O':
                        bioes_tags.append('O')
                        continue

                    prefix, entity_type = tag.split('-', 1)

                    # 检查是否是单字实体
                    is_single = True
                    if i > 0 and bio_tags[i-1] == f'B-{entity_type}':
                        is_single = False
                    if i > 0 and bio_tags[i-1] == f'I-{entity_type}':
                        is_single = False
                    if i < len(bio_tags) - 1 and bio_tags[i+1] == f'I-{entity_type}':
                        is_single = False

                    if prefix == 'B':
                        # 检查下一个标签
                        if i < len(bio_tags) - 1 and bio_tags[i+1] == f'I-{entity_type}':
                            bioes_tags.append(f'B-{entity_type}')
                        else:
                            # 单字实体
                            bioes_tags.append(f'S-{entity_type}')

                    elif prefix == 'I':
                        # 检查是否是结尾
                        if i < len(bio_tags) - 1 and bio_tags[i+1] == f'I-{entity_type}':
                            bioes_tags.append(f'I-{entity_type}')
                        else:
                            bioes_tags.append(f'E-{entity_type}')

                return bioes_tags

            @staticmethod
            def bioes_to_bio(bioes_tags: List[str]) -> List[str]:
                """将BIOES标注转换为BIO标注"""
                bio_tags = []

                for tag in bioes_tags:
                    if tag == 'O':
                        bio_tags.append('O')
                    elif tag.startswith('S-'):
                        # 单字实体转为B
                        entity_type = tag[2:]
                        bio_tags.append(f'B-{entity_type}')
                    elif tag.startswith('E-'):
                        # E转为I
                        entity_type = tag[2:]
                        bio_tags.append(f'I-{entity_type}')
                    else:
                        # B和I保持不变
                        bio_tags.append(tag)

                return bio_tags

            @staticmethod
            def bioes_to_entities(tokens: List[str],
                                  bioes_tags: List[str]) -> List[Tuple[str, int, int, str]]:
                """将BIOES标注转换为实体列表"""
                entities = []

                i = 0
                while i < len(tokens):
                    tag = bioes_tags[i]

                    if tag.startswith('S-'):
                        # 单字实体
                        entity_type = tag[2:]
                        entities.append((tokens[i], i, i+1, entity_type))
                        i += 1

                    elif tag.startswith('B-'):
                        # 多字实体开始
                        entity_type = tag[2:]
                        start = i
                        i += 1

                        # 寻找I和E标签
                        while i < len(bioes_tags):
                            if bioes_tags[i] == f'I-{entity_type}':
                                i += 1
                            elif bioes_tags[i] == f'E-{entity_type}':
                                i += 1
                                break
                            else:
                                # 错误:没有找到E标签
                                break

                        entity_text = ''.join(tokens[start:i])
                        entities.append((entity_text, start, i, entity_type))

                    else:
                        i += 1

                return entities

        # 使用示例
        bioes_tagger = BIOESTagger()

        # 示例1: BIO转BIOES
        bio_tags = ['B-PER', 'I-PER', 'O', 'B-LOC', 'O', 'B-ORG']
        bioes_tags = bioes_tagger.bio_to_bioes(bio_tags)
        print("BIO -> BIOES:", bioes_tags)
        # 输出: ['B-PER', 'E-PER', 'O', 'S-LOC', 'O', 'S-ORG']

        # 示例2: BIOES转BIO
        converted_bio = bioes_tagger.bioes_to_bio(bioes_tags)
        print("BIOES -> BIO:", converted_bio)

        # 示例3: BIOES提取实体
        tokens = list("李明在北京工作于阿里巴巴")
        bioes_tags = ['B-PER', 'E-PER', 'O', 'S-LOC', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'E-ORG']
        entities = bioes_tagger.bioes_to_entities(tokens, bioes_tags)
        print("\n提取的实体:")
        for entity in entities:
            print(f"  {entity}")
        # 输出: ('李明', 0, 2, 'PER'), ('北', 3, 4, 'LOC'), ('阿里巴巴', 7, 11, 'ORG')
        ---

03.嵌套实体处理
    a.说明
        嵌套实体指一个实体包含在另一个实体内部,如"北京大学"中"北京"是LOC,"北京大学"整体是ORG。
        传统BIO/BIOES标注体系无法直接表示嵌套结构,需要使用层次标注或跨度标注方法。
        层次标注方法为每个嵌套层级定义独立的标注序列,如layer1标注外层实体,layer2标注内层实体。
        跨度标注方法将实体识别转化为跨度分类问题,枚举所有可能的文本跨度并判断是否为实体。
        实际应用中可以使用多任务学习,同时训练多个NER模型分别识别不同层级或不同类型的实体。
        后处理策略包括优先保留长实体、根据业务规则选择合适的实体层级等。
    b.代码示例
        ---
        # 嵌套实体处理实现
        from typing import List, Tuple, Set
        from dataclasses import dataclass

        @dataclass
        class NestedEntity:
            """嵌套实体数据类"""
            text: str
            start: int
            end: int
            entity_type: str
            level: int  # 嵌套层级,0为最外层

        class NestedNERHandler:
            """嵌套实体处理器"""

            def __init__(self):
                self.entities = []

            def add_entity(self, text: str, start: int, end: int, entity_type: str):
                """添加实体"""
                # 计算嵌套层级
                level = 0
                for existing in self.entities:
                    # 如果新实体包含在现有实体内
                    if existing.start <= start and end <= existing.end:
                        level = max(level, existing.level + 1)

                entity = NestedEntity(text, start, end, entity_type, level)
                self.entities.append(entity)

            def get_entities_by_level(self, level: int) -> List[NestedEntity]:
                """获取指定层级的实体"""
                return [e for e in self.entities if e.level == level]

            def get_flat_entities(self, strategy: str = 'longest') -> List[NestedEntity]:
                """
                将嵌套实体扁平化
                strategy: 'longest'(保留最长), 'shortest'(保留最短), 'outermost'(保留最外层)
                """
                if strategy == 'longest':
                    # 按长度排序,保留不重叠的最长实体
                    sorted_entities = sorted(self.entities,
                                            key=lambda e: e.end - e.start,
                                            reverse=True)
                elif strategy == 'shortest':
                    sorted_entities = sorted(self.entities,
                                            key=lambda e: e.end - e.start)
                elif strategy == 'outermost':
                    sorted_entities = sorted(self.entities,
                                            key=lambda e: e.level)
                else:
                    sorted_entities = self.entities

                # 贪心选择不重叠的实体
                selected = []
                occupied = set()

                for entity in sorted_entities:
                    span = set(range(entity.start, entity.end))
                    if not span & occupied:  # 没有重叠
                        selected.append(entity)
                        occupied.update(span)

                return sorted(selected, key=lambda e: e.start)

            def to_layered_bio(self, tokens: List[str], max_level: int = 2) -> dict:
                """
                转换为分层BIO标注
                返回: {level: bio_tags}
                """
                layered_tags = {}

                for level in range(max_level + 1):
                    bio_tags = ['O'] * len(tokens)
                    level_entities = self.get_entities_by_level(level)

                    for entity in level_entities:
                        if entity.start < len(bio_tags):
                            bio_tags[entity.start] = f'B-{entity.entity_type}'
                            for i in range(entity.start + 1, min(entity.end, len(bio_tags))):
                                bio_tags[i] = f'I-{entity.entity_type}'

                    layered_tags[level] = bio_tags

                return layered_tags

        # 使用示例
        handler = NestedNERHandler()

        text = "北京大学医学院"
        tokens = list(text)

        # 添加嵌套实体
        handler.add_entity("北京", 0, 2, "LOC")  # 内层
        handler.add_entity("北京大学", 0, 4, "ORG")  # 中层
        handler.add_entity("北京大学医学院", 0, 7, "ORG")  # 外层

        print("所有实体:")
        for entity in handler.entities:
            print(f"  {entity.text} ({entity.entity_type}) - Level {entity.level}")

        print("\n分层标注:")
        layered_bio = handler.to_layered_bio(tokens, max_level=2)
        for level, tags in layered_bio.items():
            print(f"Level {level}:", list(zip(tokens, tags)))

        print("\n扁平化(保留最长):")
        flat_entities = handler.get_flat_entities(strategy='longest')
        for entity in flat_entities:
            print(f"  {entity.text} ({entity.entity_type})")
        ---

3.3 BiLSTM-CRF

01.BiLSTM-CRF模型架构
    a.说明
        BiLSTM-CRF是经典的序列标注模型,结合了双向LSTM的上下文建模能力和CRF的全局标签约束能力。
        BiLSTM层通过前向和后向LSTM捕获每个词的双向上下文信息,输出每个位置的隐藏状态表示。
        CRF层在BiLSTM输出基础上学习标签转移概率,确保输出标签序列的全局一致性,如B-PER后不能直接跟I-LOC。
        模型训练使用负对数似然损失,通过前向-后向算法计算配分函数,使用Viterbi算法解码最优标签序列。
        相比单独的BiLSTM分类,CRF层可以避免非法标签转移,通常能提升1-2个点的F1值。
        输入层通常使用词嵌入(Word Embedding)或字嵌入(Char Embedding),可以结合预训练词向量如Word2Vec、GloVe。
    b.代码示例
        ---
        # BiLSTM-CRF模型实现
        import torch
        import torch.nn as nn
        from typing import List, Tuple
        import numpy as np

        class BiLSTM_CRF(nn.Module):
            """BiLSTM-CRF命名实体识别模型"""

            def __init__(self,
                        vocab_size: int,
                        tag_to_ix: dict,
                        embedding_dim: int = 100,
                        hidden_dim: int = 128):
                super(BiLSTM_CRF, self).__init__()

                self.embedding_dim = embedding_dim
                self.hidden_dim = hidden_dim
                self.vocab_size = vocab_size
                self.tag_to_ix = tag_to_ix
                self.tagset_size = len(tag_to_ix)

                # 词嵌入层
                self.word_embeds = nn.Embedding(vocab_size, embedding_dim)

                # BiLSTM层
                self.lstm = nn.LSTM(embedding_dim, hidden_dim // 2,
                                   num_layers=1, bidirectional=True, batch_first=True)

                # 线性层:LSTM输出到标签空间
                self.hidden2tag = nn.Linear(hidden_dim, self.tagset_size)

                # CRF转移矩阵: transitions[i][j]表示从标签j转移到标签i的分数
                self.transitions = nn.Parameter(
                    torch.randn(self.tagset_size, self.tagset_size))

                # 约束:不能转移到START标签,不能从STOP标签转移
                self.transitions.data[tag_to_ix['<START>'], :] = -10000
                self.transitions.data[:, tag_to_ix['<STOP>']] = -10000

            def _forward_alg(self, feats: torch.Tensor) -> torch.Tensor:
                """
                前向算法计算配分函数(所有路径的分数和)
                feats: [seq_len, tagset_size] BiLSTM输出特征
                """
                # 初始化前向变量
                init_alphas = torch.full((1, self.tagset_size), -10000.)
                init_alphas[0][self.tag_to_ix['<START>']] = 0.

                forward_var = init_alphas

                # 遍历句子
                for feat in feats:
                    alphas_t = []
                    for next_tag in range(self.tagset_size):
                        # 发射���数
                        emit_score = feat[next_tag].view(1, -1).expand(1, self.tagset_size)
                        # 转移分数
                        trans_score = self.transitions[next_tag].view(1, -1)
                        # 当前标签的分数
                        next_tag_var = forward_var + trans_score + emit_score
                        # log-sum-exp
                        alphas_t.append(torch.logsumexp(next_tag_var, dim=1).view(1))

                    forward_var = torch.cat(alphas_t).view(1, -1)

                # 加上转移到STOP的分数
                terminal_var = forward_var + self.transitions[self.tag_to_ix['<STOP>']]
                alpha = torch.logsumexp(terminal_var, dim=1)

                return alpha

            def _score_sentence(self, feats: torch.Tensor, tags: torch.Tensor) -> torch.Tensor:
                """
                计算给定标签序列的分数
                feats: [seq_len, tagset_size]
                tags: [seq_len] 真实标签序列
                """
                score = torch.zeros(1)
                tags = torch.cat([torch.tensor([self.tag_to_ix['<START>']], dtype=torch.long), tags])

                for i, feat in enumerate(feats):
                    # 发射分数 + 转移分数
                    score = score + self.transitions[tags[i + 1], tags[i]] + feat[tags[i + 1]]

                # 加上转移到STOP的分数
                score = score + self.transitions[self.tag_to_ix['<STOP>'], tags[-1]]

                return score

            def _viterbi_decode(self, feats: torch.Tensor) -> Tuple[float, List[int]]:
                """
                Viterbi算法解码最优标签序列
                feats: [seq_len, tagset_size]
                返回: (最优路径分数, 最优标签序列)
                """
                backpointers = []

                # 初始化
                init_vvars = torch.full((1, self.tagset_size), -10000.)
                init_vvars[0][self.tag_to_ix['<START>']] = 0

                forward_var = init_vvars

                # 前向传播
                for feat in feats:
                    bptrs_t = []
                    viterbivars_t = []

                    for next_tag in range(self.tagset_size):
                        next_tag_var = forward_var + self.transitions[next_tag]
                        best_tag_id = torch.argmax(next_tag_var)
                        bptrs_t.append(best_tag_id)
                        viterbivars_t.append(next_tag_var[0][best_tag_id].view(1))

                    forward_var = (torch.cat(viterbivars_t) + feat).view(1, -1)
                    backpointers.append(bptrs_t)

                # 转移到STOP
                terminal_var = forward_var + self.transitions[self.tag_to_ix['<STOP>']]
                best_tag_id = torch.argmax(terminal_var)
                path_score = terminal_var[0][best_tag_id]

                # 回溯最优路径
                best_path = [best_tag_id]
                for bptrs_t in reversed(backpointers):
                    best_tag_id = bptrs_t[best_tag_id]
                    best_path.append(best_tag_id)

                # 移除START标签
                start = best_path.pop()
                assert start == self.tag_to_ix['<START>']
                best_path.reverse()

                return path_score, best_path

            def neg_log_likelihood(self, sentence: torch.Tensor, tags: torch.Tensor) -> torch.Tensor:
                """
                计算负对数似然损失
                sentence: [seq_len] 输入句子(词索引)
                tags: [seq_len] 真实标签序列
                """
                # BiLSTM前向传播
                embeds = self.word_embeds(sentence).view(1, len(sentence), -1)
                lstm_out, _ = self.lstm(embeds)
                lstm_feats = self.hidden2tag(lstm_out.view(len(sentence), -1))

                # CRF损失
                forward_score = self._forward_alg(lstm_feats)
                gold_score = self._score_sentence(lstm_feats, tags)

                return forward_score - gold_score

            def forward(self, sentence: torch.Tensor) -> Tuple[float, List[int]]:
                """
                预测标签序列
                sentence: [seq_len] 输入句子(词索引)
                """
                # BiLSTM前向传播
                embeds = self.word_embeds(sentence).view(1, len(sentence), -1)
                lstm_out, _ = self.lstm(embeds)
                lstm_feats = self.hidden2tag(lstm_out.view(len(sentence), -1))

                # Viterbi解码
                score, tag_seq = self._viterbi_decode(lstm_feats)

                return score, tag_seq

        # 使用示例
        # 构建词汇表和标签映射
        word_to_ix = {"李": 0, "明": 1, "在": 2, "北": 3, "京": 4, "工": 5, "作": 6}
        tag_to_ix = {"B-PER": 0, "I-PER": 1, "B-LOC": 2, "I-LOC": 3, "O": 4, "<START>": 5, "<STOP>": 6}

        # 创建模型
        model = BiLSTM_CRF(len(word_to_ix), tag_to_ix, embedding_dim=50, hidden_dim=64)

        # 示例数据
        sentence = torch.tensor([0, 1, 2, 3, 4, 5, 6], dtype=torch.long)  # 李明在北京工作
        tags = torch.tensor([0, 1, 4, 2, 3, 4, 4], dtype=torch.long)  # B-PER I-PER O B-LOC I-LOC O O

        # 计算损失
        loss = model.neg_log_likelihood(sentence, tags)
        print(f"训练损失: {loss.item():.4f}")

        # 预测
        with torch.no_grad():
            score, predicted_tags = model(sentence)
            print(f"预测标签: {predicted_tags}")
        ---

02.模型训练流程
    a.说明
        训练BiLSTM-CRF需要准备标注数据,构建词汇表和标签映射,将文本转换为索引序列。
        优化器通常使用Adam或SGD,学习率设置为0.001-0.01,可以使用学习率衰减策略提升收敛效果。
        训练过程中监控验证集F1值,使用早停(Early Stopping)防止过拟合,保存最佳模型。
        批处理需要对不同长度的句子进行padding,使用pack_padded_sequence和pad_packed_sequence处理变长序列。
        数据增强技术如同义词替换、回译、实体替换等可以扩充训练数据,提升模型泛化能力。
        超参数调优包括embedding维度、LSTM隐藏层维度、层数、dropout率等,通常使用网格搜索或贝叶斯优化。
    b.代码示例
        ---
        # BiLSTM-CRF训练流程
        import torch
        import torch.optim as optim
        from torch.utils.data import Dataset, DataLoader
        from typing import List, Tuple
        from tqdm import tqdm

        class NERDataset(Dataset):
            """NER数据集类"""

            def __init__(self, sentences: List[List[str]], labels: List[List[str]],
                        word_to_ix: dict, tag_to_ix: dict):
                self.sentences = sentences
                self.labels = labels
                self.word_to_ix = word_to_ix
                self.tag_to_ix = tag_to_ix

            def __len__(self):
                return len(self.sentences)

            def __getitem__(self, idx):
                sentence = self.sentences[idx]
                label = self.labels[idx]

                # 转换为索引
                sentence_idx = [self.word_to_ix.get(w, self.word_to_ix['<UNK>']) for w in sentence]
                label_idx = [self.tag_to_ix[t] for t in label]

                return torch.tensor(sentence_idx, dtype=torch.long), \
                       torch.tensor(label_idx, dtype=torch.long)

        class BiLSTMCRFTrainer:
            """BiLSTM-CRF训练器"""

            def __init__(self, model, optimizer, device='cpu'):
                self.model = model.to(device)
                self.optimizer = optimizer
                self.device = device
                self.best_f1 = 0.0

            def train_epoch(self, train_loader):
                """训练一个epoch"""
                self.model.train()
                total_loss = 0.0

                for sentences, tags in tqdm(train_loader, desc="Training"):
                    # 注意:这里简化处理,实际需要batch处理
                    for sentence, tag in zip(sentences, tags):
                        sentence = sentence.to(self.device)
                        tag = tag.to(self.device)

                        # 清零梯度
                        self.model.zero_grad()

                        # 计算损失
                        loss = self.model.neg_log_likelihood(sentence, tag)

                        # 反向传播
                        loss.backward()
                        self.optimizer.step()

                        total_loss += loss.item()

                return total_loss / len(train_loader)

            def evaluate(self, val_loader, ix_to_tag: dict):
                """评估模型"""
                self.model.eval()
                all_predictions = []
                all_labels = []

                with torch.no_grad():
                    for sentences, tags in val_loader:
                        for sentence, tag in zip(sentences, tags):
                            sentence = sentence.to(self.device)

                            # 预测
                            _, predicted_tags = self.model(sentence)

                            # 转换为标签名称
                            pred_labels = [ix_to_tag[t] for t in predicted_tags]
                            true_labels = [ix_to_tag[t.item()] for t in tag]

                            all_predictions.append(pred_labels)
                            all_labels.append(true_labels)

                # 计算F1(这里简化,实际使用seqeval)
                from seqeval.metrics import f1_score
                f1 = f1_score(all_labels, all_predictions)

                return f1

            def train(self, train_loader, val_loader, ix_to_tag: dict,
                     epochs: int = 10, patience: int = 3):
                """完整训练流程"""
                no_improve = 0

                for epoch in range(epochs):
                    print(f"\nEpoch {epoch + 1}/{epochs}")

                    # 训练
                    train_loss = self.train_epoch(train_loader)
                    print(f"训练损失: {train_loss:.4f}")

                    # 验证
                    val_f1 = self.evaluate(val_loader, ix_to_tag)
                    print(f"验证F1: {val_f1:.4f}")

                    # 早停检查
                    if val_f1 > self.best_f1:
                        self.best_f1 = val_f1
                        torch.save(self.model.state_dict(), 'best_model.pt')
                        print(f"保存最佳模型 (F1: {val_f1:.4f})")
                        no_improve = 0
                    else:
                        no_improve += 1
                        if no_improve >= patience:
                            print(f"早停: {patience}个epoch无改善")
                            break

                print(f"\n训练完成! 最佳F1: {self.best_f1:.4f}")

        # 使用示例
        # 准备数据
        train_sentences = [["李", "明", "在", "北", "京"], ["王", "华", "去", "上", "海"]]
        train_labels = [["B-PER", "I-PER", "O", "B-LOC", "I-LOC"],
                       ["B-PER", "I-PER", "O", "B-LOC", "I-LOC"]]

        # 构建词汇表
        word_to_ix = {"<PAD>": 0, "<UNK>": 1}
        for sentence in train_sentences:
            for word in sentence:
                if word not in word_to_ix:
                    word_to_ix[word] = len(word_to_ix)

        tag_to_ix = {"B-PER": 0, "I-PER": 1, "B-LOC": 2, "I-LOC": 3, "O": 4,
                    "<START>": 5, "<STOP>": 6, "<PAD>": 7}
        ix_to_tag = {v: k for k, v in tag_to_ix.items()}

        # 创建数据集
        train_dataset = NERDataset(train_sentences, train_labels, word_to_ix, tag_to_ix)
        train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)

        # 创建模型和优化器
        model = BiLSTM_CRF(len(word_to_ix), tag_to_ix, embedding_dim=50, hidden_dim=64)
        optimizer = optim.Adam(model.parameters(), lr=0.01)

        # 训练
        trainer = BiLSTMCRFTrainer(model, optimizer)
        # trainer.train(train_loader, train_loader, ix_to_tag, epochs=10)
        ---

3.4 BERT-NER

01.BERT用于NER任务
    a.说明
        BERT预训练模型通过Masked Language Model和Next Sentence Prediction学习通用语言表示,可以直接fine-tune用于NER任务。
        BERT-NER在BERT输出层添加分类层,对每个token预测BIO标签,利用BERT的上下文表示能力显著提升NER性能。
        相比BiLSTM-CRF,BERT-NER无需手工特征工程,在中文NER任务上F1值通常提升5-10个百分点。
        模型架构包括BERT编码器、Dropout层、线性分类层,可选择性添加CRF层进一步提升标签一致性。
        训练时冻结BERT底层参数只训练顶层,或使用差分学习率(discriminative learning rate)对不同层使用不同学习率。
        中文NER使用bert-base-chinese模型,英文使用bert-base-uncased,领域数据可以继续预训练领域BERT模型。
    b.代码示例
        ---
        # BERT-NER模型实现
        import torch
        import torch.nn as nn
        from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup
        from typing import List, Tuple
        import numpy as np

        class BERT_NER(nn.Module):
            """BERT命名实体识别模型"""

            def __init__(self, bert_model_name: str, num_labels: int, dropout: float = 0.1):
                super(BERT_NER, self).__init__()

                # 加载预训练BERT模型
                self.bert = BertModel.from_pretrained(bert_model_name)
                self.dropout = nn.Dropout(dropout)

                # 分类层
                self.classifier = nn.Linear(self.bert.config.hidden_size, num_labels)

                # 损失函数
                self.loss_fn = nn.CrossEntropyLoss(ignore_index=-100)  # 忽略padding

            def forward(self, input_ids, attention_mask, token_type_ids=None, labels=None):
                """
                前向传播
                input_ids: [batch_size, seq_len] 输入token ID
                attention_mask: [batch_size, seq_len] 注意力掩码
                labels: [batch_size, seq_len] 标签序列(可选)
                """
                # BERT编码
                outputs = self.bert(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    token_type_ids=token_type_ids
                )

                # 获取序列输出 [batch_size, seq_len, hidden_size]
                sequence_output = outputs.last_hidden_state

                # Dropout
                sequence_output = self.dropout(sequence_output)

                # 分类 [batch_size, seq_len, num_labels]
                logits = self.classifier(sequence_output)

                # 计算损失
                loss = None
                if labels is not None:
                    # 展平计算损失
                    loss = self.loss_fn(logits.view(-1, logits.shape[-1]), labels.view(-1))

                return {'loss': loss, 'logits': logits}

            def predict(self, input_ids, attention_mask, token_type_ids=None):
                """预测标签序列"""
                self.eval()
                with torch.no_grad():
                    outputs = self.forward(input_ids, attention_mask, token_type_ids)
                    logits = outputs['logits']

                    # 获取最大概率的标签
                    predictions = torch.argmax(logits, dim=-1)

                return predictions

        class BERT_CRF_NER(nn.Module):
            """BERT-CRF命名实体识别模型"""

            def __init__(self, bert_model_name: str, num_labels: int, dropout: float = 0.1):
                super(BERT_CRF_NER, self).__init__()

                self.num_labels = num_labels

                # BERT编码器
                self.bert = BertModel.from_pretrained(bert_model_name)
                self.dropout = nn.Dropout(dropout)

                # 发射层
                self.hidden2tag = nn.Linear(self.bert.config.hidden_size, num_labels)

                # CRF转移矩阵
                self.transitions = nn.Parameter(torch.randn(num_labels, num_labels))

                # 约束:START和STOP标签
                self.transitions.data[:, 0] = -10000  # 不能转移到START
                self.transitions.data[-1, :] = -10000  # 不能从STOP转移

            def _forward_alg(self, feats, mask):
                """CRF前向算法"""
                batch_size, seq_len, num_labels = feats.size()

                # 初始化
                alphas = torch.full((batch_size, num_labels), -10000., device=feats.device)
                alphas[:, 0] = 0.  # START标签

                for t in range(seq_len):
                    # 当前时刻的发射分数
                    emit_score = feats[:, t, :].unsqueeze(1)  # [batch, 1, num_labels]

                    # 转移分数
                    trans_score = self.transitions.unsqueeze(0)  # [1, num_labels, num_labels]

                    # 计算下一时刻的分数
                    next_alphas = alphas.unsqueeze(2) + trans_score + emit_score

                    # log-sum-exp
                    next_alphas = torch.logsumexp(next_alphas, dim=1)

                    # 根据mask更新
                    alphas = torch.where(mask[:, t].unsqueeze(1).bool(), next_alphas, alphas)

                # 转移到STOP
                alphas = alphas + self.transitions[-1, :].unsqueeze(0)

                return torch.logsumexp(alphas, dim=1)

            def _score_sentence(self, feats, tags, mask):
                """计算给定标签序列的分数"""
                batch_size, seq_len = tags.size()
                score = torch.zeros(batch_size, device=feats.device)

                # 添加START标签
                tags = torch.cat([torch.zeros(batch_size, 1, dtype=torch.long, device=tags.device), tags], dim=1)

                for t in range(seq_len):
                    # 发射分数
                    emit_score = feats[:, t, :].gather(1, tags[:, t+1].unsqueeze(1)).squeeze(1)

                    # 转移分数
                    trans_score = self.transitions[tags[:, t+1], tags[:, t]]

                    # 累加分数(考虑mask)
                    score = score + (emit_score + trans_score) * mask[:, t]

                # 转移到STOP
                last_tag_index = mask.sum(dim=1).long() - 1
                last_tags = tags.gather(1, last_tag_index.unsqueeze(1) + 1).squeeze(1)
                score = score + self.transitions[-1, last_tags]

                return score

            def forward(self, input_ids, attention_mask, token_type_ids=None, labels=None):
                """前向传播"""
                # BERT编码
                outputs = self.bert(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    token_type_ids=token_type_ids
                )

                sequence_output = self.dropout(outputs.last_hidden_state)

                # 发射分数
                feats = self.hidden2tag(sequence_output)

                if labels is not None:
                    # 计算CRF损失
                    forward_score = self._forward_alg(feats, attention_mask)
                    gold_score = self._score_sentence(feats, labels, attention_mask)
                    loss = (forward_score - gold_score).mean()

                    return {'loss': loss, 'logits': feats}
                else:
                    return {'logits': feats}

        # 使用示例
        # 初始化模型
        model = BERT_NER(
            bert_model_name='bert-base-chinese',
            num_labels=7,  # B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG, O
            dropout=0.1
        )

        # 准备输入数据
        tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
        text = "李明在北京工作"
        encoding = tokenizer(text, return_tensors='pt', padding=True, truncation=True)

        # 前向传播
        outputs = model(
            input_ids=encoding['input_ids'],
            attention_mask=encoding['attention_mask']
        )

        print(f"输出logits形状: {outputs['logits'].shape}")
        # 输出: [batch_size, seq_len, num_labels]
        ---

02.BERT-NER训练与优化
    a.说明
        BERT-NER训练需要特别注意学习率设置,BERT层使用较小学习率(2e-5),分类层使用较大学习率(1e-3)。
        使用Warmup策略在训练初期逐步增加学习率,通常warmup步数为总步数的10%,有助于稳定训练。
        批次大小受GPU内存限制,通常设置为8-32,可以使用梯度累积(gradient accumulation)模拟更大批次。
        序列最大长度设置为128-512,过长会增加计算开销,过短会截断重要信息,需根据数据分布调整。
        数据预处理需要对齐BERT的tokenization和原始标签,处理subword tokenization导致的标签对齐问题。
        模型评估使用seqeval库计算实体级F1,注意排除[CLS]、[SEP]等特殊token的标签预测。
    b.代码示例
        ---
        # BERT-NER训练流程
        import torch
        from torch.utils.data import Dataset, DataLoader
        from transformers import BertTokenizer, AdamW, get_linear_schedule_with_warmup
        from typing import List, Dict
        from tqdm import tqdm
        import numpy as np

        class NERDatasetForBERT(Dataset):
            """BERT-NER数据集"""

            def __init__(self, texts: List[List[str]], labels: List[List[str]],
                        tokenizer, label2id: Dict[str, int], max_len: int = 128):
                self.texts = texts
                self.labels = labels
                self.tokenizer = tokenizer
                self.label2id = label2id
                self.max_len = max_len

            def __len__(self):
                return len(self.texts)

            def __getitem__(self, idx):
                words = self.texts[idx]
                labels = self.labels[idx]

                # Tokenize并对齐标签
                encoding = self.tokenizer(
                    words,
                    is_split_into_words=True,
                    padding='max_length',
                    truncation=True,
                    max_length=self.max_len,
                    return_tensors='pt'
                )

                # 对齐标签(处理subword tokenization)
                word_ids = encoding.word_ids(batch_index=0)
                aligned_labels = []

                previous_word_idx = None
                for word_idx in word_ids:
                    if word_idx is None:
                        # 特殊token([CLS], [SEP], [PAD])
                        aligned_labels.append(-100)  # 忽略
                    elif word_idx != previous_word_idx:
                        # 词的第一个subword
                        aligned_labels.append(self.label2id[labels[word_idx]])
                    else:
                        # 词的后续subword,使用相同标签或-100
                        aligned_labels.append(self.label2id[labels[word_idx]])

                    previous_word_idx = word_idx

                return {
                    'input_ids': encoding['input_ids'].squeeze(0),
                    'attention_mask': encoding['attention_mask'].squeeze(0),
                    'labels': torch.tensor(aligned_labels, dtype=torch.long)
                }

        class BERTNERTrainer:
            """BERT-NER训练器"""

            def __init__(self, model, device='cuda' if torch.cuda.is_available() else 'cpu'):
                self.model = model.to(device)
                self.device = device

            def train(self, train_loader, val_loader, epochs: int = 3,
                     learning_rate: float = 2e-5, warmup_ratio: float = 0.1):
                """训练模型"""

                # 优化器:BERT层使用小��习率,分类层使用大学习率
                optimizer_grouped_parameters = [
                    {
                        'params': [p for n, p in self.model.named_parameters()
                                  if 'bert' in n],
                        'lr': learning_rate
                    },
                    {
                        'params': [p for n, p in self.model.named_parameters()
                                  if 'bert' not in n],
                        'lr': learning_rate * 10
                    }
                ]
                optimizer = AdamW(optimizer_grouped_parameters)

                # 学习率调度器(带warmup)
                total_steps = len(train_loader) * epochs
                warmup_steps = int(total_steps * warmup_ratio)
                scheduler = get_linear_schedule_with_warmup(
                    optimizer,
                    num_warmup_steps=warmup_steps,
                    num_training_steps=total_steps
                )

                best_f1 = 0.0

                for epoch in range(epochs):
                    print(f"\nEpoch {epoch + 1}/{epochs}")

                    # 训练阶段
                    self.model.train()
                    train_loss = 0.0

                    for batch in tqdm(train_loader, desc="Training"):
                        # 移动到设备
                        input_ids = batch['input_ids'].to(self.device)
                        attention_mask = batch['attention_mask'].to(self.device)
                        labels = batch['labels'].to(self.device)

                        # 前向传播
                        outputs = self.model(
                            input_ids=input_ids,
                            attention_mask=attention_mask,
                            labels=labels
                        )

                        loss = outputs['loss']
                        train_loss += loss.item()

                        # 反向传播
                        optimizer.zero_grad()
                        loss.backward()

                        # 梯度裁剪
                        torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)

                        optimizer.step()
                        scheduler.step()

                    avg_train_loss = train_loss / len(train_loader)
                    print(f"平均训练损失: {avg_train_loss:.4f}")

                    # 验证阶段
                    val_f1 = self.evaluate(val_loader)
                    print(f"验证F1: {val_f1:.4f}")

                    # 保存最佳模型
                    if val_f1 > best_f1:
                        best_f1 = val_f1
                        torch.save(self.model.state_dict(), 'best_bert_ner.pt')
                        print(f"保存最佳模型 (F1: {val_f1:.4f})")

                print(f"\n训练完成! 最佳F1: {best_f1:.4f}")

            def evaluate(self, val_loader):
                """评估模型"""
                self.model.eval()
                all_predictions = []
                all_labels = []

                with torch.no_grad():
                    for batch in val_loader:
                        input_ids = batch['input_ids'].to(self.device)
                        attention_mask = batch['attention_mask'].to(self.device)
                        labels = batch['labels'].to(self.device)

                        # 预测
                        predictions = self.model.predict(input_ids, attention_mask)

                        # 收集有效标签(排除-100)
                        for pred, label, mask in zip(predictions, labels, attention_mask):
                            valid_indices = (label != -100) & (mask == 1)
                            all_predictions.append(pred[valid_indices].cpu().numpy())
                            all_labels.append(label[valid_indices].cpu().numpy())

                # 计算F1(简化版,实际使用seqeval)
                correct = sum((p == l).sum() for p, l in zip(all_predictions, all_labels))
                total = sum(len(l) for l in all_labels)
                accuracy = correct / total if total > 0 else 0

                return accuracy  # 实际应返回F1

        # 使用示例
        label2id = {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-LOC': 3, 'I-LOC': 4, 'B-ORG': 5, 'I-ORG': 6}
        id2label = {v: k for k, v in label2id.items()}

        tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')

        # 准备数据
        train_texts = [['李', '明', '在', '北', '京', '工', '作']]
        train_labels = [['B-PER', 'I-PER', 'O', 'B-LOC', 'I-LOC', 'O', 'O']]

        train_dataset = NERDatasetForBERT(train_texts, train_labels, tokenizer, label2id)
        train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)

        # 创建模型和训练器
        model = BERT_NER('bert-base-chinese', num_labels=len(label2id))
        trainer = BERTNERTrainer(model)

        # 开始训练
        # trainer.train(train_loader, train_loader, epochs=3)
        ---

3.5 实战项目

01.中文医疗NER项目
    a.说明
        医疗NER识别病历文本中的疾病、症状、药物、检查、手术等实体,是医疗信息抽取的基础任务。
        实体类型包括疾病(DIS)、症状(SYM)、药物(DRUG)、检查(TEST)、手术(OPER)、身体部位(BODY)等。
        数据来源包括CCKS医疗NER数据集、cMedQA数据集、自建标注数据,需要医学专业人员参与标注。
        模型选择BERT-CRF架构,使用医疗领域预训练模型如MC-BERT、PCL-MedBERT提升领域适应性。
        后处理包括实体规范化(如"糖尿病"和"II型糖尿病"映射到标准术语)、实体消歧、知识库链接等。
        评估时注意医疗实体的边界模糊性,如"左侧肺部感染"中"左侧"是否属于实体需要明确定义。
    b.代码示例
        ---
        # 医疗NER完整项目实现
        import torch
        import torch.nn as nn
        from transformers import BertTokenizer, BertModel
        from typing import List, Dict, Tuple
        import json
        import re

        class MedicalNERSystem:
            """医疗命名实体识别系统"""

            def __init__(self, model_path: str, config_path: str):
                # 加载配置
                with open(config_path, 'r', encoding='utf-8') as f:
                    self.config = json.load(f)

                self.label2id = self.config['label2id']
                self.id2label = {v: k for k, v in self.label2id.items()}

                # 加载模型
                self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
                self.tokenizer = BertTokenizer.from_pretrained(self.config['bert_model'])
                self.model = self._load_model(model_path)
                self.model.to(self.device)
                self.model.eval()

                # 医疗实体词典(用于规范化)
                self.entity_dict = self._load_entity_dict()

            def _load_model(self, model_path: str):
                """加载训练好的模型"""
                from transformers import BertForTokenClassification

                model = BertForTokenClassification.from_pretrained(
                    self.config['bert_model'],
                    num_labels=len(self.label2id)
                )

                # 加载权重
                if model_path:
                    model.load_state_dict(torch.load(model_path, map_location=self.device))

                return model

            def _load_entity_dict(self) -> Dict[str, str]:
                """加载医疗实体标准词典"""
                # 示例词典:变体 -> 标准名称
                return {
                    '糖尿病': '糖尿病',
                    'II型糖尿病': '2型糖尿病',
                    '二型糖尿病': '2型糖尿病',
                    '高血压': '高血压',
                    '高血压病': '高血压',
                    '感冒': '上呼吸道感染',
                    '发烧': '发热',
                    '发热': '发热'
                }

            def preprocess_text(self, text: str) -> str:
                """文本预处理"""
                # 去除特殊字符
                text = re.sub(r'[^\w\s\u4e00-\u9fff]', '', text)
                # 去除多余空格
                text = re.sub(r'\s+', ' ', text).strip()
                return text

            def predict(self, text: str) -> List[Dict]:
                """
                预测文本中的医疗实体
                返回: [{'text': '糖尿病', 'type': 'DIS', 'start': 5, 'end': 8}, ...]
                """
                # 预处理
                text = self.preprocess_text(text)

                # Tokenize
                encoding = self.tokenizer(
                    text,
                    return_tensors='pt',
                    padding=True,
                    truncation=True,
                    max_length=512,
                    return_offsets_mapping=True
                )

                input_ids = encoding['input_ids'].to(self.device)
                attention_mask = encoding['attention_mask'].to(self.device)
                offset_mapping = encoding['offset_mapping'][0]

                # 预测
                with torch.no_grad():
                    outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
                    predictions = torch.argmax(outputs.logits, dim=-1)[0]

                # 解析实体
                entities = []
                current_entity = None

                for idx, (pred_id, (start, end)) in enumerate(zip(predictions, offset_mapping)):
                    if start == end:  # 特殊token
                        continue

                    label = self.id2label[pred_id.item()]

                    if label.startswith('B-'):
                        # 保存前一个实体
                        if current_entity:
                            entities.append(current_entity)

                        # 开始新实体
                        entity_type = label[2:]
                        current_entity = {
                            'text': text[start:end],
                            'type': entity_type,
                            'start': start,
                            'end': end
                        }

                    elif label.startswith('I-') and current_entity:
                        # 继续当前实体
                        current_entity['text'] = text[current_entity['start']:end]
                        current_entity['end'] = end

                    else:  # 'O'标签
                        if current_entity:
                            entities.append(current_entity)
                            current_entity = None

                # 添加最后一个实体
                if current_entity:
                    entities.append(current_entity)

                # 实体规范化
                entities = self.normalize_entities(entities)

                return entities

            def normalize_entities(self, entities: List[Dict]) -> List[Dict]:
                """实体规范化"""
                normalized = []

                for entity in entities:
                    # 查找标准名称
                    standard_name = self.entity_dict.get(entity['text'], entity['text'])

                    normalized.append({
                        'text': entity['text'],
                        'standard_name': standard_name,
                        'type': entity['type'],
                        'start': entity['start'],
                        'end': entity['end']
                    })

                return normalized

            def batch_predict(self, texts: List[str]) -> List[List[Dict]]:
                """批量预测"""
                return [self.predict(text) for text in texts]

            def extract_relations(self, text: str, entities: List[Dict]) -> List[Dict]:
                """
                提取实体间关系(简化版)
                如: 症状-疾病关系, 药物-疾病关系
                """
                relations = []

                # 简单规则:症状在疾病前,且距离较近
                for i, entity1 in enumerate(entities):
                    for entity2 in entities[i+1:]:
                        if entity1['type'] == 'SYM' and entity2['type'] == 'DIS':
                            if entity2['start'] - entity1['end'] < 10:  # 距离阈值
                                relations.append({
                                    'subject': entity1['text'],
                                    'relation': '症状-疾病',
                                    'object': entity2['text']
                                })

                        elif entity1['type'] == 'DRUG' and entity2['type'] == 'DIS':
                            if entity2['start'] - entity1['end'] < 15:
                                relations.append({
                                    'subject': entity1['text'],
                                    'relation': '治疗',
                                    'object': entity2['text']
                                })

                return relations

        # 使用示例
        # 配置文件示例
        config = {
            'bert_model': 'bert-base-chinese',
            'label2id': {
                'O': 0, 'B-DIS': 1, 'I-DIS': 2, 'B-SYM': 3, 'I-SYM': 4,
                'B-DRUG': 5, 'I-DRUG': 6, 'B-TEST': 7, 'I-TEST': 8
            }
        }

        # 保存配置
        with open('medical_ner_config.json', 'w', encoding='utf-8') as f:
            json.dump(config, f, ensure_ascii=False, indent=2)

        # 初始化系统
        # ner_system = MedicalNERSystem(
        #     model_path='best_medical_ner.pt',
        #     config_path='medical_ner_config.json'
        # )

        # 预测示例
        # text = "患者主诉头痛、发热,诊断为上呼吸道感染,给予阿莫西林治疗"
        # entities = ner_system.predict(text)
        # print("识别的实体:")
        # for entity in entities:
        #     print(f"  {entity['text']} ({entity['type']}) -> {entity['standard_name']}")

        # 提取关系
        # relations = ner_system.extract_relations(text, entities)
        # print("\n实体关系:")
        # for rel in relations:
        #     print(f"  {rel['subject']} --{rel['relation']}--> {rel['object']}")
        ---

02.NER模型部署与API服务
    a.说明
        模型部署需要考虑推理速度、并发处理能力、资源占用等因素,通常使用FastAPI或Flask构建REST API。
        模型优化包括量化(Quantization)、蒸馏(Distillation)、剪枝(Pruning)等技术,可以减小模型大小和加速推理。
        ONNX Runtime可以将PyTorch模型转换为ONNX格式,实现跨平台部署和推理加速。
        批处理推理可以提高吞吐量,但需要权衡延迟和吞吐量,实时应用通常使用batch_size=1。
        缓存机制可以缓存常见查询结果,减少重复计算,使用Redis等内存数据库存储缓存。
        监控指标包括QPS(每秒查询数)、平均响应时间、P99延迟、GPU利用率等,使用Prometheus+Grafana监控。
    b.代码示例
        ---
        # NER模型API服务部署
        from fastapi import FastAPI, HTTPException
        from pydantic import BaseModel
        from typing import List, Dict, Optional
        import torch
        import uvicorn
        from functools import lru_cache
        import time
        import logging

        # 配置日志
        logging.basicConfig(level=logging.INFO)
        logger = logging.getLogger(__name__)

        # 创建FastAPI应用
        app = FastAPI(title="Medical NER API", version="1.0.0")

        # 请求模型
        class NERRequest(BaseModel):
            text: str
            normalize: bool = True  # 是否进行实体规范化

        class BatchNERRequest(BaseModel):
            texts: List[str]
            normalize: bool = True

        # 响应模型
        class Entity(BaseModel):
            text: str
            type: str
            start: int
            end: int
            standard_name: Optional[str] = None
            confidence: Optional[float] = None

        class NERResponse(BaseModel):
            entities: List[Entity]
            processing_time: float

        class BatchNERResponse(BaseModel):
            results: List[List[Entity]]
            processing_time: float

        # 全局模型实例(启动时加载)
        ner_model = None

        @app.on_event("startup")
        async def load_model():
            """启动时加载模型"""
            global ner_model
            logger.info("加载NER模型...")

            try:
                # 这里使用前面定义的MedicalNERSystem
                # ner_model = MedicalNERSystem(
                #     model_path='best_medical_ner.pt',
                #     config_path='medical_ner_config.json'
                # )
                logger.info("模型加载成功!")
            except Exception as e:
                logger.error(f"模型加载失败: {e}")
                raise

        @app.get("/")
        async def root():
            """健康检查"""
            return {"status": "healthy", "service": "Medical NER API"}

        @app.post("/predict", response_model=NERResponse)
        async def predict_entities(request: NERRequest):
            """
            单文本实体识别
            """
            if not ner_model:
                raise HTTPException(status_code=503, detail="模型未加载")

            if not request.text or len(request.text) > 5000:
                raise HTTPException(status_code=400, detail="文本长度必须在1-5000字符之间")

            try:
                start_time = time.time()

                # 预测
                entities = ner_model.predict(request.text)

                # 转换为响应格式
                entity_list = [
                    Entity(
                        text=e['text'],
                        type=e['type'],
                        start=e['start'],
                        end=e['end'],
                        standard_name=e.get('standard_name') if request.normalize else None
                    )
                    for e in entities
                ]

                processing_time = time.time() - start_time

                logger.info(f"处理文本长度: {len(request.text)}, "
                           f"识别实体数: {len(entity_list)}, "
                           f"耗时: {processing_time:.3f}s")

                return NERResponse(
                    entities=entity_list,
                    processing_time=processing_time
                )

            except Exception as e:
                logger.error(f"预测失败: {e}")
                raise HTTPException(status_code=500, detail=str(e))

        @app.post("/batch_predict", response_model=BatchNERResponse)
        async def batch_predict_entities(request: BatchNERRequest):
            """
            批量文本实体识别
            """
            if not ner_model:
                raise HTTPException(status_code=503, detail="模型未加载")

            if not request.texts or len(request.texts) > 100:
                raise HTTPException(status_code=400, detail="批量大小必须在1-100之间")

            try:
                start_time = time.time()

                # 批量预测
                all_entities = ner_model.batch_predict(request.texts)

                # 转换为响应格式
                results = []
                for entities in all_entities:
                    entity_list = [
                        Entity(
                            text=e['text'],
                            type=e['type'],
                            start=e['start'],
                            end=e['end'],
                            standard_name=e.get('standard_name') if request.normalize else None
                        )
                        for e in entities
                    ]
                    results.append(entity_list)

                processing_time = time.time() - start_time

                logger.info(f"批量处理: {len(request.texts)}个文本, "
                           f"耗时: {processing_time:.3f}s")

                return BatchNERResponse(
                    results=results,
                    processing_time=processing_time
                )

            except Exception as e:
                logger.error(f"批量预测失败: {e}")
                raise HTTPException(status_code=500, detail=str(e))

        @app.get("/stats")
        async def get_stats():
            """获取服务统计信息"""
            return {
                "model_loaded": ner_model is not None,
                "device": "cuda" if torch.cuda.is_available() else "cpu",
                "gpu_available": torch.cuda.is_available()
            }

        # 启动服务
        if __name__ == "__main__":
            uvicorn.run(
                app,
                host="0.0.0.0",
                port=8000,
                workers=1,  # 单worker避免多次加载模型
                log_level="info"
            )

        # 客户端调用示例
        """
        import requests

        # 单文本预测
        response = requests.post(
            "http://localhost:8000/predict",
            json={"text": "患者主诉头痛、发热", "normalize": True}
        )
        print(response.json())

        # 批量预测
        response = requests.post(
            "http://localhost:8000/batch_predict",
            json={
                "texts": ["患者主诉头痛", "诊断为糖尿病"],
                "normalize": True
            }
        )
        print(response.json())
        """
        ---

3.6 评估与优化

01.NER模型评估方法
    a.说明
        实体级评估要求实体类型和边界完全匹配才算正确,使用seqeval库自动计算Precision、Recall、F1。
        分类型评估可以查看每种实体类型的识别效果,发现模型在哪些实体类型上表现较弱。
        混淆矩阵展示不同实体类型之间的混淆情况,如PER被识别为ORG的频率,帮助分析错误模式。
        错误分析包括边界错误(实体识别但边界不准)、类型错误(边界正确但类型错误)、漏检、误检四类。
        跨领域评估测试模型在不同领域数据上的泛化能力,如通用NER模型在医疗、金融领域的表现。
        人工评估抽样检查模型预测结果,特别关注高置信度错误和低置信度正确的case。
    b.代码示例
        ---
        # NER模型评估工具
        from seqeval.metrics import classification_report, f1_score, precision_score, recall_score
        from seqeval.metrics import accuracy_score
        from seqeval.scheme import IOB2
        from typing import List, Dict, Tuple
        import numpy as np
        import matplotlib.pyplot as plt
        import seaborn as sns
        from collections import defaultdict, Counter

        class NERModelEvaluator:
            """NER模型评估器"""

            def __init__(self, id2label: Dict[int, str]):
                self.id2label = id2label
                self.label2id = {v: k for k, v in id2label.items()}

            def evaluate_predictions(self,
                                    y_true: List[List[str]],
                                    y_pred: List[List[str]]) -> Dict:
                """
                全面评估NER预测结果
                y_true: 真实标签序列列表
                y_pred: 预测标签序列列表
                """
                # 基本指标
                precision = precision_score(y_true, y_pred, scheme=IOB2)
                recall = recall_score(y_true, y_pred, scheme=IOB2)
                f1 = f1_score(y_true, y_pred, scheme=IOB2)
                accuracy = accuracy_score(y_true, y_pred)

                # 详细报告
                report = classification_report(y_true, y_pred, scheme=IOB2, digits=4)

                # 错误分析
                error_analysis = self.analyze_errors(y_true, y_pred)

                # 混淆矩阵
                confusion = self.compute_confusion_matrix(y_true, y_pred)

                results = {
                    'precision': precision,
                    'recall': recall,
                    'f1': f1,
                    'accuracy': accuracy,
                    'report': report,
                    'error_analysis': error_analysis,
                    'confusion_matrix': confusion
                }

                return results

            def analyze_errors(self,
                              y_true: List[List[str]],
                              y_pred: List[List[str]]) -> Dict:
                """
                错误分析:分类错误类型
                """
                error_types = {
                    'boundary_error': 0,  # 边界错误
                    'type_error': 0,      # 类型错误
                    'false_positive': 0,  # 误检
                    'false_negative': 0   # 漏检
                }

                entity_errors = defaultdict(int)  # 每种实体类型的错误数

                for true_seq, pred_seq in zip(y_true, y_pred):
                    # 提取实体
                    true_entities = self._extract_entities(true_seq)
                    pred_entities = self._extract_entities(pred_seq)

                    # 转换为集合便于比较
                    true_set = set((e[0], e[1], e[2]) for e in true_entities)  # (start, end, type)
                    pred_set = set((e[0], e[1], e[2]) for e in pred_entities)

                    # 完全匹配的实体
                    correct = true_set & pred_set

                    # 漏检
                    false_negatives = true_set - pred_set
                    error_types['false_negative'] += len(false_negatives)
                    for _, _, entity_type in false_negatives:
                        entity_errors[f'{entity_type}_FN'] += 1

                    # 误检
                    false_positives = pred_set - true_set
                    error_types['false_positive'] += len(false_positives)
                    for _, _, entity_type in false_positives:
                        entity_errors[f'{entity_type}_FP'] += 1

                    # 分析边界错误和类型错误
                    for pred_entity in pred_entities:
                        pred_start, pred_end, pred_type = pred_entity

                        # 查找重叠的真实实体
                        for true_entity in true_entities:
                            true_start, true_end, true_type = true_entity

                            # 有重叠
                            if not (pred_end <= true_start or pred_start >= true_end):
                                if pred_type == true_type and (pred_start != true_start or pred_end != true_end):
                                    # 类型正确但边界错误
                                    error_types['boundary_error'] += 1
                                    entity_errors[f'{pred_type}_boundary'] += 1
                                elif pred_start == true_start and pred_end == true_end and pred_type != true_type:
                                    # 边界正确但类型错误
                                    error_types['type_error'] += 1
                                    entity_errors[f'{true_type}->{pred_type}'] += 1

                return {
                    'error_types': error_types,
                    'entity_errors': dict(entity_errors)
                }

            def _extract_entities(self, labels: List[str]) -> List[Tuple[int, int, str]]:
                """从标签序列提取实体"""
                entities = []
                current_entity = None

                for i, label in enumerate(labels):
                    if label.startswith('B-'):
                        if current_entity:
                            entities.append(current_entity)
                        entity_type = label[2:]
                        current_entity = (i, i + 1, entity_type)

                    elif label.startswith('I-') and current_entity:
                        current_entity = (current_entity[0], i + 1, current_entity[2])

                    else:  # 'O'
                        if current_entity:
                            entities.append(current_entity)
                            current_entity = None

                if current_entity:
                    entities.append(current_entity)

                return entities

            def compute_confusion_matrix(self,
                                        y_true: List[List[str]],
                                        y_pred: List[List[str]]) -> np.ndarray:
                """计算实体类型混淆矩阵"""
                # 获取所有实体类型
                entity_types = set()
                for seq in y_true + y_pred:
                    for label in seq:
                        if label != 'O':
                            entity_type = label.split('-')[1]
                            entity_types.add(entity_type)

                entity_types = sorted(list(entity_types))
                type_to_idx = {t: i for i, t in enumerate(entity_types)}

                # 初始化混淆矩阵
                n = len(entity_types)
                confusion = np.zeros((n, n), dtype=int)

                # 填充混淆矩阵
                for true_seq, pred_seq in zip(y_true, y_pred):
                    true_entities = self._extract_entities(true_seq)
                    pred_entities = self._extract_entities(pred_seq)

                    # 匹配实体(基于位置重叠)
                    for true_entity in true_entities:
                        true_start, true_end, true_type = true_entity
                        matched = False

                        for pred_entity in pred_entities:
                            pred_start, pred_end, pred_type = pred_entity

                            # 检查重叠
                            if not (pred_end <= true_start or pred_start >= true_end):
                                confusion[type_to_idx[true_type], type_to_idx[pred_type]] += 1
                                matched = True
                                break

                        if not matched:
                            # 漏检(可以添加一个"MISS"类别)
                            pass

                return confusion

            def print_evaluation_results(self, results: Dict):
                """打印评估结果"""
                print("=" * 60)
                print("NER模型评估结果")
                print("=" * 60)

                print(f"\n总体指标:")
                print(f"  Precision: {results['precision']:.4f}")
                print(f"  Recall:    {results['recall']:.4f}")
                print(f"  F1 Score:  {results['f1']:.4f}")
                print(f"  Accuracy:  {results['accuracy']:.4f}")

                print(f"\n详细报告:")
                print(results['report'])

                print(f"\n错误分析:")
                error_types = results['error_analysis']['error_types']
                print(f"  边界错误: {error_types['boundary_error']}")
                print(f"  类型错误: {error_types['type_error']}")
                print(f"  误检(FP): {error_types['false_positive']}")
                print(f"  漏检(FN): {error_types['false_negative']}")

                print(f"\n实体级错误分布:")
                entity_errors = results['error_analysis']['entity_errors']
                for error_type, count in sorted(entity_errors.items(), key=lambda x: x[1], reverse=True)[:10]:
                    print(f"  {error_type}: {count}")

        # 使用示例
        evaluator = NERModelEvaluator(id2label={0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-LOC', 4: 'I-LOC'})

        # 示例数据
        y_true = [
            ['O', 'O', 'B-PER', 'I-PER', 'O', 'B-LOC', 'I-LOC'],
            ['B-PER', 'I-PER', 'O', 'B-ORG', 'I-ORG', 'O']
        ]

        y_pred = [
            ['O', 'O', 'B-PER', 'I-PER', 'O', 'B-LOC', 'O'],  # LOC边界错误
            ['B-PER', 'I-PER', 'O', 'B-LOC', 'I-LOC', 'O']  # ORG识别为LOC
        ]

        # 评估
        results = evaluator.evaluate_predictions(y_true, y_pred)
        evaluator.print_evaluation_results(results)
        ---

02.模型优化策略
    a.说明
        数据增强包括同义词替换、回译(Back Translation)、实体替换、上下文扩充等方法扩充训练数据。
        主动学习选择模型不确定的样本进行人工标注,可以用更少的标注数据达到更好的效果。
        对抗训练在输入嵌入层添加扰动,提高模型鲁棒性,防止对抗样本攻击。
        多任务学习同时训练NER和其他相关任务如词性标注、依存句法分析,共享底层表示。
        模型集成使用多个模型投票或加权平均,可以提升1-2个点F1,但增加推理开销。
        领域适应使用源领域数据预训练,目标领域数据fine-tune,或使用对抗域适应技术。
    b.代码示例
        ---
        # NER模型优化实现
        import torch
        import torch.nn as nn
        import random
        from typing import List, Tuple
        import numpy as np

        class NERDataAugmenter:
            """NER数据增强器"""

            def __init__(self, entity_dict: dict):
                """
                entity_dict: {entity_type: [entity_examples]}
                如 {'PER': ['李明', '王华'], 'LOC': ['北京', '上海']}
                """
                self.entity_dict = entity_dict

            def entity_replacement(self,
                                  tokens: List[str],
                                  labels: List[str],
                                  replace_prob: float = 0.3) -> Tuple[List[str], List[str]]:
                """
                实体替换增强:将实体替换为同类型的其他实体
                """
                new_tokens = tokens.copy()
                new_labels = labels.copy()

                i = 0
                while i < len(labels):
                    if labels[i].startswith('B-'):
                        entity_type = labels[i][2:]

                        # 找到完整实体
                        entity_start = i
                        entity_end = i + 1
                        while entity_end < len(labels) and labels[entity_end] == f'I-{entity_type}':
                            entity_end += 1

                        # 以一定概率替换
                        if random.random() < replace_prob and entity_type in self.entity_dict:
                            # 随机选择同类型实体
                            replacement = random.choice(self.entity_dict[entity_type])

                            # 替换实体
                            new_entity_tokens = list(replacement)
                            new_entity_labels = [f'B-{entity_type}'] + [f'I-{entity_type}'] * (len(new_entity_tokens) - 1)

                            # 更新tokens和labels
                            new_tokens = new_tokens[:entity_start] + new_entity_tokens + new_tokens[entity_end:]
                            new_labels = new_labels[:entity_start] + new_entity_labels + new_labels[entity_end:]

                            i = entity_start + len(new_entity_tokens)
                        else:
                            i = entity_end
                    else:
                        i += 1

                return new_tokens, new_labels

            def context_augmentation(self,
                                    tokens: List[str],
                                    labels: List[str],
                                    context_templates: List[str]) -> List[Tuple[List[str], List[str]]]:
                """
                上下文增强:将实体放入不同的上下文模板
                """
                augmented_samples = []

                # 提取实体
                entities = []
                i = 0
                while i < len(labels):
                    if labels[i].startswith('B-'):
                        entity_type = labels[i][2:]
                        entity_start = i
                        entity_end = i + 1
                        while entity_end < len(labels) and labels[entity_end] == f'I-{entity_type}':
                            entity_end += 1

                        entity_text = ''.join(tokens[entity_start:entity_end])
                        entities.append((entity_text, entity_type))
                        i = entity_end
                    else:
                        i += 1

                # 应用模板
                for template in context_templates:
                    for entity_text, entity_type in entities:
                        # 替换模板中的占位符
                        new_text = template.replace('{ENTITY}', entity_text)
                        new_tokens = list(new_text)

                        # 生成标签
                        entity_start = new_text.index(entity_text)
                        new_labels = ['O'] * len(new_tokens)
                        new_labels[entity_start] = f'B-{entity_type}'
                        for j in range(entity_start + 1, entity_start + len(entity_text)):
                            new_labels[j] = f'I-{entity_type}'

                        augmented_samples.append((new_tokens, new_labels))

                return augmented_samples

        class AdversarialTraining:
            """对抗训练"""

            @staticmethod
            def fgm_attack(model, embeddings, epsilon=1.0):
                """
                Fast Gradient Method对抗训练
                在嵌入层添加扰动
                """
                # 计算嵌入的梯度
                embeddings_grad = embeddings.grad

                if embeddings_grad is not None:
                    # 计算扰动
                    perturbation = epsilon * embeddings_grad / (torch.norm(embeddings_grad) + 1e-8)

                    # 添加扰动
                    adversarial_embeddings = embeddings + perturbation

                    return adversarial_embeddings
                else:
                    return embeddings

            @staticmethod
            def pgd_attack(model, embeddings, epsilon=1.0, alpha=0.3, num_steps=3):
                """
                Projected Gradient Descent对抗训练
                多步迭代添加扰动
                """
                adversarial_embeddings = embeddings.clone().detach()

                for _ in range(num_steps):
                    adversarial_embeddings.requires_grad = True

                    # 前向传播(需要在外部计算损失)
                    # loss = model(adversarial_embeddings)
                    # loss.backward()

                    # 计算扰动
                    if adversarial_embeddings.grad is not None:
                        perturbation = alpha * adversarial_embeddings.grad.sign()
                        adversarial_embeddings = adversarial_embeddings + perturbation

                        # 投影到epsilon球内
                        perturbation = adversarial_embeddings - embeddings
                        perturbation = torch.clamp(perturbation, -epsilon, epsilon)
                        adversarial_embeddings = embeddings + perturbation

                        adversarial_embeddings = adversarial_embeddings.detach()

                return adversarial_embeddings

        class ModelEnsemble:
            """模型集成"""

            def __init__(self, models: List[nn.Module]):
                self.models = models

            def predict_voting(self, input_ids, attention_mask):
                """
                投票集成:多数投票
                """
                all_predictions = []

                for model in self.models:
                    model.eval()
                    with torch.no_grad():
                        outputs = model(input_ids, attention_mask)
                        predictions = torch.argmax(outputs['logits'], dim=-1)
                        all_predictions.append(predictions)

                # 投票
                stacked_predictions = torch.stack(all_predictions, dim=0)
                final_predictions = torch.mode(stacked_predictions, dim=0).values

                return final_predictions

            def predict_weighted(self, input_ids, attention_mask, weights: List[float]):
                """
                加权集成:根据模型权重加权平均logits
                """
                weighted_logits = None

                for model, weight in zip(self.models, weights):
                    model.eval()
                    with torch.no_grad():
                        outputs = model(input_ids, attention_mask)
                        logits = outputs['logits']

                        if weighted_logits is None:
                            weighted_logits = weight * logits
                        else:
                            weighted_logits += weight * logits

                # 归一化
                weighted_logits = weighted_logits / sum(weights)

                # 预测
                final_predictions = torch.argmax(weighted_logits, dim=-1)

                return final_predictions

        # 使用示例
        # 数据增强
        entity_dict = {
            'PER': ['李明', '王华', '张三', '刘德华'],
            'LOC': ['北京', '上海', '杭州', '深圳'],
            'ORG': ['阿里巴巴', '腾讯', '华为', '百度']
        }

        augmenter = NERDataAugmenter(entity_dict)

        tokens = ['李', '明', '在', '北', '京', '工', '作']
        labels = ['B-PER', 'I-PER', 'O', 'B-LOC', 'I-LOC', 'O', 'O']

        # 实体替换
        new_tokens, new_labels = augmenter.entity_replacement(tokens, labels, replace_prob=0.5)
        print("实体替换后:", ''.join(new_tokens), new_labels)

        # 上下文增强
        templates = ['{ENTITY}是一个好地方', '我喜欢{ENTITY}', '{ENTITY}很有名']
        augmented = augmenter.context_augmentation(tokens, labels, templates)
        print(f"\n生成了{len(augmented)}个增强样本")
        ---

4 机器翻译

4.1 翻译任务概述

01.机器翻译基础
    a.说明
        机器翻译(Machine Translation)是将源语言文本自动转换为目标语言文本的任务,是NLP最具挑战性的任务之一。
        翻译质量评估使用BLEU、METEOR、TER等自动指标,以及人工评估的流畅度、准确度、忠实度等维度。
        翻译类型包括文本翻译、语音翻译、图像翻译、同声传译等,应用场景涵盖跨境电商、国际会议、文档翻译等。
        主流方法从统计机器翻译(SMT)演进到神经机器翻译(NMT),NMT使用端到端深度学习模型显著提升翻译质量。
        翻译挑战包括一词多义、语序差异、文化差异、专业术语、长句翻译、低资源语言等问题。
        数据集如WMT、IWSLT、UN Parallel Corpus提供多语言平行语料,中文数据集包括AI Challenger、CWMT等。
    b.代码示例
        ---
        # 机器翻译数据处理与评估
        import torch
        from typing import List, Tuple
        from collections import Counter
        import re
        from sacrebleu import corpus_bleu
        import jieba

        class TranslationDataProcessor:
            """翻译数据处理器"""

            def __init__(self, src_lang: str = 'zh', tgt_lang: str = 'en'):
                self.src_lang = src_lang
                self.tgt_lang = tgt_lang
                self.src_vocab = {}
                self.tgt_vocab = {}

            def load_parallel_corpus(self, src_file: str, tgt_file: str) -> List[Tuple[str, str]]:
                """
                加载平行语料
                src_file: 源语言文件
                tgt_file: 目标语言文件
                """
                parallel_corpus = []

                with open(src_file, 'r', encoding='utf-8') as f_src, \
                     open(tgt_file, 'r', encoding='utf-8') as f_tgt:

                    for src_line, tgt_line in zip(f_src, f_tgt):
                        src_line = src_line.strip()
                        tgt_line = tgt_line.strip()

                        if src_line and tgt_line:
                            parallel_corpus.append((src_line, tgt_line))

                print(f"加载了 {len(parallel_corpus)} 对平行句子")
                return parallel_corpus

            def tokenize(self, text: str, lang: str) -> List[str]:
                """分词"""
                if lang == 'zh':
                    # 中文分词
                    return list(jieba.cut(text))
                else:
                    # 英文分词(简单空格分割)
                    return text.lower().split()

            def build_vocab(self, corpus: List[Tuple[str, str]], min_freq: int = 2):
                """构建词汇表"""
                src_counter = Counter()
                tgt_counter = Counter()

                for src_text, tgt_text in corpus:
                    src_tokens = self.tokenize(src_text, self.src_lang)
                    tgt_tokens = self.tokenize(tgt_text, self.tgt_lang)

                    src_counter.update(src_tokens)
                    tgt_counter.update(tgt_tokens)

                # 构建源语言词汇表
                self.src_vocab = {'<PAD>': 0, '<UNK>': 1, '<SOS>': 2, '<EOS>': 3}
                for word, freq in src_counter.items():
                    if freq >= min_freq:
                        self.src_vocab[word] = len(self.src_vocab)

                # 构建目标语言词汇表
                self.tgt_vocab = {'<PAD>': 0, '<UNK>': 1, '<SOS>': 2, '<EOS>': 3}
                for word, freq in tgt_counter.items():
                    if freq >= min_freq:
                        self.tgt_vocab[word] = len(self.tgt_vocab)

                print(f"源语言词汇量: {len(self.src_vocab)}")
                print(f"目标语言词汇量: {len(self.tgt_vocab)}")

            def encode_sentence(self, text: str, lang: str, max_len: int = 50) -> List[int]:
                """将句子编码为索引序列"""
                vocab = self.src_vocab if lang == self.src_lang else self.tgt_vocab
                tokens = self.tokenize(text, lang)

                # 添加特殊标记
                indices = [vocab['<SOS>']]
                for token in tokens[:max_len-2]:
                    indices.append(vocab.get(token, vocab['<UNK>']))
                indices.append(vocab['<EOS>'])

                return indices

            def decode_sentence(self, indices: List[int], lang: str) -> str:
                """将索引序列解码为句子"""
                vocab = self.src_vocab if lang == self.src_lang else self.tgt_vocab
                id2word = {v: k for k, v in vocab.items()}

                tokens = []
                for idx in indices:
                    word = id2word.get(idx, '<UNK>')
                    if word in ['<SOS>', '<EOS>', '<PAD>']:
                        continue
                    tokens.append(word)

                # 中文不需要空格连接
                if lang == 'zh':
                    return ''.join(tokens)
                else:
                    return ' '.join(tokens)

        class BLEUEvaluator:
            """BLEU评估器"""

            @staticmethod
            def compute_bleu(predictions: List[str],
                           references: List[List[str]],
                           max_order: int = 4) -> dict:
                """
                计算BLEU分数
                predictions: 预测翻译列表
                references: 参考翻译列表(每个预测可以有多个参考)
                """
                # 使用sacrebleu计算
                # 注意:sacrebleu期望references是List[List[str]]格式
                bleu = corpus_bleu(predictions, references)

                return {
                    'bleu': bleu.score,
                    'precisions': bleu.precisions,
                    'bp': bleu.bp,  # brevity penalty
                    'sys_len': bleu.sys_len,
                    'ref_len': bleu.ref_len
                }

            @staticmethod
            def compute_sentence_bleu(prediction: str, references: List[str]) -> float:
                """计算单句BLEU"""
                from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

                # 分词
                pred_tokens = prediction.split()
                ref_tokens_list = [ref.split() for ref in references]

                # 使用平滑函数避免0分
                smoothing = SmoothingFunction().method1

                bleu = sentence_bleu(ref_tokens_list, pred_tokens, smoothing_function=smoothing)

                return bleu

        # 使用示例
        processor = TranslationDataProcessor(src_lang='zh', tgt_lang='en')

        # 示例平行语料
        corpus = [
            ("我爱北京天安门", "I love Tiananmen Square in Beijing"),
            ("机器翻译是人工智能的重要应用", "Machine translation is an important application of AI"),
            ("今天天气很好", "The weather is nice today")
        ]

        # 构建词汇表
        processor.build_vocab(corpus, min_freq=1)

        # 编码句子
        src_text = "我爱北京天安门"
        src_indices = processor.encode_sentence(src_text, 'zh')
        print(f"编码: {src_text} -> {src_indices}")

        # 解码句子
        decoded_text = processor.decode_sentence(src_indices, 'zh')
        print(f"解码: {src_indices} -> {decoded_text}")

        # BLEU评估
        evaluator = BLEUEvaluator()
        predictions = ["I love Beijing", "Machine translation is important"]
        references = [
            ["I love Tiananmen Square in Beijing", "I love Beijing Tiananmen"],
            ["Machine translation is an important application of AI"]
        ]

        bleu_scores = evaluator.compute_bleu(predictions, references)
        print(f"\nBLEU分数: {bleu_scores['bleu']:.2f}")
        print(f"精确度: {bleu_scores['precisions']}")
        ---

02.翻译质量评估指标
    a.说明
        BLEU(Bilingual Evaluation Understudy)是最常用的自动评估指标,基于n-gram精确度和长度惩罚计算。
        BLEU-4计算1-gram到4-gram的精确度几何平均,分数范围0-100,通常40+为良好,50+为优秀。
        METEOR考虑同义词、词干、释义等语言学特征,与人工评估相关性高于BLEU。
        TER(Translation Edit Rate)计算将预测翻译编辑为参考翻译所需的最少编辑次数,越低越好。
        chrF基于字符级n-gram,对形态丰富语言和中文等语言更友好。
        人工评估包括流畅度(Fluency)、准确度(Adequacy)、忠实度(Fidelity)三个维度,使用5分制评分。
    b.代码示例
        ---
        # 多种翻译评估指标实现
        from sacrebleu.metrics import BLEU, CHRF, TER
        from typing import List, Dict
        import numpy as np

        class TranslationEvaluator:
            """翻译评估器"""

            def __init__(self):
                self.bleu = BLEU()
                self.chrf = CHRF()
                self.ter = TER()

            def evaluate_all(self,
                           predictions: List[str],
                           references: List[List[str]]) -> Dict[str, float]:
                """
                计算所有评估指标
                predictions: 预测翻译列表
                references: 参考翻译列表(每个预测可以有多个参考)
                """
                # BLEU
                bleu_score = self.bleu.corpus_score(predictions, references)

                # chrF
                chrf_score = self.chrf.corpus_score(predictions, references)

                # TER
                ter_score = self.ter.corpus_score(predictions, references)

                results = {
                    'BLEU': bleu_score.score,
                    'chrF': chrf_score.score,
                    'TER': ter_score.score,
                    'BLEU_precisions': bleu_score.precisions,
                    'brevity_penalty': bleu_score.bp
                }

                return results

            def evaluate_by_length(self,
                                  predictions: List[str],
                                  references: List[List[str]],
                                  length_bins: List[int] = [10, 20, 30, 50]) -> Dict:
                """
                按句子长度分组评估
                """
                # 按长度分组
                length_groups = {f'0-{length_bins[0]}': ([], [])}
                for i in range(len(length_bins) - 1):
                    length_groups[f'{length_bins[i]}-{length_bins[i+1]}'] = ([], [])
                length_groups[f'{length_bins[-1]}+'] = ([], [])

                # 分组
                for pred, refs in zip(predictions, references):
                    pred_len = len(pred.split())

                    if pred_len < length_bins[0]:
                        group_key = f'0-{length_bins[0]}'
                    elif pred_len >= length_bins[-1]:
                        group_key = f'{length_bins[-1]}+'
                    else:
                        for i in range(len(length_bins) - 1):
                            if length_bins[i] <= pred_len < length_bins[i+1]:
                                group_key = f'{length_bins[i]}-{length_bins[i+1]}'
                                break

                    length_groups[group_key][0].append(pred)
                    length_groups[group_key][1].append(refs)

                # 评估每组
                results = {}
                for group_name, (group_preds, group_refs) in length_groups.items():
                    if group_preds:
                        bleu = self.bleu.corpus_score(group_preds, group_refs)
                        results[group_name] = {
                            'count': len(group_preds),
                            'BLEU': bleu.score
                        }

                return results

            def print_evaluation_results(self, results: Dict):
                """打印评估结果"""
                print("=" * 60)
                print("翻译质量评估结果")
                print("=" * 60)

                print(f"\n总体指标:")
                print(f"  BLEU:  {results['BLEU']:.2f}")
                print(f"  chrF:  {results['chrF']:.2f}")
                print(f"  TER:   {results['TER']:.2f}")

                print(f"\nBLEU详细:")
                print(f"  1-gram: {results['BLEU_precisions'][0]:.2f}")
                print(f"  2-gram: {results['BLEU_precisions'][1]:.2f}")
                print(f"  3-gram: {results['BLEU_precisions'][2]:.2f}")
                print(f"  4-gram: {results['BLEU_precisions'][3]:.2f}")
                print(f"  BP:     {results['brevity_penalty']:.4f}")

        # 使用示例
        evaluator = TranslationEvaluator()

        # 示例数据
        predictions = [
            "I love Beijing Tiananmen Square",
            "Machine translation is very important",
            "Today the weather is good"
        ]

        references = [
            ["I love Tiananmen Square in Beijing", "I love Beijing Tiananmen"],
            ["Machine translation is an important application of AI"],
            ["The weather is nice today", "Today's weather is very nice"]
        ]

        # 评估
        results = evaluator.evaluate_all(predictions, references)
        evaluator.print_evaluation_results(results)

        # 按长度评估
        length_results = evaluator.evaluate_by_length(predictions, references)
        print("\n按句子长度评估:")
        for group, metrics in length_results.items():
            print(f"  {group}: {metrics['count']}句, BLEU={metrics['BLEU']:.2f}")
        ---

4.2 Seq2Seq模型

01.Seq2Seq架构原理
    a.说明
        Seq2Seq(Sequence-to-Sequence)模型由编码器(Encoder)和解码器(Decoder)组成,是神经机器翻译的基础架构。
        编码器将源语言句子编码为固定长度的上下文向量(context vector),解码器根据上下文向量生成目标语言句子。
        编码器和解码器通常使用LSTM或GRU循环神经网络,能够处理变长序列并捕获长距离依赖。
        训练时使用Teacher Forcing策略,将真实目标词作为解码器输入,加速收敛但可能导致exposure bias问题。
        推理时使用贪心搜索、束搜索(Beam Search)或采样方法生成翻译,束搜索通常能获得更好的翻译质量。
        Seq2Seq的局限包括固定长度上下文向量的信息瓶颈、长句翻译效果差、无法处理对齐等问题。
    b.代码示例
        ---
        # Seq2Seq模型实现
        import torch
        import torch.nn as nn
        import torch.nn.functional as F
        from typing import Tuple
        import random

        class Encoder(nn.Module):
            """编码器"""

            def __init__(self, input_dim: int, emb_dim: int, hidden_dim: int,
                        n_layers: int = 1, dropout: float = 0.5):
                super().__init__()

                self.hidden_dim = hidden_dim
                self.n_layers = n_layers

                # 嵌入层
                self.embedding = nn.Embedding(input_dim, emb_dim)

                # LSTM层
                self.lstm = nn.LSTM(emb_dim, hidden_dim, n_layers,
                                   dropout=dropout if n_layers > 1 else 0,
                                   batch_first=True)

                self.dropout = nn.Dropout(dropout)

            def forward(self, src: torch.Tensor) -> Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
                """
                前向传播
                src: [batch_size, src_len] 源语言句子
                返回: (outputs, (hidden, cell))
                """
                # 嵌入 [batch_size, src_len, emb_dim]
                embedded = self.dropout(self.embedding(src))

                # LSTM编码
                outputs, (hidden, cell) = self.lstm(embedded)

                # outputs: [batch_size, src_len, hidden_dim]
                # hidden: [n_layers, batch_size, hidden_dim]
                # cell: [n_layers, batch_size, hidden_dim]

                return outputs, (hidden, cell)

        class Decoder(nn.Module):
            """解码器"""

            def __init__(self, output_dim: int, emb_dim: int, hidden_dim: int,
                        n_layers: int = 1, dropout: float = 0.5):
                super().__init__()

                self.output_dim = output_dim
                self.hidden_dim = hidden_dim
                self.n_layers = n_layers

                # 嵌入层
                self.embedding = nn.Embedding(output_dim, emb_dim)

                # LSTM层
                self.lstm = nn.LSTM(emb_dim, hidden_dim, n_layers,
                                   dropout=dropout if n_layers > 1 else 0,
                                   batch_first=True)

                # 输出层
                self.fc_out = nn.Linear(hidden_dim, output_dim)

                self.dropout = nn.Dropout(dropout)

            def forward(self, input: torch.Tensor, hidden: torch.Tensor,
                       cell: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
                """
                前向传播(单步)
                input: [batch_size] 当前输入词
                hidden: [n_layers, batch_size, hidden_dim]
                cell: [n_layers, batch_size, hidden_dim]
                """
                # 添加序列维度 [batch_size, 1]
                input = input.unsqueeze(1)

                # 嵌入 [batch_size, 1, emb_dim]
                embedded = self.dropout(self.embedding(input))

                # LSTM解码
                output, (hidden, cell) = self.lstm(embedded, (hidden, cell))

                # output: [batch_size, 1, hidden_dim]

                # 预测 [batch_size, output_dim]
                prediction = self.fc_out(output.squeeze(1))

                return prediction, hidden, cell

        class Seq2Seq(nn.Module):
            """Seq2Seq模型"""

            def __init__(self, encoder: Encoder, decoder: Decoder, device):
                super().__init__()

                self.encoder = encoder
                self.decoder = decoder
                self.device = device

            def forward(self, src: torch.Tensor, trg: torch.Tensor,
                       teacher_forcing_ratio: float = 0.5) -> torch.Tensor:
                """
                前向传播
                src: [batch_size, src_len] 源语言
                trg: [batch_size, trg_len] 目标语言
                teacher_forcing_ratio: Teacher Forcing概率
                """
                batch_size = src.shape[0]
                trg_len = trg.shape[1]
                trg_vocab_size = self.decoder.output_dim

                # 存储输出
                outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)

                # 编码
                encoder_outputs, (hidden, cell) = self.encoder(src)

                # 解码器第一个输入是<SOS>
                input = trg[:, 0]

                for t in range(1, trg_len):
                    # 解码一步
                    output, hidden, cell = self.decoder(input, hidden, cell)

                    # 存储输出
                    outputs[:, t, :] = output

                    # Teacher Forcing
                    teacher_force = random.random() < teacher_forcing_ratio

                    # 获取预测词
                    top1 = output.argmax(1)

                    # 下一个输入
                    input = trg[:, t] if teacher_force else top1

                return outputs

            def translate(self, src: torch.Tensor, max_len: int = 50,
                         sos_idx: int = 2, eos_idx: int = 3) -> torch.Tensor:
                """
                翻译(贪心搜索)
                src: [batch_size, src_len]
                """
                self.eval()
                batch_size = src.shape[0]

                with torch.no_grad():
                    # 编码
                    encoder_outputs, (hidden, cell) = self.encoder(src)

                    # 解码器输入从<SOS>开始
                    input = torch.tensor([sos_idx] * batch_size).to(self.device)

                    # 存储翻译结果
                    translations = []

                    for _ in range(max_len):
                        # 解码一步
                        output, hidden, cell = self.decoder(input, hidden, cell)

                        # 贪心选择
                        top1 = output.argmax(1)

                        translations.append(top1.unsqueeze(1))

                        # 检查是否所有句子都生成了<EOS>
                        if (top1 == eos_idx).all():
                            break

                        # 下一个输入
                        input = top1

                    # 拼接结果
                    translations = torch.cat(translations, dim=1)

                return translations

        # 使用示例
        # 超参数
        INPUT_DIM = 5000   # 源语言词汇量
        OUTPUT_DIM = 5000  # 目标语言词汇量
        ENC_EMB_DIM = 256
        DEC_EMB_DIM = 256
        HIDDEN_DIM = 512
        N_LAYERS = 2
        DROPOUT = 0.5

        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

        # 创建模型
        encoder = Encoder(INPUT_DIM, ENC_EMB_DIM, HIDDEN_DIM, N_LAYERS, DROPOUT)
        decoder = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HIDDEN_DIM, N_LAYERS, DROPOUT)
        model = Seq2Seq(encoder, decoder, device).to(device)

        print(f"模型参数量: {sum(p.numel() for p in model.parameters()):,}")

        # 示例输入
        src = torch.randint(0, INPUT_DIM, (32, 20)).to(device)  # [batch_size, src_len]
        trg = torch.randint(0, OUTPUT_DIM, (32, 25)).to(device)  # [batch_size, trg_len]

        # 前向传播
        output = model(src, trg, teacher_forcing_ratio=0.5)
        print(f"输出形状: {output.shape}")  # [batch_size, trg_len, output_dim]

        # 翻译
        translations = model.translate(src[:1], max_len=30)
        print(f"翻译结果形状: {translations.shape}")
        ---

02.Attention机制
    a.说明
        Attention机制解决了Seq2Seq固定长度上下文向量的瓶颈,允许解码器在每一步关注源句子的不同部分。
        计算流程:1)计算解码器隐藏状态与编码器所有隐藏状态的相似度(能量分数);2)softmax归一化得到注意力权重;3)加权求和得到上下文向量。
        常见注意力机制包括加性注意力(Bahdanau)、乘性注意力(Luong)、缩放点积注意力(Scaled Dot-Product)。
        Attention可视化可以展示源语言和目标语言词之间的对齐关系,帮助理解翻译过程和调试模型。
        多头注意力(Multi-Head Attention)使用多个注意力头并行计算,捕获不同子空间的信息,是Transformer的核心组件。
        Self-Attention允许序列中每个位置关注其他所有位置,捕获长距离依赖,计算复杂度为O(n²)。
    b.代码示例
        ---
        # Attention机制实现
        import torch
        import torch.nn as nn
        import torch.nn.functional as F

        class BahdanauAttention(nn.Module):
            """Bahdanau加性注意力"""

            def __init__(self, hidden_dim: int):
                super().__init__()

                self.hidden_dim = hidden_dim

                # 注意力权重矩阵
                self.attn = nn.Linear(hidden_dim * 2, hidden_dim)
                self.v = nn.Linear(hidden_dim, 1, bias=False)

            def forward(self, hidden: torch.Tensor, encoder_outputs: torch.Tensor,
                       mask: torch.Tensor = None) -> Tuple[torch.Tensor, torch.Tensor]:
                """
                计算注意力
                hidden: [batch_size, hidden_dim] 解码器当前隐藏状态
                encoder_outputs: [batch_size, src_len, hidden_dim] 编码器输出
                mask: [batch_size, src_len] 掩码(可选)
                """
                batch_size = encoder_outputs.shape[0]
                src_len = encoder_outputs.shape[1]

                # 重复hidden以匹配src_len
                # [batch_size, src_len, hidden_dim]
                hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)

                # 拼接
                # [batch_size, src_len, hidden_dim * 2]
                energy = torch.cat((hidden, encoder_outputs), dim=2)

                # 计算能量分数
                # [batch_size, src_len, hidden_dim]
                energy = torch.tanh(self.attn(energy))

                # [batch_size, src_len, 1]
                attention = self.v(energy)

                # [batch_size, src_len]
                attention = attention.squeeze(2)

                # 应用掩码(可选)
                if mask is not None:
                    attention = attention.masked_fill(mask == 0, -1e10)

                # Softmax归一化
                attention_weights = F.softmax(attention, dim=1)

                # 加权求和得到上下文向量
                # [batch_size, 1, src_len] x [batch_size, src_len, hidden_dim]
                # = [batch_size, 1, hidden_dim]
                context = torch.bmm(attention_weights.unsqueeze(1), encoder_outputs)

                # [batch_size, hidden_dim]
                context = context.squeeze(1)

                return context, attention_weights

        class LuongAttention(nn.Module):
            """Luong乘性注意力"""

            def __init__(self, hidden_dim: int, method: str = 'dot'):
                super().__init__()

                self.hidden_dim = hidden_dim
                self.method = method

                if method == 'general':
                    self.attn = nn.Linear(hidden_dim, hidden_dim)
                elif method == 'concat':
                    self.attn = nn.Linear(hidden_dim * 2, hidden_dim)
                    self.v = nn.Parameter(torch.FloatTensor(hidden_dim))

            def forward(self, hidden: torch.Tensor, encoder_outputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
                """
                计算注意力
                hidden: [batch_size, hidden_dim]
                encoder_outputs: [batch_size, src_len, hidden_dim]
                """
                batch_size = encoder_outputs.shape[0]
                src_len = encoder_outputs.shape[1]

                if self.method == 'dot':
                    # 点积注意力
                    # [batch_size, 1, hidden_dim] x [batch_size, hidden_dim, src_len]
                    # = [batch_size, 1, src_len]
                    energy = torch.bmm(hidden.unsqueeze(1), encoder_outputs.transpose(1, 2))
                    energy = energy.squeeze(1)

                elif self.method == 'general':
                    # 一般注意力
                    energy = self.attn(encoder_outputs)
                    energy = torch.bmm(hidden.unsqueeze(1), energy.transpose(1, 2))
                    energy = energy.squeeze(1)

                elif self.method == 'concat':
                    # 拼接注意力
                    hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
                    energy = torch.cat((hidden, encoder_outputs), dim=2)
                    energy = torch.tanh(self.attn(energy))
                    energy = torch.sum(self.v * energy, dim=2)

                # Softmax
                attention_weights = F.softmax(energy, dim=1)

                # 上下文向量
                context = torch.bmm(attention_weights.unsqueeze(1), encoder_outputs)
                context = context.squeeze(1)

                return context, attention_weights

        class AttentionDecoder(nn.Module):
            """带注意力的解码器"""

            def __init__(self, output_dim: int, emb_dim: int, hidden_dim: int,
                        n_layers: int = 1, dropout: float = 0.5):
                super().__init__()

                self.output_dim = output_dim
                self.hidden_dim = hidden_dim

                self.embedding = nn.Embedding(output_dim, emb_dim)
                self.attention = BahdanauAttention(hidden_dim)

                # LSTM输入包括嵌入和上下文向量
                self.lstm = nn.LSTM(emb_dim + hidden_dim, hidden_dim, n_layers,
                                   dropout=dropout if n_layers > 1 else 0,
                                   batch_first=True)

                self.fc_out = nn.Linear(hidden_dim * 2 + emb_dim, output_dim)
                self.dropout = nn.Dropout(dropout)

            def forward(self, input: torch.Tensor, hidden: torch.Tensor,
                       cell: torch.Tensor, encoder_outputs: torch.Tensor):
                """
                前向传播
                input: [batch_size]
                hidden: [n_layers, batch_size, hidden_dim]
                cell: [n_layers, batch_size, hidden_dim]
                encoder_outputs: [batch_size, src_len, hidden_dim]
                """
                # 嵌入
                input = input.unsqueeze(1)  # [batch_size, 1]
                embedded = self.dropout(self.embedding(input))  # [batch_size, 1, emb_dim]

                # 计算注意力(使用最顶层hidden)
                context, attention_weights = self.attention(hidden[-1], encoder_outputs)

                # 拼接嵌入和上下文
                # [batch_size, 1, emb_dim + hidden_dim]
                lstm_input = torch.cat((embedded, context.unsqueeze(1)), dim=2)

                # LSTM
                output, (hidden, cell) = self.lstm(lstm_input, (hidden, cell))

                # 拼接输出、上下文、嵌入
                # [batch_size, hidden_dim * 2 + emb_dim]
                prediction = torch.cat((output.squeeze(1), context, embedded.squeeze(1)), dim=1)

                # 预测
                prediction = self.fc_out(prediction)

                return prediction, hidden, cell, attention_weights

        # 使用示例
        # 创建带注意力的解码器
        attn_decoder = AttentionDecoder(
            output_dim=5000,
            emb_dim=256,
            hidden_dim=512,
            n_layers=2,
            dropout=0.5
        )

        # 示例输入
        batch_size = 32
        src_len = 20
        input = torch.randint(0, 5000, (batch_size,))
        hidden = torch.randn(2, batch_size, 512)
        cell = torch.randn(2, batch_size, 512)
        encoder_outputs = torch.randn(batch_size, src_len, 512)

        # 前向传播
        output, hidden, cell, attn_weights = attn_decoder(input, hidden, cell, encoder_outputs)

        print(f"输出形状: {output.shape}")  # [batch_size, output_dim]
        print(f"注意力权重形状: {attn_weights.shape}")  # [batch_size, src_len]
        ---

4.3 Transformer翻译

01.Transformer架构
    a.说明
        Transformer完全基于注意力机制,抛弃了RNN结构,通过Self-Attention并行处理序列,训练速度大幅提升。
        编码器由N层(通常6层)堆叠,每层包含Multi-Head Self-Attention和Position-wise FFN两个子层,使用残差连接和Layer Normalization。
        解码器也由N层堆叠,每层包含Masked Self-Attention、Encoder-Decoder Attention和FFN三个子层。
        位置编码(Positional Encoding)使用正弦余弦函数为每个位置添加位置信息,弥补Self-Attention无法感知位置的缺陷。
        Multi-Head Attention使用8个头并行计算,每个头关注不同的表示子空间,增强模型表达能力。
        Transformer在WMT翻译任务上达到SOTA,BLEU分数比RNN模型提升2-3个点,且训练时间缩短数倍。
    b.代码示例
        ---
        # Transformer翻译模型实现
        import torch
        import torch.nn as nn
        import math

        class MultiHeadAttention(nn.Module):
            """多头注意力"""

            def __init__(self, d_model: int, n_heads: int, dropout: float = 0.1):
                super().__init__()

                assert d_model % n_heads == 0

                self.d_model = d_model
                self.n_heads = n_heads
                self.d_k = d_model // n_heads

                # Q, K, V线性变换
                self.W_q = nn.Linear(d_model, d_model)
                self.W_k = nn.Linear(d_model, d_model)
                self.W_v = nn.Linear(d_model, d_model)

                # 输出线性变换
                self.W_o = nn.Linear(d_model, d_model)

                self.dropout = nn.Dropout(dropout)
                self.scale = math.sqrt(self.d_k)

            def forward(self, query, key, value, mask=None):
                """
                query, key, value: [batch_size, seq_len, d_model]
                mask: [batch_size, 1, 1, seq_len] 或 [batch_size, 1, seq_len, seq_len]
                """
                batch_size = query.shape[0]

                # 线性变换并分割为多头
                # [batch_size, seq_len, d_model] -> [batch_size, seq_len, n_heads, d_k]
                Q = self.W_q(query).view(batch_size, -1, self.n_heads, self.d_k)
                K = self.W_k(key).view(batch_size, -1, self.n_heads, self.d_k)
                V = self.W_v(value).view(batch_size, -1, self.n_heads, self.d_k)

                # 转置 [batch_size, n_heads, seq_len, d_k]
                Q = Q.transpose(1, 2)
                K = K.transpose(1, 2)
                V = V.transpose(1, 2)

                # 计算注意力分数
                # [batch_size, n_heads, seq_len, seq_len]
                scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale

                # 应用掩码
                if mask is not None:
                    scores = scores.masked_fill(mask == 0, -1e10)

                # Softmax
                attention = torch.softmax(scores, dim=-1)
                attention = self.dropout(attention)

                # 加权求和
                # [batch_size, n_heads, seq_len, d_k]
                x = torch.matmul(attention, V)

                # 拼接多头
                # [batch_size, seq_len, d_model]
                x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)

                # 输出线性变换
                x = self.W_o(x)

                return x, attention

        class PositionwiseFeedForward(nn.Module):
            """位置前馈网络"""

            def __init__(self, d_model: int, d_ff: int, dropout: float = 0.1):
                super().__init__()

                self.fc1 = nn.Linear(d_model, d_ff)
                self.fc2 = nn.Linear(d_ff, d_model)
                self.dropout = nn.Dropout(dropout)

            def forward(self, x):
                """
                x: [batch_size, seq_len, d_model]
                """
                # [batch_size, seq_len, d_ff]
                x = self.dropout(torch.relu(self.fc1(x)))

                # [batch_size, seq_len, d_model]
                x = self.fc2(x)

                return x

        class PositionalEncoding(nn.Module):
            """位置编码"""

            def __init__(self, d_model: int, max_len: int = 5000, dropout: float = 0.1):
                super().__init__()

                self.dropout = nn.Dropout(dropout)

                # 创建位置编码矩阵
                pe = torch.zeros(max_len, d_model)
                position = torch.arange(0, max_len).unsqueeze(1).float()
                div_term = torch.exp(torch.arange(0, d_model, 2).float() *
                                    -(math.log(10000.0) / d_model))

                # PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
                pe[:, 0::2] = torch.sin(position * div_term)
                # PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
                pe[:, 1::2] = torch.cos(position * div_term)

                # [1, max_len, d_model]
                pe = pe.unsqueeze(0)

                # 注册为buffer(不参与训练)
                self.register_buffer('pe', pe)

            def forward(self, x):
                """
                x: [batch_size, seq_len, d_model]
                """
                # 添加位置编码
                x = x + self.pe[:, :x.size(1), :]
                return self.dropout(x)

        class EncoderLayer(nn.Module):
            """Transformer编码器层"""

            def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
                super().__init__()

                self.self_attn = MultiHeadAttention(d_model, n_heads, dropout)
                self.feed_forward = PositionwiseFeedForward(d_model, d_ff, dropout)

                self.norm1 = nn.LayerNorm(d_model)
                self.norm2 = nn.LayerNorm(d_model)

                self.dropout1 = nn.Dropout(dropout)
                self.dropout2 = nn.Dropout(dropout)

            def forward(self, x, mask=None):
                """
                x: [batch_size, seq_len, d_model]
                mask: [batch_size, 1, 1, seq_len]
                """
                # Self-Attention + 残差连接 + Layer Norm
                attn_output, _ = self.self_attn(x, x, x, mask)
                x = self.norm1(x + self.dropout1(attn_output))

                # Feed Forward + 残差连接 + Layer Norm
                ff_output = self.feed_forward(x)
                x = self.norm2(x + self.dropout2(ff_output))

                return x

        class DecoderLayer(nn.Module):
            """Transformer解码器层"""

            def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
                super().__init__()

                self.self_attn = MultiHeadAttention(d_model, n_heads, dropout)
                self.cross_attn = MultiHeadAttention(d_model, n_heads, dropout)
                self.feed_forward = PositionwiseFeedForward(d_model, d_ff, dropout)

                self.norm1 = nn.LayerNorm(d_model)
                self.norm2 = nn.LayerNorm(d_model)
                self.norm3 = nn.LayerNorm(d_model)

                self.dropout1 = nn.Dropout(dropout)
                self.dropout2 = nn.Dropout(dropout)
                self.dropout3 = nn.Dropout(dropout)

            def forward(self, x, encoder_output, src_mask=None, tgt_mask=None):
                """
                x: [batch_size, tgt_len, d_model] 解码器输入
                encoder_output: [batch_size, src_len, d_model] 编码器输出
                src_mask: [batch_size, 1, 1, src_len] 源序列掩码
                tgt_mask: [batch_size, 1, tgt_len, tgt_len] 目标序列掩码(因果掩码)
                """
                # Masked Self-Attention
                attn_output, _ = self.self_attn(x, x, x, tgt_mask)
                x = self.norm1(x + self.dropout1(attn_output))

                # Encoder-Decoder Attention
                attn_output, _ = self.cross_attn(x, encoder_output, encoder_output, src_mask)
                x = self.norm2(x + self.dropout2(attn_output))

                # Feed Forward
                ff_output = self.feed_forward(x)
                x = self.norm3(x + self.dropout3(ff_output))

                return x

        class Transformer(nn.Module):
            """Transformer翻译模型"""

            def __init__(self, src_vocab_size: int, tgt_vocab_size: int,
                        d_model: int = 512, n_heads: int = 8, n_layers: int = 6,
                        d_ff: int = 2048, dropout: float = 0.1, max_len: int = 5000):
                super().__init__()

                self.d_model = d_model

                # 嵌入层
                self.src_embedding = nn.Embedding(src_vocab_size, d_model)
                self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)

                # 位置编码
                self.pos_encoding = PositionalEncoding(d_model, max_len, dropout)

                # 编码器和解码器层
                self.encoder_layers = nn.ModuleList([
                    EncoderLayer(d_model, n_heads, d_ff, dropout)
                    for _ in range(n_layers)
                ])

                self.decoder_layers = nn.ModuleList([
                    DecoderLayer(d_model, n_heads, d_ff, dropout)
                    for _ in range(n_layers)
                ])

                # 输出层
                self.fc_out = nn.Linear(d_model, tgt_vocab_size)

                self.dropout = nn.Dropout(dropout)

                # 初始化参数
                self._init_parameters()

            def _init_parameters(self):
                """参数初始化"""
                for p in self.parameters():
                    if p.dim() > 1:
                        nn.init.xavier_uniform_(p)

            def make_src_mask(self, src):
                """创建源序列掩码"""
                # src: [batch_size, src_len]
                # mask: [batch_size, 1, 1, src_len]
                src_mask = (src != 0).unsqueeze(1).unsqueeze(2)
                return src_mask

            def make_tgt_mask(self, tgt):
                """创建目标序列掩码(因果掩码)"""
                # tgt: [batch_size, tgt_len]
                batch_size, tgt_len = tgt.shape

                # Padding掩码
                tgt_pad_mask = (tgt != 0).unsqueeze(1).unsqueeze(2)  # [batch_size, 1, 1, tgt_len]

                # 因果掩码(下三角矩阵)
                tgt_sub_mask = torch.tril(torch.ones((tgt_len, tgt_len))).bool().to(tgt.device)
                tgt_sub_mask = tgt_sub_mask.unsqueeze(0).unsqueeze(0)  # [1, 1, tgt_len, tgt_len]

                # 组合掩码
                tgt_mask = tgt_pad_mask & tgt_sub_mask

                return tgt_mask

            def encode(self, src, src_mask):
                """编码"""
                # 嵌入 + 位置编码
                x = self.src_embedding(src) * math.sqrt(self.d_model)
                x = self.pos_encoding(x)

                # 编码器层
                for layer in self.encoder_layers:
                    x = layer(x, src_mask)

                return x

            def decode(self, tgt, encoder_output, src_mask, tgt_mask):
                """解码"""
                # 嵌入 + 位置编码
                x = self.tgt_embedding(tgt) * math.sqrt(self.d_model)
                x = self.pos_encoding(x)

                # 解码器层
                for layer in self.decoder_layers:
                    x = layer(x, encoder_output, src_mask, tgt_mask)

                return x

            def forward(self, src, tgt):
                """
                前向传播
                src: [batch_size, src_len]
                tgt: [batch_size, tgt_len]
                """
                # 创建掩码
                src_mask = self.make_src_mask(src)
                tgt_mask = self.make_tgt_mask(tgt)

                # 编码
                encoder_output = self.encode(src, src_mask)

                # 解码
                decoder_output = self.decode(tgt, encoder_output, src_mask, tgt_mask)

                # 输出
                output = self.fc_out(decoder_output)

                return output

        # 使用示例
        model = Transformer(
            src_vocab_size=10000,
            tgt_vocab_size=10000,
            d_model=512,
            n_heads=8,
            n_layers=6,
            d_ff=2048,
            dropout=0.1
        )

        print(f"模型参数量: {sum(p.numel() for p in model.parameters()):,}")

        # 示例输入
        src = torch.randint(1, 10000, (32, 20))  # [batch_size, src_len]
        tgt = torch.randint(1, 10000, (32, 25))  # [batch_size, tgt_len]

        # 前向传播
        output = model(src, tgt)
        print(f"输出形状: {output.shape}")  # [batch_size, tgt_len, tgt_vocab_size]
        ---

02.Beam Search解码
    a.说明
        Beam Search是一种启发式搜索算法,在每一步保留概率最高的K个候选序列(beam),平衡搜索质量和计算效率。
        Beam Size通常设置为4-10,过小可能错过最优解,过大增加计算开销且可能导致过于保守的翻译。
        长度惩罚(Length Penalty)避免模型偏好生成短句,通常使用(5+len)^α/(5+1)^α作为惩罚因子,α=0.6-1.0。
        相比贪心搜索,Beam Search通常能提升1-2个BLEU点,但推理时间增加K倍。
        Coverage机制惩罚重复关注相同源词,避免生成重复内容,特别适用于摘要等任务。
        Diverse Beam Search通过惩罚相似候选增加多样性,生成多个不同的翻译结果供用户选择。
    b.代码示例
        ---
        # Beam Search实现
        import torch
        import torch.nn.functional as F
        from typing import List, Tuple
        import numpy as np

        class BeamSearchDecoder:
            """Beam Search解码器"""

            def __init__(self, model, beam_size: int = 5, max_len: int = 50,
                        length_penalty: float = 0.6, sos_idx: int = 2, eos_idx: int = 3):
                self.model = model
                self.beam_size = beam_size
                self.max_len = max_len
                self.length_penalty = length_penalty
                self.sos_idx = sos_idx
                self.eos_idx = eos_idx

            def length_penalty_fn(self, length: int) -> float:
                """长度惩罚函数"""
                return ((5 + length) ** self.length_penalty) / ((5 + 1) ** self.length_penalty)

            def decode(self, src: torch.Tensor) -> List[Tuple[List[int], float]]:
                """
                Beam Search解码
                src: [1, src_len] 源序列
                返回: [(翻译序列, 分数), ...] 按分数降序排列
                """
                self.model.eval()
                device = src.device

                with torch.no_grad():
                    # 编码
                    src_mask = self.model.make_src_mask(src)
                    encoder_output = self.model.encode(src, src_mask)

                    # 初始化beam
                    # 每个beam: (序列, 累积对数概率)
                    beams = [([self.sos_idx], 0.0)]

                    for step in range(self.max_len):
                        candidates = []

                        for seq, score in beams:
                            # 如果已经生成<EOS>,直接加入候选
                            if seq[-1] == self.eos_idx:
                                candidates.append((seq, score))
                                continue

                            # 准备解码器输入
                            tgt = torch.tensor([seq], device=device)  # [1, len]
                            tgt_mask = self.model.make_tgt_mask(tgt)

                            # 解码
                            decoder_output = self.model.decode(tgt, encoder_output, src_mask, tgt_mask)

                            # 获取最后一个位置的输出
                            logits = self.model.fc_out(decoder_output[:, -1, :])  # [1, vocab_size]

                            # 计算对数概率
                            log_probs = F.log_softmax(logits, dim=-1)

                            # 获取top-k
                            top_log_probs, top_indices = torch.topk(log_probs, self.beam_size)

                            # 扩展beam
                            for log_prob, idx in zip(top_log_probs[0], top_indices[0]):
                                new_seq = seq + [idx.item()]
                                new_score = score + log_prob.item()

                                candidates.append((new_seq, new_score))

                        # 按分数排序(考虑长度惩罚)
                        candidates = sorted(candidates,
                                          key=lambda x: x[1] / self.length_penalty_fn(len(x[0])),
                                          reverse=True)

                        # 保留top beam_size个候选
                        beams = candidates[:self.beam_size]

                        # 检查是否所有beam都生成了<EOS>
                        if all(seq[-1] == self.eos_idx for seq, _ in beams):
                            break

                    # 返回结果(应用长度惩罚)
                    results = [(seq, score / self.length_penalty_fn(len(seq)))
                              for seq, score in beams]

                    return results

            def translate(self, src: torch.Tensor, id2word: dict) -> List[str]:
                """
                翻译并转换为文本
                src: [1, src_len]
                id2word: 索引到词的映射
                返回: 翻译结果列表(按分数降序)
                """
                results = self.decode(src)

                translations = []
                for seq, score in results:
                    # 移除<SOS>和<EOS>
                    seq = [idx for idx in seq if idx not in [self.sos_idx, self.eos_idx]]

                    # 转换为文本
                    words = [id2word.get(idx, '<UNK>') for idx in seq]
                    translation = ' '.join(words)

                    translations.append((translation, score))

                return translations

        # 使用示例
        # 假设已有训练好的Transformer模型
        # model = Transformer(...)
        # model.load_state_dict(torch.load('model.pt'))

        # 创建Beam Search解码器
        # beam_decoder = BeamSearchDecoder(
        #     model=model,
        #     beam_size=5,
        #     max_len=50,
        #     length_penalty=0.6
        # )

        # 翻译
        # src = torch.tensor([[2, 45, 123, 67, 89, 3]])  # [1, src_len]
        # translations = beam_decoder.translate(src, id2word)

        # 打印结果
        # for i, (translation, score) in enumerate(translations):
        #     print(f"候选{i+1} (分数: {score:.4f}): {translation}")
        ---

4.4 预训练模型应用

01.mBART多语言翻译
    a.说明
        mBART是Facebook提出的多语言预训练模型,在25种语言上进行去噪自编码预训练,支持多对多翻译。
        预训练任务包括文本填充(Text Infilling)和句子重排(Sentence Permutation),学习跨语言表示。
        Fine-tune时只需少量平行语料即可达到良好效果,特别适合低资源语言翻译。
        mBART-50支持50种语言,使用语言标记(language token)控制目标语言,如<en_XX>表示英语。
        零样本翻译能力允许模型翻译未见过的语言对,如训练了中英、英法后可以直接中法翻译。
        相比从头训练Transformer,mBART在WMT数据集上BLEU提升3-5个点,且收敛速度更快。
    b.代码示例
        ---
        # mBART翻译应用
        from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
        import torch
        from typing import List

        class mBARTTranslator:
            """mBART多语言翻译器"""

            def __init__(self, model_name: str = 'facebook/mbart-large-50-many-to-many-mmt'):
                # 加载模型和分词器
                self.model = MBartForConditionalGeneration.from_pretrained(model_name)
                self.tokenizer = MBart50TokenizerFast.from_pretrained(model_name)

                self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
                self.model.to(self.device)
                self.model.eval()

                # 语言代码映射
                self.lang_codes = {
                    'zh': 'zh_CN',  # 中文
                    'en': 'en_XX',  # 英文
                    'fr': 'fr_XX',  # 法语
                    'de': 'de_DE',  # 德语
                    'ja': 'ja_XX',  # 日语
                    'ko': 'ko_KR',  # 韩语
                    'es': 'es_XX',  # 西班牙语
                    'ru': 'ru_RU',  # 俄语
                }

            def translate(self,
                         text: str,
                         src_lang: str,
                         tgt_lang: str,
                         max_length: int = 512,
                         num_beams: int = 5,
                         length_penalty: float = 1.0) -> str:
                """
                翻译单个文本
                text: 源文本
                src_lang: 源语言代码(如'zh', 'en')
                tgt_lang: 目标语言代码
                """
                # 获取完整语言代码
                src_lang_code = self.lang_codes.get(src_lang, src_lang)
                tgt_lang_code = self.lang_codes.get(tgt_lang, tgt_lang)

                # 设置源语言
                self.tokenizer.src_lang = src_lang_code

                # 编码
                encoded = self.tokenizer(
                    text,
                    return_tensors='pt',
                    padding=True,
                    truncation=True,
                    max_length=max_length
                ).to(self.device)

                # 生成翻译
                with torch.no_grad():
                    generated_tokens = self.model.generate(
                        **encoded,
                        forced_bos_token_id=self.tokenizer.lang_code_to_id[tgt_lang_code],
                        max_length=max_length,
                        num_beams=num_beams,
                        length_penalty=length_penalty,
                        early_stopping=True
                    )

                # 解码
                translation = self.tokenizer.batch_decode(
                    generated_tokens,
                    skip_special_tokens=True
                )[0]

                return translation

            def batch_translate(self,
                               texts: List[str],
                               src_lang: str,
                               tgt_lang: str,
                               batch_size: int = 8,
                               **kwargs) -> List[str]:
                """批量翻译"""
                translations = []

                for i in range(0, len(texts), batch_size):
                    batch_texts = texts[i:i + batch_size]

                    # 批量编码
                    src_lang_code = self.lang_codes.get(src_lang, src_lang)
                    tgt_lang_code = self.lang_codes.get(tgt_lang, tgt_lang)

                    self.tokenizer.src_lang = src_lang_code

                    encoded = self.tokenizer(
                        batch_texts,
                        return_tensors='pt',
                        padding=True,
                        truncation=True,
                        max_length=kwargs.get('max_length', 512)
                    ).to(self.device)

                    # 生成
                    with torch.no_grad():
                        generated_tokens = self.model.generate(
                            **encoded,
                            forced_bos_token_id=self.tokenizer.lang_code_to_id[tgt_lang_code],
                            max_length=kwargs.get('max_length', 512),
                            num_beams=kwargs.get('num_beams', 5),
                            length_penalty=kwargs.get('length_penalty', 1.0),
                            early_stopping=True
                        )

                    # 解码
                    batch_translations = self.tokenizer.batch_decode(
                        generated_tokens,
                        skip_special_tokens=True
                    )

                    translations.extend(batch_translations)

                return translations

            def pivot_translate(self,
                               text: str,
                               src_lang: str,
                               tgt_lang: str,
                               pivot_lang: str = 'en') -> str:
                """
                枢轴翻译:通过中间语言翻译
                如中文->英文->法语
                """
                # 第一步:源语言->枢轴语言
                pivot_text = self.translate(text, src_lang, pivot_lang)

                # 第二步:枢轴语言->目标语言
                final_text = self.translate(pivot_text, pivot_lang, tgt_lang)

                return final_text

        # 使用示例
        translator = mBARTTranslator()

        # 中英翻译
        zh_text = "机器翻译是自然语言处理的重要应用"
        en_translation = translator.translate(zh_text, src_lang='zh', tgt_lang='en')
        print(f"中文: {zh_text}")
        print(f"英文: {en_translation}")

        # 英中翻译
        en_text = "Machine translation is an important application of NLP"
        zh_translation = translator.translate(en_text, src_lang='en', tgt_lang='zh')
        print(f"\n英文: {en_text}")
        print(f"中文: {zh_translation}")

        # 批量翻译
        texts = [
            "今天天气很好",
            "我喜欢学习人工智能",
            "深度学习改变了世界"
        ]
        translations = translator.batch_translate(texts, src_lang='zh', tgt_lang='en')
        print("\n批量翻译:")
        for src, tgt in zip(texts, translations):
            print(f"  {src} -> {tgt}")

        # 枢轴翻译(中文->法语,通过英语)
        # fr_translation = translator.pivot_translate(zh_text, 'zh', 'fr', pivot_lang='en')
        # print(f"\n中文->法语: {fr_translation}")
        ---

02.M2M-100直接多语言翻译
    a.��明
        M2M-100是Facebook提出的真正多对多翻译模型,支持100种语言之间的直接翻译,无需英语作为枢轴。
        训练数据包含7.5B句子对,覆盖2200个语言方向,是目前最大规模的多语言翻译模型。
        直接翻译避免了枢轴翻译的误差累积,在非英语语言对上BLEU提升10+个点。
        模型使用语言特定的编码器-解码器注意力,更好地处理不同语言的特性。
        支持低资源语言翻译,如非洲、亚洲小语种,填补了传统翻译系统的空白。
        推理时需要指定源语言和目标语言标记,模型自动选择合适的翻译路径。
    b.代码示例
        ---
        # M2M-100翻译应用
        from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
        import torch
        from typing import List, Dict
        import time

        class M2M100Translator:
            """M2M-100多语言翻译器"""

            def __init__(self, model_name: str = 'facebook/m2m100_418M'):
                """
                model_name: 'facebook/m2m100_418M' 或 'facebook/m2m100_1.2B'
                """
                print(f"加载模型: {model_name}...")
                self.model = M2M100ForConditionalGeneration.from_pretrained(model_name)
                self.tokenizer = M2M100Tokenizer.from_pretrained(model_name)

                self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
                self.model.to(self.device)
                self.model.eval()

                print(f"模型加载完成,设备: {self.device}")

            def translate(self,
                         text: str,
                         src_lang: str,
                         tgt_lang: str,
                         max_length: int = 512,
                         num_beams: int = 5) -> Dict:
                """
                翻译文本
                返回: {'translation': str, 'time': float}
                """
                start_time = time.time()

                # 设置源语言
                self.tokenizer.src_lang = src_lang

                # 编码
                encoded = self.tokenizer(
                    text,
                    return_tensors='pt',
                    padding=True,
                    truncation=True,
                    max_length=max_length
                ).to(self.device)

                # 生成翻译
                with torch.no_grad():
                    generated_tokens = self.model.generate(
                        **encoded,
                        forced_bos_token_id=self.tokenizer.get_lang_id(tgt_lang),
                        max_length=max_length,
                        num_beams=num_beams,
                        early_stopping=True
                    )

                # 解码
                translation = self.tokenizer.batch_decode(
                    generated_tokens,
                    skip_special_tokens=True
                )[0]

                elapsed_time = time.time() - start_time

                return {
                    'translation': translation,
                    'time': elapsed_time,
                    'src_lang': src_lang,
                    'tgt_lang': tgt_lang
                }

            def multi_target_translate(self,
                                      text: str,
                                      src_lang: str,
                                      tgt_langs: List[str]) -> Dict[str, str]:
                """
                将一个文本翻译成多种目标语言
                """
                results = {}

                for tgt_lang in tgt_langs:
                    result = self.translate(text, src_lang, tgt_lang)
                    results[tgt_lang] = result['translation']

                return results

            def compare_translations(self,
                                    text: str,
                                    src_lang: str,
                                    tgt_lang: str,
                                    methods: List[str] = ['direct', 'pivot']) -> Dict:
                """
                比较直接翻译和枢轴翻译
                """
                results = {}

                # 直接翻译
                if 'direct' in methods:
                    direct_result = self.translate(text, src_lang, tgt_lang)
                    results['direct'] = direct_result

                # 枢轴翻译(通过英语)
                if 'pivot' in methods and src_lang != 'en' and tgt_lang != 'en':
                    # 源语言->英语
                    to_en = self.translate(text, src_lang, 'en')
                    # 英语->目标语言
                    from_en = self.translate(to_en['translation'], 'en', tgt_lang)

                    results['pivot'] = {
                        'translation': from_en['translation'],
                        'time': to_en['time'] + from_en['time'],
                        'intermediate': to_en['translation']
                    }

                return results

        # 使用示例
        translator = M2M100Translator(model_name='facebook/m2m100_418M')

        # 中文->英语
        result = translator.translate(
            "人工智能正在改变世界",
            src_lang='zh',
            tgt_lang='en'
        )
        print(f"翻译: {result['translation']}")
        print(f"耗时: {result['time']:.3f}秒")

        # 一对多翻译
        multi_results = translator.multi_target_translate(
            "你好,世界",
            src_lang='zh',
            tgt_langs=['en', 'fr', 'de', 'ja', 'ko']
        )
        print("\n一对多翻译:")
        for lang, translation in multi_results.items():
            print(f"  {lang}: {translation}")

        # 比较直接翻译和枢轴翻译
        comparison = translator.compare_translations(
            "机器学习是人工智能的核心",
            src_lang='zh',
            tgt_lang='fr',
            methods=['direct', 'pivot']
        )
        print("\n翻译方法比较:")
        print(f"直接翻译: {comparison['direct']['translation']}")
        print(f"  耗时: {comparison['direct']['time']:.3f}秒")
        if 'pivot' in comparison:
            print(f"枢轴翻译: {comparison['pivot']['translation']}")
            print(f"  中间结果: {comparison['pivot']['intermediate']}")
            print(f"  耗时: {comparison['pivot']['time']:.3f}秒")
        ---

4.5 实战项目

01.中英翻译系统搭建
    a.说明
        构建生产级翻译系统需要考虑模型选择、数据处理、推理优化、API设计、监控告警等多个方面。
        数据预处理包括分词、BPE编码、长度过滤、去重、质量过滤等步骤,确保训练数据质量。
        模型训练使用混合精度(Mixed Precision)加速,梯度累积模拟大批次,学习率warmup稳定训练。
        推理优化包括模型量化、ONNX转换、批处理、缓存机制,将延迟降低到100ms以内。
        API服务使用FastAPI构建,支持单句翻译、批量翻译、流式翻译等多种接口。
        监控指标包括QPS、延迟、BLEU分数、错误率等,使用Prometheus+Grafana可视化。
    b.代码示例
        ---
        # 中英翻译系统完整实现
        from transformers import MarianMTModel, MarianTokenizer
        import torch
        from fastapi import FastAPI, HTTPException, BackgroundTasks
        from pydantic import BaseModel
        from typing import List, Optional
        import time
        import logging
        from collections import deque
        from threading import Lock
        import redis
        import hashlib
        import json

        # 配置日志
        logging.basicConfig(level=logging.INFO)
        logger = logging.getLogger(__name__)

        class TranslationRequest(BaseModel):
            text: str
            src_lang: str = 'zh'
            tgt_lang: str = 'en'
            use_cache: bool = True

        class BatchTranslationRequest(BaseModel):
            texts: List[str]
            src_lang: str = 'zh'
            tgt_lang: str = 'en'

        class TranslationResponse(BaseModel):
            translation: str
            src_text: str
            processing_time: float
            from_cache: bool = False

        class TranslationSystem:
            """生产级翻译系统"""

            def __init__(self,
                        model_name: str = 'Helsinki-NLP/opus-mt-zh-en',
                        cache_enabled: bool = True,
                        redis_host: str = 'localhost',
                        redis_port: int = 6379):

                # 加载模型
                logger.info(f"加载翻译模型: {model_name}")
                self.model = MarianMTModel.from_pretrained(model_name)
                self.tokenizer = MarianTokenizer.from_pretrained(model_name)

                self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
                self.model.to(self.device)
                self.model.eval()

                # 缓存
                self.cache_enabled = cache_enabled
                if cache_enabled:
                    try:
                        self.redis_client = redis.Redis(
                            host=redis_host,
                            port=redis_port,
                            decode_responses=True
                        )
                        self.redis_client.ping()
                        logger.info("Redis缓存已连接")
                    except:
                        logger.warning("Redis连接失败,禁用缓存")
                        self.cache_enabled = False

                # 统计信息
                self.stats = {
                    'total_requests': 0,
                    'cache_hits': 0,
                    'total_time': 0.0,
                    'recent_latencies': deque(maxlen=100)
                }
                self.stats_lock = Lock()

                logger.info(f"翻译系统初始化完成,设备: {self.device}")

            def _get_cache_key(self, text: str, src_lang: str, tgt_lang: str) -> str:
                """生成缓存键"""
                content = f"{src_lang}:{tgt_lang}:{text}"
                return hashlib.md5(content.encode()).hexdigest()

            def _get_from_cache(self, cache_key: str) -> Optional[str]:
                """从缓存获取翻译"""
                if not self.cache_enabled:
                    return None

                try:
                    cached = self.redis_client.get(cache_key)
                    if cached:
                        with self.stats_lock:
                            self.stats['cache_hits'] += 1
                        return cached
                except Exception as e:
                    logger.error(f"缓存读取失败: {e}")

                return None

            def _save_to_cache(self, cache_key: str, translation: str, ttl: int = 3600):
                """保存翻译到缓存"""
                if not self.cache_enabled:
                    return

                try:
                    self.redis_client.setex(cache_key, ttl, translation)
                except Exception as e:
                    logger.error(f"缓存写入失败: {e}")

            def translate(self,
                         text: str,
                         src_lang: str = 'zh',
                         tgt_lang: str = 'en',
                         use_cache: bool = True,
                         max_length: int = 512,
                         num_beams: int = 4) -> dict:
                """
                翻译单个文本
                """
                start_time = time.time()

                # 更新统计
                with self.stats_lock:
                    self.stats['total_requests'] += 1

                # 检查缓存
                from_cache = False
                if use_cache:
                    cache_key = self._get_cache_key(text, src_lang, tgt_lang)
                    cached_translation = self._get_from_cache(cache_key)

                    if cached_translation:
                        elapsed_time = time.time() - start_time
                        with self.stats_lock:
                            self.stats['total_time'] += elapsed_time
                            self.stats['recent_latencies'].append(elapsed_time)

                        return {
                            'translation': cached_translation,
                            'time': elapsed_time,
                            'from_cache': True
                        }

                # 模型推理
                try:
                    # 编码
                    encoded = self.tokenizer(
                        text,
                        return_tensors='pt',
                        padding=True,
                        truncation=True,
                        max_length=max_length
                    ).to(self.device)

                    # 生成
                    with torch.no_grad():
                        generated_tokens = self.model.generate(
                            **encoded,
                            max_length=max_length,
                            num_beams=num_beams,
                            early_stopping=True
                        )

                    # 解码
                    translation = self.tokenizer.decode(
                        generated_tokens[0],
                        skip_special_tokens=True
                    )

                    # 保存到缓存
                    if use_cache:
                        self._save_to_cache(cache_key, translation)

                    elapsed_time = time.time() - start_time

                    # 更新统计
                    with self.stats_lock:
                        self.stats['total_time'] += elapsed_time
                        self.stats['recent_latencies'].append(elapsed_time)

                    return {
                        'translation': translation,
                        'time': elapsed_time,
                        'from_cache': False
                    }

                except Exception as e:
                    logger.error(f"翻译失败: {e}")
                    raise

            def batch_translate(self,
                               texts: List[str],
                               batch_size: int = 8,
                               **kwargs) -> List[dict]:
                """批量翻译"""
                results = []

                for i in range(0, len(texts), batch_size):
                    batch_texts = texts[i:i + batch_size]

                    # 批量编码
                    encoded = self.tokenizer(
                        batch_texts,
                        return_tensors='pt',
                        padding=True,
                        truncation=True,
                        max_length=kwargs.get('max_length', 512)
                    ).to(self.device)

                    # 生成
                    start_time = time.time()
                    with torch.no_grad():
                        generated_tokens = self.model.generate(
                            **encoded,
                            max_length=kwargs.get('max_length', 512),
                            num_beams=kwargs.get('num_beams', 4),
                            early_stopping=True
                        )

                    # 解码
                    translations = self.tokenizer.batch_decode(
                        generated_tokens,
                        skip_special_tokens=True
                    )

                    elapsed_time = time.time() - start_time

                    # 添加结果
                    for text, translation in zip(batch_texts, translations):
                        results.append({
                            'src_text': text,
                            'translation': translation,
                            'time': elapsed_time / len(batch_texts)
                        })

                return results

            def get_stats(self) -> dict:
                """获取统计信息"""
                with self.stats_lock:
                    avg_latency = (self.stats['total_time'] / self.stats['total_requests']
                                  if self.stats['total_requests'] > 0 else 0)

                    recent_latencies = list(self.stats['recent_latencies'])
                    p50 = sorted(recent_latencies)[len(recent_latencies)//2] if recent_latencies else 0
                    p95 = sorted(recent_latencies)[int(len(recent_latencies)*0.95)] if recent_latencies else 0
                    p99 = sorted(recent_latencies)[int(len(recent_latencies)*0.99)] if recent_latencies else 0

                    cache_hit_rate = (self.stats['cache_hits'] / self.stats['total_requests']
                                     if self.stats['total_requests'] > 0 else 0)

                    return {
                        'total_requests': self.stats['total_requests'],
                        'cache_hits': self.stats['cache_hits'],
                        'cache_hit_rate': f"{cache_hit_rate:.2%}",
                        'avg_latency': f"{avg_latency:.3f}s",
                        'p50_latency': f"{p50:.3f}s",
                        'p95_latency': f"{p95:.3f}s",
                        'p99_latency': f"{p99:.3f}s"
                    }

        # FastAPI应用
        app = FastAPI(title="Translation API", version="1.0.0")

        # 全局翻译系统实例
        translation_system = None

        @app.on_event("startup")
        async def startup_event():
            """启动时初始化"""
            global translation_system
            translation_system = TranslationSystem(
                model_name='Helsinki-NLP/opus-mt-zh-en',
                cache_enabled=True
            )

        @app.get("/")
        async def root():
            return {"status": "healthy", "service": "Translation API"}

        @app.post("/translate", response_model=TranslationResponse)
        async def translate(request: TranslationRequest):
            """单句翻译"""
            if not translation_system:
                raise HTTPException(status_code=503, detail="系统未初始化")

            try:
                result = translation_system.translate(
                    request.text,
                    request.src_lang,
                    request.tgt_lang,
                    request.use_cache
                )

                return TranslationResponse(
                    translation=result['translation'],
                    src_text=request.text,
                    processing_time=result['time'],
                    from_cache=result['from_cache']
                )

            except Exception as e:
                logger.error(f"翻译请求失败: {e}")
                raise HTTPException(status_code=500, detail=str(e))

        @app.post("/batch_translate")
        async def batch_translate(request: BatchTranslationRequest):
            """批量翻译"""
            if not translation_system:
                raise HTTPException(status_code=503, detail="系统未初始化")

            try:
                results = translation_system.batch_translate(
                    request.texts,
                    batch_size=8
                )

                return {"results": results}

            except Exception as e:
                logger.error(f"批量翻译失败: {e}")
                raise HTTPException(status_code=500, detail=str(e))

        @app.get("/stats")
        async def get_stats():
            """获取统计信息"""
            if not translation_system:
                raise HTTPException(status_code=503, detail="系统未初始化")

            return translation_system.get_stats()

        # 启动命令:
        # uvicorn translation_api:app --host 0.0.0.0 --port 8000 --workers 1
        ---

02.翻译质量监控与优化
    a.说明
        质量监控包括自动评估(BLEU、METEOR)和人工评估(抽样检查),建立质量基线和告警阈值。
        A/B测试对比不同模型或参数配置的效果,使用统计检验判断差异显著性。
        错误分析收集badcase,分类错误类型(漏译、误译、语序错误等),针对性优化。
        持续学习使用用户反馈数据fine-tune模型,建立人工标注-模型训练-上线部署的闭环。
        多模型集成使用投票或加权平均提升翻译质量,但需权衡质量提升和推理开销。
        领域适应针对特定领域(医疗、法律、金融)训练专用模型,或使用领域词典后处理。
    b.代码示例
        ---
        # 翻译质量监控系统
        from sacrebleu.metrics import BLEU
        from typing import List, Dict
        import numpy as np
        from collections import defaultdict
        import matplotlib.pyplot as plt
        from datetime import datetime, timedelta
        import json

        class TranslationQualityMonitor:
            """翻译质量监控器"""

            def __init__(self, alert_threshold: float = 30.0):
                self.bleu = BLEU()
                self.alert_threshold = alert_threshold  # BLEU告警阈值

                # 存储历史数据
                self.quality_history = []
                self.error_cases = []

            def evaluate_batch(self,
                             predictions: List[str],
                             references: List[List[str]]) -> Dict:
                """评估一批翻译"""
                # 计算BLEU
                bleu_score = self.bleu.corpus_score(predictions, references)

                # 记录
                record = {
                    'timestamp': datetime.now().isoformat(),
                    'bleu': bleu_score.score,
                    'count': len(predictions)
                }
                self.quality_history.append(record)

                # 检查告警
                if bleu_score.score < self.alert_threshold:
                    self._trigger_alert(bleu_score.score)

                return {
                    'bleu': bleu_score.score,
                    'precisions': bleu_score.precisions,
                    'bp': bleu_score.bp
                }

            def _trigger_alert(self, bleu_score: float):
                """触发质量告警"""
                logger.warning(f"翻译质量告警! BLEU={bleu_score:.2f} < {self.alert_threshold}")
                # 实际应用中可以发送邮件、钉钉消息等

            def collect_error_case(self,
                                  src_text: str,
                                  prediction: str,
                                  reference: str,
                                  error_type: str):
                """收集错误案例"""
                error_case = {
                    'timestamp': datetime.now().isoformat(),
                    'src_text': src_text,
                    'prediction': prediction,
                    'reference': reference,
                    'error_type': error_type
                }
                self.error_cases.append(error_case)

            def analyze_errors(self) -> Dict:
                """分析错误分布"""
                error_types = defaultdict(int)

                for case in self.error_cases:
                    error_types[case['error_type']] += 1

                return dict(error_types)

            def get_quality_trend(self, days: int = 7) -> List[Dict]:
                """获取质量趋势"""
                cutoff_time = datetime.now() - timedelta(days=days)

                recent_records = [
                    r for r in self.quality_history
                    if datetime.fromisoformat(r['timestamp']) > cutoff_time
                ]

                return recent_records

            def plot_quality_trend(self, days: int = 7):
                """绘制质量趋势图"""
                trend = self.get_quality_trend(days)

                if not trend:
                    print("没有足够的历史数据")
                    return

                timestamps = [datetime.fromisoformat(r['timestamp']) for r in trend]
                bleu_scores = [r['bleu'] for r in trend]

                plt.figure(figsize=(12, 6))
                plt.plot(timestamps, bleu_scores, marker='o')
                plt.axhline(y=self.alert_threshold, color='r', linestyle='--',
                           label=f'告警阈值 ({self.alert_threshold})')
                plt.xlabel('时间')
                plt.ylabel('BLEU分数')
                plt.title(f'翻译质量趋势 (最近{days}天)')
                plt.legend()
                plt.grid(True)
                plt.xticks(rotation=45)
                plt.tight_layout()
                plt.savefig('quality_trend.png')
                print("质量趋势图已保存: quality_trend.png")

        # 使用示例
        monitor = TranslationQualityMonitor(alert_threshold=30.0)

        # 评估翻译质量
        predictions = [
            "Machine translation is an important application",
            "Today the weather is very good"
        ]
        references = [
            ["Machine translation is an important application of AI"],
            ["The weather is nice today", "Today's weather is very nice"]
        ]

        results = monitor.evaluate_batch(predictions, references)
        print(f"BLEU分数: {results['bleu']:.2f}")

        # 收集错误案例
        monitor.collect_error_case(
            src_text="我爱北京天安门",
            prediction="I love Beijing",
            reference="I love Tiananmen Square in Beijing",
            error_type="漏译"
        )

        # 分析错误
        error_analysis = monitor.analyze_errors()
        print(f"\n错误分布: {error_analysis}")

        # 绘制趋势图
        # monitor.plot_quality_trend(days=7)
        ---

5 文本生成

5.1 生成任务类型

01.文本生成任务分类
    a.说明
        文本生成包括条件生成(给定输入生成输出)和无条件生成(自由创作),应用场景涵盖摘要、对话、写作、代码生成等。
        摘要生成分为抽取式(从原文选择句子)和生成式(重新组织语言),生成式摘要更灵活但更具挑战性。
        对话生成包括闲聊对话、任务型对话、知识对话,需要考虑上下文一致性、个性化、知识准确性。
        创意写作如诗歌、故事、广告文案生成,需要模型具备创造力、风格迁移、情感表达能力。
        代码生成根据自然语言描述生成程序代码,如GitHub Copilot、CodeX,是AI辅助编程的重要应用。
        数据到文本生成将结构化数据(表格、图表)转换为自然语言描述,应用于报告生成、数据分析等场景。
    b.代码示例
        ---
        # 文本生成任务框架
        import torch
        from transformers import GPT2LMHeadModel, GPT2Tokenizer, BertTokenizer, BertForMaskedLM
        from typing import List, Dict, Optional
        import numpy as np

        class TextGenerator:
            """通用文本生成器"""

            def __init__(self, model_name: str = 'gpt2'):
                self.tokenizer = GPT2Tokenizer.from_pretrained(model_name)
                self.model = GPT2LMHeadModel.from_pretrained(model_name)

                self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
                self.model.to(self.device)
                self.model.eval()

                # 设置pad_token
                if self.tokenizer.pad_token is None:
                    self.tokenizer.pad_token = self.tokenizer.eos_token

            def generate_greedy(self,
                              prompt: str,
                              max_length: int = 50) -> str:
                """贪心解码生成"""
                # 编码输入
                input_ids = self.tokenizer.encode(prompt, return_tensors='pt').to(self.device)

                # 生成
                with torch.no_grad():
                    output_ids = self.model.generate(
                        input_ids,
                        max_length=max_length,
                        do_sample=False,  # 贪心
                        pad_token_id=self.tokenizer.pad_token_id
                    )

                # 解码
                generated_text = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)

                return generated_text

            def generate_beam_search(self,
                                    prompt: str,
                                    max_length: int = 50,
                                    num_beams: int = 5,
                                    num_return_sequences: int = 1) -> List[str]:
                """Beam Search生成"""
                input_ids = self.tokenizer.encode(prompt, return_tensors='pt').to(self.device)

                with torch.no_grad():
                    output_ids = self.model.generate(
                        input_ids,
                        max_length=max_length,
                        num_beams=num_beams,
                        num_return_sequences=num_return_sequences,
                        early_stopping=True,
                        pad_token_id=self.tokenizer.pad_token_id
                    )

                # 解码所有候选
                generated_texts = [
                    self.tokenizer.decode(ids, skip_special_tokens=True)
                    for ids in output_ids
                ]

                return generated_texts

            def generate_sampling(self,
                                prompt: str,
                                max_length: int = 50,
                                temperature: float = 1.0,
                                top_k: int = 50,
                                top_p: float = 0.95,
                                num_return_sequences: int = 1) -> List[str]:
                """
                采样生成(Top-k + Top-p)
                temperature: 温度参数,越高越随机
                top_k: 只从概率最高的k个词中采样
                top_p: 核采样,累积概率达到p时停止
                """
                input_ids = self.tokenizer.encode(prompt, return_tensors='pt').to(self.device)

                with torch.no_grad():
                    output_ids = self.model.generate(
                        input_ids,
                        max_length=max_length,
                        do_sample=True,
                        temperature=temperature,
                        top_k=top_k,
                        top_p=top_p,
                        num_return_sequences=num_return_sequences,
                        pad_token_id=self.tokenizer.pad_token_id
                    )

                generated_texts = [
                    self.tokenizer.decode(ids, skip_special_tokens=True)
                    for ids in output_ids
                ]

                return generated_texts

            def generate_with_constraints(self,
                                         prompt: str,
                                         must_include: List[str],
                                         max_length: int = 50) -> str:
                """
                带约束的生成(必须包含特定词)
                """
                # 简化实现:生成多个候选,选择包含约束词的
                candidates = self.generate_sampling(
                    prompt,
                    max_length=max_length,
                    num_return_sequences=10
                )

                # 筛选满足约束的候选
                valid_candidates = []
                for candidate in candidates:
                    if all(word in candidate for word in must_include):
                        valid_candidates.append(candidate)

                if valid_candidates:
                    return valid_candidates[0]
                else:
                    # 如果没有满足约束的,返回第一个候选
                    return candidates[0]

        # 使用示例
        generator = TextGenerator(model_name='gpt2')

        # 贪心生成
        prompt = "Artificial intelligence is"
        greedy_result = generator.generate_greedy(prompt, max_length=30)
        print(f"贪心生成:\n{greedy_result}\n")

        # Beam Search生成
        beam_results = generator.generate_beam_search(prompt, max_length=30, num_beams=3, num_return_sequences=3)
        print("Beam Search生成:")
        for i, result in enumerate(beam_results, 1):
            print(f"{i}. {result}")
        print()

        # 采样生成(多样性)
        sampling_results = generator.generate_sampling(
            prompt,
            max_length=30,
            temperature=0.8,
            top_k=50,
            top_p=0.95,
            num_return_sequences=3
        )
        print("采样生成:")
        for i, result in enumerate(sampling_results, 1):
            print(f"{i}. {result}")
        print()

        # 带约束生成
        constrained_result = generator.generate_with_constraints(
            "The future of technology",
            must_include=["innovation", "society"],
            max_length=40
        )
        print(f"带约束生成:\n{constrained_result}")
        ---

02.生成策略对比
    a.说明
        贪心解码每步选择概率最高的词,速度快但容易陷入局部最优,生成重复内容。
        Beam Search保留多个候选路径,质量优于贪心但仍可能生成通用、保守的文本。
        采样方法引入随机性,增加多样性,temperature控制随机程度,top-k和top-p限制采样空间。
        Temperature=1.0保持原始概率分布,<1.0更确定性,>1.0更随机,通常设置0.7-1.0。
        Top-k采样只从概率最高的k个词中采样,k=50是常用值,过小限制多样性,过大引入噪声。
        Top-p(Nucleus Sampling)动态选择累积概率达到p的最小词集,p=0.9-0.95是常用值,比top-k更灵活。
    b.代码示例
        ---
        # 生成策略对比实验
        import torch
        import torch.nn.functional as F
        from transformers import GPT2LMHeadModel, GPT2Tokenizer
        import matplotlib.pyplot as plt
        import numpy as np

        class GenerationStrategyComparison:
            """生成策略对比工具"""

            def __init__(self, model_name: str = 'gpt2'):
                self.tokenizer = GPT2Tokenizer.from_pretrained(model_name)
                self.model = GPT2LMHeadModel.from_pretrained(model_name)
                self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
                self.model.to(self.device)
                self.model.eval()

            def compare_strategies(self, prompt: str, max_length: int = 50) -> Dict[str, List[str]]:
                """对比不同生成策略"""
                results = {}

                # 1. 贪心解码
                results['greedy'] = self._generate(
                    prompt, max_length, do_sample=False
                )

                # 2. Beam Search
                results['beam_search'] = self._generate(
                    prompt, max_length, num_beams=5, num_return_sequences=3
                )

                # 3. 低温采样(更确定)
                results['low_temp'] = self._generate(
                    prompt, max_length, do_sample=True, temperature=0.5, num_return_sequences=3
                )

                # 4. 高温采样(更随机)
                results['high_temp'] = self._generate(
                    prompt, max_length, do_sample=True, temperature=1.5, num_return_sequences=3
                )

                # 5. Top-k采样
                results['top_k'] = self._generate(
                    prompt, max_length, do_sample=True, top_k=50, num_return_sequences=3
                )

                # 6. Top-p采样
                results['top_p'] = self._generate(
                    prompt, max_length, do_sample=True, top_p=0.95, num_return_sequences=3
                )

                return results

            def _generate(self, prompt: str, max_length: int, **kwargs) -> List[str]:
                """生成文本"""
                input_ids = self.tokenizer.encode(prompt, return_tensors='pt').to(self.device)

                # 设置默认参数
                if 'pad_token_id' not in kwargs:
                    kwargs['pad_token_id'] = self.tokenizer.eos_token_id

                if 'num_return_sequences' not in kwargs:
                    kwargs['num_return_sequences'] = 1

                with torch.no_grad():
                    output_ids = self.model.generate(
                        input_ids,
                        max_length=max_length,
                        **kwargs
                    )

                texts = [
                    self.tokenizer.decode(ids, skip_special_tokens=True)
                    for ids in output_ids
                ]

                return texts

            def analyze_diversity(self, texts: List[str]) -> Dict[str, float]:
                """分析生成文本的多样性"""
                # 计算不同指标
                unique_words = set()
                total_words = 0

                for text in texts:
                    words = text.split()
                    unique_words.update(words)
                    total_words += len(words)

                # 词汇多样性
                vocab_diversity = len(unique_words) / total_words if total_words > 0 else 0

                # 句子间相似度(简化:计算重叠词比例)
                similarities = []
                for i in range(len(texts)):
                    for j in range(i+1, len(texts)):
                        words_i = set(texts[i].split())
                        words_j = set(texts[j].split())
                        overlap = len(words_i & words_j)
                        union = len(words_i | words_j)
                        similarity = overlap / union if union > 0 else 0
                        similarities.append(similarity)

                avg_similarity = np.mean(similarities) if similarities else 0

                return {
                    'vocab_diversity': vocab_diversity,
                    'avg_similarity': avg_similarity,
                    'unique_words': len(unique_words),
                    'total_words': total_words
                }

            def print_comparison(self, results: Dict[str, List[str]]):
                """打印对比结果"""
                print("=" * 80)
                print("生成策略对比")
                print("=" * 80)

                for strategy, texts in results.items():
                    print(f"\n【{strategy.upper()}】")
                    for i, text in enumerate(texts, 1):
                        print(f"{i}. {text}")

                    # 分析多样性
                    diversity = self.analyze_diversity(texts)
                    print(f"\n多样性指标:")
                    print(f"  词汇多样性: {diversity['vocab_diversity']:.3f}")
                    print(f"  平均相似度: {diversity['avg_similarity']:.3f}")
                    print(f"  独特词数: {diversity['unique_words']}")

        # 使用示例
        comparator = GenerationStrategyComparison()

        prompt = "The future of artificial intelligence"
        results = comparator.compare_strategies(prompt, max_length=40)
        comparator.print_comparison(results)
        ---

5.2 文本摘要

01.抽取式摘要
    a.说明
        抽取式摘要从原文中选择重要句子组成摘要,保持原文表达,适合新闻、论文等正式文本。
        TextRank算法基于PageRank思想,构建句子图并计算重要性分数,选择top-k句子作为摘要。
        BERT-based方法使用BERT编码句子,训练分类器判断每个句子是否应被选入摘要。
        Lead-3基线方法选择前3句作为摘要,在新闻摘要任务上效果surprisingly good。
        评估指标使用ROUGE(Recall-Oriented Understudy for Gisting Evaluation),计算n-gram重叠。
        优点是流畅度高、事实准确,缺点是缺乏灵活性、可能包含冗余信息。
    b.代码示例
        ---
        # 抽取式摘要实现
        import torch
        from transformers import BertTokenizer, BertModel
        import numpy as np
        from sklearn.metrics.pairwise import cosine_similarity
        import networkx as nx

        class ExtractiveSummarizer:
            """抽取式摘要器"""

            def __init__(self, model_name: str = 'bert-base-uncased'):
                self.tokenizer = BertTokenizer.from_pretrained(model_name)
                self.model = BertModel.from_pretrained(model_name)
                self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
                self.model.to(self.device)
                self.model.eval()

            def sentence_embedding(self, sentence: str) -> np.ndarray:
                """获取句子嵌入"""
                encoded = self.tokenizer(sentence, return_tensors='pt', padding=True,
                                        truncation=True, max_length=512).to(self.device)

                with torch.no_grad():
                    outputs = self.model(**encoded)
                    # 使用[CLS]向量作为句子表示
                    embedding = outputs.last_hidden_state[:, 0, :].cpu().numpy()

                return embedding[0]

            def textrank_summarize(self, text: str, num_sentences: int = 3) -> str:
                """TextRank摘要"""
                # 分句
                sentences = [s.strip() for s in text.split('.') if s.strip()]

                if len(sentences) <= num_sentences:
                    return text

                # 计算句子嵌入
                embeddings = [self.sentence_embedding(s) for s in sentences]

                # 计算相似度矩阵
                similarity_matrix = cosine_similarity(embeddings)

                # 构建图并计算PageRank
                nx_graph = nx.from_numpy_array(similarity_matrix)
                scores = nx.pagerank(nx_graph)

                # 选择top-k句子
                ranked_sentences = sorted(((scores[i], s) for i, s in enumerate(sentences)),
                                         reverse=True)
                summary_sentences = [s for _, s in ranked_sentences[:num_sentences]]

                # 按原文顺序排列
                summary_sentences = sorted(summary_sentences, key=lambda x: sentences.index(x))

                return '. '.join(summary_sentences) + '.'

            def lead_baseline(self, text: str, num_sentences: int = 3) -> str:
                """Lead-N基线"""
                sentences = [s.strip() for s in text.split('.') if s.strip()]
                return '. '.join(sentences[:num_sentences]) + '.'

        # 使用示例
        summarizer = ExtractiveSummarizer()

        text = """Artificial intelligence is transforming the world. Machine learning algorithms can now 
        perform tasks that were once thought to require human intelligence. Deep learning has achieved 
        remarkable success in computer vision and natural language processing. However, there are still 
        many challenges to overcome. AI systems need to be more robust, interpretable, and fair."""

        # TextRank摘要
        summary = summarizer.textrank_summarize(text, num_sentences=2)
        print(f"TextRank摘要:\n{summary}\n")

        # Lead基线
        lead_summary = summarizer.lead_baseline(text, num_sentences=2)
        print(f"Lead基线:\n{lead_summary}")
        ---

02.生成式摘要
    a.说明
        生成式摘要使用Seq2Seq模型生成新的摘要文本,可以改写、压缩、重组原文信息。
        BART和T5是主流的生成式摘要模型,在CNN/DailyMail、XSum等数据集上达到SOTA。
        Pointer-Generator Network结合生成和复制机制,可以从原文复制罕见词,提高事实准确性。
        Coverage机制惩罚重复关注相同内容,避免生成重复摘要,特别适用于长文本摘要。
        评估使用ROUGE-1(unigram)、ROUGE-2(bigram)、ROUGE-L(longest common subsequence)。
        优点是灵活性高、可压缩性强,缺点是可能产生事实错误、幻觉(hallucination)。
    b.代码示例
        ---
        # 生成式摘要实现
        from transformers import BartForConditionalGeneration, BartTokenizer
        from transformers import T5ForConditionalGeneration, T5Tokenizer
        import torch

        class AbstractiveSummarizer:
            """生成式摘要器"""

            def __init__(self, model_type: str = 'bart'):
                if model_type == 'bart':
                    self.model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
                    self.tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
                elif model_type == 't5':
                    self.model = T5ForConditionalGeneration.from_pretrained('t5-base')
                    self.tokenizer = T5Tokenizer.from_pretrained('t5-base')

                self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
                self.model.to(self.device)
                self.model.eval()
                self.model_type = model_type

            def summarize(self, text: str, max_length: int = 130,
                         min_length: int = 30, num_beams: int = 4) -> str:
                """生成摘要"""
                # T5需要添加前缀
                if self.model_type == 't5':
                    text = f"summarize: {text}"

                # 编码
                inputs = self.tokenizer(text, return_tensors='pt', max_length=1024,
                                       truncation=True).to(self.device)

                # 生成
                with torch.no_grad():
                    summary_ids = self.model.generate(
                        inputs['input_ids'],
                        max_length=max_length,
                        min_length=min_length,
                        num_beams=num_beams,
                        length_penalty=2.0,
                        early_stopping=True
                    )

                # 解码
                summary = self.tokenizer.decode(summary_ids[0], skip_special_tokens=True)

                return summary

        # 使用示例
        summarizer = AbstractiveSummarizer(model_type='bart')

        article = """The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building. 
        It was the first structure to reach a height of 300 metres. It is now taller than the Chrysler Building 
        in New York City by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest 
        free-standing structure in France after the Millau Viaduct."""

        summary = summarizer.summarize(article, max_length=60, min_length=20)
        print(f"生成式摘要:\n{summary}")
        ---

5.3 对话生成

01.闲聊对话系统
    a.说明
        闲聊对话(Chit-chat)旨在与用户进行开放域对话,需要模型具备常识、情感理解、个性化能力。
        Seq2Seq模型是基础架构,但容易生成通用回复如"我不知道"、"哈哈",缺乏信息量。
        DialoGPT基于GPT-2在Reddit对话数据上fine-tune,生成更自然、多样的回复。
        Persona-based对话为模型赋予人设(persona),使回复更一致、有个性,如"我是一个喜欢旅游的人"。
        评估指标包括困惑度(Perplexity)、BLEU、人工评估的流畅度、相关性、信息量、一致性。
        挑战包括生成安全、无偏见的回复,避免有害内容,处理多轮对话的上下文依赖。
    b.代码示例
        ---
        # 闲聊对话系统
        from transformers import AutoModelForCausalLM, AutoTokenizer
        import torch

        class ChatbotSystem:
            """闲聊对话系统"""

            def __init__(self, model_name: str = 'microsoft/DialoGPT-medium'):
                self.tokenizer = AutoTokenizer.from_pretrained(model_name)
                self.model = AutoModelForCausalLM.from_pretrained(model_name)
                self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
                self.model.to(self.device)
                self.model.eval()

                # 对话历史
                self.chat_history_ids = None

            def chat(self, user_input: str, max_length: int = 1000) -> str:
                """单轮对话"""
                # 编码用户输入
                new_input_ids = self.tokenizer.encode(
                    user_input + self.tokenizer.eos_token,
                    return_tensors='pt'
                ).to(self.device)

                # 拼接历史
                if self.chat_history_ids is not None:
                    bot_input_ids = torch.cat([self.chat_history_ids, new_input_ids], dim=-1)
                else:
                    bot_input_ids = new_input_ids

                # 生成回复
                self.chat_history_ids = self.model.generate(
                    bot_input_ids,
                    max_length=max_length,
                    pad_token_id=self.tokenizer.eos_token_id,
                    do_sample=True,
                    top_k=50,
                    top_p=0.95,
                    temperature=0.7
                )

                # 解码回复
                response = self.tokenizer.decode(
                    self.chat_history_ids[:, bot_input_ids.shape[-1]:][0],
                    skip_special_tokens=True
                )

                return response

            def reset(self):
                """重置对话历史"""
                self.chat_history_ids = None

        # 使用示例
        chatbot = ChatbotSystem()

        print("Chatbot: Hello! How can I help you today?")
        user_inputs = [
            "Hi, how are you?",
            "What do you like to do?",
            "That sounds interesting!"
        ]

        for user_input in user_inputs:
            print(f"User: {user_input}")
            response = chatbot.chat(user_input)
            print(f"Bot: {response}\n")
        ---

02.任务型对话
    a.说明
        任务型对话(Task-oriented Dialogue)帮助用户完成特定任务,如订票、查询、预约等。
        对话管理包括自然语言理解(NLU)、对话状态跟踪(DST)、对话策略(Policy)、自然语言生成(NLG)四个模块。
        意图识别(Intent Detection)判断用户意图,如"订票"、"查询"、"取消",通常使用分类模型。
        槽位填充(Slot Filling)提取关键信息,如时间、地点、人数,使用序列标注模型。
        端到端模型如SimpleTOD、PPTOD直接从对话历史生成系统回复,简化pipeline。
        评估指标包括任务成功率、对话轮数、用户满意度,以及NLU/DST/NLG各模块的准确率。
    b.代码示例
        ---
        # 任务型对话系统
        from transformers import pipeline
        import torch

        class TaskOrientedDialogue:
            """任务型对话系统"""

            def __init__(self):
                # 意图识别
                self.intent_classifier = pipeline(
                    'text-classification',
                    model='distilbert-base-uncased-finetuned-sst-2-english'
                )

                # 槽位填充(简化:使用NER)
                self.slot_filler = pipeline('ner', model='dslim/bert-base-NER')

                # 对话状态
                self.dialogue_state = {
                    'intent': None,
                    'slots': {},
                    'history': []
                }

            def process_user_input(self, user_input: str) -> dict:
                """处理用户输入"""
                # 意图识别
                intent_result = self.intent_classifier(user_input)[0]
                intent = intent_result['label']

                # 槽位填充
                entities = self.slot_filler(user_input)
                slots = {}
                for entity in entities:
                    slot_name = entity['entity']
                    slot_value = entity['word']
                    slots[slot_name] = slot_value

                # 更新对话状态
                self.dialogue_state['intent'] = intent
                self.dialogue_state['slots'].update(slots)
                self.dialogue_state['history'].append({
                    'user': user_input,
                    'intent': intent,
                    'slots': slots
                })

                return {
                    'intent': intent,
                    'slots': slots,
                    'confidence': intent_result['score']
                }

            def generate_response(self, nlu_result: dict) -> str:
                """生成系统回复"""
                intent = nlu_result['intent']
                slots = nlu_result['slots']

                # 简单的模板回复
                if intent == 'POSITIVE':
                    if slots:
                        return f"Great! I found: {', '.join(f'{k}={v}' for k, v in slots.items())}"
                    else:
                        return "That's wonderful! How can I assist you further?"
                else:
                    return "I understand. What would you like to do?"

        # 使用示例
        dialogue_system = TaskOrientedDialogue()

        user_input = "I want to book a flight to New York tomorrow"
        nlu_result = dialogue_system.process_user_input(user_input)
        print(f"Intent: {nlu_result['intent']}")
        print(f"Slots: {nlu_result['slots']}")

        response = dialogue_system.generate_response(nlu_result)
        print(f"System: {response}")
        ---

5.4 创意写作

01.故事生成
    a.说明
        故事生成需要模型具备情节构思、角色塑造、情感表达能力,是AI创造力的重要体现。
        GPT-3等大模型展现出惊人的故事生成能力,可以根据开头续写、根据关键词创作。
        情节规划使用知识图谱或脚本模板指导生成,确保故事逻辑连贯、情节合理。
        角色一致性要求模型记住角色特征,在长文本生成中保持角色行为、性格一致。
        评估包括流畅度、连贯性、创意性、情感表达,主要依赖人工评估。
        应用场景包括小说创作辅助、游戏剧情生成、儿童故事生成等。
    b.代码示例
        ---
        # 故事生成系统
        from transformers import GPT2LMHeadModel, GPT2Tokenizer
        import torch

        class StoryGenerator:
            def __init__(self):
                self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
                self.model = GPT2LMHeadModel.from_pretrained('gpt2-medium')
                self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
                self.model.to(self.device)

            def generate_story(self, prompt: str, max_length: int = 500) -> str:
                input_ids = self.tokenizer.encode(prompt, return_tensors='pt').to(self.device)
                output = self.model.generate(input_ids, max_length=max_length, 
                                            temperature=0.9, top_p=0.95, do_sample=True)
                return self.tokenizer.decode(output[0], skip_special_tokens=True)

        generator = StoryGenerator()
        story = generator.generate_story("Once upon a time in a magical forest,", max_length=200)
        print(story)
        ---

5.5 实战项目

01.新闻摘要生成系统
    a.说明
        新闻摘要系统自动提取新闻要点,帮助用户快速获取信息,广泛应用于新闻聚合平台。
        数据集使用CNN/DailyMail、XSum、LCSTS(中文),包含新闻正文和人工标注摘要。
        模型选择BART、T5、Pegasus等预训练模型,fine-tune在新闻数据上。
        评估使用ROUGE指标,同时进行人工评估检查事实准确性、可读性。
        部署考虑实时性要求,使用模型量化、批处理等优化推理速度。
        监控摘要质量,收集用户反馈,持续优化模型。
    b.代码示例
        ---
        # 新闻摘要系统
        from transformers import PegasusForConditionalGeneration, PegasusTokenizer
        import torch

        class NewsSummarizer:
            def __init__(self):
                model_name = 'google/pegasus-xsum'
                self.tokenizer = PegasusTokenizer.from_pretrained(model_name)
                self.model = PegasusForConditionalGeneration.from_pretrained(model_name)
                self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
                self.model.to(self.device)

            def summarize(self, article: str) -> str:
                inputs = self.tokenizer(article, max_length=1024, return_tensors='pt', 
                                       truncation=True).to(self.device)
                summary_ids = self.model.generate(inputs['input_ids'], max_length=128, 
                                                  min_length=32, num_beams=4)
                return self.tokenizer.decode(summary_ids[0], skip_special_tokens=True)

        summarizer = NewsSummarizer()
        article = """Scientists have discovered a new species of dinosaur in Argentina. 
        The dinosaur lived approximately 90 million years ago during the Cretaceous period."""
        summary = summarizer.summarize(article)
        print(f"Summary: {summary}")
        ---

6 问答系统

6.1 问答系统分类

01.问答系统类型
    a.说明
        问答系统根据知识来源分为知识库问答(KBQA)、文档问答(MRC)、开放域问答(Open-domain QA)。
        知识库问答基于结构化知识图谱,将自然语言问题转换为SPARQL查询,返回精确答案。
        文档问答(阅读理解)从给定文档中抽取答案,如SQuAD数据集,答案是文档中的连续片段。
        开放域问答结合检索和阅读理解,先检索相关文档,再从文档中抽取答案,如DPR+BERT。
        生成式问答直接生成答案文本,不限于文档片段,更灵活但可能产生幻觉。
        评估指标包括EM(Exact Match)、F1、MRR(Mean Reciprocal Rank)、NDCG等。
    b.代码示例
        ---
        # 问答系统框架
        from transformers import pipeline
        import torch

        class QASystem:
            def __init__(self, qa_type: str = 'extractive'):
                if qa_type == 'extractive':
                    self.qa_pipeline = pipeline('question-answering', 
                                               model='distilbert-base-cased-distilled-squad')
                elif qa_type == 'generative':
                    self.qa_pipeline = pipeline('text2text-generation', 
                                               model='google/flan-t5-base')

            def answer(self, question: str, context: str) -> dict:
                if hasattr(self.qa_pipeline, 'model') and 'distilbert' in str(type(self.qa_pipeline.model)):
                    result = self.qa_pipeline(question=question, context=context)
                    return {'answer': result['answer'], 'score': result['score']}
                else:
                    prompt = f"question: {question} context: {context}"
                    result = self.qa_pipeline(prompt, max_length=100)
                    return {'answer': result[0]['generated_text']}

        qa_system = QASystem(qa_type='extractive')
        context = "The Eiffel Tower is located in Paris, France. It was built in 1889."
        question = "Where is the Eiffel Tower?"
        answer = qa_system.answer(question, context)
        print(f"Answer: {answer['answer']}")
        ---

6.2 阅读理解

01.SQuAD阅读理解
    a.说明
        SQuAD(Stanford Question Answering Dataset)是最著名的阅读理解数据集,包含10万+问答对。
        任务是从给定段落中抽取答案片段,答案必须是段落中的连续文本。
        BERT-based模型在SQuAD上达到超人类表现,F1分数超过90%。
        模型输出答案的起始和结束位置,通过softmax预测每个位置的概率。
        SQuAD 2.0增加了无答案问题,模型需要判断问题是否可回答。
        中文阅读理解数据集包括CMRC、DRCD、DuReader等。
    b.代码示例
        ---
        # SQuAD阅读理解模型
        from transformers import BertForQuestionAnswering, BertTokenizer
        import torch

        class SQuADModel:
            def __init__(self):
                self.tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
                self.model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
                self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
                self.model.to(self.device)

            def answer_question(self, question: str, context: str) -> dict:
                inputs = self.tokenizer(question, context, return_tensors='pt', 
                                       max_length=512, truncation=True).to(self.device)

                with torch.no_grad():
                    outputs = self.model(**inputs)

                answer_start = torch.argmax(outputs.start_logits)
                answer_end = torch.argmax(outputs.end_logits) + 1

                answer = self.tokenizer.convert_tokens_to_string(
                    self.tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][answer_start:answer_end])
                )

                return {'answer': answer, 'start': answer_start.item(), 'end': answer_end.item()}

        squad_model = SQuADModel()
        context = "BERT is a transformer-based model developed by Google in 2018."
        question = "Who developed BERT?"
        result = squad_model.answer_question(question, context)
        print(f"Answer: {result['answer']}")
        ---

6.3 知识图谱问答

01.KBQA系统
    a.说明
        知识图谱问答将自然语言问题映射到知识图谱查询,返回结构化答案。
        实体链接识别问题中的实体并链接到知识图谱,如"姚明"链接到Yao_Ming节点。
        关系抽取识别问题中的关系,如"出生地"、"效力球队",映射到知识图谱关系。
        查询生成将问题转换为SPARQL或Cypher查询语句,执行查询返回答案。
        常用知识图谱包括Freebase、DBpedia、Wikidata、CN-DBpedia(中文)。
        评估指标包括准确率、召回率、F1,以及查询生成的正确率。
    b.代码示例
        ---
        # 知识图谱问答
        from SPARQLWrapper import SPARQLWrapper, JSON

        class KBQASystem:
            def __init__(self, endpoint: str = "http://dbpedia.org/sparql"):
                self.sparql = SPARQLWrapper(endpoint)
                self.sparql.setReturnFormat(JSON)

            def query(self, sparql_query: str) -> list:
                self.sparql.setQuery(sparql_query)
                results = self.sparql.query().convert()
                return results['results']['bindings']

            def answer_question(self, entity: str, relation: str) -> str:
                query = f"""
                SELECT ?answer WHERE {{
                    <http://dbpedia.org/resource/{entity}> 
                    <http://dbpedia.org/ontology/{relation}> 
                    ?answer .
                }} LIMIT 1
                """
                results = self.query(query)
                if results:
                    return results[0]['answer']['value']
                return "No answer found"

        kbqa = KBQASystem()
        answer = kbqa.answer_question("Albert_Einstein", "birthPlace")
        print(f"Answer: {answer}")
        ---

6.4 检索式问答

01.DPR检索系统
    a.说明
        Dense Passage Retrieval使用双编码器架构,分别编码问题和文档,通过向量相似度检索。
        问题编码器和文档编码器通常使用BERT,训练目标是最大化正样本相似度,最小化负样本相似度。
        FAISS等向量检索库实现高效的最近邻搜索,支持百万级文档的毫秒级检索。
        检索后使用阅读理解模型从top-k文档中抽取答案,形成检索-阅读pipeline。
        评估指标包括Recall@k(top-k中包含答案的比例)、MRR、检索时间。
        应用场景包括搜索引擎、智能客服、知识库问答等。
    b.代码示例
        ---
        # DPR检索系统
        from transformers import DPRQuestionEncoder, DPRContextEncoder, DPRQuestionEncoderTokenizer
        import torch
        import faiss
        import numpy as np

        class DPRRetriever:
            def __init__(self):
                self.q_encoder = DPRQuestionEncoder.from_pretrained('facebook/dpr-question_encoder-single-nq-base')
                self.ctx_encoder = DPRContextEncoder.from_pretrained('facebook/dpr-ctx_encoder-single-nq-base')
                self.tokenizer = DPRQuestionEncoderTokenizer.from_pretrained('facebook/dpr-question_encoder-single-nq-base')
                self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
                self.q_encoder.to(self.device)
                self.ctx_encoder.to(self.device)

            def encode_question(self, question: str) -> np.ndarray:
                inputs = self.tokenizer(question, return_tensors='pt').to(self.device)
                with torch.no_grad():
                    embeddings = self.q_encoder(**inputs).pooler_output
                return embeddings.cpu().numpy()

            def encode_contexts(self, contexts: list) -> np.ndarray:
                all_embeddings = []
                for ctx in contexts:
                    inputs = self.tokenizer(ctx, return_tensors='pt', truncation=True).to(self.device)
                    with torch.no_grad():
                        embeddings = self.ctx_encoder(**inputs).pooler_output
                    all_embeddings.append(embeddings.cpu().numpy())
                return np.vstack(all_embeddings)

            def retrieve(self, question: str, contexts: list, top_k: int = 5) -> list:
                q_emb = self.encode_question(question)
                ctx_embs = self.encode_contexts(contexts)

                index = faiss.IndexFlatIP(ctx_embs.shape[1])
                index.add(ctx_embs)

                scores, indices = index.search(q_emb, top_k)
                return [(contexts[i], scores[0][j]) for j, i in enumerate(indices[0])]

        retriever = DPRRetriever()
        contexts = ["Paris is the capital of France.", "London is the capital of UK."]
        results = retriever.retrieve("What is the capital of France?", contexts, top_k=1)
        print(f"Top result: {results[0][0]}")
        ---

6.5 生成式问答

01.RAG架构
    a.说明
        Retrieval-Augmented Generation结合检索和生成,先检索相关文档,再生成答案。
        RAG模型包括检索器(DPR)和生成器(BART/T5),端到端训练或分别训练。
        检索增强减少幻觉,提供事实依据,特别适合知识密集型任务。
        FiD(Fusion-in-Decoder)将多个检索文档拼接输入解码器,融合多源信息。
        评估使用EM、F1、ROUGE,同时评估检索质量和生成质量。
        应用包括开放域问答、事实核查、知识密集型对话等。
    b.代码示例
        ---
        # RAG问答系统
        from transformers import RagTokenizer, RagRetriever, RagTokenForGeneration
        import torch

        class RAGSystem:
            def __init__(self):
                self.tokenizer = RagTokenizer.from_pretrained("facebook/rag-token-nq")
                self.retriever = RagRetriever.from_pretrained("facebook/rag-token-nq", index_name="exact", use_dummy_dataset=True)
                self.model = RagTokenForGeneration.from_pretrained("facebook/rag-token-nq", retriever=self.retriever)
                self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
                self.model.to(self.device)

            def answer(self, question: str) -> str:
                inputs = self.tokenizer(question, return_tensors="pt").to(self.device)
                with torch.no_grad():
                    generated = self.model.generate(input_ids=inputs["input_ids"])
                return self.tokenizer.batch_decode(generated, skip_special_tokens=True)[0]

        rag_system = RAGSystem()
        answer = rag_system.answer("Who won the Nobel Prize in Physics in 2020?")
        print(f"Answer: {answer}")
        ---

6.6 实战项目

01.智能客服问答系统
    a.说明
        智能客服结合FAQ匹配、知识库检索、对话生成,提供7x24小时服务。
        FAQ匹配使用语义相似度模型,将用户问题匹配到预定义问答对。
        知识库检索从产品文档、帮助中心检索相关信息,使用DPR+阅读理解。
        对话生成处理复杂问题,生成个性化回复,使用GPT等生成模型。
        人工转接当模型置信度低或用户不满意时,转接人工客服。
        评估指标包括问题解决率、用户满意度、平均处理时间、人工转接率。
    b.代码示例
        ---
        # 智能客服系统
        from transformers import pipeline
        from sentence_transformers import SentenceTransformer, util
        import torch

        class CustomerServiceBot:
            def __init__(self):
                self.qa_pipeline = pipeline('question-answering')
                self.similarity_model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
                self.faq_database = {
                    "How do I reset my password?": "Go to Settings > Account > Reset Password",
                    "What are your business hours?": "We are open 24/7",
                    "How can I contact support?": "Email us at [email protected]"
                }
                self.faq_embeddings = self.similarity_model.encode(list(self.faq_database.keys()))

            def match_faq(self, question: str, threshold: float = 0.7) -> str:
                q_embedding = self.similarity_model.encode(question)
                similarities = util.cos_sim(q_embedding, self.faq_embeddings)[0]
                best_match_idx = torch.argmax(similarities).item()

                if similarities[best_match_idx] > threshold:
                    matched_question = list(self.faq_database.keys())[best_match_idx]
                    return self.faq_database[matched_question]
                return None

            def answer(self, question: str, context: str = None) -> dict:
                faq_answer = self.match_faq(question)
                if faq_answer:
                    return {'answer': faq_answer, 'source': 'FAQ', 'confidence': 'high'}

                if context:
                    result = self.qa_pipeline(question=question, context=context)
                    return {'answer': result['answer'], 'source': 'Knowledge Base', 'confidence': result['score']}

                return {'answer': "I'm not sure. Would you like to speak with a human agent?", 'source': 'fallback', 'confidence': 'low'}

        bot = CustomerServiceBot()
        response = bot.answer("How do I reset my password?")
        print(f"Answer: {response['answer']} (Source: {response['source']})")
        ---

7 信息抽取

7.1 关系抽取

01.关系三元组抽取
    a.说明
        关系抽取从文本中识别实体对及其关系,构建(主体,关系,客体)三元组。
        Pipeline方法先识别实体,再分类实体对关系,简单但误差累积。
        Joint方法同时识别实体和关系,使用序列标注或指针网络,性能更好。
        远程监督利用知识图谱自动标注训练数据,缓解标注成本高的问题。
        评估使用Precision、Recall、F1,严格匹配要求实体和关系都正确。
        应用包括知识图谱构建、信息抽取、问答系统等。
    b.代码示例
        ---
        # 关系抽取系统
        from transformers import pipeline
        import spacy

        class RelationExtractor:
            def __init__(self):
                self.nlp = spacy.load("en_core_web_sm")
                self.ner_pipeline = pipeline("ner", model="dslim/bert-base-NER")

            def extract_entities(self, text: str) -> list:
                entities = self.ner_pipeline(text)
                return [(e['word'], e['entity']) for e in entities]

            def extract_relations(self, text: str) -> list:
                doc = self.nlp(text)
                relations = []

                for token in doc:
                    if token.dep_ in ['nsubj', 'dobj']:
                        subject = token.text
                        relation = token.head.text
                        obj = [child.text for child in token.head.children if child.dep_ == 'dobj']
                        if obj:
                            relations.append((subject, relation, obj[0]))

                return relations

        extractor = RelationExtractor()
        text = "Barack Obama was born in Hawaii and served as President."
        relations = extractor.extract_relations(text)
        print(f"Relations: {relations}")
        ---

7.2 事件抽取

01.事件角色标注
    a.说明
        事件抽取识别文本中的事件及其参与者、时间、地点等要素。
        事件触发词识别判断哪些词触发事件,如"地震"触发灾害事件。
        事件论元抽取识别事件的参与者、时间、地点等角色。
        ACE、ERE等数据集定义了标准事件类型和角色体系。
        评估使用触发词识别F1、论元识别F1、事件抽取F1。
        应用包括舆情监控、金融事件分析、新闻摘要等。
    b.代码示例
        ---
        # 事件抽取系统
        import spacy
        from collections import defaultdict

        class EventExtractor:
            def __init__(self):
                self.nlp = spacy.load("en_core_web_sm")
                self.event_triggers = {'attack', 'earthquake', 'election', 'merger'}

            def extract_events(self, text: str) -> list:
                doc = self.nlp(text)
                events = []

                for token in doc:
                    if token.lemma_ in self.event_triggers:
                        event = {
                            'trigger': token.text,
                            'type': token.lemma_,
                            'arguments': {}
                        }

                        for child in token.children:
                            if child.dep_ == 'nsubj':
                                event['arguments']['agent'] = child.text
                            elif child.dep_ in ['dobj', 'pobj']:
                                event['arguments']['patient'] = child.text

                        events.append(event)

                return events

        extractor = EventExtractor()
        text = "The earthquake struck Japan yesterday."
        events = extractor.extract_events(text)
        print(f"Events: {events}")
        ---

7.3 知识图谱构建

01.知识图谱pipeline
    a.说明
        知识图谱构建包括实体识别、关系抽取、实体链接、知识融合、质量控制等步骤。
        实体链接将文本实体映射到知识库实体,消除歧义,如"苹果"可能指公司或水果。
        知识融合合并来自多源的知识,解决冲突,补全缺失信息。
        质量控制检测错误三元组,使用规则、统计、机器学习方法。
        存储使用图数据库如Neo4j、JanusGraph,支持高效的图查询。
        应用包括智能搜索、推荐系统、问答系统、决策支持等。
    b.代码示例
        ---
        # 知识图谱构建
        from py2neo import Graph, Node, Relationship

        class KnowledgeGraphBuilder:
            def __init__(self, uri="bolt://localhost:7687", user="neo4j", password="password"):
                self.graph = Graph(uri, auth=(user, password))

            def add_entity(self, name: str, entity_type: str, properties: dict = None):
                node = Node(entity_type, name=name, **(properties or {}))
                self.graph.create(node)
                return node

            def add_relation(self, subject: str, relation: str, object: str):
                subj_node = self.graph.nodes.match(name=subject).first()
                obj_node = self.graph.nodes.match(name=object).first()

                if subj_node and obj_node:
                    rel = Relationship(subj_node, relation, obj_node)
                    self.graph.create(rel)

            def query(self, cypher_query: str):
                return self.graph.run(cypher_query).data()

        kg = KnowledgeGraphBuilder()
        kg.add_entity("Albert Einstein", "Person", {"birth_year": 1879})
        kg.add_entity("Physics", "Field")
        kg.add_relation("Albert Einstein", "WORKS_IN", "Physics")
        ---

7.4 实战项目

01.企业知识图谱系统
    a.说明
        企业知识图谱整合企业内部文档、数据库、业务系统的知识,支持智能搜索和决策。
        数据源包括产品文档、客户数据、交易记录、邮件、会议纪要等。
        实体类型包括产品、客户、员工、项目、合同等业务实体。
        关系类型包括供应商关系、客户关系、项目参与、合同签订等。
        应用场景包括智能搜索、风险预警、商机发现、决策支持等。
        部署考虑数据安全、权限控制、实时更新、可扩展性等。
    b.代码示例
        ---
        # 企业知识图谱系统
        from py2neo import Graph
        import pandas as pd

        class EnterpriseKG:
            def __init__(self):
                self.graph = Graph("bolt://localhost:7687", auth=("neo4j", "password"))

            def import_from_csv(self, entity_file: str, relation_file: str):
                entities = pd.read_csv(entity_file)
                for _, row in entities.iterrows():
                    self.graph.run(f"CREATE (:{row['type']} {{name: '{row['name']}', id: '{row['id']}'}})")

                relations = pd.read_csv(relation_file)
                for _, row in relations.iterrows():
                    query = f"""
                    MATCH (a {{id: '{row['subject_id']}'}}), (b {{id: '{row['object_id']}'}})
                    CREATE (a)-[:{row['relation']}]->(b)
                    """
                    self.graph.run(query)

            def search(self, keyword: str):
                query = f"MATCH (n) WHERE n.name CONTAINS '{keyword}' RETURN n LIMIT 10"
                return self.graph.run(query).data()

            def find_path(self, start: str, end: str):
                query = f"""
                MATCH path = shortestPath((a {{name: '{start}'}})-[*]-(b {{name: '{end}'}}))
                RETURN path
                """
                return self.graph.run(query).data()

        kg = EnterpriseKG()
        results = kg.search("Project")
        print(f"Search results: {results}")
        ---