文档清洗系统初始化脚本

2025-05-16 11:30:02 +08:00
parent a73040d739
commit 532eb2857c
29 changed files with 11568 additions and 225 deletions
--- a/cxs/README.md
+++ b/cxs/README.md
@@ -0,0 +1,120 @@
+# 文档处理系统
+
+这是一个基于 FastAPI 的文档处理系统，可以将 DOC、DOCX 和 PDF 文件转换为纯文本格式。
+
+## 系统要求
+
+### 必需组件
+
+1. Python 3.8 或更高版本
+2. LibreOffice（用于文档格式转换）
+   - 下载地址：https://www.libreoffice.org/download/download/
+   - 安装后需要将安装目录（通常是 `C:\Program Files\LibreOffice\program`）添加到系统 PATH 环境变量
+
+3. Tesseract OCR（用于图片文字识别）
+   - 下载地址：https://github.com/UB-Mannheim/tesseract/wiki
+   - 安装时选择"添加到系统路径"选项
+
+### Python 依赖
+
+所有必需的 Python 包都列在 `requirements.txt` 文件中。使用以下命令安装：
+
+```bash
+pip install -r requirements.txt
+```
+
+## 功能特点
+
+- 支持 DOC、DOCX 和 PDF 文件格式
+- 提供简单的拖拽上传界面
+- 自动清理文档内容，去除冗余信息
+- 输出整洁的纯文本文件
+- 自动提取文档正文内容
+- 支持图片中的文字识别（OCR）
+- 自动分离正文和附录内容
+- 自动下载处理后的文本文件
+
+## 安装说明
+
+1. 克隆项目到本地：
+```bash
+git clone <repository_url>
+cd doc-etl
+```
+
+2. 创建虚拟环境（推荐）：
+```bash
+python -m venv venv
+source venv/bin/activate  # Linux/Mac
+venv\Scripts\activate     # Windows
+```
+
+3. 安装依赖：
+```bash
+pip install -r requirements.txt
+```
+
+## 运行说明
+
+1. 启动服务器：
+```bash
+uvicorn main:app --reload
+```
+
+2. 打开浏览器访问：
+```
+http://localhost:8000
+```
+
+3. 在网页界面上传文件，系统会自动处理并返回转换后的文本文件。
+
+## 目录结构
+
+```
+doc-etl/
+├── main.py              # 主应用程序
+├── requirements.txt     # 项目依赖
+├── README.md           # 项目说明
+├── static/             # 静态文件
+│   └── index.html      # 上传页面
+├── temp/               # 临时文件目录
+└── cxs/               # 文档处理模块
+    └── cxs_doc_cleaner.py
+```
+
+## 注意事项
+
+1. 确保系统已安装 Python 3.8 或更高版本
+2. 处理 PDF 文件时需要安装额外的依赖
+3. 所有临时文件会在以下情况自动清理：
+   - 文件处理完成后
+   - 发生错误时
+   - 文件下载完成后
+   - 程序退出时
+4. 临时文件存储在 `temp` 目录中，该目录会在程序启动时自动清理
+
+## 更新日志
+
+### 2024-03-21
+- 初始版本发布
+- 支持基本的文档处理功能
+- 添加文件上传界面
+
+### 2024-03-xx
+- 优化了文件处理逻辑
+- 添加了更详细的错误处理
+- 改进了文件类型验证
+- 添加了处理进度显示
+- 增强了临时文件的自动清理机制
+
+### 2024-03-22
+- 优化了批量处理逻辑，改为顺序处理文件
+- 添加了文件处理状态实时显示
+- 改进了临时文件清理机制
+- 增强了错误处理和提示信息
+
+### 2024-03-23
+- 优化了图片文件夹命名规则，使用"文件名_随机ID"格式
+- 改进了文件清理机制，处理完立即清理
+- 添加了更多的日志输出，方便调试问题
+- 优化了临时目录的管理和清理时机 
--- a/cxs/_optimize_for_chinese.py
+++ b/cxs/_optimize_for_chinese.py
@@ -0,0 +1,285 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+"""
+针对中文OCR的图像预处理优化
+"""
+
+import cv2
+import numpy as np
+from typing import Optional, Tuple, List, Dict, Any
+
+
+def optimize_for_chinese(image: np.ndarray) -> np.ndarray:
+    """
+    针对中文文本的图像优化处理
+    
+    Args:
+        image: 输入图像的NumPy数组
+        
+    Returns:
+        优化后的图像NumPy数组
+    """
+    # 确保图像不为空
+    if image is None or image.size == 0:
+        raise ValueError("输入图像为空")
+    
+    # 转换为灰度图
+    if len(image.shape) == 3:
+        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
+    else:
+        gray = image.copy()
+    
+    # 1. 自适应二值化 - 对于不同分辨率和对比度的图像很有效
+    binary = cv2.adaptiveThreshold(
+        gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, 
+        cv2.THRESH_BINARY_INV, 25, 15
+    )
+    
+    # 2. 对二值化图像进行形态学操作，使文字更清晰
+    # 创建一个长方形核，水平方向较小，垂直方向较大
+    # 这有助于保持中文字符的笔画连接
+    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 3))
+    
+    # 闭运算 - 用于连接断开的部分，尤其对于中文细笔画非常有效
+    morph = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel, iterations=1)
+    
+    # 3. 降噪 - 去除小的噪点
+    # 查找所有轮廓
+    contours, _ = cv2.findContours(morph, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
+    
+    # 创建一个空白图像
+    cleaned = np.zeros_like(morph)
+    
+    # 筛选轮廓 - 保留较大的轮廓(文字)，去除较小的轮廓(噪点)
+    min_contour_area = 20  # 最小轮廓面积，可以根据实际情况调整
+    for contour in contours:
+        if cv2.contourArea(contour) > min_contour_area:
+            cv2.drawContours(cleaned, [contour], -1, 255, -1)
+    
+    # 4. 反转回来 - 因为OCR通常需要黑底白字
+    cleaned_inverted = cv2.bitwise_not(cleaned)
+    
+    # 5. 对图像进行锐化，提高轮廓清晰度
+    # 创建一个锐化核
+    sharpen_kernel = np.array([[-1,-1,-1], 
+                               [-1, 9,-1], 
+                               [-1,-1,-1]])
+    
+    sharpened = cv2.filter2D(cleaned_inverted, -1, sharpen_kernel)
+    
+    # 6. 确保图像完全二值化
+    _, final = cv2.threshold(sharpened, 127, 255, cv2.THRESH_BINARY)
+    
+    return final
+
+
+def optimize_for_chinese_advanced(image: np.ndarray) -> List[np.ndarray]:
+    """
+    针对中文文本的多种高级图像优化处理，返回多种优化结果
+    
+    Args:
+        image: 输入图像的NumPy数组
+        
+    Returns:
+        优化后的图像NumPy数组列表
+    """
+    # 确保图像不为空
+    if image is None or image.size == 0:
+        raise ValueError("输入图像为空")
+    
+    # 转换为灰度图
+    if len(image.shape) == 3:
+        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
+    else:
+        gray = image.copy()
+    
+    results = []
+    
+    # 方法1: 自适应二值化基础版
+    binary1 = cv2.adaptiveThreshold(
+        gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, 
+        cv2.THRESH_BINARY, 25, 15
+    )
+    results.append(binary1)
+    
+    # 方法2: 自适应二值化增强版
+    binary2 = cv2.adaptiveThreshold(
+        gray, 255, cv2.ADAPTIVE_THRESH_MEAN_C, 
+        cv2.THRESH_BINARY, 35, 15
+    )
+    results.append(binary2)
+    
+    # 方法3: Otsu二值化
+    _, binary3 = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
+    results.append(binary3)
+    
+    # 方法4: 应用高斯模糊后再Otsu二值化
+    blurred = cv2.GaussianBlur(gray, (5, 5), 0)
+    _, binary4 = cv2.threshold(blurred, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
+    results.append(binary4)
+    
+    # 方法5: 增强对比度后的二值化
+    # 创建CLAHE对象
+    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
+    # 应用CLAHE增强对比度
+    contrast_enhanced = clahe.apply(gray)
+    _, binary5 = cv2.threshold(contrast_enhanced, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
+    results.append(binary5)
+    
+    # 方法6: 使用基本优化函数
+    basic_optimized = optimize_for_chinese(image)
+    results.append(basic_optimized)
+    
+    # 方法7: 形态学操作
+    # 先进行二值化
+    _, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
+    # 创建一个椭圆核
+    kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (3, 3))
+    # 开运算去除噪点
+    opened = cv2.morphologyEx(binary, cv2.MORPH_OPEN, kernel, iterations=1)
+    # 闭运算连接断开的笔画
+    morph = cv2.morphologyEx(opened, cv2.MORPH_CLOSE, kernel, iterations=1)
+    results.append(morph)
+    
+    # 方法8: 锐化处理
+    sharpen_kernel = np.array([[-1,-1,-1], 
+                               [-1, 9,-1], 
+                               [-1,-1,-1]])
+    sharpened = cv2.filter2D(gray, -1, sharpen_kernel)
+    _, binary8 = cv2.threshold(sharpened, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
+    results.append(binary8)
+    
+    # 方法9: 边缘增强
+    # 先进行高斯模糊
+    blurred = cv2.GaussianBlur(gray, (0, 0), 3)
+    # 使用unsharp masking技术
+    edge_enhanced = cv2.addWeighted(gray, 1.5, blurred, -0.5, 0)
+    _, binary9 = cv2.threshold(edge_enhanced, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
+    results.append(binary9)
+    
+    return results
+
+
+def detect_and_correct_skew(image: np.ndarray, angle_range: Tuple[int, int] = (-15, 15), angle_step: float = 0.5) -> np.ndarray:
+    """
+    检测并修正图像中文本的倾斜
+    
+    Args:
+        image: 输入图像的NumPy数组
+        angle_range: 搜索倾斜角度的范围
+        angle_step: 角度搜索的步长
+        
+    Returns:
+        修正倾斜后的图像
+    """
+    # 确保图像不为空
+    if image is None or image.size == 0:
+        raise ValueError("输入图像为空")
+    
+    # 转换为灰度图
+    if len(image.shape) == 3:
+        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
+    else:
+        gray = image.copy()
+    
+    # 二值化
+    _, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
+    
+    # 计算每个旋转角度的像素和
+    scores = []
+    angles = np.arange(angle_range[0], angle_range[1] + angle_step, angle_step)
+    
+    # 获取中心点
+    center = (binary.shape[1] // 2, binary.shape[0] // 2)
+    
+    for angle in angles:
+        # 旋转图像
+        rotation_matrix = cv2.getRotationMatrix2D(center, angle, 1.0)
+        rotated = cv2.warpAffine(binary, rotation_matrix, (binary.shape[1], binary.shape[0]),
+                                flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_CONSTANT, borderValue=0)
+        
+        # 计算每行像素和
+        row_sums = np.sum(rotated, axis=1)
+        # 计算方差作为评分
+        score = np.var(row_sums)
+        scores.append(score)
+    
+    # 找到最佳角度
+    best_angle_index = np.argmax(scores)
+    best_angle = angles[best_angle_index]
+    
+    # 旋转原始图像
+    rotation_matrix = cv2.getRotationMatrix2D(center, best_angle, 1.0)
+    rotated_image = cv2.warpAffine(image, rotation_matrix, (image.shape[1], image.shape[0]),
+                                 flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_CONSTANT)
+    
+    return rotated_image
+
+
+def process_image_for_chinese_ocr(image: np.ndarray, correct_skew: bool = True) -> Dict[str, Any]:
+    """
+    完整的中文OCR图像预处理流程
+    
+    Args:
+        image: 输入图像的NumPy数组
+        correct_skew: 是否进行倾斜校正
+        
+    Returns:
+        字典，包含多种处理结果和原始图像
+    """
+    result = {
+        'original': image.copy()
+    }
+    
+    # 步骤1: 倾斜校正（如果需要）
+    if correct_skew:
+        corrected = detect_and_correct_skew(image)
+        result['deskewed'] = corrected
+        # 使用校正后的图像进行后续处理
+        working_image = corrected
+    else:
+        working_image = image
+    
+    # 步骤2: 应用基本的中文优化
+    optimized = optimize_for_chinese(working_image)
+    result['optimized'] = optimized
+    
+    # 步骤3: 应用高级优化，获取多种处理结果
+    advanced_results = optimize_for_chinese_advanced(working_image)
+    for i, img in enumerate(advanced_results):
+        result[f'method_{i+1}'] = img
+    
+    return result
+
+
+if __name__ == "__main__":
+    # 简单的测试代码
+    import sys
+    if len(sys.argv) > 1:
+        input_image_path = sys.argv[1]
+        output_dir = sys.argv[2] if len(sys.argv) > 2 else "."
+        
+        # 读取图像
+        image = cv2.imread(input_image_path)
+        if image is None:
+            print(f"无法读取图像: {input_image_path}")
+            sys.exit(1)
+            
+        # 处理图像
+        result = process_image_for_chinese_ocr(image)
+        
+        # 保存结果
+        cv2.imwrite(f"{output_dir}/original.png", result['original'])
+        cv2.imwrite(f"{output_dir}/optimized.png", result['optimized'])
+        
+        if 'deskewed' in result:
+            cv2.imwrite(f"{output_dir}/deskewed.png", result['deskewed'])
+        
+        for i in range(1, 10):
+            key = f'method_{i}'
+            if key in result:
+                cv2.imwrite(f"{output_dir}/{key}.png", result[key])
+        
+        print(f"处理完成，结果已保存到 {output_dir}")
+    else:
+        print("使用方法: python _optimize_for_chinese.py <输入图像路径> [输出目录]") 
--- a/cxs/cxs_doc_cleaner.py
+++ b/cxs/cxs_doc_cleaner.py
--- a/cxs/cxs_pdf_cleaner.py
+++ b/cxs/cxs_pdf_cleaner.py
--- a/cxs/cxs_table_processor.py
+++ b/cxs/cxs_table_processor.py
@@ -0,0 +1,961 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+import re
+from typing import List, Dict, Any, Optional
+import os
+from docx.oxml import parse_xml
+from docx.oxml.ns import nsdecls
+
+# 自定义TableData类，用于存储表格数据
+class TableData:
+    def __init__(self):
+        """
+        初始化表格数据结构
+        """
+        self.rows = []
+        self.style = None
+        self.columns = []  # 添加列属性
+
+    def cell(self, row_idx: int, col_idx: int) -> Dict[str, Any]:
+        """
+        获取表格单元格
+        
+        Args:
+            row_idx: 行索引
+            col_idx: 列索引
+            
+        Returns:
+            Dict: 单元格数据
+        """
+        try:
+            # 首先检查行索引是否有效
+            if row_idx < 0 or row_idx >= len(self.rows):
+                return {'text': '', 'gridspan': 1, 'vmerge': None}
+                
+            # 然后检查列索引是否有效
+            if col_idx < 0 or col_idx >= len(self.rows[row_idx]):
+                return {'text': '', 'gridspan': 1, 'vmerge': None}
+                
+            # 如果需要，进行额外的安全检查
+            cell = self.rows[row_idx][col_idx]
+            if not isinstance(cell, dict):
+                print(f"警告：单元格数据格式错误 [{row_idx},{col_idx}]")
+                return {'text': str(cell) if cell is not None else '', 'gridspan': 1, 'vmerge': None}
+                
+            return cell
+        except Exception as e:
+            print(f"获取单元格时出错 [{row_idx},{col_idx}]: {str(e)}")
+            return {'text': '', 'gridspan': 1, 'vmerge': None}
+
+class TableProcessor:
+    def __init__(self):
+        """
+        初始化表格处理器
+        """
+        print("初始化表格处理器")
+    
+    def _extract_table_row(self, row_element, namespace):
+        """
+        提取表格行数据，增强的表格行处理
+        
+        Args:
+            row_element: 行元素
+            namespace: XML命名空间
+            
+        Returns:
+            List: 行数据列表
+        """
+        row = []
+        try:
+            # 处理单元格
+            for cell_element in row_element.findall('.//w:tc', namespaces=namespace):
+                cell_text = ''
+                # 提取单元格中的所有文本
+                for paragraph in cell_element.findall('.//w:p', namespaces=namespace):
+                    for run in paragraph.findall('.//w:t', namespaces=namespace):
+                        if run.text:
+                            cell_text += run.text
+                    # 在段落后添加换行符
+                    cell_text += '\n'
+                
+                # 移除末尾换行
+                cell_text = cell_text.rstrip('\n')
+                
+                # 检查单元格合并属性
+                gridspan = self._get_gridspan_value(cell_element)
+                vmerge = self._get_vmerge_value(cell_element)
+                
+                # 创建单元格数据
+                cell = {
+                    'text': cell_text,
+                    'gridspan': gridspan,
+                    'vmerge': vmerge
+                }
+                row.append(cell)
+                
+            # 如果行为空，创建至少一个空单元格
+            if not row:
+                row.append({'text': '', 'gridspan': 1, 'vmerge': None})
+                
+            return row
+        except Exception as e:
+            print(f"提取表格行数据时出错: {str(e)}")
+            # 返回至少有一个单元格的行
+            return [{'text': '', 'gridspan': 1, 'vmerge': None}]
+
+    def _preprocess_table(self, element, namespace):
+        """
+        对表格进行预处理，加强特殊表格的识别能力
+        
+        Args:
+            element: 表格元素
+            namespace: XML命名空间
+            
+        Returns:
+            TableData: 预处理后的表格数据
+        """
+        table = TableData()
+        
+        # 检查并处理表格行
+        rows_elements = element.findall('.//w:tr', namespaces=namespace)
+        
+        # 表格为空的特殊处理
+        if not rows_elements:
+            # 尝试寻找更深层次的表格元素，可能是嵌套在其他元素中的表格
+            nested_rows = element.findall('.//*//w:tr', namespaces=namespace)
+            if nested_rows:
+                rows_elements = nested_rows
+                print(f"已找到嵌套表格行：{len(rows_elements)}行")
+            else:
+                # 创建一个默认行，避免表格为空
+                print("未找到表格行，创建默认行")
+                table.rows.append([{'text': '', 'gridspan': 1, 'vmerge': None}])
+                return table
+        
+        # 处理每一行
+        for row_element in rows_elements:
+            row = self._extract_table_row(row_element, namespace)
+            table.rows.append(row)
+        
+        # 如果表格为空，创建默认行
+        if not table.rows:
+            table.rows.append([{'text': '', 'gridspan': 1, 'vmerge': None}])
+        
+        # 分析表格，确定列数
+        max_cols = 0
+        for row in table.rows:
+            # 计算考虑gridspan的实际列数
+            effective_cols = sum(cell.get('gridspan', 1) for cell in row)
+            max_cols = max(max_cols, effective_cols)
+        
+        # 确保每行都有足够的列
+        for i, row in enumerate(table.rows):
+            current_cols = sum(cell.get('gridspan', 1) for cell in row)
+            if current_cols < max_cols:
+                # 添加空单元格来填充行
+                padding_cells = max_cols - current_cols
+                for _ in range(padding_cells):
+                    row.append({'text': '', 'gridspan': 1, 'vmerge': None})
+        
+        # 设置列索引
+        table.columns = [i for i in range(max_cols)]
+        
+        # 增强对垂直合并单元格的处理
+        self._enhance_vertical_merges(table)
+        
+        # 额外执行一次垂直合并内容传播，修复复杂表格中的合并单元格
+        self._propagate_vertical_merges(table)
+        
+        return table
+    
+    def _propagate_vertical_merges(self, table: TableData):
+        """
+        专门处理复杂表格中的垂直合并单元格，向下传播内容
+        
+        Args:
+            table: TableData对象
+        """
+        rows = len(table.rows)
+        cols = len(table.columns) if table.columns else 0
+        
+        if rows <= 1 or cols == 0:
+            return
+        
+        # 创建一个矩阵记录每个单元格位置的内容
+        matrix = []
+        for i in range(rows):
+            row = []
+            for j in range(cols):
+                try:
+                    if j < len(table.rows[i]):
+                        cell = table.rows[i][j]
+                        row.append(cell.get('text', '').strip())
+                    else:
+                        row.append('')
+                except (IndexError, KeyError):
+                    row.append('')  # 防止索引越界
+            matrix.append(row)
+        
+        # 对每一列进行垂直合并检查
+        for j in range(cols):
+            # 从上到下传播非空内容
+            last_non_empty = None
+            last_non_empty_idx = -1
+            
+            for i in range(rows):
+                try:
+                    # 安全访问表格单元格
+                    current_text = ''
+                    if j < len(table.rows[i]):
+                        cell = table.rows[i][j]
+                        current_text = cell.get('text', '').strip()
+                    
+                    # 如果当前单元格为空，但上方有非空单元格，考虑垂直合并
+                    if not current_text and last_non_empty:
+                        # 检查这是否可能是垂直合并
+                        if i - last_non_empty_idx <= 3:  # 限制垂直检查范围，避免过度填充
+                            # 根据上下文判断是否真的是合并单元格
+                            # 1. 检查该列其他单元格是否有相似模式
+                            pattern_match = False
+                            for k in range(rows):
+                                if k != i and k != last_non_empty_idx:
+                                    # 查找相似模式：空单元格下方接非空单元格
+                                    if k > 0 and not matrix[k-1][j] and matrix[k][j]:
+                                        pattern_match = True
+                                        break
+                            
+                            # 2. 检查第一列特殊情况 - 可能是分类表
+                            is_first_columns = j < 2  # 前两列更可能是分类信息
+                            
+                            if pattern_match or is_first_columns:
+                                if j < len(table.rows[i]):
+                                    # 安全地更新当前单元格
+                                    table.rows[i][j]['text'] = last_non_empty
+                                    table.rows[i][j]['is_inferred_merge'] = True
+                                    matrix[i][j] = last_non_empty  # 更新矩阵
+                                    print(f"传播合并内容到位置 [{i},{j}]: {last_non_empty[:20]}...")
+                    
+                    # 更新最后一个非空单元格
+                    if current_text:
+                        last_non_empty = current_text
+                        last_non_empty_idx = i
+                except Exception as e:
+                    print(f"处理垂直合并传播时出错 [{i},{j}]: {str(e)}")
+        
+        # 第二轮：处理常见的分类表格模式（第一列相同值表示同一类别）
+        for j in range(min(2, cols)):  # 只处理前两列
+            # 查找具有相同值的行组
+            groups = {}
+            for i in range(rows):
+                try:
+                    if j < len(table.rows[i]):
+                        value = table.rows[i][j].get('text', '').strip()
+                        if value:
+                            if value not in groups:
+                                groups[value] = []
+                            groups[value].append(i)
+                except Exception as e:
+                    print(f"分组时出错 [{i},{j}]: {str(e)}")
+            
+            # 处理每个组
+            for value, indices in groups.items():
+                if len(indices) >= 2:  # 至少有两行具有相同值
+                    # 检查这些行之间是否有空行
+                    indices.sort()
+                    for idx in range(len(indices) - 1):
+                        start_row = indices[idx]
+                        end_row = indices[idx + 1]
+                        
+                        # 如果两行不相邻，检查中间行
+                        if end_row - start_row > 1:
+                            for mid_row in range(start_row + 1, end_row):
+                                try:
+                                    # 检查中间行的单元格是否为空
+                                    if j < len(table.rows[mid_row]):
+                                        mid_cell = table.rows[mid_row][j]
+                                        if not mid_cell.get('text', '').strip():
+                                            # 这可能是被合并的单元格，填充内容
+                                            mid_cell['text'] = value
+                                            mid_cell['is_inferred_merge'] = True
+                                            print(f"填充中间行合并单元格 [{mid_row},{j}]: {value[:20]}...")
+                                except Exception as e:
+                                    print(f"填充中间行时出错 [{mid_row},{j}]: {str(e)}")
+
+    def _enhance_vertical_merges(self, table: TableData):
+        """
+        增强对垂直合并单元格的处理
+        
+        处理逻辑包括：
+        1. 检查并处理第一列和第二列的特殊情况
+        2. 在表格中识别内容相似的单元格
+        
+        Args:
+            table: TableData对象
+        """
+        rows = len(table.rows)
+        cols = len(table.columns) if table.columns else 0
+        
+        if rows <= 1 or cols == 0:
+            return
+        
+        # 检查第一列和第二列的特殊情况
+        for j in range(min(2, cols)):  # 检查前两列，因为合并单元格可能出现在这两列中
+            # 检查是否有垂直合并单元格
+            for i in range(1, rows):
+                try:
+                    if j < len(table.rows[i]):  # 确保索引有效
+                        cell = table.rows[i][j]
+                        # 如果单元格为空且没有标记为合并单元格，检查上面行的内容
+                        if not cell.get('text', '').strip() and cell.get('vmerge') is None:
+                            # 安全访问上一行
+                            if j < len(table.rows[i-1]):
+                                prev_cell = table.rows[i-1][j]
+                                if prev_cell.get('text', '').strip():
+                                    # 如果上面行有内容，这可能是合并单元格
+                                    print(f"在位置 [{i},{j}] 检测到可能的垂直合并单元格")
+                                    # 将内容复制到当前单元格
+                                    cell['text'] = prev_cell['text']
+                                    cell['is_inferred_merge'] = True  # 标记为推导出的合并单元格
+                except IndexError as e:
+                    print(f"增强垂直合并处理索引错误 [{i},{j}]: {str(e)}")
+                except Exception as e:
+                    print(f"增强垂直合并处理一般错误 [{i},{j}]: {str(e)}")
+            
+            # 特殊情况：检查分类表格中的模式
+            try:
+                # 在分类表格中，同一列的内容如果重复出现，可能是合并单元格
+                content_groups = self._identify_content_groups(table, j)
+                
+                # 处理内容相似的单元格
+                for group_indices in content_groups:
+                    if len(group_indices) > 1:  # 如果有多个相同的单元格
+                        if group_indices[0] < len(table.rows) and j < len(table.rows[group_indices[0]]):
+                            group_text = table.rows[group_indices[0]][j].get('text', '')
+                            if group_text.strip():  # 如果单元格有内容
+                                print(f"在列 {j} 中发现可能的内容合并组: {group_indices}")
+                                # 将这些单元格标记为具有相同的内容
+                                for idx in group_indices:
+                                    if idx < len(table.rows) and j < len(table.rows[idx]):
+                                        table.rows[idx][j]['content_group'] = group_indices
+            except Exception as e:
+                print(f"处理内容组时出错 [列 {j}]: {str(e)}")
+    
+    def _identify_content_groups(self, table: TableData, col_idx: int) -> List[List[int]]:
+        """
+        根据内容相似性识别合并单元格
+        
+        Args:
+            table: TableData对象
+            col_idx: 要分析的列索引
+            
+        Returns:
+            List[List[int]]: 可能合并单元格的行索引组
+        """
+        rows = len(table.rows)
+        # 存储每个唯一内容的所有行索引
+        content_groups = {}
+        
+        for i in range(rows):
+            try:
+                if col_idx < len(table.rows[i]):
+                    cell_text = table.rows[i][col_idx].get('text', '').strip()
+                    if cell_text:
+                        if cell_text not in content_groups:
+                            content_groups[cell_text] = []
+                        content_groups[cell_text].append(i)
+            except IndexError:
+                # 安全跳过索引越界情况
+                continue
+            except Exception as e:
+                print(f"识别内容组时出错 [{i},{col_idx}]: {str(e)}")
+        
+        # 返回包含多个行索引的组
+        return [indices for text, indices in content_groups.items() if len(indices) > 1]
+
+    def _is_valid_table(self, table: TableData) -> bool:
+        """
+        检查表格是否有效（至少有一行一列且含有有意义的内容）
+        
+        Args:
+            table: TableData对象
+            
+        Returns:
+            bool: 表格是否有效
+        """
+        try:
+            # 检查表格尺寸
+            rows = len(table.rows)
+            cols = len(table.columns) if table.columns else 0
+            
+            # 如果没有行或列，表格无效
+            if rows < 1 or cols < 1:
+                print(f"表格无效: 没有行或列 (行数={rows}, 列数={cols})")
+                return False
+                
+            # 检查表格XML结构是否包含表格标记
+            # 此步骤可以简单检测表格是否有表格相关的XML标记
+            try:
+                # 以下逻辑是为了特殊处理可能被误判的表格
+                # 判断是否是特殊表格（如药品分类表）
+                first_cell_text = ""
+                if rows > 0 and len(table.rows[0]) > 0:
+                    first_cell_text = table.cell(0, 0).get('text', '').strip()
+                
+                # 检查首行首列是否包含特定文本模式（如编号、分类名称等）
+                # 这些模式暗示这可能是一个重要表格
+                special_patterns = [
+                    r'^\d{2}-\d{2}',        # 类似 01-01 的编码
+                    r'^[一二三四五六七八九十]+级',  # 中文级别（一级、二级等）
+                    r'^\d+\.\d+',           # 类似 1.1 的编号格式
+                    r'类[别型]|分类|编码',       # 包含分类相关词汇
+                    r'表\s*\d+',             # 表格编号（如"表1"）
+                    r'产品|器械|设备|材料'        # 常见医疗或药品分类术语
+                ]
+                
+                for pattern in special_patterns:
+                    if re.search(pattern, first_cell_text):
+                        print(f"检测到特殊表格模式: '{first_cell_text}'，强制视为有效表格")
+                        return True
+                
+            except Exception as e:
+                # 特殊检测失败，继续常规检测
+                print(f"特殊表格检测时出错: {str(e)}")
+            
+            # 计算表格中的有效内容
+            total_cells = 0
+            non_empty_cells = 0
+            total_text_length = 0
+            
+            for i in range(rows):
+                for j in range(min(cols, len(table.rows[i]))):  # 防止越界
+                    total_cells += 1
+                    cell_text = table.cell(i, j)['text'].strip()
+                    if cell_text:
+                        non_empty_cells += 1
+                        total_text_length += len(cell_text)
+            
+            # 计算非空单元格比例
+            non_empty_ratio = non_empty_cells / total_cells if total_cells > 0 else 0
+            
+            # 表格行列数检查 - 如果行数或列数足够多，更可能是有效表格
+            has_multiple_rows = rows >= 3
+            has_multiple_cols = cols >= 3
+            
+            # 实际单元格内容检查
+            # 进一步放宽标准，只要有内容就视为可能有效
+            is_meaningful = (
+                # 1. 标准条件：至少有2个单元格有内容
+                non_empty_cells >= 2 or 
+                # 2. 极低门槛：至少有1个单元格有内容且文本长度>=1个字符
+                (non_empty_cells > 0 and total_text_length >= 1) or
+                # 3. 表格足够大：至少有3行3列，可能是重要表格
+                (has_multiple_rows and has_multiple_cols) or
+                # 4. 非空率较高：即使单元格少，但如果填充率高，也可能是有意义的
+                (non_empty_ratio >= 0.5 and total_text_length > 0)
+            )
+            
+            if not is_meaningful:
+                print(f"表格无效: 内容不足 (非空单元格={non_empty_cells}/{total_cells}, 文本长度={total_text_length})")
+            
+            return is_meaningful
+                
+        except Exception as e:
+            print(f"警告：检查表格有效性时出错: {str(e)}")
+            import traceback
+            traceback.print_exc()
+            # 出错时默认认为有效，避免丢失潜在有用的表格
+            return True
+
+    def _extract_plain_text_from_table(self, table: TableData) -> str:
+        """
+        从表格中提取纯文本，用于将无效表格作为普通文本处理
+        
+        Args:
+            table: docx表格对象
+            
+        Returns:
+            str: 表格内容的纯文本表示
+        """
+        try:
+            text_parts = []
+            for row in table.rows:
+                for cell in row:
+                    cell_text = cell['text'].strip()
+                    if cell_text:
+                        text_parts.append(cell_text)
+            
+            return " ".join(text_parts)
+            
+        except Exception as e:
+            print(f"警告：从表格提取文本时出错: {str(e)}")
+            return "【表格文本提取失败】"
+
+    def _convert_table_to_text(self, table: TableData) -> str:
+        """
+        将表格转换为文本格式，使用简化易读的表格表示
+        
+        Args:
+            table: TableData对象
+            
+        Returns:
+            str: 表格的文本表示
+        """
+        try:
+            # 获取表格的行数和列数
+            rows = len(table.rows)
+            cols = len(table.columns) if table.columns else 0
+            
+            if rows == 0 or cols == 0:
+                return "【空表格】"
+                
+            # 构建一个完整的表格矩阵，处理合并单元格
+            matrix = []
+            for i in range(rows):
+                row = [""] * cols
+                matrix.append(row)
+            
+            # 首先安全地处理所有已知的单元格内容
+            for i in range(rows):
+                for j in range(cols):
+                    try:
+                        if j < len(table.rows[i]):
+                            cell = table.rows[i][j]
+                            text = cell.get('text', '').strip()
+                            matrix[i][j] = text
+                    except IndexError:
+                        continue  # 跳过索引越界
+            
+            # 填充矩阵，处理合并单元格
+            for i in range(rows):
+                j = 0
+                while j < cols:
+                    try:
+                        if j >= len(table.rows[i]):
+                            j += 1
+                            continue
+                            
+                        cell = table.rows[i][j]
+                        text = cell.get('text', '').strip()
+                        
+                        # 特殊处理：检查是否有内容组标记
+                        content_group = cell.get('content_group', [])
+                        if content_group:
+                            # 如果这是内容组的一部分
+                            if i in content_group and content_group[0] < len(table.rows) and j < len(table.rows[content_group[0]]):
+                                group_text = table.rows[content_group[0]][j].get('text', '').strip()
+                                if group_text:
+                                    text = group_text
+                        
+                        # 处理水平合并(gridspan)
+                        gridspan = cell.get('gridspan', 1)
+                        
+                        # 处理垂直合并(vmerge)和推断的合并
+                        if cell.get('vmerge') == 'continue' or cell.get('is_inferred_merge'):
+                            # 如果是继续合并的单元格或推断的合并，使用当前已有的文本
+                            if not text:
+                                # 如果当前单元格文本为空，尝试从上面行查找
+                                for prev_i in range(i-1, -1, -1):
+                                    if prev_i < len(table.rows) and j < len(table.rows[prev_i]):
+                                        prev_cell = table.rows[prev_i][j]
+                                        prev_text = prev_cell.get('text', '').strip()
+                                        if prev_text:
+                                            text = prev_text
+                                            break
+                        
+                        # 填充当前单元格
+                        matrix[i][j] = text
+                        
+                        # 处理水平合并，将内容复制到被合并的单元格
+                        for k in range(1, gridspan):
+                            if j + k < cols:
+                                matrix[i][j+k] = text
+                        
+                        # 如果这是垂直合并的起始单元格，复制内容到下面被合并的单元格
+                        if text and (cell.get('vmerge') == 'restart' or not cell.get('vmerge')):
+                            for next_i in range(i+1, rows):
+                                if next_i < len(table.rows) and j < len(table.rows[next_i]):
+                                    next_cell = table.rows[next_i][j]
+                                    if next_cell.get('vmerge') == 'continue' or not next_cell.get('text', '').strip():
+                                        # 复制到下面被合并的单元格
+                                        matrix[next_i][j] = text
+                                        # 处理水平合并
+                                        next_gridspan = next_cell.get('gridspan', 1)
+                                        for k in range(1, next_gridspan):
+                                            if j + k < cols:
+                                                matrix[next_i][j+k] = text
+                                    else:
+                                        break
+                        
+                        j += max(1, gridspan)
+                    except IndexError as e:
+                        print(f"表格转文本处理索引错误 [{i},{j}]: {str(e)}")
+                        j += 1  # 确保进度
+                    except Exception as e:
+                        print(f"表格转文本一般错误 [{i},{j}]: {str(e)}")
+                        j += 1
+            
+            # 再次处理第一列和第二列中的空白单元格 - 增强垂直合并处理
+            for j in range(min(3, cols)):  # 扩展到前三列
+                # 自上而下扫描
+                last_content = ""
+                for i in range(rows):
+                    if matrix[i][j]:
+                        last_content = matrix[i][j]
+                    elif last_content and i > 0 and matrix[i-1][j]:
+                        # 如果当前为空且上一行不为空，填充内容
+                        matrix[i][j] = last_content
+                
+                # 自下而上扫描，填充孤立的空单元格
+                for i in range(rows-2, 0, -1):  # 从倒数第二行开始向上
+                    if not matrix[i][j] and matrix[i-1][j] and matrix[i+1][j] and matrix[i-1][j] == matrix[i+1][j]:
+                        # 如果当前为空且上下行内容相同，填充内容
+                        matrix[i][j] = matrix[i-1][j]
+            
+            # 如果有表头，提取它们
+            headers = matrix[0] if rows > 0 else ["列" + str(j+1) for j in range(cols)]
+            # 确保表头不为空
+            for j in range(cols):
+                if not headers[j]:
+                    headers[j] = "列" + str(j+1)
+            
+            # 构建结构化输出 - 使用统一简化格式
+            result = []
+            result.append("表格内容（简化格式）:")
+            
+            # 添加表头行
+            header_line = []
+            # 计算每列最大宽度
+            col_widths = [0] * cols
+            for j in range(cols):
+                col_widths[j] = max(len(headers[j]), col_widths[j])
+            
+            # 计算数据行的宽度
+            for i in range(1, rows):
+                for j in range(cols):
+                    if matrix[i][j]:
+                        col_widths[j] = max(col_widths[j], len(matrix[i][j]))
+            
+            # 加入表头与分隔线    
+            for j in range(cols):
+                header_line.append(headers[j].ljust(col_widths[j]))
+            result.append(" | ".join(header_line))
+            
+            # 添加分隔线
+            separator = []
+            for j in range(cols):
+                separator.append("-" * col_widths[j])
+            result.append(" | ".join(separator))
+            
+            # 添加数据行
+            for i in range(1, rows):
+                row_line = []
+                has_content = False
+                
+                for j in range(cols):
+                    cell_text = matrix[i][j]
+                    if cell_text:
+                        has_content = True
+                    # 始终添加单元格内容，即使为空
+                    row_line.append(cell_text.ljust(col_widths[j]))
+                
+                if has_content:
+                    result.append(" | ".join(row_line))
+            
+            return "\n".join(result)
+            
+        except Exception as e:
+            print(f"警告：处理表格时出错: {str(e)}")
+            import traceback
+            traceback.print_exc()
+            return "【表格处理失败】"
+
+    def _convert_table_to_markdown(self, table: TableData) -> str:
+        """
+        将表格转换为Markdown格式，使用简化易读的表格表示
+        
+        Args:
+            table: TableData对象
+            
+        Returns:
+            str: 表格的Markdown表示
+        """
+        try:
+            # 获取表格的行数和列数
+            rows = len(table.rows)
+            cols = len(table.columns) if table.columns else 0
+            
+            if rows == 0 or cols == 0:
+                return "| 空表格 |"
+                
+            # 构建一个完整的表格矩阵，处理合并单元格
+            matrix = []
+            for i in range(rows):
+                row = [""] * cols
+                matrix.append(row)
+
+            # 首先安全地处理所有已知的单元格内容
+            for i in range(rows):
+                for j in range(cols):
+                    try:
+                        if j < len(table.rows[i]):
+                            cell = table.rows[i][j]
+                            text = cell.get('text', '').strip()
+                            matrix[i][j] = text
+                    except IndexError:
+                        continue  # 跳过索引越界
+            
+            # 填充矩阵，处理合并单元格
+            for i in range(rows):
+                j = 0
+                while j < cols:
+                    try:
+                        if j >= len(table.rows[i]):
+                            j += 1
+                            continue
+                            
+                        cell = table.rows[i][j]
+                        text = cell.get('text', '').strip()
+                        
+                        # 特殊处理：检查是否有内容组标记
+                        content_group = cell.get('content_group', [])
+                        if content_group and i in content_group:
+                            # 如果这是内容组的一部分，保证内容的一致性
+                            if content_group[0] < len(table.rows) and j < len(table.rows[content_group[0]]):
+                                group_text = table.rows[content_group[0]][j].get('text', '').strip()
+                                if group_text:
+                                    text = group_text
+                        
+                        # 处理水平合并(gridspan)
+                        gridspan = cell.get('gridspan', 1)
+                        
+                        # 处理垂直合并(vmerge)和推断的合并
+                        if cell.get('vmerge') == 'continue' or cell.get('is_inferred_merge'):
+                            # 如果是继续合并的单元格或推断的合并，使用当前已有的文本
+                            if not text:
+                                # 如果当前单元格文本为空，尝试从上面行查找
+                                for prev_i in range(i-1, -1, -1):
+                                    if prev_i < len(table.rows) and j < len(table.rows[prev_i]):
+                                        prev_cell = table.rows[prev_i][j]
+                                        prev_text = prev_cell.get('text', '').strip()
+                                        if prev_text:
+                                            text = prev_text
+                                            break
+                        
+                        # 填充当前单元格
+                        matrix[i][j] = text
+                        
+                        # 处理水平合并，将内容复制到被合并的单元格
+                        for k in range(1, gridspan):
+                            if j + k < cols:
+                                matrix[i][j+k] = text
+                        
+                        # 如果这是垂直合并的起始单元格，复制内容到下面被合并的单元格
+                        if text and (cell.get('vmerge') == 'restart' or not cell.get('vmerge')):
+                            for next_i in range(i+1, rows):
+                                if next_i < len(table.rows) and j < len(table.rows[next_i]):
+                                    next_cell = table.rows[next_i][j]
+                                    if next_cell.get('vmerge') == 'continue' or not next_cell.get('text', '').strip():
+                                        # 复制到下面被合并的单元格
+                                        matrix[next_i][j] = text
+                                        # 处理水平合并
+                                        next_gridspan = next_cell.get('gridspan', 1)
+                                        for k in range(1, next_gridspan):
+                                            if j + k < cols:
+                                                matrix[next_i][j+k] = text
+                                    else:
+                                        break
+                        
+                        j += max(1, gridspan)
+                    except Exception as e:
+                        print(f"Markdown表格处理错误 [{i},{j}]: {str(e)}")
+                        j += 1
+            
+            # 再次处理第一列中的空白单元格 - 增强垂直合并处理
+            for j in range(min(3, cols)):  # 扩展到前三列
+                # 自上而下扫描
+                last_content = ""
+                for i in range(rows):
+                    if matrix[i][j]:
+                        last_content = matrix[i][j]
+                    elif last_content and i > 0 and matrix[i-1][j]:
+                        # 如果当前为空且上一行不为空，填充内容
+                        matrix[i][j] = last_content
+            
+            # 确保表头不为空
+            headers = matrix[0] if rows > 0 else []
+            for j in range(cols):
+                if j >= len(headers) or not headers[j]:
+                    headers.append("列" + str(j+1))
+            
+            # 构建Markdown表格
+            markdown_rows = []
+            
+            # 添加表头行
+            header_row = "| " + " | ".join(headers) + " |"
+            markdown_rows.append(header_row)
+            
+            # 添加分隔行
+            separator = "| " + " | ".join(["---"] * cols) + " |"
+            markdown_rows.append(separator)
+            
+            # 添加数据行
+            for i in range(1, rows):
+                row_data = []
+                has_content = False
+                
+                for j in range(cols):
+                    cell_text = matrix[i][j]
+                    if cell_text:
+                        has_content = True
+                    row_data.append(cell_text)
+                
+                if has_content:
+                    markdown_rows.append("| " + " | ".join(row_data) + " |")
+            
+            return "\n".join(markdown_rows)
+            
+        except Exception as e:
+            print(f"警告：处理Markdown表格时出错: {str(e)}")
+            import traceback
+            traceback.print_exc()
+            return "| 表格处理失败 |"
+
+    def _extract_table_text(self, table: TableData) -> str:
+        """
+        提取表格中的文本内容，返回格式化的文本表示
+        
+        Args:
+            table: docx表格对象
+            
+        Returns:
+            str: 表格内容的文本表示
+        """
+        # 调用优化后的表格处理函数，确保合并单元格被正确处理
+        return self._convert_table_to_text(table)
+
+    def _convert_small_table_to_text(self, table: TableData) -> str:
+        """
+        将小型表格转换为更简洁的文本格式
+        
+        Args:
+            table: TableData对象
+            
+        Returns:
+            str: 表格的文本表示
+        """
+        rows = len(table.rows)
+        cols = len(table.columns) if table.columns else 0
+        
+        if rows == 0 or cols == 0:
+            return "【空表格】"
+        
+        # 提取所有单元格文本
+        cell_texts = []
+        for i in range(rows):
+            row_texts = []
+            for j in range(min(cols, len(table.rows[i]))):
+                cell_text = table.cell(i, j)['text'].strip().replace('\n', ' ')
+                row_texts.append(cell_text)
+            cell_texts.append(row_texts)
+        
+        # 计算每列的最大宽度
+        col_widths = [0] * cols
+        for i in range(rows):
+            for j in range(len(cell_texts[i])):
+                col_widths[j] = max(col_widths[j], len(cell_texts[i][j]))
+        
+        # 生成表格文本
+        result = []
+        
+        # 添加表头
+        header_row = cell_texts[0]
+        header_line = []
+        for j, text in enumerate(header_row):
+            width = min(col_widths[j], 30)  # 限制最大宽度
+            header_line.append(text.ljust(width))
+        result.append(" | ".join(header_line))
+        
+        # 添加分隔线
+        separator = []
+        for j in range(cols):
+            width = min(col_widths[j], 30)
+            separator.append("-" * width)
+        result.append(" | ".join(separator))
+        
+        # 添加数据行
+        for i in range(1, rows):
+            row_line = []
+            for j, text in enumerate(cell_texts[i]):
+                width = min(col_widths[j], 30)  # 限制最大宽度
+                row_line.append(text.ljust(width))
+            result.append(" | ".join(row_line))
+        
+        return "\n".join(result)
+
+    def _get_vmerge_value(self, cell_element) -> str:
+        """
+        获取单元格的垂直合并属性
+        
+        Args:
+            cell_element: 单元格元素
+            
+        Returns:
+            str: 垂直合并属性值
+        """
+        vmerge = cell_element.xpath('.//w:vMerge')
+        if vmerge:
+            return vmerge[0].get(self._qn('w:val'), 'continue')
+        return None
+
+    def _get_gridspan_value(self, cell_element) -> int:
+        """
+        获取单元格的水平合并数量
+        
+        Args:
+            cell_element: 单元格元素
+            
+        Returns:
+            int: 水平合并的列数
+        """
+        try:
+            gridspan = cell_element.xpath('.//w:gridSpan')
+            if gridspan and gridspan[0].get(self._qn('w:val')):
+                return int(gridspan[0].get(self._qn('w:val')))
+        except (ValueError, TypeError, AttributeError) as e:
+            print(f"警告：获取gridspan值时出错: {str(e)}")
+        return 1  # 默认返回1，表示没有合并
+
+    def _get_vertical_span(self, table: TableData, start_row: int, col: int) -> int:
+        """
+        计算垂直合并的行数
+        
+        Args:
+            table: 表格对象
+            start_row: 起始行
+            col: 列号
+            
+        Returns:
+            int: 垂直合并的行数
+        """
+        span = 1
+        for i in range(start_row + 1, len(table.rows)):
+            cell = table.cell(i, col)
+            if cell.get('vmerge') == 'continue':
+                span += 1
+            else:
+                break
+        return span
+    
+    def _qn(self, tag: str) -> str:
+        """
+        将标签转换为带命名空间的格式
+        
+        Args:
+            tag: 原始标签
+            
+        Returns:
+            str: 带命名空间的标签
+        """
+        prefix = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
+        return prefix + tag 
--- a/cxs/cxs_text_paragraph_splitter.py
+++ b/cxs/cxs_text_paragraph_splitter.py
@@ -0,0 +1,183 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+import re
+import json
+import argparse
+
+def split_text_into_paragraphs(text):
+    """
+    将连续文本智能分段
+    
+    策略:
+    1. 识别表格标记，将表格内容作为独立段落处理
+    2. 对普通文本，按照语义和长度适当分段（约500字/段）
+    3. 确保分段不破坏语义完整性
+    """
+    # 清理文本中可能存在的多余空格
+    text = re.sub(r'\s+', ' ', text).strip()
+    
+    # 识别表格范围，表格以"表格 N 开始"和"表格 N 结束"标记
+    table_pattern = re.compile(r'表格\s*\d+\s*开始(.*?)表格\s*\d+\s*结束', re.DOTALL)
+    
+    # 使用表格标记分割文本
+    parts = []
+    last_end = 0
+    
+    for match in table_pattern.finditer(text):
+        # 添加表格前的文本
+        if match.start() > last_end:
+            parts.append(("text", text[last_end:match.start()]))
+        
+        # 获取表格内容（去掉表格标记）
+        table_content = match.group(1).strip()
+        parts.append(("table", table_content))
+        
+        last_end = match.end()
+    
+    # 添加最后一个表格之后的文本
+    if last_end < len(text):
+        parts.append(("text", text[last_end:]))
+    
+    # 如果没有找到表格，则整个文本作为一个文本片段
+    if not parts:
+        parts = [("text", text)]
+    
+    # 对文本段落进行处理
+    final_paragraphs = []
+    
+    # 可能表示段落边界或重要语义分割点的标记
+    paragraph_markers = [
+        r'^第.{1,3}章',
+        r'^第.{1,3}节',
+        r'^[一二三四五六七八九十][、.\s]',
+        r'^\d+[、.\s]',
+        r'^[IVX]+[、.\s]',
+        r'^附录',
+        r'^前言',
+        r'^目录',
+        r'^摘要',
+        r'^引言',
+        r'^结论',
+        r'^参考文献'
+    ]
+    marker_pattern = re.compile('|'.join(paragraph_markers))
+    
+    # 按句子分割的分隔符
+    sentence_separators = r'([。！？\!\?])'
+    
+    # 目标段落长度（字符数）
+    target_length = 500
+    # 最小段落长度阈值
+    min_length = 100
+    # 最大段落长度阈值
+    max_length = 800
+    
+    for part_type, content in parts:
+        # 如果是表格内容，直接添加为独立段落
+        if part_type == "table":
+            final_paragraphs.append(content)
+            continue
+        
+        # 处理普通文本
+        # 按句子分割文本
+        sentences = re.split(sentence_separators, content)
+        # 将分割后的句子和标点符号重新组合
+        sentence_list = []
+        for i in range(0, len(sentences)-1, 2):
+            if i+1 < len(sentences):
+                sentence_list.append(sentences[i] + sentences[i+1])
+            else:
+                sentence_list.append(sentences[i])
+        
+        # 如果最后一个元素不是句子结束符，添加它
+        if len(sentences) % 2 == 1:
+            if sentences[-1]:
+                sentence_list.append(sentences[-1])
+        
+        # 构建段落
+        current_para = ""
+        for sentence in sentence_list:
+            # 检查是否是段落标记的开始
+            is_marker = marker_pattern.search(sentence)
+            
+            # 如果当前段落已经足够长，或者遇到段落标记，则开始新段落
+            if ((len(current_para) >= target_length and len(current_para) + len(sentence) > max_length) or 
+                (is_marker and current_para)):
+                if current_para.strip():
+                    final_paragraphs.append(current_para.strip())
+                current_para = sentence
+            else:
+                current_para += sentence
+        
+        # 添加最后一个段落
+        if current_para.strip():
+            final_paragraphs.append(current_para.strip())
+    
+    # 对段落进行后处理，合并过短的段落
+    processed_paragraphs = []
+    temp_para = ""
+    
+    for para in final_paragraphs:
+        if len(para) < min_length:
+            # 如果段落太短，尝试与临时段落合并
+            if temp_para:
+                temp_para += " " + para
+            else:
+                temp_para = para
+        else:
+            # 如果有临时段落，先处理它
+            if temp_para:
+                # 如果临时段落也很短，合并到当前段落
+                if len(temp_para) < min_length:
+                    para = temp_para + " " + para
+                else:
+                    processed_paragraphs.append(temp_para)
+                temp_para = ""
+            
+            processed_paragraphs.append(para)
+    
+    # 处理最后可能剩余的临时段落
+    if temp_para:
+        if processed_paragraphs and len(temp_para) < min_length:
+            processed_paragraphs[-1] += " " + temp_para
+        else:
+            processed_paragraphs.append(temp_para)
+    
+    return processed_paragraphs
+
+def save_to_json(paragraphs, output_file):
+    """将段落保存为JSON格式"""
+    data = {
+        "total_paragraphs": len(paragraphs),
+        "paragraphs": paragraphs
+    }
+    
+    with open(output_file, 'w', encoding='utf-8') as f:
+        json.dump(data, f, ensure_ascii=False, indent=2)
+    
+    print(f"成功将文本分成 {len(paragraphs)} 个段落并保存到 {output_file}")
+
+def main():
+    parser = argparse.ArgumentParser(description="将连续文本智能分段并保存为JSON")
+    parser.add_argument("input_file", help="输入文本文件路径")
+    parser.add_argument("--output", "-o", default="paragraphs.json", help="输出JSON文件路径")
+    
+    args = parser.parse_args()
+    
+    # 读取输入文件
+    try:
+        with open(args.input_file, 'r', encoding='utf-8') as f:
+            text = f.read()
+    except Exception as e:
+        print(f"读取文件出错: {e}")
+        return
+    
+    # 分段
+    paragraphs = split_text_into_paragraphs(text)
+    
+    # 保存为JSON
+    save_to_json(paragraphs, args.output)
+
+if __name__ == "__main__":
+    main()
--- a/cxs/main.py
+++ b/cxs/main.py
@@ -0,0 +1,500 @@
+from fastapi import FastAPI, File, UploadFile, Form, HTTPException, Request
+from fastapi.responses import FileResponse, JSONResponse
+from fastapi.staticfiles import StaticFiles
+from fastapi.middleware.cors import CORSMiddleware
+import os
+import tempfile
+from pathlib import Path
+import uuid
+import sys
+import shutil
+import glob
+import asyncio
+from typing import List
+import json
+import atexit
+import re
+import time  # 添加time模块导入
+
+# 获取当前文件所在目录的绝对路径
+CURRENT_DIR = Path(os.path.dirname(os.path.abspath(__file__)))
+if str(CURRENT_DIR) not in sys.path:
+    sys.path.append(str(CURRENT_DIR))
+
+# 定义目录
+TEMP_DIR = CURRENT_DIR / "temp"
+STATIC_DIR = CURRENT_DIR / "static"
+UPLOAD_DIR = TEMP_DIR / "uploads"
+OUTPUT_DIR = TEMP_DIR / "outputs"
+IMAGES_DIR = TEMP_DIR / "images"  # 添加图片目录
+
+# 确保所有必要的目录都存在
+def ensure_directories():
+    """确保所有必要的目录都存在且具有正确的权限"""
+    directories = [TEMP_DIR, STATIC_DIR, UPLOAD_DIR, OUTPUT_DIR, IMAGES_DIR]
+    for directory in directories:
+        try:
+            # 只在目录不存在时创建
+            if not directory.exists():
+                directory.mkdir(parents=True, exist_ok=True)
+                print(f"创建目录: {directory}")
+            # 在 Windows 上设置目录权限
+            if os.name == 'nt':
+                os.system(f'icacls "{directory}" /grant Everyone:(OI)(CI)F /T')
+                print(f"设置目录权限: {directory}")
+        except Exception as e:
+            print(f"创建目录失败 {directory}: {e}")
+            raise
+
+def clean_temp_directories():
+    """清理临时目录中的内容，但保留目录结构"""
+    try:
+        # 只清理临时目录中的内容
+        for directory in [UPLOAD_DIR, OUTPUT_DIR, IMAGES_DIR]:
+            if directory.exists():
+                print(f"清理目录: {directory}")
+                # 删除目录中的所有文件和子目录
+                for item in directory.glob("*"):
+                    try:
+                        if item.is_file():
+                            item.unlink()
+                            print(f"删除文件: {item}")
+                        elif item.is_dir():
+                            shutil.rmtree(str(item))
+                            print(f"删除目录: {item}")
+                    except Exception as e:
+                        print(f"清理项目失败 {item}: {e}")
+    except Exception as e:
+        print(f"清理临时目录失败: {e}")
+
+# 初始化目录
+ensure_directories()
+
+try:
+    from cxs_doc_cleaner import DocCleaner
+except ImportError as e:
+    print(f"导入错误: {e}")
+    print(f"当前目录: {CURRENT_DIR}")
+    print(f"Python路径: {sys.path}")
+    raise
+
+app = FastAPI(debug=True)
+
+# 配置CORS
+origins = ["*"]
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=origins,
+    allow_credentials=True,
+    allow_methods=["GET", "POST", "OPTIONS"],
+    allow_headers=["*"],
+    expose_headers=["*"]
+)
+
+# API 路由
+@app.options("/api/upload/")
+async def upload_options():
+    return {}
+
+@app.post("/api/upload/")
+async def upload_files(request: Request, files: List[UploadFile] = File(...)):
+    """处理文件上传"""
+    print(f"收到上传请求: {request.method} {request.url}")
+    print(f"请求头: {request.headers}")
+    print(f"收到的文件数量: {len(files)}")
+    
+    # 确保目录存在
+    ensure_directories()
+    
+    # 检查是否有文件上传
+    if not files:
+        return {
+            "results": [],
+            "error": "没有上传文件"
+        }
+
+    results = []
+    cleaner = None
+    
+    try:
+        # 创建文档处理器
+        cleaner = DocCleaner()
+        print("成功创建DocCleaner实例")
+        
+        # 一次只处理一个文件
+        for index, file in enumerate(files):
+            print(f"\n开始处理第 {index + 1}/{len(files)} 个文件: {file.filename}")
+            temp_file = None
+            output_file = None
+            
+            try:
+                # 保存上传的文件
+                temp_file, save_error = await save_uploaded_file(file)
+                if save_error or not temp_file:
+                    print(f"保存文件失败: {save_error}")
+                    results.append({
+                        "filename": file.filename,
+                        "status": "error",
+                        "error": save_error or "保存文件失败",
+                        "output_file": None,
+                        "markdown_file": None,
+                        "content": None
+                    })
+                    continue
+                
+                print(f"文件已保存到临时位置: {temp_file}")
+                
+                # 检查文件类型
+                file_ext = Path(file.filename).suffix.lower()
+                supported_formats = {
+                    '.doc': 'word',
+                    '.docx': 'word',
+                    '.pdf': 'pdf',
+                    '.html': 'html',
+                    '.htm': 'html',
+                    '.xls': 'excel',
+                    '.xlsx': 'excel'
+                }
+                
+                if file_ext not in supported_formats:
+                    print(f"不支持的文件类型: {file_ext}")
+                    results.append({
+                        "filename": file.filename,
+                        "status": "error",
+                        "error": f"不支持的文件类型: {file_ext}",
+                        "output_file": None,
+                        "markdown_file": None,
+                        "content": None
+                    })
+                    if temp_file.exists():
+                        temp_file.unlink()
+                    continue
+                
+                # 确保文件存在
+                if not temp_file.exists():
+                    print(f"错误：临时文件不存在: {temp_file}")
+                    results.append({
+                        "filename": file.filename,
+                        "status": "error",
+                        "error": "临时文件不存在",
+                        "output_file": None,
+                        "markdown_file": None,
+                        "content": None
+                    })
+                    continue
+                
+                print(f"开始处理文件内容: {temp_file}")
+                # 处理文件
+                output_file, text_content, markdown_file, error = await process_single_file(str(temp_file), cleaner)
+                
+                # 处理完成后删除临时文件
+                if temp_file and temp_file.exists():
+                    # 修改为使用安全删除函数
+                    if safe_delete_file(temp_file):
+                        print(f"删除临时文件: {temp_file}")
+                    else:
+                        print(f"警告：无法完全删除临时文件，但处理已成功完成: {temp_file}")
+                
+                if error:
+                    print(f"处理文件时出错: {error}")
+                    results.append({
+                        "filename": file.filename,
+                        "status": "error",
+                        "error": str(error),
+                        "output_file": None,
+                        "markdown_file": None,
+                        "content": None
+                    })
+                    continue
+                
+                # 创建响应文件
+                response_file = OUTPUT_DIR / f"response_{Path(file.filename).stem}_output.txt"
+                response_markdown = OUTPUT_DIR / f"response_{Path(file.filename).stem}_output.md"
+                print(f"创建响应文件: {response_file}")
+                print(f"创建Markdown响应文件: {response_markdown}")
+                
+                if output_file and Path(output_file).exists():
+                    shutil.copy2(output_file, str(response_file))
+                    print(f"复制输出文件到响应文件: {output_file} -> {response_file}")
+                    
+                    # 复制Markdown文件
+                    if markdown_file and Path(markdown_file).exists():
+                        shutil.copy2(markdown_file, str(response_markdown))
+                        print(f"复制Markdown文件到响应文件: {markdown_file} -> {response_markdown}")
+                    
+                    # 删除原始输出文件
+                    Path(output_file).unlink()
+                    print(f"删除原始输出文件: {output_file}")
+                    
+                    # 删除原始Markdown文件
+                    if markdown_file and Path(markdown_file).exists():
+                        Path(markdown_file).unlink()
+                        print(f"删除原始Markdown文件: {markdown_file}")
+                else:
+                    print(f"警告：输出文件不存在: {output_file}")
+                    results.append({
+                        "filename": file.filename,
+                        "status": "error",
+                        "error": "处理后的文件不存在",
+                        "output_file": None,
+                        "markdown_file": None,
+                        "content": None
+                    })
+                    continue
+                
+                # 添加成功结果
+                results.append({
+                    "filename": file.filename,
+                    "status": "success",
+                    "error": None,
+                    "output_file": response_file.name,
+                    "markdown_file": response_markdown.name,
+                    "content": text_content or ""
+                })
+                
+                print(f"文件处理完成: {file.filename}")
+                
+            except Exception as e:
+                print(f"处理文件时出错: {file.filename}, 错误: {str(e)}")
+                results.append({
+                    "filename": file.filename,
+                    "status": "error",
+                    "error": f"处理文件时发生错误: {str(e)}",
+                    "output_file": None,
+                    "markdown_file": None,
+                    "content": None
+                })
+                # 确保清理临时文件
+                if temp_file and temp_file.exists():
+                    try:
+                        # 修改为使用安全删除函数
+                        safe_delete_file(temp_file)
+                    except Exception as cleanup_error:
+                        print(f"清理临时文件失败: {cleanup_error}")
+    
+    except Exception as e:
+        print(f"处理过程发生错误: {str(e)}")
+        return {
+            "results": results,
+            "error": f"处理过程发生错误: {str(e)}"
+        }
+    
+    # 返回处理结果
+    return {
+        "results": results,
+        "error": None if results else "没有成功处理任何文件"
+    }
+
+@app.get("/api/download/{filename:path}")
+async def download_file(filename: str):
+    """下载处理后的文件"""
+    # 确保输出目录存在
+    ensure_directories()
+    
+    file_path = OUTPUT_DIR / filename
+    if not file_path.exists():
+        raise HTTPException(status_code=404, detail="文件不存在")
+    
+    # 根据文件扩展名设置正确的MIME类型
+    file_extension = Path(filename).suffix.lower()
+    if file_extension == '.md':
+        media_type = 'text/markdown'
+    elif file_extension == '.txt':
+        media_type = 'text/plain'
+    elif file_extension == '.docx':
+        media_type = 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
+    else:
+        media_type = 'application/octet-stream'
+    
+    return FileResponse(
+        path=str(file_path),
+        filename=filename,
+        media_type=media_type
+    )
+
+# 在应用启动时清理所有临时目录的内容
+@app.on_event("startup")
+async def startup_event():
+    """应用启动时的初始化操作"""
+    ensure_directories()
+    clean_temp_directories()
+
+# 在应用关闭时清理所有临时目录的内容
+@app.on_event("shutdown")
+async def shutdown_event():
+    """应用关闭时的清理操作"""
+    clean_temp_directories()
+
+# 挂载静态文件目录 - 放在所有API路由之后
+app.mount("/", StaticFiles(directory=str(STATIC_DIR), html=True), name="static")
+
+async def save_uploaded_file(file: UploadFile) -> tuple[Path, str]:
+    """保存上传的文件并返回临时文件路径"""
+    try:
+        if not file or not file.filename:
+            return None, "无效的文件"
+            
+        # 确保上传目录存在
+        if not UPLOAD_DIR.exists():
+            UPLOAD_DIR.mkdir(parents=True, exist_ok=True)
+            print(f"创建上传目录: {UPLOAD_DIR}")
+            
+        # 生成唯一的文件名
+        unique_id = str(uuid.uuid4())
+        # 处理文件名，移除.tmp部分
+        original_name = Path(file.filename).name
+        if '.tmp.' in original_name:
+            # 如果文件名中包含.tmp.，则移除它
+            name_parts = original_name.split('.tmp.')
+            safe_filename = name_parts[-1]  # 取.tmp.后面的部分
+        else:
+            safe_filename = original_name
+            
+        # 确保文件名只包含安全字符
+        safe_filename = re.sub(r'[^\w\-_\.]', '_', safe_filename)
+        temp_file = UPLOAD_DIR / f"temp_{unique_id}_{safe_filename}"
+        print(f"准备保存文件到: {temp_file}")
+        
+        # 读取文件内容
+        content = await file.read()
+        if not content:
+            return None, "文件内容为空"
+            
+        # 保存文件
+        with open(temp_file, "wb") as buffer:
+            buffer.write(content)
+            
+        # 验证文件是否成功保存
+        if not temp_file.exists():
+            return None, "文件保存失败"
+            
+        print(f"文件成功保存到: {temp_file}")
+        return temp_file, None
+    except Exception as e:
+        print(f"保存文件时出错: {str(e)}")
+        return None, f"保存文件时发生错误: {str(e)}"
+
+def safe_delete_file(file_path, max_retries=3, retry_delay=1.0):
+    """
+    安全删除文件，带有重试机制
+    
+    Args:
+        file_path: 要删除的文件路径
+        max_retries: 最大重试次数
+        retry_delay: 重试之间的延迟（秒）
+    
+    Returns:
+        bool: 是否成功删除文件
+    """
+    path = Path(file_path)
+    if not path.exists():
+        return True
+        
+    for attempt in range(max_retries):
+        try:
+            path.unlink()
+            print(f"删除临时文件: {file_path}")
+            return True
+        except Exception as e:
+            print(f"尝试 {attempt+1}/{max_retries} 删除文件失败: {str(e)}")
+            if "WinError 32" in str(e):
+                # 如果是"另一个程序正在使用此文件"的错误，等待一会再重试
+                print(f"文件被锁定，等待 {retry_delay} 秒后重试...")
+                time.sleep(retry_delay)
+            else:
+                # 其他错误不继续尝试
+                print(f"删除文件时发生错误: {str(e)}")
+                return False
+    
+    print(f"无法删除文件 {file_path}，已尝试 {max_retries} 次")
+    return False
+
+async def process_single_file(file_path: str, cleaner: DocCleaner) -> tuple[str, str, str, str]:
+    """处理单个文件并返回结果文件路径、文件内容和Markdown文件路径"""
+    image_dir = None
+    output_file = None
+    temp_docx = None
+    
+    try:
+        # 确保输入文件存在
+        file_path = Path(file_path)
+        if not file_path.exists():
+            print(f"错误：输入文件不存在: {file_path}")
+            raise FileNotFoundError(f"找不到输入文件: {file_path}")
+            
+        # 规范化文件路径
+        file_path = str(file_path.resolve())
+        print(f"规范化后的文件路径: {file_path}")
+        
+        # 处理文件名，移除.tmp部分
+        file_stem = Path(file_path).stem
+        if '.tmp.' in file_stem:
+            # 如果文件名中包含.tmp.，则移除它
+            name_parts = file_stem.split('.tmp.')
+            file_stem = name_parts[-1]  # 取.tmp.后面的部分
+        
+        # 生成唯一的图片目录名
+        unique_id = str(uuid.uuid4())[:8]
+        # 确保文件名只包含安全字符
+        safe_file_stem = re.sub(r'[^\w\-_\.]', '_', file_stem)
+        image_dir = IMAGES_DIR / f"{safe_file_stem}_{unique_id}"
+        
+        # 确保图片目录存在
+        image_dir.mkdir(parents=True, exist_ok=True)
+        print(f"创建图片目录: {image_dir}")
+        
+        # 生成输出文件路径
+        output_file = OUTPUT_DIR / f"{safe_file_stem}_output.txt"
+        markdown_file = OUTPUT_DIR / f"{safe_file_stem}_output.md"
+        docx_file = OUTPUT_DIR / f"{safe_file_stem}_output.docx"
+        print(f"输出文件路径: {output_file}")
+        print(f"Markdown文件路径: {markdown_file}")
+        print(f"Word文件路径: {docx_file}")
+        
+        # 处理文档
+        print(f"开始处理文件: {file_path}")
+        print(f"图片将保存到: {image_dir}")
+        
+        # 处理文档并保存所有格式
+        main_content, appendix, tables = cleaner.clean_doc(file_path)
+        print(f"文档处理完成: {file_path}")
+        
+        # 保存为docx格式（这个函数会同时生成txt和md文件）
+        cleaner.save_as_docx(main_content, appendix, tables, str(docx_file))
+        
+        # 合并正文和附录内容用于返回
+        all_content = main_content + ["附录"] + appendix if appendix else main_content
+        text_content = " ".join([t.replace("\n", " ").strip() for t in all_content if t.strip()])
+        
+        # 验证所有文件是否成功创建
+        if not output_file.exists():
+            raise FileNotFoundError(f"TXT文件未能成功创建: {output_file}")
+        if not markdown_file.exists():
+            raise FileNotFoundError(f"Markdown文件未能成功创建: {markdown_file}")
+            
+        return str(output_file), text_content, str(markdown_file), None
+        
+    except Exception as e:
+        print(f"处理文件时出错: {str(e)}")
+        return None, None, None, str(e)
+    finally:
+        # 清理临时文件和目录
+        try:
+            if image_dir and image_dir.exists():
+                print(f"清理图片目录: {image_dir}")
+                shutil.rmtree(str(image_dir))
+        except Exception as cleanup_error:
+            print(f"清理图片目录时出错: {str(cleanup_error)}")
+            
+        try:
+            if temp_docx and os.path.exists(temp_docx):
+                print(f"清理临时DOCX文件: {temp_docx}")
+                safe_delete_file(temp_docx)  # 使用安全删除函数
+                temp_dir = os.path.dirname(temp_docx)
+                if os.path.exists(temp_dir):
+                    try:
+                        os.rmdir(temp_dir)
+                    except Exception as dir_error:
+                        print(f"清理临时目录时出错: {str(dir_error)}")
+        except Exception as cleanup_error:
+            print(f"清理临时DOCX文件时出错: {str(cleanup_error)}") 
--- a/cxs/ocr_api.py
+++ b/cxs/ocr_api.py
@@ -0,0 +1,285 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+from fastapi import FastAPI, File, UploadFile, Form, HTTPException
+from fastapi.responses import JSONResponse
+from fastapi.staticfiles import StaticFiles
+from fastapi.middleware.cors import CORSMiddleware
+import os
+import tempfile
+from pathlib import Path
+import uuid
+import time
+import base64
+import io
+
+# 导入PDF处理器
+try:
+    from cxs_pdf_cleaner import PdfProcessor
+except ImportError:
+    try:
+        from cxs.cxs_pdf_cleaner import PdfProcessor
+    except ImportError:
+        # 如果导入失败，添加当前目录到Python路径
+        import sys
+        current_dir = os.path.dirname(os.path.abspath(__file__))
+        sys.path.append(current_dir)
+        from cxs_pdf_cleaner import PdfProcessor
+
+# 获取当前文件所在目录
+CURRENT_DIR = Path(os.path.dirname(os.path.abspath(__file__)))
+
+# 定义目录
+TEMP_DIR = CURRENT_DIR / "temp"
+STATIC_DIR = CURRENT_DIR / "static"
+DEBUG_DIR = TEMP_DIR / "debug"
+
+# 确保所有必要的目录都存在
+def ensure_directories():
+    """确保所有必要的目录都存在"""
+    directories = [TEMP_DIR, STATIC_DIR, DEBUG_DIR]
+    for directory in directories:
+        directory.mkdir(parents=True, exist_ok=True)
+        print(f"确保目录存在: {directory}")
+
+# 初始化目录
+ensure_directories()
+
+# 创建FastAPI应用
+app = FastAPI(debug=True, title="OCR图像识别API", 
+              description="提供高级图像OCR识别服务")
+
+# 配置CORS
+origins = ["*"]
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=origins,
+    allow_credentials=True,
+    allow_methods=["GET", "POST", "OPTIONS"],
+    allow_headers=["*"],
+    expose_headers=["*"]
+)
+
+# 初始化PDF处理器
+pdf_processor = PdfProcessor()
+
+# 设置静态文件
+app.mount("/static", StaticFiles(directory=str(STATIC_DIR)), name="static")
+app.mount("/debug", StaticFiles(directory=str(DEBUG_DIR)), name="debug")
+
+
+@app.get("/")
+async def root():
+    """重定向到OCR测试页面"""
+    return {"message": "欢迎使用OCR图像识别API", "test_page": "/static/ocr_test.html"}
+
+
+@app.post("/api/ocr")
+async def ocr_image(
+    image: UploadFile = File(...),
+    lang: str = Form("chi_sim+eng"),
+    mode: str = Form("auto")
+):
+    """
+    对上传的图片进行OCR识别
+    
+    - **image**: 要进行OCR识别的图片文件
+    - **lang**: OCR语言，默认为中文简体+英文 (chi_sim+eng)
+    - **mode**: 处理模式，auto=自动，standard=标准，advanced=高级，chinese=中文优化
+    """
+    print(f"接收到OCR请求: 文件名={image.filename}, 语言={lang}, 模式={mode}")
+    
+    # 检查文件类型
+    valid_types = ["image/jpeg", "image/png", "image/bmp", "image/tiff", "image/gif"]
+    if image.content_type not in valid_types:
+        raise HTTPException(status_code=400, detail="不支持的文件类型，请上传图片文件")
+    
+    # 创建一个唯一的ID用于此次处理
+    process_id = str(uuid.uuid4())[:8]
+    
+    # 保存上传的图片
+    temp_dir = tempfile.mkdtemp(dir=TEMP_DIR)
+    temp_path = Path(temp_dir) / f"image_{process_id}{Path(image.filename).suffix}"
+    
+    try:
+        # 保存上传的图片
+        content = await image.read()
+        with open(temp_path, "wb") as f:
+            f.write(content)
+        
+        print(f"图片已保存到临时路径: {temp_path}")
+        
+        # 记录开始时间
+        start_time = time.time()
+        
+        # 执行OCR处理
+        ocr_results = []
+        best_result = ""
+        
+        # 根据不同模式选择不同的处理参数
+        if mode == "standard":
+            # 标准模式 - 使用基本的OCR处理
+            ocr_text = pdf_processor.perform_ocr(str(temp_path), lang, retry_count=0)
+            best_result = ocr_text
+            ocr_results.append({
+                "name": "标准处理",
+                "text": ocr_text,
+                "length": len(ocr_text),
+                "confidence": 90.0,
+                "blocks": 1
+            })
+        elif mode == "chinese":
+            # 中文优化模式 - 使用中文专项处理
+            image = pdf_processor._read_image(str(temp_path))
+            if image is not None:
+                # 应用中文优化
+                processed = pdf_processor._optimize_for_chinese(image)
+                # 保存处理后的图像以供显示
+                debug_path = DEBUG_DIR / f"chinese_{process_id}.png"
+                pdf_processor._save_debug_image(processed, str(debug_path))
+                # 执行OCR
+                ocr_text = pdf_processor.perform_ocr(str(debug_path), lang, retry_count=1)
+                best_result = ocr_text
+                ocr_results.append({
+                    "name": "中文优化",
+                    "text": ocr_text,
+                    "length": len(ocr_text),
+                    "confidence": 90.0,
+                    "blocks": 1
+                })
+        elif mode == "advanced":
+            # 高级模式 - 使用多种处理方法并比较结果
+            # 读取原始图像
+            image = pdf_processor._read_image(str(temp_path))
+            if image is not None:
+                # 使用多种图像处理方法
+                preprocessed_images = pdf_processor._apply_multiple_preprocessing(image)
+                
+                # 对每个预处理后的图像执行OCR并比较结果
+                best_length = 0
+                best_confidence = 0
+                
+                for method_name, processed_image in preprocessed_images:
+                    # 保存处理后的图像以供显示
+                    debug_path = DEBUG_DIR / f"{method_name.replace(' ', '_').lower()}_{process_id}.png"
+                    pdf_processor._save_debug_image(processed_image, str(debug_path))
+                    
+                    # 执行OCR
+                    try:
+                        import pytesseract
+                        ocr_result = pytesseract.image_to_data(processed_image, lang=lang, output_type=pytesseract.Output.DICT)
+                        
+                        # 提取文本
+                        extracted_text = []
+                        total_confidence = 0
+                        valid_blocks = 0
+                        
+                        for i in range(len(ocr_result['text'])):
+                            confidence = ocr_result['conf'][i]
+                            text = ocr_result['text'][i].strip()
+                            
+                            if confidence > pdf_processor.min_text_confidence and text:
+                                extracted_text.append(text)
+                                total_confidence += confidence
+                                valid_blocks += 1
+                        
+                        # 合并结果
+                        result_text = " ".join(extracted_text)
+                        result_length = len(result_text)
+                        avg_confidence = total_confidence / valid_blocks if valid_blocks > 0 else 0
+                        
+                        ocr_results.append({
+                            "name": method_name,
+                            "text": result_text,
+                            "length": result_length,
+                            "confidence": avg_confidence,
+                            "blocks": valid_blocks
+                        })
+                        
+                        # 更新最佳结果
+                        if result_length > 0:
+                            if (result_length > best_length * 1.5) or \
+                               (result_length >= best_length * 0.8 and avg_confidence > best_confidence):
+                                best_result = result_text
+                                best_length = result_length
+                                best_confidence = avg_confidence
+                    
+                    except Exception as e:
+                        print(f"处理方法 {method_name} 失败: {str(e)}")
+        else:
+            # 自动模式 - 使用完整的OCR处理流程
+            best_result = pdf_processor.perform_ocr(str(temp_path), lang, retry_count=3)
+            
+            # 添加处理结果
+            ocr_results.append({
+                "name": "自动处理",
+                "text": best_result,
+                "length": len(best_result),
+                "confidence": 90.0,
+                "blocks": 1
+            })
+        
+        # 计算处理时间
+        processing_time = time.time() - start_time
+        print(f"OCR处理完成，耗时: {processing_time:.2f}秒")
+        
+        # 收集处理后的图像列表
+        processed_images = []
+        try:
+            # 查找调试目录中的图像
+            debug_files = list(DEBUG_DIR.glob(f"*_{process_id}.png"))
+            for debug_file in debug_files:
+                # 提取处理方法名称
+                method_name = debug_file.stem.split('_')[0].replace('_', ' ').title()
+                
+                # 创建图像URL
+                image_url = f"/debug/{debug_file.name}"
+                
+                processed_images.append({
+                    "name": method_name,
+                    "url": image_url
+                })
+        except Exception as e:
+            print(f"收集处理图像时出错: {str(e)}")
+        
+        # 根据OCR结果长度排序
+        ocr_results.sort(key=lambda x: x['length'], reverse=True)
+        
+        # 返回OCR结果
+        response = {
+            "text": best_result,
+            "processing_time": processing_time,
+            "lang": lang,
+            "mode": mode,
+            "methods": ocr_results,
+            "processed_images": processed_images
+        }
+        
+        return JSONResponse(content=response)
+        
+    except Exception as e:
+        import traceback
+        traceback.print_exc()
+        raise HTTPException(status_code=500, detail=f"OCR处理失败: {str(e)}")
+    finally:
+        # 清理临时文件
+        try:
+            if temp_path.exists():
+                temp_path.unlink()
+            
+            if Path(temp_dir).exists():
+                os.rmdir(temp_dir)
+                
+            print(f"临时文件已清理")
+        except Exception as e:
+            print(f"清理临时文件时出错: {str(e)}")
+
+
+if __name__ == "__main__":
+    import uvicorn
+    print("启动OCR API服务...")
+    print(f"当前工作目录: {os.getcwd()}")
+    print(f"静态文件目录: {STATIC_DIR}")
+    print(f"调试文件目录: {DEBUG_DIR}")
+    # 启动服务器
+    uvicorn.run(app, host="0.0.0.0", port=8001) 
--- a/cxs/requirements.txt
+++ b/cxs/requirements.txt
@@ -0,0 +1,20 @@
+fastapi==0.104.1
+python-multipart==0.0.6
+uvicorn==0.24.0
+python-docx==1.0.1
+numpy>=1.26.2
+scikit-learn>=1.3.2
+requests>=2.32.2
+reportlab==4.0.4
+python-Levenshtein>=0.22.0
+regex>=2023.0.0
+pdf2docx>=0.5.6
+pytesseract>=0.3.10
+opencv-python>=4.8.0
+Pillow>=10.0.0
+beautifulsoup4>=4.12.0
+html2text>=2020.1.16
+pandas>=2.0.0
+aiofiles>=23.1.0
+openpyxl>=3.1.2
+uuid>=1.30
--- a/cxs/static/index.html
+++ b/cxs/static/index.html
@@ -0,0 +1,468 @@
+<!DOCTYPE html>
+<html lang="zh">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>文档处理系统</title>
+    <style>
+        body {
+            font-family: 'Microsoft YaHei', sans-serif;
+            max-width: 1000px;
+            margin: 0 auto;
+            padding: 20px;
+            background-color: #f5f5f5;
+        }
+        .container {
+            background-color: white;
+            padding: 30px;
+            border-radius: 8px;
+            box-shadow: 0 2px 4px rgba(0,0,0,0.1);
+        }
+        h1 {
+            color: #333;
+            text-align: center;
+            margin-bottom: 30px;
+        }
+        .upload-area {
+            border: 2px dashed #ccc;
+            padding: 20px;
+            text-align: center;
+            margin-bottom: 20px;
+            border-radius: 4px;
+            cursor: pointer;
+            transition: all 0.3s ease;
+        }
+        .upload-area:hover {
+            border-color: #666;
+        }
+        .upload-area.dragover {
+            border-color: #4CAF50;
+            background-color: #E8F5E9;
+        }
+        #file-input {
+            display: none;
+        }
+        .btn {
+            background-color: #4CAF50;
+            color: white;
+            padding: 10px 20px;
+            border: none;
+            border-radius: 4px;
+            cursor: pointer;
+            font-size: 16px;
+            transition: background-color 0.3s ease;
+            margin: 0 5px;
+        }
+        .btn:hover {
+            background-color: #45a049;
+        }
+        .btn:disabled {
+            background-color: #cccccc;
+            cursor: not-allowed;
+        }
+        #status {
+            margin-top: 20px;
+            padding: 10px;
+            border-radius: 4px;
+            display: none;
+        }
+        .success {
+            background-color: #E8F5E9;
+            color: #2E7D32;
+        }
+        .error {
+            background-color: #FFEBEE;
+            color: #C62828;
+        }
+        .file-list {
+            margin: 20px 0;
+            max-height: 300px;
+            overflow-y: auto;
+        }
+        .file-item {
+            display: flex;
+            align-items: center;
+            justify-content: space-between;
+            padding: 10px;
+            border: 1px solid #ddd;
+            margin-bottom: 5px;
+            border-radius: 4px;
+        }
+        .file-item .progress-container {
+            flex: 1;
+            margin: 0 20px;
+            background-color: #f0f0f0;
+            border-radius: 10px;
+            overflow: hidden;
+        }
+        .file-item .progress-bar {
+            height: 20px;
+            background-color: #4CAF50;
+            width: 0%;
+            transition: width 0.3s ease;
+            border-radius: 10px;
+            position: relative;
+        }
+        .progress-text {
+            position: absolute;
+            width: 100%;
+            text-align: center;
+            color: white;
+            font-size: 12px;
+            line-height: 20px;
+        }
+        .file-item .remove-btn {
+            background-color: #f44336;
+            color: white;
+            border: none;
+            padding: 5px 10px;
+            border-radius: 3px;
+            cursor: pointer;
+        }
+        .result-container {
+            margin-top: 20px;
+            border-top: 1px solid #ddd;
+            padding-top: 20px;
+        }
+        .result-item {
+            display: flex;
+            justify-content: space-between;
+            align-items: center;
+            padding: 10px;
+            border: 1px solid #ddd;
+            margin-bottom: 5px;
+            border-radius: 4px;
+            background-color: #fff;
+        }
+        .result-item.error {
+            background-color: #FFEBEE;
+        }
+        .result-item.success {
+            background-color: #E8F5E9;
+        }
+        .result-info {
+            flex: 1;
+            margin-right: 10px;
+        }
+        .button-group {
+            text-align: center;
+            margin: 20px 0;
+        }
+        .result-text {
+            max-height: 300px;
+            overflow-y: auto;
+            border: 1px solid #ddd;
+            padding: 10px;
+            margin-top: 10px;
+            background-color: #fff;
+            border-radius: 4px;
+            white-space: pre-wrap;
+            display: none;
+        }
+        .result-buttons {
+            display: flex;
+            gap: 10px;
+        }
+    </style>
+</head>
+<body>
+    <div class="container">
+        <h1>文档处理系统</h1>
+        <div class="upload-area" id="drop-area">
+            <p>点击或拖拽文件到此处上传</p>
+            <p>支持的格式：.doc, .docx, .pdf, .html, .htm, .xls, .xlsx</p>
+            <p>可以同时选择多个文件</p>
+            <input type="file" id="file-input" accept=".doc,.docx,.pdf,.html,.htm,.xls,.xlsx" multiple>
+        </div>
+        <div class="file-list" id="file-list"></div>
+        <div class="button-group">
+            <button id="upload-btn" class="btn" disabled>开始处理</button>
+            <button id="clear-btn" class="btn" style="background-color: #f44336;">清空列表</button>
+        </div>
+        <div id="status"></div>
+        <div class="result-container">
+            <h2>处理结果</h2>
+            <div id="result-list"></div>
+        </div>
+    </div>
+
+    <script>
+        const dropArea = document.getElementById('drop-area');
+        const fileInput = document.getElementById('file-input');
+        const uploadBtn = document.getElementById('upload-btn');
+        const clearBtn = document.getElementById('clear-btn');
+        const status = document.getElementById('status');
+        const fileList = document.getElementById('file-list');
+        const resultList = document.getElementById('result-list');
+        
+        let files = new Map(); // 存储待处理的文件
+        let processing = false; // 是否正在处理文件
+
+        // 处理拖拽事件
+        ['dragenter', 'dragover', 'dragleave', 'drop'].forEach(eventName => {
+            dropArea.addEventListener(eventName, preventDefaults, false);
+        });
+
+        function preventDefaults(e) {
+            e.preventDefault();
+            e.stopPropagation();
+        }
+
+        ['dragenter', 'dragover'].forEach(eventName => {
+            dropArea.addEventListener(eventName, highlight, false);
+        });
+
+        ['dragleave', 'drop'].forEach(eventName => {
+            dropArea.addEventListener(eventName, unhighlight, false);
+        });
+
+        function highlight(e) {
+            dropArea.classList.add('dragover');
+        }
+
+        function unhighlight(e) {
+            dropArea.classList.remove('dragover');
+        }
+
+        // 处理文件拖放
+        dropArea.addEventListener('drop', handleDrop, false);
+
+        function handleDrop(e) {
+            const dt = e.dataTransfer;
+            handleFiles(Array.from(dt.files));
+        }
+
+        // 点击上传区域触发文件选择
+        dropArea.addEventListener('click', () => {
+            fileInput.click();
+        });
+
+        fileInput.addEventListener('change', function() {
+            handleFiles(Array.from(this.files));
+            this.value = ''; // 清空input，允许重复选择相同文件
+        });
+
+        // 清空按钮事件
+        clearBtn.addEventListener('click', () => {
+            if (!processing) {
+                files.clear();
+                updateFileList();
+                uploadBtn.disabled = true;
+            }
+        });
+
+        function handleFiles(newFiles) {
+            const validTypes = ['.doc', '.docx', '.pdf', '.html', '.htm', '.xls', '.xlsx'];
+            
+            newFiles.forEach(file => {
+                const fileExtension = file.name.toLowerCase().slice(file.name.lastIndexOf('.'));
+                if (validTypes.includes(fileExtension)) {
+                    files.set(file.name, {
+                        file: file,
+                        progress: 0,
+                        status: 'pending' // pending, processing, completed, error
+                    });
+                }
+            });
+            
+            updateFileList();
+            uploadBtn.disabled = files.size === 0;
+        }
+
+        function updateFileList() {
+            fileList.innerHTML = '';
+            files.forEach((fileData, fileName) => {
+                const fileItem = document.createElement('div');
+                fileItem.className = 'file-item';
+                
+                const nameSpan = document.createElement('span');
+                nameSpan.textContent = fileName;
+                
+                const progressContainer = document.createElement('div');
+                progressContainer.className = 'progress-container';
+                
+                const progressBar = document.createElement('div');
+                progressBar.className = 'progress-bar';
+                progressBar.style.width = fileData.progress + '%';
+                
+                const progressText = document.createElement('div');
+                progressText.className = 'progress-text';
+                progressText.textContent = fileData.progress + '%';
+                
+                const removeBtn = document.createElement('button');
+                removeBtn.className = 'remove-btn';
+                removeBtn.textContent = '删除';
+                removeBtn.onclick = () => {
+                    if (!processing) {
+                        files.delete(fileName);
+                        updateFileList();
+                        uploadBtn.disabled = files.size === 0;
+                    }
+                };
+                
+                progressBar.appendChild(progressText);
+                progressContainer.appendChild(progressBar);
+                fileItem.appendChild(nameSpan);
+                fileItem.appendChild(progressContainer);
+                fileItem.appendChild(removeBtn);
+                fileList.appendChild(fileItem);
+            });
+        }
+
+        // 处理文件上传
+        uploadBtn.addEventListener('click', async () => {
+            if (processing || files.size === 0) return;
+            
+            processing = true;
+            uploadBtn.disabled = true;
+            status.style.display = 'none';
+            resultList.innerHTML = '';
+            
+            try {
+                const results = [];
+                // 一个一个处理文件
+                for (const [fileName, fileData] of files.entries()) {
+                    const formData = new FormData();
+                    formData.append('files', fileData.file);
+                    
+                    // 更新进度显示
+                    fileData.status = 'processing';
+                    updateFileList();
+                    
+                    try {
+                        const response = await fetch('/api/upload/', {
+                            method: 'POST',
+                            body: formData,
+                            credentials: 'same-origin'
+                        });
+                        
+                        if (!response.ok) {
+                            throw new Error(`HTTP error! status: ${response.status}`);
+                        }
+                        
+                        const result = await response.json();
+                        console.log(`文件 ${fileName} 处理结果:`, result);  // 调试日志
+                        
+                        if (result.error) {
+                            fileData.status = 'error';
+                            showMessage(`文件 ${fileName} 处理失败: ${result.error}`);
+                        } else if (result.results && result.results.length > 0) {
+                            fileData.status = 'completed';
+                            results.push(...result.results);
+                        }
+                    } catch (error) {
+                        console.error(`文件 ${fileName} 处理错误:`, error);
+                        fileData.status = 'error';
+                        showMessage(`文件 ${fileName} 处理失败: ${error.message}`);
+                    }
+                    
+                    // 更新进度显示
+                    fileData.progress = 100;
+                    updateFileList();
+                    
+                    // 等待一小段时间，确保文件处理完成
+                    await new Promise(resolve => setTimeout(resolve, 500));
+                }
+                
+                // 显示所有处理结果
+                displayResults(results);
+                
+            } catch (error) {
+                console.error('处理错误:', error);
+                showMessage(`处理失败: ${error.message}`);
+            } finally {
+                processing = false;
+                uploadBtn.disabled = false;
+                files.clear();
+                updateFileList();
+            }
+        });
+
+        async function displayResults(results) {
+            if (results.length === 0) {
+                showMessage('没有文件被处理');
+                return;
+            }
+            
+            results.forEach(result => {
+                const resultItem = document.createElement('div');
+                resultItem.className = `result-item ${result.status}`;
+                
+                const resultInfo = document.createElement('div');
+                resultInfo.className = 'result-info';
+                
+                if (result.status === 'success') {
+                    resultInfo.innerHTML = `<strong>${result.filename}</strong> 处理成功`;
+                    
+                    const buttonsDiv = document.createElement('div');
+                    buttonsDiv.className = 'result-buttons';
+                    
+                    // 下载TXT按钮
+                    if (result.output_file) {
+                        const downloadBtn = document.createElement('button');
+                        downloadBtn.className = 'btn';
+                        downloadBtn.textContent = '下载TXT';
+                        downloadBtn.onclick = () => {
+                            window.location.href = `/api/download/${result.output_file}`;
+                        };
+                        buttonsDiv.appendChild(downloadBtn);
+                    }
+                    
+                    // 下载Markdown按钮
+                    if (result.markdown_file) {
+                        const downloadMarkdownBtn = document.createElement('button');
+                        downloadMarkdownBtn.className = 'btn';
+                        downloadMarkdownBtn.style.backgroundColor = '#2196F3'; // 使用不同的颜色区分
+                        downloadMarkdownBtn.textContent = '下载MD';
+                        downloadMarkdownBtn.onclick = () => {
+                            window.location.href = `/api/download/${result.markdown_file}`;
+                        };
+                        buttonsDiv.appendChild(downloadMarkdownBtn);
+                    }
+                    
+                    // 查看内容按钮
+                    if (result.content) {
+                        const showTextBtn = document.createElement('button');
+                        showTextBtn.className = 'btn';
+                        showTextBtn.textContent = '查看内容';
+                        
+                        const textDiv = document.createElement('div');
+                        textDiv.className = 'result-text';
+                        textDiv.textContent = result.content;
+                        textDiv.style.display = 'none';
+                        
+                        showTextBtn.onclick = () => {
+                            const isVisible = textDiv.style.display === 'block';
+                            textDiv.style.display = isVisible ? 'none' : 'block';
+                            showTextBtn.textContent = isVisible ? '查看内容' : '隐藏内容';
+                        };
+                        
+                        buttonsDiv.appendChild(showTextBtn);
+                        resultItem.appendChild(textDiv);
+                    }
+                    
+                    resultItem.appendChild(resultInfo);
+                    resultItem.appendChild(buttonsDiv);
+                } else {
+                    resultInfo.innerHTML = `<strong>${result.filename}</strong> 处理失败: ${result.error || '未知错误'}`;
+                    resultItem.appendChild(resultInfo);
+                }
+                
+                resultList.appendChild(resultItem);
+            });
+        }
+
+        function showMessage(message) {
+            const statusDiv = document.getElementById('status');
+            statusDiv.textContent = message;
+            statusDiv.className = 'error';
+            statusDiv.style.display = 'block';
+            setTimeout(() => {
+                statusDiv.style.display = 'none';
+                statusDiv.textContent = '';
+                statusDiv.className = '';
+            }, 3000);
+        }
+    </script>
+</body>
+</html> 
--- a/cxs/static/ocr_test.html
+++ b/cxs/static/ocr_test.html
@@ -0,0 +1,526 @@
+<!DOCTYPE html>
+<html lang="zh-CN">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>OCR图像识别测试</title>
+    <style>
+        body {
+            font-family: 'Microsoft YaHei', Arial, sans-serif;
+            background-color: #f5f7fa;
+            margin: 0;
+            padding: 20px;
+            color: #333;
+            line-height: 1.6;
+        }
+        .container {
+            max-width: 1200px;
+            margin: 0 auto;
+            background: white;
+            padding: 25px;
+            border-radius: 8px;
+            box-shadow: 0 2px 10px rgba(0,0,0,0.1);
+        }
+        h1 {
+            text-align: center;
+            color: #2c3e50;
+            margin-bottom: 30px;
+            border-bottom: 2px solid #eee;
+            padding-bottom: 15px;
+        }
+        .subtitle {
+            color: #7f8c8d;
+            text-align: center;
+            margin-top: -20px;
+            margin-bottom: 30px;
+        }
+        .upload-container {
+            border: 2px dashed #3498db;
+            border-radius: 8px;
+            padding: 40px;
+            text-align: center;
+            margin-bottom: 20px;
+            background-color: #f8fafc;
+            transition: background-color 0.3s;
+        }
+        .upload-container.dragover {
+            background-color: #e1f0fa;
+        }
+        .upload-container p {
+            margin: 0;
+            color: #7f8c8d;
+        }
+        .upload-icon {
+            font-size: 50px;
+            color: #3498db;
+            margin-bottom: 15px;
+        }
+        .file-input {
+            display: none;
+        }
+        .upload-btn, .ocr-btn {
+            background-color: #3498db;
+            color: white;
+            padding: 10px 20px;
+            border: none;
+            border-radius: 4px;
+            cursor: pointer;
+            font-size: 16px;
+            transition: background-color 0.3s;
+            margin: 10px 5px;
+        }
+        .upload-btn:hover, .ocr-btn:hover {
+            background-color: #2980b9;
+        }
+        .ocr-btn {
+            background-color: #2ecc71;
+            display: none;
+        }
+        .ocr-btn:hover {
+            background-color: #27ae60;
+        }
+        .ocr-btn:disabled {
+            background-color: #95a5a6;
+            cursor: not-allowed;
+        }
+        .preview-container {
+            margin-top: 20px;
+            text-align: center;
+        }
+        .image-preview {
+            max-width: 100%;
+            max-height: 400px;
+            border-radius: 4px;
+            box-shadow: 0 2px 5px rgba(0,0,0,0.1);
+            display: none;
+        }
+        .settings {
+            background-color: #f8fafc;
+            padding: 15px;
+            border-radius: 8px;
+            margin: 20px 0;
+            display: none;
+        }
+        .settings h3 {
+            margin-top: 0;
+            color: #2c3e50;
+        }
+        .form-group {
+            margin-bottom: 15px;
+        }
+        .form-group label {
+            display: block;
+            margin-bottom: 5px;
+            font-weight: bold;
+            color: #34495e;
+        }
+        .form-control {
+            width: 100%;
+            padding: 10px;
+            border: 1px solid #ddd;
+            border-radius: 4px;
+            box-sizing: border-box;
+            font-family: inherit;
+            font-size: 16px;
+        }
+        .results {
+            margin-top: 30px;
+            display: none;
+        }
+        .tabs {
+            display: flex;
+            border-bottom: 1px solid #ddd;
+            margin-bottom: 20px;
+        }
+        .tab {
+            padding: 10px 20px;
+            cursor: pointer;
+            border: 1px solid transparent;
+            border-radius: 4px 4px 0 0;
+            margin-right: 5px;
+            background-color: #f8f9fa;
+        }
+        .tab.active {
+            border: 1px solid #ddd;
+            border-bottom-color: white;
+            background-color: white;
+            font-weight: bold;
+        }
+        .tab-content {
+            display: none;
+            padding: 20px;
+            border: 1px solid #ddd;
+            border-top: none;
+            border-radius: 0 0 4px 4px;
+        }
+        .tab-content.active {
+            display: block;
+        }
+        .ocr-text {
+            background-color: #f8f9fa;
+            padding: 15px;
+            border-radius: 4px;
+            white-space: pre-wrap;
+            font-family: 'Courier New', monospace;
+            line-height: 1.5;
+            max-height: 300px;
+            overflow-y: auto;
+            border: 1px solid #ddd;
+        }
+        .processing-info {
+            margin-top: 20px;
+            padding: 15px;
+            background-color: #f0f7fb;
+            border-radius: 4px;
+            border-left: 5px solid #3498db;
+        }
+        .method-result {
+            margin: 10px 0;
+            padding: 15px;
+            background-color: #f8fafc;
+            border-radius: 4px;
+            border: 1px solid #ddd;
+        }
+        .method-result h4 {
+            margin-top: 0;
+            color: #2c3e50;
+        }
+        .confidence-bar {
+            height: 10px;
+            background-color: #ecf0f1;
+            border-radius: 5px;
+            margin: 5px 0;
+            position: relative;
+        }
+        .confidence-value {
+            height: 100%;
+            background-color: #2ecc71;
+            border-radius: 5px;
+            position: absolute;
+            left: 0;
+            top: 0;
+        }
+        .processed-images {
+            display: flex;
+            flex-wrap: wrap;
+            gap: 15px;
+            margin-top: 20px;
+        }
+        .processed-image {
+            max-width: calc(50% - 15px);
+            border: 1px solid #ddd;
+            border-radius: 4px;
+            padding: 10px;
+            box-shadow: 0 1px 3px rgba(0,0,0,0.1);
+            background-color: white;
+        }
+        .processed-image h4 {
+            margin-top: 0;
+            text-align: center;
+            color: #2c3e50;
+            font-size: 14px;
+        }
+        .processed-image img {
+            max-width: 100%;
+            border-radius: 4px;
+        }
+        .loader {
+            border: 5px solid #f3f3f3;
+            border-top: 5px solid #3498db;
+            border-radius: 50%;
+            width: 30px;
+            height: 30px;
+            animation: spin 2s linear infinite;
+            margin: 20px auto;
+            display: none;
+        }
+        @keyframes spin {
+            0% { transform: rotate(0deg); }
+            100% { transform: rotate(360deg); }
+        }
+        .error-message {
+            color: #e74c3c;
+            padding: 10px;
+            background-color: #fadbd8;
+            border-radius: 4px;
+            margin: 20px 0;
+            display: none;
+        }
+    </style>
+</head>
+<body>
+    <div class="container">
+        <h1>OCR图像识别测试</h1>
+        <p class="subtitle">上传图片并测试文字识别效果</p>
+        
+        <div class="upload-container" id="uploadContainer">
+            <div class="upload-icon">📁</div>
+            <p>拖放图片到这里，或点击上传</p>
+            <input type="file" id="fileInput" class="file-input" accept="image/*">
+            <button class="upload-btn" id="uploadBtn">选择图片</button>
+        </div>
+        
+        <div class="preview-container">
+            <img id="imagePreview" class="image-preview">
+        </div>
+        
+        <div class="settings" id="settings">
+            <h3>OCR设置</h3>
+            <div class="form-group">
+                <label for="langSelect">识别语言</label>
+                <select id="langSelect" class="form-control">
+                    <option value="chi_sim+eng" selected>中文简体+英文</option>
+                    <option value="chi_sim">中文简体</option>
+                    <option value="eng">英文</option>
+                    <option value="chi_tra">中文繁体</option>
+                    <option value="jpn">日语</option>
+                    <option value="kor">韩语</option>
+                    <option value="rus">俄语</option>
+                </select>
+            </div>
+            <div class="form-group">
+                <label for="modeSelect">处理模式</label>
+                <select id="modeSelect" class="form-control">
+                    <option value="auto" selected>自动模式</option>
+                    <option value="standard">标准模式</option>
+                    <option value="chinese">中文优化</option>
+                    <option value="advanced">高级模式</option>
+                </select>
+            </div>
+            <button class="ocr-btn" id="ocrBtn">执行OCR</button>
+        </div>
+        
+        <div class="loader" id="loader"></div>
+        <div class="error-message" id="errorMessage"></div>
+        
+        <div class="results" id="results">
+            <div class="tabs">
+                <div class="tab active" data-tab="text">识别文本</div>
+                <div class="tab" data-tab="details">处理详情</div>
+                <div class="tab" data-tab="images">处理图像</div>
+            </div>
+            
+            <div class="tab-content active" id="textContent">
+                <h3>OCR识别结果</h3>
+                <div class="ocr-text" id="ocrText"></div>
+                <div class="processing-info" id="processingInfo"></div>
+            </div>
+            
+            <div class="tab-content" id="detailsContent">
+                <h3>处理方法详情</h3>
+                <div id="methodsList"></div>
+            </div>
+            
+            <div class="tab-content" id="imagesContent">
+                <h3>处理后的图像</h3>
+                <div class="processed-images" id="processedImages"></div>
+            </div>
+        </div>
+    </div>
+    
+    <script>
+        document.addEventListener('DOMContentLoaded', function() {
+            const uploadContainer = document.getElementById('uploadContainer');
+            const fileInput = document.getElementById('fileInput');
+            const uploadBtn = document.getElementById('uploadBtn');
+            const imagePreview = document.getElementById('imagePreview');
+            const settings = document.getElementById('settings');
+            const ocrBtn = document.getElementById('ocrBtn');
+            const results = document.getElementById('results');
+            const ocrText = document.getElementById('ocrText');
+            const processingInfo = document.getElementById('processingInfo');
+            const methodsList = document.getElementById('methodsList');
+            const processedImages = document.getElementById('processedImages');
+            const loader = document.getElementById('loader');
+            const errorMessage = document.getElementById('errorMessage');
+            const tabs = document.querySelectorAll('.tab');
+            const tabContents = document.querySelectorAll('.tab-content');
+            
+            // 处理文件选择
+            fileInput.addEventListener('change', handleFileSelect);
+            uploadBtn.addEventListener('click', () => fileInput.click());
+            
+            // 拖放功能
+            uploadContainer.addEventListener('dragover', (e) => {
+                e.preventDefault();
+                uploadContainer.classList.add('dragover');
+            });
+            
+            uploadContainer.addEventListener('dragleave', () => {
+                uploadContainer.classList.remove('dragover');
+            });
+            
+            uploadContainer.addEventListener('drop', (e) => {
+                e.preventDefault();
+                uploadContainer.classList.remove('dragover');
+                if (e.dataTransfer.files.length > 0) {
+                    fileInput.files = e.dataTransfer.files;
+                    handleFileSelect(e);
+                }
+            });
+            
+            // 处理OCR按钮点击
+            ocrBtn.addEventListener('click', performOCR);
+            
+            // 处理标签页切换
+            tabs.forEach(tab => {
+                tab.addEventListener('click', () => {
+                    tabs.forEach(t => t.classList.remove('active'));
+                    tabContents.forEach(c => c.classList.remove('active'));
+                    
+                    tab.classList.add('active');
+                    const tabId = tab.getAttribute('data-tab');
+                    document.getElementById(`${tabId}Content`).classList.add('active');
+                });
+            });
+            
+            function handleFileSelect(e) {
+                const file = fileInput.files[0];
+                if (!file) return;
+                
+                // 检查文件类型
+                if (!file.type.match('image.*')) {
+                    showError('请选择图片文件');
+                    return;
+                }
+                
+                // 隐藏之前的错误消息和结果
+                errorMessage.style.display = 'none';
+                results.style.display = 'none';
+                
+                // 更新预览
+                const reader = new FileReader();
+                reader.onload = function(e) {
+                    imagePreview.src = e.target.result;
+                    imagePreview.style.display = 'block';
+                    settings.style.display = 'block';
+                    ocrBtn.style.display = 'block';
+                    ocrBtn.disabled = false;
+                };
+                reader.readAsDataURL(file);
+            }
+            
+            function performOCR() {
+                const file = fileInput.files[0];
+                if (!file) {
+                    showError('请先选择图片文件');
+                    return;
+                }
+                
+                const lang = document.getElementById('langSelect').value;
+                const mode = document.getElementById('modeSelect').value;
+                
+                // 显示加载状态
+                loader.style.display = 'block';
+                ocrBtn.disabled = true;
+                errorMessage.style.display = 'none';
+                results.style.display = 'none';
+                
+                // 创建FormData对象
+                const formData = new FormData();
+                formData.append('image', file);
+                formData.append('lang', lang);
+                formData.append('mode', mode);
+                
+                // 发送OCR请求
+                fetch('/api/ocr', {
+                    method: 'POST',
+                    body: formData
+                })
+                .then(response => {
+                    if (!response.ok) {
+                        return response.json().then(err => {
+                            throw new Error(err.detail || '处理图片时出错');
+                        });
+                    }
+                    return response.json();
+                })
+                .then(data => {
+                    // 隐藏加载状态
+                    loader.style.display = 'none';
+                    
+                    // 显示OCR结果
+                    ocrText.textContent = data.text || '未识别到文本';
+                    
+                    // 显示处理信息
+                    processingInfo.innerHTML = `
+                        <p><strong>处理时间:</strong> ${data.processing_time.toFixed(2)}秒</p>
+                        <p><strong>识别语言:</strong> ${data.lang}</p>
+                        <p><strong>处理模式:</strong> ${getModeLabel(data.mode)}</p>
+                        <p><strong>识别文本长度:</strong> ${data.text ? data.text.length : 0}个字符</p>
+                    `;
+                    
+                    // 显示处理方法详情
+                    methodsList.innerHTML = '';
+                    if (data.methods && data.methods.length > 0) {
+                        data.methods.forEach(method => {
+                            const methodDiv = document.createElement('div');
+                            methodDiv.className = 'method-result';
+                            
+                            const confidencePercent = method.confidence || 0;
+                            
+                            methodDiv.innerHTML = `
+                                <h4>${method.name}</h4>
+                                <p><strong>文本长度:</strong> ${method.length} 字符</p>
+                                <p><strong>置信度:</strong> ${confidencePercent.toFixed(2)}%</p>
+                                <div class="confidence-bar">
+                                    <div class="confidence-value" style="width: ${Math.min(100, confidencePercent)}%"></div>
+                                </div>
+                                <p><strong>文本块数:</strong> ${method.blocks}</p>
+                                <div class="ocr-text">${method.text || '未识别到文本'}</div>
+                            `;
+                            
+                            methodsList.appendChild(methodDiv);
+                        });
+                    } else {
+                        methodsList.innerHTML = '<p>没有可用的处理方法详情</p>';
+                    }
+                    
+                    // 显示处理后的图像
+                    processedImages.innerHTML = '';
+                    if (data.processed_images && data.processed_images.length > 0) {
+                        data.processed_images.forEach(img => {
+                            const imgDiv = document.createElement('div');
+                            imgDiv.className = 'processed-image';
+                            imgDiv.innerHTML = `
+                                <h4>${img.name}</h4>
+                                <img src="${img.url}" alt="${img.name}">
+                            `;
+                            processedImages.appendChild(imgDiv);
+                        });
+                    } else {
+                        processedImages.innerHTML = '<p>没有处理后的图像可供显示</p>';
+                    }
+                    
+                    // 显示结果区域
+                    results.style.display = 'block';
+                    
+                    // 恢复OCR按钮
+                    ocrBtn.disabled = false;
+                })
+                .catch(error => {
+                    console.error('OCR处理失败:', error);
+                    loader.style.display = 'none';
+                    ocrBtn.disabled = false;
+                    showError(error.message || '处理图片时出错，请重试');
+                });
+            }
+            
+            function showError(message) {
+                errorMessage.textContent = message;
+                errorMessage.style.display = 'block';
+            }
+            
+            function getModeLabel(mode) {
+                const modes = {
+                    'auto': '自动模式',
+                    'standard': '标准模式',
+                    'chinese': '中文优化',
+                    'advanced': '高级模式'
+                };
+                return modes[mode] || mode;
+            }
+        });
+    </script>
+</body>
+</html> 
--- a/cxs/temp/outputs/images/image_1.png
+++ b/cxs/temp/outputs/images/image_1.png
--- a/cxs/temp/outputs/response_医疗器械分类目录2017年第104号_output.md
+++ b/cxs/temp/outputs/response_医疗器械分类目录2017年第104号_output.md
--- a/cxs/temp/outputs/response_医疗器械分类目录2017年第104号_output.txt
+++ b/cxs/temp/outputs/response_医疗器械分类目录2017年第104号_output.txt
--- a/cxs/temp/outputs/response_图片_output.md
+++ b/cxs/temp/outputs/response_图片_output.md
@@ -0,0 +1,69 @@
+# 文档内容
+
+
+【文档信息】
+
+作者: Lenovo
+
+创建时间: 2025-05-15 08:30:10
+
+修改时间: 2025-05-15 08:30:25
+
+1111
+
+【图片识别文本】
+“ 完 善 了 异 常 处 理 , 防 止 惑 时 目 录 券 除 失 败 导 致 程 序 崖
+澎
+澎
+
+4. 更 新 README.md
+* 在 暨 近 更 新 部 分 记 录 了 临 时 文 件 处 理 机 制 的 改 进
+。 添 加 了 Excel 文 件 句 柄 管 理
+
+使 用 说 明
+这 东 改 进 不 需 要 您 做 任 何 额 外 操 作 , 系 统 会 自 动 -
+1. 在 处 理 Excel 文 件 时 正 球 关 闭 文 件 句 柄
+
+2 当 尝 试 删 除 文 件 通 刨 “ 文 件 被 占 用 “ 错 误 时 , 自 动 等 待
+并 重 试
+
+3 即 使 无 法 券 除 临 时 文 件 , 也 不 影 响 处 #
+如 果 仍 然 通 到 惧 时 文 件 问 题 , 系 统 会 在 下 次 启 动 时 自 动
+清 理 所 有 临 时 文 件 , 不 会 影 响 系 统 功 能 。
+
+以 上 优 化 星 觞 失 了 临 时 文 伟 删 除 问 题 , 又 保 持 了 系 统 的
+稳 定 性 , 让 您 能 雪 顺 畅 地 处 理 Bxcel 文 件 。
+
+
+## 图片内容
+
+
+### 图片 1
+
+![图片 1](images/image_1.png)
+
+
+**OCR文本内容:**
+
+“ 完 善 了 异 常 处 理 , 防 止 惑 时 目 录 券 除 失 败 导 致 程 序 崖
+澎
+澎
+
+4. 更 新 README.md
+* 在 暨 近 更 新 部 分 记 录 了 临 时 文 件 处 理 机 制 的 改 进
+。 添 加 了 Excel 文 件 句 柄 管 理
+
+使 用 说 明
+这 东 改 进 不 需 要 您 做 任 何 额 外 操 作 , 系 统 会 自 动 -
+1. 在 处 理 Excel 文 件 时 正 球 关 闭 文 件 句 柄
+
+2 当 尝 试 删 除 文 件 通 刨 “ 文 件 被 占 用 “ 错 误 时 , 自 动 等 待
+并 重 试
+
+3 即 使 无 法 券 除 临 时 文 件 , 也 不 影 响 处 #
+如 果 仍 然 通 到 惧 时 文 件 问 题 , 系 统 会 在 下 次 启 动 时 自 动
+清 理 所 有 临 时 文 件 , 不 会 影 响 系 统 功 能 。
+
+以 上 优 化 星 觞 失 了 临 时 文 伟 删 除 问 题 , 又 保 持 了 系 统 的
+稳 定 性 , 让 您 能 雪 顺 畅 地 处 理 Bxcel 文 件 。
+
--- a/cxs/temp/outputs/response_图片_output.txt
+++ b/cxs/temp/outputs/response_图片_output.txt
@@ -0,0 +1 @@
+【文档信息】 作者: Lenovo 创建时间: 2025-05-15 08:30:10 修改时间: 2025-05-15 08:30:25 1111 【图片识别文本】 “ 完 善 了 异 常 处 理 , 防 止 惑 时 目 录 券 除 失 败 导 致 程 序 崖 澎 澎  4. 更 新 README.md * 在 暨 近 更 新 部 分 记 录 了 临 时 文 件 处 理 机 制 的 改 进 。 添 加 了 Excel 文 件 句 柄 管 理  使 用 说 明 这 东 改 进 不 需 要 您 做 任 何 额 外 操 作 , 系 统 会 自 动 - 1. 在 处 理 Excel 文 件 时 正 球 关 闭 文 件 句 柄  2 当 尝 试 删 除 文 件 通 刨 “ 文 件 被 占 用 “ 错 误 时 , 自 动 等 待 并 重 试  3 即 使 无 法 券 除 临 时 文 件 , 也 不 影 响 处 # 如 果 仍 然 通 到 惧 时 文 件 问 题 , 系 统 会 在 下 次 启 动 时 自 动 清 理 所 有 临 时 文 件 , 不 会 影 响 系 统 功 能 。  以 上 优 化 星 觞 失 了 临 时 文 伟 删 除 问 题 , 又 保 持 了 系 统 的 稳 定 性 , 让 您 能 雪 顺 畅 地 处 理 Bxcel 文 件 。
--- a/cxs/temp/outputs/temp_8c424573-5517-4da0-a812-73527077e0c8_医疗器械分类目录2017年第104号_output.docx
+++ b/cxs/temp/outputs/temp_8c424573-5517-4da0-a812-73527077e0c8_医疗器械分类目录2017年第104号_output.docx
--- a/cxs/temp/outputs/temp_aa098435-5fe6-4409-b5e0-692c36ac493c_图片_output.docx
+++ b/cxs/temp/outputs/temp_aa098435-5fe6-4409-b5e0-692c36ac493c_图片_output.docx
--- a/cxs/temp/uploads/images_0c53e440/image1.png
+++ b/cxs/temp/uploads/images_0c53e440/image1.png
				`@@ -0,0 +1 @@`
				【文档信息】作者: Lenovo 创建时间: 2025-05-15 08:30:10 修改时间: 2025-05-15 08:30:25 1111 【图片识别文本】 “ 完善了异常处理 , 防止惑时目录券除失败导致程序崖澎澎 4. 更新 README.md * 在暨近更新部分记录了临时文件处理机制的改进。添加了 Excel 文件句柄管理使用说明这东改进不需要您做任何额外操作 , 系统会自动 - 1. 在处理 Excel 文件时正球关闭文件句柄 2 当尝试删除文件通刨 “ 文件被占用 “ 错误时 , 自动等待并重试 3 即使无法券除临时文件 , 也不影响处 # 如果仍然通到惧时文件问题 , 系统会在下次启动时自动清理所有临时文件 , 不会影响系统功能。以上优化星觞失了临时文伟删除问题 , 又保持了系统的稳定性 , 让您能雪顺畅地处理 Bxcel 文件。