Back to skills
extension
Category: Data & AnalyticsNo API key required

统计数据格式转换器

将 28+ 统计软件格式读入 pandas DataFrame,完整保留全部元数据;支持任意双向转换。Read 28+ stats formats into pandas with full metadata; bidirectional conversion. See README.md / README_ZH.md for details.

personAuthor: user_ff7413f5hubcommunity

统计数据格式转换器 | statdata-transfer

致敬 Stat/Transfer — 业界格式转换标杆 | Honoring Stat/Transfer — the industry standard

核心能力 | Core Capabilities

| 能力 | 说明 | Ability | Description | |------|------|---------|-------------| | 读入 | SPSS/Stata/SAS 100% 保留变量标签、值标签、缺失值等全部元数据 | Read | SPSS/Stata/SAS: 100% preserve all metadata | | 转存 | 任意格式互转(.sav ↔ .dta ↔ .rda ↔ .xlsx ↔ .parquet 等) | Convert | Bidirectional conversion between any formats | | 元数据嵌入 | Parquet/Feather/HDF5/JSON 通过 schema.metadata 保留标签 | Embed | Arrow/HDF5/JSON embed labels in schema.metadata | | 丢失警告 | 转换时自动检测并报告元数据损失 | Warn | Automatic metadata-loss warnings on conversion | | P0 修复 | R 对象读入、CSV 重复读取、值标签键还原等 | P0 Fixes | R object read, CSV dedup, value-label key restore | | EpiInfo/ARFF/Gretl | 新增 3 种流行病学/ML/计量格式读入 | EpiInfo/ARFF/Gretl | New epidemiology/ML/econometrics format readers |

推荐场景 | Recommended Use Cases

| 场景 Scenario | 推荐 Recommendation | 说明 Notes | |--------------|-------------------|------------| | 保存中间数据 Intermediate data | JSON | 全部元数据双向 100% 还原 / All metadata, 100% bidirectional | | 统计软件交换 Stats software swap | SPSS/Stata/R | 无损转换 / Lossless conversion | | 高性能处理 High performance | Parquet | 高效 + 标签保留 / Fast + label preservation | | 大数据集 Large datasets | HDF5 | 支持分块 / Chunked storage | | 临床数据 Clinical trials | SAS XPT / CDISC ODM | 监管提交 compliant submission |

触发条件 | Triggers

  • 读入 .dta .sav .sas7bdat .rda .xlsx .mat .h5 .parquet .json .xml 等 28+ 格式 / Read files: .dta .sav .sas7bdat .rda .xlsx .mat .h5 .parquet .json .xml + 20 more
  • 需要格式转换并保留元数据 / Format conversion with metadata preservation
  • 需要不同统计软件间迁移数据 / Migrate data between statistical software
  • 用户询问各格式读入能力边界 / User asks about format capability limits

支持格式与能力对照表 | Supported Formats & Capability Matrix

按字母排序 | Sorted alphabetically.

| 格式 Format | 扩展名 Extension | 依赖 Dependency | 变量标签 Var Label | 值标签 Val Label | 特殊缺失 Special Missing | 公式 Formula | 元数据保留 Meta Preserve | |-------------|-----------------|:---:|:---:|:---:|:---:|:---:|:---:| | CDISC ODM | .odm | lxml | ✗ | ✗ | ✗ | ✗ | ⚠️ 仅临床数据 Clinical data only | | EpiData | .rec | R foreign | ✗ | ✗ | ✗ | ✗ | ⚠️ 通过R读入 Via R | | EpiInfo | .prj .xml | xml/etree | ✅ | ✅(codes) | ✗ | ✗ | ✅ XML结构 XML structure | | Excel | .xlsx .xls .xlsm | openpyxl / xlrd | ✗ | ✗ | ✗ | ⚠️ 仅结果 result only | ⚠️ 写出用额外工作表 Extra sheet for labels | | EViews | .wf1 .wf2 | 内置 built-in | ✗ | ✗ | ✗ | ✗ | ⚠️ JSON结构 JSON structure | | Feather | .feather .arrow | pyarrow | ✅(schema) | ✅(schema) | ✗ | ✗ | ⚠️ 版本差异 Version diff | | FST | .fst | fst (R) | ✅(schema) | ✅(schema) | ✗ | ✗ | ⚠️ 版本差异 Version diff | | GraphPad Prism | .pzfx .pz | pzfx | ✗ | ✗ | ✗ | ✗ | ⚠️ 多表 Multi-table | | Gretl | .gdt .gdtb | 内置 built-in | ✅ | ✅(tables) | ✗ | ✗ | ✅ string-tables | | HDF5 | .h5 .hdf5 | h5py | ✗ | ✗ | ✗ | ✗ | ⚠️ 层级结构 Hierarchy,写出用文件属性 attrs on write | | HTML | .html | lxml | ✗ | ✗ | ✗ | ✗ | ⚠️ 仅表格 Tables only | | jamovi | .omv | ZIP内置 ZIP built-in | ✅ | ✅ | ✗ | ✗ | ✅ JSON分析 JSON analysis | | JMP | .jmp | jmpio-python | ⚠️ | ⚠️ | ✗ | ✗ | ⚠️ 多表 Multi-table | | JSON | .json | 内置 built-in | ✅ | ✅ | ✗ | ✗ | ✅ 写出嵌入stat-full-meta stat-full-meta on write | | MATLAB | .mat | scipy | ✗ | ✗ | ✗ | ✗ | ⚠️ 变量名 Variable names | | Minitab | .mtw .mpj | mtbpy / R | ✗ | ✗ | ✗ | ✗ | ⚠️ 通过R读入 Via R | | ODS | .ods | odfpy | ✗ | ✗ | ✗ | ✗ | ⚠️ 仅数据 Data only | | ORC | .orc | pyarrow | ✅(schema) | ✅(schema) | ✗ | ✗ | ⚠️ 版本差异 Version diff | | Parquet | .parquet | pyarrow | ✅(schema) | ✅(schema) | ✗ | ✗ | ⚠️ 嵌套类型 Nested types,写出用schema.metadata schema.metadata on write | | R | .rda .rds .rdata | pyreadr + R | ✅ | ✅ | ✅ | ✗ | ✅ statdata_meta,写出通过R桥接 R bridge on write | | SAS | .sas7bdat .xpt .sas7bcat | pyreadstat | ✅ | ✅(需catalog need catalog) | ⚠️ | ✗ | ✅ | | SPSS | .sav .zsav .por | pyreadstat | ✅ | ✅ | ✅ | ✗ | ✅ | | Stata | .dta | pyreadstat | ✅ | ✅ | ⚠️ | ✗ | ✅ | | Weka ARFF | .arff | 内置 built-in | ✅ | ✅(nominal) | ✗ | ✗ | ✅ 名义映射 nominal mapping | | XML | .xml | lxml | ✗ | ✗ | ✗ | ✗ | ⚠️ 结构保留 Structure preserved |

✅=完整保留 Full preservation · ⚠️=部分保留或依赖条件 Partial/conditional · ✗=无法保留 Not preserved

返回结构 | Return Structure

{
    "dataframe": pd.DataFrame,
    "metadata": {
        "file_format": "spss_sav",
        "collected_at": "2026-07-02T21:00:00",
        "row_count": 100, "column_count": 10,
        "total_missing_pct": 5.2,
        "variable_labels": {"q1": "问题1"},
        "value_labels": {"q1": {1: "是", 2: "否"}},
        # ... 全部元数据 | all metadata types
    },
    "warnings": [],
    "column_report": {"q1": {"source_type": "int", "pandas_dtype": "int64"}},
}

元数据保留能力 | Metadata Preservation

读入降损规则 | Read Fallback Rules

  1. 统计软件格式(SPSS/Stata/SAS):全部元数据完整读入 | Stats formats: 100% metadata
  2. Arrow 生态(Parquet/Feather/ORC/FST):仅还原 write_stat_file 写入的标签 | Only restores labels from write_stat_file
  3. 非统计格式:仅保留数据值,可通过 apply_value_labels() 手动附加 | Data only; use apply_value_labels() manually
  4. R 格式:v1.6.0+ 通过 statdata_meta 属性嵌入全部元数据 | v1.6.0+ embeds all metadata fields via statdata_meta

元数据丢失警告 | Metadata-Loss Warnings

运行时 warnings 列表自动包含以下提示 | Runtime warnings include:

  • 特殊缺失值变为 NaN | Special missing → NaN
  • 测量级别/显示宽度/对齐方式丢失 | Measurement level/width/alignment lost
  • 多重响应集(mr_sets)无法保留 | MR sets cannot be preserved
  • 文件标签/注释不保留 | File label/notes not preserved
  • 日期基准不保存 | Date origin not saved

预检代码见 references/usage_examples.py 中「元数据丢失检查」部分 | Pre-check code: see "Metadata loss check" in references/usage_examples.py

依赖 | Dependencies

requires:
  bins: [python3]
  packages:
    core: [pyreadstat>=1.3.5, pyreadr, pandas]
    extended: [openpyxl, xlrd, scipy, h5py, pyarrow, lxml, odfpy]
    optional: [jmpio-python, mtbpy, pzfx]

完整依赖列表见 requirements.txt | Full list: requirements.txt

参考文档 | References

  • references/usage_examples.py — 完整代码示例 | Complete code examples
  • references/new_formats_architecture_analysis.json — 架构分析 | Architecture analysis