Third-party Data Normalizer
Use this skill as a normalization workflow, not as a universal migration program. The skill should understand the current batch, follow the standard template order, make mapping decisions explicit, confirm each step with the user in text, trial-run sample output for that step, and only then write a batch-specific migration script that outputs strict standard import templates.
Batch Workflow
- Locate the third-party source directory and choose a customer output directory. All outputs must be under an isolated English customer-name directory, for example
outputs/anjuleyu/. - Use
references/standard-data-template/as the default standard template directory. Read templates in filename order, skipping Excel lock files beginning with~$. - Parse and profile every source document before designing mappings. Flag sparse/empty sheets, wrong worksheet dimensions, suspicious header rows, duplicates, and legacy
.xlsfiles. - Follow the template-driven confirmation sequence below. For each step, list target fields, propose source mapping and uniqueness rules, confirm with the user in conversation, then trial-run a small sample output before moving to the next step.
- Produce temporary mapping checklists and trial output files inside the customer directory. Audit files such as
mapping-review.xlsx,issues.xlsx, andmapping.jsonare supporting evidence only; the user should not need to open them to understand or approve the plan. - After all steps are confirmed, dynamically write a batch-specific migration script for the confirmed mapping and run it to generate final import files.
- Verify final files against the standard template headers, row counts, split counts, required fields, cross-template references, coordinate source confirmation, coordinate conversion/precision, building-floor preservation, legacy
.xlscompatibility, and known source totals.
Template Order
Always organize mapping and confirmation by this standard template order:
01项目主数据导入模板.xls02门店信息导入模板.xlsx03项目补充信息导入模板.xlsx04户型导入模板.xlsx05房源导入模板.xlsx
1. Project Master Data
Confirm project master data first. Determine:
- Project uniqueness and project grain.
- Project name and project code rules.
- Company, parent project, project type, province/city/county, address, longitude, and latitude. Explicitly ask the user whether provided coordinates are Baidu BD-09, Gaode/GCJ-02, WGS-84, or unknown, and record the answer before generating final files.
- If the target template fields name a provider such as
经度(高德)/纬度(高德), final generation is blocked until the source coordinate system is confirmed. When the source is Baidu BD-09 and the target is Gaode/GCJ-02, convert BD-09 to GCJ-02 and强制保留 6 位小数;never pass Baidu coordinates through to Gaode fields. - Building/floor data under the project master template, including project reference, region, building name, unit, floor numbers, floor counts, and areas. Preserve user-confirmed non-numeric original floor labels when floor parsing cannot derive a numeric floor; only compute above/below-ground floor counts from numeric floors.
Trial-run requirement: output a small sample of project rows and building/floor rows, including examples with non-numeric floor labels and coordinates, then ask the user to confirm before continuing.
2. Store Data
Store data is optional. Check whether the source files contain store-level information.
- If no store data is present, explain the evidence and ask the user to confirm that store import can be skipped.
- If store data is present, determine store uniqueness, store name, and store address mapping.
Trial-run requirement: output a small sample of store rows or a clear "skip store import" decision, then ask the user to confirm.
3. Project Extra Information
Confirm fields in 03项目补充信息导入模板.xlsx, including project code, completion year, manager phones, associated/linked store, elevator/stairs, parking, deposit rules, property fee, water/electricity/gas/network/parking/other public service fees, and late-fee rules.
店长手机号和管家手机号是必填。If the source cannot identify valid values, ask the user for default values before generation; each provided default must already exist in the target system or an explicitly confirmed system user directory. Do not invent, fabricate, or silently reuse unrelated phone numbers.
For associated/linked store fields, determine whether the source has a project-to-store relationship in room pricing, room configuration, store configuration, or project metadata. Confirm the grain for conflicts (for example one project linked to multiple stores) and whether to join multiple stores or split project grain before generation.
Trial-run requirement: output a small sample of project-extra rows and call out blank or inferred non-required fields, including any project-to-store conflicts.
4. Layout Data
Confirm whether the source contains independent layout data.
- If yes, map the independent layout source and validate uniqueness.
- If no, derive layouts from room-related documents by deduplicating on
项目 + 户型名称. - Ensure layout names are unique under each project because room import must reference the layout name.
- For
04户型导入模板.xlsx, 室、厅、卫不能同时为 0.厨and阳台cannot be blank; default them to0unless the user confirms another valid value.
Trial-run requirement: output a small sample of layout rows and duplicate/conflict checks before continuing.
5. Room Data
Confirm room extraction last, after project and layout references are stable. Determine room uniqueness, project code reference, building/floor/room number parsing, layout reference, orientation, decoration, facilities/supporting amenities, area, lease status, tenant fields, rent, base price, VR URL, and video URL.
For room facilities/supporting amenities fields, inspect grouped configuration columns such as appliance, furniture, smart-device, utility, and public/private supporting fields. Confirm which truthy source values mean an amenity is present, normalize amenity names to the target wording, deduplicate them, and keep the output delimiter consistent across all room rows.
For room dictionary fields such as orientation, decoration, and lease status, read the target template's dictionary sheet or data-validation ranges and map source values into the system-allowed values instead of passing source labels through directly. For room numeric fields such as area and rent, confirm the target precision/format and write deterministic defaults in that same format.
建筑面积和租赁面积不能为 0. If the user-provided source data contains zero, blank, missing, non-positive, or unparseable area values, report the affected counts/examples and ask the user how to avoid zero values before final generation; do not silently output 0.
If 出租状态为已出租, tenant fields are mandatory: fill both 租客名称 and 租客手机号 from confirmed source data or a user-confirmed mapping. If either tenant field is missing, block final generation for those rows and ask the user how to handle them.
Trial-run requirement: output a small sample of room rows and required-field missing counts before final generation. Include examples that exercise facilities extraction, dictionary-value mapping, area zero/non-positive handling, tenant-required checks for rented rooms, and numeric precision/default formatting.
Helper Scripts
python scripts/third_party_data_normalizer.py analyze --source-dir "<source_dir>" --output-dir "<customer_output_dir>"
Pass --template-dir only when overriding the bundled templates. The bundled script is a helper/example for profiling, template inspection, AJLY mapping review, and regression testing. Treat its generate path as a sample implementation for a known batch, not as the default answer for new third-party systems. For a new or uncertain batch, write a fresh script from the confirmed mapping instead of forcing the data through a generic migrator.
Mapping Rules
Always show these decisions directly in the conversation, then point to the Excel review files for details:
- Source files and detected sheet/header rows.
- Candidate join keys, duplicate counts, and overlap counts.
- Project grain, project code rule, and conflict fallback.
- Store presence or explicit store-skip decision.
- Standard template fields for project master data, store data, project extra information, layout data, and room data.
- For every target field: source field, transform rule, default value, inferred value, blank rule, confidence, and user decision status.
- Target workbook row counts, split counts, and source-to-target coverage.
- Coordinate source system, target coordinate system, conversion rule, precision, and user confirmation status. Coordinate outputs must强制保留 6 位小数. If the source system is unknown while target fields require Gaode/GCJ-02, stop and ask; do not generate final coordinate values.
- Area zero/non-positive counts, rented-room tenant missing counts, invalid layout room counts (
室/厅/卫all zero, blank厨/阳台), and missing manager-phone defaults. - Required fields and business-important fields that are empty, inferred, or cannot be mapped.
- Trial-run sample rows for the current step before advancing.
Use conversation as the primary confirmation surface. Keep human-facing details in mapping-review.xlsx with Chinese sheet names and columns, but do not require the user to open it before they can understand or approve the mapping. Keep mapping.json only for the script.
Default behavior is conservative: fill deterministic fields, record inferred fields, and send ambiguous values to the issue list instead of inventing silent values.
Template Rules
Strictly reuse the files under references/standard-data-template/ unless the user explicitly provides a replacement template directory. For .xlsx templates, write only the 数据 sheet and keep the header row order unchanged.
When the source row count exceeds the template data capacity, split the output into numbered files instead of overfilling the template. The default split size comes from the smallest detected standard template data row capacity.
Legacy .xls templates are treated as fragile. This skill vendors xlrd/xlwt/xlutils for the project master workbook and writes the 项目 and 楼栋楼层 sheets from the copied standard template. If those dependencies fail in the local runtime, do not silently rewrite the file; generate a clear blocking note and ask the user to provide a writable template or handle that workbook manually.
For legacy .xls outputs, Python-written BIFF files may be readable but still fail when downstream Java/Apache POI writes an error workbook (for example HSSFWorkbook.write() record-cast errors). If Excel or WPS is available, add an optional post-generation native resave step for .xls files (open with Excel/WPS and SaveAs the same .xls format), and verify the resaved file, not just the Python-written file. When users report that manual Excel save fixes import, treat that as evidence of .xls writer compatibility, not data mapping.
Do not diagnose legacy .xls import failures only from file size. Check actual sheet nrows, non-empty rows, and data rows. A smaller file after Excel save can indicate native BIFF normalization, not necessarily removed blank rows.
Final Verification Checklist
Before calling final outputs ready, run automated checks for:
- Headers exactly matching the standard templates.
- Row counts, split counts, and source-to-target coverage.
- Required fields in every generated workbook.
- Cross-template references (project codes, project names, layout names, and room references).
- Coordinate source was explicitly confirmed by the user before final generation. If target fields specify Gaode/GCJ-02, Baidu BD-09 source coordinates were converted to GCJ-02 and coordinate outputs强制保留 6 位小数; unknown source coordinates block final generation.
- Template dictionary/data-validation fields, especially room orientation, decoration, and lease status, containing only system-allowed values.
- Numeric formatting rules, especially room area/rent precision and default formatting, matching the confirmed target requirements; 建筑面积和租赁面积不能为 0, and any source zero/non-positive area values have a user-confirmed avoidance rule.
- If 出租状态为已出租, every affected room has both
租客名称and租客手机号populated. - Layout rows satisfy 室、厅、卫不能同时为 0, and
厨/阳台are non-blank with confirmed defaults such as0. - Project extra rows have
店长手机号and管家手机号; user-provided defaults were verified as existing system values and were not fabricated. - Building/floor rows preserving confirmed original labels when parsing cannot derive numeric floors.
- Numeric floor counts (
地上层数/地下层数) computed only from numeric floor labels. - Legacy
.xlscompatibility after any native Excel/WPS resave step.
AJLY Reference
For Anjuleyu/AJLY batches, read references/ajly.md before mapping. The current known sample uses:
房间IDas the room merge key.房源定价.xlsxplus房源配置.xlsxas primary room sources.物业地址/楼栋as the default project grain.门店+物业地址only as a conflict fallback.- Layout inference from room number, area, and configuration fields, with low-confidence results reviewed by the user.
Treat these as batch observations and previously confirmed rules, not universal rules for other third-party systems.
Bundled Resources
references/standard-data-template/: bundled standard import templates used as the default target schema and confirmation order.scripts/third_party_data_normalizer.py: profiling, template inspection, AJLY mapping review, and sample workbook generation helper.scripts/test_third_party_data_normalizer.py: regression tests for template inspection, AJLY key analysis, confirmation gate, and split workbook generation.
Use scripts as deterministic helpers for evidence and verification, not as a replacement for field-by-field user confirmation.
微信扫一扫