返回 Skill 列表
extension
分类: 开发与工程无需 API Key

clean-data-xls

清理杂乱的电子表格数据——修剪空白、修正大小写不一致、转换文本格式的数字、标准化日期、删除重复项,并标记混合类型列。当数据杂乱、不一致或在分析前需要准备时使用。触发词包括“清理这些数据”、“清理这个表格”、“规范化这些数据”、“修正格式”、“去重”、“标准化这列”、“这些数据很乱”。

person作者: jakexiaohubgithub

Clean Data

Clean messy data in the active sheet or a specified range.

Environment

  • If running inside Excel (Office Add-in / Office JS): Use Office JS directly (Excel.run(async (context) => {...})). Read via range.values, write helper-column formulas via range.formulas = [["=TRIM(A2)"]]. The in-place vs helper-column decision still applies.
  • If operating on a standalone .xlsx file: Use Python/openpyxl.

Workflow

Step 1: Scope

  • If a range is given (e.g. A1:F200), use it
  • Otherwise use the full used range of the active sheet
  • Profile each column: detect its dominant type (text / number / date) and identify outliers

Step 2: Detect issues

| Issue | What to look for | |---|---| | Whitespace | leading/trailing spaces, double spaces | | Casing | inconsistent casing in categorical columns (usa / USA / Usa) | | Number-as-text | numeric values stored as text; stray $, ,, % in number cells | | Dates | mixed formats in the same column (3/8/26, 2026-03-08, March 8 2026) | | Duplicates | exact-duplicate rows and near-duplicates (case/whitespace differences) | | Blanks | empty cells in otherwise-populated columns | | Mixed types | a column that's 98% numbers but has 3 text entries | | Encoding | mojibake (é, ’), non-printing characters | | Errors | #REF!, #N/A, #VALUE!, #DIV/0! |

Step 3: Propose fixes

Show a summary table before changing anything:

| Column | Issue | Count | Proposed Fix | |---|---|---|---|

Step 4: Apply

  • Prefer formulas over hardcoded cleaned values — where the cleaned output can be expressed as a formula (e.g. =TRIM(A2), =VALUE(SUBSTITUTE(B2,"$","")), =UPPER(C2), =DATEVALUE(D2)), write the formula in an adjacent helper column rather than computing the result in Python and overwriting the original. This keeps the transformation transparent and auditable.
  • Only overwrite in place with computed values when the user explicitly asks for it, or when no sensible formula equivalent exists (e.g. encoding/mojibake repair)
  • For destructive operations (removing duplicates, filling blanks, overwriting originals), confirm with the user first
  • After each category of fix (whitespace → casing → number conversion → dates → dedup), show the user a sample of what changed and get confirmation before moving to the next category
  • Report a before/after summary of what changed