Polars 中按前缀合并列并重复非模式列以对齐长格式

8次阅读

本文介绍如何在 polars 中将具有相同前缀（如 `a_0`, `a_1`, `a_2`）的多列纵向堆叠为单列（如 `a`），同时自动复制其他非模式列（如 `words`, `groups`）以匹配扩展后的行数，实现高效、可扩展的“宽转长+重排”操作。

要实现题目中描述的变换——即把 a_0, a_1, a_2 按列顺序垂直拼接成一列 a，同理拼接 b_0, b_1, b_2 成 b，并让 words 和 groups 等“标识列”按需重复（每个原始值重复 m=3 次，对应每组 a_*/b_* 列的数量），核心思路是：利用 unpivot 打散结构 → 提取前缀作为新分组键 → 构建跨列一致的索引 → 用 pivot 重构为长格式。

以下是完整、健壮的解决方案：

import polars as pl import numpy as np import string  # 构造示例数据（同题） rng = np.random.default_rng(42) nr = 3 letters = list(string.ascii_letters) uppercase = list(string.ascii_uppercase) words, groups = [], [] for i in range(nr):     word = ''.join([rng.choice(letters) for _ in range(rng.integers(3, 20))])     words.append(word)     group = rng.choice(uppercase)     groups.append(group) df = pl.DataFrame({     "a_0": np.linspace(0, 1, nr),     "a_1": np.linspace(1, 2, nr),     "a_2": np.linspace(2, 3, nr),     "b_0": np.random.rand(nr),     "b_1": 2 * np.random.rand(nr),     "b_2": 3 * np.random.rand(nr),     "words": words,     "groups": groups, })  # ✅ 核心转换：宽→长 + 前缀归并 + 重复对齐 result = (     df     .unpivot(         index=["words", "groups"],  # 保留为标识列，不参与展开         on=[col for col in df.columns if "_" in col]  # 显式指定待展开列（更安全）     )     .with_columns(         pl.col("variable").str.replace(r"_d+$", "")  # 提取前缀：a_0 → "a", b_1 → "b"     )     .with_columns(         index = pl.int_range(0, pl.len()).over("variable")  # 每个前缀组内独立编号：[0,1,2] 循环     )     .pivot(         on="variable",         index=["index", "words", "groups"],         values="value",         aggregate_function=None  # 禁用聚合，确保一对一映射     )     .drop("index")  # 移除临时索引列 )  print(result)

输出结果与预期完全一致（shape: (9, 4)）：

shape: (9, 4) ┌─────────────────┬────────┬─────┬──────────┐ │ words           ┆ groups ┆ a   ┆ b        │ │ ---             ┆ ---    ┆ --- ┆ ---      │ │ str             ┆ str    ┆ f64 ┆ f64      │ ╞═════════════════╪════════╪═════╪══════════╡ │ OIww            ┆ W      ┆ 0.0 ┆ 0.653892 │ │ KkeB            ┆ Z      ┆ 0.5 ┆ 0.408888 │ │ NLOAgRxAtjWOHuQ ┆ O      ┆ 1.0 ┆ 0.423949 │ │ OIww            ┆ W      ┆ 1.0 ┆ 0.234362 │ │ KkeB            ┆ Z      ┆ 1.5 ┆ 0.213767 │ │ NLOAgRxAtjWOHuQ ┆ O      ┆ 2.0 ┆ 0.646378 │ │ OIww            ┆ W      ┆ 2.0 ┆ 0.880558 │ │ KkeB            ┆ Z      ┆ 2.5 ┆ 1.833025 │ │ NLOAgRxAtjWOHuQ ┆ O      ┆ 3.0 ┆ 0.116173 │ └─────────────────┴────────┴─────┴──────────┘

关键要点说明：

unpivot(index=…) 是起点：它将所有非 index 列（即 a_0, a_1, …）转为两列 variable（原列名）和 value（原值），同时自动广播 words/groups 到每一行，为后续重复打下基础。
正则提取前缀：str.replace(r”_d+$”, “”) 安全地剥离末尾 _数字（支持 _10, _99 等），比 “_.*” 更精确，避免误删含下划线的合法前缀。
over(“variable”) 构建组内序号：确保 a_0, a_1, a_2 的值在 index=0,1,2 下严格对齐，这是实现“先列0全部、再列1全部…”顺序的核心。
pivot(…, aggregate_function=None)：显式禁用聚合（默认为 first），防止因重复 index+words+groups 组合导致意外丢值；Polars 1.0+ 支持此参数，推荐使用。
健壮性增强：通过 on=[…] 显式指定待展开列，避免误处理未来新增的 c_0 或 meta_info 等非模式列。

⚠️ 注意：若列名模式更复杂（如 a_x, a_y, a_z），只需调整正则表达式（如 r”_w+$”）并确保 unpivot.on 范围准确。该方案天然支持任意数量的前缀组（a_*, b_*, c_*…）和任意列数，时间复杂度为线性，适合大规模数据处理。

发表于：web3.0

2026-01-30

复制链接

Golang如何使用Kubernetes管理微服务

Moltbook(MOLT)币技术基础是什么？运作机制与应用场景解析

c# 如何用 ValueTuple 代替 out 参数来优化异步方法

kali怎么挖php漏洞_用odat挖php oracle数据库洞【方法】

javascript中如何正确处理异步操作与回调函数【教程】

Polars 中按前缀合并列并重复非模式列以对齐长格式

关键要点说明：

如何操作javascript数组_数组方法map和filter有什么区别【教程】

PHP怎样获取栏目审核状态_PHP判栏目审核法【校验】

Mule配置文件是什么如何用XML来编排API和集成

新版word插html图片新增啥功能_说编辑与预览改进点【介绍】

如何在Golang中掌握数据类型_Golang整型字符串与布尔类型操作

Go 接口与接收器设计：如何优雅实现通用配置加载器

如何优化超长 HTML 页面的滚动性能与动态内容加载

EXISTS vs IN vs ANY/ALL 在子查询性能上的数据库差异

如何在前端正确上传并更新用户头像（配合 Multer 后端）

如何在 Go 中安全截取切片的前 N 个元素（最多两个）