如何在网页爬虫中正确提取超链接并避免 NaN 值

2次阅读

本文详解在使用 beautifulsoup 爬取 EliteProspects 球员统计表格时，为何 Player_URL 列持续返回 NaN，并提供可复用的修复方案：定位嵌套标签、预清洗字段、安全匹配名称，确保链接准确注入 DataFrame。

本文详解在使用 beautifulsoup 爬取 eliteprospects 球员统计表格时，为何 `player_url` 列持续返回 nan，并提供可复用的修复方案：定位嵌套 `eliteprospects nhl 2023–2024 统计页为例，许多开发者尝试通过提取球员个人主页链接，却始终得到 nan —— 根本原因在于： 元素自身不含 href 属性，真正的链接藏在其内部的标签中。

以下是一个精简、健壮且可直接运行的修复版本（基于原代码优化）：

import requests from bs4 import BeautifulSoup import pandas as pd  start_url = 'https://www.php.cn/link/8641afa4db7421c9eeaf01260d8afefe' r = requests.get(start_url, timeout=10) r.raise_for_status()  # 显式检查 HTTP 错误  soup = BeautifulSoup(r.content, "html.parser") table = soup.find("table", class_="table table-striped table-sortable player-stats highlight-stats season")  # 提取表头（自动去重 & 清洗） headers = [th.get_text(strip=True) for th in table.find_all("th")] df = pd.DataFrame(columns=headers)  # 构建基础数据行（跳过表头行） for row in table.find_all("tr")[1:]:     cells = row.find_all(["td", "th"])     if len(cells) < len(headers):  # 跳过无效行（如分组标题）         continue     row_data = [cell.get_text(strip=True).replace('n', ' ') for cell in cells]     df.loc[len(df)] = row_data  # ✅ 关键修复：正确提取 Player_URL df["Player_URL"] = None  # 显式初始化列，避免 SettingWithCopyWarning  for span in table.find_all("span", class_="txt-blue"):     a_tag = span.find("a")     if not a_tag or not a_tag.get("href") or not a_tag.get_text(strip=True):         continue     player_name = a_tag.get_text(strip=True)     player_url = a_tag["href"]      # 安全匹配：df.Player 可能含括号/空格等干扰，需统一清洗     cleaned_names = df["Player"].str.strip().str.replace(r's+', ' ', regex=True)     match_mask = cleaned_names == player_name      if match_mask.any():         df.loc[match_mask, "Player_URL"] = player_url  # 后处理：全局清洗（推荐在填充 URL 后执行，避免干扰字符串匹配） df = df.replace(r's+', ' ', regex=True).applymap(     lambda x: x.strip() if isinstance(x, str) else x )  print(df[["Player", "Team", "GP", "G", "A", "TP", "Player_URL"]].head())

⚠️ 关键注意事项与最佳实践

不要直接操作 span.get(“href”)：HTML 中 John Doe 是典型嵌套结构，href 属于，而非。
名称匹配前必须清洗：原始 Player 列常含换行符（n）、多余空格或括号（如 “Connor McDavid (C)”），而链接文本通常为 “Connor McDavid”。务必在 df.Player == name 前对双方做 strip() 和正则清理。
避免未定义变量：原代码中 name 未声明即使用，属运行时错误；应从 a_tag.text 动态提取。
防御性编程不可少：添加 if not a_tag: 检查、r.raise_for_status() 和 timeout，防止因网络波动或 dom 变更导致脚本静默失败。
列初始化显式化：使用 df[“Player_URL”] = None 而非依赖 .loc[] 自动创建，提升可读性与稳定性。

? 扩展提示：若需批量爬取多赛季（2007–2023）或多联赛，建议将上述逻辑封装为函数 scrape_season(league, season)，配合 time.sleep(1) 遵守 robots.txt，并用 requests.session() 复用连接提升效率。球员身高体重等深层字段，则需对 Player_URL 发起二次请求，在个人页中解析对应

/

结构——此时同样适用本教程的核心原则：逐层定位标签、清洗再匹配、异常必兜底。

通过理解 HTML 结构本质而非盲目套用选择器，你不仅能解决当前的 NaN 问题，更能构建出鲁棒、可维护的爬虫管线。

发表于：php框架

近三天内

复制链接

如何在Golang中实现闭包_Golang闭包概念与常见使用场景

Go 中实现字符串大小写不敏感比较的正确方法

HTML5注释怎么写性能提示_标注性能优化点的注释技巧【技巧】

c++ Box2D物理引擎怎么用 c++ 2D游戏物理入门【教程】

HTML怎么设置图像映射_HTML map area标签教程【热点】

如何在网页爬虫中正确提取超链接并避免 NaN 值

⚠️ 关键注意事项与最佳实践

Composer怎么更新单个包只升级指定依赖不更新其他【技巧】

composer怎么把composer.lock用于离线安装？

如何让生成器支持 .send() 但保持类型提示正确

TypeORM 迁移生成失败的常见原因与解决方案

如何在Golang中捕获和恢复panic_Golang panic恢复与错误处理方案

如何用 CSS 实现图片上文字的响应式缩放与居中覆盖

Linux PostgreSQL 高可用部署方案

Python 排序与内存占用的关系

Linux shell 函数与变量作用域解析

如何在Golang中实现错误处理_Golang Web错误管理与异常处理方案