推荐 :Pandas 2.0 数据科学家的游戏改变者(附链接)
本文约4800字,建议阅读12分钟
本文介绍了新版本pandas 2.0中引入的主要优势以及代码实现。
截图来自作者 2.0发行版看起来在数据科学社区造成了相当大的影响,很多用户都称赞新版本里的改进。
有趣的事实:你意识到这个发行版用了惊人的3年时间制作的吗?这就是我所说的“对社区的承诺”!
所以pandas 2.0带来了什么?让我们立刻深入看一下!
"data/hn.csv") timeit df = pd.read_csv(
12 s ± 304 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
"data/hn.csv", engine='pyarrow', dtype_backend='pyarrow') timeit df_arrow = pd.read_csv(
329 ms ± 65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
如您所见,使用新的后端使读取数据的速度提高了近 35 倍。其他值得指出的方面:
如果没有 pyarrow 后端,每个列/特征都存储为自己的唯一数据类型:数字特征存储为 int64 或 float64,而字符串值存储为对象;
使用 pyarrow,所有功能都使用 Arrow dtypes:请注意 [pyarrow] 注释和不同类型的数据:int64、float64、字符串、时间戳和双精度:
df = pd.read_csv("data/hn.csv")
df.info()
#
# RangeIndex: 3885799 entries, 0 to 3885798
# Data columns (total 8 columns):
# # Column Dtype
# --- ------ -----
# 0 Object ID int64
# 1 Title object
# 2 Post Type object
# 3 Author object
# 4 Created At object
# 5 URL object
# 6 Points int64
# 7 Number of Comments float64
# dtypes: float64(1), int64(2), object(5)
# memory usage: 237.2+ MB
df_arrow = pd.read_csv("data/hn.csv", dtype_backend='pyarrow', engine='pyarrow')
df_arrow.info()
#
# RangeIndex: 3885799 entries, 0 to 3885798
# Data columns (total 8 columns):
# # Column Dtype
# --- ------ -----
# 0 Object ID int64[pyarrow]
# 1 Title string[pyarrow]
# 2 Post Type string[pyarrow]
# 3 Author string[pyarrow]
# 4 Created At timestamp[s][pyarrow]
# 5 URL string[pyarrow]
# 6 Points int64[pyarrow]
# 7 Number of Comments double[pyarrow]
# dtypes: double[pyarrow](1), int64[pyarrow](2), string[pyarrow](4), timestamp[s][pyarrow](1)
# memory usage: 660.2 MB
"Author"].str.startswith('phy') timeit df[
851 ms ± 7.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df_arrow["Author"].str.startswith('phy')
27.9 ms ± 538 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
pd.Index([1, 2, 3])
'int64') Index([1, 2, 3], dtype=
pd.Index([1, 2, 3], dtype=np.int32)
'int32') Index([1, 2, 3], dtype=
df = pd.read_csv("data/hn.csv")
points = df["Points"]
points.isna().sum()
# 0
points[0:5]
# 0 61
# 1 16
# 2 7
# 3 5
# 4 7
# Name: Points, dtype: int64
# Setting first position to None
points.iloc[0] = None
points[0:5]
# 0 NaN
# 1 16.0
# 2 7.0
# 3 5.0
# 4 7.0
# Name: Points, dtype: float64
df_null = pd.read_csv("data/hn.csv", dtype_backend='numpy_nullable')
points_null = df_null["Points"]
points_null.isna().sum()
# 0
points_null[0:5]
# 0 61
# 1 16
# 2 7
# 3 5
# 4 7
# Name: Points, dtype: Int64
points_null.iloc[0] = None
points_null[0:5]
# 0
# 1 16
# 2 7
# 3 5
# 4 7
# Name: Points, dtype: Int64
pd.options.mode.copy_on_write = False # disable copy-on-write (default in pandas 2.0)
df = pd.read_csv("data/hn.csv")
df.head()
# Throws a 'SettingWithCopy' warning
# SettingWithCopyWarning:
# A value is trying to be set on a copy of a slice from a DataFrame
df["Points"][0] = 2000
df.head() # <---- df changes
pd.options.mode.copy_on_write = True
df = pd.read_csv("data/hn.csv")
df.head()
# Throws a ChainedAssignmentError
df["Points"][0] = 2000
# ChainedAssignmentError: A value is trying to be set on a copy of a DataFrame
# or Series through chained assignment. When using the Copy-on-Write mode,
# such chained assignment never works to update the original DataFrame
# or Series, because the intermediate object on which we are setting
# values always behaves as a copy.
# Try using '.loc[row_indexer, col_indexer] = value' instead,
# to perform the assignment in a single step.
df.head() # <---- df does not change
pip install "pandas[postgresql, aws, spss]>=2.0.0"
import pandas as pd
from ydata_profiling import ProfileReport
# Using pandas 1.5.3 and ydata-profiling 4.2.0
"data/hn.csv") timeit df = pd.read_csv(
10.1 s ± 215 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit profile = ProfileReport(df, title="Pandas Profiling Report")
4.85 ms ± 77.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit profile.to_file("report.html")
18.5 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Using pandas 2.0.2 and ydata-profiling 4.3.1
"data/hn.csv", engine='pyarrow') timeit df_arrow = pd.read_csv(
3.27 s ± 38.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit profile_arrow = ProfileReport(df_arrow, title="Pandas Profiling Report")
5.24 ms ± 448 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit profile_arrow.to_file("report.html")
19 ms ± 1.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
性能优化:随着 Apache Arrow 后端的引入、更多的 numpy dtype 索引和写入时复制模式; 增加灵活性和自定义性:允许用户控制可选的依赖项并利用 Apache Arrow 数据类型(包括从一开始的可空性!); 互操作性:也许是新版本的一个不太“广受赞誉”的优势,但影响巨大。由于 Arrow 是独立于语言的,因此内存中的数据不仅可以在基于 Python 构建的程序之间传输,还可以在 R、Spark 和其他使用 Apache Arrow 后端的程序之间传输!
https://medium.com/towards-data-science/pandas-2-0-a-game-changer-for-data-scientists-3cd281fcc4b4?source=topic_portal_recommended_stories---------2-85----------machine_learning----------30a1af14_d40c_416a_bc92_b752b8fd806c-------
关注公众号:拾黑(shiheibook)了解更多
[广告]赞助链接:
四季很好,只要有你,文娱排行榜:https://www.yaopaiming.com/
让资讯触达的更精准有趣:https://www.0xu.cn/
随时掌握互联网精彩
- 1 和人民在一起 7921797
- 2 江西“最强钉子户”:后悔没答应拆迁 7903965
- 3 柯洁被判负 7808031
- 4 今天明天 都是小年 7789067
- 5 成都巨型刘亦菲成了网红打卡点 7695722
- 6 肖战一出场就有郭靖的感觉 7589240
- 7 #网红潘宏是否涉嫌虐狗# 7405414
- 8 嘴角起泡其实是病毒感染 7331393
- 9 卖猪商户切肉时发现猪被打针 7256443
- 10 第一批见岳父的男生开始坐立不安 7141128