2.7. 数据透视#
pivot()
和 melt()
是一对互逆的操作,能提供类似 Excel 数据透视表这样的功能。
pivot#
在使用 Excel 这样的表格时,一个高阶的操作是使用“数据透视表”对数据进行分类汇总。pandas 的 pivot()
提供与 Excel 对应的数据透视的功能,它根据一个或多个列的值对数据进行重新排列和汇总,以便更好地理解数据的结构和关系。pivot()
能够可自定义函数进行聚合,因此功能更强大。
图 2.10 是一个数据透视的示意图,下文还将展示具体的例子。
Show code cell content
import urllib.request
import os
import pandas as pd
import numpy as np
download_url_prefix = "https://py-ds.godaai.org/datasets/sales"
folder_path = os.path.join(os.getcwd(), "../data/sales")
download_url = f"{download_url_prefix}/sales.csv"
file_name = download_url.split("/")[-1]
file_path = os.path.join(folder_path, file_name)
if not os.path.exists(folder_path):
os.makedirs(folder_path)
if not os.path.exists(file_path):
urllib.request.urlretrieve(download_url, file_path)
sales_df = pd.read_csv(file_path)
sales_df
Account | Name | Rep | Manager | Product | Quantity | Price | Status | |
---|---|---|---|---|---|---|---|---|
0 | 714466 | Trantow-Barrows | Craig Booker | Debra Henley | CPU | 1 | 30000 | presented |
1 | 714466 | Trantow-Barrows | Craig Booker | Debra Henley | Software | 1 | 10000 | presented |
2 | 714466 | Trantow-Barrows | Craig Booker | Debra Henley | Maintenance | 2 | 5000 | pending |
3 | 737550 | Fritsch, Russel and Anderson | Craig Booker | Debra Henley | CPU | 1 | 35000 | declined |
4 | 146832 | Kiehn-Spinka | Daniel Hilton | Debra Henley | CPU | 2 | 65000 | won |
5 | 218895 | Kulas Inc | Daniel Hilton | Debra Henley | CPU | 2 | 40000 | pending |
6 | 218895 | Kulas Inc | Daniel Hilton | Debra Henley | Software | 1 | 10000 | presented |
7 | 412290 | Jerde-Hilpert | John Smith | Debra Henley | Maintenance | 2 | 5000 | pending |
8 | 740150 | Barton LLC | John Smith | Debra Henley | CPU | 1 | 35000 | declined |
9 | 141962 | Herman LLC | Cedric Moss | Fred Anderson | CPU | 2 | 65000 | won |
10 | 163416 | Purdy-Kunde | Cedric Moss | Fred Anderson | CPU | 1 | 30000 | presented |
11 | 239344 | Stokes LLC | Cedric Moss | Fred Anderson | Maintenance | 1 | 5000 | pending |
12 | 239344 | Stokes LLC | Cedric Moss | Fred Anderson | Software | 1 | 10000 | presented |
13 | 307599 | Kassulke, Ondricka and Metz | Wendy Yule | Fred Anderson | Maintenance | 3 | 7000 | won |
14 | 688981 | Keeling LLC | Wendy Yule | Fred Anderson | CPU | 5 | 100000 | won |
15 | 729833 | Koepp Ltd | Wendy Yule | Fred Anderson | CPU | 2 | 65000 | declined |
16 | 729833 | Koepp Ltd | Wendy Yule | Fred Anderson | Monitor | 2 | 5000 | presented |
具体而言,pandas 有两个函数:pivot()
和 pivot_table()
,其中 pivot_table()
更像 Excel 的数据透视表。
pivot_table()
函数的完整形式为 pd.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False)
。参数很多,只有 data
是唯一必需的参数,其他均有默认值,常需要设置的参数有 values
, index
, columns
,aggfunc
等。我们先从简单的开始。
index 和 columns 参数#
index
和 columns
接收的都是字段名,它们都可用于分组。
index
生成透视表的行索引,可以为多个属性。columns
生成透视表的列索引。
例 1: 按照 Name 分组,即添加 Name 的行索引。这个功能与 groupby()
很像。这个例子中,aggfunc
参数没有设置,默认求平均值。
pd.pivot_table(sales_df, index=['Name'], values=['Quantity', 'Price'])
Price | Quantity | |
---|---|---|
Name | ||
Barton LLC | 35000.0 | 1.000000 |
Fritsch, Russel and Anderson | 35000.0 | 1.000000 |
Herman LLC | 65000.0 | 2.000000 |
Jerde-Hilpert | 5000.0 | 2.000000 |
Kassulke, Ondricka and Metz | 7000.0 | 3.000000 |
Keeling LLC | 100000.0 | 5.000000 |
Kiehn-Spinka | 65000.0 | 2.000000 |
Koepp Ltd | 35000.0 | 2.000000 |
Kulas Inc | 25000.0 | 1.500000 |
Purdy-Kunde | 30000.0 | 1.000000 |
Stokes LLC | 7500.0 | 1.000000 |
Trantow-Barrows | 15000.0 | 1.333333 |
例 2: 按 Manager 和 Rep 分组,类似于使用 groupby()
对两个字段 Manager 和 Rep 进行分组。
pd.pivot_table(sales_df, index=['Manager', 'Rep'], values=['Quantity', 'Price'])
Price | Quantity | ||
---|---|---|---|
Manager | Rep | ||
Debra Henley | Craig Booker | 20000.000000 | 1.250000 |
Daniel Hilton | 38333.333333 | 1.666667 | |
John Smith | 20000.000000 | 1.500000 | |
Fred Anderson | Cedric Moss | 27500.000000 | 1.250000 |
Wendy Yule | 44250.000000 | 3.000000 |
前面两个例子与 groupby
很像,第 3 个例子,增加 columns
这个参数,添加透视表的列 Product,把原来存在于 Product 中的数据作为列名提出来。
pd.pivot_table(sales_df, index=['Manager','Rep'], values=['Quantity', 'Price'], columns=["Product"])
Price | Quantity | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | |
Manager | Rep | ||||||||
Debra Henley | Craig Booker | 32500.0 | 5000.0 | NaN | 10000.0 | 1.0 | 2.0 | NaN | 1.0 |
Daniel Hilton | 52500.0 | NaN | NaN | 10000.0 | 2.0 | NaN | NaN | 1.0 | |
John Smith | 35000.0 | 5000.0 | NaN | NaN | 1.0 | 2.0 | NaN | NaN | |
Fred Anderson | Cedric Moss | 47500.0 | 5000.0 | NaN | 10000.0 | 1.5 | 1.0 | NaN | 1.0 |
Wendy Yule | 82500.0 | 7000.0 | 5000.0 | NaN | 3.5 | 3.0 | 2.0 | NaN |
生成的数据透视表有两层列索引,第一层是原有的 Price、Quantity 这些字段,第二层是 columns=["Product"]
里面的值:CPU、Monitor 等。
values 参数#
values
参数指明要聚合的列或列的列表,通常是你想要在透视表中分析的数值数据。
例 4:只显示 Price 列对应的平均值。
pd.pivot_table(sales_df, index=['Manager','Rep'], values='Price')
Price | ||
---|---|---|
Manager | Rep | |
Debra Henley | Craig Booker | 20000.000000 |
Daniel Hilton | 38333.333333 | |
John Smith | 20000.000000 | |
Fred Anderson | Cedric Moss | 27500.000000 |
Wendy Yule | 44250.000000 |
aggfunc 参数#
用于汇总数据的聚合函数,可以是字符串(如’sum’、’mean’、’count’ 等预置函数)或自定义聚合函数。
例:对 Price 显示 mean
和 len
两列结果。
pd.pivot_table(sales_df, index=["Manager","Rep"], values=["Price"], aggfunc=["mean", len])
mean | len | ||
---|---|---|---|
Price | Price | ||
Manager | Rep | ||
Debra Henley | Craig Booker | 20000.000000 | 4 |
Daniel Hilton | 38333.333333 | 3 | |
John Smith | 20000.000000 | 2 | |
Fred Anderson | Cedric Moss | 27500.000000 | 4 |
Wendy Yule | 44250.000000 | 4 |
fill_value 和 margins 参数#
fill_value
参数用于填充缺失值的值。margins
参数为布尔值,如果为 True, 则在透视表中包含行和列的总计。
pd.pivot_table(sales_df,index=["Manager","Rep","Product"],
values=["Price","Quantity"],
aggfunc=[np.sum,np.mean], fill_value=0, margins=True)
/var/folders/4n/v40br47s46ggrjm9bdm64lwh0000gn/T/ipykernel_19556/3052684454.py:1: FutureWarning: The provided callable <function sum at 0x1041ee340> is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
pd.pivot_table(sales_df,index=["Manager","Rep","Product"],
/var/folders/4n/v40br47s46ggrjm9bdm64lwh0000gn/T/ipykernel_19556/3052684454.py:1: FutureWarning: The provided callable <function mean at 0x1041ef420> is currently using DataFrameGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
pd.pivot_table(sales_df,index=["Manager","Rep","Product"],
sum | mean | |||||
---|---|---|---|---|---|---|
Price | Quantity | Price | Quantity | |||
Manager | Rep | Product | ||||
Debra Henley | Craig Booker | CPU | 65000 | 2 | 32500.000000 | 1.000000 |
Maintenance | 5000 | 2 | 5000.000000 | 2.000000 | ||
Software | 10000 | 1 | 10000.000000 | 1.000000 | ||
Daniel Hilton | CPU | 105000 | 4 | 52500.000000 | 2.000000 | |
Software | 10000 | 1 | 10000.000000 | 1.000000 | ||
John Smith | CPU | 35000 | 1 | 35000.000000 | 1.000000 | |
Maintenance | 5000 | 2 | 5000.000000 | 2.000000 | ||
Fred Anderson | Cedric Moss | CPU | 95000 | 3 | 47500.000000 | 1.500000 |
Maintenance | 5000 | 1 | 5000.000000 | 1.000000 | ||
Software | 10000 | 1 | 10000.000000 | 1.000000 | ||
Wendy Yule | CPU | 165000 | 7 | 82500.000000 | 3.500000 | |
Maintenance | 7000 | 3 | 7000.000000 | 3.000000 | ||
Monitor | 5000 | 2 | 5000.000000 | 2.000000 | ||
All | 522000 | 30 | 30705.882353 | 1.764706 |
维度交叉#
将以上参数结合起来,对 Quantity 和 Price 应用不同的聚合函数。
table = pd.pivot_table(sales_df,
index=["Manager","Status"],
columns=["Product"],
values=["Quantity","Price"],
aggfunc={"Quantity":len, "Price":[np.mean]},
fill_value=0)
table
/var/folders/4n/v40br47s46ggrjm9bdm64lwh0000gn/T/ipykernel_19556/1669582107.py:1: FutureWarning: The provided callable <function mean at 0x1041ef420> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
table = pd.pivot_table(sales_df,
Price | Quantity | ||||||||
---|---|---|---|---|---|---|---|---|---|
mean | len | ||||||||
Product | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | |
Manager | Status | ||||||||
Debra Henley | declined | 35000.0 | 0.0 | 0.0 | 0.0 | 2 | 0 | 0 | 0 |
pending | 40000.0 | 5000.0 | 0.0 | 0.0 | 1 | 2 | 0 | 0 | |
presented | 30000.0 | 0.0 | 0.0 | 10000.0 | 1 | 0 | 0 | 2 | |
won | 65000.0 | 0.0 | 0.0 | 0.0 | 1 | 0 | 0 | 0 | |
Fred Anderson | declined | 65000.0 | 0.0 | 0.0 | 0.0 | 1 | 0 | 0 | 0 |
pending | 0.0 | 5000.0 | 0.0 | 0.0 | 0 | 1 | 0 | 0 | |
presented | 30000.0 | 0.0 | 5000.0 | 10000.0 | 1 | 0 | 1 | 1 | |
won | 82500.0 | 7000.0 | 0.0 | 0.0 | 2 | 1 | 0 | 0 |
数据筛选#
建立数据透视表(pivot table)后,也可以进一步筛选数据以满足特定的条件,类似于 DataFrame 的数据切片方法中的条件索引。这里主要介绍 query()
函数,通过传递查询字符串来筛选数据。
例 1: 筛选出 Manager 为 Debra Henley 的结果。
table.query('Manager == ["Debra Henley"]')
Price | Quantity | ||||||||
---|---|---|---|---|---|---|---|---|---|
mean | len | ||||||||
Product | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | |
Manager | Status | ||||||||
Debra Henley | declined | 35000.0 | 0.0 | 0.0 | 0.0 | 2 | 0 | 0 | 0 |
pending | 40000.0 | 5000.0 | 0.0 | 0.0 | 1 | 2 | 0 | 0 | |
presented | 30000.0 | 0.0 | 0.0 | 10000.0 | 1 | 0 | 0 | 2 | |
won | 65000.0 | 0.0 | 0.0 | 0.0 | 1 | 0 | 0 | 0 |
例 2: 筛选出 Status 为 pending 或 won 的结果。
table.query('Status == ["pending","won"]')
Price | Quantity | ||||||||
---|---|---|---|---|---|---|---|---|---|
mean | len | ||||||||
Product | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | |
Manager | Status | ||||||||
Debra Henley | pending | 40000.0 | 5000.0 | 0.0 | 0.0 | 1 | 2 | 0 | 0 |
won | 65000.0 | 0.0 | 0.0 | 0.0 | 1 | 0 | 0 | 0 | |
Fred Anderson | pending | 0.0 | 5000.0 | 0.0 | 0.0 | 0 | 1 | 0 | 0 |
won | 82500.0 | 7000.0 | 0.0 | 0.0 | 2 | 1 | 0 | 0 |
melt#
melt()
是用于将一个很宽的的表转换为长。这个函数通常用于数据重塑(data reshaping)操作,以便更容易进行数据分析和可视化。根据其英文名,又可被理解成,将一个宽表,“融化”成一个长表。通过这种方式,可以将宽格式数据表格中的多列数据整合到一个列中,可以理解为透视表 pivot_table()
的反操作。
d1 = {
'Student_id': pd.Series([1, 2, 3]),
'Name': pd.Series(['Amy', 'Bob', 'John']),
'Math': pd.Series([90, 78, 88]),
'English':pd.Series([85, 92, 79]),
'History':pd.Series([88, 76, 90])
}
student_df = pd.DataFrame(d1)
student_df
Student_id | Name | Math | English | History | |
---|---|---|---|---|---|
0 | 1 | Amy | 90 | 85 | 88 |
1 | 2 | Bob | 78 | 92 | 76 |
2 | 3 | John | 88 | 79 | 90 |
melted_df = pd.melt(student_df,
id_vars=['Student_id', 'Name'],
value_vars=['Math', 'English', 'History'],
var_name='Subject',
value_name='Score')
melted_df
Student_id | Name | Subject | Score | |
---|---|---|---|---|
0 | 1 | Amy | Math | 90 |
1 | 2 | Bob | Math | 78 |
2 | 3 | John | Math | 88 |
3 | 1 | Amy | English | 85 |
4 | 2 | Bob | English | 92 |
5 | 3 | John | English | 79 |
6 | 1 | Amy | History | 88 |
7 | 2 | Bob | History | 76 |
8 | 3 | John | History | 90 |
数据重塑时,通常需要使用 value_vars
来指定需要 “融化” 的列,使它们被整合为一列。比如本例中,将 Math, English,History 整合为一列: Subject 列。
我们回到 melt()
函数的参数。完整的参数形式为:pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)
。
frame
指的是要进行重塑操作的 DataFrame 。id_vars
表示需要保留的列,它们将成为长表中的行;这些列又被称为标识变量(identifier variable),不被融化。value_vars
是需要融化的列,它们将被整合成一列,并用新的列名表示。var_name
用于存储融化后的列名的新列的名称。value_name
用于存储融化后的值的新列的名称。