2.7. 数据透视#
pivot() 和 melt() 是一对互逆的操作,能提供类似 Excel 数据透视表这样的功能。
pivot#
在使用 Excel 这样的表格时,一个高阶的操作是使用“数据透视表”对数据进行分类汇总。pandas 的 pivot() 提供与 Excel 对应的数据透视的功能,它根据一个或多个列的值对数据进行重新排列和汇总,以便更好地理解数据的结构和关系。pivot() 能够可自定义函数进行聚合,因此功能更强大。
图 2.10 是一个数据透视的示意图,下文还将展示具体的例子。
图 2.10 数据透视#
Show code cell content
import urllib.request
import os
import pandas as pd
import numpy as np
download_url_prefix = "https://py-ds.godaai.org/datasets/sales"
folder_path = os.path.join(os.getcwd(), "../data/sales")
download_url = f"{download_url_prefix}/sales.csv"
file_name = download_url.split("/")[-1]
file_path = os.path.join(folder_path, file_name)
if not os.path.exists(folder_path):
os.makedirs(folder_path)
if not os.path.exists(file_path):
urllib.request.urlretrieve(download_url, file_path)
sales_df = pd.read_csv(file_path)
sales_df
| Account | Name | Rep | Manager | Product | Quantity | Price | Status | |
|---|---|---|---|---|---|---|---|---|
| 0 | 714466 | Trantow-Barrows | Craig Booker | Debra Henley | CPU | 1 | 30000 | presented |
| 1 | 714466 | Trantow-Barrows | Craig Booker | Debra Henley | Software | 1 | 10000 | presented |
| 2 | 714466 | Trantow-Barrows | Craig Booker | Debra Henley | Maintenance | 2 | 5000 | pending |
| 3 | 737550 | Fritsch, Russel and Anderson | Craig Booker | Debra Henley | CPU | 1 | 35000 | declined |
| 4 | 146832 | Kiehn-Spinka | Daniel Hilton | Debra Henley | CPU | 2 | 65000 | won |
| 5 | 218895 | Kulas Inc | Daniel Hilton | Debra Henley | CPU | 2 | 40000 | pending |
| 6 | 218895 | Kulas Inc | Daniel Hilton | Debra Henley | Software | 1 | 10000 | presented |
| 7 | 412290 | Jerde-Hilpert | John Smith | Debra Henley | Maintenance | 2 | 5000 | pending |
| 8 | 740150 | Barton LLC | John Smith | Debra Henley | CPU | 1 | 35000 | declined |
| 9 | 141962 | Herman LLC | Cedric Moss | Fred Anderson | CPU | 2 | 65000 | won |
| 10 | 163416 | Purdy-Kunde | Cedric Moss | Fred Anderson | CPU | 1 | 30000 | presented |
| 11 | 239344 | Stokes LLC | Cedric Moss | Fred Anderson | Maintenance | 1 | 5000 | pending |
| 12 | 239344 | Stokes LLC | Cedric Moss | Fred Anderson | Software | 1 | 10000 | presented |
| 13 | 307599 | Kassulke, Ondricka and Metz | Wendy Yule | Fred Anderson | Maintenance | 3 | 7000 | won |
| 14 | 688981 | Keeling LLC | Wendy Yule | Fred Anderson | CPU | 5 | 100000 | won |
| 15 | 729833 | Koepp Ltd | Wendy Yule | Fred Anderson | CPU | 2 | 65000 | declined |
| 16 | 729833 | Koepp Ltd | Wendy Yule | Fred Anderson | Monitor | 2 | 5000 | presented |
具体而言,pandas 有两个函数:pivot() 和 pivot_table(),其中 pivot_table() 更像 Excel 的数据透视表。
pivot_table() 函数的完整形式为 pd.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False) 。参数很多,只有 data 是唯一必需的参数,其他均有默认值,常需要设置的参数有 values, index, columns,aggfunc 等。我们先从简单的开始。
index 和 columns 参数#
index 和 columns 接收的都是字段名,它们都可用于分组。
index生成透视表的行索引,可以为多个属性。columns生成透视表的列索引。
例 1: 按照 Name 分组,即添加 Name 的行索引。这个功能与 groupby() 很像。这个例子中,aggfunc 参数没有设置,默认求平均值。
pd.pivot_table(sales_df, index=['Name'], values=['Quantity', 'Price'])
| Price | Quantity | |
|---|---|---|
| Name | ||
| Barton LLC | 35000.0 | 1.000000 |
| Fritsch, Russel and Anderson | 35000.0 | 1.000000 |
| Herman LLC | 65000.0 | 2.000000 |
| Jerde-Hilpert | 5000.0 | 2.000000 |
| Kassulke, Ondricka and Metz | 7000.0 | 3.000000 |
| Keeling LLC | 100000.0 | 5.000000 |
| Kiehn-Spinka | 65000.0 | 2.000000 |
| Koepp Ltd | 35000.0 | 2.000000 |
| Kulas Inc | 25000.0 | 1.500000 |
| Purdy-Kunde | 30000.0 | 1.000000 |
| Stokes LLC | 7500.0 | 1.000000 |
| Trantow-Barrows | 15000.0 | 1.333333 |
例 2: 按 Manager 和 Rep 分组,类似于使用 groupby() 对两个字段 Manager 和 Rep 进行分组。
pd.pivot_table(sales_df, index=['Manager', 'Rep'], values=['Quantity', 'Price'])
| Price | Quantity | ||
|---|---|---|---|
| Manager | Rep | ||
| Debra Henley | Craig Booker | 20000.000000 | 1.250000 |
| Daniel Hilton | 38333.333333 | 1.666667 | |
| John Smith | 20000.000000 | 1.500000 | |
| Fred Anderson | Cedric Moss | 27500.000000 | 1.250000 |
| Wendy Yule | 44250.000000 | 3.000000 |
前面两个例子与 groupby 很像,第 3 个例子,增加 columns 这个参数,添加透视表的列 Product,把原来存在于 Product 中的数据作为列名提出来。
pd.pivot_table(sales_df, index=['Manager','Rep'], values=['Quantity', 'Price'], columns=["Product"])
| Price | Quantity | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | |
| Manager | Rep | ||||||||
| Debra Henley | Craig Booker | 32500.0 | 5000.0 | NaN | 10000.0 | 1.0 | 2.0 | NaN | 1.0 |
| Daniel Hilton | 52500.0 | NaN | NaN | 10000.0 | 2.0 | NaN | NaN | 1.0 | |
| John Smith | 35000.0 | 5000.0 | NaN | NaN | 1.0 | 2.0 | NaN | NaN | |
| Fred Anderson | Cedric Moss | 47500.0 | 5000.0 | NaN | 10000.0 | 1.5 | 1.0 | NaN | 1.0 |
| Wendy Yule | 82500.0 | 7000.0 | 5000.0 | NaN | 3.5 | 3.0 | 2.0 | NaN | |
生成的数据透视表有两层列索引,第一层是原有的 Price、Quantity 这些字段,第二层是 columns=["Product"] 里面的值:CPU、Monitor 等。
values 参数#
values 参数指明要聚合的列或列的列表,通常是你想要在透视表中分析的数值数据。
例 4:只显示 Price 列对应的平均值。
pd.pivot_table(sales_df, index=['Manager','Rep'], values='Price')
| Price | ||
|---|---|---|
| Manager | Rep | |
| Debra Henley | Craig Booker | 20000.000000 |
| Daniel Hilton | 38333.333333 | |
| John Smith | 20000.000000 | |
| Fred Anderson | Cedric Moss | 27500.000000 |
| Wendy Yule | 44250.000000 |
aggfunc 参数#
用于汇总数据的聚合函数,可以是字符串(如’sum’、’mean’、’count’ 等预置函数)或自定义聚合函数。
例:对 Price 显示 mean 和 len 两列结果。
pd.pivot_table(sales_df, index=["Manager","Rep"], values=["Price"], aggfunc=["mean", len])
| mean | len | ||
|---|---|---|---|
| Price | Price | ||
| Manager | Rep | ||
| Debra Henley | Craig Booker | 20000.000000 | 4 |
| Daniel Hilton | 38333.333333 | 3 | |
| John Smith | 20000.000000 | 2 | |
| Fred Anderson | Cedric Moss | 27500.000000 | 4 |
| Wendy Yule | 44250.000000 | 4 |
fill_value 和 margins 参数#
fill_value 参数用于填充缺失值的值。margins 参数为布尔值,如果为 True, 则在透视表中包含行和列的总计。
pd.pivot_table(sales_df,index=["Manager","Rep","Product"],
values=["Price","Quantity"],
aggfunc=[np.sum,np.mean], fill_value=0, margins=True)
/var/folders/4n/v40br47s46ggrjm9bdm64lwh0000gn/T/ipykernel_19556/3052684454.py:1: FutureWarning: The provided callable <function sum at 0x1041ee340> is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
pd.pivot_table(sales_df,index=["Manager","Rep","Product"],
/var/folders/4n/v40br47s46ggrjm9bdm64lwh0000gn/T/ipykernel_19556/3052684454.py:1: FutureWarning: The provided callable <function mean at 0x1041ef420> is currently using DataFrameGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
pd.pivot_table(sales_df,index=["Manager","Rep","Product"],
| sum | mean | |||||
|---|---|---|---|---|---|---|
| Price | Quantity | Price | Quantity | |||
| Manager | Rep | Product | ||||
| Debra Henley | Craig Booker | CPU | 65000 | 2 | 32500.000000 | 1.000000 |
| Maintenance | 5000 | 2 | 5000.000000 | 2.000000 | ||
| Software | 10000 | 1 | 10000.000000 | 1.000000 | ||
| Daniel Hilton | CPU | 105000 | 4 | 52500.000000 | 2.000000 | |
| Software | 10000 | 1 | 10000.000000 | 1.000000 | ||
| John Smith | CPU | 35000 | 1 | 35000.000000 | 1.000000 | |
| Maintenance | 5000 | 2 | 5000.000000 | 2.000000 | ||
| Fred Anderson | Cedric Moss | CPU | 95000 | 3 | 47500.000000 | 1.500000 |
| Maintenance | 5000 | 1 | 5000.000000 | 1.000000 | ||
| Software | 10000 | 1 | 10000.000000 | 1.000000 | ||
| Wendy Yule | CPU | 165000 | 7 | 82500.000000 | 3.500000 | |
| Maintenance | 7000 | 3 | 7000.000000 | 3.000000 | ||
| Monitor | 5000 | 2 | 5000.000000 | 2.000000 | ||
| All | 522000 | 30 | 30705.882353 | 1.764706 | ||
维度交叉#
将以上参数结合起来,对 Quantity 和 Price 应用不同的聚合函数。
table = pd.pivot_table(sales_df,
index=["Manager","Status"],
columns=["Product"],
values=["Quantity","Price"],
aggfunc={"Quantity":len, "Price":[np.mean]},
fill_value=0)
table
/var/folders/4n/v40br47s46ggrjm9bdm64lwh0000gn/T/ipykernel_19556/1669582107.py:1: FutureWarning: The provided callable <function mean at 0x1041ef420> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
table = pd.pivot_table(sales_df,
| Price | Quantity | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| mean | len | ||||||||
| Product | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | |
| Manager | Status | ||||||||
| Debra Henley | declined | 35000.0 | 0.0 | 0.0 | 0.0 | 2 | 0 | 0 | 0 |
| pending | 40000.0 | 5000.0 | 0.0 | 0.0 | 1 | 2 | 0 | 0 | |
| presented | 30000.0 | 0.0 | 0.0 | 10000.0 | 1 | 0 | 0 | 2 | |
| won | 65000.0 | 0.0 | 0.0 | 0.0 | 1 | 0 | 0 | 0 | |
| Fred Anderson | declined | 65000.0 | 0.0 | 0.0 | 0.0 | 1 | 0 | 0 | 0 |
| pending | 0.0 | 5000.0 | 0.0 | 0.0 | 0 | 1 | 0 | 0 | |
| presented | 30000.0 | 0.0 | 5000.0 | 10000.0 | 1 | 0 | 1 | 1 | |
| won | 82500.0 | 7000.0 | 0.0 | 0.0 | 2 | 1 | 0 | 0 | |
数据筛选#
建立数据透视表(pivot table)后,也可以进一步筛选数据以满足特定的条件,类似于 DataFrame 的数据切片方法中的条件索引。这里主要介绍 query() 函数,通过传递查询字符串来筛选数据。
例 1: 筛选出 Manager 为 Debra Henley 的结果。
table.query('Manager == ["Debra Henley"]')
| Price | Quantity | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| mean | len | ||||||||
| Product | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | |
| Manager | Status | ||||||||
| Debra Henley | declined | 35000.0 | 0.0 | 0.0 | 0.0 | 2 | 0 | 0 | 0 |
| pending | 40000.0 | 5000.0 | 0.0 | 0.0 | 1 | 2 | 0 | 0 | |
| presented | 30000.0 | 0.0 | 0.0 | 10000.0 | 1 | 0 | 0 | 2 | |
| won | 65000.0 | 0.0 | 0.0 | 0.0 | 1 | 0 | 0 | 0 | |
例 2: 筛选出 Status 为 pending 或 won 的结果。
table.query('Status == ["pending","won"]')
| Price | Quantity | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| mean | len | ||||||||
| Product | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | |
| Manager | Status | ||||||||
| Debra Henley | pending | 40000.0 | 5000.0 | 0.0 | 0.0 | 1 | 2 | 0 | 0 |
| won | 65000.0 | 0.0 | 0.0 | 0.0 | 1 | 0 | 0 | 0 | |
| Fred Anderson | pending | 0.0 | 5000.0 | 0.0 | 0.0 | 0 | 1 | 0 | 0 |
| won | 82500.0 | 7000.0 | 0.0 | 0.0 | 2 | 1 | 0 | 0 | |
melt#
melt() 是用于将一个很宽的的表转换为长。这个函数通常用于数据重塑(data reshaping)操作,以便更容易进行数据分析和可视化。根据其英文名,又可被理解成,将一个宽表,“融化”成一个长表。通过这种方式,可以将宽格式数据表格中的多列数据整合到一个列中,可以理解为透视表 pivot_table() 的反操作。
d1 = {
'Student_id': pd.Series([1, 2, 3]),
'Name': pd.Series(['Amy', 'Bob', 'John']),
'Math': pd.Series([90, 78, 88]),
'English':pd.Series([85, 92, 79]),
'History':pd.Series([88, 76, 90])
}
student_df = pd.DataFrame(d1)
student_df
| Student_id | Name | Math | English | History | |
|---|---|---|---|---|---|
| 0 | 1 | Amy | 90 | 85 | 88 |
| 1 | 2 | Bob | 78 | 92 | 76 |
| 2 | 3 | John | 88 | 79 | 90 |
melted_df = pd.melt(student_df,
id_vars=['Student_id', 'Name'],
value_vars=['Math', 'English', 'History'],
var_name='Subject',
value_name='Score')
melted_df
| Student_id | Name | Subject | Score | |
|---|---|---|---|---|
| 0 | 1 | Amy | Math | 90 |
| 1 | 2 | Bob | Math | 78 |
| 2 | 3 | John | Math | 88 |
| 3 | 1 | Amy | English | 85 |
| 4 | 2 | Bob | English | 92 |
| 5 | 3 | John | English | 79 |
| 6 | 1 | Amy | History | 88 |
| 7 | 2 | Bob | History | 76 |
| 8 | 3 | John | History | 90 |
数据重塑时,通常需要使用 value_vars 来指定需要 “融化” 的列,使它们被整合为一列。比如本例中,将 Math, English,History 整合为一列: Subject 列。
图 2.11 melt(): 数据融化#
我们回到 melt() 函数的参数。完整的参数形式为:pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)。
frame指的是要进行重塑操作的 DataFrame 。id_vars表示需要保留的列,它们将成为长表中的行;这些列又被称为标识变量(identifier variable),不被融化。value_vars是需要融化的列,它们将被整合成一列,并用新的列名表示。var_name用于存储融化后的列名的新列的名称。value_name用于存储融化后的值的新列的名称。