数据透视

本节目录

2.7. 数据透视#

pivot() 和 melt() 是一对互逆的操作，能提供类似 Excel 数据透视表这样的功能。

pivot#

在使用 Excel 这样的表格时，一个高阶的操作是使用“数据透视表”对数据进行分类汇总。pandas 的 pivot() 提供与 Excel 对应的数据透视的功能，它根据一个或多个列的值对数据进行重新排列和汇总，以便更好地理解数据的结构和关系。pivot() 能够可自定义函数进行聚合，因此功能更强大。

图 2.10 是一个数据透视的示意图，下文还将展示具体的例子。

sales_df = pd.read_csv(file_path)
sales_df

	Account	Name	Rep	Manager	Product	Quantity	Price	Status
0	714466	Trantow-Barrows	Craig Booker	Debra Henley	CPU	1	30000	presented
1	714466	Trantow-Barrows	Craig Booker	Debra Henley	Software	1	10000	presented
2	714466	Trantow-Barrows	Craig Booker	Debra Henley	Maintenance	2	5000	pending
3	737550	Fritsch, Russel and Anderson	Craig Booker	Debra Henley	CPU	1	35000	declined
4	146832	Kiehn-Spinka	Daniel Hilton	Debra Henley	CPU	2	65000	won
5	218895	Kulas Inc	Daniel Hilton	Debra Henley	CPU	2	40000	pending
6	218895	Kulas Inc	Daniel Hilton	Debra Henley	Software	1	10000	presented
7	412290	Jerde-Hilpert	John Smith	Debra Henley	Maintenance	2	5000	pending
8	740150	Barton LLC	John Smith	Debra Henley	CPU	1	35000	declined
9	141962	Herman LLC	Cedric Moss	Fred Anderson	CPU	2	65000	won
10	163416	Purdy-Kunde	Cedric Moss	Fred Anderson	CPU	1	30000	presented
11	239344	Stokes LLC	Cedric Moss	Fred Anderson	Maintenance	1	5000	pending
12	239344	Stokes LLC	Cedric Moss	Fred Anderson	Software	1	10000	presented
13	307599	Kassulke, Ondricka and Metz	Wendy Yule	Fred Anderson	Maintenance	3	7000	won
14	688981	Keeling LLC	Wendy Yule	Fred Anderson	CPU	5	100000	won
15	729833	Koepp Ltd	Wendy Yule	Fred Anderson	CPU	2	65000	declined
16	729833	Koepp Ltd	Wendy Yule	Fred Anderson	Monitor	2	5000	presented

具体而言，pandas 有两个函数：pivot() 和 pivot_table()，其中 pivot_table() 更像 Excel 的数据透视表。

pivot_table() 函数的完整形式为 pd.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False) 。参数很多，只有 data 是唯一必需的参数，其他均有默认值，常需要设置的参数有 values, index, columns,aggfunc 等。我们先从简单的开始。

index 和 columns 参数#

index 和 columns 接收的都是字段名，它们都可用于分组。

index 生成透视表的行索引，可以为多个属性。
columns 生成透视表的列索引。

例 1: 按照 Name 分组，即添加 Name 的行索引。这个功能与 groupby() 很像。这个例子中，aggfunc 参数没有设置，默认求平均值。

pd.pivot_table(sales_df, index=['Name'], values=['Quantity', 'Price'])

	Price	Quantity
Name
Barton LLC	35000.0	1.000000
Fritsch, Russel and Anderson	35000.0	1.000000
Herman LLC	65000.0	2.000000
Jerde-Hilpert	5000.0	2.000000
Kassulke, Ondricka and Metz	7000.0	3.000000
Keeling LLC	100000.0	5.000000
Kiehn-Spinka	65000.0	2.000000
Koepp Ltd	35000.0	2.000000
Kulas Inc	25000.0	1.500000
Purdy-Kunde	30000.0	1.000000
Stokes LLC	7500.0	1.000000
Trantow-Barrows	15000.0	1.333333

例 2: 按 Manager 和 Rep 分组，类似于使用 groupby() 对两个字段 Manager 和 Rep 进行分组。

pd.pivot_table(sales_df, index=['Manager', 'Rep'], values=['Quantity', 'Price'])

		Price	Quantity
Manager	Rep
Debra Henley	Craig Booker	20000.000000	1.250000
	Daniel Hilton	38333.333333	1.666667
	John Smith	20000.000000	1.500000
Fred Anderson	Cedric Moss	27500.000000	1.250000
Fred Anderson	Wendy Yule	44250.000000	3.000000

前面两个例子与 groupby 很像，第 3 个例子，增加 columns 这个参数，添加透视表的列 Product，把原来存在于 Product 中的数据作为列名提出来。

pd.pivot_table(sales_df, index=['Manager','Rep'], values=['Quantity', 'Price'], columns=["Product"])

		Price				Quantity
	Product	CPU	Maintenance	Monitor	Software	CPU	Maintenance	Monitor	Software
Manager	Rep
Debra Henley	Craig Booker	32500.0	5000.0	NaN	10000.0	1.0	2.0	NaN	1.0
	Daniel Hilton	52500.0	NaN	NaN	10000.0	2.0	NaN	NaN	1.0
	John Smith	35000.0	5000.0	NaN	NaN	1.0	2.0	NaN	NaN
Fred Anderson	Cedric Moss	47500.0	5000.0	NaN	10000.0	1.5	1.0	NaN	1.0
Fred Anderson	Wendy Yule	82500.0	7000.0	5000.0	NaN	3.5	3.0	2.0	NaN

生成的数据透视表有两层列索引，第一层是原有的 Price、Quantity 这些字段，第二层是 columns=["Product"] 里面的值：CPU、Monitor 等。

values 参数#

values 参数指明要聚合的列或列的列表，通常是你想要在透视表中分析的数值数据。

例 4：只显示 Price 列对应的平均值。

pd.pivot_table(sales_df, index=['Manager','Rep'], values='Price')

		Price
Manager	Rep
Debra Henley	Craig Booker	20000.000000
	Daniel Hilton	38333.333333
	John Smith	20000.000000
Fred Anderson	Cedric Moss	27500.000000
Fred Anderson	Wendy Yule	44250.000000

aggfunc 参数#

用于汇总数据的聚合函数，可以是字符串（如’sum’、’mean’、’count’ 等预置函数）或自定义聚合函数。

例：对 Price 显示 mean 和 len 两列结果。

pd.pivot_table(sales_df, index=["Manager","Rep"], values=["Price"], aggfunc=["mean", len])

		mean	len
		Price	Price
Manager	Rep
Debra Henley	Craig Booker	20000.000000	4
	Daniel Hilton	38333.333333	3
	John Smith	20000.000000	2
Fred Anderson	Cedric Moss	27500.000000	4
Fred Anderson	Wendy Yule	44250.000000	4

fill_value 和 margins 参数#

fill_value 参数用于填充缺失值的值。margins 参数为布尔值，如果为 True, 则在透视表中包含行和列的总计。

pd.pivot_table(sales_df,index=["Manager","Rep","Product"],
               values=["Price","Quantity"],
               aggfunc=[np.sum,np.mean], fill_value=0, margins=True)

/var/folders/4n/v40br47s46ggrjm9bdm64lwh0000gn/T/ipykernel_19556/3052684454.py:1: FutureWarning: The provided callable <function sum at 0x1041ee340> is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  pd.pivot_table(sales_df,index=["Manager","Rep","Product"],
/var/folders/4n/v40br47s46ggrjm9bdm64lwh0000gn/T/ipykernel_19556/3052684454.py:1: FutureWarning: The provided callable <function mean at 0x1041ef420> is currently using DataFrameGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  pd.pivot_table(sales_df,index=["Manager","Rep","Product"],

			sum		mean
			Price	Quantity	Price	Quantity
Manager	Rep	Product
Debra Henley	Craig Booker	CPU	65000	2	32500.000000	1.000000
		Maintenance	5000	2	5000.000000	2.000000
		Software	10000	1	10000.000000	1.000000
	Daniel Hilton	CPU	105000	4	52500.000000	2.000000
	Daniel Hilton	Software	10000	1	10000.000000	1.000000
	John Smith	CPU	35000	1	35000.000000	1.000000
	John Smith	Maintenance	5000	2	5000.000000	2.000000
Fred Anderson	Cedric Moss	CPU	95000	3	47500.000000	1.500000
		Maintenance	5000	1	5000.000000	1.000000
		Software	10000	1	10000.000000	1.000000
	Wendy Yule	CPU	165000	7	82500.000000	3.500000
		Maintenance	7000	3	7000.000000	3.000000
		Monitor	5000	2	5000.000000	2.000000
All			522000	30	30705.882353	1.764706

维度交叉#

将以上参数结合起来，对 Quantity 和 Price 应用不同的聚合函数。

table = pd.pivot_table(sales_df,
        index=["Manager","Status"],
        columns=["Product"],
        values=["Quantity","Price"],
        aggfunc={"Quantity":len, "Price":[np.mean]},
        fill_value=0)
table

/var/folders/4n/v40br47s46ggrjm9bdm64lwh0000gn/T/ipykernel_19556/1669582107.py:1: FutureWarning: The provided callable <function mean at 0x1041ef420> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  table = pd.pivot_table(sales_df,

		Price				Quantity
		mean				len
	Product	CPU	Maintenance	Monitor	Software	CPU	Maintenance	Monitor	Software
Manager	Status
Debra Henley	declined	35000.0	0.0	0.0	0.0	2	0	0	0
	pending	40000.0	5000.0	0.0	0.0	1	2	0	0
	presented	30000.0	0.0	0.0	10000.0	1	0	0	2
	won	65000.0	0.0	0.0	0.0	1	0	0	0
Fred Anderson	declined	65000.0	0.0	0.0	0.0	1	0	0	0
	pending	0.0	5000.0	0.0	0.0	0	1	0	0
	presented	30000.0	0.0	5000.0	10000.0	1	0	1	1
	won	82500.0	7000.0	0.0	0.0	2	1	0	0

数据筛选#

建立数据透视表（pivot table）后，也可以进一步筛选数据以满足特定的条件，类似于 DataFrame 的数据切片方法中的条件索引。这里主要介绍 query() 函数，通过传递查询字符串来筛选数据。

例 1: 筛选出 Manager 为 Debra Henley 的结果。

table.query('Manager == ["Debra Henley"]')

		Price				Quantity
		mean				len
	Product	CPU	Maintenance	Monitor	Software	CPU	Maintenance	Monitor	Software
Manager	Status
Debra Henley	declined	35000.0	0.0	0.0	0.0	2	0	0	0
	pending	40000.0	5000.0	0.0	0.0	1	2	0	0
	presented	30000.0	0.0	0.0	10000.0	1	0	0	2
	won	65000.0	0.0	0.0	0.0	1	0	0	0

例 2: 筛选出 Status 为 pending 或 won 的结果。

table.query('Status == ["pending","won"]')

		Price				Quantity
		mean				len
	Product	CPU	Maintenance	Monitor	Software	CPU	Maintenance	Monitor	Software
Manager	Status
Debra Henley	pending	40000.0	5000.0	0.0	0.0	1	2	0	0
Debra Henley	won	65000.0	0.0	0.0	0.0	1	0	0	0
Fred Anderson	pending	0.0	5000.0	0.0	0.0	0	1	0	0
Fred Anderson	won	82500.0	7000.0	0.0	0.0	2	1	0	0

melt#

melt() 是用于将一个很宽的的表转换为长。这个函数通常用于数据重塑（data reshaping）操作，以便更容易进行数据分析和可视化。根据其英文名，又可被理解成，将一个宽表，“融化”成一个长表。通过这种方式，可以将宽格式数据表格中的多列数据整合到一个列中，可以理解为透视表 pivot_table() 的反操作。

d1 = {
    'Student_id': pd.Series([1, 2, 3]), 
    'Name': pd.Series(['Amy', 'Bob', 'John']), 
    'Math': pd.Series([90, 78, 88]),
    'English':pd.Series([85, 92, 79]),
    'History':pd.Series([88, 76, 90])
}
student_df = pd.DataFrame(d1)
student_df

	Student_id	Name	Math	English	History
0	1	Amy	90	85	88
1	2	Bob	78	92	76
2	3	John	88	79	90

melted_df = pd.melt(student_df, 
                    id_vars=['Student_id', 'Name'], 
                    value_vars=['Math', 'English', 'History'], 
                    var_name='Subject', 
                    value_name='Score')
melted_df

	Student_id	Name	Subject	Score
0	1	Amy	Math	90
1	2	Bob	Math	78
2	3	John	Math	88
3	1	Amy	English	85
4	2	Bob	English	92
5	3	John	English	79
6	1	Amy	History	88
7	2	Bob	History	76
8	3	John	History	90

数据重塑时，通常需要使用 value_vars 来指定需要 “融化” 的列，使它们被整合为一列。比如本例中，将 Math, English,History 整合为一列： Subject 列。

../_images/melt.svg — 图 2.11 melt(): 数据融化#

我们回到 melt() 函数的参数。完整的参数形式为:pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)。

frame 指的是要进行重塑操作的 DataFrame 。
id_vars 表示需要保留的列，它们将成为长表中的行；这些列又被称为标识变量（identifier variable），不被融化。
value_vars 是需要融化的列，它们将被整合成一列，并用新的列名表示。
var_name 用于存储融化后的列名的新列的名称。
value_name 用于存储融化后的值的新列的名称。