如果你觉得内容对你有帮助,请在 GitHub 上点个 star 吧!

2.7. 数据透视#

pivot()melt() 是一对互逆的操作,能提供类似 Excel 数据透视表这样的功能。

pivot#

在使用 Excel 这样的表格时,一个高阶的操作是使用“数据透视表”对数据进行分类汇总。pandas 的 pivot() 提供与 Excel 对应的数据透视的功能,它根据一个或多个列的值对数据进行重新排列和汇总,以便更好地理解数据的结构和关系。pivot() 能够可自定义函数进行聚合,因此功能更强大。

图 2.10 是一个数据透视的示意图,下文还将展示具体的例子。

../_images/pivot.svg

图 2.10 数据透视#

Hide code cell content
import urllib.request
import os
import pandas as pd
import numpy as np

download_url_prefix = "https://py-ds.godaai.org/datasets/sales"
folder_path = os.path.join(os.getcwd(), "../data/sales")
download_url = f"{download_url_prefix}/sales.csv"

file_name = download_url.split("/")[-1]
file_path = os.path.join(folder_path, file_name)

if not os.path.exists(folder_path):
    os.makedirs(folder_path)

if not os.path.exists(file_path):
    urllib.request.urlretrieve(download_url, file_path)
sales_df = pd.read_csv(file_path)
sales_df
Account Name Rep Manager Product Quantity Price Status
0 714466 Trantow-Barrows Craig Booker Debra Henley CPU 1 30000 presented
1 714466 Trantow-Barrows Craig Booker Debra Henley Software 1 10000 presented
2 714466 Trantow-Barrows Craig Booker Debra Henley Maintenance 2 5000 pending
3 737550 Fritsch, Russel and Anderson Craig Booker Debra Henley CPU 1 35000 declined
4 146832 Kiehn-Spinka Daniel Hilton Debra Henley CPU 2 65000 won
5 218895 Kulas Inc Daniel Hilton Debra Henley CPU 2 40000 pending
6 218895 Kulas Inc Daniel Hilton Debra Henley Software 1 10000 presented
7 412290 Jerde-Hilpert John Smith Debra Henley Maintenance 2 5000 pending
8 740150 Barton LLC John Smith Debra Henley CPU 1 35000 declined
9 141962 Herman LLC Cedric Moss Fred Anderson CPU 2 65000 won
10 163416 Purdy-Kunde Cedric Moss Fred Anderson CPU 1 30000 presented
11 239344 Stokes LLC Cedric Moss Fred Anderson Maintenance 1 5000 pending
12 239344 Stokes LLC Cedric Moss Fred Anderson Software 1 10000 presented
13 307599 Kassulke, Ondricka and Metz Wendy Yule Fred Anderson Maintenance 3 7000 won
14 688981 Keeling LLC Wendy Yule Fred Anderson CPU 5 100000 won
15 729833 Koepp Ltd Wendy Yule Fred Anderson CPU 2 65000 declined
16 729833 Koepp Ltd Wendy Yule Fred Anderson Monitor 2 5000 presented

具体而言,pandas 有两个函数:pivot()pivot_table(),其中 pivot_table() 更像 Excel 的数据透视表。

pivot_table() 函数的完整形式为 pd.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False) 。参数很多,只有 data 是唯一必需的参数,其他均有默认值,常需要设置的参数有 values, index, columns,aggfunc 等。我们先从简单的开始。

index 和 columns 参数#

indexcolumns 接收的都是字段名,它们都可用于分组。

  • index 生成透视表的行索引,可以为多个属性。

  • columns 生成透视表的列索引。

例 1: 按照 Name 分组,即添加 Name 的行索引。这个功能与 groupby() 很像。这个例子中,aggfunc 参数没有设置,默认求平均值。

pd.pivot_table(sales_df, index=['Name'], values=['Quantity', 'Price'])
Price Quantity
Name
Barton LLC 35000.0 1.000000
Fritsch, Russel and Anderson 35000.0 1.000000
Herman LLC 65000.0 2.000000
Jerde-Hilpert 5000.0 2.000000
Kassulke, Ondricka and Metz 7000.0 3.000000
Keeling LLC 100000.0 5.000000
Kiehn-Spinka 65000.0 2.000000
Koepp Ltd 35000.0 2.000000
Kulas Inc 25000.0 1.500000
Purdy-Kunde 30000.0 1.000000
Stokes LLC 7500.0 1.000000
Trantow-Barrows 15000.0 1.333333

例 2: 按 Manager 和 Rep 分组,类似于使用 groupby() 对两个字段 Manager 和 Rep 进行分组。

pd.pivot_table(sales_df, index=['Manager', 'Rep'], values=['Quantity', 'Price'])
Price Quantity
Manager Rep
Debra Henley Craig Booker 20000.000000 1.250000
Daniel Hilton 38333.333333 1.666667
John Smith 20000.000000 1.500000
Fred Anderson Cedric Moss 27500.000000 1.250000
Wendy Yule 44250.000000 3.000000

前面两个例子与 groupby 很像,第 3 个例子,增加 columns 这个参数,添加透视表的列 Product,把原来存在于 Product 中的数据作为列名提出来。

pd.pivot_table(sales_df, index=['Manager','Rep'], values=['Quantity', 'Price'], columns=["Product"])
Price Quantity
Product CPU Maintenance Monitor Software CPU Maintenance Monitor Software
Manager Rep
Debra Henley Craig Booker 32500.0 5000.0 NaN 10000.0 1.0 2.0 NaN 1.0
Daniel Hilton 52500.0 NaN NaN 10000.0 2.0 NaN NaN 1.0
John Smith 35000.0 5000.0 NaN NaN 1.0 2.0 NaN NaN
Fred Anderson Cedric Moss 47500.0 5000.0 NaN 10000.0 1.5 1.0 NaN 1.0
Wendy Yule 82500.0 7000.0 5000.0 NaN 3.5 3.0 2.0 NaN

生成的数据透视表有两层列索引,第一层是原有的 Price、Quantity 这些字段,第二层是 columns=["Product"] 里面的值:CPU、Monitor 等。

values 参数#

values 参数指明要聚合的列或列的列表,通常是你想要在透视表中分析的数值数据。

例 4:只显示 Price 列对应的平均值。

pd.pivot_table(sales_df, index=['Manager','Rep'], values='Price')
Price
Manager Rep
Debra Henley Craig Booker 20000.000000
Daniel Hilton 38333.333333
John Smith 20000.000000
Fred Anderson Cedric Moss 27500.000000
Wendy Yule 44250.000000

aggfunc 参数#

用于汇总数据的聚合函数,可以是字符串(如’sum’、’mean’、’count’ 等预置函数)或自定义聚合函数。

例:对 Price 显示 meanlen 两列结果。

pd.pivot_table(sales_df, index=["Manager","Rep"], values=["Price"], aggfunc=["mean", len])
mean len
Price Price
Manager Rep
Debra Henley Craig Booker 20000.000000 4
Daniel Hilton 38333.333333 3
John Smith 20000.000000 2
Fred Anderson Cedric Moss 27500.000000 4
Wendy Yule 44250.000000 4

fill_value 和 margins 参数#

fill_value 参数用于填充缺失值的值。margins 参数为布尔值,如果为 True, 则在透视表中包含行和列的总计。

pd.pivot_table(sales_df,index=["Manager","Rep","Product"],
               values=["Price","Quantity"],
               aggfunc=[np.sum,np.mean], fill_value=0, margins=True)
/var/folders/4n/v40br47s46ggrjm9bdm64lwh0000gn/T/ipykernel_19556/3052684454.py:1: FutureWarning: The provided callable <function sum at 0x1041ee340> is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  pd.pivot_table(sales_df,index=["Manager","Rep","Product"],
/var/folders/4n/v40br47s46ggrjm9bdm64lwh0000gn/T/ipykernel_19556/3052684454.py:1: FutureWarning: The provided callable <function mean at 0x1041ef420> is currently using DataFrameGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  pd.pivot_table(sales_df,index=["Manager","Rep","Product"],
sum mean
Price Quantity Price Quantity
Manager Rep Product
Debra Henley Craig Booker CPU 65000 2 32500.000000 1.000000
Maintenance 5000 2 5000.000000 2.000000
Software 10000 1 10000.000000 1.000000
Daniel Hilton CPU 105000 4 52500.000000 2.000000
Software 10000 1 10000.000000 1.000000
John Smith CPU 35000 1 35000.000000 1.000000
Maintenance 5000 2 5000.000000 2.000000
Fred Anderson Cedric Moss CPU 95000 3 47500.000000 1.500000
Maintenance 5000 1 5000.000000 1.000000
Software 10000 1 10000.000000 1.000000
Wendy Yule CPU 165000 7 82500.000000 3.500000
Maintenance 7000 3 7000.000000 3.000000
Monitor 5000 2 5000.000000 2.000000
All 522000 30 30705.882353 1.764706

维度交叉#

将以上参数结合起来,对 Quantity 和 Price 应用不同的聚合函数。

table = pd.pivot_table(sales_df,
        index=["Manager","Status"],
        columns=["Product"],
        values=["Quantity","Price"],
        aggfunc={"Quantity":len, "Price":[np.mean]},
        fill_value=0)
table
/var/folders/4n/v40br47s46ggrjm9bdm64lwh0000gn/T/ipykernel_19556/1669582107.py:1: FutureWarning: The provided callable <function mean at 0x1041ef420> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  table = pd.pivot_table(sales_df,
Price Quantity
mean len
Product CPU Maintenance Monitor Software CPU Maintenance Monitor Software
Manager Status
Debra Henley declined 35000.0 0.0 0.0 0.0 2 0 0 0
pending 40000.0 5000.0 0.0 0.0 1 2 0 0
presented 30000.0 0.0 0.0 10000.0 1 0 0 2
won 65000.0 0.0 0.0 0.0 1 0 0 0
Fred Anderson declined 65000.0 0.0 0.0 0.0 1 0 0 0
pending 0.0 5000.0 0.0 0.0 0 1 0 0
presented 30000.0 0.0 5000.0 10000.0 1 0 1 1
won 82500.0 7000.0 0.0 0.0 2 1 0 0

数据筛选#

建立数据透视表(pivot table)后,也可以进一步筛选数据以满足特定的条件,类似于 DataFrame 的数据切片方法中的条件索引。这里主要介绍 query() 函数,通过传递查询字符串来筛选数据。

例 1: 筛选出 Manager 为 Debra Henley 的结果。

table.query('Manager == ["Debra Henley"]')
Price Quantity
mean len
Product CPU Maintenance Monitor Software CPU Maintenance Monitor Software
Manager Status
Debra Henley declined 35000.0 0.0 0.0 0.0 2 0 0 0
pending 40000.0 5000.0 0.0 0.0 1 2 0 0
presented 30000.0 0.0 0.0 10000.0 1 0 0 2
won 65000.0 0.0 0.0 0.0 1 0 0 0

例 2: 筛选出 Status 为 pending 或 won 的结果。

table.query('Status == ["pending","won"]')
Price Quantity
mean len
Product CPU Maintenance Monitor Software CPU Maintenance Monitor Software
Manager Status
Debra Henley pending 40000.0 5000.0 0.0 0.0 1 2 0 0
won 65000.0 0.0 0.0 0.0 1 0 0 0
Fred Anderson pending 0.0 5000.0 0.0 0.0 0 1 0 0
won 82500.0 7000.0 0.0 0.0 2 1 0 0

melt#

melt() 是用于将一个很宽的的表转换为长。这个函数通常用于数据重塑(data reshaping)操作,以便更容易进行数据分析和可视化。根据其英文名,又可被理解成,将一个宽表,“融化”成一个长表。通过这种方式,可以将宽格式数据表格中的多列数据整合到一个列中,可以理解为透视表 pivot_table() 的反操作。

d1 = {
    'Student_id': pd.Series([1, 2, 3]), 
    'Name': pd.Series(['Amy', 'Bob', 'John']), 
    'Math': pd.Series([90, 78, 88]),
    'English':pd.Series([85, 92, 79]),
    'History':pd.Series([88, 76, 90])
}
student_df = pd.DataFrame(d1)
student_df
Student_id Name Math English History
0 1 Amy 90 85 88
1 2 Bob 78 92 76
2 3 John 88 79 90
melted_df = pd.melt(student_df, 
                    id_vars=['Student_id', 'Name'], 
                    value_vars=['Math', 'English', 'History'], 
                    var_name='Subject', 
                    value_name='Score')
melted_df
Student_id Name Subject Score
0 1 Amy Math 90
1 2 Bob Math 78
2 3 John Math 88
3 1 Amy English 85
4 2 Bob English 92
5 3 John English 79
6 1 Amy History 88
7 2 Bob History 76
8 3 John History 90

数据重塑时,通常需要使用 value_vars 来指定需要 “融化” 的列,使它们被整合为一列。比如本例中,将 Math, English,History 整合为一列: Subject 列。

../_images/melt.svg

图 2.11 melt(): 数据融化#

我们回到 melt() 函数的参数。完整的参数形式为:pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)

  • frame 指的是要进行重塑操作的 DataFrame 。

  • id_vars 表示需要保留的列,它们将成为长表中的行;这些列又被称为标识变量(identifier variable),不被融化。

  • value_vars 是需要融化的列,它们将被整合成一列,并用新的列名表示。

  • var_name 用于存储融化后的列名的新列的名称。

  • value_name 用于存储融化后的值的新列的名称。