Series 与 DataFrame

本节目录

2.2. Series 与 DataFrame#

pandas 的核心数据结构有两个： Series 和 DataFrame。

import pandas as pd

Series#

在 pandas 中，Series 是一种一维的带标签的 数组状 数据结构。

我们创建一个 Series，这个数组有 4 个数，并命名为 my_series。

s = pd.Series([1, 2, 3, 4], name = 'my_series')

Series 是一个数组状数据结构，其实就是章节 1.2 中的 ndarray。数组最重要的结构是索引（Index）。Index 主要用于标记第几个位置存储什么数据。pd.Series() 中不指定 Index 参数时，默认从 0 开始，逐一自增，形如： 0，1，…

Series 支持计算操作。

s * 100

  100
  200
  300
  400
Name: my_series, dtype: int64

Series 支持描述性统计。比如，获得所有统计信息。

s.describe()

count    4.000000
mean     2.500000
std      1.290994
min      1.000000
25%      1.750000
50%      2.500000
75%      3.250000
max      4.000000
Name: my_series, dtype: float64

计算平均值，中位数和标准差。

s.mean()

2.5

s.median()

2.5

s.std()

1.2909944487358056

Series 的索引很灵活。

s.index = ['number1','number2','number3','number4']

这时，Series 就像一个 Python 中的字典 dict，可以使用像 dict 一样的语法来访问 Series 中的元素，其中 index 相当于 dict 的键 key。例如，使用 [] 操作符访问 number1 对应的值。

s['number1']

又例如，使用 in 表达式判断某个索引是否在 Series 中。

'number1' in s

True

DataFrame#

DataFrame 可以简单理解为一个 Excel 表，有很多列和很多行，如图 2.2 所示。 DataFrame 的列（column）表示一个字段；DataFrame 的行（row）表示一条数据。DataFrame 常被用来分析像 Excel 这样的、有行和列的表格类数据。Excel 也正在兼容 DataFrame，使得用户在 Excel 中进行 pandas 数据处理与分析。

创建 DataFrame#

创建一个 DataFrame 有很多方式，比如从列表、字典、文件中读取数据，并创建一个 DataFrame。

基于列表创建

names = ['Alice', 'Bob', 'Charlie']
ages = [25, 30, 22]
cities = ['New York', 'San Francisco', 'Los Angeles']
data = {'Name': names, 'Age': ages, 'City': cities}
df = pd.DataFrame(data)

基于字典创建

data = {'Column1': [1, 2], 'Column2': [3, 4]}
df = pd.DataFrame(data)

基于文件创建

如图 2.3 所示，pandas 可以读取不同类型的文件，进行处理，最后持久化地写入不同类型的文件中。

对于不同类型的文件，使用不同的函数，比如 read_csv 读取 csv 类型的数据。df = pd.read_csv('/path/file.csv') 用来读取一个 csv 文件，df = pd.read_excel('/path/file.xlsx') 用来读取一个 Excel 文件。

Note

注：csv 文件一般由很多个 column 组成，使用 pd.read_csv 时，默认每个 column 之间的分隔符为逗号（,），pd.read_table 默认分隔符为换行符。这些函数还支持许多其他参数，可以使用 help() 函数查看。

同样地，我们最终也可以将处理过的 DataFrame 写入文件即以文件的格式输出，例：df.to_excel(‘/path/file.xlsx’)则输出为excel文件。

计算统计量#

前面我们讲过对 Series 对象我们可以计算统计量，而 DataFrame 是由一列列 Series 对象构成的，自然也可以做相应的计算。

对某一列计算相应统计量，本质上还是对 Series 对象做统计量计算，如图 2.4。

../_images/s1.svg — 图 2.4 对 Dataframe 一列对象做统计量#

例：对 df 的第一列求平均值。（DataFrame 对象的切片操作后续会详细讲解。）

print("------- df is -------\n{}".format(df))
print("------- Column1.mean() -------\n")
print(df[['Column1']].mean())

------- df is -------
   Column1  Column2
0        1        3
1        2        4
------- Column1.mean() -------

Column1    1.5
dtype: float64

对某几列计算相应统计量。

../_images/s2.svg — 图 2.5 对 Dataframe 几列对象做统计量#

例：对 df 的第一列和第二列进行描述性统计。

由于这里的 df 只有两列，也可以直接对整个 DataFrame 对象操作，即等价于 df.describe()

df[['Column1','Column2']].describe()

	Column1	Column2
count	2.000000	2.000000
mean	1.500000	3.500000
std	0.707107	0.707107
min	1.000000	3.000000
25%	1.250000	3.250000
50%	1.500000	3.500000
75%	1.750000	3.750000
max	2.000000	4.000000

使用 DataFrame.agg() 方法计算特定的统计量组合。

有时候我们对一列需要指定的一些统计量，对不同列又有不同的需求，则可以使用 DataFrame.agg() 方法，传入字典参数，key 为列名，value 为需要的统计量的列表。

例：对 df 的第一列获得最小值 / 最大值 / 中位数 / 偏度，对第二列获得最小值 / 最大值 / 中位数 / 均值。

df.agg({
    'Column1':['min','max','median','skew'],
    'Column2':['min','max','median','mean']   
})

	Column1	Column2
min	1.0	3.0
max	2.0	4.0
median	1.5	3.5
skew	NaN	NaN
mean	NaN	3.5

案例：PWT#

PWT 是一个经济学数据库，用于比较国家和地区之间的宏观经济数据，该数据集包含了各种宏观经济指标，如国内生产总值（GDP）、人均收入、劳动力和资本等因素，以及价格水平、汇率等信息。我们先下载，并使用 pandas 简单探索该数据集。

查看数据#

使用 read_csv() 读取数据。

import pandas as pd

df = pd.read_csv(os.path.join(folder_path, "pwt70_w_country_names.csv"))

head() 函数可以指定查看前 n 行。

n = 5
df.head(n)

	country	isocode	year	POP	XRAT	Currency_Unit	ppp	tcgdp	cgdp	cgdp2	...	kg	ki	openk	rgdpeqa	rgdpwok	rgdpl2wok	rgdpl2pe	rgdpl2te	rgdpl2th	rgdptt
0	Afghanistan	AFG	1950	8150.368	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Afghanistan	AFG	1951	8284.473	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Afghanistan	AFG	1952	8425.333	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Afghanistan	AFG	1953	8573.217	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	Afghanistan	AFG	1954	8728.408	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 37 columns

tail() 函数指定查看后 n 行。

df.tail(n)

	country	isocode	year	POP	XRAT	Currency_Unit	ppp	tcgdp	cgdp	cgdp2	...	kg	ki	openk	rgdpeqa	rgdpwok	rgdpl2wok	rgdpl2pe	rgdpl2te	rgdpl2th	rgdptt
11395	Zimbabwe	ZWE	2005	11639.470	2.236364e+01	Zimbabwe Dollar	39.482829	1968.205961	169.097559	184.183929	...	6.995770	9.376272	89.399427	214.739197	418.970867	418.970867	NaN	390.907086	NaN	169.097559
11396	Zimbabwe	ZWE	2006	11544.326	1.643606e+02	Zimbabwe Dollar	384.899651	2132.305773	184.705956	192.953943	...	7.648020	14.986823	81.697014	217.543648	424.754259	407.262097	NaN	377.352394	NaN	179.368685
11397	Zimbabwe	ZWE	2007	11443.187	9.675781e+03	Zimbabwe Dollar	38583.323960	2107.937100	184.208918	198.215361	...	8.387106	15.787322	84.483374	202.707080	396.486201	376.163064	NaN	345.764991	NaN	173.113448
11398	Zimbabwe	ZWE	2008	11350.000	6.715424e+09	Zimbabwe Dollar	38723.957740	1772.209867	156.141839	162.112294	...	7.685312	13.444449	85.117130	174.178806	343.159758	332.649861	NaN	302.945712	NaN	142.329054
11399	Zimbabwe	ZWE	2009	11383.000	1.400000e+17	Zimbabwe Dollar	40289.958990	1906.049843	167.447056	174.419700	...	7.905525	14.743667	83.749534	182.613004	NaN	NaN	NaN	314.171069	NaN	151.435285

5 rows × 37 columns

info() 函数可以查看数据基本信息，包括字段类型和非空值计数。

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11400 entries, 0 to 11399
Data columns (total 37 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 country        11400 non-null  object 
 isocode        11400 non-null  object 
 year           11400 non-null  int64  
 POP            11398 non-null  float64
 XRAT           10163 non-null  float64
 Currency_Unit  10163 non-null  object 
 ppp            8745 non-null   float64
 tcgdp          8745 non-null   float64
 cgdp           8745 non-null   float64
 cgdp2          8745 non-null   float64
cda2           8745 non-null   float64
cc             8745 non-null   float64
cg             8745 non-null   float64
ci             8745 non-null   float64
p              8745 non-null   float64
p2             8745 non-null   float64
pc             8745 non-null   float64
pg             8745 non-null   float64
pi             8745 non-null   float64
openc          8745 non-null   float64
cgnp           8305 non-null   float64
y              8745 non-null   float64
y2             8745 non-null   float64
rgdpl          8725 non-null   float64
rgdpl2         8725 non-null   float64
rgdpch         8725 non-null   float64
kc             8725 non-null   float64
kg             8725 non-null   float64
ki             8725 non-null   float64
openk          8725 non-null   float64
rgdpeqa        8555 non-null   float64
rgdpwok        8177 non-null   float64
rgdpl2wok      8177 non-null   float64
rgdpl2pe       845 non-null    float64
rgdpl2te       5399 non-null   float64
rgdpl2th       2274 non-null   float64
rgdptt         8745 non-null   float64
dtypes: float64(33), int64(1), object(3)
memory usage: 3.2+ MB

dtypes 查看各变量数据类型。

df.dtypes

country           object
isocode           object
year               int64
POP              float64
XRAT             float64
Currency_Unit     object
ppp              float64
tcgdp            float64
cgdp             float64
cgdp2            float64
cda2             float64
cc               float64
cg               float64
ci               float64
p                float64
p2               float64
pc               float64
pg               float64
pi               float64
openc            float64
cgnp             float64
y                float64
y2               float64
rgdpl            float64
rgdpl2           float64
rgdpch           float64
kc               float64
kg               float64
ki               float64
openk            float64
rgdpeqa          float64
rgdpwok          float64
rgdpl2wok        float64
rgdpl2pe         float64
rgdpl2te         float64
rgdpl2th         float64
rgdptt           float64
dtype: object

.columns 查看数据框列名（变量名）。

df.columns

Index(['country', 'isocode', 'year', 'POP', 'XRAT', 'Currency_Unit', 'ppp',
       'tcgdp', 'cgdp', 'cgdp2', 'cda2', 'cc', 'cg', 'ci', 'p', 'p2', 'pc',
       'pg', 'pi', 'openc', 'cgnp', 'y', 'y2', 'rgdpl', 'rgdpl2', 'rgdpch',
       'kc', 'kg', 'ki', 'openk', 'rgdpeqa', 'rgdpwok', 'rgdpl2wok',
       'rgdpl2pe', 'rgdpl2te', 'rgdpl2th', 'rgdptt'],
      dtype='object')

rename() 函数既可以用于更改行标签，也可以用于列标签。传入一个字典，其中键为当前名称，值为新名称，以更新相应的名称。

例：

将 year 改为 Year，country 改为 Country：

df_renamed = df.rename(columns={'year':Year, 'country':'Country'})

将所有列名改为小写：

df_renamed = df.rename(columns=str.lower)

.index 查看数据框行名。

df.index

RangeIndex(start=0, stop=11400, step=1)

.shape 可以查看 DataFrame 的维度，返回一个 tuple（元组对象），显示数据框的行数和列数。因此，可以用索引分别查看数据框的行数和列数。

#查看数据框行数
print(df.shape[0])

#查看数据框列数
print(df.shape[1])

11400
37