如果你觉得内容对你有帮助,请在 GitHub 上点个 star 吧!

2.3. 数据切片#

实际中,我们常常不是分析整个数据,而是数据中的部分子集。即 图 2.6 中的转化。本节主要介绍如何根据特定的条件获得所需要的数据。

../_images/slicing.svg

图 2.6 数据切片#

Hide code cell content
import os
import urllib.request
import zipfile
import pandas as pd

folder_path = os.path.join(os.getcwd(), "../data/pwt")
download_url = "https://www.rug.nl/ggdc/docs/pwt70_06032011version.zip"
file_name = download_url.split("/")[-1]
zip_file_path = os.path.join(folder_path, file_name)
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

if not os.path.exists(zip_file_path):
    urllib.request.urlretrieve(download_url, zip_file_path)
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        zip_ref.extractall(folder_path)

df = pd.read_csv(os.path.join(folder_path, "pwt70_w_country_names.csv"))
文件夹已存在,无需操作。

主要方法#

DataFrame 选择行和列,主要有几种方式:

  • 使用 [] 选择

  • 使用 .iloc 或者 .loc 函数

  • 使用 .query 函数

使用 [] 进行选择#

  • 选择行

直接使用数字索引即可,df[a,b]表示选择 DataFrame 的第a行到第b-1行。

Note

Python中的索引区间都是左闭右开区间,这意味着左边端点可以取到,而右边端点取不到。

例:对上一章节的PWT案例数据 df 选择第 2 行到第 5 行(不包括第 5 行)的数据。

df[2:5]
country isocode year POP XRAT Currency_Unit ppp tcgdp cgdp cgdp2 ... kg ki openk rgdpeqa rgdpwok rgdpl2wok rgdpl2pe rgdpl2te rgdpl2th rgdptt
2 Afghanistan AFG 1952 8425.333 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Afghanistan AFG 1953 8573.217 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Afghanistan AFG 1954 8728.408 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

3 rows × 37 columns

  • 选择列

我们可以传递一个列表,其中包含所需列的列名,为字符串形式。

例:选择 country 和 tcgdp 两列。

df[['country', 'tcgdp']]
country tcgdp
0 Afghanistan NaN
1 Afghanistan NaN
2 Afghanistan NaN
3 Afghanistan NaN
4 Afghanistan NaN
... ... ...
11395 Zimbabwe 1968.205961
11396 Zimbabwe 2132.305773
11397 Zimbabwe 2107.937100
11398 Zimbabwe 1772.209867
11399 Zimbabwe 1906.049843

11400 rows × 2 columns

如果只选取一列,df['country'] 等价于 df.country

  • [] 选择符合特定条件的数据。

例如,选取 POP 大于 20000 的行。判断语句 df.POP> 20000 会返回一系列布尔值,符合 POP 大于 20000 条件的会返回为 True。如果想要选择这些符合条件的数据,则需要:

df[df.POP>= 20000]
country isocode year POP XRAT Currency_Unit ppp tcgdp cgdp cgdp2 ... kg ki openk rgdpeqa rgdpwok rgdpl2wok rgdpl2pe rgdpl2te rgdpl2th rgdptt
45 Afghanistan AFG 1995 20881.480 0.833333 Afghani 0.352981 7640.404208 365.893807 380.608406 ... 11.969276 11.435963 64.779503 575.749694 1427.160971 1432.074990 NaN NaN NaN 450.097562
46 Afghanistan AFG 1996 21559.923 2.500023 Afghani 1.035618 7541.782643 349.805639 363.579817 ... 12.403287 11.368530 72.681765 539.935370 1339.773062 1343.634290 NaN NaN NaN 422.025786
47 Afghanistan AFG 1997 22227.415 3.000030 Afghani 1.230919 7420.757993 333.856096 342.188209 ... 2.098612 22.681018 57.214244 569.246414 1414.179670 1508.635000 NaN NaN NaN 388.407429
48 Afghanistan AFG 1998 22912.814 3.000053 Afghani 1.225851 7203.038357 314.367251 321.797566 ... 2.155247 22.589964 62.186846 539.258652 1340.627874 1430.919956 NaN NaN NaN 363.011141
49 Afghanistan AFG 1999 23646.128 45.106375 Afghani 18.193448 7053.728453 298.303741 305.275708 ... 2.169332 22.483448 67.001174 510.554309 1270.043559 1356.481758 NaN NaN NaN 339.159330
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
11275 Yemen YEM 2005 20345.000 195.080000 Yemeni Rial 69.501112 49247.125610 2420.600915 2445.644456 ... 6.630541 30.908319 70.762075 3152.987812 9689.452913 9689.452913 NaN 9376.655609 NaN 2420.600915
11276 Yemen YEM 2006 20965.000 198.500000 Yemeni Rial 79.718914 51669.180970 2464.544764 2509.971963 ... 7.627067 28.001314 77.513306 3098.297843 9398.235529 9540.051028 NaN 9243.472310 NaN 2382.182154
11277 Yemen YEM 2007 21591.000 199.540000 Yemeni Rial 84.656475 55761.428540 2582.623711 2607.372597 ... 8.616352 29.295411 74.538105 3149.613533 9432.763102 9606.653374 NaN 9314.041767 NaN 2417.153583
11278 Yemen YEM 2008 22223.000 200.080000 Yemeni Rial 95.889955 59804.720830 2691.118248 2708.301212 ... 7.578654 26.635831 77.739227 3154.238236 9329.826256 9586.031067 NaN 9295.538992 NaN 2437.614583
11279 Yemen YEM 2009 22858.000 207.320000 Yemeni Rial 87.617700 61288.814890 2681.285103 2699.266570 ... 8.432549 23.873591 66.853623 3116.525801 NaN NaN NaN 9276.310000 NaN 2415.879535

2223 rows × 37 columns

例:选择 cc 列和 cg 列的和大于 80 并且 POP 小于 20000 的行。

df[(df.cc + df.cg>= 80) & (df.POP <= 20000)]
country isocode year POP XRAT Currency_Unit ppp tcgdp cgdp cgdp2 ... kg ki openk rgdpeqa rgdpwok rgdpl2wok rgdpl2pe rgdpl2te rgdpl2th rgdptt
20 Afghanistan AFG 1970 12430.623 4.499843e-02 Afghani 0.030932 2544.158761 204.668645 217.070672 ... 6.874972 7.282942 11.400791 1081.424221 2573.757504 2628.306751 NaN NaN NaN 855.028440
21 Afghanistan AFG 1971 12749.385 4.499843e-02 Afghani 0.029717 2772.708160 217.477797 233.174215 ... 7.391439 8.136900 16.542793 1107.678841 2638.926007 2690.518366 NaN NaN NaN 864.055243
22 Afghanistan AFG 1972 13079.460 4.499843e-02 Afghani 0.028351 2532.448551 193.620268 203.815514 ... 7.894959 7.223611 17.784399 945.298951 2254.372064 2287.650884 NaN NaN NaN 737.915549
23 Afghanistan AFG 1973 13421.301 4.499843e-02 Afghani 0.026877 2901.972071 216.221369 225.405444 ... 7.451048 6.500619 18.505092 999.400460 2385.834304 2424.683471 NaN NaN NaN 779.620585
24 Afghanistan AFG 1974 13772.076 4.499843e-02 Afghani 0.028327 3424.228187 248.635586 257.753750 ... 7.098609 7.785019 24.503937 1025.208685 2449.957675 2497.585718 NaN NaN NaN 812.015225
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
11395 Zimbabwe ZWE 2005 11639.470 2.236364e+01 Zimbabwe Dollar 39.482829 1968.205961 169.097559 184.183929 ... 6.995770 9.376272 89.399427 214.739197 418.970867 418.970867 NaN 390.907086 NaN 169.097559
11396 Zimbabwe ZWE 2006 11544.326 1.643606e+02 Zimbabwe Dollar 384.899651 2132.305773 184.705956 192.953943 ... 7.648020 14.986823 81.697014 217.543648 424.754259 407.262097 NaN 377.352394 NaN 179.368685
11397 Zimbabwe ZWE 2007 11443.187 9.675781e+03 Zimbabwe Dollar 38583.323960 2107.937100 184.208918 198.215361 ... 8.387106 15.787322 84.483374 202.707080 396.486201 376.163064 NaN 345.764991 NaN 173.113448
11398 Zimbabwe ZWE 2008 11350.000 6.715424e+09 Zimbabwe Dollar 38723.957740 1772.209867 156.141839 162.112294 ... 7.685312 13.444449 85.117130 174.178806 343.159758 332.649861 NaN 302.945712 NaN 142.329054
11399 Zimbabwe ZWE 2009 11383.000 1.400000e+17 Zimbabwe Dollar 40289.958990 1906.049843 167.447056 174.419700 ... 7.905525 14.743667 83.749534 182.613004 NaN NaN NaN 314.171069 NaN 151.435285

3827 rows × 37 columns

iloc 或者 loc#

使用 iloc 函数进行选择,形式应为 .iloc[rows, columns]

可以将 i 理解为 integer,即 i 是整数,表示行或者列位置(位置由被称为 index),iloc 即用整数来选择行或者选择列。i 从 0 开始,至 length-1

例如:选择第 2 行到第 5 行(不包括第 5 行),第 0 列到第 4 列(不包括第 4 列)。

df.iloc[2:5, 0:4]
country isocode year POP
2 Afghanistan AFG 1952 8425.333
3 Afghanistan AFG 1953 8573.217
4 Afghanistan AFG 1954 8728.408

使用 loc 函数进行选择,与 iloc 的区别在于,loc 除了接受整数外,还可以接受标签(ab 这样的列名)、表示整数位置的 index、boolean

例:选择第 2 行到第 5 行(不包括第 5 行),country 和 tcgdp 列。

df.loc[df.index[2:5], ['country', 'tcgdp']]
country tcgdp
2 Afghanistan NaN
3 Afghanistan NaN
4 Afghanistan NaN

例:使用 loc 函数选择 POP 列最大值的行。

df.loc[df.POP == max(df.POP)]
country isocode year POP XRAT Currency_Unit ppp tcgdp cgdp cgdp2 ... kg ki openk rgdpeqa rgdpwok rgdpl2wok rgdpl2pe rgdpl2te rgdpl2th rgdptt
2159 China Version 1 CHN 2009 1323592.0 6.83 Yuan Renminbi 3.342447 10739325.58 8113.773410 8296.575690 ... 15.779553 46.707842 58.579480 7782.385295 NaN NaN NaN 11914.32395 NaN 7314.930606
2219 China Version 2 CH2 2009 1323592.0 6.83 Yuan Renminbi 3.164712 11342464.26 8569.456643 9227.187296 ... 15.655224 39.044203 61.240342 8247.646160 NaN NaN NaN 12658.65025 NaN 7747.331618

2 rows × 37 columns

还可以使用这种形式:.loc[,],两个参数用逗号隔开,第一个参数接受条件,第二个参数接受我们想要返回的列名,得到的是符合条件的特定的列。

例:选择满足 cc 列加 cg 列大于等于80,POP小于等于20000条件的 country, year, POP 三列。

df.loc[(df.cc + df.cg>= 80) & (df.POP <= 20000), ['country', 'year', 'POP']]
country year POP
20 Afghanistan 1970 12430.623
21 Afghanistan 1971 12749.385
22 Afghanistan 1972 13079.460
23 Afghanistan 1973 13421.301
24 Afghanistan 1974 13772.076
... ... ... ...
11395 Zimbabwe 2005 11639.470
11396 Zimbabwe 2006 11544.326
11397 Zimbabwe 2007 11443.187
11398 Zimbabwe 2008 11350.000
11399 Zimbabwe 2009 11383.000

3827 rows × 3 columns

由于 iloc 函数只接受整数,所以不允许使用这种条件进行筛选。

query#

.query() 函数的用法与 [] 有相似之处。值得注意的是,.query() 函数的性能更好,在处理大规模数据时更快。

df.query("POP>= 20000")
country isocode year POP XRAT Currency_Unit ppp tcgdp cgdp cgdp2 ... kg ki openk rgdpeqa rgdpwok rgdpl2wok rgdpl2pe rgdpl2te rgdpl2th rgdptt
45 Afghanistan AFG 1995 20881.480 0.833333 Afghani 0.352981 7640.404208 365.893807 380.608406 ... 11.969276 11.435963 64.779503 575.749694 1427.160971 1432.074990 NaN NaN NaN 450.097562
46 Afghanistan AFG 1996 21559.923 2.500023 Afghani 1.035618 7541.782643 349.805639 363.579817 ... 12.403287 11.368530 72.681765 539.935370 1339.773062 1343.634290 NaN NaN NaN 422.025786
47 Afghanistan AFG 1997 22227.415 3.000030 Afghani 1.230919 7420.757993 333.856096 342.188209 ... 2.098612 22.681018 57.214244 569.246414 1414.179670 1508.635000 NaN NaN NaN 388.407429
48 Afghanistan AFG 1998 22912.814 3.000053 Afghani 1.225851 7203.038357 314.367251 321.797566 ... 2.155247 22.589964 62.186846 539.258652 1340.627874 1430.919956 NaN NaN NaN 363.011141
49 Afghanistan AFG 1999 23646.128 45.106375 Afghani 18.193448 7053.728453 298.303741 305.275708 ... 2.169332 22.483448 67.001174 510.554309 1270.043559 1356.481758 NaN NaN NaN 339.159330
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
11275 Yemen YEM 2005 20345.000 195.080000 Yemeni Rial 69.501112 49247.125610 2420.600915 2445.644456 ... 6.630541 30.908319 70.762075 3152.987812 9689.452913 9689.452913 NaN 9376.655609 NaN 2420.600915
11276 Yemen YEM 2006 20965.000 198.500000 Yemeni Rial 79.718914 51669.180970 2464.544764 2509.971963 ... 7.627067 28.001314 77.513306 3098.297843 9398.235529 9540.051028 NaN 9243.472310 NaN 2382.182154
11277 Yemen YEM 2007 21591.000 199.540000 Yemeni Rial 84.656475 55761.428540 2582.623711 2607.372597 ... 8.616352 29.295411 74.538105 3149.613533 9432.763102 9606.653374 NaN 9314.041767 NaN 2417.153583
11278 Yemen YEM 2008 22223.000 200.080000 Yemeni Rial 95.889955 59804.720830 2691.118248 2708.301212 ... 7.578654 26.635831 77.739227 3154.238236 9329.826256 9586.031067 NaN 9295.538992 NaN 2437.614583
11279 Yemen YEM 2009 22858.000 207.320000 Yemeni Rial 87.617700 61288.814890 2681.285103 2699.266570 ... 8.432549 23.873591 66.853623 3116.525801 NaN NaN NaN 9276.310000 NaN 2415.879535

2223 rows × 37 columns

df.query("cc + cg>= 80 & POP <= 20000")
country isocode year POP XRAT Currency_Unit ppp tcgdp cgdp cgdp2 ... kg ki openk rgdpeqa rgdpwok rgdpl2wok rgdpl2pe rgdpl2te rgdpl2th rgdptt
20 Afghanistan AFG 1970 12430.623 4.499843e-02 Afghani 0.030932 2544.158761 204.668645 217.070672 ... 6.874972 7.282942 11.400791 1081.424221 2573.757504 2628.306751 NaN NaN NaN 855.028440
21 Afghanistan AFG 1971 12749.385 4.499843e-02 Afghani 0.029717 2772.708160 217.477797 233.174215 ... 7.391439 8.136900 16.542793 1107.678841 2638.926007 2690.518366 NaN NaN NaN 864.055243
22 Afghanistan AFG 1972 13079.460 4.499843e-02 Afghani 0.028351 2532.448551 193.620268 203.815514 ... 7.894959 7.223611 17.784399 945.298951 2254.372064 2287.650884 NaN NaN NaN 737.915549
23 Afghanistan AFG 1973 13421.301 4.499843e-02 Afghani 0.026877 2901.972071 216.221369 225.405444 ... 7.451048 6.500619 18.505092 999.400460 2385.834304 2424.683471 NaN NaN NaN 779.620585
24 Afghanistan AFG 1974 13772.076 4.499843e-02 Afghani 0.028327 3424.228187 248.635586 257.753750 ... 7.098609 7.785019 24.503937 1025.208685 2449.957675 2497.585718 NaN NaN NaN 812.015225
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
11395 Zimbabwe ZWE 2005 11639.470 2.236364e+01 Zimbabwe Dollar 39.482829 1968.205961 169.097559 184.183929 ... 6.995770 9.376272 89.399427 214.739197 418.970867 418.970867 NaN 390.907086 NaN 169.097559
11396 Zimbabwe ZWE 2006 11544.326 1.643606e+02 Zimbabwe Dollar 384.899651 2132.305773 184.705956 192.953943 ... 7.648020 14.986823 81.697014 217.543648 424.754259 407.262097 NaN 377.352394 NaN 179.368685
11397 Zimbabwe ZWE 2007 11443.187 9.675781e+03 Zimbabwe Dollar 38583.323960 2107.937100 184.208918 198.215361 ... 8.387106 15.787322 84.483374 202.707080 396.486201 376.163064 NaN 345.764991 NaN 173.113448
11398 Zimbabwe ZWE 2008 11350.000 6.715424e+09 Zimbabwe Dollar 38723.957740 1772.209867 156.141839 162.112294 ... 7.685312 13.444449 85.117130 174.178806 343.159758 332.649861 NaN 302.945712 NaN 142.329054
11399 Zimbabwe ZWE 2009 11383.000 1.400000e+17 Zimbabwe Dollar 40289.958990 1906.049843 167.447056 174.419700 ... 7.905525 14.743667 83.749534 182.613004 NaN NaN NaN 314.171069 NaN 151.435285

3827 rows × 37 columns