2.3. 数据切片#
实际中,我们常常不是分析整个数据,而是数据中的部分子集。即 图 2.6 中的转化。本节主要介绍如何根据特定的条件获得所需要的数据。
Show code cell content
import os
import urllib.request
import zipfile
import pandas as pd
folder_path = os.path.join(os.getcwd(), "../data/pwt")
download_url = "https://www.rug.nl/ggdc/docs/pwt70_06032011version.zip"
file_name = download_url.split("/")[-1]
zip_file_path = os.path.join(folder_path, file_name)
if not os.path.exists(folder_path):
os.makedirs(folder_path)
if not os.path.exists(zip_file_path):
urllib.request.urlretrieve(download_url, zip_file_path)
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
zip_ref.extractall(folder_path)
df = pd.read_csv(os.path.join(folder_path, "pwt70_w_country_names.csv"))
文件夹已存在,无需操作。
主要方法#
对 DataFrame
选择行和列,主要有几种方式:
使用
[]
选择使用
.iloc
或者.loc
函数使用
.query
函数
使用 []
进行选择#
选择行
直接使用数字索引即可,df[a,b]
表示选择 DataFrame
的第a
行到第b-1
行。
Note
Python中的索引区间都是左闭右开区间,这意味着左边端点可以取到,而右边端点取不到。
例:对上一章节的PWT案例数据 df 选择第 2 行到第 5 行(不包括第 5 行)的数据。
df[2:5]
country | isocode | year | POP | XRAT | Currency_Unit | ppp | tcgdp | cgdp | cgdp2 | ... | kg | ki | openk | rgdpeqa | rgdpwok | rgdpl2wok | rgdpl2pe | rgdpl2te | rgdpl2th | rgdptt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | Afghanistan | AFG | 1952 | 8425.333 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | Afghanistan | AFG | 1953 | 8573.217 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | Afghanistan | AFG | 1954 | 8728.408 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 rows × 37 columns
选择列
我们可以传递一个列表,其中包含所需列的列名,为字符串形式。
例:选择 country 和 tcgdp 两列。
df[['country', 'tcgdp']]
country | tcgdp | |
---|---|---|
0 | Afghanistan | NaN |
1 | Afghanistan | NaN |
2 | Afghanistan | NaN |
3 | Afghanistan | NaN |
4 | Afghanistan | NaN |
... | ... | ... |
11395 | Zimbabwe | 1968.205961 |
11396 | Zimbabwe | 2132.305773 |
11397 | Zimbabwe | 2107.937100 |
11398 | Zimbabwe | 1772.209867 |
11399 | Zimbabwe | 1906.049843 |
11400 rows × 2 columns
如果只选取一列,df['country']
等价于 df.country
。
[]
选择符合特定条件的数据。
例如,选取 POP 大于 20000 的行。判断语句 df.POP> 20000
会返回一系列布尔值,符合 POP 大于 20000 条件的会返回为 True
。如果想要选择这些符合条件的数据,则需要:
df[df.POP>= 20000]
country | isocode | year | POP | XRAT | Currency_Unit | ppp | tcgdp | cgdp | cgdp2 | ... | kg | ki | openk | rgdpeqa | rgdpwok | rgdpl2wok | rgdpl2pe | rgdpl2te | rgdpl2th | rgdptt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
45 | Afghanistan | AFG | 1995 | 20881.480 | 0.833333 | Afghani | 0.352981 | 7640.404208 | 365.893807 | 380.608406 | ... | 11.969276 | 11.435963 | 64.779503 | 575.749694 | 1427.160971 | 1432.074990 | NaN | NaN | NaN | 450.097562 |
46 | Afghanistan | AFG | 1996 | 21559.923 | 2.500023 | Afghani | 1.035618 | 7541.782643 | 349.805639 | 363.579817 | ... | 12.403287 | 11.368530 | 72.681765 | 539.935370 | 1339.773062 | 1343.634290 | NaN | NaN | NaN | 422.025786 |
47 | Afghanistan | AFG | 1997 | 22227.415 | 3.000030 | Afghani | 1.230919 | 7420.757993 | 333.856096 | 342.188209 | ... | 2.098612 | 22.681018 | 57.214244 | 569.246414 | 1414.179670 | 1508.635000 | NaN | NaN | NaN | 388.407429 |
48 | Afghanistan | AFG | 1998 | 22912.814 | 3.000053 | Afghani | 1.225851 | 7203.038357 | 314.367251 | 321.797566 | ... | 2.155247 | 22.589964 | 62.186846 | 539.258652 | 1340.627874 | 1430.919956 | NaN | NaN | NaN | 363.011141 |
49 | Afghanistan | AFG | 1999 | 23646.128 | 45.106375 | Afghani | 18.193448 | 7053.728453 | 298.303741 | 305.275708 | ... | 2.169332 | 22.483448 | 67.001174 | 510.554309 | 1270.043559 | 1356.481758 | NaN | NaN | NaN | 339.159330 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
11275 | Yemen | YEM | 2005 | 20345.000 | 195.080000 | Yemeni Rial | 69.501112 | 49247.125610 | 2420.600915 | 2445.644456 | ... | 6.630541 | 30.908319 | 70.762075 | 3152.987812 | 9689.452913 | 9689.452913 | NaN | 9376.655609 | NaN | 2420.600915 |
11276 | Yemen | YEM | 2006 | 20965.000 | 198.500000 | Yemeni Rial | 79.718914 | 51669.180970 | 2464.544764 | 2509.971963 | ... | 7.627067 | 28.001314 | 77.513306 | 3098.297843 | 9398.235529 | 9540.051028 | NaN | 9243.472310 | NaN | 2382.182154 |
11277 | Yemen | YEM | 2007 | 21591.000 | 199.540000 | Yemeni Rial | 84.656475 | 55761.428540 | 2582.623711 | 2607.372597 | ... | 8.616352 | 29.295411 | 74.538105 | 3149.613533 | 9432.763102 | 9606.653374 | NaN | 9314.041767 | NaN | 2417.153583 |
11278 | Yemen | YEM | 2008 | 22223.000 | 200.080000 | Yemeni Rial | 95.889955 | 59804.720830 | 2691.118248 | 2708.301212 | ... | 7.578654 | 26.635831 | 77.739227 | 3154.238236 | 9329.826256 | 9586.031067 | NaN | 9295.538992 | NaN | 2437.614583 |
11279 | Yemen | YEM | 2009 | 22858.000 | 207.320000 | Yemeni Rial | 87.617700 | 61288.814890 | 2681.285103 | 2699.266570 | ... | 8.432549 | 23.873591 | 66.853623 | 3116.525801 | NaN | NaN | NaN | 9276.310000 | NaN | 2415.879535 |
2223 rows × 37 columns
例:选择 cc 列和 cg 列的和大于 80 并且 POP 小于 20000 的行。
df[(df.cc + df.cg>= 80) & (df.POP <= 20000)]
country | isocode | year | POP | XRAT | Currency_Unit | ppp | tcgdp | cgdp | cgdp2 | ... | kg | ki | openk | rgdpeqa | rgdpwok | rgdpl2wok | rgdpl2pe | rgdpl2te | rgdpl2th | rgdptt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
20 | Afghanistan | AFG | 1970 | 12430.623 | 4.499843e-02 | Afghani | 0.030932 | 2544.158761 | 204.668645 | 217.070672 | ... | 6.874972 | 7.282942 | 11.400791 | 1081.424221 | 2573.757504 | 2628.306751 | NaN | NaN | NaN | 855.028440 |
21 | Afghanistan | AFG | 1971 | 12749.385 | 4.499843e-02 | Afghani | 0.029717 | 2772.708160 | 217.477797 | 233.174215 | ... | 7.391439 | 8.136900 | 16.542793 | 1107.678841 | 2638.926007 | 2690.518366 | NaN | NaN | NaN | 864.055243 |
22 | Afghanistan | AFG | 1972 | 13079.460 | 4.499843e-02 | Afghani | 0.028351 | 2532.448551 | 193.620268 | 203.815514 | ... | 7.894959 | 7.223611 | 17.784399 | 945.298951 | 2254.372064 | 2287.650884 | NaN | NaN | NaN | 737.915549 |
23 | Afghanistan | AFG | 1973 | 13421.301 | 4.499843e-02 | Afghani | 0.026877 | 2901.972071 | 216.221369 | 225.405444 | ... | 7.451048 | 6.500619 | 18.505092 | 999.400460 | 2385.834304 | 2424.683471 | NaN | NaN | NaN | 779.620585 |
24 | Afghanistan | AFG | 1974 | 13772.076 | 4.499843e-02 | Afghani | 0.028327 | 3424.228187 | 248.635586 | 257.753750 | ... | 7.098609 | 7.785019 | 24.503937 | 1025.208685 | 2449.957675 | 2497.585718 | NaN | NaN | NaN | 812.015225 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
11395 | Zimbabwe | ZWE | 2005 | 11639.470 | 2.236364e+01 | Zimbabwe Dollar | 39.482829 | 1968.205961 | 169.097559 | 184.183929 | ... | 6.995770 | 9.376272 | 89.399427 | 214.739197 | 418.970867 | 418.970867 | NaN | 390.907086 | NaN | 169.097559 |
11396 | Zimbabwe | ZWE | 2006 | 11544.326 | 1.643606e+02 | Zimbabwe Dollar | 384.899651 | 2132.305773 | 184.705956 | 192.953943 | ... | 7.648020 | 14.986823 | 81.697014 | 217.543648 | 424.754259 | 407.262097 | NaN | 377.352394 | NaN | 179.368685 |
11397 | Zimbabwe | ZWE | 2007 | 11443.187 | 9.675781e+03 | Zimbabwe Dollar | 38583.323960 | 2107.937100 | 184.208918 | 198.215361 | ... | 8.387106 | 15.787322 | 84.483374 | 202.707080 | 396.486201 | 376.163064 | NaN | 345.764991 | NaN | 173.113448 |
11398 | Zimbabwe | ZWE | 2008 | 11350.000 | 6.715424e+09 | Zimbabwe Dollar | 38723.957740 | 1772.209867 | 156.141839 | 162.112294 | ... | 7.685312 | 13.444449 | 85.117130 | 174.178806 | 343.159758 | 332.649861 | NaN | 302.945712 | NaN | 142.329054 |
11399 | Zimbabwe | ZWE | 2009 | 11383.000 | 1.400000e+17 | Zimbabwe Dollar | 40289.958990 | 1906.049843 | 167.447056 | 174.419700 | ... | 7.905525 | 14.743667 | 83.749534 | 182.613004 | NaN | NaN | NaN | 314.171069 | NaN | 151.435285 |
3827 rows × 37 columns
iloc 或者 loc#
使用 iloc
函数进行选择,形式应为 .iloc[rows, columns]
。
可以将 i 理解为 integer,即 i 是整数,表示行或者列位置(位置由被称为 index),iloc
即用整数来选择行或者选择列。i 从 0 开始,至 length-1
。
例如:选择第 2 行到第 5 行(不包括第 5 行),第 0 列到第 4 列(不包括第 4 列)。
df.iloc[2:5, 0:4]
country | isocode | year | POP | |
---|---|---|---|---|
2 | Afghanistan | AFG | 1952 | 8425.333 |
3 | Afghanistan | AFG | 1953 | 8573.217 |
4 | Afghanistan | AFG | 1954 | 8728.408 |
使用 loc
函数进行选择,与 iloc
的区别在于,loc
除了接受整数外,还可以接受标签(a
、b
这样的列名)、表示整数位置的 index、boolean
。
例:选择第 2 行到第 5 行(不包括第 5 行),country 和 tcgdp 列。
df.loc[df.index[2:5], ['country', 'tcgdp']]
country | tcgdp | |
---|---|---|
2 | Afghanistan | NaN |
3 | Afghanistan | NaN |
4 | Afghanistan | NaN |
例:使用 loc
函数选择 POP 列最大值的行。
df.loc[df.POP == max(df.POP)]
country | isocode | year | POP | XRAT | Currency_Unit | ppp | tcgdp | cgdp | cgdp2 | ... | kg | ki | openk | rgdpeqa | rgdpwok | rgdpl2wok | rgdpl2pe | rgdpl2te | rgdpl2th | rgdptt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2159 | China Version 1 | CHN | 2009 | 1323592.0 | 6.83 | Yuan Renminbi | 3.342447 | 10739325.58 | 8113.773410 | 8296.575690 | ... | 15.779553 | 46.707842 | 58.579480 | 7782.385295 | NaN | NaN | NaN | 11914.32395 | NaN | 7314.930606 |
2219 | China Version 2 | CH2 | 2009 | 1323592.0 | 6.83 | Yuan Renminbi | 3.164712 | 11342464.26 | 8569.456643 | 9227.187296 | ... | 15.655224 | 39.044203 | 61.240342 | 8247.646160 | NaN | NaN | NaN | 12658.65025 | NaN | 7747.331618 |
2 rows × 37 columns
还可以使用这种形式:.loc[,]
,两个参数用逗号隔开,第一个参数接受条件,第二个参数接受我们想要返回的列名,得到的是符合条件的特定的列。
例:选择满足 cc 列加 cg 列大于等于80,POP小于等于20000条件的 country, year, POP 三列。
df.loc[(df.cc + df.cg>= 80) & (df.POP <= 20000), ['country', 'year', 'POP']]
country | year | POP | |
---|---|---|---|
20 | Afghanistan | 1970 | 12430.623 |
21 | Afghanistan | 1971 | 12749.385 |
22 | Afghanistan | 1972 | 13079.460 |
23 | Afghanistan | 1973 | 13421.301 |
24 | Afghanistan | 1974 | 13772.076 |
... | ... | ... | ... |
11395 | Zimbabwe | 2005 | 11639.470 |
11396 | Zimbabwe | 2006 | 11544.326 |
11397 | Zimbabwe | 2007 | 11443.187 |
11398 | Zimbabwe | 2008 | 11350.000 |
11399 | Zimbabwe | 2009 | 11383.000 |
3827 rows × 3 columns
由于 iloc
函数只接受整数,所以不允许使用这种条件进行筛选。
query#
.query()
函数的用法与 []
有相似之处。值得注意的是,.query()
函数的性能更好,在处理大规模数据时更快。
df.query("POP>= 20000")
country | isocode | year | POP | XRAT | Currency_Unit | ppp | tcgdp | cgdp | cgdp2 | ... | kg | ki | openk | rgdpeqa | rgdpwok | rgdpl2wok | rgdpl2pe | rgdpl2te | rgdpl2th | rgdptt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
45 | Afghanistan | AFG | 1995 | 20881.480 | 0.833333 | Afghani | 0.352981 | 7640.404208 | 365.893807 | 380.608406 | ... | 11.969276 | 11.435963 | 64.779503 | 575.749694 | 1427.160971 | 1432.074990 | NaN | NaN | NaN | 450.097562 |
46 | Afghanistan | AFG | 1996 | 21559.923 | 2.500023 | Afghani | 1.035618 | 7541.782643 | 349.805639 | 363.579817 | ... | 12.403287 | 11.368530 | 72.681765 | 539.935370 | 1339.773062 | 1343.634290 | NaN | NaN | NaN | 422.025786 |
47 | Afghanistan | AFG | 1997 | 22227.415 | 3.000030 | Afghani | 1.230919 | 7420.757993 | 333.856096 | 342.188209 | ... | 2.098612 | 22.681018 | 57.214244 | 569.246414 | 1414.179670 | 1508.635000 | NaN | NaN | NaN | 388.407429 |
48 | Afghanistan | AFG | 1998 | 22912.814 | 3.000053 | Afghani | 1.225851 | 7203.038357 | 314.367251 | 321.797566 | ... | 2.155247 | 22.589964 | 62.186846 | 539.258652 | 1340.627874 | 1430.919956 | NaN | NaN | NaN | 363.011141 |
49 | Afghanistan | AFG | 1999 | 23646.128 | 45.106375 | Afghani | 18.193448 | 7053.728453 | 298.303741 | 305.275708 | ... | 2.169332 | 22.483448 | 67.001174 | 510.554309 | 1270.043559 | 1356.481758 | NaN | NaN | NaN | 339.159330 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
11275 | Yemen | YEM | 2005 | 20345.000 | 195.080000 | Yemeni Rial | 69.501112 | 49247.125610 | 2420.600915 | 2445.644456 | ... | 6.630541 | 30.908319 | 70.762075 | 3152.987812 | 9689.452913 | 9689.452913 | NaN | 9376.655609 | NaN | 2420.600915 |
11276 | Yemen | YEM | 2006 | 20965.000 | 198.500000 | Yemeni Rial | 79.718914 | 51669.180970 | 2464.544764 | 2509.971963 | ... | 7.627067 | 28.001314 | 77.513306 | 3098.297843 | 9398.235529 | 9540.051028 | NaN | 9243.472310 | NaN | 2382.182154 |
11277 | Yemen | YEM | 2007 | 21591.000 | 199.540000 | Yemeni Rial | 84.656475 | 55761.428540 | 2582.623711 | 2607.372597 | ... | 8.616352 | 29.295411 | 74.538105 | 3149.613533 | 9432.763102 | 9606.653374 | NaN | 9314.041767 | NaN | 2417.153583 |
11278 | Yemen | YEM | 2008 | 22223.000 | 200.080000 | Yemeni Rial | 95.889955 | 59804.720830 | 2691.118248 | 2708.301212 | ... | 7.578654 | 26.635831 | 77.739227 | 3154.238236 | 9329.826256 | 9586.031067 | NaN | 9295.538992 | NaN | 2437.614583 |
11279 | Yemen | YEM | 2009 | 22858.000 | 207.320000 | Yemeni Rial | 87.617700 | 61288.814890 | 2681.285103 | 2699.266570 | ... | 8.432549 | 23.873591 | 66.853623 | 3116.525801 | NaN | NaN | NaN | 9276.310000 | NaN | 2415.879535 |
2223 rows × 37 columns
df.query("cc + cg>= 80 & POP <= 20000")
country | isocode | year | POP | XRAT | Currency_Unit | ppp | tcgdp | cgdp | cgdp2 | ... | kg | ki | openk | rgdpeqa | rgdpwok | rgdpl2wok | rgdpl2pe | rgdpl2te | rgdpl2th | rgdptt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
20 | Afghanistan | AFG | 1970 | 12430.623 | 4.499843e-02 | Afghani | 0.030932 | 2544.158761 | 204.668645 | 217.070672 | ... | 6.874972 | 7.282942 | 11.400791 | 1081.424221 | 2573.757504 | 2628.306751 | NaN | NaN | NaN | 855.028440 |
21 | Afghanistan | AFG | 1971 | 12749.385 | 4.499843e-02 | Afghani | 0.029717 | 2772.708160 | 217.477797 | 233.174215 | ... | 7.391439 | 8.136900 | 16.542793 | 1107.678841 | 2638.926007 | 2690.518366 | NaN | NaN | NaN | 864.055243 |
22 | Afghanistan | AFG | 1972 | 13079.460 | 4.499843e-02 | Afghani | 0.028351 | 2532.448551 | 193.620268 | 203.815514 | ... | 7.894959 | 7.223611 | 17.784399 | 945.298951 | 2254.372064 | 2287.650884 | NaN | NaN | NaN | 737.915549 |
23 | Afghanistan | AFG | 1973 | 13421.301 | 4.499843e-02 | Afghani | 0.026877 | 2901.972071 | 216.221369 | 225.405444 | ... | 7.451048 | 6.500619 | 18.505092 | 999.400460 | 2385.834304 | 2424.683471 | NaN | NaN | NaN | 779.620585 |
24 | Afghanistan | AFG | 1974 | 13772.076 | 4.499843e-02 | Afghani | 0.028327 | 3424.228187 | 248.635586 | 257.753750 | ... | 7.098609 | 7.785019 | 24.503937 | 1025.208685 | 2449.957675 | 2497.585718 | NaN | NaN | NaN | 812.015225 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
11395 | Zimbabwe | ZWE | 2005 | 11639.470 | 2.236364e+01 | Zimbabwe Dollar | 39.482829 | 1968.205961 | 169.097559 | 184.183929 | ... | 6.995770 | 9.376272 | 89.399427 | 214.739197 | 418.970867 | 418.970867 | NaN | 390.907086 | NaN | 169.097559 |
11396 | Zimbabwe | ZWE | 2006 | 11544.326 | 1.643606e+02 | Zimbabwe Dollar | 384.899651 | 2132.305773 | 184.705956 | 192.953943 | ... | 7.648020 | 14.986823 | 81.697014 | 217.543648 | 424.754259 | 407.262097 | NaN | 377.352394 | NaN | 179.368685 |
11397 | Zimbabwe | ZWE | 2007 | 11443.187 | 9.675781e+03 | Zimbabwe Dollar | 38583.323960 | 2107.937100 | 184.208918 | 198.215361 | ... | 8.387106 | 15.787322 | 84.483374 | 202.707080 | 396.486201 | 376.163064 | NaN | 345.764991 | NaN | 173.113448 |
11398 | Zimbabwe | ZWE | 2008 | 11350.000 | 6.715424e+09 | Zimbabwe Dollar | 38723.957740 | 1772.209867 | 156.141839 | 162.112294 | ... | 7.685312 | 13.444449 | 85.117130 | 174.178806 | 343.159758 | 332.649861 | NaN | 302.945712 | NaN | 142.329054 |
11399 | Zimbabwe | ZWE | 2009 | 11383.000 | 1.400000e+17 | Zimbabwe Dollar | 40289.958990 | 1906.049843 | 167.447056 | 174.419700 | ... | 7.905525 | 14.743667 | 83.749534 | 182.613004 | NaN | NaN | NaN | 314.171069 | NaN | 151.435285 |
3827 rows × 37 columns