In [15]: means = df['data1'].groupby([df['key1'], df['key2']]).mean()In [16]: meansOut[16]:key1 key2a one 0.880536 two 0.478943b one -0.519439 two -0.555730Name: data1, dtype: float64
这里,我通过两个键对数据进行了分组,得到的Series具有一个层次化索引(由唯一的键对组成):
In [17]: means.unstack()Out[17]:key2 one twokey1 a 0.8805360.478943b -0.519439-0.555730
在这个例子中,分组键均为Series。实际上,分组键可以是任何长度适当的数组:
In [18]: states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])In [19]: years = np.array([2005, 2005, 2006, 2005, 2006])In [20]: df['data1'].groupby([states, years]).mean()Out[20]:California 20050.4789432006-0.519439Ohio 2005-0.38021920061.965781Name: data1, dtype: float64
In [21]: df.groupby('key1').mean()Out[21]: data1 data2key1a 0.7466720.910916b -0.5375850.525384In [22]: df.groupby(['key1', 'key2']).mean()Out[22]: data1 data2key1 key2 a one 0.8805361.319920 two 0.4789430.092908b one -0.5194390.281746 two -0.5557300.769023
In [23]: df.groupby(['key1', 'key2']).size()Out[23]:key1 key2a one 2 two 1b one 1 two 1dtype: int64
注意,任何分组关键词中的缺失值,都会被从结果中除去。
对分组进行迭代
GroupBy对象支持迭代,可以产生一组二元元组(由分组名和数据块组成)。看下面的例子:
In [24]:for name, group in df.groupby('key1'): ....:print(name) ....:print(group) ....:a data1 data2 key1 key20-0.2047081.393406 a one10.4789430.092908 a two41.9657811.246435 a oneb data1 data2 key1 key22-0.5194390.281746 b one3-0.5557300.769023 b two
对于多重键的情况,元组的第一个元素将会是由键值组成的元组:
In [25]:for (k1, k2), group in df.groupby(['key1', 'key2']): ....:print((k1, k2)) ....:print(group) ....:('a','one') data1 data2 key1 key20-0.2047081.393406 a one41.9657811.246435 a one('a','two') data1 data2 key1 key210.4789430.092908 a two('b','one') data1 data2 key1 key22-0.5194390.281746 b one('b','two') data1 data2 key1 key23-0.555730.769023 b two
当然,你可以对这些数据片段做任何操作。有一个你可能会觉得有用的运算:将这些数据片段做成一个字典:
In [26]: pieces =dict(list(df.groupby('key1')))In [27]: pieces['b']Out[27]: data1 data2 key1 key22-0.5194390.281746 b one3-0.5557300.769023 b two
In [30]:for dtype, group in grouped: ....:print(dtype) ....:print(group) ....:float64 data1 data20-0.2047081.39340610.4789430.0929082-0.5194390.2817463-0.5557300.76902341.9657811.246435object key1 key20 a one1 a two2 b one3 b two4 a one
In [32]: s_grouped = df.groupby(['key1', 'key2'])['data2']In [33]: s_groupedOut[33]:<pandas.core.groupby.SeriesGroupBy object at 0x7faa30c78da0>In [34]: s_grouped.mean()Out[34]:key1 key2a one 1.319920 two 0.092908b one 0.281746 two 0.769023Name: data2, dtype: float64
通过字典或Series进行分组
除数组以外,分组信息还可以其他形式存在。来看另一个示例DataFrame:
In [35]: people = pd.DataFrame(np.random.randn(5, 5), ....: columns=['a', 'b', 'c', 'd', 'e'], ....: index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])In [36]: people.iloc[2:3, [1,2]]= np.nan # Add a few NA valuesIn [37]: peopleOut[37]: a b c d eJoe 1.007189-1.2962210.2749920.2289131.352917Steve 0.886429-2.001637-0.3718431.669025-0.438570Wes -0.539741 NaN NaN -1.021228-0.577087Jim 0.1241210.3026140.5237720.0009401.343810Travis -0.713544-0.831154-2.370232-1.860761-0.860757
现在,假设已知列的分组关系,并希望根据分组计算列的和:
In [38]: mapping ={'a':'red','b':'red','c':'blue', ....:'d':'blue','e':'red','f':'orange'}
In [44]: people.groupby(len).sum()Out[44]: a b c d e30.591569-0.9936080.798764-0.7913742.11963950.886429-2.001637-0.3718431.669025-0.4385706-0.713544-0.831154-2.370232-1.860761-0.860757
将函数跟数组、列表、字典、Series混合使用也不是问题,因为任何东西在内部都会被转换为数组:
In [45]: key_list = ['one','one','one','two','two']In [46]: people.groupby([len, key_list]).min()Out[46]: a b c d e3 one -0.539741-1.2962210.274992-1.021228-0.577087 two 0.1241210.3026140.5237720.0009401.3438105 one 0.886429-2.001637-0.3718431.669025-0.4385706 two -0.713544-0.831154-2.370232-1.860761-0.860757
In [51]: dfOut[51]: data1 data2 key1 key20-0.2047081.393406 a one10.4789430.092908 a two2-0.5194390.281746 b one3-0.5557300.769023 b two41.9657811.246435 a oneIn [52]: grouped = df.groupby('key1')In [53]: grouped['data1'].quantile(0.9)Out[53]:key1a 1.668413b -0.523068Name: data1, dtype: float64
如果要使用你自己的聚合函数,只需将其传入aggregate或agg方法即可:
In [54]:defpeak_to_peak(arr): ....:return arr.max()- arr.min()In [55]: grouped.agg(peak_to_peak)Out[55]: data1 data2key1 a 2.1704881.300498b 0.0362920.487276
In [56]: grouped.describe()Out[56]: data1 \ count mean std min25%50%75%key1 a 3.00.7466721.109736-0.2047080.1371180.4789431.222362b 2.0-0.5375850.025662-0.555730-0.546657-0.537585-0.528512 data2 \max count mean std min25%50%key1 a 1.9657813.00.9109160.7122170.0929080.6696711.246435b -0.5194392.00.5253840.3445560.2817460.4035650.52538475%maxkey1 a 1.3199201.393406b 0.6472030.769023
In [57]: tips = pd.read_csv('examples/tips.csv')# Add tip percentage of total billIn [58]: tips['tip_pct']= tips['tip']/ tips['total_bill']In [59]: tips[:6]Out[59]: total_bill tip smoker day time size tip_pct016.991.01 No Sun Dinner 20.059447110.341.66 No Sun Dinner 30.160542221.013.50 No Sun Dinner 30.166587323.683.31 No Sun Dinner 20.139780424.593.61 No Sun Dinner 40.146808525.294.71 No Sun Dinner 40.186240
In [60]: grouped = tips.groupby(['day', 'smoker'])
注意,对于表10-1中的那些描述统计,可以将函数名以字符串的形式传入:
In [61]: grouped_pct = grouped['tip_pct']In [62]: grouped_pct.agg('mean')Out[62]:day smokerFri No 0.151650 Yes 0.174783Sat No 0.158048 Yes 0.147906Sun No 0.160113 Yes 0.187250Thur No 0.160298 Yes 0.163863Name: tip_pct, dtype: float64
如果传入一组函数或函数名,得到的DataFrame的列就会以相应的函数命名:
In [63]: grouped_pct.agg(['mean', 'std', peak_to_peak])Out[63]: mean std peak_to_peakday smoker Fri No 0.1516500.0281230.067349 Yes 0.1747830.0512930.159925Sat No 0.1580480.0397670.235193 Yes 0.1479060.0613750.290095Sun No 0.1601130.0423470.193226 Yes 0.1872500.1541340.644685Thur No 0.1602980.0387740.193350 Yes 0.1638630.0393890.151240
In [65]: functions = ['count','mean','max']In [66]: result = grouped['tip_pct','total_bill'].agg(functions)In [67]: resultOut[67]: tip_pct total_bill count mean max count mean maxday smoker Fri No 40.1516500.187735418.42000022.75 Yes 150.1747830.2634801516.81333340.17Sat No 450.1580480.2919904519.66177848.33 Yes 420.1479060.3257334221.27666750.81Sun No 570.1601130.2526725720.50666748.17 Yes 190.1872500.7103451924.12000045.35Thur No 450.1602980.2663124517.11311141.19 Yes 170.1638630.2412551719.19058843.11
In [68]: result['tip_pct']Out[68]: count mean maxday smoker Fri No 40.1516500.187735 Yes 150.1747830.263480Sat No 450.1580480.291990 Yes 420.1479060.325733Sun No 570.1601130.252672 Yes 190.1872500.710345Thur No 450.1602980.266312 Yes 170.1638630.241255
跟前面一样,这里也可以传入带有自定义名称的一组元组:
In [69]: ftuples = [('Durchschnitt','mean'),('Abweichung', np.var)]In [70]: grouped['tip_pct','total_bill'].agg(ftuples)Out[70]: tip_pct total_bill Durchschnitt Abweichung Durchschnitt Abweichungday smoker Fri No 0.1516500.00079118.42000025.596333 Yes 0.1747830.00263116.81333382.562438Sat No 0.1580480.00158119.66177879.908965 Yes 0.1479060.00376721.276667101.387535Sun No 0.1601130.00179320.50666766.099980 Yes 0.1872500.02375724.120000109.046044Thur No 0.1602980.00150317.11311159.625081 Yes 0.1638630.00155119.19058869.808518
In [73]: tips.groupby(['day', 'smoker'], as_index=False).mean()Out[73]: day smoker total_bill tip size tip_pct0 Fri No 18.4200002.8125002.2500000.1516501 Fri Yes 16.8133332.7140002.0666670.1747832 Sat No 19.6617783.1028892.5555560.1580483 Sat Yes 21.2766672.8754762.4761900.1479064 Sun No 20.5066673.1678952.9298250.1601135 Sun Yes 24.1200003.5168422.5789470.1872506 Thur No 17.1131112.6737782.4888890.1602987 Thur Yes 19.1905883.0300002.3529410.163863
In [74]:deftop(df,n=5,column='tip_pct'): ....:return df.sort_values(by=column)[-n:]In [75]:top(tips, n=6)Out[75]: total_bill tip smoker day time size tip_pct10914.314.00 Yes Sat Dinner 20.27952518323.176.50 Yes Sun Dinner 40.28053523211.613.39 No Sat Dinner 20.291990673.071.00 Yes Sat Dinner 10.3257331789.604.00 Yes Sun Dinner 20.4166671727.255.15 Yes Sun Dinner 20.710345
现在,如果对smoker分组并用该函数调用apply,就会得到:
In [76]: tips.groupby('smoker').apply(top)Out[76]: total_bill tip smoker day time size tip_pctsmoker No 8824.715.85 No Thur Lunch 20.23674618520.695.00 No Sun Dinner 50.2416635110.292.60 No Sun Dinner 20.2526721497.512.00 No Thur Lunch 20.26631223211.613.39 No Sat Dinner 20.291990Yes 10914.314.00 Yes Sat Dinner 20.27952518323.176.50 Yes Sun Dinner 40.280535673.071.00 Yes Sat Dinner 10.3257331789.604.00 Yes Sun Dinner 20.4166671727.255.15 Yes Sun Dinner 20.710345
In [77]: tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')Out[77]: total_bill tip smoker day time size tip_pctsmoker day No Fri 9422.753.25 No Fri Dinner 20.142857 Sat 21248.339.00 No Sat Dinner 40.186220 Sun 15648.175.00 No Sun Dinner 60.103799 Thur 14241.195.00 No Thur Lunch 50.121389Yes Fri 9540.174.73 Yes Fri Dinner 40.117750 Sat 17050.8110.00 Yes Sat Dinner 30.196812 Sun 18245.353.50 Yes Sun Dinner 30.077178 Thur 19743.115.00 Yes Thur Lunch 40.115982
In [78]: result = tips.groupby('smoker')['tip_pct'].describe()In [79]: resultOut[79]: count mean std min25%50%75%\smoker No 151.00.1593280.0399100.0567970.1369060.1556250.185014Yes 93.00.1631960.0851190.0356380.1067710.1538460.195059maxsmokerNo 0.291990Yes 0.710345In [80]: result.unstack('smoker')Out[80]: smokercount No 151.000000 Yes 93.000000mean No 0.159328 Yes 0.163196std No 0.039910 Yes 0.085119min No 0.056797 Yes 0.03563825% No 0.136906 Yes 0.10677150% No 0.155625 Yes 0.15384675% No 0.185014 Yes 0.195059max No 0.291990 Yes 0.710345dtype: float64
In [81]: tips.groupby('smoker', group_keys=False).apply(top)Out[81]: total_bill tip smoker day time size tip_pct8824.715.85 No Thur Lunch 20.23674618520.695.00 No Sun Dinner 50.2416635110.292.60 No Sun Dinner 20.2526721497.512.00 No Thur Lunch 20.26631223211.613.39 No Sat Dinner 20.29199010914.314.00 Yes Sat Dinner 20.27952518323.176.50 Yes Sun Dinner 40.280535673.071.00 Yes Sat Dinner 10.3257331789.604.00 Yes Sun Dinner 20.4166671727.255.15 Yes Sun Dinner 20.710345
In [99]: data[['Vermont', 'Nevada', 'Idaho']] = np.nan
In [100]: data
Out[100]:
Ohio 0.922264
New York -2.153545
Vermont NaN
Florida -0.375842
Oregon 0.329939
Nevada NaN
California 1.105913
Idaho NaN
dtype: float64
In [101]: data.groupby(group_key).mean()
Out[101]:
East -0.535707
West 0.717926
dtype: float64
我们可以用分组平均值去填充NA值:
In [102]: fill_mean = lambda g: g.fillna(g.mean())
In [103]: data.groupby(group_key).apply(fill_mean)
Out[103]:
Ohio 0.922264
New York -2.153545
Vermont -0.535707
Florida -0.375842
Oregon 0.329939
Nevada 0.717926
California 1.105913
Idaho 0.717926
dtype: float64
另外,也可以在代码中预定义各组的填充值。由于分组具有一个name属性,所以我们可以拿来用一下:
In [104]: fill_values = {'East': 0.5, 'West': -1}
In [105]: fill_func = lambda g: g.fillna(fill_values[g.name])
In [106]: data.groupby(group_key).apply(fill_func)
Out[106]:
Ohio 0.922264
New York -2.153545
Vermont 0.500000
Florida -0.375842
Oregon 0.329939
Nevada -1.000000
California 1.105913
Idaho -1.000000
dtype: float64
示例:随机采样和排列
假设你想要从一个大数据集中随机抽取(进行替换或不替换)样本以进行蒙特卡罗模拟(Monte Carlo simulation)或其他分析工作。“抽取”的方式有很多,这里使用的方法是对Series使用sample方法:
In [111]: get_suit = lambda card: card[-1] # last letter is suit
In [112]: deck.groupby(get_suit).apply(draw, n=2)
Out[112]:
C 2C 2
3C 3
D KD 10
8D 8
H KH 10
3H 3
S 2S 2
4S 4
dtype: int64
In [114]: df = pd.DataFrame({'category': ['a', 'a', 'a', 'a',
.....: 'b', 'b', 'b', 'b'],
.....: 'data': np.random.randn(8),
.....: 'weights': np.random.rand(8)})
In [115]: df
Out[115]:
category data weights
0 a 1.561587 0.957515
1 a 1.219984 0.347267
2 a -0.482239 0.581362
3 a 0.315667 0.217091
4 b -0.047852 0.894406
5 b -0.454145 0.918564
6 b -0.556774 0.277825
7 b 0.253321 0.955905
然后可以利用category计算分组加权平均数:
In [116]: grouped = df.groupby('category')
In [117]: get_wavg = lambda g: np.average(g['data'], weights=g['weights'])
In [118]: grouped.apply(get_wavg)
Out[118]:
category
a 0.811643
b -0.122262
dtype: float64
顺着上一个例子继续,你可以用groupby执行更为复杂的分组统计分析,只要函数返回的是pandas对象或标量值即可。例如,我可以定义下面这个regress函数(利用statsmodels计量经济学库)对各数据块执行普通最小二乘法(Ordinary Least Squares,OLS)回归:
import statsmodels.api as sm
def regress(data, yvar, xvars):
Y = data[yvar]
X = data[xvars]
X['intercept'] = 1.
result = sm.OLS(Y, X).fit()
return result.params
In [133]: tips.pivot_table('tip_pct', index=['time', 'smoker'], columns='day',
.....: aggfunc=len, margins=True)
Out[133]:
day Fri Sat Sun Thur All
time smoker
Dinner No 3.0 45.0 57.0 1.0 106.0
Yes 9.0 42.0 19.0 NaN 70.0
Lunch No 1.0 NaN NaN 44.0 45.0
Yes 6.0 NaN NaN 17.0 23.0
All 19.0 87.0 76.0 62.0 244.0
如果存在空的组合(也就是NA),你可能会希望设置一个fill_value:
In [134]: tips.pivot_table('tip_pct', index=['time', 'size', 'smoker'],
.....: columns='day', aggfunc='mean', fill_value=0)
Out[134]:
day Fri Sat Sun Thur
time size smoker
Dinner 1 No 0.000000 0.137931 0.000000 0.000000
Yes 0.000000 0.325733 0.000000 0.000000
2 No 0.139622 0.162705 0.168859 0.159744
Yes 0.171297 0.148668 0.207893 0.000000
3 No 0.000000 0.154661 0.152663 0.000000
Yes 0.000000 0.144995 0.152660 0.000000
4 No 0.000000 0.150096 0.148143 0.000000
Yes 0.117750 0.124515 0.193370 0.000000
5 No 0.000000 0.000000 0.206928 0.000000
Yes 0.000000 0.106572 0.065660 0.000000
... ... ... ... ...
Lunch 1 No 0.000000 0.000000 0.000000 0.181728
Yes 0.223776 0.000000 0.000000 0.000000
2 No 0.000000 0.000000 0.000000 0.166005
Yes 0.181969 0.000000 0.000000 0.158843
3 No 0.187735 0.000000 0.000000 0.084246
Yes 0.000000 0.000000 0.000000 0.204952
4 No 0.000000 0.000000 0.000000 0.138919
Yes 0.000000 0.000000 0.000000 0.155410
5 No 0.000000 0.000000 0.000000 0.121389
6 No 0.000000 0.000000 0.000000 0.173706
[21 rows x 4 columns]
In [138]: data
Out[138]:
Sample Nationality Handedness
0 1 USA Right-handed
1 2 Japan Left-handed
2 3 USA Right-handed
3 4 Japan Right-handed
4 5 Japan Left-handed
5 6 Japan Right-handed
6 7 USA Right-handed
7 8 USA Left-handed
8 9 Japan Right-handed
9 10 USA Right-handed
In [139]: pd.crosstab(data.Nationality, data.Handedness, margins=True)
Out[139]:
Handedness Left-handed Right-handed All
Nationality
Japan 2 3 5
USA 1 4 5
All 3 7 10
crosstab的前两个参数可以是数组或Series,或是数组列表。就像小费数据:
In [140]: pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)
Out[140]:
smoker No Yes All
time day
Dinner Fri 3 9 12
Sat 45 42 87
Sun 57 19 76
Thur 1 0 1
Lunch Fri 1 6 7
Thur 44 17 61
All 151 93 244