In [15]: means = df['data1'].groupby([df['key1'], df['key2']]).mean()In [16]: meansOut[16]:key1 key2a one 0.880536 two 0.478943b one -0.519439 two -0.555730Name: data1, dtype: float64
这里,我通过两个键对数据进行了分组,得到的Series具有一个层次化索引(由唯一的键对组成):
In [17]: means.unstack()Out[17]:key2 one twokey1 a 0.8805360.478943b -0.519439-0.555730
在这个例子中,分组键均为Series。实际上,分组键可以是任何长度适当的数组:
In [18]: states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])In [19]: years = np.array([2005, 2005, 2006, 2005, 2006])In [20]: df['data1'].groupby([states, years]).mean()Out[20]:California 20050.4789432006-0.519439Ohio 2005-0.38021920061.965781Name: data1, dtype: float64
In [21]: df.groupby('key1').mean()Out[21]: data1 data2key1a 0.7466720.910916b -0.5375850.525384In [22]: df.groupby(['key1', 'key2']).mean()Out[22]: data1 data2key1 key2 a one 0.8805361.319920 two 0.4789430.092908b one -0.5194390.281746 two -0.5557300.769023
In [23]: df.groupby(['key1', 'key2']).size()Out[23]:key1 key2a one 2 two 1b one 1 two 1dtype: int64
注意,任何分组关键词中的缺失值,都会被从结果中除去。
对分组进行迭代
GroupBy对象支持迭代,可以产生一组二元元组(由分组名和数据块组成)。看下面的例子:
In [24]:for name, group in df.groupby('key1'): ....:print(name) ....:print(group) ....:a data1 data2 key1 key20-0.2047081.393406 a one10.4789430.092908 a two41.9657811.246435 a oneb data1 data2 key1 key22-0.5194390.281746 b one3-0.5557300.769023 b two
对于多重键的情况,元组的第一个元素将会是由键值组成的元组:
In [25]:for (k1, k2), group in df.groupby(['key1', 'key2']): ....:print((k1, k2)) ....:print(group) ....:('a','one') data1 data2 key1 key20-0.2047081.393406 a one41.9657811.246435 a one('a','two') data1 data2 key1 key210.4789430.092908 a two('b','one') data1 data2 key1 key22-0.5194390.281746 b one('b','two') data1 data2 key1 key23-0.555730.769023 b two
当然,你可以对这些数据片段做任何操作。有一个你可能会觉得有用的运算:将这些数据片段做成一个字典:
In [26]: pieces =dict(list(df.groupby('key1')))In [27]: pieces['b']Out[27]: data1 data2 key1 key22-0.5194390.281746 b one3-0.5557300.769023 b two
In [30]:for dtype, group in grouped: ....:print(dtype) ....:print(group) ....:float64 data1 data20-0.2047081.39340610.4789430.0929082-0.5194390.2817463-0.5557300.76902341.9657811.246435object key1 key20 a one1 a two2 b one3 b two4 a one
In [32]: s_grouped = df.groupby(['key1', 'key2'])['data2']In [33]: s_groupedOut[33]:<pandas.core.groupby.SeriesGroupBy object at 0x7faa30c78da0>In [34]: s_grouped.mean()Out[34]:key1 key2a one 1.319920 two 0.092908b one 0.281746 two 0.769023Name: data2, dtype: float64
通过字典或Series进行分组
除数组以外,分组信息还可以其他形式存在。来看另一个示例DataFrame:
In [35]: people = pd.DataFrame(np.random.randn(5, 5), ....: columns=['a', 'b', 'c', 'd', 'e'], ....: index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])In [36]: people.iloc[2:3, [1,2]]= np.nan # Add a few NA valuesIn [37]: peopleOut[37]: a b c d eJoe 1.007189-1.2962210.2749920.2289131.352917Steve 0.886429-2.001637-0.3718431.669025-0.438570Wes -0.539741 NaN NaN -1.021228-0.577087Jim 0.1241210.3026140.5237720.0009401.343810Travis -0.713544-0.831154-2.370232-1.860761-0.860757
现在,假设已知列的分组关系,并希望根据分组计算列的和:
In [38]: mapping ={'a':'red','b':'red','c':'blue', ....:'d':'blue','e':'red','f':'orange'}
In [44]: people.groupby(len).sum()Out[44]: a b c d e30.591569-0.9936080.798764-0.7913742.11963950.886429-2.001637-0.3718431.669025-0.4385706-0.713544-0.831154-2.370232-1.860761-0.860757
将函数跟数组、列表、字典、Series混合使用也不是问题,因为任何东西在内部都会被转换为数组:
In [45]: key_list = ['one','one','one','two','two']In [46]: people.groupby([len, key_list]).min()Out[46]: a b c d e3 one -0.539741-1.2962210.274992-1.021228-0.577087 two 0.1241210.3026140.5237720.0009401.3438105 one 0.886429-2.001637-0.3718431.669025-0.4385706 two -0.713544-0.831154-2.370232-1.860761-0.860757
In [51]: dfOut[51]: data1 data2 key1 key20-0.2047081.393406 a one10.4789430.092908 a two2-0.5194390.281746 b one3-0.5557300.769023 b two41.9657811.246435 a oneIn [52]: grouped = df.groupby('key1')In [53]: grouped['data1'].quantile(0.9)Out[53]:key1a 1.668413b -0.523068Name: data1, dtype: float64
如果要使用你自己的聚合函数,只需将其传入aggregate或agg方法即可:
In [54]:defpeak_to_peak(arr): ....:return arr.max()- arr.min()In [55]: grouped.agg(peak_to_peak)Out[55]: data1 data2key1 a 2.1704881.300498b 0.0362920.487276
In [56]: grouped.describe()Out[56]: data1 \ count mean std min25%50%75%key1 a 3.00.7466721.109736-0.2047080.1371180.4789431.222362b 2.0-0.5375850.025662-0.555730-0.546657-0.537585-0.528512 data2 \max count mean std min25%50%key1 a 1.9657813.00.9109160.7122170.0929080.6696711.246435b -0.5194392.00.5253840.3445560.2817460.4035650.52538475%maxkey1 a 1.3199201.393406b 0.6472030.769023
In [57]: tips = pd.read_csv('examples/tips.csv')# Add tip percentage of total billIn [58]: tips['tip_pct']= tips['tip']/ tips['total_bill']In [59]: tips[:6]Out[59]: total_bill tip smoker day time size tip_pct016.991.01 No Sun Dinner 20.059447110.341.66 No Sun Dinner 30.160542221.013.50 No Sun Dinner 30.166587323.683.31 No Sun Dinner 20.139780424.593.61 No Sun Dinner 40.146808525.294.71 No Sun Dinner 40.186240
In [60]: grouped = tips.groupby(['day', 'smoker'])
注意,对于表10-1中的那些描述统计,可以将函数名以字符串的形式传入:
In [61]: grouped_pct = grouped['tip_pct']In [62]: grouped_pct.agg('mean')Out[62]:day smokerFri No 0.151650 Yes 0.174783Sat No 0.158048 Yes 0.147906Sun No 0.160113 Yes 0.187250Thur No 0.160298 Yes 0.163863Name: tip_pct, dtype: float64
如果传入一组函数或函数名,得到的DataFrame的列就会以相应的函数命名:
In [63]: grouped_pct.agg(['mean', 'std', peak_to_peak])Out[63]: mean std peak_to_peakday smoker Fri No 0.1516500.0281230.067349 Yes 0.1747830.0512930.159925Sat No 0.1580480.0397670.235193 Yes 0.1479060.0613750.290095Sun No 0.1601130.0423470.193226 Yes 0.1872500.1541340.644685Thur No 0.1602980.0387740.193350 Yes 0.1638630.0393890.151240
In [65]: functions = ['count','mean','max']In [66]: result = grouped['tip_pct','total_bill'].agg(functions)In [67]: resultOut[67]: tip_pct total_bill count mean max count mean maxday smoker Fri No 40.1516500.187735418.42000022.75 Yes 150.1747830.2634801516.81333340.17Sat No 450.1580480.2919904519.66177848.33 Yes 420.1479060.3257334221.27666750.81Sun No 570.1601130.2526725720.50666748.17 Yes 190.1872500.7103451924.12000045.35Thur No 450.1602980.2663124517.11311141.19 Yes 170.1638630.2412551719.19058843.11
In [68]: result['tip_pct']Out[68]: count mean maxday smoker Fri No 40.1516500.187735 Yes 150.1747830.263480Sat No 450.1580480.291990 Yes 420.1479060.325733Sun No 570.1601130.252672 Yes 190.1872500.710345Thur No 450.1602980.266312 Yes 170.1638630.241255
跟前面一样,这里也可以传入带有自定义名称的一组元组:
In [69]: ftuples = [('Durchschnitt','mean'),('Abweichung', np.var)]In [70]: grouped['tip_pct','total_bill'].agg(ftuples)Out[70]: tip_pct total_bill Durchschnitt Abweichung Durchschnitt Abweichungday smoker Fri No 0.1516500.00079118.42000025.596333 Yes 0.1747830.00263116.81333382.562438Sat No 0.1580480.00158119.66177879.908965 Yes 0.1479060.00376721.276667101.387535Sun No 0.1601130.00179320.50666766.099980 Yes 0.1872500.02375724.120000109.046044Thur No 0.1602980.00150317.11311159.625081 Yes 0.1638630.00155119.19058869.808518
In [73]: tips.groupby(['day', 'smoker'], as_index=False).mean()Out[73]: day smoker total_bill tip size tip_pct0 Fri No 18.4200002.8125002.2500000.1516501 Fri Yes 16.8133332.7140002.0666670.1747832 Sat No 19.6617783.1028892.5555560.1580483 Sat Yes 21.2766672.8754762.4761900.1479064 Sun No 20.5066673.1678952.9298250.1601135 Sun Yes 24.1200003.5168422.5789470.1872506 Thur No 17.1131112.6737782.4888890.1602987 Thur Yes 19.1905883.0300002.3529410.163863
In [74]:deftop(df,n=5,column='tip_pct'): ....:return df.sort_values(by=column)[-n:]In [75]:top(tips, n=6)Out[75]: total_bill tip smoker day time size tip_pct10914.314.00 Yes Sat Dinner 20.27952518323.176.50 Yes Sun Dinner 40.28053523211.613.39 No Sat Dinner 20.291990673.071.00 Yes Sat Dinner 10.3257331789.604.00 Yes Sun Dinner 20.4166671727.255.15 Yes Sun Dinner 20.710345
现在,如果对smoker分组并用该函数调用apply,就会得到:
In [76]: tips.groupby('smoker').apply(top)Out[76]: total_bill tip smoker day time size tip_pctsmoker No 8824.715.85 No Thur Lunch 20.23674618520.695.00 No Sun Dinner 50.2416635110.292.60 No Sun Dinner 20.2526721497.512.00 No Thur Lunch 20.26631223211.613.39 No Sat Dinner 20.291990Yes 10914.314.00 Yes Sat Dinner 20.27952518323.176.50 Yes Sun Dinner 40.280535673.071.00 Yes Sat Dinner 10.3257331789.604.00 Yes Sun Dinner 20.4166671727.255.15 Yes Sun Dinner 20.710345
In [77]: tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')Out[77]: total_bill tip smoker day time size tip_pctsmoker day No Fri 9422.753.25 No Fri Dinner 20.142857 Sat 21248.339.00 No Sat Dinner 40.186220 Sun 15648.175.00 No Sun Dinner 60.103799 Thur 14241.195.00 No Thur Lunch 50.121389Yes Fri 9540.174.73 Yes Fri Dinner 40.117750 Sat 17050.8110.00 Yes Sat Dinner 30.196812 Sun 18245.353.50 Yes Sun Dinner 30.077178 Thur 19743.115.00 Yes Thur Lunch 40.115982
In [78]: result = tips.groupby('smoker')['tip_pct'].describe()In [79]: resultOut[79]: count mean std min25%50%75%\smoker No 151.00.1593280.0399100.0567970.1369060.1556250.185014Yes 93.00.1631960.0851190.0356380.1067710.1538460.195059maxsmokerNo 0.291990Yes 0.710345In [80]: result.unstack('smoker')Out[80]: smokercount No 151.000000 Yes 93.000000mean No 0.159328 Yes 0.163196std No 0.039910 Yes 0.085119min No 0.056797 Yes 0.03563825% No 0.136906 Yes 0.10677150% No 0.155625 Yes 0.15384675% No 0.185014 Yes 0.195059max No 0.291990 Yes 0.710345dtype: float64
In [81]: tips.groupby('smoker', group_keys=False).apply(top)Out[81]: total_bill tip smoker day time size tip_pct8824.715.85 No Thur Lunch 20.23674618520.695.00 No Sun Dinner 50.2416635110.292.60 No Sun Dinner 20.2526721497.512.00 No Thur Lunch 20.26631223211.613.39 No Sat Dinner 20.29199010914.314.00 Yes Sat Dinner 20.27952518323.176.50 Yes Sun Dinner 40.280535673.071.00 Yes Sat Dinner 10.3257331789.604.00 Yes Sun Dinner 20.4166671727.255.15 Yes Sun Dinner 20.710345
In [114]: df = pd.DataFrame({'category': ['a', 'a', 'a', 'a', .....: 'b', 'b', 'b', 'b'], .....: 'data': np.random.randn(8), .....: 'weights': np.random.rand(8)})In [115]: dfOut[115]: category data weights0 a 1.5615870.9575151 a 1.2199840.3472672 a -0.4822390.5813623 a 0.3156670.2170914 b -0.0478520.8944065 b -0.4541450.9185646 b -0.5567740.2778257 b 0.2533210.955905
In [124]: get_year =lambdax: x.yearIn [125]: by_year = rets.groupby(get_year)In [126]: by_year.apply(spx_corr)Out[126]: AAPL MSFT XOM SPX20030.5411240.7451740.6612651.020040.3742830.5885310.5577421.020050.4675400.5623740.6310101.020060.4282670.4061260.5185141.020070.5081180.6587700.7862641.020080.6814340.8046260.8283031.020090.7071030.6549020.7979211.020100.7101050.7301180.8390571.020110.6919310.8009960.8599751.0
当然,你还可以计算列与列之间的相关系数。这里,我们计算Apple和Microsoft的年相关系数:
In [127]: by_year.apply(lambdag: g['AAPL'].corr(g['MSFT']))Out[127]:20030.48086820040.25902420050.30009320060.16173520070.41773820080.61190120090.43273820100.57194620110.581987dtype: float64
示例:组级别的线性回归
顺着上一个例子继续,你可以用groupby执行更为复杂的分组统计分析,只要函数返回的是pandas对象或标量值即可。例如,我可以定义下面这个regress函数(利用statsmodels计量经济学库)对各数据块执行普通最小二乘法(Ordinary Least Squares,OLS)回归:
import statsmodels.api as smdefregress(data,yvar,xvars): Y = data[yvar] X = data[xvars] X['intercept']=1. result = sm.OLS(Y, X).fit()return result.params
现在,为了按年计算AAPL对SPX收益率的线性回归,执行:
In [129]: by_year.apply(regress, 'AAPL', ['SPX'])Out[129]: SPX intercept20031.1954060.00071020041.3634630.00420120051.7664150.00324620061.6454960.00008020071.1987610.00343820080.968016-0.00111020090.8791030.00295420101.0526080.00126120110.8066050.001514
In [130]: tips.pivot_table(index=['day', 'smoker'])Out[130]: size tip tip_pct total_billday smoker Fri No 2.2500002.8125000.15165018.420000 Yes 2.0666672.7140000.17478316.813333Sat No 2.5555563.1028890.15804819.661778 Yes 2.4761902.8754760.14790621.276667Sun No 2.9298253.1678950.16011320.506667 Yes 2.5789473.5168420.18725024.120000Thur No 2.4888892.6737780.16029817.113111 Yes 2.3529413.0300000.16386319.190588
In [131]: tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], .....: columns='smoker')Out[131]: size tip_pct smoker No Yes No Yestime day Dinner Fri 2.0000002.2222220.1396220.165347 Sat 2.5555562.4761900.1580480.147906 Sun 2.9298252.5789470.1601130.187250 Thur 2.000000 NaN 0.159744 NaNLunch Fri 3.0000001.8333330.1877350.188937 Thur 2.5000002.3529410.1603110.163863
In [132]: tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], .....: columns='smoker', margins=True)Out[132]: size tip_pct smoker No Yes All No Yes Alltime day Dinner Fri 2.0000002.2222222.1666670.1396220.1653470.158916 Sat 2.5555562.4761902.5172410.1580480.1479060.153152 Sun 2.9298252.5789472.8421050.1601130.1872500.166897 Thur 2.000000 NaN 2.0000000.159744 NaN 0.159744Lunch Fri 3.0000001.8333332.0000000.1877350.1889370.188765 Thur 2.5000002.3529412.4590160.1603110.1638630.161301All 2.6688742.4086022.5696720.1593280.1631960.160803
In [133]: tips.pivot_table('tip_pct', index=['time', 'smoker'], columns='day', .....: aggfunc=len, margins=True)Out[133]:day Fri Sat Sun Thur Alltime smoker Dinner No 3.045.057.01.0106.0 Yes 9.042.019.0 NaN 70.0Lunch No 1.0 NaN NaN 44.045.0 Yes 6.0 NaN NaN 17.023.0All 19.087.076.062.0244.0
如果存在空的组合(也就是NA),你可能会希望设置一个fill_value:
In [134]: tips.pivot_table('tip_pct', index=['time', 'size', 'smoker'], .....: columns='day', aggfunc='mean', fill_value=0)Out[134]:day Fri Sat Sun Thurtime size smoker Dinner 1 No 0.0000000.1379310.0000000.000000 Yes 0.0000000.3257330.0000000.0000002 No 0.1396220.1627050.1688590.159744 Yes 0.1712970.1486680.2078930.0000003 No 0.0000000.1546610.1526630.000000 Yes 0.0000000.1449950.1526600.0000004 No 0.0000000.1500960.1481430.000000 Yes 0.1177500.1245150.1933700.0000005 No 0.0000000.0000000.2069280.000000Yes 0.0000000.1065720.0656600.000000... ... ... ... ...Lunch 1 No 0.0000000.0000000.0000000.181728 Yes 0.2237760.0000000.0000000.0000002 No 0.0000000.0000000.0000000.166005 Yes 0.1819690.0000000.0000000.1588433 No 0.1877350.0000000.0000000.084246 Yes 0.0000000.0000000.0000000.2049524 No 0.0000000.0000000.0000000.138919 Yes 0.0000000.0000000.0000000.1554105 No 0.0000000.0000000.0000000.1213896 No 0.0000000.0000000.0000000.173706[21 rows x 4 columns]
In [138]: dataOut[138]: Sample Nationality Handedness01 USA Right-handed12 Japan Left-handed23 USA Right-handed34 Japan Right-handed45 Japan Left-handed56 Japan Right-handed67 USA Right-handed78 USA Left-handed89 Japan Right-handed910 USA Right-handed
In [140]: pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)Out[140]:smoker No Yes Alltime day Dinner Fri 3912 Sat 454287 Sun 571976 Thur 101Lunch Fri 167 Thur 441761All 15193244