In [20]: frame.index.names = ['key1','key2']In [21]: frame.columns.names = ['state','color']In [22]: frameOut[22]:state Ohio Coloradocolor Green Red Greenkey1 key2 a 10122345b 1678291011
注意:小心区分索引名state、color与行标签。
有了部分列索引,因此可以轻松选取列分组:
In [23]: frame['Ohio']Out[23]:color Green Redkey1 key2 a 101234b 1672910
In [25]: frame.sort_index(level=1)Out[25]:state Ohio Coloradocolor Green Red Greenkey1 key2 a 1012b 1678a 2345b 291011In [26]: frame.swaplevel(0, 1).sort_index(level=0)Out[26]:state Ohio Coloradocolor Green Red Greenkey2 key1 1 a 012 b 6782 a 345 b 91011
In [27]: frame.sum(level='key2')Out[27]:state Ohio Coloradocolor Green Red Greenkey2 168102121416In [28]: frame.sum(level='color', axis=1)Out[28]:color Green Redkey1 key2 a 121284b 114722010
In [29]: frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1), ....: 'c': ['one', 'one', 'one', 'two', 'two', ....: 'two', 'two'], ....: 'd': [0, 1, 2, 0, 1, 2, 3]})In [30]: frameOut[30]: a b c d007 one 0116 one 1225 one 2334 two 0443 two 1552 two 2661 two 3
In [54]: pd.merge(left, right, on='key1')Out[54]: key1 key2_x lval key2_y rval0 foo one 1 one 41 foo one 1 one 52 foo two 2 one 43 foo two 2 one 54 bar one 3 one 65 bar one 3 two 7In [55]: pd.merge(left, right, on='key1', suffixes=('_left', '_right'))Out[55]: key1 key2_left lval key2_right rval0 foo one 1 one 41 foo one 1 one 52 foo two 2 one 43 foo two 2 one 54 bar one 3 one 65 bar one 3 two 7
In [56]: left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'], ....: 'value': range(6)})In [57]: right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])In [58]: left1Out[58]: key value0 a 01 b 12 a 23 a 34 b 45 c 5In [59]: right1Out[59]: group_vala 3.5b 7.0In [60]: pd.merge(left1, right1, left_on='key', right_index=True)Out[60]: key value group_val0 a 03.52 a 23.53 a 33.51 b 17.04 b 47.0
由于默认的merge方法是求取连接键的交集,因此你可以通过外连接的方式得到它们的并集:
In [61]: pd.merge(left1, right1, left_on='key', right_index=True, how='outer')Out[61]: key value group_val0 a 03.52 a 23.53 a 33.51 b 17.04 b 47.05 c 5 NaN
In [73]: left2.join(right2, how='outer')Out[73]: Ohio Nevada Missouri Alabamaa 1.02.0 NaN NaNb NaN NaN 7.08.0c 3.04.09.010.0d NaN NaN 11.012.0e 5.06.013.014.0
In [75]: another = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]], ....: index=['a', 'c', 'e', 'f'], ....: columns=['New York','Oregon'])In [76]: anotherOut[76]: New York Oregona 7.08.0c 9.010.0e 11.012.0f 16.017.0In [77]: left2.join([right2, another])Out[77]: Ohio Nevada Missouri Alabama New York Oregona 1.02.0 NaN NaN 7.08.0c 3.04.09.010.09.010.0e 5.06.013.014.011.012.0In [78]: left2.join([right2, another], how='outer')Out[78]: Ohio Nevada Missouri Alabama New York Oregona 1.02.0 NaN NaN 7.08.0b NaN NaN 7.08.0 NaN NaNc 3.04.09.010.09.010.0d NaN NaN 11.012.0 NaN NaNe 5.06.013.014.011.012.0f NaN NaN NaN NaN 16.017.0
In [92]: result = pd.concat([s1, s1, s3], keys=['one','two', 'three'])In [93]: resultOut[93]:one a 0 b 1two a 0 b 1three f 5 g 6dtype: int64In [94]: result.unstack()Out[94]: a b f gone 0.01.0 NaN NaNtwo 0.01.0 NaN NaNthree NaN NaN 5.06.0
如果沿着axis=1对Series进行合并,则keys就会成为DataFrame的列头:
In [95]: pd.concat([s1, s2, s3], axis=1, keys=['one','two', 'three'])Out[95]: one two threea 0.0 NaN NaNb 1.0 NaN NaNc NaN 2.0 NaNd NaN 3.0 NaNe NaN 4.0 NaNf NaN NaN 5.0g NaN NaN 6.0
同样的逻辑也适用于DataFrame对象:
In [96]: df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'], ....: columns=['one', 'two'])In [97]: df2 = pd.DataFrame(5+ np.arange(4).reshape(2, 2), index=['a', 'c'], ....: columns=['three', 'four'])In [98]: df1Out[98]: one twoa 01b 23c 45In [99]: df2Out[99]: three foura 56c 78In [100]: pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])Out[100]: level1 level2 one two three foura 015.06.0b 23 NaN NaNc 457.08.0
如果传入的不是列表而是一个字典,则字典的键就会被当做keys选项的值:
In [101]: pd.concat({'level1': df1, 'level2': df2}, axis=1)Out[101]: level1 level2 one two three foura 015.06.0b 23 NaN NaNc 457.08.0
In [102]: pd.concat([df1, df2], axis=1, keys=['level1', 'level2'], .....: names=['upper', 'lower'])Out[102]:upper level1 level2 lower one two three foura 015.06.0b 23 NaN NaNc 457.08.0
最后一个关于DataFrame的问题是,DataFrame的行索引不包含任何相关数据:
In [103]: df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])In [104]: df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])In [105]: df1Out[105]: a b c d01.2464351.007189-1.2962210.27499210.2289131.3529170.886429-2.0016372-0.3718431.669025-0.438570-0.539741In [106]: df2Out[106]: b d a00.4769853.248944-1.0212281-0.5770870.1241210.302614
在这种情况下,传入ignore_index=True即可:
In [107]: pd.concat([df1, df2], ignore_index=True)Out[107]: a b c d01.2464351.007189-1.2962210.27499210.2289131.3529170.886429-2.0016372-0.3718431.669025-0.438570-0.5397413-1.0212280.476985 NaN 3.24894440.302614-0.577087 NaN 0.124121
In [115]: df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan], .....: 'b': [np.nan, 2., np.nan, 6.], .....: 'c': range(2, 18, 4)})In [116]: df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.], .....: 'b': [np.nan, 3., 4., 6., 8.]})In [117]: df1Out[117]: a b c01.0 NaN 21 NaN 2.0625.0 NaN 103 NaN 6.014In [118]: df2Out[118]: a b05.0 NaN14.03.02 NaN 4.033.06.047.08.0In [119]: df1.combine_first(df2)Out[119]: a b c01.0 NaN 2.014.02.06.025.04.010.033.06.014.047.08.0 NaN
In [125]: result.unstack(0)Out[125]:state Ohio Coloradonumber one 03two 14three 25In [126]: result.unstack('state')Out[126]:state Ohio Coloradonumber one 03two 14three 25
如果不是所有的级别值都能在各分组中找到的话,则unstack操作可能会引入缺失数据:
In [127]: s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])In [128]: s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])In [129]: data2 = pd.concat([s1, s2], keys=['one', 'two'])In [130]: data2Out[130]:one a 0 b 1 c 2 d 3two c 4 d 5 e 6dtype: int64In [131]: data2.unstack()Out[131]: a b c d eone 0.01.02.03.0 NaNtwo NaN NaN 4.05.06.0
stack默认会滤除缺失数据,因此该运算是可逆的:
In [132]: data2.unstack()Out[132]: a b c d eone 0.01.02.03.0 NaNtwo NaN NaN 4.05.06.0In [133]: data2.unstack().stack()Out[133]:one a 0.0 b 1.0 c 2.0 d 3.0two c 4.0 d 5.0 e 6.0dtype: float64In [134]: data2.unstack().stack(dropna=False)Out[134]:one a 0.0 b 1.0 c 2.0 d 3.0 e NaNtwo a NaN b NaN c 4.0 d 5.0 e 6.0dtype: float64
在对DataFrame进行unstack操作时,作为旋转轴的级别将会成为结果中的最低级别:
In [135]: df = pd.DataFrame({'left': result, 'right': result +5}, .....: columns=pd.Index(['left', 'right'], name='side'))In [136]: dfOut[136]:side left rightstate number Ohio one 05 two 16 three 27Colorado one 38 two 49 three 510In [137]: df.unstack('state')Out[137]:side left rightstate Ohio Colorado Ohio Coloradonumber one 0358two 1469three 25710
当调用stack,我们可以指明轴的名字:
In [138]: df.unstack('state').stack('side')Out[138]:state Colorado Ohionumber side one left 30 right 85two left 41 right 96three left 52 right 107
In [159]: melted = pd.melt(df, ['key'])In [160]: meltedOut[160]: key variable value0 foo A 11 bar A 22 baz A 33 foo B 44 bar B 55 baz B 66 foo C 77 bar C 88 baz C 9
使用pivot,可以重塑回原来的样子:
In [161]: reshaped = melted.pivot('key', 'variable', 'value')In [162]: reshapedOut[162]:variable A B Ckey bar 258baz 369foo 147
In [163]: reshaped.reset_index()Out[163]:variable key A B C0 bar 2581 baz 3692 foo 147
你还可以指定列的子集,作为值的列:
In [164]: pd.melt(df, id_vars=['key'], value_vars=['A', 'B'])Out[164]: key variable value0 foo A 11 bar A 22 baz A 33 foo B 44 bar B 55 baz B 6
pandas.melt也可以不用分组指标:
In [165]: pd.melt(df, value_vars=['A', 'B', 'C'])Out[165]: variable value0 A 11 A 22 A 33 B 44 B 55 B 66 C 77 C 88 C 9In [166]: pd.melt(df, value_vars=['key', 'A', 'B'])Out[166]: variable value0 key foo1 key bar2 key baz3 A 14 A 25 A 36 B 47 B 58 B 6