In [20]: frame.index.names = ['key1','key2']In [21]: frame.columns.names = ['state','color']In [22]: frameOut[22]:state Ohio Coloradocolor Green Red Greenkey1 key2 a 10122345b 1678291011
注意:小心区分索引名state、color与行标签。
有了部分列索引,因此可以轻松选取列分组:
In [23]: frame['Ohio']Out[23]:color Green Redkey1 key2 a 101234b 1672910
In [25]: frame.sort_index(level=1)Out[25]:state Ohio Coloradocolor Green Red Greenkey1 key2 a 1012b 1678a 2345b 291011In [26]: frame.swaplevel(0, 1).sort_index(level=0)Out[26]:state Ohio Coloradocolor Green Red Greenkey2 key1 1 a 012 b 6782 a 345 b 91011
In [27]: frame.sum(level='key2')Out[27]:state Ohio Coloradocolor Green Red Greenkey2 168102121416In [28]: frame.sum(level='color', axis=1)Out[28]:color Green Redkey1 key2 a 121284b 114722010
In [29]: frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1), ....: 'c': ['one', 'one', 'one', 'two', 'two', ....: 'two', 'two'], ....: 'd': [0, 1, 2, 0, 1, 2, 3]})In [30]: frameOut[30]: a b c d007 one 0116 one 1225 one 2334 two 0443 two 1552 two 2661 two 3
In [54]: pd.merge(left, right, on='key1')Out[54]: key1 key2_x lval key2_y rval0 foo one 1 one 41 foo one 1 one 52 foo two 2 one 43 foo two 2 one 54 bar one 3 one 65 bar one 3 two 7In [55]: pd.merge(left, right, on='key1', suffixes=('_left', '_right'))Out[55]: key1 key2_left lval key2_right rval0 foo one 1 one 41 foo one 1 one 52 foo two 2 one 43 foo two 2 one 54 bar one 3 one 65 bar one 3 two 7
In [56]: left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'], ....: 'value': range(6)})In [57]: right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])In [58]: left1Out[58]: key value0 a 01 b 12 a 23 a 34 b 45 c 5In [59]: right1Out[59]: group_vala 3.5b 7.0In [60]: pd.merge(left1, right1, left_on='key', right_index=True)Out[60]: key value group_val0 a 03.52 a 23.53 a 33.51 b 17.04 b 47.0
由于默认的merge方法是求取连接键的交集,因此你可以通过外连接的方式得到它们的并集:
In [61]: pd.merge(left1, right1, left_on='key', right_index=True, how='outer')Out[61]: key value group_val0 a 03.52 a 23.53 a 33.51 b 17.04 b 47.05 c 5 NaN
In [73]: left2.join(right2, how='outer')Out[73]: Ohio Nevada Missouri Alabamaa 1.02.0 NaN NaNb NaN NaN 7.08.0c 3.04.09.010.0d NaN NaN 11.012.0e 5.06.013.014.0
In [75]: another = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]], ....: index=['a', 'c', 'e', 'f'], ....: columns=['New York','Oregon'])In [76]: anotherOut[76]: New York Oregona 7.08.0c 9.010.0e 11.012.0f 16.017.0In [77]: left2.join([right2, another])Out[77]: Ohio Nevada Missouri Alabama New York Oregona 1.02.0 NaN NaN 7.08.0c 3.04.09.010.09.010.0e 5.06.013.014.011.012.0In [78]: left2.join([right2, another], how='outer')Out[78]: Ohio Nevada Missouri Alabama New York Oregona 1.02.0 NaN NaN 7.08.0b NaN NaN 7.08.0 NaN NaNc 3.04.09.010.09.010.0d NaN NaN 11.012.0 NaN NaNe 5.06.013.014.011.012.0f NaN NaN NaN NaN 16.017.0
In [92]: result = pd.concat([s1, s1, s3], keys=['one','two', 'three'])In [93]: resultOut[93]:one a 0 b 1two a 0 b 1three f 5 g 6dtype: int64In [94]: result.unstack()Out[94]: a b f gone 0.01.0 NaN NaNtwo 0.01.0 NaN NaNthree NaN NaN 5.06.0
如果沿着axis=1对Series进行合并,则keys就会成为DataFrame的列头:
In [95]: pd.concat([s1, s2, s3], axis=1, keys=['one','two', 'three'])Out[95]: one two threea 0.0 NaN NaNb 1.0 NaN NaNc NaN 2.0 NaNd NaN 3.0 NaNe NaN 4.0 NaNf NaN NaN 5.0g NaN NaN 6.0
同样的逻辑也适用于DataFrame对象:
In [96]: df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'], ....: columns=['one', 'two'])In [97]: df2 = pd.DataFrame(5+ np.arange(4).reshape(2, 2), index=['a', 'c'], ....: columns=['three', 'four'])In [98]: df1Out[98]: one twoa 01b 23c 45In [99]: df2Out[99]: three foura 56c 78In [100]: pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])Out[100]: level1 level2 one two three foura 015.06.0b 23 NaN NaNc 457.08.0
如果传入的不是列表而是一个字典,则字典的键就会被当做keys选项的值:
In [101]: pd.concat({'level1': df1, 'level2': df2}, axis=1)Out[101]: level1 level2 one two three foura 015.06.0b 23 NaN NaNc 457.08.0
In [102]: pd.concat([df1, df2], axis=1, keys=['level1', 'level2'], .....: names=['upper', 'lower'])Out[102]:upper level1 level2 lower one two three foura 015.06.0b 23 NaN NaNc 457.08.0
最后一个关于DataFrame的问题是,DataFrame的行索引不包含任何相关数据:
In [103]: df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])In [104]: df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])In [105]: df1Out[105]: a b c d01.2464351.007189-1.2962210.27499210.2289131.3529170.886429-2.0016372-0.3718431.669025-0.438570-0.539741In [106]: df2Out[106]: b d a00.4769853.248944-1.0212281-0.5770870.1241210.302614
在这种情况下,传入ignore_index=True即可:
In [107]: pd.concat([df1, df2], ignore_index=True)Out[107]: a b c d01.2464351.007189-1.2962210.27499210.2289131.3529170.886429-2.0016372-0.3718431.669025-0.438570-0.5397413-1.0212280.476985 NaN 3.24894440.302614-0.577087 NaN 0.124121
In [120]: data = pd.DataFrame(np.arange(6).reshape((2, 3)),
.....: index=pd.Index(['Ohio','Colorado'], name='state'),
.....: columns=pd.Index(['one', 'two', 'three'],
.....: name='number'))
In [121]: data
Out[121]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
对该数据使用stack方法即可将列转换为行,得到一个Series:
In [122]: result = data.stack()
In [123]: result
Out[123]:
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int64
对于一个层次化索引的Series,你可以用unstack将其重排为一个DataFrame:
In [124]: result.unstack()
Out[124]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
In [125]: result.unstack(0)
Out[125]:
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
In [126]: result.unstack('state')
Out[126]:
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
如果不是所有的级别值都能在各分组中找到的话,则unstack操作可能会引入缺失数据:
In [127]: s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
In [128]: s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])
In [129]: data2 = pd.concat([s1, s2], keys=['one', 'two'])
In [130]: data2
Out[130]:
one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
In [131]: data2.unstack()
Out[131]:
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
stack默认会滤除缺失数据,因此该运算是可逆的:
In [132]: data2.unstack()
Out[132]:
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
In [133]: data2.unstack().stack()
Out[133]:
one a 0.0
b 1.0
c 2.0
d 3.0
two c 4.0
d 5.0
e 6.0
dtype: float64
In [134]: data2.unstack().stack(dropna=False)
Out[134]:
one a 0.0
b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64
在对DataFrame进行unstack操作时,作为旋转轴的级别将会成为结果中的最低级别:
In [135]: df = pd.DataFrame({'left': result, 'right': result + 5},
.....: columns=pd.Index(['left', 'right'], name='side'))
In [136]: df
Out[136]:
side left right
state number
Ohio one 0 5
two 1 6
three 2 7
Colorado one 3 8
two 4 9
three 5 10
In [137]: df.unstack('state')
Out[137]:
side left right
state Ohio Colorado Ohio Colorado
number
one 0 3 5 8
two 1 4 6 9
three 2 5 7 10
当调用stack,我们可以指明轴的名字:
In [138]: df.unstack('state').stack('side')
Out[138]:
state Colorado Ohio
number side
one left 3 0
right 8 5
two left 4 1
right 9 6
three left 5 2
right 10 7
In [159]: melted = pd.melt(df, ['key'])
In [160]: melted
Out[160]:
key variable value
0 foo A 1
1 bar A 2
2 baz A 3
3 foo B 4
4 bar B 5
5 baz B 6
6 foo C 7
7 bar C 8
8 baz C 9
使用pivot,可以重塑回原来的样子:
In [161]: reshaped = melted.pivot('key', 'variable', 'value')
In [162]: reshaped
Out[162]:
variable A B C
key
bar 2 5 8
baz 3 6 9
foo 1 4 7
In [163]: reshaped.reset_index()
Out[163]:
variable key A B C
0 bar 2 5 8
1 baz 3 6 9
2 foo 1 4 7
你还可以指定列的子集,作为值的列:
In [164]: pd.melt(df, id_vars=['key'], value_vars=['A', 'B'])
Out[164]:
key variable value
0 foo A 1
1 bar A 2
2 baz A 3
3 foo B 4
4 bar B 5
5 baz B 6
pandas.melt也可以不用分组指标:
In [165]: pd.melt(df, value_vars=['A', 'B', 'C'])
Out[165]:
variable value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 C 7
7 C 8
8 C 9
In [166]: pd.melt(df, value_vars=['key', 'A', 'B'])
Out[166]:
variable value
0 key foo
1 key bar
2 key baz
3 A 1
4 A 2
5 A 3
6 B 4
7 B 5
8 B 6