In [19]: data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], ....: [NA, NA, NA], [NA, 6.5, 3.]])In [20]: cleaned = data.dropna()In [21]: dataOut[21]:01201.06.53.011.0 NaN NaN2 NaN NaN NaN3 NaN 6.53.0In [22]: cleanedOut[22]:01201.06.53.0
传入how='all'将只丢弃全为NA的那些行:
In [23]: data.dropna(how='all')Out[23]:01201.06.53.011.0 NaN NaN3 NaN 6.53.0
用这种方式丢弃列,只需传入axis=1即可:
In [24]: data[4]= NAIn [25]: dataOut[25]:012401.06.53.0 NaN11.0 NaN NaN NaN2 NaN NaN NaN NaN3 NaN 6.53.0 NaNIn [26]: data.dropna(axis=1, how='all')Out[26]:01201.06.53.011.0 NaN NaN2 NaN NaN NaN3 NaN 6.53.0
In [27]: df = pd.DataFrame(np.random.randn(7, 3))In [28]: df.iloc[:4,1]= NAIn [29]: df.iloc[:2,2]= NAIn [30]: dfOut[30]:0120-0.204708 NaN NaN1-0.555730 NaN NaN20.092908 NaN 0.76902331.246435 NaN -1.29622140.2749920.2289131.35291750.886429-2.001637-0.37184361.669025-0.438570-0.539741In [31]: df.dropna()Out[31]:01240.2749920.2289131.35291750.886429-2.001637-0.37184361.669025-0.438570-0.539741In [32]: df.dropna(thresh=2)Out[32]:01220.092908 NaN 0.76902331.246435 NaN -1.29622140.2749920.2289131.35291750.886429-2.001637-0.37184361.669025-0.438570-0.539741
In [33]: df.fillna(0)Out[33]:0120-0.2047080.0000000.0000001-0.5557300.0000000.00000020.0929080.0000000.76902331.2464350.000000-1.29622140.2749920.2289131.35291750.886429-2.001637-0.37184361.669025-0.438570-0.539741
若是通过一个字典调用fillna,就可以实现对不同的列填充不同的值:
In [34]: df.fillna({1: 0.5, 2: 0})Out[34]:0120-0.2047080.5000000.0000001-0.5557300.5000000.00000020.0929080.5000000.76902331.2464350.500000-1.29622140.2749920.2289131.35291750.886429-2.001637-0.37184361.669025-0.438570-0.539741
fillna默认会返回新对象,但也可以对现有对象进行就地修改:
In [35]: _ = df.fillna(0, inplace=True)In [36]: dfOut[36]:0120-0.2047080.0000000.0000001-0.5557300.0000000.00000020.0929080.0000000.76902331.2464350.000000-1.29622140.2749920.2289131.35291750.886429-2.001637-0.37184361.669025-0.438570-0.539741
对reindexing有效的那些插值方法也可用于fillna:
In [37]: df = pd.DataFrame(np.random.randn(6, 3))In [38]: df.iloc[2:,1]= NAIn [39]: df.iloc[4:,2]= NAIn [40]: dfOut[40]:01200.4769853.248944-1.0212281-0.5770870.1241210.30261420.523772 NaN 1.3438103-0.713544 NaN -2.3702324-1.860761 NaN NaN5-1.265934 NaN NaNIn [41]: df.fillna(method='ffill')Out[41]:01200.4769853.248944-1.0212281-0.5770870.1241210.30261420.5237720.1241211.3438103-0.7135440.124121-2.3702324-1.8607610.124121-2.3702325-1.2659340.124121-2.370232In [42]: df.fillna(method='ffill', limit=2)Out[42]:01200.4769853.248944-1.0212281-0.5770870.1241210.30261420.5237720.1241211.3438103-0.7135440.124121-2.3702324-1.860761 NaN -2.3702325-1.265934 NaN -2.370232
In [43]: data = pd.Series([1., NA, 3.5, NA, 7])In [44]: data.fillna(data.mean())Out[44]:01.00000013.83333323.50000033.83333347.000000dtype: float64
表7-2列出了fillna的参考。
7.2 数据转换
本章到目前为止介绍的都是数据的重排。另一类重要操作则是过滤、清理以及其他的转换工作。
移除重复数据
DataFrame中出现重复行有多种原因。下面就是一个例子:
In [45]: data = pd.DataFrame({'k1': ['one', 'two'] *3+ ['two'], ....: 'k2': [1, 1, 2, 3, 3, 4, 4]})In [46]: dataOut[46]: k1 k20 one 11 two 12 one 23 two 34 one 35 two 46 two 4
In [92]: data = pd.DataFrame(np.random.randn(1000, 4))In [93]: data.describe()Out[93]:0123count 1000.0000001000.0000001000.0000001000.000000mean 0.0490910.026112-0.002544-0.051827std 0.9969471.0074580.9952320.998311min-3.645860-3.184377-3.745356-3.42825425%-0.599807-0.612162-0.687373-0.74747850%0.047101-0.013609-0.022158-0.08827475%0.7566460.6952980.6990460.623331max2.6536563.5258652.7355273.366626
假设你想要找出某列中绝对值大小超过3的值:
In [94]: col = data[2]In [95]: col[np.abs(col)>3]Out[95]:41-3.399312136-3.745356Name:2, dtype: float64
要选出全部含有“超过3或-3的值”的行,你可以在布尔型DataFrame中使用any方法:
In [96]: data[(np.abs(data)>3).any(1)]Out[96]:0123410.457246-0.025907-3.399312-0.974657601.9513123.2603830.9633011.2012061360.508391-0.196713-3.745356-1.520113235-0.242459-3.0569901.918403-0.5788282580.6828410.3260450.425384-3.4282543221.179227-3.1843771.369891-1.074833544-3.5488241.553205-2.1863011.277104635-0.5780930.1932991.3978223.366626782-0.2074343.5258650.2830700.544635803-3.6458600.255475-0.549574-1.907459
根据这些条件,就可以对值进行设置。下面的代码可以将值限制在区间-3到3以内:
In [97]: data[np.abs(data)>3]= np.sign(data)*3In [98]: data.describe()Out[98]:0123count 1000.0000001000.0000001000.0000001000.000000mean 0.0502860.025567-0.001399-0.051765std 0.9929201.0042140.9914140.995761min-3.000000-3.000000-3.000000-3.00000025%-0.599807-0.612162-0.687373-0.74747850%0.047101-0.013609-0.022158-0.08827475%0.7566460.6952980.6990460.623331max2.6536563.0000002.7355273.000000
根据数据的值是正还是负,np.sign(data)可以生成1和-1:
In [99]: np.sign(data).head()Out[99]:01230-1.01.0-1.01.011.0-1.01.0-1.021.01.01.0-1.03-1.0-1.01.0-1.04-1.01.0-1.0-1.0
In [114]: mnames = ['movie_id', 'title', 'genres']
In [115]: movies = pd.read_table('datasets/movielens/movies.dat', sep='::',
.....: header=None, names=mnames)
In [116]: movies[:10]
Out[116]:
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
5 6 Heat (1995) Action|Crime|Thriller
6 7 Sabrina (1995) Comedy|Romance
7 8 Tom and Huck (1995) Adventure|Children's
8 9 Sudden Death (1995)
Action
9 10 GoldenEye (1995) Action|Adventure|Thriller
In [123]: gen = movies.genres[0]
In [124]: gen.split('|')
Out[124]: ['Animation', "Children's", 'Comedy']
In [125]: dummies.columns.get_indexer(gen.split('|'))
Out[125]: array([0, 1, 2])
然后,根据索引,使用.iloc设定值:
In [126]: for i, gen in enumerate(movies.genres):
.....: indices = dummies.columns.get_indexer(gen.split('|'))
.....: dummies.iloc[i, indices] = 1
.....:
In [141]: 'guido' in val
Out[141]: True
In [142]: val.index(',')
Out[142]: 1
In [143]: val.find(':')
Out[143]: -1
注意find和index的区别:如果找不到字符串,index将会引发一个异常(而不是返回-1):
In [144]: val.index(':')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-144-280f8b2856ce> in <module>()
----> 1 val.index(':')
ValueError: substring not found
与此相关,count可以返回指定子串的出现次数:
In [145]: val.count(',')
Out[145]: 2
replace用于将指定模式替换为另一个模式。通过传入空字符串,它也常常用于删除模式:
In [146]: val.replace(',', '::')
Out[146]: 'a::b:: guido'
In [147]: val.replace(',', '')
Out[147]: 'ab guido'
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)
对text使用findall将得到一组电子邮件地址:
In [155]: regex.findall(text)
Out[155]:
['dave@google.com',
'steve@gmail.com',
'rob@gmail.com',
'ryan@yahoo.com']
In [156]: m = regex.search(text)
In [157]: m
Out[157]: <_sre.SRE_Match object; span=(5, 20), match='dave@google.com'>
In [158]: text[m.start():m.end()]
Out[158]: 'dave@google.com'
regex.match则将返回None,因为它只匹配出现在字符串开头的模式:
In [159]: print(regex.match(text))
None
相关的,sub方法可以将匹配到的模式替换为指定字符串,并返回所得到的新字符串:
In [160]: print(regex.sub('REDACTED', text))
Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED
In [166]: print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))
Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com
In [167]: data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
.....: 'Rob': 'rob@gmail.com', 'Wes': np.nan}
In [168]: data = pd.Series(data)
In [169]: data
Out[169]:
Dave dave@google.com
Rob rob@gmail.com
Steve steve@gmail.com
Wes NaN
dtype: object
In [170]: data.isnull()
Out[170]:
Dave False
Rob False
Steve False
Wes True
dtype: bool
In [171]: data.str.contains('gmail')
Out[171]:
Dave False
Rob True
Steve True
Wes NaN
dtype: object
也可以使用正则表达式,还可以加上任意re选项(如IGNORECASE):
In [172]: pattern
Out[172]: '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'
In [173]: data.str.findall(pattern, flags=re.IGNORECASE)
Out[173]:
Dave [(dave, google, com)]
Rob [(rob, gmail, com)]
Steve [(steve, gmail, com)]
Wes NaN
dtype: object
有两个办法可以实现矢量化的元素获取操作:要么使用str.get,要么在str属性上使用索引:
In [174]: matches = data.str.match(pattern, flags=re.IGNORECASE)
In [175]: matches
Out[175]:
Dave True
Rob True
Steve True
Wes NaN
dtype: object
要访问嵌入列表中的元素,我们可以传递索引到这两个函数中:
In [176]: matches.str.get(1)
Out[176]:
Dave NaN
Rob NaN
Steve NaN
Wes NaN
dtype: float64
In [177]: matches.str[0]
Out[177]:
Dave NaN
Rob NaN
Steve NaN
Wes NaN
dtype: float64
你可以利用这种方法对字符串进行截取:
In [178]: data.str[:5]
Out[178]:
Dave dave@
Rob rob@g
Steve steve
Wes NaN
dtype: object