In [9]: df = pd.read_csv('examples/ex1.csv')In [10]: dfOut[10]: a b c d message01234 hello15678 world29101112 foo
我们还可以使用read_table,并指定分隔符:
In [11]: pd.read_table('examples/ex1.csv', sep=',')Out[11]: a b c d message01234 hello15678 world29101112 foo
并不是所有文件都有标题行。看看下面这个文件:
In [12]: !cat examples/ex2.csv1,2,3,4,hello5,6,7,8,world9,10,11,12,foo
读入该文件的办法有两个。你可以让pandas为其分配默认的列名,也可以自己定义列名:
In [13]: pd.read_csv('examples/ex2.csv', header=None)Out[13]:0123401234 hello15678 world29101112 fooIn [14]: pd.read_csv('examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])Out[14]: a b c d message01234 hello15678 world29101112 foo
In [15]: names = ['a','b','c','d','message']In [16]: pd.read_csv('examples/ex2.csv', names=names, index_col='message')Out[16]: a b c dmessage hello 1234world 5678foo 9101112
如果希望将多个列做成一个层次化索引,只需传入由列编号或列名组成的列表即可:
In [17]: !cat examples/csv_mindex.csvkey1,key2,value1,value2one,a,1,2one,b,3,4one,c,5,6one,d,7,8two,a,9,10two,b,11,12two,c,13,14two,d,15,16In [18]: parsed = pd.read_csv('examples/csv_mindex.csv', ....: index_col=['key1', 'key2'])In [19]: parsedOut[19]: value1 value2key1 key2 one a 12 b 34 c 56 d 78two a 910 b 1112 c 1314 d 1516
In [21]: result = pd.read_table('examples/ex3.txt', sep='\s+')In [22]: resultOut[22]: A B Caaa -0.264438-1.026059-0.619500bbb 0.9272720.302904-0.032399ccc -0.264273-0.386314-0.217601ddd -0.871858-0.3483821.100491
In [23]: !cat examples/ex4.csv# hey!a,b,c,d,message# just wanted to make things more difficult for you# who reads CSV files with computers, anyway?1,2,3,4,hello5,6,7,8,world9,10,11,12,fooIn [24]: pd.read_csv('examples/ex4.csv', skiprows=[0, 2, 3])Out[24]: a b c d message01234 hello15678 world29101112 foo
In [25]: !cat examples/ex5.csvsomething,a,b,c,d,messageone,1,2,3,4,NAtwo,5,6,,8,worldthree,9,10,11,12,fooIn [26]: result = pd.read_csv('examples/ex5.csv')In [27]: resultOut[27]: something a b c d message0 one 123.04 NaN1 two 56 NaN 8 world2 three 91011.012 fooIn [28]: pd.isnull(result)Out[28]: something a b c d message0FalseFalseFalseFalseFalseTrue1FalseFalseFalseTrueFalseFalse2FalseFalseFalseFalseFalseFalse
na_values可以用一个列表或集合的字符串表示缺失值:
In [29]: result = pd.read_csv('examples/ex5.csv', na_values=['NULL'])In [30]: resultOut[30]: something a b c d message0 one 123.04 NaN1 two 56 NaN 8 world2 three 91011.012 foo
字典的各列可以使用不同的NA标记值:
In [31]: sentinels ={'message': ['foo','NA'],'something': ['two']}In [32]: pd.read_csv('examples/ex5.csv', na_values=sentinels)Out[32]:something a b c d message0 one 123.04 NaN1 NaN 56 NaN 8 world2 three 91011.012 NaN
In [34]: result = pd.read_csv('examples/ex6.csv')In [35]: resultOut[35]: one two three four key00.467976-0.038649-0.295344-1.824726 L1-0.3588931.4044530.704965-0.200638 B2-0.5018400.659254-0.421691-0.057688 G30.2048861.0741341.388361-0.982404 R40.354628-0.1331160.283763-0.837063 Q... ... ... ... ... ..99952.311896-0.417070-1.409599-0.515821 L9996-0.479893-0.6504190.745152-0.646038 E99970.5233310.7871120.4860661.093156 K9998-0.3625590.598894-1.8432010.887292 G9999-0.096376-1.012999-0.657431-0.5733150[10000 rows x 5 columns]If you want to only read a small
如果只想读取几行(避免读取整个文件),通过nrows进行指定即可:
In [36]: pd.read_csv('examples/ex6.csv', nrows=5)Out[36]: one two three four key00.467976-0.038649-0.295344-1.824726 L1-0.3588931.4044530.704965-0.200638 B2-0.5018400.659254-0.421691-0.057688 G30.2048861.0741341.388361-0.982404 R40.354628-0.1331160.283763-0.837063 Q
要逐块读取文件,可以指定chunksize(行数):
In [874]: chunker = pd.read_csv('ch06/ex6.csv', chunksize=1000)In [875]: chunkerOut[875]:<pandas.io.parsers.TextParser at 0x8398150>
In [41]: data = pd.read_csv('examples/ex5.csv')In [42]: dataOut[42]: something a b c d message0 one 123.04 NaN1 two 56 NaN 8 world2 three 91011.012 foo
利用DataFrame的to_csv方法,我们可以将数据写到一个以逗号分隔的文件中:
In [43]: data.to_csv('examples/out.csv')In [44]: !cat examples/out.csv,something,a,b,c,d,message0,one,1,2,3.0,4,1,two,5,6,,8,world2,three,9,10,11.0,12,foo
In [71]:print(data.to_json()){"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}In [72]:print(data.to_json(orient='records'))[{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]
In [73]: tables = pd.read_html('examples/fdic_failed_bank_list.html')In [74]:len(tables)Out[74]:1In [75]: failures = tables[0]In [76]: failures.head()Out[76]: Bank Name City ST CERT \0 Allied Bank Mulberry AR 911 The Woodbury Banking Company Woodbury GA 112972 First CornerStone Bank King of Prussia PA 353123 Trust Company Bank Memphis TN 99564 North Milwaukee State Bank Milwaukee WI 20364 Acquiring Institution Closing Date Updated Date 0 Today's Bank September 23, 2016 November 17, 2016 1 United Bank August 19,2016 November 17,20162 First-Citizens Bank & Trust Company May 6,2016 September 6,20163 The Bank of Fayette County April 29,2016 September 6,20164 First-Citizens Bank & Trust Company March 11,2016 June 16,2016
<INDICATOR> <INDICATOR_SEQ>373889</INDICATOR_SEQ> <PARENT_SEQ></PARENT_SEQ> <AGENCY_NAME>Metro-North Railroad</AGENCY_NAME> <INDICATOR_NAME>Escalator Availability</INDICATOR_NAME> <DESCRIPTION>Percent of the time that escalators are operational systemwide. The availability rate is based on physical observations performed the morning of regular business days only. This is a new indicator the agency began reporting in 2009.</DESCRIPTION> <PERIOD_YEAR>2011</PERIOD_YEAR> <PERIOD_MONTH>12</PERIOD_MONTH> <CATEGORY>Service Indicators</CATEGORY> <FREQUENCY>M</FREQUENCY> <DESIRED_CHANGE>U</DESIRED_CHANGE> <INDICATOR_UNIT>%</INDICATOR_UNIT> <DECIMAL_PLACES>1</DECIMAL_PLACES> <YTD_TARGET>97.00</YTD_TARGET> <YTD_ACTUAL></YTD_ACTUAL> <MONTHLY_TARGET>97.00</MONTHLY_TARGET> <MONTHLY_ACTUAL></MONTHLY_ACTUAL></INDICATOR>
data = []skip_fields = ['PARENT_SEQ','INDICATOR_SEQ','DESIRED_CHANGE','DECIMAL_PLACES']for elt in root.INDICATOR: el_data ={}for child in elt.getchildren():if child.tag in skip_fields:continue el_data[child.tag]= child.pyval data.append(el_data)
最后,将这组字典转换为一个DataFrame:
In [81]: perf = pd.DataFrame(data)In [82]: perf.head()Out[82]:Empty DataFrameColumns: []Index: []
In [87]: frame = pd.read_csv('examples/ex1.csv')In [88]: frameOut[88]: a b c d message01234 hello15678 world29101112 fooIn [89]: frame.to_pickle('examples/frame_pickle')
HDF5是一种存储大规模科学数组数据的非常好的文件格式。它可以被作为C标准库,带有许多语言的接口,如Java、Python和MATLAB等。HDF5中的HDF指的是层次型数据格式(hierarchical data format)。每个HDF5文件都含有一个文件系统式的节点结构,它使你能够存储多个数据集并支持元数据。与其他简单格式相比,HDF5支持多种压缩器的即时压缩,还能更高效地存储重复模式数据。对于那些非常大的无法直接放入内存的数据集,HDF5就是不错的选择,因为它可以高效地分块读写。
In [92]: frame = pd.DataFrame({'a': np.random.randn(100)})In [93]: store = pd.HDFStore('mydata.h5')In [94]: store['obj1']= frameIn [95]: store['obj1_col']= frame['a']In [96]: storeOut[96]:<class'pandas.io.pytables.HDFStore'>File path: mydata.h5/obj1 frame (shape->[100,1])/obj1_col series (shape->[100])/obj2 frame_table (typ->appendable,nrows->100,ncols->1,indexers->[index])/obj3 frame_table (typ->appendable,nrows->100,ncols->1,indexers->[index])
HDF5文件中的对象可以通过与字典一样的API进行获取:
In [97]: store['obj1']Out[97]: a0-0.20470810.4789432-0.5194393-0.55573041.965781.. ...950.795253960.11811097-0.748532980.584970990.152677[100 rows x 1 columns]
In [119]: issues = pd.DataFrame(data, columns=['number', 'title', .....: 'labels', 'state'])In [120]: issuesOut[120]: number title \017666 Period does notround down for frequencies les... 117665 DOC: improve docstring of function where 217664 COMPAT: skip 32-bit test on intrepr317662 implement Delegator class417654 BUG: Fix series rename called withstr alterin... .. ... ... 2517603 BUG: Correctly localize naive datetime strings... 2617599 core.dtypes.generic --> cython 2717596 Merge cdate_range functionality into bdate_range 2817587 Time Grouper bug fix when applied forlist gro... 2917583 BUG: fix tz-aware DatetimeIndex + TimedeltaInd... labels state 0 [] open1 [{'id':134699,'url':'https://api.github.com... open 2 [{'id':563047854,'url':'https://api.github.... open 3 [] open4 [{'id':76811,'url':'https://api.github.com/... open .. ... ... 25 [{'id':76811,'url':'https://api.github.com/... open 26 [{'id':49094459,'url':'https://api.github.c... open 27 [{'id':35818298,'url':'https://api.github.c... open 28 [{'id':233160,'url':'https://api.github.com... open 29 [{'id':76811,'url':'https://api.github.com/... open [30 rows x 4 columns]
In [121]:import sqlite3In [122]: query =""" .....: CREATE TABLE test .....: (a VARCHAR(20), b VARCHAR(20), .....: c REAL, d INTEGER .....: );"""In [123]: con = sqlite3.connect('mydata.sqlite')In [124]: con.execute(query)Out[124]:<sqlite3.Cursor at 0x7f6b12a50f10>In [125]: con.commit()
然后插入几行数据:
In [126]: data = [('Atlanta','Georgia',1.25,6), .....: ('Tallahassee','Florida',2.6,3), .....: ('Sacramento','California',1.7,5)]In [127]: stmt ="INSERT INTO test VALUES(?, ?, ?, ?)"In [128]: con.executemany(stmt, data)Out[128]:<sqlite3.Cursor at 0x7f6b15c66ce0>
In [133]: cursor.descriptionOut[133]:(('a',None,None,None,None,None,None), ('b',None,None,None,None,None,None), ('c',None,None,None,None,None,None), ('d',None,None,None,None,None,None))In [134]: pd.DataFrame(rows, columns=[x[0] for x in cursor.description])Out[134]: a b c d0 Atlanta Georgia 1.2561 Tallahassee Florida 2.6032 Sacramento California 1.705
In [135]:import sqlalchemy as sqlaIn [136]: db = sqla.create_engine('sqlite:///mydata.sqlite')In [137]: pd.read_sql('select * from test', db)Out[137]: a b c d0 Atlanta Georgia 1.2561 Tallahassee Florida 2.6032 Sacramento California 1.705