Question
Given a large series of DataFrames with a small variety of dtypes, what is the optimal design for Pandas DataFrame persistence/serialization if I care about compression ratio first, decompression speed second, and initial compression speed third?
Background:
I have roughly 200k dataframes of shape [2900,8] that I need to store in logical blocks of ~50 data frames per file. The data frame contains variables of type np.int8, np.float64. Most data frames are good candidates for sparse types, but sparse is not supported in HDF 'table' format stores (not that it would even help - see the size below for a sparse gzipped pickle). Data is generated daily and currently adds up to over 20GB. While I'm not bound to HDF, I have yet to find a better solution that allows for reads on individual dataframes within the persistent store, combined with top quality compression. Again, I'm willing to sacrifice a little speed for better compression ratios, especially since I will need to be sending this all over the wire.
There are a couple of other SO threads and other links that might be relevant for those that are in a similar position. However most of what I've found doesn't focus on minimizing storage size as a priority:
“Large data” work flows using pandas
HDF5 and SQLite. Concurrency, compression & I/O performance [closed]
Environment:
OSX 10.9.5
Pandas 14.1
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
PyTables version: 3.1.1
HDF5 version: 1.8.13
NumPy version: 1.8.1
Numexpr version: 2.4 (not using Intel's VML/MKL)
Zlib version: 1.2.5 (in Python interpreter)
LZO version: 2.06 (Aug 12 2011)
BZIP2 version: 1.0.6 (6-Sept-2010)
Blosc version: 1.3.5 (2014-03-22)
Blosc compressors: ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib']
Cython version: 0.20.2
Python version: 2.7.8 (default, Jul 2 2014, 10:14:46)
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)]
Platform: Darwin-13.4.0-x86_64-i386-64bit
Byte-ordering: little
Detected cores: 8
Default encoding: ascii
Default locale: (en_US, UTF-8)
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Example:
import pandas as pd
import numpy as np
import random
import cPickle as pickle
import gzip
def generate_data():
alldfs = {}
n = 2800
m = 8
loops = 50
idx = pd.date_range('1/1/1980',periods=n,freq='D')
for x in xrange(loops):
id = "id_%s" % x
df = pd.DataFrame(np.random.randn(n,m) * 100,index=idx)
# adjust data a bit..
df.ix[:,0] = 0
df.ix[:,1] = 0
for y in xrange(100):
i = random.randrange(n-1)
j = random.randrange(n-1)
df.ix[i,0] = 1
df.ix[j,1] = 1
df.ix[:,0] = df.ix[:,0].astype(np.int8) # adjust datatype
df.ix[:,1] = df.ix[:,1].astype(np.int8)
alldfs[id] = df
return alldfs
def store_all_hdf(x,format='table',complevel=9,complib='blosc'):
fn = "test_%s_%s-%s.hdf" % (format,complib,complevel)
hdfs = pd.HDFStore(fn,mode='w',format=format,complevel=complevel,complib=complib)
for key in x.keys():
df = x[key]
hdfs.put(key,df,format=format,append=False)
hdfs.close()
alldfs = generate_data()
for format in ['table','fixed']:
for complib in ['blosc','zlib','bzip2','lzo',None]:
store_all_hdf(alldfs,format=format,complib=complib,complevel=9)
# pickle, for comparison
with open('test_pickle.pkl','wb') as f:
pickle.dump(alldfs,f)
with gzip.open('test_pickle_gzip.pklz','wb') as f:
pickle.dump(alldfs,f)
with gzip.open('test_pickle_gzip_sparse.pklz','wb') as f:
sparsedfs = {}
for key in alldfs.keys():
sdf = alldfs[key].to_sparse(fill_value=0)
sparsedfs[key] = sdf
pickle.dump(sparsedfs,f)
Results
-rw-r--r-- 1 bazel staff 10292760 Oct 17 14:31 test_fixed_None-9.hdf
-rw-r--r-- 1 bazel staff 9531607 Oct 17 14:31 test_fixed_blosc-9.hdf
-rw-r--r-- 1 bazel staff 7867786 Oct 17 14:31 test_fixed_bzip2-9.hdf
-rw-r--r-- 1 bazel staff 9506483 Oct 17 14:31 test_fixed_lzo-9.hdf
-rw-r--r-- 1 bazel staff 8036845 Oct 17 14:31 test_fixed_zlib-9.hdf
-rw-r--r-- 1 bazel staff 26627915 Oct 17 14:31 test_pickle.pkl
-rw-r--r-- 1 bazel staff 8752370 Oct 17 14:32 test_pickle_gzip.pklz
-rw-r--r-- 1 bazel staff 8407704 Oct 17 14:32 test_pickle_gzip_sparse.pklz
-rw-r--r-- 1 bazel staff 14464924 Oct 17 14:31 test_table_None-9.hdf
-rw-r--r-- 1 bazel staff 8619016 Oct 17 14:31 test_table_blosc-9.hdf
-rw-r--r-- 1 bazel staff 8154716 Oct 17 14:31 test_table_bzip2-9.hdf
-rw-r--r-- 1 bazel staff 8481631 Oct 17 14:31 test_table_lzo-9.hdf
-rw-r--r-- 1 bazel staff 8047125 Oct 17 14:31 test_table_zlib-9.hdf
Given the results above, the best 'compression-first' solution appears to be to store the data in HDF fixed format, with bzip2. Is there a better way of organising the data, perhaps without HDF, that would allow me to save even more space?
Update 1
Per the comment below from Jeff, I have used ptrepack on the table store HDF file without initial compression -- and then recompressed. Results are below:
-rw-r--r-- 1 bazel staff 8627220 Oct 18 08:40 test_table_repack-blocsc-9.hdf
-rw-r--r-- 1 bazel staff 8627620 Oct 18 09:07 test_table_repack-blocsc-blosclz-9.hdf
-rw-r--r-- 1 bazel staff 8409221 Oct 18 08:41 test_table_repack-blocsc-lz4-9.hdf
-rw-r--r-- 1 bazel staff 8104142 Oct 18 08:42 test_table_repack-blocsc-lz4hc-9.hdf
-rw-r--r-- 1 bazel staff 14475444 Oct 18 09:05 test_table_repack-blocsc-snappy-9.hdf
-rw-r--r-- 1 bazel staff 8059586 Oct 18 08:43 test_table_repack-blocsc-zlib-9.hdf
-rw-r--r-- 1 bazel staff 8161985 Oct 18 09:08 test_table_repack-bzip2-9.hdf
Oddly, recompressing with ptrepack seems to increase total file size (at least in this case using table format with similar compressors).
Related
I'm trying to read a specific file from a compressed file bz2 using python.
tar = tarfile.open(filename, "r|bz2", bufsize=57860311)
for tarinfo in tar:
print tarinfo.name, "is", tarinfo.size, "bytes in size and is",
if tarinfo.isreg():
print "a regular file."
# read the file
f = tar.extractfile(tarinfo)
#print f.read()
elif tarinfo.isdir():
print "a directory."
else:
print "something else."
tar.close()
But at the end I got the error:
/usr/local/Cellar/python#2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/tarfile.pyc in read(self, size)
577 buf = "".join(t)
578 else:
--> 579 buf = self._read(size)
580 self.pos += len(buf)
581 return buf
/usr/local/Cellar/python#2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/tarfile.pyc in _read(self, size)
594 break
595 try:
--> 596 buf = self.cmp.decompress(buf)
597 except IOError:
598 raise ReadError("invalid compressed data")
EOFError: end of stream was already found
I also tried to list the files within the tar through 'tar.list()' and again ...
-rwxr-xr-x lindauer/or3uunp 0 2013-05-21 00:58:36 r3.2/
-rw-r--r-- lindauer/or3uunp 6057 2012-01-05 14:41:00 r3.2/readme.txt
-rw-r--r-- lindauer/or3uunp 44732 2012-01-04 10:08:54 r3.2/psychometric.csv
-rw-r--r-- lindauer/or3uunp 57860309 2012-01-04 09:58:20 r3.2/logon.csv
/usr/local/Cellar/python#2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/tarfile.pyc in _read(self, size)
594 break
595 try:
--> 596 buf = self.cmp.decompress(buf)
597 except IOError:
598 raise ReadError("invalid compressed data")
EOFError: end of stream was already found
I listed the files inside the archive using the tar command. Here is the result:
tar -tvf r3.2.tar.bz2
drwxr-xr-x 0 lindauer or3uunp 0 May 21 2013 r3.2/
-rw-r--r-- 0 lindauer or3uunp 6057 Jan 5 2012 r3.2/readme.txt
-rw-r--r-- 0 lindauer or3uunp 44732 Jan 4 2012 r3.2/psychometric.csv
-rw-r--r-- 0 lindauer or3uunp 57860309 Jan 4 2012 r3.2/logon.csv
-rw-r--r-- 0 lindauer or3uunp 12494829865 Jan 5 2012 r3.2/http.csv
-rw-r--r-- 0 lindauer or3uunp 1066622500 Jan 5 2012 r3.2/email.csv
-rw-r--r-- 0 lindauer or3uunp 218962503 Jan 5 2012 r3.2/file.csv
-rw-r--r-- 0 lindauer or3uunp 29156988 Jan 4 2012 r3.2/device.csv
drwxr-xr-x 0 lindauer or3uunp 0 May 20 2013 r3.2/LDAP/
-rw-r--r-- 0 lindauer or3uunp 140956 Jan 4 2012 r3.2/LDAP/2011-01.csv
-rw-r--r-- 0 lindauer or3uunp 147370 Jan 4 2012 r3.2/LDAP/2010-05.csv
-rw-r--r-- 0 lindauer or3uunp 149221 Jan 4 2012 r3.2/LDAP/2010-02.csv
-rw-r--r-- 0 lindauer or3uunp 141717 Jan 4 2012 r3.2/LDAP/2010-12.csv
-rw-r--r-- 0 lindauer or3uunp 148931 Jan 4 2012 r3.2/LDAP/2010-03.csv
-rw-r--r-- 0 lindauer or3uunp 147370 Jan 4 2012 r3.2/LDAP/2010-04.csv
-rw-r--r-- 0 lindauer or3uunp 149793 Jan 4 2012 r3.2/LDAP/2009-12.csv
-rw-r--r-- 0 lindauer or3uunp 143979 Jan 4 2012 r3.2/LDAP/2010-09.csv
-rw-r--r-- 0 lindauer or3uunp 145591 Jan 4 2012 r3.2/LDAP/2010-07.csv
-rw-r--r-- 0 lindauer or3uunp 139444 Jan 4 2012 r3.2/LDAP/2011-03.csv
-rw-r--r-- 0 lindauer or3uunp 142347 Jan 4 2012 r3.2/LDAP/2010-11.csv
-rw-r--r-- 0 lindauer or3uunp 138285 Jan 4 2012 r3.2/LDAP/2011-04.csv
-rw-r--r-- 0 lindauer or3uunp 149793 Jan 4 2012 r3.2/LDAP/2010-01.csv
-rw-r--r-- 0 lindauer or3uunp 146008 Jan 4 2012 r3.2/LDAP/2010-06.csv
-rw-r--r-- 0 lindauer or3uunp 144711 Jan 4 2012 r3.2/LDAP/2010-08.csv
-rw-r--r-- 0 lindauer or3uunp 137967 Jan 4 2012 r3.2/LDAP/2011-05.csv
-rw-r--r-- 0 lindauer or3uunp 140085 Jan 4 2012 r3.2/LDAP/2011-02.csv
-rw-r--r-- 0 lindauer or3uunp 143420 Jan 4 2012 r3.2/LDAP/2010-10.csv
-r--r--r-- 0 lindauer or3uunp 3923 Jan 4 2012 r3.2/license.txt
I think this is due to the fact the archive has subfolders and for some reason python libraries have problems in dealing with subfolders extractions?
I also tried to open the tar file manually and I have no problems so I don't think the file is corrupted. Any help appreciated.
Comment: I tried the debug=3 and I get : ReadError: bad checksum
Found the following related Infos:
tar: directory checksum error
Cause
This error message from tar(1) indicates that the checksum of the directory and the files it has read from tape does not match the checksum advertised in the header block. Usually this message indicates the wrong blocking factor, although it could indicate corrupt data on tape.
Action
To resolve this problem, make certain that the blocking factor you specify on the command line (after -b) matches the blocking factor originally specified. If in doubt, leave out the block size and let tar(1) determine it automatically. If that remedy does not help, the tape data could be corrupted.
SE:tar-ignore-or-fix-checksum
I'd try the -i switch to see if you can just ignore and messages regarding EOF.
-i, --ignore-zeros ignore zeroed blocks in archive (means EOF)
Example
$ tar xivf backup.tar
bugs.python.org:tarfile-headererror
The comment in tarfile.py reads (Don't know the date of the file!):
- # We shouldn't rely on this checksum, because some tar programs
- # calculate it differently and it is merely validating the
- # header block.
ReadError: unexpected end of data
From the tarfile Documentation
The tarfile module defines the following exceptions:
exception tarfile.ReadError
Is raised when a tar archive is opened, that either cannot be handled by the tarfile module or is somehow invalid.
First, try with another tar archiv file to verify your python environent.
Second, check if your tar archiv file match the following format:
tarfile.DEFAULT_FORMAT
The default format for creating archives. This is currently GNU_FORMAT.
Third, instead of using tarfile.open(...), to create a tarfile instance, try to use the following, to set debug=3.
tar = tarfile.TarFile(name=filename, debug=3)
tar.open()
...
class tarfile.TarFile(name=None, mode='r', fileobj=None, format=DEFAULT_FORMAT, tarinfo=TarInfo, dereference=False, ignore_zeros=False, encoding=ENCODING, errors='surrogateescape', pax_headers=None, debug=0, errorlevel=0)
I'm trying to parse some log files in Python, but my responses always return only null bytes.
I've confirmed that the file in question does contain data:
$ zcat Events.log.gz | wc -c
188371128
$ zcat Events.log.gz | head
17 Jan 2018 08:10:35,863: {"deviceType":"A16ZV8BU3SN1N3",[REDACTED]}
17 Jan 2018 08:10:35,878: {"deviceType":"A1CTGXB4BA274T",[REDACTED]}
17 Jan 2018 08:10:35,886: {"deviceType":"A1DL2DVDQVK3Q",[REDACTED]}
17 Jan 2018 08:10:35,911: {"deviceType":"A2CZFJ2RKY7SE2",[REDACTED]}
17 Jan 2018 08:10:35,937: {"deviceType":"A2JTEGS8GUPDOF",[REDACTED]}
17 Jan 2018 08:10:35,963: {"appOtaState":"ota",[REDACTED]}
17 Jan 2018 08:10:35,971: {"deviceType":"A1DL2DVDQVK3Q",[REDACTED]}
17 Jan 2018 08:10:36,006: {"deviceType":"A2JTEGS8GUPDOF",[REDACTED]}
17 Jan 2018 08:10:36,013: {"deviceType":"A1CTGXB4BA274T",[REDACTED]}
17 Jan 2018 08:10:36,041: {"deviceType":"A1DL2DVDQVK3Q",[REDACTED]}
But attempting to read it in Python gives only null bytes:
$ python
Python 2.6.9 (unknown, Sep 14 2016, 17:46:59)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> filename = 'Events.log.gz'
>>> import gzip
>>> content = gzip.open(filename).read()
>>> len(content)
188371128
>>> for i in range(10):
... content[i*10000:(i*10000)+10]
...
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
I've tried explicitly setting 'mode' to either 'r' or 'rb', with no difference in result.
I've also tried subprocess.Popen(['zcat', filename], stdout=subprocess.PIPE).stdout.read(), with the same response.
Perhaps relevantly, when I tried to zcat the file to another file, the output was a binary file:
$ zcat Events.log.gz > /tmp/logoutput
$ less /tmp/logoutput
"/tmp/logoutput" may be a binary file. See it anyway?
[y]
^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#...
$ head /tmp/logoutput
17 Jan 2018 08:10:35,863: {"deviceType":"A16ZV8BU3SN1N3",[REDACTED]}
17 Jan 2018 08:10:35,878: {"deviceType":"A1CTGXB4BA274T",[REDACTED]}
17 Jan 2018 08:10:35,886: {"deviceType":"A1DL2DVDQVK3Q",[REDACTED]}
17 Jan 2018 08:10:35,911: {"deviceType":"A2CZFJ2RKY7SE2",[REDACTED]}
17 Jan 2018 08:10:35,937: {"deviceType":"A2JTEGS8GUPDOF",[REDACTED]}
17 Jan 2018 08:10:35,963: {"appOtaState":"ota",[REDACTED]}
17 Jan 2018 08:10:35,971: {"deviceType":"A1DL2DVDQVK3Q",[REDACTED]}
17 Jan 2018 08:10:36,006: {"deviceType":"A2JTEGS8GUPDOF",[REDACTED]}
17 Jan 2018 08:10:36,013: {"deviceType":"A1CTGXB4BA274T",[REDACTED]}
17 Jan 2018 08:10:36,041: {"deviceType":"A1DL2DVDQVK3Q",[REDACTED]}
So I currently have a directory, we'll call it /mydir, that contains 36 CSV files, each 2.1 GB and with the same dimensions. They are all the same size, and I want to read them into pandas, concatenate them together side-by-side (so the amount of rows stays the same), and then output the resulting dataframe as one large csv. The code I have for this works for combining a few of them but reaches a memory error after a certain point. I was wondering if there is a more efficient way to do this than what I have.
df = pd.DataFrame()
for file in os.listdir('/mydir'):
df.concat([df, pd.read_csv('/mydir' + file, dtype = 'float)], axis = 1)
df.to_csv('mydir/file.csv')
It was suggested to me to break it up into smaller pieces, combine the files in groups of 6, then combine these together in turn but I don't know if this is a valid solution that will avoid the memory error problem
EDIT: view of the directory:
-rw-rw---- 1 m2762 2.1G Jul 11 10:35 2010.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:32 2001.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:28 1983.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:21 2009.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:21 1991.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:07 2000.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:06 1982.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:01 1990.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:01 2008.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:55 1999.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:54 1981.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 2007.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 1998.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 1989.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 1980.csv
Chunk Them All!
from glob import glob
import os
# grab files
files = glob('./[0-9][0-9][0-9][0-9].csv')
# simplify the file reading
# notice this will create a generator
# that goes through chunks of the file
# at a time
def read_csv(f, n=100):
return pd.read_csv(f, index_col=0, chunksize=n)
# simplify the concatenation
def concat(lot):
return pd.concat(lot, axis=1)
# simplify the writing
# make sure mode is append and header is off
# if file already exists
def to_csv(f, df):
if os.path.exists(f):
mode = 'a'
header = False
else:
mode = 'w'
header = True
df.to_csv(f, mode=mode, header=header)
# Fun stuff! zip will take the next element of the generator
# for each generator created for each file
# concat one chunk at a time and write
for lot in zip(*[read_csv(f, n=10) for f in files]):
to_csv('out.csv', concat(lot))
Assuming the answer to MaxU is that all the files have the same number of rows, and assuming further that minor CSV differences like quoting are done the same way in all the files, you don't need to do this with Pandas. Regular file readlines will give you the strings that you can concatenate and write out. Assuming further that you can supply the number of rows. Something like this code:
numrows = 999 # whatever. Probably pass as argument to function or on cmdline
out_file = open('myout.csv','w')
infile_names = [ 'file01.csv',
'file02.csv',
..
'file36.csv' ]
# open all the input files
infiles = []
for fname in infile_names:
infiles.append(open(fname))
for i in range(numrows):
# read a line from each input file and add it to the output string
out_csv=''
for infile2read in infiles:
out_csv += infile2read.readline().strip() + ','
out_csv[-1] = '\n' # replace final comma with newline
# write this rows data out to the output file
outfile.write(out_csv)
#close the files
for f in infiles:
f.close()
outfile.close()
I'm trying to execute command in Python like this:
os.system('ls')
What's interesting is that the output length is limited by the terminal window size where I'm running this python console.
>>>os.system('ls -l')
total 0
-rw-r--r-- 1 hy hy 0 Apr 29 22:30 a.txt
-rw-r--r-- 1 hy hy 0 Apr 29 22:31 b.txt
-rw-r--r-- 1 hy hy 0 Apr 29 22:31 c.txt
-rw-r--r-- 1 hy hy 0 Apr 29 22:31 d.txt
-rw-r--r-- 1 hy hy 0 Apr 29 22:31 e.txt
-rw-r--r-- 1 hy hy 0 Apr 29 22:31 f.txt
-rw-r--r-- 1 hy hy 0 Apr 29 22:31 g.txt
>>>
I did that in a directory containing hundreds of files, and intentionally re-size the terminal window very small, it only outputs very few lines which exactly fill the entire window. If I use smaller terminal window, it output even less lines. Every time it just outputs down to the low boundary of my terminal window.
It's not that python console hides some output when displaying. I tried using subprocess.Popen() to store the output into a pipe and readlines() the pipe, and got the same result.
But it seems python doesn't do that all the time. I don't got this problem on all machines.
I am trying to install postgresql_python.
I downloaded the tarball and installed it using:
python setup.py build
python setup.py install
I got /usr/lib64/python2.4/site-packages/psycopg2/ with
> total 836
> -rw-r--r-- 1 root root 12759 Dec 11 18:18 errorcodes.py
> -rw-r--r-- 1 root root 14584 Dec 12 13:49 errorcodes.pyc
> -rw-r--r-- 1 root root 14584 Dec 12 13:49 errorcodes.pyo
> -rw-r--r-- 1 root root 5807 Dec 11 18:18 extensions.py
> -rw-r--r-- 1 root root 7298 Dec 12 13:49 extensions.pyc
> -rw-r--r-- 1 root root 7298 Dec 12 13:49 extensions.pyo
> -rw-r--r-- 1 root root 31495 Dec 11 18:18 extras.py
> -rw-r--r-- 1 root root 35124 Dec 12 13:49 extras.pyc
> -rw-r--r-- 1 root root 35124 Dec 12 13:49 extras.pyo
> -rw-r--r-- 1 root root 6177 Dec 11 18:18 __init__.py
> -rw-r--r-- 1 root root 5740 Dec 12 13:49 __init__.pyc
> -rw-r--r-- 1 root root 5740 Dec 12 13:49 __init__.pyo
> -rw-r--r-- 1 root root 8855 Dec 11 18:18 pool.py
> -rw-r--r-- 1 root root 8343 Dec 12 13:49 pool.pyc
> -rw-r--r-- 1 root root 8343 Dec 12 13:49 pool.pyo
> -rw-r--r-- 1 root root 3389 Dec 21 11:17 psycopg1.py
> -rw-r--r-- 1 root root 3182 Dec 21 11:22 psycopg1.pyc
> -rw-r--r-- 1 root root 3167 Dec 12 13:49 psycopg1.pyo
> -rwxr-xr-x 1 root root 572648 Dec 21 11:22 _psycopg.so drwxr-xr-x 2 root root 4096 Dec 21 10:38 tests
> -rw-r--r-- 1 root root 4427 Dec 11 18:18 tz.py
> -rw-r--r-- 1 root root 4325 Dec 12 13:49 tz.pyc
> -rw-r--r-- 1 root root 4325 Dec 12 13:49 tz.pyo
But in python shell when I am trying to import library, I got error:
>>> import psycopg2
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib64/python2.4/site-packages/psycopg2/__init__.py", line 76, in ?
from psycopg2._psycopg import _connect, apilevel, threadsafety, paramstyle
ImportError: cannot import name _connect
I am running with Postgresql 9.2.
What am I missing here?
Please let me know.
Thanks.
You most likely have to remove some existing packages related to psycopg2 within your root. Some common locations:
rm -r /usr/lib/python2.4/site-packages/psycopg2*
rm -r /usr/local/lib/python2.6/dist-packages/psycopg2*
However, I recommend setting up a virtualenv to house the packages for your Python app.
Check out virtualenv. It's easy to use once installed:
virtualenv myapp
. myapp/bin/activate
cd ~/your/postgres_lib/download
python setup.py install
This will install postgres libraries into your virtualenv (located under the myapp) folder. Then, whenever you want to run your app, you just need to activate the environment via
. myapp/bin/activate
Adjusting the path to myapp when necessary. There are helpers, like virtualenvwrapper to streamline this process.