Storing as Pandas DataFrames and Updating as Pytables

Storing as Pandas DataFrames and Updating as Pytables - python

Can you store data as pandas HDFStore and open them / perform i/o using pytables? The reason this question comes up is because I am currently storing data as
pd.HDFStore('Filename',mode='a')
store.append(data)
However, as i understand pandas doesn't support updating records so much. I have a usecase where I have to update 5% of the data daily. Would pd.io.pytables work? if so I found no documentation on this? Pytables has a lot of documentation but i am not sure if i can open the file / update without opening using pytables when i didnt use pytables to save the file initially?

Here is a demonstration for the #flyingmeatball's answer:
Let's generate a test DF:
In [56]: df = pd.DataFrame(np.random.rand(15, 3), columns=list('abc'))
In [57]: df
Out[57]:
a b c
0 0.022079 0.901965 0.282529
1 0.596452 0.096204 0.197186
2 0.034127 0.992500 0.523114
3 0.659184 0.447355 0.246932
4 0.441517 0.853434 0.119602
5 0.779707 0.429574 0.744452
6 0.105255 0.934440 0.545421
7 0.216278 0.217386 0.282171
8 0.690729 0.052097 0.146705
9 0.828667 0.439608 0.091007
10 0.988435 0.326589 0.536904
11 0.687250 0.661912 0.318209
12 0.829129 0.758737 0.519068
13 0.500462 0.723528 0.026962
14 0.464162 0.364536 0.843899
and save it to HDFStore (NOTE: don't forget to use data_columns=True (or data_columns=[list_of_columns_to_index]) in order to index all columns, that we want to use in the where clause):
In [58]: store = pd.HDFStore(r'd:/temp/test_removal.h5')
In [59]: store.append('test', df, format='t', data_columns=True)
In [60]: store.close()
Solution:
In [61]: store = pd.HDFStore(r'd:/temp/test_removal.h5')
The .remove() method should return # of removed rows:
In [62]: store.remove('test', where="a > 0.5")
Out[62]: 9
Let's append changed (multiplied by 100) rows :
In [63]: store.append('test', df.loc[df.a > 0.5] * 100, format='t', data_columns=True)
Test:
In [64]: store.select('test')
Out[64]:
a b c
0 0.022079 0.901965 0.282529
2 0.034127 0.992500 0.523114
4 0.441517 0.853434 0.119602
6 0.105255 0.934440 0.545421
7 0.216278 0.217386 0.282171
14 0.464162 0.364536 0.843899
1 59.645151 9.620415 19.718557
3 65.918421 44.735482 24.693160
5 77.970749 42.957446 74.445185
8 69.072948 5.209725 14.670545
9 82.866731 43.960848 9.100682
10 98.843540 32.658931 53.690360
11 68.725002 66.191215 31.820942
12 82.912937 75.873689 51.906795
13 50.046189 72.352794 2.696243
finalize:
In [65]: store.close()

Here are the docs I think you're after:
http://pandas.pydata.org/pandas-docs/version/0.19.0/api.html?highlight=pytables
See this thread as well:
Update pandas DataFrame in stored in a Pytable with another pandas DataFrame
Looks like you can load the 5% records into memory, remove them from the store then append the updated ones back
to replace the whole table
store.remove(key, where = ...)
store.append(.....)
You can also do outside of Pandas - see tutorial here on removal
http://www.pytables.org/usersguide/tutorials.html

Related

Operating on pandas dataframes that may or may not be multiIndex

I have a few functions that make new columns in a pandas dataframe, as a function of existing columns in the dataframe. I have two different scenarios that occur here: (1) the dataframe is NOT multiIndex and has a set of columns, say [a,b] and (2) the dataframe is multiIndex and now has the same set of columns headers repeated N times, say [(a,1),(b,1),(a,2),(b,2)....(a,N),(n,N)].
I've been making the aforementioned functions in the style shown below:
def f(df):
if multiindex(df):
for s df[a].columns:
df[c,s] = someFunction(df[a,s], df[b,s])
else:
df[c] = someFunction(df[a], df[b])
Is there another way to do this, without having these if-multi-index/else statement everywhere and duplicating the someFunction code? I'd prefer NOT to split the multi indexed frame into N smaller dataframes (I often need to filter data or do things and keep the rows consistent across all the 1,2,...N frames, and keeping them together in one frame seems the to be the best way to do that).

you may still have to test if columns is a MultiIndex but this should be cleaner and more efficient. Caveat, this will not work if your function utilizes summary statistics on the column. For example, if someFunction divides by the the average of column 'a'.
Solution
def someFunction(a, b):
return a + b
def f(df):
df = df.copy()
ismi = isinstance(df.columns, pd.MultiIndex)
if ismi:
df = df.stack()
df['c'] = someFunction(df['a'], df['a'])
if ismi:
df = df.unstack()
return df
Setup
import pandas as pd
import numpy as np
setup_tuples = []
for c in ['a', 'b']:
for i in ['one', 'two', 'three']:
setup_tuples.append((c, i))
columns = pd.MultiIndex.from_tuples(setup_tuples)
rand_array = np.random.rand(10, len(setup_tuples))
df = pd.DataFrame(rand_array, columns=columns)
df looks like this
a b
one two three one two three
0 0.282834 0.490313 0.201300 0.140157 0.467710 0.352555
1 0.838527 0.707131 0.763369 0.265170 0.452397 0.968125
2 0.822786 0.785226 0.434637 0.146397 0.056220 0.003197
3 0.314795 0.414096 0.230474 0.595133 0.060608 0.900934
4 0.334733 0.118689 0.054299 0.237786 0.658538 0.057256
5 0.993753 0.552942 0.665615 0.336948 0.788817 0.320329
6 0.310809 0.199921 0.158675 0.059406 0.801491 0.134779
7 0.971043 0.183953 0.723950 0.909778 0.103679 0.695661
8 0.755384 0.728327 0.029720 0.408389 0.808295 0.677195
9 0.276158 0.978232 0.623972 0.897015 0.253178 0.093772
I constructed df to have MultiIndex columns. What I'd do is use the .stack() method to push the second level of the column index to be the second level of the row index.
df.stack() looks like this
a b
0 one 0.282834 0.140157
three 0.201300 0.352555
two 0.490313 0.467710
1 one 0.838527 0.265170
three 0.763369 0.968125
two 0.707131 0.452397
2 one 0.822786 0.146397
three 0.434637 0.003197
two 0.785226 0.056220
3 one 0.314795 0.595133
three 0.230474 0.900934
two 0.414096 0.060608
4 one 0.334733 0.237786
three 0.054299 0.057256
two 0.118689 0.658538
5 one 0.993753 0.336948
three 0.665615 0.320329
two 0.552942 0.788817
6 one 0.310809 0.059406
three 0.158675 0.134779
two 0.199921 0.801491
7 one 0.971043 0.909778
three 0.723950 0.695661
two 0.183953 0.103679
8 one 0.755384 0.408389
three 0.029720 0.677195
two 0.728327 0.808295
9 one 0.276158 0.897015
three 0.623972 0.093772
two 0.978232 0.253178
Now you can operate on df.stack() as if the columns were not a MultiIndex
Demonstration
print f(df)
will give you what you want
a b c \
one three two one three two one
0 0.282834 0.201300 0.490313 0.140157 0.352555 0.467710 0.565667
1 0.838527 0.763369 0.707131 0.265170 0.968125 0.452397 1.677055
2 0.822786 0.434637 0.785226 0.146397 0.003197 0.056220 1.645572
3 0.314795 0.230474 0.414096 0.595133 0.900934 0.060608 0.629591
4 0.334733 0.054299 0.118689 0.237786 0.057256 0.658538 0.669465
5 0.993753 0.665615 0.552942 0.336948 0.320329 0.788817 1.987507
6 0.310809 0.158675 0.199921 0.059406 0.134779 0.801491 0.621618
7 0.971043 0.723950 0.183953 0.909778 0.695661 0.103679 1.942086
8 0.755384 0.029720 0.728327 0.408389 0.677195 0.808295 1.510767
9 0.276158 0.623972 0.978232 0.897015 0.093772 0.253178 0.552317
three two
0 0.402600 0.980626
1 1.526739 1.414262
2 0.869273 1.570453
3 0.460948 0.828193
4 0.108599 0.237377
5 1.331230 1.105884
6 0.317349 0.399843
7 1.447900 0.367907
8 0.059439 1.456654
9 1.247944 1.956464

Pandas DataFrame.merge MemoryError

Goal
My goal is to merge two DataFrames by their common column (gene names) so I can take a product of each gene score across each gene row. I'd then perform a groupby on patients and cells and sum all scores from each. The ultimate data frame should look like this:
patient cell
Pat_1 22RV1 12
DU145 15
LN18 9
Pat_2 22RV1 12
DU145 15
LN18 9
Pat_3 22RV1 12
DU145 15
LN18 9
That last part should work fine, but I have not been able to perform the first merge on gene names due to a MemoryError. Below are snippets of each DataFrame.
Data
cell_s =
Description Name level_2 0
0 LOC100009676 100009676_at LN18_CENTRAL_NERVOUS_SYSTEM 1
1 LOC100009676 100009676_at 22RV1_PROSTATE 2
2 LOC100009676 100009676_at DU145_PROSTATE 3
3 AKT3 10000_at LN18_CENTRAL_NERVOUS_SYSTEM 4
4 AKT3 10000_at 22RV1_PROSTATE 5
5 AKT3 10000_at DU145_PROSTATE 6
6 MED6 10001_at LN18_CENTRAL_NERVOUS_SYSTEM 7
7 MED6 10001_at 22RV1_PROSTATE 8
8 MED6 10001_at DU145_PROSTATE 9
cell_s is about 10,000,000 rows
patient_s =
id level_1 0
0 MED6 Pat_1 1
1 MED6 Pat_2 1
2 MED6 Pat_3 1
3 LOC100009676 Pat_1 2
4 LOC100009676 Pat_2 2
5 LOC100009676 Pat_3 2
6 ABCD Pat_1 3
7 ABCD Pat_2 3
8 ABCD Pat_3 3
....
patient_s is about 1,200,000 rows
Code
def get_score(cell, patient):
cell_s = cell.set_index(['Description', 'Name']).stack().reset_index()
cell_s.columns = ['Description', 'Name', 'cell', 's1']
patient_s = patient.set_index('id').stack().reset_index()
patient_s.columns = ['id', 'patient', 's2']
# fails here:
merged = cell_s.merge(patient_s, left_on='Description', right_on='id')
merged['score'] = merged.s1 * merged.s2
scores = merged.groupby(['patient','cell'])['score'].sum()
return scores
I was getting a MemoryError when initially read_csving these files, but then specifying the dtypes resolved the issue. Confirming that my python is 64 bit did not fix my issue either. I haven't reached the limitations on pandas, have I?
Python 3.4.3 |Anaconda 2.3.0 (64-bit)| Pandas 0.16.2

Consider two workarounds:
CSV By CHUNKS
Apparently, read_csv can suffer performance issues and therefore large files must load in iterated chunks.
cellsfilepath = 'C:\\Path\To\Cells\CSVFile.csv'
tp = pd.io.parsers.read_csv(cellsfilepath, sep=',', iterator=True, chunksize=1000)
cell_s = pd.concat(tp, ignore_index=True)
patientsfilepath = 'C:\\Path\To\Patients\CSVFile.csv'
tp = pd.io.parsers.read_csv(patientsfilepath, sep=',', iterator=True, chunksize=1000)
patient_s = pd.concat(tp, ignore_index=True)
CSV VIA SQL
As a database guy, I always recommend handling large data loads and merging/joining with a SQL relational engine that scales well for such processes. I have written many a comment on dataframe merge Q/As to this effect -even in R. You can use any SQL database including file server dbs (Access, SQLite) or client server dbs (MySQL, MSSQL, or other), even where your dfs derive. Python maintains a built-in library for SQLite (otherwise you use ODBC); and dataframes can be pushed into databases as tables using pandas to_sql:
import sqlite3
dbfile = 'C:\\Path\To\SQlitedb.sqlite'
cxn = sqlite3.connect(dbfile)
c = cxn.cursor()
cells_s.to_sql(name='cell_s', con = cxn, if_exists='replace')
patient_s.to_sql(name='patient_s', con = cxn, if_exists='replace')
strSQL = 'SELECT * FROM cell_s c INNER JOIN patient_s p ON c.Description = p.id;'
# MIGHT HAVE TO ADJUST ABOVE FOR CELL AND PATIENT PARAMS IN DEFINED FUNCTION
merged = pd.read_sql(strSQL, cxn)

You may have to do it in pieces, or look into blaze. http://blaze.pydata.org

pandas groupby for multiple data frames/files at once

I have multiple huge tsv files that I'm trying to process using pandas. I want to group by 'col3' and 'col5'. I've tried this:
import pandas as pd
df = pd.read_csv('filename.txt', sep = "\t")
g2 = df.drop_duplicates(['col3', 'col5'])
g3 = g2.groupby(['col3', 'col5']).size().sum(level=0)
print g3
It works fine so far and prints an output like this:
yes 2
no 2
I'd like to be able to aggregate the output from multiple files, i.e., to be able to group by these two columns in all the files at once and print one common output with total number of occurrences of 'yes' or 'no' or whatever that attribute could be. In other words, I'd now like to use groupby on multiple files at once. And if a file doesn't have one of these columns, it should be skipped and should go to the next file.

This is a nice use case for blaze.
Here's an example using a couple of reduced files from the nyctaxi dataset. I've purposely split a single large file into two files of 1,000,000 lines each:
In [16]: from blaze import Data, compute, by
In [17]: ls
trip10.csv trip11.csv
In [18]: d = Data('*.csv')
In [19]: expr = by(d[['passenger_count', 'medallion']], avg_time=d.trip_time_in_secs.mean())
In [20]: %time result = compute(expr)
CPU times: user 3.22 s, sys: 393 ms, total: 3.61 s
Wall time: 3.6 s
In [21]: !du -h *
194M trip10.csv
192M trip11.csv
In [22]: len(d)
Out[22]: 2000000
In [23]: result.head()
Out[23]:
passenger_count medallion avg_time
0 0 08538606A68B9A44756733917323CE4B 0
1 0 0BB9A21E40969D85C11E68A12FAD8DDA 15
2 0 9280082BB6EC79247F47EB181181D1A4 0
3 0 9F4C63E44A6C97DE0EF88E537954FC33 0
4 0 B9182BF4BE3E50250D3EAB3FD790D1C9 14
Note: This will perform the computation with pandas, using pandas' own chunked CSV reader. If your files are in the GB range you're better off converting to a format such as bcolz or PyTables, as these are binary formats and designed for data analysis on huge files. CSVs are justs blobs of text with conventions.

One way to do it is to concatenate the dfs. It can eat up a lot of memory. How huge are the files?
filelist = ['file1.txt', 'file2.txt']
df = pd.concat([pd.read_csv(x, sep="\t") for x in filelist], axis=0)

pandas to_csv: suppress scientific notation in csv file when writing pandas to csv

I am writing a pandas df to a csv. When I write it to a csv file, some of the elements in one of the columns are being incorrectly converted to scientific notation/numbers. For example, col_1 has strings such as '104D59' in it. The strings are mostly represented as strings in the csv file, as they should be. However, occasional strings, such as '104E59', are being converted into scientific notation (e.g., 1.04 E 61) and represented as integers in the ensuing csv file.
I am trying to export the csv file into a software package (i.e., pandas -> csv -> software_new) and this change in data type is causing problems with that export.
Is there a way to write the df to a csv, ensuring that all elements in df['problem_col'] are represented as string in the resulting csv or not converted to scientific notation?
Here is the code I have used to write the pandas df to a csv:
df.to_csv('df.csv', encoding='utf-8')
I also check the dtype of the problem column:
for df.dtype, df['problem_column'] is an object

For python 3.xx (Python 3.7.2)&
In [2]: pd.__version__ Out[2]: '0.23.4':
Options and Settings
For visualization of the dataframe pandas.set_option
import pandas as pd #import pandas package
# for visualisation fo the float data once we read the float data:
pd.set_option('display.html.table_schema', True) # to can see the dataframe/table as a html
pd.set_option('display.precision', 5) # setting up the precision point so can see the data how looks, here is 5
df = pd.DataFrame(np.random.randn(20,4)* 10 ** -12) # create random dataframe
Output of the data:
df.dtypes # check datatype for columns
[output]:
0 float64
1 float64
2 float64
3 float64
dtype: object
Dataframe:
df # output of the dataframe
[output]:
0 1 2 3
0 -2.01082e-12 1.25911e-12 1.05556e-12 -5.68623e-13
1 -6.87126e-13 1.91950e-12 5.25925e-13 3.72696e-13
2 -1.48068e-12 6.34885e-14 -1.72694e-12 1.72906e-12
3 -5.78192e-14 2.08755e-13 6.80525e-13 1.49018e-12
4 -9.52408e-13 1.61118e-13 2.09459e-13 2.10940e-13
5 -2.30242e-13 -1.41352e-13 2.32575e-12 -5.08936e-13
6 1.16233e-12 6.17744e-13 1.63237e-12 1.59142e-12
7 1.76679e-13 -1.65943e-12 2.18727e-12 -8.45242e-13
8 7.66469e-13 1.29017e-13 -1.61229e-13 -3.00188e-13
9 9.61518e-13 9.71320e-13 8.36845e-14 -6.46556e-13
10 -6.28390e-13 -1.17645e-12 -3.59564e-13 8.68497e-13
11 3.12497e-13 2.00065e-13 -1.10691e-12 -2.94455e-12
12 -1.08365e-14 5.36770e-13 1.60003e-12 9.19737e-13
13 -1.85586e-13 1.27034e-12 -1.04802e-12 -3.08296e-12
14 1.67438e-12 7.40403e-14 3.28035e-13 5.64615e-14
15 -5.31804e-13 -6.68421e-13 2.68096e-13 8.37085e-13
16 -6.25984e-13 1.81094e-13 -2.68336e-13 1.15757e-12
17 7.38247e-13 -1.76528e-12 -4.72171e-13 -3.04658e-13
18 -1.06099e-12 -1.31789e-12 -2.93676e-13 -2.40465e-13
19 1.38537e-12 9.18101e-13 5.96147e-13 -2.41401e-12
And now write to_csv using the float_format='%.15f' parameter
df.to_csv('estc.csv',sep=',', float_format='%.15f') # write with precision .15
file output:
,0,1,2,3
0,-0.000000000002011,0.000000000001259,0.000000000001056,-0.000000000000569
1,-0.000000000000687,0.000000000001919,0.000000000000526,0.000000000000373
2,-0.000000000001481,0.000000000000063,-0.000000000001727,0.000000000001729
3,-0.000000000000058,0.000000000000209,0.000000000000681,0.000000000001490
4,-0.000000000000952,0.000000000000161,0.000000000000209,0.000000000000211
5,-0.000000000000230,-0.000000000000141,0.000000000002326,-0.000000000000509
6,0.000000000001162,0.000000000000618,0.000000000001632,0.000000000001591
7,0.000000000000177,-0.000000000001659,0.000000000002187,-0.000000000000845
8,0.000000000000766,0.000000000000129,-0.000000000000161,-0.000000000000300
9,0.000000000000962,0.000000000000971,0.000000000000084,-0.000000000000647
10,-0.000000000000628,-0.000000000001176,-0.000000000000360,0.000000000000868
11,0.000000000000312,0.000000000000200,-0.000000000001107,-0.000000000002945
12,-0.000000000000011,0.000000000000537,0.000000000001600,0.000000000000920
13,-0.000000000000186,0.000000000001270,-0.000000000001048,-0.000000000003083
14,0.000000000001674,0.000000000000074,0.000000000000328,0.000000000000056
15,-0.000000000000532,-0.000000000000668,0.000000000000268,0.000000000000837
16,-0.000000000000626,0.000000000000181,-0.000000000000268,0.000000000001158
17,0.000000000000738,-0.000000000001765,-0.000000000000472,-0.000000000000305
18,-0.000000000001061,-0.000000000001318,-0.000000000000294,-0.000000000000240
19,0.000000000001385,0.000000000000918,0.000000000000596,-0.000000000002414
And now write to_csv using the float_format='%f' parameter
df.to_csv('estc.csv',sep=',', float_format='%f') # this will remove the extra zeros after the '.'
For more details check pandas.DataFrame.to_csv

Use the float_format argument:
In [11]: df = pd.DataFrame(np.random.randn(3, 3) * 10 ** 12)
In [12]: df
Out[12]:
0 1 2
0 1.757189e+12 -1.083016e+12 5.812695e+11
1 7.889034e+11 5.984651e+11 2.138096e+11
2 -8.291878e+11 1.034696e+12 8.640301e+08
In [13]: print(df.to_string(float_format='{:f}'.format))
0 1 2
0 1757188536437.788086 -1083016404775.687134 581269533538.170288
1 788903446803.216797 598465111695.240601 213809584103.112457
2 -829187757358.493286 1034695767987.889160 864030095.691202
Which works similarly for to_csv:
df.to_csv('df.csv', float_format='{:f}'.format, encoding='utf-8')

If you would like to use the values as formated string in a list, say as part of csvfile csv.writier, the numbers can be formated before creating a list:
with open('results_actout_file','w',newline='') as csvfile:
resultwriter = csv.writer(csvfile, delimiter=',')
resultwriter.writerow(header_row_list)
resultwriter.writerow(df['label'].apply(lambda x: '%.17f' % x).values.tolist())

How to store formulas, instead of values, in pandas DataFrame

Is it possible to work with pandas DataFrame as with an Excel spreadsheet: say, by entering a formula in a column so that when variables in other columns change, the values in this column change automatically? Something like:
a b c
2 3 =a+b
And so when I update 2 or 3, the column c also updates automatically.
PS: It's clearly possible to write a function to return a+b, but is there any built-in functionality in pandas or in other Python libraries to work with matrices this way?

This will work in 0.13 (still in development)
In [19]: df = DataFrame(randn(10,2),columns=list('ab'))
In [20]: df
Out[20]:
a b
0 0.958465 0.679193
1 -0.769077 0.497436
2 0.598059 0.457555
3 0.290926 -1.617927
4 -0.248910 -0.947835
5 -1.352096 -0.568631
6 0.009125 0.711511
7 -0.993082 -1.440405
8 -0.593704 0.352468
9 0.523332 -1.544849
This will be possible as 'a + b' (soon)
In [21]: formulas = { 'c' : 'df.a + df.b' }
In [22]: def update(df,formulas):
for k, v in formulas.items():
df[k] = pd.eval(v)
In [23]: update(df,formulas)
In [24]: df
Out[24]:
a b c
0 0.958465 0.679193 1.637658
1 -0.769077 0.497436 -0.271642
2 0.598059 0.457555 1.055614
3 0.290926 -1.617927 -1.327001
4 -0.248910 -0.947835 -1.196745
5 -1.352096 -0.568631 -1.920726
6 0.009125 0.711511 0.720636
7 -0.993082 -1.440405 -2.433487
8 -0.593704 0.352468 -0.241236
9 0.523332 -1.544849 -1.021517
You could implement a hook into setitem on the data frame to have this type of function called automatically. But pretty tricky. You didn't specify how the frame is updated in the first place. Would probably be easiest to simply call the update function after you change the values

I don't know it it is what you want, but I accidentally discovered that you can store xlwt.Formula objects in the DataFrame cells, and then, using DataFrame.to_excel method, export the DataFrame to excel and have your formulas in it:
import pandas
import xlwt
formulae=[]
formulae.append(xlwt.Formula('SUM(F1:F5)'))
formulae.append(xlwt.Formula('SUM(G1:G5)'))
formulae.append(xlwt.Formula('SUM(H1:I5)'))
formulae.append(xlwt.Formula('SUM(I1:I5)'))
df=pandas.DataFrame(formula)
df.to_excel('FormulaTest.xls')
Try it...

There's currently no way to do this exactly in the way that you describe.
In pandas 0.13 there will be a new DataFrame.eval method that will allow you to evaluate an expression in the "context" of a DataFrame. For example, you'll be able to df['c'] = df.eval('a + b').

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Storing as Pandas DataFrames and Updating as Pytables - python

Related

Operating on pandas dataframes that may or may not be multiIndex

Pandas DataFrame.merge MemoryError

pandas groupby for multiple data frames/files at once

pandas to_csv: suppress scientific notation in csv file when writing pandas to csv

How to store formulas, instead of values, in pandas DataFrame

Categories

Resources