Pandas DataFrame.merge MemoryError - python

Goal
My goal is to merge two DataFrames by their common column (gene names) so I can take a product of each gene score across each gene row. I'd then perform a groupby on patients and cells and sum all scores from each. The ultimate data frame should look like this:
patient cell
Pat_1 22RV1 12
DU145 15
LN18 9
Pat_2 22RV1 12
DU145 15
LN18 9
Pat_3 22RV1 12
DU145 15
LN18 9
That last part should work fine, but I have not been able to perform the first merge on gene names due to a MemoryError. Below are snippets of each DataFrame.
Data
cell_s =
Description Name level_2 0
0 LOC100009676 100009676_at LN18_CENTRAL_NERVOUS_SYSTEM 1
1 LOC100009676 100009676_at 22RV1_PROSTATE 2
2 LOC100009676 100009676_at DU145_PROSTATE 3
3 AKT3 10000_at LN18_CENTRAL_NERVOUS_SYSTEM 4
4 AKT3 10000_at 22RV1_PROSTATE 5
5 AKT3 10000_at DU145_PROSTATE 6
6 MED6 10001_at LN18_CENTRAL_NERVOUS_SYSTEM 7
7 MED6 10001_at 22RV1_PROSTATE 8
8 MED6 10001_at DU145_PROSTATE 9
cell_s is about 10,000,000 rows
patient_s =
id level_1 0
0 MED6 Pat_1 1
1 MED6 Pat_2 1
2 MED6 Pat_3 1
3 LOC100009676 Pat_1 2
4 LOC100009676 Pat_2 2
5 LOC100009676 Pat_3 2
6 ABCD Pat_1 3
7 ABCD Pat_2 3
8 ABCD Pat_3 3
....
patient_s is about 1,200,000 rows
Code
def get_score(cell, patient):
cell_s = cell.set_index(['Description', 'Name']).stack().reset_index()
cell_s.columns = ['Description', 'Name', 'cell', 's1']
patient_s = patient.set_index('id').stack().reset_index()
patient_s.columns = ['id', 'patient', 's2']
# fails here:
merged = cell_s.merge(patient_s, left_on='Description', right_on='id')
merged['score'] = merged.s1 * merged.s2
scores = merged.groupby(['patient','cell'])['score'].sum()
return scores
I was getting a MemoryError when initially read_csving these files, but then specifying the dtypes resolved the issue. Confirming that my python is 64 bit did not fix my issue either. I haven't reached the limitations on pandas, have I?
Python 3.4.3 |Anaconda 2.3.0 (64-bit)| Pandas 0.16.2

Consider two workarounds:
CSV By CHUNKS
Apparently, read_csv can suffer performance issues and therefore large files must load in iterated chunks.
cellsfilepath = 'C:\\Path\To\Cells\CSVFile.csv'
tp = pd.io.parsers.read_csv(cellsfilepath, sep=',', iterator=True, chunksize=1000)
cell_s = pd.concat(tp, ignore_index=True)
patientsfilepath = 'C:\\Path\To\Patients\CSVFile.csv'
tp = pd.io.parsers.read_csv(patientsfilepath, sep=',', iterator=True, chunksize=1000)
patient_s = pd.concat(tp, ignore_index=True)
CSV VIA SQL
As a database guy, I always recommend handling large data loads and merging/joining with a SQL relational engine that scales well for such processes. I have written many a comment on dataframe merge Q/As to this effect -even in R. You can use any SQL database including file server dbs (Access, SQLite) or client server dbs (MySQL, MSSQL, or other), even where your dfs derive. Python maintains a built-in library for SQLite (otherwise you use ODBC); and dataframes can be pushed into databases as tables using pandas to_sql:
import sqlite3
dbfile = 'C:\\Path\To\SQlitedb.sqlite'
cxn = sqlite3.connect(dbfile)
c = cxn.cursor()
cells_s.to_sql(name='cell_s', con = cxn, if_exists='replace')
patient_s.to_sql(name='patient_s', con = cxn, if_exists='replace')
strSQL = 'SELECT * FROM cell_s c INNER JOIN patient_s p ON c.Description = p.id;'
# MIGHT HAVE TO ADJUST ABOVE FOR CELL AND PATIENT PARAMS IN DEFINED FUNCTION
merged = pd.read_sql(strSQL, cxn)

You may have to do it in pieces, or look into blaze. http://blaze.pydata.org

Related

Efficiently concatenate/append dataframe in a for loop to get a single big dataframe using python pandas

Using a logic- I am reading multiple PDF files which are having certain highlighted portions(presume that these are tables).
After pushing them to a list, I am saving them to a dataframe.
Here's the logic for the same
try:
filepath = [file for file in glob.glob("Folder/*.pdf")]
for file in filepath:
doc = fitz.open(file)
print(file)
highlights = []
for page in doc:
highlights += handle_page(page)
#print(highlights)
highlights_alt = highlights[0].split(',')
df = pd.DataFrame(highlights_alt, columns=['Security Name'])
#print(df.columns.tolist())
df[['Security Name', 'Weights']] = df['Security Name'].str.rsplit(n=1, expand=True)
df.drop_duplicates(keep='first', inplace=True)
print(df.head())
print(df.shape)
except IndexError:
print('file {} is not highlighted'.format(file))
Using this logic I get the dataframes however if the folder has 5 PDFs then this logic creates 5 different dataframes. Something like this.
Folder\z.pdf
Security Name Weights
0 UTILITIES (5.96
1 %*) None
(2, 2)
Folder\y.pdf
Security Name Weights
0 Quantity/ Market Value % of Net Investments Cu... 1.125
1 % 01
2 /07 None
3 /2027 None
4 EUR 230
(192, 2)
Folder\x.pdf
Security Name Weights
0 Holding £740
1 000 None
2 Leeds Building Society 3.75
3 % variable 25
4 /4 None
(526, 2)
However I want a single dataframe with the above records in them making their shape as (720,2) something like
Security Name Weights
0 Holding £740
1 000 None
2 Leeds Building Society 3.75
3 % variable 25
4 /4 None
.
.
720 xyz 3.33
(720, 2)
I tried using pandas's concat & append but have been unsuccessful so far. Please let me know an efficient way of doing it since, the PDFs in future would be more than 1000s.
Please help!
A quick way is to use pd.concat:
big_df = pd.concat(list_of_dfs, axis=0)
If this creates an error it would be helpful to know what the error is.

Run functions over many dataframes, add results to another dataframe, and dynamically name the resulting column with the name of the original df

I have many different tables that all have different column names and each refer to an outcome, like glucose, insulin, leptin etc (except keep in mind that the tables are all gigantic and messy with tons of other columns in them as well).
I am trying to generate a report that starts empty but then adds columns based on functions applied to each of the glucose, insulin, and leptin tables.
I have included a very simple example - ignore that the function makes little sense. The below code works, but I would like to, instead of copy + pasting final_report["outcome"] = over and over again, just run the find_result function over each of glucose, insulin, and leptin and add the "glucose_result", "insulin_result" and "leptin_result" to the final_report in one or a few lines.
Thanks in advance.
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
glucose = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
insulin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
leptin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
ids = [1,2,3,4]
start = [1,1,1,1]
end = [6,6,6,6]
final_report = pd.DataFrame({'id':ids,
'start':start,
'end':end})
def find_result(subject, start, end, df):
df = df.loc[(df["id"] == subject) & (df["timepoint"] >= start) & (df["timepoint"] <= end)].sort_values(by = "timepoint")
return df["timepoint"].nunique()
final_report['glucose_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], glucose), axis=1)
final_report['insulin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], insulin), axis=1)
final_report['leptin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], leptin), axis=1)
If you have to use this code structure, you can create a simple dictionary with your dataframes and their names and loop through them, creating new columns with programmatically assigned names:
input_dfs = {"glucose": glucose, "insulin": insulin, "leptin": leptin}
for name, df in input_dfs.items():
final_report[f"{name}_result"] = final_report.apply(
lambda x: find_result(x['id'], x['start'], x['end'], df),
axis=1
)
Output:
id start end glucose_result insulin_result leptin_result
0 1 1 6 6 6 6
1 2 1 6 6 6 6
2 3 1 6 3 3 3
3 4 1 6 6 6 6

Create category rows from count of list items in pandas column

I have a set of job vacancy data in a pandas dataframe I have 'tagged' by the strings it contains. E.g., if C# is mentioned in the title or description, C# is added to that row under the languages column.
I now want to summarise the count of each skill in the entire dataframe. Original dataframe:
languages frameworks platforms databases other
0 [SQL, C] [] [AWS] [] []
1 [SQL] [] [] [SQL Server] []
2 [SQL, C#] [ASP.NET, ASP] [] [SQL Server] []
3 [JavaScript, HTML, CSS,] [] [] [] []
4 [JavaScript, Python, Java] [React] [] [] []
...
Desired:
skill_category skill count
0 languages SQL 3
1 languages C 1
2 languages C# 1
3 languages JavaScript 2
...
9 frameworks ASP.NET 1
10 frameworks ASP 1
...
12 platforms AWS 1
...
14 databases SQL Server 2
...
15 other Hadoop 1
etc.
I have tried:
Outputting the relevant parts of the df into a python list of dictionaries and using for loops and counters to count the skills in each category. I can then create a new dataframe from this, but this is a lot of code for something I feel should be possible with pandas.
Looked in to pandas' .pivot() method, though I can't work out a way to have the column names (languages, frameworks etc.) become rows for each skill.
Used pandas' .explode() and .value_counts() methods to count the skills in each column, e.g.:
In[12]: df['languages'].explode().value_counts()
Out[12]:
JavaScript 39
SQL 32
C# 28
HTML 24
Java 24
...
But this only works column-by-column. I need a dataframe with a category row for creating faceted visualisations in plotly.
Please help?
You are close to the answer! But stack all columns into one tall series before counting:
counts = df.stack().explode()\
.reset_index().groupby(["level_1", 0]).count()
counts = counts.reset_index()
counts.columns = "skill_category", "skill", "count"

Storing as Pandas DataFrames and Updating as Pytables

Can you store data as pandas HDFStore and open them / perform i/o using pytables? The reason this question comes up is because I am currently storing data as
pd.HDFStore('Filename',mode='a')
store.append(data)
However, as i understand pandas doesn't support updating records so much. I have a usecase where I have to update 5% of the data daily. Would pd.io.pytables work? if so I found no documentation on this? Pytables has a lot of documentation but i am not sure if i can open the file / update without opening using pytables when i didnt use pytables to save the file initially?
Here is a demonstration for the #flyingmeatball's answer:
Let's generate a test DF:
In [56]: df = pd.DataFrame(np.random.rand(15, 3), columns=list('abc'))
In [57]: df
Out[57]:
a b c
0 0.022079 0.901965 0.282529
1 0.596452 0.096204 0.197186
2 0.034127 0.992500 0.523114
3 0.659184 0.447355 0.246932
4 0.441517 0.853434 0.119602
5 0.779707 0.429574 0.744452
6 0.105255 0.934440 0.545421
7 0.216278 0.217386 0.282171
8 0.690729 0.052097 0.146705
9 0.828667 0.439608 0.091007
10 0.988435 0.326589 0.536904
11 0.687250 0.661912 0.318209
12 0.829129 0.758737 0.519068
13 0.500462 0.723528 0.026962
14 0.464162 0.364536 0.843899
and save it to HDFStore (NOTE: don't forget to use data_columns=True (or data_columns=[list_of_columns_to_index]) in order to index all columns, that we want to use in the where clause):
In [58]: store = pd.HDFStore(r'd:/temp/test_removal.h5')
In [59]: store.append('test', df, format='t', data_columns=True)
In [60]: store.close()
Solution:
In [61]: store = pd.HDFStore(r'd:/temp/test_removal.h5')
The .remove() method should return # of removed rows:
In [62]: store.remove('test', where="a > 0.5")
Out[62]: 9
Let's append changed (multiplied by 100) rows :
In [63]: store.append('test', df.loc[df.a > 0.5] * 100, format='t', data_columns=True)
Test:
In [64]: store.select('test')
Out[64]:
a b c
0 0.022079 0.901965 0.282529
2 0.034127 0.992500 0.523114
4 0.441517 0.853434 0.119602
6 0.105255 0.934440 0.545421
7 0.216278 0.217386 0.282171
14 0.464162 0.364536 0.843899
1 59.645151 9.620415 19.718557
3 65.918421 44.735482 24.693160
5 77.970749 42.957446 74.445185
8 69.072948 5.209725 14.670545
9 82.866731 43.960848 9.100682
10 98.843540 32.658931 53.690360
11 68.725002 66.191215 31.820942
12 82.912937 75.873689 51.906795
13 50.046189 72.352794 2.696243
finalize:
In [65]: store.close()
Here are the docs I think you're after:
http://pandas.pydata.org/pandas-docs/version/0.19.0/api.html?highlight=pytables
See this thread as well:
Update pandas DataFrame in stored in a Pytable with another pandas DataFrame
Looks like you can load the 5% records into memory, remove them from the store then append the updated ones back
to replace the whole table
store.remove(key, where = ...)
store.append(.....)
You can also do outside of Pandas - see tutorial here on removal
http://www.pytables.org/usersguide/tutorials.html

How to divide a dbf table to two or more dbf tables by using python

I have a dbf table. I want to automatically divide this table into two or more tables by using Python. The main problem is, that this table consists of more groups of lines. Each group of lines is divided from the previous group by empty line. So i need to save each of groups to a new dbf table. I think that this problem could be solved by using some function from Arcpy package and FOR cycle and WHILE, but my brain cant solve it :D :/ My source dbf table is more complex, but i attach a simple example for better understanding. Sorry for my poor english.
Source dbf table:
ID NAME TEAM
1 A 1
2 B 2
3 C 1
4
5 D 2
6 E 3
I want get dbf1:
ID NAME TEAM
1 A 1
2 B 2
3 C 1
I want get dbf2:
ID NAME TEAM
1 D 2
2 E 3
Using my dbf package it could look something like this (untested):
import dbf
source_dbf = '/path/to/big/dbf_file.dbf'
base_name = '/path/to/smaller/dbf_%03d'
sdbf = dbf.Table(source_dbf)
i = 1
ddbf = sdbf.new(base_name % i)
sdbf.open()
ddbf.open()
for record in sdbf:
if not record.name: # assuming if 'name' is empty, all are empty
ddbf.close()
i += 1
ddbf = sdbf.new(base_name % i)
continue
ddbf.append(record)
ddbf.close()
sdbf.close()

Categories