MemoryError when I merge two Pandas data frames - python

I searched almost all over the internet and somehow none of the approaches seem to work in my case.
I have two large csv files (each with a million+ rows and about 300-400MB in size). They are loading fine into data frames using the read_csv function without having to use the chunksize parameter.
I even performed certain minor operations on this data like new column generation, filtering, etc.
However, when I try to merge these two frames, I get a MemoryError. I have even tried to use SQLite to accomplish the merge, but in vain. The operation takes forever.
Mine is a Windows 7 PC with 8GB RAM. The Python version is 2.7
Thank you.
Edit: I tried chunking methods too. When I do this, I don't get MemoryError, but the RAM usage explodes and my system crashes.

When you are merging data using pandas.merge it will use df1 memory, df2 memory and merge_df memory. I believe that it is why you get a memory error. You should export df2 to a csv file and use chunksize option and merge data.
It might be a better way but you can try this.
*for large data set you can use chunksize option in pandas.read_csv
df1 = pd.read_csv("yourdata.csv")
df2 = pd.read_csv("yourdata2.csv")
df2_key = df2.Colname2
# creating a empty bucket to save result
df_result = pd.DataFrame(columns=(df1.columns.append(df2.columns)).unique())
df_result.to_csv("df3.csv",index_label=False)
# save data which only appear in df1 # sorry I was doing left join here. no need to run below two line.
# df_result = df1[df1.Colname1.isin(df2.Colname2)!=True]
# df_result.to_csv("df3.csv",index_label=False, mode="a")
# deleting df2 to save memory
del(df2)
def preprocess(x):
df2=pd.merge(df1,x, left_on = "Colname1", right_on = "Colname2")
df2.to_csv("df3.csv",mode="a",header=False,index=False)
reader = pd.read_csv("yourdata2.csv", chunksize=1000) # chunksize depends with you colsize
[preprocess(r) for r in reader]
this will save merged data as df3.

The reason you might be getting MemoryError: Unable to allocate.. could be due to duplicates or blanks in your dataframe. Check the column you are joining on (when using merge) and see if you have duplicates or blanks. If so get rid of them using this command:
df.drop_duplicates(subset ='column_name', keep = False, inplace = True)
Then re-run your python/pandas code. This worked for me.

In general chunk version suggested by #T_cat works great.
However, memory exploding might be caused by joining on columns that have Nan values.
So you may want to exclude those rows from the join.
see: https://github.com/pandas-dev/pandas/issues/24698#issuecomment-614347153

Maybe the left data frame has NaN in the merging columns which causes the final merged dataframe to bloat.
Fill the merging column in the left data frame with zeros if ok.
df['left_column'] = df['left_column'].fillna(0)
Then do the merge. See what you get.

Related

Converting for loop to numpy calculation for pandas dataframes

So I have a python script that compares two dataframes and works to find any rows that are not in both dataframes. It currently iterates through a for loop which is slow.
I want to improve the speed of the process, and know that iteration is the problem. However, I haven't been having much luck using various numpy methods such as merge and where.
Couple of caveats:
The column names from my file sources aren't the same, so I set their names into variables and use the variable names to compare.
I want to only use the column names from one of the dataframes.
df_new represents new information to be checked against what is currently on file (df_current)
My current code:
set_current = set(df_current[current_col_name])
df_out = pd.DataFrame(columns=df_new.columns)
for i in range(len(df_new.index)):
# if the row entry is new, we add it to our dataset
if not df_new[new_col_name][i] in set_current:
df_out.loc[len(df_out)] = df_new.iloc[i]
# if the row entry is a match, then we aren't going to do anything with it
else:
continue
# create a xlsx file with the new items
df_out.to_excel("data/new_products_to_examine.xlsx", index=False)
Here are some simple examples of dataframes I would be working with:
df_current
|partno|description|category|cost|price|upc|brand|color|size|year|
|:-----|:----------|:-------|:---|:----|:--|:----|:----|:---|:---|
|123|Logo T-Shirt||25|49.99||apple|red|large|2021||
|456|Knitted Shirt||35|69.99||apple|green|medium|2021||
df_new
|mfgr_num|desc|category|cost|msrp|upc|style|brand|color|size|year|
|:-------|:---|:-------|:---|:---|:--|:----|:----|:----|:---|:---|
|456|Knitted Shirt||35|69.99|||apple|green|medium|2021|
|789|Logo Vest||20|39.99|||apple|yellow|small|2022|
There are usually many more columns in the current sheet, but I wanted the table displayed to be somewhat readable. The key is that I would only want the columns in the "new" dataframe to be output.
I would want to match partno with mfgr_num since the spreadsheets will always have them, whereas some items don't have upc/gtin/ean.
It's still a unclear what you want without providing examples of each dataframe. But if you want to test unique IDs in differently named columns in two different dataframes, try an approach like this.
Find the IDs that exist in the second dataframe
test_ids = df2['cola_id'].unique().tolist()
the filter the first dataframe for those IDs.
df1[df1['keep_id'].isin(test_ids)]
Here is the answer that works - was supplied to me by someone much smarter.
df_out = df_new[~df_new[new_col_name].isin(df_current[current_col_name])]

Combining two dataframes with pandas/numpy

I'm trying to write script that gets the data from two different databases and gives the result which is a CSV file with combined data.
I managed to get data using psycopg2 and pandas read_sql_query, I turned results into two different dataframes and all of that works great. I wrote all of that with only a little information about those databases so I used databases I had and some simple queries.
All of that is on my github:
https://github.com/tomasz-urban/sql-db-get
With more detailed info about what needs to be done I'm stuck...
In the first database there are users limitations: lim_val_1 and lim_val_2 together with user_id (couple thousand rows). Second one holds usage with val_1 and val_2 gathered every some period of time (couple hundred thousand rows).
I need to get those rows where users reach their limits (doesn't matter if it is lim_val_1 or lim_val_2 or both, I need all of that).
To visualize it better there are some simple tables in the link:
Databases info with output
My last approach:
result_query = df2.loc[(df2['val_1'] == df1['lim_val_1']) & (df2['val_2'] == df1['lim_val_2'])]
output_data = pd.DataFrame(result_query)
and I'm getting an error:
"ValueError: Can only compare identically-labeled Series objects"
I cannot label those columns the same so I think this solution will not work for me. I also tried merging with no result.
Anyone could help me with this ?
Since df1 and df2 have different numbers of rows, you cannot compare the values directly, try joining the data frames and then select where the conditions are met.
If user_id is the index of both df:
df_3 = df1.join(df2)
df_3[
(df_3['val_1'] == df_3['lim_val_1']) |
(df_3['val_2'] == df_3['lim_val_2']
]
You might want to replace == in for >= in case where limits can be surpassed.

Why is the `df.columns` an empty list while I can see the column names if I print out the dataframe? Python Pandas

import pandas as pd
DATA = pd.read_csv(url)
DATA.head()
I have a large dataset that have dozens of columns. After loading it like above into Colab, I can see the name of each column. But running DATA.columns just return Index([], dtype='object'). What's happening in this?
Now I find it impossible to pick out a few columns without column names. One way is to specify names = [...] when I load it, but I'm reluctant to do that since there're too many columns. So I'm looking for a way to index a column by integers, like in R df[:,[1,2,3]] would simply give me the first three columns of a dataframe. Somehow Pandas seems to focus on column names and makes integer indexing very inconvenient, though.
So what I'm asking is (1) What did I do wrong? Can I obtain those column names as well when I load the dataframe? (2) If not, how can I pick out the [0, 1, 10]th column by a list of integers?
It seems that the problem is in the loading as DATA.shape returns (10000,0). I rerun the loading code a few times, and all of a sudden, things go back normal. Maybe Colab was taking a nap or something?
You can perfectly do that using df.loc[:,[1,2,3]] but i would suggest you to use the names because if the columns ever change the order or you insert new columns, the code can break it.

How to merge Python Dask Dataframes into one on columns?

got a little problem. I have two dask dataframes with following format:
#DF1.csv
DATE|EVENTNAME|VALUE
#DF2.csv
DATE|EVENTNAME0|EVENTNAME1|...|EVENTNAMEX
I want to merge the value from DF1.csv into DF2.csv, at time t (Date) and column (EventName). I use Dask at the moment, because i'm working with huge datesets ~50gb. I noticed that you can't use direct assignment of values in Dask. So i tried, dd.Series.where:
df[nodeid].where(time,value) => Result in an error (for row in df.iterrows():
#df2.loc[row[0],row[1][0]] =row[1][1])
i also tried a merge, but the resulting Dask dataframe had no partitions, which result in a MemoryError, because all datasets will be loaded into memory, if i use the .to_csv('data-*.csv') method. It should be easy to merge the dataframes, but i have no clue at the moment. Is there a Dask pro, that could help me out?
Edit://
This works well in pandas but not with dask:
for row in df.iterrows():
df2.loc[row[0],row[1][0]] =row[1][1]
Tried something like that:
for row in df.iterrows():
df2[row[1][0]] = df2[row[1][0]].where(row[0], row[1][1])
#Result in Error => raise ValueError('Array conditional must be same shape as '
Any ideas?
For everyone who is interested, you can use:
#DF1
df.pivot(index='date', columns='event', values='value') #to create DF2 Memory efficient
see also: https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
before, it took a huge time, was horrible memory hungry and brought up not the results that i was looking for. Just use Pandas pivot, if you try to alter your dataframe scheme.
Edit:// And there is no reason to use Dask anymore, speed up the whole process even further ;)

Building a dataframe from many InfluxDB result series

I am using the influxdb Python library to get a list of series in my database. There are about 20k series. I am then trying to build a Pandas dataframe out of the series'. I am rounding on 15S (I'd like to get rid of the nanoseconds, and I wonder why InfluxDB's Python library's documented get_list_series() call is missing in all versions I've tried, but those are other questions...); I'd like to end up with one big dataframe.
Here's the code:
from influxdb import DataFrameClient
.... get series list ...
temp_df = pd.DataFrame()
for series in series_list:
df = dfclient.query('select time,temp from {} where "series" = \'{}\''.format(location, temp))[location].asfreq('15S')
df.columns = [series]
if temp_df.empty:
temp_df = df
else:
temp_df = temp_df.join(df, how='outer')
This starts out fine but after a few hundred series, slows down quickly, grinding nearly to a halt. I am sure I'm not using Pandas the right way, and I'm hoping you can tell me how to do it the right way.
For what it's worth, I'm running this on relatively powerful hardware (which is why I believe I'm doing this the wrong way.)
One more thing: the time series' for each series I pull from InfluxDB may be different than all the others, which is why I'm using join. I'd like to end up with a DF with a column for each series, with the datetimes appropriately sorted in the index; join does that.

Categories