got a little problem. I have two dask dataframes with following format:
#DF1.csv
DATE|EVENTNAME|VALUE
#DF2.csv
DATE|EVENTNAME0|EVENTNAME1|...|EVENTNAMEX
I want to merge the value from DF1.csv into DF2.csv, at time t (Date) and column (EventName). I use Dask at the moment, because i'm working with huge datesets ~50gb. I noticed that you can't use direct assignment of values in Dask. So i tried, dd.Series.where:
df[nodeid].where(time,value) => Result in an error (for row in df.iterrows():
#df2.loc[row[0],row[1][0]] =row[1][1])
i also tried a merge, but the resulting Dask dataframe had no partitions, which result in a MemoryError, because all datasets will be loaded into memory, if i use the .to_csv('data-*.csv') method. It should be easy to merge the dataframes, but i have no clue at the moment. Is there a Dask pro, that could help me out?
Edit://
This works well in pandas but not with dask:
for row in df.iterrows():
df2.loc[row[0],row[1][0]] =row[1][1]
Tried something like that:
for row in df.iterrows():
df2[row[1][0]] = df2[row[1][0]].where(row[0], row[1][1])
#Result in Error => raise ValueError('Array conditional must be same shape as '
Any ideas?
For everyone who is interested, you can use:
#DF1
df.pivot(index='date', columns='event', values='value') #to create DF2 Memory efficient
see also: https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
before, it took a huge time, was horrible memory hungry and brought up not the results that i was looking for. Just use Pandas pivot, if you try to alter your dataframe scheme.
Edit:// And there is no reason to use Dask anymore, speed up the whole process even further ;)
Related
For the life of me, I cant figure how to combine these two dataframes. I am using the newest most updated versions of all softwares, including Python, Pandas and Dask.
#pandasframe has 10k rows and 3 columns -
['monkey','banana','furry']
#daskframe has 1.5m rows, 1column, 135 partitions -
row.index: 'monkey_banana_furry'
row.mycolumn = 'happy flappy tuna'
my dask dataframe has a string as its index for accessing,
but when i do daskframe.loc[index_str] it returns a dask dataframe, but i thought it was supposed to return one single specific row. and i dont know how to access the row/value that i need from that dataframe. what i want is to input the index, and output one specific value.
what am i doing wrong?
Even pandas.DataFrame.loc don't return a scalar if you don't specify a label for the columns.
Anyways, to get a scalar in your case, first, you need to add dask.dataframe.DataFrame.compute so you can get a pandas dataframe (since dask.dataframe.DataFrame.loc returns a dask dataframe). And only then, you can use the pandas .loc.
Assuming (dfd) is your dask dataframe, try this :
dfd.loc[index_str].compute().loc[index_str, "happy flappy tuna"]
Or this :
dfd.loc[index_str, "happy flappy tuna"].compute().iloc[0]
I have two big data frame : one containes 3M rows and the other contains 2M rows
1st dataframe :
sacc_id$ id$ creation_date
0 0011200001LheyyAAB 5001200000gxTeGAAU 2017-05-30 13:25:07
2nd dataframe :
sacc_id$ opp_line_id$ oppline_creation_date
0 001A000000hAUn8IAG a0WA000000BYKoWMAX 2013-10-26
I need to merge them :
case = pd.merge(limdata, df_case, left_on='sacc_id$',right_on='sacc_id$')
But I get A memory problem:
pandas/_libs/join.pyx in pandas._libs.join.inner_join()
MemoryError:
Is there another way to do it efficiently. I read in some discussion here that Dask can help but i do not understand how to use it in this context.
Any help please?
thank you
I suggest using Dask when you are dealing with large dataframes. Dask supports the Pandas dataframe and Numpy array data structures and is able to either be run on your local computer or be scaled up to run on a cluster.
You can easily convert your Pandas dataframe to Dask which is made up of smaller split up Pandas dataframes and therefore allows a subset of Pandas query syntax.
here is an example of how you can do it:
import dask.dataframe as dd
limdata= dd.read_csv(path_to_file_1)
df_case= dd.read_csv(path_to_file_2)
case = dd.merge(limdata, df_case, left_on='sacc_id$',right_on='sacc_id$')
There are tips on best practices on how to partition your dataframes to get a better performance. I reckon reading up on it. Also it is a good practice not to have special characters like $ in your column name.
I tried to make below code useable in dask.
daily_register_users = df.groupby(['gid','timestamp')['uid'].apply(set).apply(list)
daily_register_users = daily_register_users.unstack().T
I had tried unstack(), but it haven't implement in dask, and dask.pivot_table() only supports mean ,sum and count.
thanks for any help
my input is
input dataframe
expected output is
expected
I searched almost all over the internet and somehow none of the approaches seem to work in my case.
I have two large csv files (each with a million+ rows and about 300-400MB in size). They are loading fine into data frames using the read_csv function without having to use the chunksize parameter.
I even performed certain minor operations on this data like new column generation, filtering, etc.
However, when I try to merge these two frames, I get a MemoryError. I have even tried to use SQLite to accomplish the merge, but in vain. The operation takes forever.
Mine is a Windows 7 PC with 8GB RAM. The Python version is 2.7
Thank you.
Edit: I tried chunking methods too. When I do this, I don't get MemoryError, but the RAM usage explodes and my system crashes.
When you are merging data using pandas.merge it will use df1 memory, df2 memory and merge_df memory. I believe that it is why you get a memory error. You should export df2 to a csv file and use chunksize option and merge data.
It might be a better way but you can try this.
*for large data set you can use chunksize option in pandas.read_csv
df1 = pd.read_csv("yourdata.csv")
df2 = pd.read_csv("yourdata2.csv")
df2_key = df2.Colname2
# creating a empty bucket to save result
df_result = pd.DataFrame(columns=(df1.columns.append(df2.columns)).unique())
df_result.to_csv("df3.csv",index_label=False)
# save data which only appear in df1 # sorry I was doing left join here. no need to run below two line.
# df_result = df1[df1.Colname1.isin(df2.Colname2)!=True]
# df_result.to_csv("df3.csv",index_label=False, mode="a")
# deleting df2 to save memory
del(df2)
def preprocess(x):
df2=pd.merge(df1,x, left_on = "Colname1", right_on = "Colname2")
df2.to_csv("df3.csv",mode="a",header=False,index=False)
reader = pd.read_csv("yourdata2.csv", chunksize=1000) # chunksize depends with you colsize
[preprocess(r) for r in reader]
this will save merged data as df3.
The reason you might be getting MemoryError: Unable to allocate.. could be due to duplicates or blanks in your dataframe. Check the column you are joining on (when using merge) and see if you have duplicates or blanks. If so get rid of them using this command:
df.drop_duplicates(subset ='column_name', keep = False, inplace = True)
Then re-run your python/pandas code. This worked for me.
In general chunk version suggested by #T_cat works great.
However, memory exploding might be caused by joining on columns that have Nan values.
So you may want to exclude those rows from the join.
see: https://github.com/pandas-dev/pandas/issues/24698#issuecomment-614347153
Maybe the left data frame has NaN in the merging columns which causes the final merged dataframe to bloat.
Fill the merging column in the left data frame with zeros if ok.
df['left_column'] = df['left_column'].fillna(0)
Then do the merge. See what you get.
I am attempting to use Dask to handle a large file (50 gb). Typically, I would load it in memory and use Pandas. I want to groupby two columns "A", and "B", and whenever column "C" starts with a value, I want to repeat that value in that column for that particular group.
In pandas, I would do the following:
df['C'] = df.groupby(['A','B'])['C'].fillna(method = 'ffill')
What would be the equivalent in Dask?
Also, I am a little bit lost as to how to structure problems in Dask as opposed to in Pandas,
thank you,
My progress so far:
First set index:
df1 = df.set_index(['A','B'])
Then groupby:
df1.groupby(['A','B']).apply(lambda x: x.fillna(method='ffill').compute()
It appears dask does not currently implement the fillna method for GroupBy objects. I've tried PRing it some time ago and gave up quite quickly.
Also, dask doesn't support the method parameter (as it isn't always trivial to implement with delayed algorithms).
A workaround for this could be using fillna before grouping, like so:
df['C'] = df.fillna(0).groupby(['A','B'])['C']
Although this wasn't tested.
You can find my (failed) attempt here: https://github.com/nirizr/dask/tree/groupy_fillna