I tried to make below code useable in dask.
daily_register_users = df.groupby(['gid','timestamp')['uid'].apply(set).apply(list)
daily_register_users = daily_register_users.unstack().T
I had tried unstack(), but it haven't implement in dask, and dask.pivot_table() only supports mean ,sum and count.
thanks for any help
my input is
input dataframe
expected output is
expected
Related
For the life of me, I cant figure how to combine these two dataframes. I am using the newest most updated versions of all softwares, including Python, Pandas and Dask.
#pandasframe has 10k rows and 3 columns -
['monkey','banana','furry']
#daskframe has 1.5m rows, 1column, 135 partitions -
row.index: 'monkey_banana_furry'
row.mycolumn = 'happy flappy tuna'
my dask dataframe has a string as its index for accessing,
but when i do daskframe.loc[index_str] it returns a dask dataframe, but i thought it was supposed to return one single specific row. and i dont know how to access the row/value that i need from that dataframe. what i want is to input the index, and output one specific value.
what am i doing wrong?
Even pandas.DataFrame.loc don't return a scalar if you don't specify a label for the columns.
Anyways, to get a scalar in your case, first, you need to add dask.dataframe.DataFrame.compute so you can get a pandas dataframe (since dask.dataframe.DataFrame.loc returns a dask dataframe). And only then, you can use the pandas .loc.
Assuming (dfd) is your dask dataframe, try this :
dfd.loc[index_str].compute().loc[index_str, "happy flappy tuna"]
Or this :
dfd.loc[index_str, "happy flappy tuna"].compute().iloc[0]
I'm trying to organise data using a pandas dataframe.
Given the structure of the data it seems logical to use a composite index; 'league_id' and 'fixture_id'. I believe I have implemented this according to the examples in the docs, however I am unable to access the data using the index.
My code can be found here;
https://repl.it/repls/OldCorruptRadius
** I am very new to Pandas and programming in general, so any advice would be much appreciated! Thanks! **
For multi-indexing, you would need to use the pandas MutliIndex API, which brings its own learning curve; thus, I would not recommend it for beginners. Link: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html
The way that I use multi-indexing is only to display a final product to others (i.e. making it easy/pretty to view). Before the multi-indexing, you filter the fixture_id and league_id as columns first:
df = pd.DataFrame(fixture, columns=features)
df[(df['fixture_id'] == 592) & (df['league_id'] == 524)]
This way, you are still technically targeting the indexes if you would have gone through with multi-indexing the two columns.
If you have to use multi-indexing, try the transform feature of a pandas DataFrame. This turns the indexes into columns and vise-versa. For example, you can do something like this:
df = pd.DataFrame(fixture, columns=features).set_index(['league_id', 'fixture_id'])
df.T[524][592].loc['event_date'] # gets you the row of `event_dates`
df.T[524][592].loc['event_date'].iloc[0] # gets you the first instance of event_dates
got a little problem. I have two dask dataframes with following format:
#DF1.csv
DATE|EVENTNAME|VALUE
#DF2.csv
DATE|EVENTNAME0|EVENTNAME1|...|EVENTNAMEX
I want to merge the value from DF1.csv into DF2.csv, at time t (Date) and column (EventName). I use Dask at the moment, because i'm working with huge datesets ~50gb. I noticed that you can't use direct assignment of values in Dask. So i tried, dd.Series.where:
df[nodeid].where(time,value) => Result in an error (for row in df.iterrows():
#df2.loc[row[0],row[1][0]] =row[1][1])
i also tried a merge, but the resulting Dask dataframe had no partitions, which result in a MemoryError, because all datasets will be loaded into memory, if i use the .to_csv('data-*.csv') method. It should be easy to merge the dataframes, but i have no clue at the moment. Is there a Dask pro, that could help me out?
Edit://
This works well in pandas but not with dask:
for row in df.iterrows():
df2.loc[row[0],row[1][0]] =row[1][1]
Tried something like that:
for row in df.iterrows():
df2[row[1][0]] = df2[row[1][0]].where(row[0], row[1][1])
#Result in Error => raise ValueError('Array conditional must be same shape as '
Any ideas?
For everyone who is interested, you can use:
#DF1
df.pivot(index='date', columns='event', values='value') #to create DF2 Memory efficient
see also: https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
before, it took a huge time, was horrible memory hungry and brought up not the results that i was looking for. Just use Pandas pivot, if you try to alter your dataframe scheme.
Edit:// And there is no reason to use Dask anymore, speed up the whole process even further ;)
I have two big data frame : one containes 3M rows and the other contains 2M rows
1st dataframe :
sacc_id$ id$ creation_date
0 0011200001LheyyAAB 5001200000gxTeGAAU 2017-05-30 13:25:07
2nd dataframe :
sacc_id$ opp_line_id$ oppline_creation_date
0 001A000000hAUn8IAG a0WA000000BYKoWMAX 2013-10-26
I need to merge them :
case = pd.merge(limdata, df_case, left_on='sacc_id$',right_on='sacc_id$')
But I get A memory problem:
pandas/_libs/join.pyx in pandas._libs.join.inner_join()
MemoryError:
Is there another way to do it efficiently. I read in some discussion here that Dask can help but i do not understand how to use it in this context.
Any help please?
thank you
I suggest using Dask when you are dealing with large dataframes. Dask supports the Pandas dataframe and Numpy array data structures and is able to either be run on your local computer or be scaled up to run on a cluster.
You can easily convert your Pandas dataframe to Dask which is made up of smaller split up Pandas dataframes and therefore allows a subset of Pandas query syntax.
here is an example of how you can do it:
import dask.dataframe as dd
limdata= dd.read_csv(path_to_file_1)
df_case= dd.read_csv(path_to_file_2)
case = dd.merge(limdata, df_case, left_on='sacc_id$',right_on='sacc_id$')
There are tips on best practices on how to partition your dataframes to get a better performance. I reckon reading up on it. Also it is a good practice not to have special characters like $ in your column name.
I am attempting to use Dask to handle a large file (50 gb). Typically, I would load it in memory and use Pandas. I want to groupby two columns "A", and "B", and whenever column "C" starts with a value, I want to repeat that value in that column for that particular group.
In pandas, I would do the following:
df['C'] = df.groupby(['A','B'])['C'].fillna(method = 'ffill')
What would be the equivalent in Dask?
Also, I am a little bit lost as to how to structure problems in Dask as opposed to in Pandas,
thank you,
My progress so far:
First set index:
df1 = df.set_index(['A','B'])
Then groupby:
df1.groupby(['A','B']).apply(lambda x: x.fillna(method='ffill').compute()
It appears dask does not currently implement the fillna method for GroupBy objects. I've tried PRing it some time ago and gave up quite quickly.
Also, dask doesn't support the method parameter (as it isn't always trivial to implement with delayed algorithms).
A workaround for this could be using fillna before grouping, like so:
df['C'] = df.fillna(0).groupby(['A','B'])['C']
Although this wasn't tested.
You can find my (failed) attempt here: https://github.com/nirizr/dask/tree/groupy_fillna