Pandas xs slow for DataFrame.apply

Pandas xs slow for DataFrame.apply - python

I have a DataFrame with multi-index ['timestamp', 'symbol'] that contains timeseries data. I merging this data with other samples and my apply function that uses asof is similar to:
df.apply(lambda x: df2.xs(x['symbol'], level='symbol').index.asof(x['timestamp'])), axis=1)
I think the actual xs to filter on symbol is what is causing it to be so slow, so I am instead creating a dict of 'symbol' -> df where the values are already filtered so I can just call index.asof directly. Am I approaching this the wrong way?
Example:
df = pd.read_csv(StringIO("ts,symbol,bid,ask\n2014-03-03T09:30:00,A,54.00,55.00\n2014-03-03T09:30:05,B,34.00,35.00"), parse_dates='ts', index_col=['ts', 'symbol'])
df2 = pd.read_csv(StringIO("ts,eventId,symbol\n2014-03-03T09:32:00,1,A\n2014-03-03T09:33:05,2,B"), parse_dates='ts', index_col=['ts', 'symbol'])
# find ts to join with and use xs so we can use indexof
df2['event_ts'] = df2.apply(lambda x: df.xs(x['symbol'], level='symbol').index.asof(x['ts'])), axis=1)
# merge in fields
df2 = pd.merge(df2, df, left_on=['event_ts', 'symbol'], right_index=True)

Related

Pandas `hash_pandas_object` not producing duplicate hash values for duplicate entires

I have two dataframes, df1 and df2, and I know that df2 is a subset of df1. What I am trying to do is find the set difference between df1 and df2, such that df1 has only entries that are different from those in df2. To accomplish this, I first used pandas.util.hash_pandas_object on each of the dataframes, and then found the set difference between the two hashed columns.
df1['hash'] = pd.util.hash_pandas_object(df1, index=False)
df2['hash'] = pd.util.hash_pandas_object(df2, index=False)
df1 = df1.loc[~df1['hash'].isin(df2['hash'])]
This results in df1 remaining the same size; that is, none of the hash values matched. However, when I use a lambda function, df1 is reduced by the expected amount.
df1['hash'] = df1.apply(lambda x: hash(tuple(x)), axis=1)
df2['hash'] = df2.apply(lambda x: hash(tuple(x)), axis=1)
df1 = df1.loc[~df1['hash'].isin(df2['hash'])]
The problem with the second approach is that it takes an extremely long time to execute (df1 has about 3 million rows). Am I just misunderstanding how to use pandas.util.hash_pandas_object?

The difference is that in the first case you are hashing the complete dataframe, while in the second case you are hashing each individual row.
If your object is to remove the duplicate rows, you can achieve this faster using left/right merge with indicator option and then drop the rows that are not unique to the original dataframe.
df_merged = df1.merge(df2, how='left', on=list_columns, indicator=True)
df_merged = df_merged[df_merged.indicator=="left_only"] # this will keep only unmatched rows

Merge two dataframe to reduce memory consumption

I am trying to explode a list in my dataframe column and merge it back to the df, but i get a memory error while merging the flatten column with the initial dataframe. I would like to know if i can merge it in chunks, so that i can overcome the memory issue.
def flatten_colum_with_list(df, column, reset_index=False):
column_to_flatten = pd.DataFrame([[i, x] for i, y in df[column].apply(list).iteritems() for x in y], columns=['I', column])
column_to_flatten = column_to_flatten.set_index('I')
df = df.drop(column, axis=1)
df = df.merge(column_to_flatten, left_index=True, right_index=True)
if reset_index:
df = df.reset_index(drop=True)
return df
I would appreciate any support.

Regarding this, you can simply use the following code:
df.explode(*column name here*,ignore_index=True)
The ignore_index set to true will set the index to 0,1,2,....... order.

Pandas concatenation results in NaNs?

What seems to be a simple function returns NaNs instead of the actual numbers. What am I missing here?
#Concatenate the dataframes:
dfcal = dfcal.astype(float)
dfmag = dfmag.astype(float)
print('dfcal\n-----',dfcal)
print('dfmag\n-----',dfmag)
df = pd.concat([dfcal,dfmag])
print('concatresult\n-----',df)
Cheers!

I guess you need axis=1 for append new columns, selected column caliper for avoid duplicated depth columns:
df = pd.concat([dfcal['caliper'],dfmag], axis=1)
Or:
df = pd.concat([dfcal.drop('depth', axis=1),dfmag], axis=1)

Check parameters (join, axis) or use merge
join{‘inner’, ‘outer’}, default ‘outer’
df = pd.concat([dfcal,dfmag], join='inner')

Resetting column headings after concat

I concatenated a df(6000,13) with a sampleDf(6000,1) and found that my col index in my pandas df as expected ranges from 0 - 12 for the df, and then displays 0 for the concatenated sampleDf.
df = pd.concat([df, sampleDF], axis=1)
I am trying to reset this and have tried the following but nothing seems to have any effect. Any other methods I can try or any thoughts on why this may be happening?
df = df.reset_index(drop=True)
df = df.reindex()
df.index = range(len(df.index))
df.index = pd.RangeIndex(len(df.index))
I have also tried to append .reset_index(drop=True) to my original concat.
The only thing I can think of is that my data frame is 1d in length after processing and should be a pandas series perhaps?
Edit
I found a workaround if I transpose and then transpose again. There has to be a better way than this.
df = pd.concat([df, sampleDF], axis=1)
df = df.transpose()
df.index = range(len(df.index))
df = df.transpose()

You can simply rename your columns directly:
df = pd.concat([df, sampleDF], axis=1)
df.columns = range(len(df.columns))
This will be more efficient than repeatedly transposing df.

dask groupby apply then merge back to dataframe

I would I go about creating a new column that is the result of a groupby and apply of another column while keeping the order of the dataframe (or at least be able to sort it back).
example:
I want to normalize a signal column by group
import dask
import numpy as np
import pandas as pd
from dask import dataframe
def normalize(x):
return ((x - x.mean())/x.std())
data = np.vstack([np.arange(2000), np.random.random(2000), np.round(np.linspace(0, 10, 2000))]).T
df = dataframe.from_array(data, columns=['index', 'signal', 'id_group'], chunksize=100)
df = df.set_index('index')
normalized_signal = df.groupby('id_group').signal.apply(normalize, meta=pd.Series(name='normalized_signal_by_group'))
normalized_signal.compute()
I do get the right series, but the index is shuffled.
I do I get this series back in the dataframe?
I tried
df['normalized_signal'] = normalized_signal
df.compute()
but I get
ValueError: Not all divisions are known, can't align partitions. Please use set_index to set the index.
I also tried a merge, but my final dataframe ends up shuffled with no easy way to resort along the index
df2 = df.merge(normalized_signal.to_frame(), left_index=True, right_index=True, how='left')
df2.compute()
It works when I compute the series than sort_index() in pandas but that doesn't seem efficient.
df3 = df.merge(normalized_signal.to_frame().compute().sort_index(), left_index=True, right_index=True, how='left')
df3.compute()
The equivalent pandas way is :
df4 = df.compute()
df4['normalized_signal_by_group'] = df4.groupby('id_group').signal.transform(normalize)
df4

Unfortunately transform is not implemented in dask yet. My (ugly) workaround is:
import numpy as np
import pandas as pd
import dask.dataframe as dd
pd.options.mode.chained_assignment = None
def normalize(x):
return ((x - x.mean())/x.std())
def dask_norm(gp):
gp["norm_signal"] = normalize(gp["signal"].values)
return(gp.as_matrix())
data = np.vstack([np.arange(2000), np.random.random(2000), np.round(np.linspace(0, 10, 2000))]).T
df = dd.from_array(data, columns=['index', 'signal', 'id_group'], chunksize=100)
df1 = df.groupby("id_group").apply(dask_norm, meta=pd.Series(name="a") )
df2 = df1.to_frame().compute()
df3 = pd.concat([pd.DataFrame(a) for a in df2.a.values])
df3.columns = ["index", "signal", "id_group", "normalized_signal_by_group"]
df3.sort_values("index", inplace=True)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas xs slow for DataFrame.apply - python

Related

Pandas `hash_pandas_object` not producing duplicate hash values for duplicate entires

Merge two dataframe to reduce memory consumption

Pandas concatenation results in NaNs?

Resetting column headings after concat

dask groupby apply then merge back to dataframe

Categories

Resources