Most efficient way of mapping column in pandas DataFrame - python

I was wondering if map method was the best option when a simple mapping was necessary in a column, since using map or apply is usually a bad idea .

I compared the following functions for the simple case below. Please share if you have better alternatives.
# Case - Map the random number to its string
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1,7,size=(5000,1)), columns=['A'])
dikt = {1:'1',2:'2',3:'3',4:'4',5:'5',6:'6'}
First function - using map method:
def f1():
df1 = df.copy()
df1['B'] = df['A'].map(dikt)
return df1
Results:
Second function - using to_list method in column:
def f2():
df2 = df.copy()
column_list = df2['A'].tolist()
df2['B'] = [dikt[i] for i in column_list]
return df2
Results:

Related

column filter and multiplication in dask dataframe

I am trying to replicate the following operation on a dask dataframe where I have to filter the dataframe based on column value and multiply another column on that.
Following is pandas equivalent -
import dask.dataframe as dd
df['adjusted_revenue'] = 0
df.loc[(df.tracked ==1), 'adjusted_revenue'] = 0.7*df['gross_revenue']
df.loc[(df.tracked ==0), 'adjusted_revenue'] = 0.3*df['gross_revenue']
I am trying to do this on a dask dataframe but it doesn't support assignment.
TypeError: '_LocIndexer' object does not support item assignment
This is working for me -
df['adjusted_revenue'] = 0
df1 = df.loc[df['tracked'] ==1]
df1['adjusted_revenue'] = 0.7*df1['gross_revenue']
df2 = df.loc[df['tracked'] ==0]
df2['adjusted_revenue'] = 0.3*df['gross_revenue']
df = dd.concat([df1, df2])
However, I was hoping if there is any simpler way to do this.
Thanks!
You should use .apply, which is probably the right thing to do with Pandas too; or perhaps where. However, to keep things as similar to your original, here it is with map_partitions, in which you act on each piece of the the dataframe independently, and those pieces really are Pandas dataframes.
def make_col(df):
df['adjusted_revenue'] = 0
df.loc[(df.tracked ==1), 'adjusted_revenue'] = 0.7*df['gross_revenue']
df.loc[(df.tracked ==0), 'adjusted_revenue'] = 0.3*df['gross_revenue']
return df
new_df = df.map_partitions(make_col)

Efficient way to empty a pandas dataframe using its mutable alias property

So I want to create a function in which a part of the codes modifies an existing pandas dataframe df and under some conditions, the df will be modified to empty. The challenge is that this function is now allwoed to return the dataframe itself; it can only modify the df by handling the alias. An example of this is the following function:
import pandas as pd
import random
def random_df_modifier(df):
letter_lst = list('abc')
message_lst = [f'random {i}' for i in range(len(letter_lst) - 1)] + ['BOOM']
chosen_tup = random.choice(list(zip(letter_lst, message_lst)))
df[chosen_tup[0]] = chosen_tup[1]
if chosen_tup[0] == letter_lst[-1]:
print('Game over')
df = pd.DataFrame()#<--this line won't work as intended
return chosen_tup
testing_df = pd.DataFrame({'col1': [True, False]})
print(random_df_modifier(testing_df))
I am aware of the reason df = pd.DataFrame() won't work is because the local df is now associated with the pd.DataFrame() instead of the mutable alias of the input dataframe. so is there any way to change the df inplace to an empty dataframe?
Thank you in advance
EDIT1: df.drop(df.index, inplace=True) seems to work as intended, but I am not sure about its efficientcy because df.drop() may suffer from performance issue
when the dataframe is big enough(by big enough I mean 1mil+ total entries).
df = pd.DataFrame(columns=df.columns)
will empty a dataframe in pandas (and be way faster than using the drop method).
I believe that is what your asking.

Applying functions declared as strings to a pandas dataframe

I have a pandas dataframe. I want to create new columns in the dataframe with
mathematical functional values of the existing columns.
I know how to do it for simple cases:
import pandas as pd
import numpy as np
# Basic dataframe
df = pd.DataFrame(data={'col1': [1,2], 'col2':[3,5]})
for i in df.columns:
df[f'{i}_sqrt'] = df[i].apply(lambda x :np.sqrt(x))
produces
Now I want to extend it to the cases where the functions are written as strings like:
one_func = ['(x)', '(np.sqrt(x))']
two_func = ['*'.join(i) for i in itertools.product(one_func, one_func)]
so that two_func = ['(x)*(x)','(x)*(np.sqrt(x))','(np.sqrt(x))*(x)', '(np.sqrt(x))*(np.sqrt(x))']. Is there any way I can create columns like the first example with these new functions?
That looks like a bad design, but I won't go down that road.
Answering your question, you can use df.eval
First of all, set
one_func = ['{x}', '(sqrt({x}))']
with {} instead of () such that you can replace {x} for your actual column name.
Then, for instance,
expr = two_func[0].format(x='col1')
df.eval(expr)
The food loop your look like
for col in df.columns:
for func in two_func: df[func] = df.eval(func.format(x=col))

pandas conditional selection - returning a view not a copy

I have an original pandas Dataframe with a chain of objects doing conditional selection on it. Each time I do a conditional selection, pandas creates a new dataframe. In other words:
import pandas as pd
df = pd.DataFrame(dict(A=range(3,23), B=range(5,25)))
print(id(df))
df2 = df[df['A']> 15]
print(id(df2))
df = pd.DataFrame(dict(A=range(3,43), B=range(5,45)))
print(id(df))
# output:
139963862409288
139963862409456
139963862275296
In the above example, I want df2 to change when I update df. I know that now because I rebind the variable df to a new Pandas DataFrame (a new object), its ID changes and df2 is not connected to the new df anymore. Is there anyway to do it the way I want? Is there any method/attribute in pandas to keep the connection between the original Dataframe and my conditional selection, or any Pythonic way I'm not aware of?
What are you trying to accomplish? Maybe it can be accomplished in a different way?
Regarding having views instead of copies -- when you select a single row or column, you have a view. The code below demonstrates this:
import pandas as pd
df = pd.DataFrame(dict(A=range(8,13), B=range(10,15), C=range(-3,2)))
print(df)
print('-----------')
dfa = df['A']
df2 = df.loc[2]
dfi = df.iloc[2]
dfa[2]=42
df2['B']=99
dfi['C']=-1
print(df)
print(dfa)
print(df2)
print(dfi)

Efficiently reconstruct DataFrame using oversampled index

I have two DataFrames: df1 and df2
both df1 and df2 are derived from the same original data set, which has a DatetimeIndex.
df2 still has a DatetimeIndex.
Whereas, df1 has been oversampled and now has an int index with the prior DatetimeIndex as a 'Date' column within it.
I need to reconstruct a df2 so that it aligns with df1, i.e. I'll need to oversample the rows that are oversampled and then order them and set them onto the same int index that df1 has.
Currently, I'm using these two functions below, but they are painfully slow. Is there any way to speed this up? I haven't been able to find any built-in function that does this. Is there?
def align_data(idx_col,data):
new_data = pd.DataFrame(index=idx_col.index,columns=data.columns)
for label,group in idx_col.groupby(idx_col):
if len(group.index) > 1:
slice = expanded(data.loc[label],len(group.index)).values
else:
slice = data.loc[label]
new_data.loc[group.index] = slice
return new_data
def expanded(row,l):
return pd.DataFrame(data=[row for i in np.arange(l)],index=np.arange(l),columns=row.index)
A test can be generated using the code below:
import pandas as pd
import numpy as np
import datetime as dt
dt_idx = pd.DatetimeIndex(start='1990-01-01',end='2018-07-02',freq='B')
df1 = pd.DataFrame(data=np.zeros((len(dt_idx),20)),index=dt_idx)
df1.index.name = 'Date'
df2 = df1.copy()
df1 = pd.concat([df1,df1.sample(len(dt_idx)/2)],axis=0)
df1.reset_index(drop=False,inplace=True)
t = dt.datetime.now()
df2_aligned = align_data(df1['Date'],df2)
print(dt.datetime.now()-t)

Categories