Can I apply a vectorized function to a pandas dataframe? - python

I am pretty new to pandas and numpy, and I'm trying to figure out the best way to do some things.
Right now I am trying to call a function on every row of a dataframe. If I pass in three numpy arrays to this function, it's very fast, but using apply on the dataframe is very slow.
My guess is that numpy is using vectorized functions in the first case, and not in the second. Is there a way to get pandas to use that optimization? Basically, in pseudocode I think apply is doing something like for row in frame: func(row['a'], row['b'], row['c']) but I want it to do func(col['a'], col['b'], col['c']).
Here is an example of what I am trying to do.
import numpy as np
import pandas as pd
from scipy.stats import beta
count = 100000
# If I start with a given dataframe and use apply, it's very slow
df = pd.DataFrame(np.random.uniform(0, 1, size=(count, 3)), columns=['a', 'b', 'c'])
df.apply(lambda frame: beta.cdf(frame['a'], frame['b'], frame['c']), axis=1)
# However, if I split out each column into a numpy array, this is very fast.
a = df['a'].as_matrix()
b = df['b'].as_matrix()
c = df['c'].as_matrix()
beta.cdf(a, b, c)
# But at this point I've lost the context of the dataframe.
# I would like to keep the results in a new column for further processing

It's not clear why you're trying to use apply. You can just do beta.cdf(df.a, df.b, df.c).

Related

Calling a Python function/class that takes an entire pandas dataframe or series as input, for all rows in another dataframe

I have a Python class that takes a geopandas Series or Dataframe to initialize (specifically working with geopandas, but I imagine it to be the same solution as pandas). This class has attributes/methods that utilize the various columns in the series/dataframe. Outside of this, I have a dataframe with many rows. I would like to iterate through (ideally in an efficient/parallel manner as each row is independent of each other) this dataframe, and call a method in the class for each row (aka Series). And append the results as a column to the dataframe. But I am having trouble with this. With the standard list comprehension/pandas apply() methods, I can call like this e.g.:
gdf1['function_return_col'] = list(map((lambda f: my_function(f)), gdf2['date']))
But if said function (or in my case, class) needs the entire gdf, and I call like this:
gdf1['function_return_col'] = list(map((lambda f: my_function(f)), gdf2))
It does not work because 'my_function()' takes a dataframe or series, while what is being sent to it is the column names (strings) of gdf2.
How can I apply a function to all rows in a dataframe if said function takes an entire dataframe/series and not just select column(s)? In my specific case, since it's a method in a class, I would like to do this, or something similar to call this method on all rows in a dataframe:
gdf1['function_return_col'] = list(map((lambda f: my_class(f).my_method()), gdf2))
Or am I just thinking of this in the entirely wrong way?
Have you tried using pandas dataframe method called "apply".
Here is an example of using it for both row axis and column axis.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2], 'B': [10, 20]})
df1 = df.apply(np.sum, axis=0)
print(df1)
df1 = df.apply(np.sum, axis=1)
print(df1)

Efficiently apply function to every row of a dataframe depending on another dataframe

Introduction
I have two dataframes. I would like to apply a function to each row of the first one. This function depends on the row and the entire second dataframe. I would like to do this efficiently.
Reprudicible Example
Setting up the dataframes
import pandas as pd
import numpy as np
Let the two dataframes be:
df0 = pd.DataFrame.from_dict({'a':np.random.normal(0,1,5),'b':np.random.normal(0,1,5)})
df1 = pd.DataFrame.from_dict({'c':np.random.normal(0,1,10),'d':np.random.normal(0,1,10)})
(In the real application, they are much bigger.)
I would like to find which row from df1 is closest to each row in df0, where closest is defined as having the least squared_dist between them:
def squared_dist(x,y):
return np.sum(np.square(x-y))
What I have tried
What I do is create to numpy arrays from the dataframes
df0np=df0.to_numpy()
df1np=df1.to_numpy()
Iterate through these arrays:
res=[]
for row in df0np:
distances = [squared_dist(row,df1np[i,]) for i in range(len(df1np))]
index=np.argmin(distances)
res.append(index)
Add the result to df0 as a new column:
df0['res']=res
How fast is it?
The whole code in one piece, including timings for the method described above:
import time
import pandas as pd
import numpy as np
df0 = pd.DataFrame.from_dict({'a':np.random.normal(0,1,5),'b':np.random.normal(0,1,5)})
df1 = pd.DataFrame.from_dict({'c':np.random.normal(0,1,10),'d':np.random.normal(0,1,10)})
df0np=df0.to_numpy()
df1np=df1.to_numpy()
start=time.time()
df0np=df0.to_numpy()
df1np=df1.to_numpy()
res=[]
for row in df0np:
distances = [squared_dist(row,df1np[i,]) for i in range(len(df1np))]
index=np.argmin(distances)
res.append(index)
df0['res']=res
end=time.time()
print(end-start) # prints 0.0014030933380126953
Question
How could I make this more efficient, ie how could I achieve lower execution times? This method works fine for this example above, but in my real world application where dataframes are much bigger, this is unusably slow.

Python Dataframe Create a rolling aggregate of list column with a window

I have a df that has a column of lists.
Python Pandas rolling aggregate a column of lists
import pandas as pd
import numpy as np
# Get some time series data
df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/timeseries.csv")
input_cols = ['A', 'B']
df['single_input_vector'] = df[input_cols].apply(tuple, axis=1).apply(list)
I am wondering if there is a way to create a rolling aggregate of the 'single_input_vector' column for a given window. I looked at the following SO link but it does not provide a way to include a window. In my case, the desired output column for a window of 3 would be:
Row1: [[24.68, 164.93]]
Row2: [[24.68, 164.93], [24.18, 164.89]]
Row3: [[24.68, 164.93], [24.18, 164.89], [23.99, 164.63]]
Row4: [[24.18, 164.89], [23.99, 164.63], [24.14, 163.92]]
and so on.
I can't think of a more efficient way to do this, so while this does work there may be performance constraints on massive data sets.
We are basically using rolling count to create a start:stop set of slicing indices.
import pandas as pd
import numpy as np
# Get some time series data
df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/timeseries.csv")
input_cols = ['A', 'B']
df['single_input_vector'] = df[input_cols].apply(tuple, axis=1).apply(list)
window = 3
df['len'] = df['A'].rolling(window=window).count()
df['vector_list'] = df.apply(lambda x: df['single_input_vector'][max(0,x.name-(window-1)):int(x.name)+1].values, axis=1)

Subset dask dataframe by column position

Once I have a dask dataframe, how can I selectively pull columns into an in-memory pandas DataFrame? Say I have an N x M dataframe. How can I create an N x m dataframe where m << M and is arbitrary.
from sklearn.datasets import load_iris
import dask.dataframe as dd
d = load_iris()
df = pd.DataFrame(d.data)
ddf = dd.from_pandas(df, chunksize=100)
What I would like to do:
in_memory = ddf.iloc[:,2:4].compute()
What I have been able to do:
ddf.map_partitions(lambda x: x.iloc[:,2:4]).compute()
map_partitions works but it was quite slow on a file that wasn't very large. I hope I am missing something very obvious.
Although iloc is not implemented for dask-dataframes, you can achieve the indexing easily enough as follows:
cols = list(ddf.columns[2:4])
ddf[cols].compute()
This has the additional benefit, that dask knows immediately the types of the columns selected, and needs to do no additional work. For the map_partitions variant, dask at the least needs to check the data types produces, since the function you call is completely arbitrary.

Data Frame Indexing

Using python3 I wrote a code for calculating data. Code is as follows:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
def data(symbols):
dates = pd.date_range('2016/01/01','2016/12/23')
df=pd.DataFrame(index=dates)
for symbol in symbols:
df_temp=pd.read_csv("/home/furqan/Desktop/Data/{}.csv".format(symbol),
index_col='Date',parse_dates=True,usecols=['Date',"Close"],
na_values = ['nan'])
df_temp=df_temp.rename(columns={'Close':symbol})
df=df.join(df_temp)
df=df.fillna(method='ffill')
df=df.fillna(method='bfill')
df=(df/df.ix[0,: ])
return df
symbols = ['FABL','HINOON']
df=data(symbols)
print(df)
p_value=(np.zeros((2,2),dtype="float"))
p_value[0,0]=0.5
p_value[1,1]=0.5
print(df.shape[1])
print(p_value.shape[0])
df=np.dot(df,p_value)
print(df.shape[1])
print(df.shape[0])
print(df)
When I print df for second time the index has vanished. I think the issue is due to matrix multiplication. How can I get the indexing and column headings back into df?
To resolve your issue, because you are using numpy methods, these typically return a numpy array which is why any existing columns and index labels will have been lost.
So instead of
df=np.dot(df,p_value)
you can do
df=df.dot(p_value)
Additionally because p_value is a pure numpy array, there is no column names here so you can either create a df using existing column names:
p_value=pd.DataFrame(np.zeros((2,2),dtype="float"), columns = df.columns)
or just overwrite the column names directly after calculating the dot product like so:
df.columns = ['FABL', 'HINOON']

Categories