Speeding up complex functions on pandas

Speeding up complex functions on pandas - python

I am filling up NaN values in one column of my dataframe using the followikng code:
for i in tqdm(range(nadf.shape[0])):
a = nadf["primary"][i]
nadf["count"][i] = np.ceil(d[a]*a)
This code replaces the NaN values in the "count" by multiplying the corresponding value of the "primary" in a dictionary d with the value of "primary". The nadf has 16 million rows. I understand that the execution will be slow, but is there a method to speed this up?

If I understood your question and dataframe value in a right way, the problem can be solved the following way by using pandas internal functionality:
Please follow comments in code, feel free to ask questions.
import pandas as pd
import numpy as np
import math
def fill_nan(row, _d):
"""fill nan values in "count" column based on "primary" column value and dictionary _d"""
if math.isnan(row["count"]):
return np.ceil(_d[row["primary"]]) * row["primary"]
return row["count"] # else not nan
if __name__ == "__main__":
d = {1: 10, 2: 20, 3: 30}
df = pd.DataFrame({
"primary": [1, 2, 3, 1, 2, 1, 2],
"count": [10.1, 4, 5, np.nan, np.nan, 4, np.nan]
})
df["count"] = df.apply(lambda row: fill_nan(row, d), axis=1) # changes nan here
print(df)

Related

How to apply rolling mean function while keeping all the observations with duplicated indices in time

I have a dataframe that has duplicated time indices and I would like to get the mean across all for the previous 2 days (I do not want to drop any observations; they are all information that I need). I've checked pandas documentation and read previous posts on Stackoverflow (such as Apply rolling mean function on data frames with duplicated indices in pandas), but could not find a solution. Here's an example of how my data frame look like and the output I'm looking for. Thank you in advance.
data:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,3,3,4,4,4],'t': [1, 2, 3, 2, 1, 2, 2, 3, 4],'v1':[1, 2, 3, 4, 5, 6, 7, 8, 9]})
output:
t
v2
1
-
2
-
3
4.167
4
5
5
6.667

A rough proposal to concatenate 2 copies of the input frame in which values in 't' are replaced respectively by values of 't+1' and 't+2'. This way, the meaning of the column 't' becomes "the target day".
Setup:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,3,3,4,4,4],
't': [1, 2, 3, 2, 1, 2, 2, 3, 4],
'v1':[1, 2, 3, 4, 5, 6, 7, 8, 9]})
Implementation:
len = df.shape[0]
incr = pd.DataFrame({'id': [0]*len, 't': [1]*len, 'v1':[0]*len}) # +1 in 't'
df2 = pd.concat([df + incr, df + incr + incr]).groupby('t').mean()
df2 = df2[1:-1] # Drop the days that have no full values for the 2 previous days
df2 = df2.rename(columns={'v1': 'v2'}).drop('id', axis=1)
Output:
v2
t
3 4.166667
4 5.000000
5 6.666667

Thank you for all the help. I ended up using groupby + rolling (2 Day), and then drop duplicates (keep the last observation).

pandas largest value per group with multi columns / why does it only work when flattening?

For a pandas dataframe of:
import pandas as pd
df = pd.DataFrame({
'id': [1, 1, 2, 1], 'anomaly_score':[5, 10, 8, 100], 'match_level_0':[np.nan, 1, 1, 1], 'match_level_1':[np.nan, np.nan, 1, 1], 'match_level_2':[np.nan, 1, 1, 1]
})
display(df)
df = df.groupby(['id', 'match_level_0']).agg(['mean', 'sum'])
I want to calculate the largest rows per group.
df.columns = ['__'.join(col).strip() for col in df.columns.values]
df.groupby(['id'])['anomaly_score__mean'].nlargest(2)
Works but requires to flatten the multiindex for the columns.
Instead I want to directly use,
df.groupby(['id'])[('anomaly_score', 'mean')].nlargest(2)
But this fails with the key not being found.
Interestingly, it works just fine when not grouping:
df[('anomaly_score', 'mean')].nlargest(2)

For me working grouping by Series with first level of MultiIndex, but it seems bug why not working like in your solution:
print (df[('anomaly_score', 'mean')].groupby(level=0).nlargest(2))
id match_level_0
1 1.0 55
2 1.0 8
Name: (anomaly_score, mean), dtype: int64
print (df[('anomaly_score', 'mean')].groupby(level='id').nlargest(2))

How to load scipy.stats.describe output into a pandas dataframe?

Is there an easy and straightforward way to load the output from sp.stats.describe() into a DataFrame, including the value names? It doesn't seem to be a dictionary format or something related. Ofcourse I can manually attach the relevant column names (see below), but was wondering whether it might be possible to directly load into a DataFrame with named columns.
import pandas as pd
import scipy as sp
data = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 2, 3, 4, 5]})
sp.stats.describe(data['a'])
pd.DataFrame(a)
pd.DataFrame(a).transpose().rename(columns={0: 'N', 1: 'Min,Max',
2: 'Mean', 3: 'Var',
4: 'Skewness',
5: 'Kurtosis'})

You can use _fields for columns names from named tuple:
a = sp.stats.describe(data['a'])
df = pd.DataFrame([a], columns=a._fields)
print (df)
nobs minmax mean variance skewness kurtosis
0 5 (1, 5) 3.0 2.5 0.0 -1.3
Also is possible create dictionary from named tuples by _asdict:
d = sp.stats.describe(data['a'])._asdict()
df = pd.DataFrame([d], columns=d.keys())
print (df)
nobs minmax mean variance skewness kurtosis
0 5 (1, 5) 3.0 2.5 0.0 -1.3

numpy.argmax() not working on my sorted pandas.Series

Can someone please explain to my why the argmax() function does not work after using sort_values() on my pandas series?
Below is the example of my code. The indices in the output is based on the original DataFrame, and not on the sorted Series.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a': [4, 5, 3, 1, 2],
'b': [20, 10, 40, 50, 30],
'c': [25, 20, 5, 15, 10]
})
def sec_largest(x):
xsorted = x.sort_values(ascending=False)
return xsorted.idxmax()
df.apply(sec_largest)
Then the output is
a 1
b 3
c 0
dtype: int64
And when I checked the Series using xsorted.iloc[0] function, it gives me the maximum values in the series.
Can someone explain to me how this works? Thank you very much.

The problem is that you are using the sort on the pandas Series, with which the indices also get passed along while sorting, and idxmax returns the original index with the highest value, not the index of the sorted series..
def sec_largest(x):
xsorted = x.sort_values(ascending=False)
return xsorted.values.argmax()
By using the values of xsorted we use the numpy dataframe, and not the underlying pandas datastructure and everything works as expected.
If you print xsorted in the function you can see that the indices also get sorted along:
1 5
0 4
2 3
4 2
3 1

creating a column based on missing value in pandas

I have a data-frame for which want to create a column that represents missing value patterns in data-frame.For example :
for example for the CSV file,
A,B,C,D
1,NaN,NaN,NaN
Nan,2,3,NaN
3,2,2,3
3,2,NaN,3
3,2,1,NaN
I want to create a column E,which has value in following way:
If A,B,C,D all are missing E = 4,
If A,B,C,D all are present E = 0,
if A and B are only missing E = 1 of that sort, encoding of E need not be like I mentioned just an indication of pattern.How can I come across this problem in pandas?

use isnull in combination with sum(axis=1)
Example:
import pandas as pd
df = pd.DataFrame({'A': [1, None, 3, 3, 3],
'B':[ None, None, 1, 1, 1]})
df['C'] = df.isnull().sum(axis=1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Speeding up complex functions on pandas - python

Related

How to apply rolling mean function while keeping all the observations with duplicated indices in time

pandas largest value per group with multi columns / why does it only work when flattening?

How to load scipy.stats.describe output into a pandas dataframe?

numpy.argmax() not working on my sorted pandas.Series

creating a column based on missing value in pandas

Categories

Resources