Pandas Rank Multiple Columns for huge dataset using Threadpool

Pandas Rank Multiple Columns for huge dataset using Threadpool - python

I need to rank each column of the dataframe. I'm currently using the below code:
for x in range(1,len(cols)):
data[cols[x]] = data[cols[x]].rank(ascending=0)
This works for small dataset. I have more than 50,000 columns and 20,000 rows. Is there a way I can achieve faster with Threadpool. Tried the below code but it didn't work. It is returning empty set.
cols = rankDset.columns.tolist()
def rank_columns(c):
rankDset[c] = rankDset[c].rank(ascending=0)
def parallelDataframe(df, func):
pool = Pool(8)
pool.map(func, cols)
pool.close()
pool.join()
parallelDataframe(rankDset, rank_columns)

You should be able to rank every column by using pd.DataFrame.rank:
df.rank()
From Docs
Compute numerical data ranks (1 through n) along axis.
axis: {0 or ‘index’, 1 or ‘columns’}, default 0
index to direct ranking
consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(dict(
A=np.random.choice(np.arange(10), 5, False),
B=np.random.choice(np.arange(10), 5, False),
C=np.random.choice(np.arange(10), 5, False),
D=np.random.choice(np.arange(10), 5, False),
))
df
A B C D
0 9 1 6 0
1 4 3 8 2
2 5 5 9 6
3 1 9 7 1
4 7 4 3 9
Then ranking produces
df.rank()
A B C D
0 5.0 1.0 2.0 1.0
1 2.0 2.0 4.0 3.0
2 3.0 4.0 5.0 4.0
3 1.0 5.0 3.0 2.0
4 4.0 3.0 1.0 5.0

Related

Create dataframe with (for each cell) averages of other dataframes

I have a list of about 20 dataframes, all with the same structure (same rows and columns).
I want to create a new df, where each cell is equal to the average of the corresponding (same row/column) cells of the listed dfs.
So, for example, if we have just 2 dfs (A and B), I need the following:
A=
A B C D
0 7 6 8 7
1 7 0 7 6
2 9 2 7 0
B=
A B C D
0 6 9 2 7
1 4 4 5 7
2 6 8 5 4
Average=
A B C D
0 6.5 7.5 5.0 7.0
1 5.5 2.0 6.0 6.5
2 7.5 5.0 6.0 2.0
I tried this code, but it's pretty slow (the real dfs are quite large) and messes up the order of columns:
dfs = [A,B]
Average = pd.concat([each.stack() for each in dfs],axis=1)\
.apply(lambda x:x.mean(),axis=1)\
.unstack()
Is there a better alternative? Thanks

Use -
(A+B) / 2
Output
A B C D
0 6.5 7.5 5.0 7.0
1 5.5 2.0 6.0 6.5
2 7.5 5.0 6.0 2.0
For scaling up to more dfs, put all of them in a list and just use sum(list). Edit: Based on #younggoti's reco-
list_of_df = [A,B]
sum(list_of_df)/len(list_of_df)

Pandas Rename a Single Row of MultiIndex by Tuple

I'm trying to rename a single row of a pandas dataframe by it's tuple.
For example:
import pandas as pd
df = pd.DataFrame(data={'i1':[0,0,0,0,1,1,1,1],
'i2':[0,1,2,3,0,1,2,3],
'x':[1.,2.,3.,4.,5.,6.,7.,8.],
'y':[9,10,11,12,13,14,15,16]})
df.set_index(['i1','i2'], inplace=True)
Creates df:
x y
i1 i2
0 0 1.0 9
1 2.0 10
2 3.0 11
3 4.0 12
1 0 5.0 13
1 6.0 14
2 7.0 15
3 8.0 16
I'd like to be able to use something like: df.rename(index={(0,1):(0,9)},inplace=True) to get:
x y
i1 i2
0 0 1.0 9
9 2.0 10 <-- new key
2 3.0 11
3 4.0 12
1 0 5.0 13
1 6.0 14
2 7.0 15
3 8.0 16
The command executes without raising an error but returns the same df unchanged.
This also returns the same df: df.rename(index={pd.IndexSlice[0,1]:pd.IndexSlice[0,9]},inplace=True)
This will have close to the desired effect:
df.loc[(0,9),:] = df.loc[(0,1),:]
df.drop(index=(0,1),inplace=True)
but if row ordering matters, it'll be a pain to get it into the right order, and possibly quite slow if the df gets big.
I'm using Pandas 1.0.1, python 3.7. Any suggestions? Thank you in advance.

Possible solution with list comprehension and MultiIndex.from_tuples:
L = [(0,9) if x == (0,1) else x for x in df.index]
df.index = pd.MultiIndex.from_tuples(L, names=df.index.names)
print (df)
x y
i1 i2
0 0 1.0 9
9 2.0 10
2 3.0 11
3 4.0 12
1 0 5.0 13
1 6.0 14
2 7.0 15
3 8.0 16

Custom expanding function with raw=False

Consider the following dataframe:
df = pd.DataFrame({
'a': np.arange(1, 5),
'b': np.arange(1, 5) * 2,
'c': np.arange(1, 5) * 3
})
a b c
0 1 2 3
1 2 4 6
2 3 6 9
3 4 8 12
I want to calculate the cumulative sum for each row across the columns:
def expanding_func(s):
return s.sum()
df.expanding(1, axis=1).apply(expanding_func, raw=True)
# As expected:
a b c
0 1.0 3.0 6.0
1 2.0 6.0 12.0
2 3.0 9.0 18.0
3 4.0 12.0 24.0
However, if I set raw=False, expanding_func no longer works:
df.expanding(1, axis=1).apply(expanding_func, raw=False)
ValueError: Length of passed values is 3, index implies 4
The documentation says expanding_func
Must produce a single value from an ndarray input if raw=True or a single value from a Series if raw=False.
And that is exactly what I was doing. Why did expanding_func fail when raw=False?
Note: this is only a contrived example. I want to know how to write a custom rolling function, not how to calculate the cumulative sum across columns.

It seems this is a bug with pandas.
If you do:
df.iloc[:3].expanding(1, axis=1).apply(expanding_func, raw=False)
It actually works. It seems when passed as a series, pandas tries to check the number of returned columns with the number of rows of the dataframe for some reason. (it should compare the number of columns of the df)
A workaround is to transpose the df, apply your function and transpose back which seems to work. The bug only seems to affect when axis is set to 1.
df.T.expanding(1, axis=0).apply(expanding_func, raw=False).T
a b c
0 1.0 3.0 6.0
1 2.0 6.0 12.0
2 3.0 9.0 18.0
3 4.0 12.0 24.0

dont need to define raw False/True,Just do simple way:
df.expanding(0, axis=1).apply(expanding_func)
a b c
0 1.0 3.0 6.0
1 2.0 6.0 12.0
2 3.0 9.0 18.0
3 4.0 12.0 24.0

What is the fastest way to calculate a rolling function with a two dimensional window?

I have a pandas dataframe with two dimensions. I want to calculate the rolling standard deviation along axis 1 while also including datapoints in the rows above and below.
So say I have this df:
data = {'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'C': [9, 10, 11, 12]}
df = pd.DataFrame(data)
print(df)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
I want a rectangular window 3 rows high and 2 columns across, moving from left to right. So, for example,
std_df.loc[1, 'C']
would be equal to
np.std([1, 5, 9, 2, 6, 10, 3, 7, 11])
But no idea how to achieve this without very slow iteration

Looks like what you want is pd.shift
import pandas as pd
import numpy as np
data = {'A': [1,2,3,4], 'B': [5,6,7,8], 'C': [9,10,11,12]}
df = pd.DataFrame(data)
print(df)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
Shifting the dataframe you provided by 1 yields the row above
print(df.shift(1))
A B C
0 NaN NaN NaN
1 1.0 5.0 9.0
2 2.0 6.0 10.0
3 3.0 7.0 11.0
Similarly, shifting the dataframe you provided by -1 yields the row below
print(df.shift(-1))
A B C
0 2.0 6.0 10.0
1 3.0 7.0 11.0
2 4.0 8.0 12.0
3 NaN NaN NaN
so the code below should do what you're looking for (add_prefix prefixes the column names to make them unique)
above_df = df.shift(1).add_prefix('above_')
below_df = df.shift(-1).add_prefix('below_')
lagged = pd.concat([df, above_df, below_df], axis=1)
lagged['std'] = lagged.apply(np.std, axis=1)
print(lagged)
A B C above_A above_B above_C below_A below_B below_C std
0 1 5 9 NaN NaN NaN 2.0 6.0 10.0 3.304038
1 2 6 10 1.0 5.0 9.0 3.0 7.0 11.0 3.366502
2 3 7 11 2.0 6.0 10.0 4.0 8.0 12.0 3.366502
3 4 8 12 3.0 7.0 11.0 NaN NaN NaN 3.304038

Pandas agg fuction with operations on multiple columns

I am interested if we can use pandas.core.groupby.DataFrameGroupBy.agg function to make arithmetic operations on multiple columns columns. For example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(15).reshape(5, 3))
df['C'] = [0, 0, 2, 2, 5]
print(df.groupby('C').mean()[0] - df.groupby('C').mean()[1])
print(df.groupby('C').agg({0: 'mean', 1: 'sum', 2: 'nunique', 'C': 'mean0-mean1'}))
Is it somehow possible that we receive result like in this example: the difference between means of column 0 and column 1 grouped by column 'C'?
df
0 1 2 C
0 0 1 2 0
1 3 4 5 0
2 6 7 8 2
3 9 10 11 2
4 12 13 14 5
Groupped difference
C
0 -1.0
2 -1.0
5 -1.0
dtype: float64
I am not interested with solutions that does not use agg method. I am curious only if agg method can take multiple columns as argument and then do some operations on them to return one columns after job is done.

IIUC:
In [12]: df.groupby('C').mean().diff(axis=1)
Out[12]:
0 1 2
C
0 NaN 1.0 1.0
2 NaN 1.0 1.0
5 NaN 1.0 1.0
or
In [13]: df.groupby('C').mean().diff(-1, axis=1)
Out[13]:
0 1 2
C
0 -1.0 -1.0 NaN
2 -1.0 -1.0 NaN
5 -1.0 -1.0 NaN

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Rank Multiple Columns for huge dataset using Threadpool - python

Related

Create dataframe with (for each cell) averages of other dataframes

Pandas Rename a Single Row of MultiIndex by Tuple

Custom expanding function with raw=False

What is the fastest way to calculate a rolling function with a two dimensional window?

Pandas agg fuction with operations on multiple columns

Categories

Resources