More efficient way to create dataframe of top n values - python - python

I have a dataframe of categories that I need to clean up by limiting the values to the top n categories. Any value that isn't in the top n categories, should be binned under 0 (or "other").
I tried the code below, which loops through each row of a column, then through each column in a dataframe to check if the value in that position is found in that column's top n value_counts. If yes, then it keeps the value, if not then replaces with 0.
This implementation technically works, but when the number of rows is large, it takes too long to run. What is the quicker way of accomplishing this in pandas/numpy?
z = pd.DataFrame(np.random.randint(1,4,size=(100000, 4)))
x=pd.DataFrame()
n=10
for j in z:
for i in z[j].index:
if z.at[i,j] in z[j].value_counts().head(n).index.tolist():
x.at[i,j] = z.at[i,j]
else:
x.at[i,j]= 0
print(x)

I think you can use apply for loop columns with custom function with value_counts for top values and where with isin for boolean mask for replacing:
def f(x):
y = x.value_counts().head(n).index
return x.where(x.isin(y), 0)
print (z.apply(f))
What is same as:
print (z.apply(lambda x: x.where(x.isin(x.value_counts().head(n).index), 0)))
Sample:
#N =100000
N = 10
np.random.seed(123)
z = pd.DataFrame(np.random.randint(1,4,size=(N, 4)))
print (z)
0 1 2 3
0 3 2 3 3
1 1 3 3 2
2 3 2 3 2
3 1 2 3 2
4 1 3 1 2
5 3 2 1 1
6 1 1 2 3
7 1 3 1 1
8 2 1 2 1
9 1 1 3 2
x=pd.DataFrame()
n=2
for j in z:
for i in z[j].index:
if z.at[i,j] in z[j].value_counts().head(n).index.tolist():
x.at[i,j] = z.at[i,j]
else:
x.at[i,j]= 0
print(x)
0 1 2 3
0 3.0 2.0 3.0 0.0
1 1.0 3.0 3.0 2.0
2 3.0 2.0 3.0 2.0
3 1.0 2.0 3.0 2.0
4 1.0 3.0 1.0 2.0
5 3.0 2.0 1.0 1.0
6 1.0 0.0 0.0 0.0
7 1.0 3.0 1.0 1.0
8 0.0 0.0 0.0 1.0
9 1.0 0.0 3.0 2.0
print (z.apply(lambda x: x.where(x.isin(x.value_counts().head(n).index), 0)))
0 1 2 3
0 3 2 3 0
1 1 3 3 2
2 3 2 3 2
3 1 2 3 2
4 1 3 1 2
5 3 2 1 1
6 1 0 0 0
7 1 3 1 1
8 0 0 0 1
9 1 0 3 2
Similar solution with numpy.where:
print (z.apply(lambda x: np.where(x.isin(x.value_counts().head(n).index), x, 0)))
0 1 2 3
0 3 2 3 0
1 1 3 3 2
2 3 2 3 2
3 1 2 3 2
4 1 3 1 2
5 3 2 1 1
6 1 0 0 0
7 1 3 1 1
8 0 0 0 1
9 1 0 3 2

Related

How can I calculate the sum of 3 values from each number in a pandas dataframe including the first number?

I have a dataframe as below:
Datetime Data Fn
0 18747.385417 11275.0 0
1 18747.388889 8872.0 1
2 18747.392361 7050.0 0
3 18747.395833 8240.0 1
4 18747.399306 5158.0 1
5 18747.402778 3926.0 0
6 18747.406250 4043.0 0
7 18747.409722 2752.0 1
8 18747.420139 3502.0 1
9 18747.423611 4026.0 1
I want to calculate the sum of 3 values from each number in Fn and put the value in Sum.
My expected result is this:
Datetime Data Fn Sum
0 18747.385417 11275.0 0 0
1 18747.388889 8872.0 1 0
2 18747.392361 7050.0 0 1
3 18747.395833 8240.0 1 2
4 18747.399306 5158.0 1 2
5 18747.402778 3926.0 0 2
6 18747.406250 4043.0 0 1
7 18747.409722 2752.0 1 1
8 18747.420139 3502.0 1 2
9 18747.423611 4026.0 1 3
df['Sum'] = df.Fn.rolling(3).sum().fillna(0)
Output:
Datetime Data Fn Sum
0 18747.385417 11275.0 0 0.0
1 18747.388889 8872.0 1 0.0
2 18747.392361 7050.0 0 1.0
3 18747.395833 8240.0 1 2.0
4 18747.399306 5158.0 1 2.0
5 18747.402778 3926.0 0 2.0
6 18747.406250 4043.0 0 1.0
7 18747.409722 2752.0 1 1.0
8 18747.420139 3502.0 1 2.0
9 18747.423611 4026.0 1 3.0

Difference in score to next rank

I have a dataframe
Group Score Rank
1 0 3
1 4 1
1 2 2
2 3 2
2 1 3
2 7 1
I have to take the difference of the score in next rank within each group. For example, in group 1 rank(1) - rank(2) = 4 - 2
Expected output:
Group Score Rank Difference
1 0 3 0
1 4 1 2
1 2 2 2
2 3 2 2
2 1 3 0
2 7 1 4
you can try:
df = df.sort_values(['Group', 'Rank'],ascending = [True,False])
df['Difference'] =df.groupby('Group', as_index=False)['Score'].transform('diff').fillna(0).astype(int)
OUTPUT:
Group Score Rank Difference
0 1 0 3 0
2 1 2 2 2
1 1 4 1 2
4 2 1 3 0
3 2 3 2 2
5 2 7 1 4
NOTE: The result is sorted based on the rank column.
I think you can create a new column for the values in the next rank by using the shift() and then calculate the difference. You can see the following codes:
# Sort the dataframe
df = df.sort_values(['Group','Rank']).reset_index(drop=True)
# Shift up values by one row within a group
df['Score_next'] = df.groupby('Group')['Score'].shift(-1).fillna(0)
# Calculate the difference
df['Difference'] = df['Score'] - df['Score_next']
Here is the result:
print(df)
Group Score Rank Score_next Difference
0 1 4 1 2.0 2.0
1 1 2 2 0.0 2.0
2 1 0 3 0.0 0.0
3 2 7 1 3.0 4.0
4 2 3 2 1.0 2.0
5 2 1 3 0.0 1.0

Row-wise replace operation in pandas dataframe

In the given data frame, I am trying to perform a row-wise replace operation where 1 should be replaced by the value in Values.
Input:
import pandas as pd
df = pd.DataFrame({'ID': [1,1,1,2,3,3,4,5,6,7],
'A': [0,1,0,1,0,0,1,0,np.nan,0],
'B': [0,0,0,0,1,1,0,0,0,0],
'C': [1,0,1,0,0,0,0,0,1,1],
'Values': [10, 2, 3,4,9,3,4,5,2,3]})
Expected Output:
ID A B C Values
0 1 0.0 0 10 10
1 1 2.0 0 0 2
2 1 0.0 0 3 3
3 2 4.0 0 0 4
4 3 0.0 9 0 9
5 3 0.0 3 0 3
6 4 4.0 0 0 4
7 5 0.0 0 0 5
8 6 NaN 0 2 2
9 7 0.0 0 3 3
**Note: The data is very large.
Use df.where
df[['A','B','C']]=df[['A','B','C']].where(df[['A','B','C']].ne(1),df['Values'], axis=0)
ID A B C Values
0 1 0.0 0 10 10
1 1 2.0 0 0 2
2 1 0.0 0 3 3
3 2 4.0 0 0 4
4 3 0.0 9 0 9
5 3 0.0 3 0 3
6 4 4.0 0 0 4
7 5 0.0 0 0 5
8 6 NaN 0 2 2
9 7 0.0 0 3 3
Or
df[['A','B','C']]=df[['A','B','C']].mask(df[['A','B','C']].eq(1),df['Values'], axis=0)
My data is really large and it is very slow.
If we exploit the nature of your dataset (A, B, C columns have 1s or 0s or Nans), you simply have to multiple df['values'] with each column independently. This should be super fast as it is vectorized.
df['A'] = df['A']*df['Values']
df['B'] = df['B']*df['Values']
df['C'] = df['C']*df['Values']
print(df)
ID A B C Values
0 1 0.0 0 10 10
1 1 2.0 0 0 2
2 1 0.0 0 3 3
3 2 4.0 0 0 4
4 3 0.0 9 0 9
5 3 0.0 3 0 3
6 4 4.0 0 0 4
7 5 0.0 0 0 5
8 6 NaN 0 2 2
9 7 0.0 0 3 3
If you want to explicitly check the condition where values of A, B, C are 1 (maybe because those columns could have values other than Nans or 0s), then you can use this -
df[['A','B','C']] = (df[['A','B','C']] == 1)*df[['Values']].values
This will replace the columns A, B, C in the original data but, also replaces Nans with 0.

itertools.product() of multiple rows

My df looks like
a b c
0 1 nan
0 2 3
0 3 4
1 1 nan
I need a itertools.product() like combination of the entries in rows within groups of 'a'. Here 2 possible ways, since the second row has 2 different values:
a b
1 0 1
0 2
0 3
2 0 1
0 3
0 3
3 0 1
0 2
0 4
4 0 1
0 3
0 4
5 1 1
Any ideas?
In your case
df=pd.concat([y.dropna(axis=1,thresh=1).ffill(1).melt('a') for x , y in df.groupby('a')])
a variable value
0 0.0 b 1.0
1 0.0 b 2.0
2 0.0 b 3.0
3 0.0 c 1.0
4 0.0 c 3.0
5 0.0 c 3.0
0 1.0 b 1.0

How to group by data using one column perform some operation on another column and assign new groups pandas

I have a dataframe as below :
distance_along_path ID
0 0 1
1 2.2 1
2 4.5 1
3 7.0 1
4 0 2
5 0 3
6 3.0 2
7 5.0 3
8 0 4
9 2.0 4
10 5.0 4
11 0 5
12 3.0 5
11 7.0 4
12
I want be able to group these by id's first and the by distance_along_path values, every time a 0 is seen in distance along path for the id, new group is created and until the next 0 all these rows are under A group as indicated below
distance_along_path ID group
0 0 1 1
1 2.2 1 1
2 4.5 1 1
3 7.0 1 1
4 0 1 2
5 0 2 3
6 3.0 1 2
7 5.0 2 3
8 0 2 4
9 2.0 2 4
10 5.0 2 4
11 0 1 5
12 3.0 1 5
13 7.0 1 5
14 0 1 6
15 0 2 7
16 3.0 1 6
17 5.0 2 7
18 1.0 2 7
Thank you
try the following:
grp_id = df.groupby(['ID']).id.count().reset_index()
grp_distance = grp_id.groupby(['distance_along_path'].grp_id['distance_along_path']==0

Categories