itertools.product() of multiple rows - python

My df looks like
a b c
0 1 nan
0 2 3
0 3 4
1 1 nan
I need a itertools.product() like combination of the entries in rows within groups of 'a'. Here 2 possible ways, since the second row has 2 different values:
a b
1 0 1
0 2
0 3
2 0 1
0 3
0 3
3 0 1
0 2
0 4
4 0 1
0 3
0 4
5 1 1
Any ideas?

In your case
df=pd.concat([y.dropna(axis=1,thresh=1).ffill(1).melt('a') for x , y in df.groupby('a')])
a variable value
0 0.0 b 1.0
1 0.0 b 2.0
2 0.0 b 3.0
3 0.0 c 1.0
4 0.0 c 3.0
5 0.0 c 3.0
0 1.0 b 1.0

Related

Row-wise replace operation in pandas dataframe

In the given data frame, I am trying to perform a row-wise replace operation where 1 should be replaced by the value in Values.
Input:
import pandas as pd
df = pd.DataFrame({'ID': [1,1,1,2,3,3,4,5,6,7],
'A': [0,1,0,1,0,0,1,0,np.nan,0],
'B': [0,0,0,0,1,1,0,0,0,0],
'C': [1,0,1,0,0,0,0,0,1,1],
'Values': [10, 2, 3,4,9,3,4,5,2,3]})
Expected Output:
ID A B C Values
0 1 0.0 0 10 10
1 1 2.0 0 0 2
2 1 0.0 0 3 3
3 2 4.0 0 0 4
4 3 0.0 9 0 9
5 3 0.0 3 0 3
6 4 4.0 0 0 4
7 5 0.0 0 0 5
8 6 NaN 0 2 2
9 7 0.0 0 3 3
**Note: The data is very large.
Use df.where
df[['A','B','C']]=df[['A','B','C']].where(df[['A','B','C']].ne(1),df['Values'], axis=0)
ID A B C Values
0 1 0.0 0 10 10
1 1 2.0 0 0 2
2 1 0.0 0 3 3
3 2 4.0 0 0 4
4 3 0.0 9 0 9
5 3 0.0 3 0 3
6 4 4.0 0 0 4
7 5 0.0 0 0 5
8 6 NaN 0 2 2
9 7 0.0 0 3 3
Or
df[['A','B','C']]=df[['A','B','C']].mask(df[['A','B','C']].eq(1),df['Values'], axis=0)
My data is really large and it is very slow.
If we exploit the nature of your dataset (A, B, C columns have 1s or 0s or Nans), you simply have to multiple df['values'] with each column independently. This should be super fast as it is vectorized.
df['A'] = df['A']*df['Values']
df['B'] = df['B']*df['Values']
df['C'] = df['C']*df['Values']
print(df)
ID A B C Values
0 1 0.0 0 10 10
1 1 2.0 0 0 2
2 1 0.0 0 3 3
3 2 4.0 0 0 4
4 3 0.0 9 0 9
5 3 0.0 3 0 3
6 4 4.0 0 0 4
7 5 0.0 0 0 5
8 6 NaN 0 2 2
9 7 0.0 0 3 3
If you want to explicitly check the condition where values of A, B, C are 1 (maybe because those columns could have values other than Nans or 0s), then you can use this -
df[['A','B','C']] = (df[['A','B','C']] == 1)*df[['Values']].values
This will replace the columns A, B, C in the original data but, also replaces Nans with 0.

More efficient way to create dataframe of top n values - python

I have a dataframe of categories that I need to clean up by limiting the values to the top n categories. Any value that isn't in the top n categories, should be binned under 0 (or "other").
I tried the code below, which loops through each row of a column, then through each column in a dataframe to check if the value in that position is found in that column's top n value_counts. If yes, then it keeps the value, if not then replaces with 0.
This implementation technically works, but when the number of rows is large, it takes too long to run. What is the quicker way of accomplishing this in pandas/numpy?
z = pd.DataFrame(np.random.randint(1,4,size=(100000, 4)))
x=pd.DataFrame()
n=10
for j in z:
for i in z[j].index:
if z.at[i,j] in z[j].value_counts().head(n).index.tolist():
x.at[i,j] = z.at[i,j]
else:
x.at[i,j]= 0
print(x)
I think you can use apply for loop columns with custom function with value_counts for top values and where with isin for boolean mask for replacing:
def f(x):
y = x.value_counts().head(n).index
return x.where(x.isin(y), 0)
print (z.apply(f))
What is same as:
print (z.apply(lambda x: x.where(x.isin(x.value_counts().head(n).index), 0)))
Sample:
#N =100000
N = 10
np.random.seed(123)
z = pd.DataFrame(np.random.randint(1,4,size=(N, 4)))
print (z)
0 1 2 3
0 3 2 3 3
1 1 3 3 2
2 3 2 3 2
3 1 2 3 2
4 1 3 1 2
5 3 2 1 1
6 1 1 2 3
7 1 3 1 1
8 2 1 2 1
9 1 1 3 2
x=pd.DataFrame()
n=2
for j in z:
for i in z[j].index:
if z.at[i,j] in z[j].value_counts().head(n).index.tolist():
x.at[i,j] = z.at[i,j]
else:
x.at[i,j]= 0
print(x)
0 1 2 3
0 3.0 2.0 3.0 0.0
1 1.0 3.0 3.0 2.0
2 3.0 2.0 3.0 2.0
3 1.0 2.0 3.0 2.0
4 1.0 3.0 1.0 2.0
5 3.0 2.0 1.0 1.0
6 1.0 0.0 0.0 0.0
7 1.0 3.0 1.0 1.0
8 0.0 0.0 0.0 1.0
9 1.0 0.0 3.0 2.0
print (z.apply(lambda x: x.where(x.isin(x.value_counts().head(n).index), 0)))
0 1 2 3
0 3 2 3 0
1 1 3 3 2
2 3 2 3 2
3 1 2 3 2
4 1 3 1 2
5 3 2 1 1
6 1 0 0 0
7 1 3 1 1
8 0 0 0 1
9 1 0 3 2
Similar solution with numpy.where:
print (z.apply(lambda x: np.where(x.isin(x.value_counts().head(n).index), x, 0)))
0 1 2 3
0 3 2 3 0
1 1 3 3 2
2 3 2 3 2
3 1 2 3 2
4 1 3 1 2
5 3 2 1 1
6 1 0 0 0
7 1 3 1 1
8 0 0 0 1
9 1 0 3 2

Fast way to get the number of NaNs in a column counted from the last valid value in a DataFrame

Say I have a DataFrame like
A B
0 0.1880 0.345
1 0.2510 0.585
2 NaN NaN
3 NaN NaN
4 NaN 1.150
5 0.2300 1.210
6 0.1670 1.290
7 0.0835 1.400
8 0.0418 NaN
9 0.0209 NaN
10 NaN NaN
11 NaN NaN
12 NaN NaN
I want a new DataFrame of the same shape where each entry represents the number of NaNs counted up to its position started from the last valid value as follows
A B
0 0 0
1 0 0
2 1 1
3 2 2
4 3 0
5 0 0
6 0 0
7 0 0
8 0 1
9 0 2
10 1 3
11 2 4
12 3 5
I wonder if this can be done efficiently by utilizing some of the Pandas/Numpy functions?
You can use:
a = df.isnull()
b = a.cumsum()
df1 = b.sub(b.mask(a).ffill().fillna(0).astype(int))
print (df1)
A B
0 0 0
1 0 0
2 1 1
3 2 2
4 3 0
5 0 0
6 0 0
7 0 0
8 0 1
9 0 2
10 1 3
11 2 4
12 3 5
For better understanding:
#add NaN where True in a
a2 = b.mask(a)
#forward filling NaN
a3 = b.mask(a).ffill()
#replace NaN to 0, cast to int
a4 = b.mask(a).ffill().fillna(0).astype(int)
#substract b to a4
a5 = b.sub(b.mask(a).ffill().fillna(0).astype(int))
df1 = pd.concat([a,b,a2, a3, a4, a5], axis=1,
keys=['a','b','where','ffill nan','substract','output'])
print (df1)
a b where ffill nan substract output
A B A B A B A B A B A B
0 False False 0 0 0.0 0.0 0.0 0.0 0 0 0 0
1 False False 0 0 0.0 0.0 0.0 0.0 0 0 0 0
2 True True 1 1 NaN NaN 0.0 0.0 0 0 1 1
3 True True 2 2 NaN NaN 0.0 0.0 0 0 2 2
4 True False 3 2 NaN 2.0 0.0 2.0 0 2 3 0
5 False False 3 2 3.0 2.0 3.0 2.0 3 2 0 0
6 False False 3 2 3.0 2.0 3.0 2.0 3 2 0 0
7 False False 3 2 3.0 2.0 3.0 2.0 3 2 0 0
8 False True 3 3 3.0 NaN 3.0 2.0 3 2 0 1
9 False True 3 4 3.0 NaN 3.0 2.0 3 2 0 2
10 True True 4 5 NaN NaN 3.0 2.0 3 2 1 3
11 True True 5 6 NaN NaN 3.0 2.0 3 2 2 4
12 True True 6 7 NaN NaN 3.0 2.0 3 2 3 5

pandas calculate the counts and sum group by each category

i have a dataframe:
category num1 num2 mark
1 A 2 2 0
2 B 3 3 1
3 C 4 2 2
4 C 3 5 2
5 D 6 8 0
6 E 7 5 1
7 D 8 1 1
i want to calculate the counts number for each category group by the mark(as the columns), like:
the counts:
catgory mark_0 mark_1 mark_2
A 1 0 0
B 0 1 0
C 0 0 2
D 0 2 0
E 0 1 0
another is calculate the sum of the number for each category group by the mark(as the columns), like:
the sum:
category numsum_0 numsum_1 numsum_2
A 2 0 0
B 0 3 0
C 0 0 7
D 0 14 0
E 0 7 0
and my method is :
df_z[df_z['mark']==0]['category'].value_counts()
df_z[df_z['mark']==0].groupby(['category'], sort=False).sum()
but it is inefficient
>>> pd.pivot_table(df,index=['category'],columns=['mark'],aggfunc=len).fillna(0)
num
mark 0 1 2
category
A 1.0 0.0 0.0
B 0.0 1.0 0.0
C 0.0 0.0 2.0
D 1.0 1.0 0.0
E 0.0 1.0 0.0
>>> pd.pivot_table(df,index=['category'],columns=['mark'],aggfunc=np.sum).fillna(0)
num
mark 0 1 2
category
A 2.0 0.0 0.0
B 0.0 3.0 0.0
C 0.0 0.0 7.0
D 6.0 8.0 0.0
E 0.0 7.0 0.0
Use agg.
idx_cols = ['category', 'mark']
agg_dict = {'num1': {'Sum': 'sum'}, 'num2': {'Count': 'count'}}
df.set_index(idx_cols).groupby(level=[0, 1]).agg(agg_dict).unstack()

reading multiple csv file into a big pandas data frame by appending the columns of different size

so i am creating some data frame in a loop and save them as a csv file. The data frame have the same columns but different length. i would like to be able to concatenate these data frames into a single data frame that has all the columns something like
df1
A B C
0 0 1 2
1 0 1 0
2 1.2 1 1
3 2 1 2
df2
A B C
0 0 1 2
1 0 1 0
2 0.2 1 2
df3
A B C
0 0 1 2
1 0 1 0
2 1.2 1 1
3 2 1 4
4 1 2 2
5 2.3 3 0
i would like to get something like
df_big
A B C A B C A B C
0 0 1 2 0 1 2 0 1 2
1 0 1 0 0 1 0 0 1 0
2 1.2 1 1 0.2 1 2 1.2 1 1
3 2 1 2 2 1 4
4 1 2 2
5 2.3 3 0
is this something that can be done in pandas?
You could use pd.concat:
df_big = pd.concat([df1, df2, df3], axis=1)
yields
A B C A B C A B C
0 0.0 1 2 0.0 1 2 0.0 1 2
1 0.0 1 0 0.0 1 0 0.0 1 0
2 1.2 1 1 0.2 1 2 1.2 1 1
3 2.0 1 2 NaN NaN NaN 2.0 1 4
4 NaN NaN NaN NaN NaN NaN 1.0 2 2
5 NaN NaN NaN NaN NaN NaN 2.3 3 0

Categories