Row-wise replace operation in pandas dataframe - python

In the given data frame, I am trying to perform a row-wise replace operation where 1 should be replaced by the value in Values.
Input:
import pandas as pd
df = pd.DataFrame({'ID': [1,1,1,2,3,3,4,5,6,7],
'A': [0,1,0,1,0,0,1,0,np.nan,0],
'B': [0,0,0,0,1,1,0,0,0,0],
'C': [1,0,1,0,0,0,0,0,1,1],
'Values': [10, 2, 3,4,9,3,4,5,2,3]})
Expected Output:
ID A B C Values
0 1 0.0 0 10 10
1 1 2.0 0 0 2
2 1 0.0 0 3 3
3 2 4.0 0 0 4
4 3 0.0 9 0 9
5 3 0.0 3 0 3
6 4 4.0 0 0 4
7 5 0.0 0 0 5
8 6 NaN 0 2 2
9 7 0.0 0 3 3
**Note: The data is very large.

Use df.where
df[['A','B','C']]=df[['A','B','C']].where(df[['A','B','C']].ne(1),df['Values'], axis=0)
ID A B C Values
0 1 0.0 0 10 10
1 1 2.0 0 0 2
2 1 0.0 0 3 3
3 2 4.0 0 0 4
4 3 0.0 9 0 9
5 3 0.0 3 0 3
6 4 4.0 0 0 4
7 5 0.0 0 0 5
8 6 NaN 0 2 2
9 7 0.0 0 3 3
Or
df[['A','B','C']]=df[['A','B','C']].mask(df[['A','B','C']].eq(1),df['Values'], axis=0)

My data is really large and it is very slow.
If we exploit the nature of your dataset (A, B, C columns have 1s or 0s or Nans), you simply have to multiple df['values'] with each column independently. This should be super fast as it is vectorized.
df['A'] = df['A']*df['Values']
df['B'] = df['B']*df['Values']
df['C'] = df['C']*df['Values']
print(df)
ID A B C Values
0 1 0.0 0 10 10
1 1 2.0 0 0 2
2 1 0.0 0 3 3
3 2 4.0 0 0 4
4 3 0.0 9 0 9
5 3 0.0 3 0 3
6 4 4.0 0 0 4
7 5 0.0 0 0 5
8 6 NaN 0 2 2
9 7 0.0 0 3 3
If you want to explicitly check the condition where values of A, B, C are 1 (maybe because those columns could have values other than Nans or 0s), then you can use this -
df[['A','B','C']] = (df[['A','B','C']] == 1)*df[['Values']].values
This will replace the columns A, B, C in the original data but, also replaces Nans with 0.

Related

Subtraction in dataframe between rows

i want an easy subtraction of two values. I want to replace the value in [10, 150] by calculation the value in ([10, 150] - [9, 150]).
Somehow the code does not like the "rows-1"
for columns in listofcolumns:
rows = 0
while rows < row_count:
column= all_columns.index(columns)
df_merged.iloc[rows, column] = (df_merged.iloc[rows, column] - df_merged.iloc[rows-1, columns])
rows = rows+ 1
It seems to be the case that the df_merged.iloc[rows-1, column] takes the last value of the column.
I used the exact same line in another script before and it worked
This would be an example of some columns
Col1 Col2
0 2
0 3
0 4
0 4
1 5
1 7
1 8
1 8
2 8
The output dataframe i want would look like this.
Col1 Col1
nAn nAn
0 1
0 1
0 0
1 1
0 2
0 1
0 1
1 1
If I understood what you want to do, this would be the solution:
data = {'A': [5,7,9,3,2], 'B': [1,4,6,1,2]}
df = pd.DataFrame(data)
df["A"] = df["A"] - df["B"]
DataFrame at the start
A B
0 5 1
1 7 4
2 9 6
3 3 1
4 2 2
DataFrame at the end
A B
0 4 1
1 3 4
2 3 6
3 2 1
4 0 2
df.diff(1)
Col1 Col2
0 NaN NaN
1 0.0 1.0
2 0.0 1.0
3 0.0 0.0
4 1.0 1.0
5 0.0 2.0
6 0.0 1.0
7 0.0 0.0
8 1.0 0.0
above is based on the following data
Col1 Col2
0 0 2
1 0 3
2 0 4
3 0 4
4 1 5
5 1 7
6 1 8
7 1 8
8 2 8

pd dataframe addings rows by id

I have df with some ids, days number and running sum:
data = {'id': [0, 0, 0, 1, 1, 2, 1], 'day' : [0, 2, 1, 1, 4, 2, 2], 'running_sum': [1,4,2,1,6,6,3]}
df_1 = pd.DataFrame(data)
id day running_sum
0 0 0 1
1 0 2 4
2 0 1 2
3 1 1 1
4 1 4 6
5 2 2 6
6 1 2 3
I wanna have dataframe of all days for each id with the correct running sum:
id day running_sum
0 0 0 1
1 0 1 2
2 0 2 4
3 0 3 4
4 0 4 4
5 1 0 0
6 1 1 1
7 1 2 3
8 1 3 3
9 1 4 6
10 2 0 0
11 2 1 0
12 2 2 6
13 2 3 6
14 2 4 6
thanks for the help
Let's see if this logic is what you have in mind:
Set id and day as index:
df_1 = df_1.set_index(['id', 'day'])
Build a new index to reindex df_1 while introducing new numbers; luckily the index is unique, so reindex works fine:
new_index = df_1.index.get_level_values('id').unique()
new_index = pd.MultiIndex.from_product([new_index, range(5)],
names = ['id', 'day'])
df_1 = df_1.reindex(new_index)
Groupby id and filldown on each group, the rest nulls will be replaced with zero:
(df_1.assign(running_sum = df_1.groupby('id')
.running_sum
.ffill()
.fillna(0))
.reset_index()
)
id day running_sum
0 0 0 1.0
1 0 1 2.0
2 0 2 4.0
3 0 3 4.0
4 0 4 4.0
5 1 0 0.0
6 1 1 1.0
7 1 2 3.0
8 1 3 3.0
9 1 4 6.0
10 2 0 0.0
11 2 1 0.0
12 2 2 6.0
13 2 3 6.0
14 2 4 6.0
If you are not averse to using an add-on library, the complete function/method from pyjanitor could help abstract the process:
# pip install pyjanitor
import pyjanitor
df = df_1.complete('id', {'day':range(5)}) # explicitly expose the missing values
df.assign(running_sum = df.groupby('id').running_sum.ffill().fillna(0))
id day running_sum
0 0 0 1.0
1 0 1 2.0
2 0 2 4.0
3 0 3 4.0
4 0 4 4.0
5 1 0 0.0
6 1 1 1.0
7 1 2 3.0
8 1 3 3.0
9 1 4 6.0
10 2 0 0.0
11 2 1 0.0
12 2 2 6.0
13 2 3 6.0
14 2 4 6.0
All this is premised on the assumption that I got the logic right
You can unstack/stack and ffill. The tricky part is to get the missing 'days':
missing = set(range(0, df_1['day'].max()+1)).difference(df_1['day'].unique())
(pd.concat([df_1,
pd.DataFrame({'id': 0, 'day': list(missing)})])
.set_index(['id', 'day'])
.unstack()
.stack(dropna=False) ## adds the missing values
.sort_index()
.groupby('id')
.ffill()
.fillna(0, downcast='infer')
.reset_index()
)
output:
id day running_sum
0 0 0 1
1 0 1 2
2 0 2 4
3 0 3 4
4 0 4 4
5 1 0 0
6 1 1 1
7 1 2 3
8 1 3 3
9 1 4 6
10 2 0 0
11 2 1 0
12 2 2 6
13 2 3 6
14 2 4 6

How to create a Pandas boolean column indicating if a value will have an n-fold increase in x periods ahead?

I have a single column in a DataFrame containing only numbers, and I need to create a boolean column to indicate if the value will have an n-fold increase in x periods ahead.
I developed a solution using two for loops, but it doesn't seem pythonic enough for me.
Is there a better, more efficient way of doing it? Maybe something with map() or apply()?
Find below my code with an MRE.
df = pd.DataFrame([1,2,2,1,3,2,1,3,4,1,2,3,4,4,5,1], columns=['column'])
df['double_in_5_periods_ahead'] = 'n/a'
periods_ahead = 5
for i in range(0,len(df)-periods_ahead):
for j in range(1,periods_ahead):
if df['column'].iloc[i+j]/df['column'].iloc[i] >= 2:
df['double_in_5_periods_ahead'].iloc[i] = 1
break
else:
df['double_in_5_periods_ahead'].iloc[i] = 0
This is the output:
column double_in_5_periods_ahead
0 1 1
1 2 0
2 2 0
3 1 1
4 3 0
5 2 1
6 1 1
7 3 0
8 4 0
9 1 1
10 2 1
11 3 n/a
12 4 n/a
13 4 n/a
14 5 n/a
15 1 n/a
Let us try rolling
n = 5
df['new'] = (df['column'].iloc[::-1].rolling(n).max() / df['column']).gt(2).astype(int)
df.iloc[-n:,1]=np.nan
df
Out[146]:
column new
0 1 1.0
1 2 0.0
2 2 0.0
3 1 1.0
4 3 0.0
5 2 0.0
6 1 1.0
7 3 0.0
8 4 0.0
9 1 1.0
10 2 1.0
11 3 NaN
12 4 NaN
13 4 NaN
14 5 NaN
15 1 NaN

itertools.product() of multiple rows

My df looks like
a b c
0 1 nan
0 2 3
0 3 4
1 1 nan
I need a itertools.product() like combination of the entries in rows within groups of 'a'. Here 2 possible ways, since the second row has 2 different values:
a b
1 0 1
0 2
0 3
2 0 1
0 3
0 3
3 0 1
0 2
0 4
4 0 1
0 3
0 4
5 1 1
Any ideas?
In your case
df=pd.concat([y.dropna(axis=1,thresh=1).ffill(1).melt('a') for x , y in df.groupby('a')])
a variable value
0 0.0 b 1.0
1 0.0 b 2.0
2 0.0 b 3.0
3 0.0 c 1.0
4 0.0 c 3.0
5 0.0 c 3.0
0 1.0 b 1.0

More efficient way to create dataframe of top n values - python

I have a dataframe of categories that I need to clean up by limiting the values to the top n categories. Any value that isn't in the top n categories, should be binned under 0 (or "other").
I tried the code below, which loops through each row of a column, then through each column in a dataframe to check if the value in that position is found in that column's top n value_counts. If yes, then it keeps the value, if not then replaces with 0.
This implementation technically works, but when the number of rows is large, it takes too long to run. What is the quicker way of accomplishing this in pandas/numpy?
z = pd.DataFrame(np.random.randint(1,4,size=(100000, 4)))
x=pd.DataFrame()
n=10
for j in z:
for i in z[j].index:
if z.at[i,j] in z[j].value_counts().head(n).index.tolist():
x.at[i,j] = z.at[i,j]
else:
x.at[i,j]= 0
print(x)
I think you can use apply for loop columns with custom function with value_counts for top values and where with isin for boolean mask for replacing:
def f(x):
y = x.value_counts().head(n).index
return x.where(x.isin(y), 0)
print (z.apply(f))
What is same as:
print (z.apply(lambda x: x.where(x.isin(x.value_counts().head(n).index), 0)))
Sample:
#N =100000
N = 10
np.random.seed(123)
z = pd.DataFrame(np.random.randint(1,4,size=(N, 4)))
print (z)
0 1 2 3
0 3 2 3 3
1 1 3 3 2
2 3 2 3 2
3 1 2 3 2
4 1 3 1 2
5 3 2 1 1
6 1 1 2 3
7 1 3 1 1
8 2 1 2 1
9 1 1 3 2
x=pd.DataFrame()
n=2
for j in z:
for i in z[j].index:
if z.at[i,j] in z[j].value_counts().head(n).index.tolist():
x.at[i,j] = z.at[i,j]
else:
x.at[i,j]= 0
print(x)
0 1 2 3
0 3.0 2.0 3.0 0.0
1 1.0 3.0 3.0 2.0
2 3.0 2.0 3.0 2.0
3 1.0 2.0 3.0 2.0
4 1.0 3.0 1.0 2.0
5 3.0 2.0 1.0 1.0
6 1.0 0.0 0.0 0.0
7 1.0 3.0 1.0 1.0
8 0.0 0.0 0.0 1.0
9 1.0 0.0 3.0 2.0
print (z.apply(lambda x: x.where(x.isin(x.value_counts().head(n).index), 0)))
0 1 2 3
0 3 2 3 0
1 1 3 3 2
2 3 2 3 2
3 1 2 3 2
4 1 3 1 2
5 3 2 1 1
6 1 0 0 0
7 1 3 1 1
8 0 0 0 1
9 1 0 3 2
Similar solution with numpy.where:
print (z.apply(lambda x: np.where(x.isin(x.value_counts().head(n).index), x, 0)))
0 1 2 3
0 3 2 3 0
1 1 3 3 2
2 3 2 3 2
3 1 2 3 2
4 1 3 1 2
5 3 2 1 1
6 1 0 0 0
7 1 3 1 1
8 0 0 0 1
9 1 0 3 2

Categories