pandas: replace values by row based on condition - python

I have a pandas dataframe as follows:
df2
amount 1 2 3 4
0 5 1 1 1 1
1 7 0 1 1 1
2 9 0 0 0 1
3 8 0 0 1 0
4 2 0 0 0 1
What I want to do is replace the 1s on every row with the value of the amount field in that row and leave the zeros as is. The output should look like this
amount 1 2 3 4
0 5 5 5 5 5
1 7 0 7 7 7
2 9 0 0 0 9
3 8 0 0 8 0
4 2 0 0 0 2
I've tried applying a lambda function row-wise like this, but I'm running into errors
df2.apply(lambda x: x.loc[i].replace(0, x['amount']) for i in len(x), axis=1)
Any help would be much appreciated. Thanks

Let's use mask:
df2.mask(df2 == 1, df2['amount'], axis=0)
Output:
amount 1 2 3 4
0 5 5 5 5 5
1 7 0 7 7 7
2 9 0 0 0 9
3 8 0 0 8 0
4 2 0 0 0 2

You can also do it wit pandas.DataFrame.mul() method, like this:
>>> df2.iloc[:, 1:] = df2.iloc[:, 1:].mul(df2['amount'], axis=0)
>>> print(df2)
amount 1 2 3 4
0 5 5 5 5 5
1 7 0 7 7 7
2 9 0 0 0 9
3 8 0 0 8 0
4 2 0 0 0 2

Related

Pandas How to flag consecutive values ignoring the first occurrence

I have the following code:
data={'id':[1,2,3,4,5,6,7,8,9,10,11],
'value':[1,0,1,0,1,1,1,0,0,1,0]}
df=pd.DataFrame.from_dict(data)
df
Out[8]:
id value
0 1 1
1 2 0
2 3 1
3 4 0
4 5 1
5 6 1
6 7 1
7 8 0
8 9 0
9 10 1
10 11 0
I want to create a flag column that indicate with 1 consecutive values starting from the second occurrence and ignoring the first.
With the actual solution:
df['flag'] =
df.value.groupby([df.value,df.flag.diff().ne(0).cumsum()]).transform('size').ge(3).astype(int)
Out[8]:
id value flag
0 1 1 0
1 2 0 0
2 3 1 0
3 4 0 0
4 5 1 1
5 6 1 1
6 7 1 1
7 8 0 1
8 9 0 1
9 10 1 0
10 11 0 0
While I need a solution like this, where the first occurence is flagged as 0 and 1 starting from the second:
Out[8]:
id value flag
0 1 1 0
1 2 0 0
2 3 1 0
3 4 0 0
4 5 1 0
5 6 1 1
6 7 1 1
7 8 0 0
8 9 0 1
9 10 1 0
10 11 0 0
Create consecutive groups by compared Series.shifted values by not equal and Series.cumsum, create counter by GroupBy.cumcount and compare if greater values like 0 by Series.gt, last map True, False to 1, 0 by casting to integers by Series.astype:
df['flag'] = (df.groupby(df['value'].ne(df['value'].shift()).cumsum())
.cumcount()
.gt(0)
.astype(int))
print (df)
id value flag
0 1 1 0
1 2 0 0
2 3 1 0
3 4 0 0
4 5 1 0
5 6 1 1
6 7 1 1
7 8 0 0
8 9 0 1
9 10 1 0
10 11 0 0
How it working:
print (df.assign(g = df['value'].ne(df['value'].shift()).cumsum(),
counter = df.groupby(df['value'].ne(df['value'].shift()).cumsum()).cumcount(),
mask = df.groupby(df['value'].ne(df['value'].shift()).cumsum()).cumcount().gt(0)))
id value g counter mask
0 1 1 1 0 False
1 2 0 2 0 False
2 3 1 3 0 False
3 4 0 4 0 False
4 5 1 5 0 False
5 6 1 5 1 True
6 7 1 5 2 True
7 8 0 6 0 False
8 9 0 6 1 True
9 10 1 7 0 False
10 11 0 8 0 False
Use groupby.cumcount and a custom grouper:
# group by identical successive values
grp = df['value'].ne(df['value'].shift()).cumsum()
# flag all but the first one (>0)
# convert the booleans True/False to integers 1/0
df['flag'] = df.groupby(grp).cumcount().gt(0).astype(int)
Generic code to skip first N:
N = 1
grp = df['value'].ne(df['value'].shift()).cumsum()
df['flag'] = df.groupby(grp).cumcount().ge(N).astype(int)
Output:
id value flag
0 1 1 0
1 2 0 0
2 3 1 0
3 4 0 0
4 5 1 0
5 6 1 1
6 7 1 1
7 8 0 0
8 9 0 1
9 10 1 0
10 11 0 0

pandas countif negative using where()

Below is the code and output, what I'm trying to get is shown in the "exp" column, as you can see the "countif" column just counts 5 columns, but I want it to only count negative values.
So for example: index 0, df1[0] should equal 2
What am I doing wrong?
Python
import pandas as pd
import numpy as np
a = ['A','B','C','B','C','A','A','B','C','C','A','C','B','A']
b = [2,4,1,1,2,5,-1,2,2,3,4,3,3,3]
c = [-2,4,1,-1,2,5,1,2,2,3,4,3,3,3]
d = [-2,-4,1,-1,2,5,1,2,2,3,4,3,3,3]
exp = [2,1,0,2,0,0,1,0,0,0,0,0,0,0]
df1 = pd.DataFrame({'b':b,'c':c,'d':d,'exp':exp}, columns=['b','c','d','exp'])
df1['sumif'] = df1.where(df1<0,0).sum(1)
df1['countif'] = df1.where(df1<0,0).count(1)
df1
# df1.sort_values(['a','countif'], ascending=[True, True])
Output
You don't need where here, you can simply use df.lt with df.sum(axis=1):
In [1329]: df1['exp'] = df1.lt(0).sum(1)
In [1330]: df1
Out[1330]:
b c d exp
0 2 -2 -2 2
1 4 4 -4 1
2 1 1 1 0
3 1 -1 -1 2
4 2 2 2 0
5 5 5 5 0
6 -1 1 1 1
7 2 2 2 0
8 2 2 2 0
9 3 3 3 0
10 4 4 4 0
11 3 3 3 0
12 3 3 3 0
13 3 3 3 0
EDIT: As per OP's comment including solution with iloc and .lt:
In [1609]: df1['exp'] = df1.iloc[:, :3].lt(0).sum(1)
First DataFrame.where working different, it replace False values to 0 here by condition (here False are greater of equal 0), so cannot be used for count:
print (df1.iloc[:, :3].where(df1<0,0))
b c d
0 0 -2 -2
1 0 0 -4
2 0 0 0
3 0 -1 -1
4 0 0 0
5 0 0 0
6 -1 0 0
7 0 0 0
8 0 0 0
9 0 0 0
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
You need compare first 3 columns for less like 0 and sum:
df1['exp1'] = (df1.iloc[:, :3] < 0).sum(1)
#If need compare all columns
#df1['exp1'] = (df1 < 0).sum(1)
print (df1)
b c d exp exp1
0 2 -2 -2 2 2
1 4 4 -4 1 1
2 1 1 1 0 0
3 1 -1 -1 2 2
4 2 2 2 0 0
5 5 5 5 0 0
6 -1 1 1 1 1
7 2 2 2 0 0
8 2 2 2 0 0
9 3 3 3 0 0
10 4 4 4 0 0
11 3 3 3 0 0
12 3 3 3 0 0
13 3 3 3 0 0

Stacking Pandas Dataframe without dropping row

Currently, I have a dataframe like this:
0 0 0 3 0 0
0 7 8 9 1 0
0 4 5 2 4 0
My code to stack it:
dt = dataset.iloc[:,0:7].stack().sort_index(level=1).reset_index(level=0, drop=True).to_frame()
dt['variable'] = pandas.Categorical(dt.index).codes+1
dt.rename(columns={0:index_column_name}, inplace=True)
dt.set_index(index_column_name, inplace=True)
dt['variable'] = numpy.sort(dt['variable'])
However, it drops the first row when I'm stacking it, and I want to keep the headers / first row, how would I achieve this?
In essence, I'm losing the data from the first row (a.k.a column headers) and I want to keep it.
Desired Output:
value,variable
0 1
0 1
0 1
0 2
7 2
4 2
0 3
8 3
5 3
3 4
9 4
2 4
0 5
1 5
4 5
0 6
0 6
0 6
Current output:
value,variable
0 1
0 1
7 2
4 2
8 3
5 3
9 4
2 4
1 5
4 5
0 6
0 6
Why not use df.melt as #WeNYoBen mentioned?
print(df)
1 2 3 4 5 6
0 0 0 0 3 0 0
1 0 7 8 9 1 0
2 0 4 5 2 4 0
print(df.melt())
variable value
0 1 0
1 1 0
2 1 0
3 2 0
4 2 7
5 2 4
6 3 0
7 3 8
8 3 5
9 4 3
10 4 9
11 4 2
12 5 0
13 5 1
14 5 4
15 6 0
16 6 0
17 6 0

How to Give Sequential Names to DataFrames Using Loops?

I've succeed on splitting a DataFrame into several smaller DataFrames. I'm now working on giving these DataFrames sequential names, and can be called independently.
shuffled = df.sample(frac=1)
result = np.array_split(shuffled, 3)
for part in result:
print(part, '\n')
movie_id 1 2 5 borda rank IRAM
2 3 4 0 0 4 3 2
1 2 3 0 3 6 2 1
movie_id 1 2 5 borda rank IRAM
4 5 3 0 0 3 4 3
0 1 5 4 4 13 1 4
movie_id 1 2 5 borda rank IRAM
3 4 3 0 0 3 4 3
I want to give names in sequential order to these separated DataFrames with loop(or any helpful methods).
For instance :
df_1
movie_id 1 2 5 borda rank IRAM
2 3 4 0 0 4 3 2
1 2 3 0 3 6 2 1
df_2
movie_id 1 2 5 borda rank IRAM
4 5 3 0 0 3 4 3
0 1 5 4 4 13 1 4
df_3
movie_id 1 2 5 borda rank IRAM
3 4 3 0 0 3 4 3
I've been searching solutions for a while, but I can't find an ideally answer to my problem.
This can be done by taking a dictionary and adding all dataframes into it:
df = pd.DataFrame({'Col1': np.random.randint(10, size=10)})
shuffled = df.sample(frac=1)
result = np.array_split(shuffled, 3)
d = {}
for i, part in enumerate(result):
d['df_'+str(i)] = part # If want to start the number for df from 1 then use str(i+1)
print(d['df_0'])
Col1
7 7
6 0
4 5
2 3
print(d['df_1'])
Col1
0 0
8 1
1 5
print(d['df_2'])
Col1
5 2
3 2
9 4
df_dict = {}
for index, splited in enumerate(result):
df_name = "df_{}".format(index)
# if you want to set name of the dataframe
splited.name = df_name
# if you want to set the variable name to dataframe
df_dict[df_name] = splited
print(df_dict)
{'df_0': movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
9 10 3 2 0 0 0 4 0 0 0 0 0 9
7 8 1 0 0 0 4 5 0 0 0 4 0 14
6 7 4 0 0 0 2 5 3 4 4 0 0 22
0 1 5 4 0 4 4 0 0 0 4 0 0 21,
'df_1': movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
8 9 5 0 0 0 4 5 0 0 4 5 0 23
3 4 3 0 0 0 0 5 0 0 4 0 5 17
5 6 5 0 0 0 0 0 0 5 0 0 0 10,
'df_2': movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
4 5 3 0 0 0 0 0 0 0 0 0 0 3
2 3 4 0 0 0 0 0 0 0 0 0 0 4
1 2 3 0 0 3 0 0 0 0 0 0 0 6}
Then you can call any splited_df by df_dict[df_name].
You can use a dictionary, like this:
d = {"df_"+str(k):v for (k,v) in [(i,result[i]) for i in range(len(result))]}

Python: Replace a cell value in Dataframe with if statement

I have a matrix with that looks like this:
com 0 1 2 3 4 5
AAA 0 5 0 4 2 1 4
ABC 0 9 8 9 1 0 3
ADE 1 4 3 5 1 0 1
BCD 1 6 7 8 3 4 1
BCF 2 3 4 2 1 3 0 ...
Where AAA, ABC ... is the dataframe index. The dataframe columns are com 0 1 3 4 5 6
I want to set the cell values in my dataframe equal to 0 when the row values of com is equal the column "number". So for instance, the above matrix will look like:
com 0 1 2 3 4 5
AAA 0 0 0 4 2 1 4
ABC 0 0 8 9 1 0 3
ADE 1 4 0 5 1 0 1
BCD 1 6 0 8 3 4 1
BCF 2 3 4 0 1 3 0 ...
I tried to iterate over rows and use both .loc and .ix but no success.
Just require some numpy trick
In [22]:
print df
0 1 2 3 4 5
0 5 0 4 2 1 4
0 9 8 9 1 0 3
1 4 3 5 1 0 1
1 6 7 8 3 4 1
2 3 4 2 1 3 0
[5 rows x 6 columns]
In [23]:
#making a masking matrix, 0 where column and index values equal, 1 elsewhere, kind of the vectorized way of doing if TURE 0, else 1
print df*np.where(df.columns.values==df.index.values[..., np.newaxis], 0,1)
0 1 2 3 4 5
0 0 0 4 2 1 4
0 0 8 9 1 0 3
1 4 0 5 1 0 1
1 6 0 8 3 4 1
2 3 4 0 1 3 0
[5 rows x 6 columns]
I think this should work.
for line in range(len(matrix)):
matrix[matrix[line][0]+1]=0
NOTE
Depending on your matrix setup you may not need the +1
Basically it takes the first digit of each line in the matrix and uses that as the index of the value to change to 0
i.e. if the row was
c 0 1 2 3 4 5
AAA 4 3 2 3 9 5 9,
it would change the 5 below the number 4 to 0
c 0 1 2 3 4 5
AAA 4 3 2 3 9 0 9

Categories