Multiply column values but exclude zero values in Pandas - python

Im looking for way to multiply values of all columns but to exclude columns that has a value of 0. So, result should not be 0 (multiplication by 0). If there is this number of columns and rows, its easy, but what If there are 100 columns and 5000 rows?
import pandas as pd
df = pd.DataFrame({"Col1":[6,4,3,0],
"Col2":[1,0,0,3],
"Col3":[2,4,3,2]})
So result should look like this:
print(df)
# result should be multiplication of all column values, but not 0
# zeros should be excluded
6 * 1 * 2
4 * 4
3 * 3
3 * 2
df = pd.DataFrame({"Col1":[6,4,3,0],
"Col2":[1,0,0,3],
"Col3":[2,4,3,2],
"Result":[12,16,9,6]})
print(df)
I can not change the data , so changing zeros to 1 does not work

You could simply replace the 0s with 1s.
df = pd.DataFrame({"Col1":[6,4,3,0],
"Col2":[1,0,0,3],
"Col3":[2,4,3,2]})
df['Result'] = df.replace(0,1).prod(axis=1)
Col1 Col2 Col3 Result
0 6 1 2 12
1 4 0 4 16
2 3 0 3 9
3 0 3 2 6
To get technical - in multiplication 1 is the identity function. In addition the identity function is 0. to way oversimply it - An identity function is just a fancy way of saying "return the same result by adding another variable"
To get non technical I think of the quote "Think Smart Not Hard"

May be just replace zero values by one and multiply values:
df['Result'] = df.replace(0,1).apply(np.prod,axis=1)

Simple mask 0 values to NaN and call prod
df['Result'] = df.where(df.ne(0)).prod(1)
Out[1748]:
Col1 Col2 Col3 Result
0 6 1 2 12.0
1 4 0 4 16.0
2 3 0 3 9.0
3 0 3 2 6.0
Or mask 0 to 1
df['Result'] = df.where(df.ne(0), 1).prod(1)
Out[1754]:
Col1 Col2 Col3 Result
0 6 1 2 12
1 4 0 4 16
2 3 0 3 9
3 0 3 2 6
step by step:
ne(0) return a boolean mask where 0 is masked as True
In [1755]: df.ne(0)
Out[1755]:
Col1 Col2 Col3 Result
0 True True True True
1 True False True True
2 True False True True
3 False True True True
where checks on each location of the boolean mask. On True, it keeps same value. On False it turns to NaN when there is no 2nd parameter.
In [1756]: df.where(df.ne(0))
Out[1756]:
Col1 Col2 Col3 Result
0 6.0 1.0 2 12
1 4.0 NaN 4 16
2 3.0 NaN 3 9
3 NaN 3.0 2 6
prod(1) is the product along axis=1. Prod is defaulted to ignore NaN, so It returns the product of each rows without consider NaN
In [1759]: df.where(df.ne(0)).prod(1)
Out[1759]:
0 12.0
1 16.0
2 9.0
3 6.0
dtype: float64
When specifying the 2nd parameter for where, it is used to replace on False mask.

Related

How to replace column values based on other columns in pandas?

Assume, I have a data frame such as
import pandas as pd
df = pd.DataFrame({'visitor':['A','B','C','D','E'],
'col1':[1,2,3,4,5],
'col2':[1,2,4,7,8],
'col3':[4,2,3,6,1]})
visitor
col1
col2
col3
A
1
1
4
B
2
2
2
C
3
4
3
D
4
7
6
E
5
8
1
For each row/visitor, (1) First, if there are any identical values, I would like to keep the 1st value of each row then replace the rest of identical values in the same row with NULL such as
visitor
col1
col2
col3
A
1
NULL
4
B
2
NULL
NULL
C
3
4
NULL
D
4
7
6
E
5
8
1
Then (2) keep rows/visitors with more than 1 value such as
Final Data Frame
visitor
col1
col2
col3
A
1
NULL
4
C
3
4
NULL
D
4
7
6
E
5
8
1
Any suggestions? many thanks
We can use series.duplicated along the columns axis to identify the duplicates, then mask the duplicates using where and filter the rows where the sum of non-duplicated values is greater than 1
s = df.set_index('visitor')
m = ~s.apply(pd.Series.duplicated, axis=1)
s.where(m)[m.sum(1).gt(1)]
col1 col2 col3
visitor
A 1 NaN 4.0
C 3 4.0 NaN
D 4 7.0 6.0
E 5 8.0 1.0
Let us try mask with pd.Series.duplicated, then dropna with thresh
out = df.mask(df.apply(pd.Series.duplicated,1)).dropna(thresh = df.shape[1]-1)
Out[321]:
visitor col1 col2 col3
0 A 1 NaN 4.0
2 C 3 4.0 NaN
3 D 4 7.0 6.0
4 E 5 8.0 1.0

how to replace rows by row with condition

i want to replace all rows that have "A" in name column
with single row from another df
i got this
data={"col1":[2,3,4,5,7],
"col2":[4,2,4,6,4],
"col3":[7,6,9,11,2],
"col4":[14,11,22,8,5],
"name":["A","A","V","A","B"],
"n_roll":[8,2,1,3,9]}
df=pd.DataFrame.from_dict(data)
df
that is my single row (the another df)
data2={"col1":[0]
,"col2":[1]
,"col3":[5]
,"col4":[6]
}
df2=pd.DataFrame.from_dict(data2)
df2
that how i want it to look like
data={"col1":[0,0,4,0,7],
"col2":[1,1,4,1,4],
"col3":[5,5,9,5,2],
"col4":[6,6,22,6,5],
"name":["A","A","V","A","B"],
"n_roll":[8,2,1,3,9]}
df=pd.DataFrame.from_dict(data)
df
i try do this df.loc[df["name"]=="A"][df2.columns]=df2
but it did not work
We can try mask + combine_first
df = df.mask(df['name'].eq('A'), df2.loc[0], axis=1).combine_first(df)
df
col1 col2 col3 col4 name n_roll
0 0 1 5 6 A 8.0
1 0 1 5 6 A 2.0
2 4 4 9 22 V 1.0
3 0 1 5 6 A 3.0
4 7 4 2 5 B 9.0
df.loc[df["name"]=="A"][df2.columns]=df2 is index-chaining and is not expected to work. For details, see the doc.
You can also use boolean indexing like this:
df.loc[df['name']=='A', df2.columns] = df2.values
Output:
col1 col2 col3 col4 name n_roll
0 0 1 5 6 A 8
1 0 1 5 6 A 2
2 4 4 9 22 V 1
3 0 1 5 6 A 3
4 7 4 2 5 B 9

Replace repetitive number with NAN values except the first, in pandas column

I have a data frame like this,
df
col1 col2
1 A
2 A
3 B
4 C
5 C
6 C
7 B
8 B
9 A
Now we can see that there is continuous occurrence of A, B and C. I want only the rows where the occurrence is starting. And the other values of the same occurrence will be nan.
The final data frame I am looking for will look like,
df
col1 col2
1 A
2 NA
3 B
4 C
5 NA
6 NA
7 B
8 NA
9 A
I can do it using for loop and comparing, But the execution time will be more. I am looking for pythonic way to do it. Some panda shortcuts may be.
Compare by Series.shifted values and missing values by Series.where or numpy.where:
df['col2'] = df['col2'].where(df['col2'].ne(df['col2'].shift()))
#alternative
#df['col2'] = np.where(df['col2'].ne(df['col2'].shift()), df['col2'], np.nan)
Or by DataFrame.loc with inverted condition by ~:
df.loc[~df['col2'].ne(df['col2'].shift()), 'col2'] = np.nan
Or thanks #Daniel Mesejo - use eq for ==:
df.loc[df['col2'].eq(df['col2'].shift()), 'col2'] = np.nan
print (df)
col1 col2
0 1 A
1 2 NaN
2 3 B
3 4 C
4 5 NaN
5 6 NaN
6 7 B
7 8 NaN
8 9 A
Detail:
print (df['col2'].ne(df['col2'].shift()))
0 True
1 False
2 True
3 True
4 False
5 False
6 True
7 False
8 True
Name: col2, dtype: bool

Group by within a groupby then averaging

Let's say I have a dataframe (I'll just use a simple example) that looks like this:
import pandas as pd
df = {'Col1':[3,4,2,6,5,7,3,4,9,7,1,3],
'Col2':['B','B','B','B','A','A','A','A','C','C','C','C',],
'Col3':[1,1,2,2,1,1,2,2,1,1,2,2]}
df = pd.DataFrame(df)
Which gives a dataframe like so:
Col1 Col2 Col3
0 3 B 1
1 4 B 1
2 2 B 2
3 6 B 2
4 5 A 1
5 7 A 1
6 3 A 2
7 4 A 2
8 9 C 1
9 7 C 1
10 1 C 2
11 3 C 2
What I want to do is several steps:
1) For each unique value in Col2, and for each unique value in Col3, average Col1. So a desired output would be:
Avg Col2 Col3
1 3.5 B 1
2 4 B 2
3 6 A 1
4 3.5 A 2
5 8 C 1
6 2 C 2
2) Now, for each unique value in Col3, I want the highest average and the corresponding value in Col2. So
Best Avg Col2 Col3
1 8 C 1
2 4 B 2
My attempt has been using df.groupby(['Col3','Col2'], as_index = False).agg({'Col1':'mean'}).groupby(['Col3']).agg({'Col1':'max'})
This gives me the highest average for each Col3 value, but not the corresponding Col2 label. Thank you for any help you can give!
After you first groupby do sort_values + drop_duplicates
g1=df.groupby(['Col3','Col2'], as_index = False).agg({'Col1':'mean'})
g1.sort_values('Col1').drop_duplicates('Col3',keep='last')
Out[569]:
Col3 Col2 Col1
4 2 B 4.0
2 1 C 8.0
Or in case you have duplicate max value of mean
g1[g1.Col1==g1.groupby('Col3').Col1.transform('max')]
Do the following (I modified your code slightly,
to make it a bit shorter):
df2 = df.groupby(['Col3','Col2'], as_index = False).mean()
When you print the result, for your input, you will get:
Col3 Col2 Col1
0 1 A 6.0
1 1 B 3.5
2 1 C 8.0
3 2 A 3.5
4 2 B 4.0
5 2 C 2.0
Then run:
res = df2.iloc[df2.groupby('Col3').Col1.idxmax()]
When you print the result, you will get:
Col3 Col2 Col1
2 1 C 8.0
4 2 B 4.0
As you can see:
idxmax gives the index of the row with "maximal" element (for each
group),
this result you can use as the argument of iloc.

How to extract numeric ranges from 2 columns containig numeric sequences and print the range from both columns (different increment values)?

I'm curently learnig python and pandas (this question is based on a pevious post but with an additional query); at the moment have the 2 columns containing numeric sequences (ascending and/or descending) as described below:
Col 1: (col1 numeric incrememt and/or decrement = 1)
1
2
3
5
7
8
9
Col 2: (Col2 numeric increment and/or decrement = 4)
113
109
105
90
94
98
102
Need to extract the numeric ranges from both columns and print them according to the sequence break occurance on any of those 2 columns and the result should be as follow:
1,3,105,113
5,5,90,90
7,9,94,102
Already received a very useful way to do it using python's pandas library by #MaxU where it generates the numeric ranges based on the breaks detected on both columns using a criteria of col1 and col2 = increase and/or decreases by 1.
How can I extract numeric ranges from 2 columns and print the range from both columns as tuples?
The unique difference on this case is that the increment/decrement criteria applied for both columns are different for each one of them.
Try this:
In [42]: df
Out[42]:
Col1 Col2
0 1 113
1 2 109
2 3 105
3 5 90
4 7 94
5 8 98
6 9 102
In [43]: df.groupby(df.diff().abs().ne([1,4]).any(1).cumsum()).agg(['min','max'])
Out[43]:
Col1 Col2
min max min max
1 1 3 105 113
2 5 5 90 90
3 7 9 94 102
Explanation: our goal is to group those rows with the increment/decrement [1,4] for Col1, Col2 correspondingly:
In [44]: df.diff().abs()
Out[44]:
Col1 Col2
0 NaN NaN
1 1.0 4.0
2 1.0 4.0
3 2.0 15.0
4 2.0 4.0
5 1.0 4.0
6 1.0 4.0
In [45]: df.diff().abs().ne([1,4])
Out[45]:
Col1 Col2
0 True True
1 False False
2 False False
3 True True
4 True False
5 False False
6 False False
In [46]: df.diff().abs().ne([1,4]).any(1)
Out[46]:
0 True
1 False
2 False
3 True
4 True
5 False
6 False
dtype: bool
In [47]: df.diff().abs().ne([1,4]).any(1).cumsum()
Out[47]:
0 1
1 1
2 1
3 2
4 3
5 3
6 3
dtype: int32

Categories