I have a data frame like this,
df
col1 col2
1 A
2 A
3 B
4 C
5 C
6 C
7 B
8 B
9 A
Now we can see that there is continuous occurrence of A, B and C. I want only the rows where the occurrence is starting. And the other values of the same occurrence will be nan.
The final data frame I am looking for will look like,
df
col1 col2
1 A
2 NA
3 B
4 C
5 NA
6 NA
7 B
8 NA
9 A
I can do it using for loop and comparing, But the execution time will be more. I am looking for pythonic way to do it. Some panda shortcuts may be.
Compare by Series.shifted values and missing values by Series.where or numpy.where:
df['col2'] = df['col2'].where(df['col2'].ne(df['col2'].shift()))
#alternative
#df['col2'] = np.where(df['col2'].ne(df['col2'].shift()), df['col2'], np.nan)
Or by DataFrame.loc with inverted condition by ~:
df.loc[~df['col2'].ne(df['col2'].shift()), 'col2'] = np.nan
Or thanks #Daniel Mesejo - use eq for ==:
df.loc[df['col2'].eq(df['col2'].shift()), 'col2'] = np.nan
print (df)
col1 col2
0 1 A
1 2 NaN
2 3 B
3 4 C
4 5 NaN
5 6 NaN
6 7 B
7 8 NaN
8 9 A
Detail:
print (df['col2'].ne(df['col2'].shift()))
0 True
1 False
2 True
3 True
4 False
5 False
6 True
7 False
8 True
Name: col2, dtype: bool
Related
Assume, I have a data frame such as
import pandas as pd
df = pd.DataFrame({'visitor':['A','B','C','D','E'],
'col1':[1,2,3,4,5],
'col2':[1,2,4,7,8],
'col3':[4,2,3,6,1]})
visitor
col1
col2
col3
A
1
1
4
B
2
2
2
C
3
4
3
D
4
7
6
E
5
8
1
For each row/visitor, (1) First, if there are any identical values, I would like to keep the 1st value of each row then replace the rest of identical values in the same row with NULL such as
visitor
col1
col2
col3
A
1
NULL
4
B
2
NULL
NULL
C
3
4
NULL
D
4
7
6
E
5
8
1
Then (2) keep rows/visitors with more than 1 value such as
Final Data Frame
visitor
col1
col2
col3
A
1
NULL
4
C
3
4
NULL
D
4
7
6
E
5
8
1
Any suggestions? many thanks
We can use series.duplicated along the columns axis to identify the duplicates, then mask the duplicates using where and filter the rows where the sum of non-duplicated values is greater than 1
s = df.set_index('visitor')
m = ~s.apply(pd.Series.duplicated, axis=1)
s.where(m)[m.sum(1).gt(1)]
col1 col2 col3
visitor
A 1 NaN 4.0
C 3 4.0 NaN
D 4 7.0 6.0
E 5 8.0 1.0
Let us try mask with pd.Series.duplicated, then dropna with thresh
out = df.mask(df.apply(pd.Series.duplicated,1)).dropna(thresh = df.shape[1]-1)
Out[321]:
visitor col1 col2 col3
0 A 1 NaN 4.0
2 C 3 4.0 NaN
3 D 4 7.0 6.0
4 E 5 8.0 1.0
Im looking for way to multiply values of all columns but to exclude columns that has a value of 0. So, result should not be 0 (multiplication by 0). If there is this number of columns and rows, its easy, but what If there are 100 columns and 5000 rows?
import pandas as pd
df = pd.DataFrame({"Col1":[6,4,3,0],
"Col2":[1,0,0,3],
"Col3":[2,4,3,2]})
So result should look like this:
print(df)
# result should be multiplication of all column values, but not 0
# zeros should be excluded
6 * 1 * 2
4 * 4
3 * 3
3 * 2
df = pd.DataFrame({"Col1":[6,4,3,0],
"Col2":[1,0,0,3],
"Col3":[2,4,3,2],
"Result":[12,16,9,6]})
print(df)
I can not change the data , so changing zeros to 1 does not work
You could simply replace the 0s with 1s.
df = pd.DataFrame({"Col1":[6,4,3,0],
"Col2":[1,0,0,3],
"Col3":[2,4,3,2]})
df['Result'] = df.replace(0,1).prod(axis=1)
Col1 Col2 Col3 Result
0 6 1 2 12
1 4 0 4 16
2 3 0 3 9
3 0 3 2 6
To get technical - in multiplication 1 is the identity function. In addition the identity function is 0. to way oversimply it - An identity function is just a fancy way of saying "return the same result by adding another variable"
To get non technical I think of the quote "Think Smart Not Hard"
May be just replace zero values by one and multiply values:
df['Result'] = df.replace(0,1).apply(np.prod,axis=1)
Simple mask 0 values to NaN and call prod
df['Result'] = df.where(df.ne(0)).prod(1)
Out[1748]:
Col1 Col2 Col3 Result
0 6 1 2 12.0
1 4 0 4 16.0
2 3 0 3 9.0
3 0 3 2 6.0
Or mask 0 to 1
df['Result'] = df.where(df.ne(0), 1).prod(1)
Out[1754]:
Col1 Col2 Col3 Result
0 6 1 2 12
1 4 0 4 16
2 3 0 3 9
3 0 3 2 6
step by step:
ne(0) return a boolean mask where 0 is masked as True
In [1755]: df.ne(0)
Out[1755]:
Col1 Col2 Col3 Result
0 True True True True
1 True False True True
2 True False True True
3 False True True True
where checks on each location of the boolean mask. On True, it keeps same value. On False it turns to NaN when there is no 2nd parameter.
In [1756]: df.where(df.ne(0))
Out[1756]:
Col1 Col2 Col3 Result
0 6.0 1.0 2 12
1 4.0 NaN 4 16
2 3.0 NaN 3 9
3 NaN 3.0 2 6
prod(1) is the product along axis=1. Prod is defaulted to ignore NaN, so It returns the product of each rows without consider NaN
In [1759]: df.where(df.ne(0)).prod(1)
Out[1759]:
0 12.0
1 16.0
2 9.0
3 6.0
dtype: float64
When specifying the 2nd parameter for where, it is used to replace on False mask.
Say I have a big DataFrame (>10000 rows) that has some rows containing one or more nulls. How do I remove all the rows containing a null in one or more of its columns from the original DataFrame and putting the rows into another DataFrame?
e.g.:
Original DataFrame:
a b c
1 "foo" 5 3
2 "bar" 9 1
3 NaN 5 4
4 "foo" NaN 1
Non-Null DataFrame:
a b c
1 "foo" 5 3
2 "bar" 9 1
Null containing DataFrame:
a b c
1 NaN 5 4
2 "foo" NaN 1
Use DataFrame.isna for checking missing values:
print (df.isna())
#print (df.isnull())
a b c
1 False False False
2 False False False
3 True False False
4 False True False
And test if at least True per row by DataFrame.any:
mask = df.isna().any(axis=1)
#oldier pandas versions
mask = df.isnull().any(axis=1)
print (mask)
1 False
2 False
3 True
4 True
dtype: bool
Last filter by boolean indexing - ~ is for inverting boolean mask:
df1 = df[~mask]
df2 = df[mask]
print (df1)
a b c
1 foo 5.0 3
2 bar 9.0 1
print (df2)
a b c
3 NaN 5.0 4
4 foo NaN 1
Consider the following dataframe
import pandas as pd
df = pd.DataFrame({'A' : [1, 2, 3, 3, 4, 4, 5, 6, 7],
'B' : ['a','b','c','c','d','d','e','f','g'],
'Col_1' :[np.NaN, 'A','A', np.NaN, 'B', np.NaN, 'B', np.NaN, np.NaN],
'Col_2' :[2,2,3,3,3,3,4,4,5]})
df
Out[92]:
A B Col_1 Col_2
0 1 a NaN 2
1 2 b A 2
2 3 c A 3
3 3 c NaN 3
4 4 d B 3
5 4 d NaN 3
6 5 e B 4
7 6 f NaN 4
8 7 g NaN 5
I want to remove all rows which are duplicates with regards to column 'A' 'B'. I want to remove the entry which has a NaN entry (I know that for all dulicates there will be a NaN and a not-NaN entry). The end results should look like this
A B Col_1 Col_2
0 1 a NaN 2
1 2 b A 2
2 3 c A 3
4 4 d B 3
6 5 e B 4
7 6 f NaN 4
8 7 g NaN 5
All efficient, one-liners are most welcome
If the goal is to only drop the NaN duplicates, a slightly more involved solution is needed.
First, sort on A, B, and Col_1, so NaNs are moved to the bottom for each group. Then call df.drop_duplicates with keep=first:
out = df.sort_values(['A', 'B', 'Col_1']).drop_duplicates(['A', 'B'], keep='first')
print(out)
A B Col_1 Col_2
0 1 a NaN 2
1 2 b A 2
2 3 c A 3
4 4 d B 3
6 5 e B 4
7 6 f NaN 4
8 7 g NaN 5
Here's an alternative:
df[~((df[['A', 'B']].duplicated(keep=False)) & (df.isnull().any(axis=1)))]
# A B Col_1 Col_2
# 0 1 a NaN 2
# 1 2 b A 2
# 2 3 c A 3
# 4 4 d B 3
# 6 5 e B 4
# 7 6 f NaN 4
# 8 7 g NaN 5
This uses the bitwise "not" operator ~ to negate rows that meet the joint condition of being a duplicate row (the argument keep=False causes the method to evaluate to True for all non-unique rows) and containing at least one null value. So where the expression df[['A', 'B']].duplicated(keep=False) returns this Series:
# 0 False
# 1 False
# 2 True
# 3 True
# 4 True
# 5 True
# 6 False
# 7 False
# 8 False
...and the expression df.isnull().any(axis=1) returns this Series:
# 0 True
# 1 False
# 2 False
# 3 True
# 4 False
# 5 True
# 6 False
# 7 True
# 8 True
... we wrap both in parentheses (required by Pandas syntax whenever using multiple expressions in indexing operations), and then wrap them in parentheses again so that we can negate the entire expression (i.e. ~( ... )), like so:
~((df[['A','B']].duplicated(keep=False)) & (df.isnull().any(axis=1))) & (df['Col_2'] != 5)
# 0 True
# 1 True
# 2 True
# 3 False
# 4 True
# 5 False
# 6 True
# 7 True
# 8 False
You can build more complex conditions with further use of the logical operators & and | (the "or" operator). As with SQL, group your conditions as necessary with additional parentheses; for instance, filter based on the logic "both condition X AND condition Y are true, or condition Z is true" with df[ ( (X) & (Y) ) | (Z) ].
Or you can just using first(), by using the first , will give back the first notnull value, so the order of original input does not really matter.
df.groupby(['A','B']).first()
Out[180]:
Col_1 Col_2
A B
1 a NaN 2
2 b A 2
3 c A 3
4 d B 3
5 e B 4
6 f NaN 4
7 g NaN 5
I'm curently learnig python and pandas (this question is based on a pevious post but with an additional query); at the moment have the 2 columns containing numeric sequences (ascending and/or descending) as described below:
Col 1: (col1 numeric incrememt and/or decrement = 1)
1
2
3
5
7
8
9
Col 2: (Col2 numeric increment and/or decrement = 4)
113
109
105
90
94
98
102
Need to extract the numeric ranges from both columns and print them according to the sequence break occurance on any of those 2 columns and the result should be as follow:
1,3,105,113
5,5,90,90
7,9,94,102
Already received a very useful way to do it using python's pandas library by #MaxU where it generates the numeric ranges based on the breaks detected on both columns using a criteria of col1 and col2 = increase and/or decreases by 1.
How can I extract numeric ranges from 2 columns and print the range from both columns as tuples?
The unique difference on this case is that the increment/decrement criteria applied for both columns are different for each one of them.
Try this:
In [42]: df
Out[42]:
Col1 Col2
0 1 113
1 2 109
2 3 105
3 5 90
4 7 94
5 8 98
6 9 102
In [43]: df.groupby(df.diff().abs().ne([1,4]).any(1).cumsum()).agg(['min','max'])
Out[43]:
Col1 Col2
min max min max
1 1 3 105 113
2 5 5 90 90
3 7 9 94 102
Explanation: our goal is to group those rows with the increment/decrement [1,4] for Col1, Col2 correspondingly:
In [44]: df.diff().abs()
Out[44]:
Col1 Col2
0 NaN NaN
1 1.0 4.0
2 1.0 4.0
3 2.0 15.0
4 2.0 4.0
5 1.0 4.0
6 1.0 4.0
In [45]: df.diff().abs().ne([1,4])
Out[45]:
Col1 Col2
0 True True
1 False False
2 False False
3 True True
4 True False
5 False False
6 False False
In [46]: df.diff().abs().ne([1,4]).any(1)
Out[46]:
0 True
1 False
2 False
3 True
4 True
5 False
6 False
dtype: bool
In [47]: df.diff().abs().ne([1,4]).any(1).cumsum()
Out[47]:
0 1
1 1
2 1
3 2
4 3
5 3
6 3
dtype: int32