Pandas deleting rows in order - python

Given a particular df:
ID Text
1 abc
1 xyz
2 xyz
2 abc
3 xyz
3 abc
3 ijk
4 xyz
I want to apply condition where: Grouping by ID, if abc exists then delete row with xyz. The outcome would be:
ID Text
1 abc
2 abc
3 abc
3 ijk
4 xyz
Usually I would group them by Id and apply np.where(...). However, I don't think this approach would work for this case since it's based on rows.
Many thanks!

To the best of my knowledge, you can vectorize this with a groupby + transform:
df[~(df.Text.eq('abc').groupby(df.ID).transform('any') & df.Text.eq('xyz'))]
ID Text
0 1 abc
3 2 abc
5 3 abc
6 3 ijk
7 4 xyz

I am using crosstab
s=pd.crosstab(df.ID,df.Text)
s.xyz=s.xyz.mask(s.abc.eq(1)&s.xyz.eq(1))
s
Out[162]:
Text abc ijk xyz
ID
1 1 0 NaN
2 1 0 NaN
3 1 1 NaN
4 0 0 1.0
s.replace(0,np.nan).stack().reset_index().drop(0,1)
Out[167]:
ID Text
0 1 abc
1 2 abc
2 3 abc
3 3 ijk
4 4 xyz

Related

create new column based on value column a mapping on column b get column c

i have a dataframe like this:
num
text
foreign
1
abc
4
2
bcd
1
3
efg
3
4
jkl
2
4
jkl
1
i want to make new column based on 'foreign' column match with 'id' column and get the 'text' column.
so, im expecting:
num
text
foreign
foreign_txt
1
abc
4
jkl
2
bcd
1
abc
3
efg
3
bcd
4
jkl
2
efg
4
jkl
1
abc
how the syntax to make 'foreign_txt'?
I cant drop any row.
i forget how to do it. can you help me?
Use Series.map with Series creted by DataFrame.set_index:
df['foreign_txt'] = df['foreign'].map(df.set_index('id')['text'])
print (df)
id text foreign foreign_txt
0 1 abc 4 jkl
1 2 bcd 1 abc
2 3 efg 3 efg
3 4 jkl 2 bcd
Or:
df = (df.merge(df.drop('foreign', axis=1)
.rename(columns={'id':'foreign', 'text':'foreign_txt'}), how='left'))
EDIT: If need first value of text by id add DataFrame.drop_duplicates, for avoid remove unique texts per id with aggregate join:
print (df)
id text foreign
0 1 abc 4
1 2 bcd 1
2 3 efg 3
3 4 jkl 2
4 4 jkl 1
5 3 aaa 8
#join unique duplicates
s = df.drop_duplicates(['id','text']).groupby('id')['text'].agg(','.join)
df['foreign_txt1'] = df['foreign'].map(s)
#get first duplicates
df['foreign_txt2'] = df['foreign'].map(df.drop_duplicates('id').set_index('id')['text'])
#get last duplicates
df['foreign_txt3'] = df['foreign'].map(df.drop_duplicates('id', keep='last').set_index('id')['text'])
print (df)
id text foreign foreign_txt1 foreign_txt2 foreign_txt3
0 1 abc 4 jkl jkl jkl
1 2 bcd 1 abc abc abc
2 3 efg 3 efg,aaa efg aaa
3 4 jkl 2 bcd bcd bcd
4 4 jkl 1 abc abc abc
5 3 aaa 8 NaN NaN NaN
you can apply map method..!
Code:
import pandas as pd
data = {'num': [1, 2, 3, 4,4],
'text': ['abc', 'bcd', 'efg', 'jkl','jkl'],
'foreign': [4, 1, 3, 2,1]}
df = pd.DataFrame(data)
foreign_dict = df.set_index('num')['text'].to_dict()
#print(foreign_dict) #{1: 'abc', 2: 'bcd', 3: 'efg', 4: 'jkl'}
df['foreign_txt'] = df['foreign'].map(foreign_dict)
print(df)
Output:
num text foreign foreign_txt
0 1 abc 4 jkl
1 2 bcd 1 abc
2 3 efg 3 efg
3 4 jkl 2 bcd
4 4 jkl 1 abc

Create a column based on a condition python pandas

Here is the sample data
import pandas as pd
df=pd.DataFrame({'P_Name':['ABC','ABC','ABC','ABC','PQR','PQR','PQR','PQR','XYZ','XYZ','XYZ','XYZ'],
'Date':['11/01/2020','12/01/2020','13/01/2020','14/01/2020','11/01/2020','12/01/2020','13/01/2020','14/01/2020','11/01/2020','12/01/2020','13/01/2020','14/01/2020'],
'Open':['242.584','238.179','233.727','229.441','241.375','28.965','235.96','233.193','280.032','78.472','277.592','276.71'],
'End':['4.405','4.452','4.286','4.405','2.41','3.005','2.767','3.057','1.56','0.88','0.882','0.88'],
'Close':['238.179','233.727','229.441','225.036','238.965','235.96','233.193','230.136','278.472','277.592','276.71','275.83']})
I'm trying to create a new column where the condition will be that for every new product entry, the corresponding will be 1 AND will also have to check the condition if df['Close'][0] == df['Open'][1] are same the value will be 1 if not same(E.g df['Close'][8] == df['Open'][9]) then 0
df after these conditions
P_Name Date Open End Close Check
0 ABC 11/01/2020 242.584 4.405 238.179 1
1 ABC 12/01/2020 238.179 4.452 233.727 1
2 ABC 13/01/2020 233.727 4.286 229.441 1
3 ABC 14/01/2020 229.441 4.405 225.036 1
4 PQR 11/01/2020 241.375 2.41 238.965 1
5 PQR 12/01/2020 28.965 3.005 235.96 0
6 PQR 13/01/2020 235.96 2.767 233.193 1
7 PQR 14/01/2020 233.193 3.057 230.136 1
8 XYZ 11/01/2020 280.032 1.56 278.472 1
9 XYZ 12/01/2020 78.472 0.88 277.592 0
10 XYZ 13/01/2020 277.592 0.882 276.71 1
11 XYZ 14/01/2020 276.71 0.88 275.83 1
You can compare shifted values per groups by DataFrameGroupBy.shift with Series.eq with replace missing values by another column by Series.fillna with cast mask to 0,1 with Series.astype:
df['Check'] = df.Open.eq(df.groupby('P_Name').Close.shift().fillna(df.Open)).astype(int)
Anothr idea is compare without groups, but chain another mask with Series.duplicated for match first rows per groups:
df['Check'] = (~df.P_Name.duplicated() | df.Open.eq(df.Close.shift())).astype(int)
print (df)
P_Name Date Open End Close Check
0 ABC 11/01/2020 242.584 4.405 238.179 1
1 ABC 12/01/2020 238.179 4.452 233.727 1
2 ABC 13/01/2020 233.727 4.286 229.441 1
3 ABC 14/01/2020 229.441 4.405 225.036 1
4 PQR 11/01/2020 241.375 2.41 238.965 1
5 PQR 12/01/2020 28.965 3.005 235.96 0
6 PQR 13/01/2020 235.96 2.767 233.193 1
7 PQR 14/01/2020 233.193 3.057 230.136 1
8 XYZ 11/01/2020 280.032 1.56 278.472 1
9 XYZ 12/01/2020 78.472 0.88 277.592 0
10 XYZ 13/01/2020 277.592 0.882 276.71 1
11 XYZ 14/01/2020 276.71 0.88 275.83 1
check = []
for i in range(df.index - 1):
if df['Close'][i] == df['Open'][i+1]:
check.append (1)
else
check.append (0)
df['Check'] = check

Dropping row in pandas if 2 columns in the same row have NAN value in it

I am new to pandas and trying to complete the following:
I have a dataframe which look like this:
row A B
1 abc abc
2 abc
3 abc
4
5 abc abc
My desired output would look like this:
row A B
1 abc abc
2 abc
3 abc
5 abc abc
I am trying to drop rows if there is no value in both A and B columns:
if finalized_export_cf[finalized_export_cf['A']].str.len()<2:
if finalized_export_cf[finalized_export_cf['B']].str.len()<2:
finalized_export_cf[finalized_export_cf['B']].drop()
But that gives me the following error:
ValueError: cannot index with vector containing NA / NaN values
How could I drop values when both columns have an empty cell?
Thank you for your suggestions.
You can check whether all rows have a null by using .isnull() and all() in a chain. isnull() produces a dataframe with booleans, and all(axis=1) checks whether all values in a given rows are true. If that's the case, that means that all values in the rows are nulls:
inds = df[["A", "B"]].isnull().all(axis=1)
You can then use inds to clean up all rows that have only nulls. First negate it using the tilda ~, or else you can only missing values:
df = df.loc[~inds, :]
For your use case you can create a mask and get the values where A & B are not True:
mask = df.isna()
df[~((mask.A == True) & (mask.B == True))]
output:
row A B
0 1 abc abc
1 2 abc NaN
2 3 NaN abc
4 5 abc abc
If missing values are NaNs then use DataFrame.dropna with all and subset parameter:
print (df)
row A B
0 1 abc abc
1 2 abc NaN
2 3 NaN abc
3 4 NaN NaN
4 5 abc abc
df = df.dropna(how='all', subset=['A','B'])
print (df)
row A B
0 1 abc abc
1 2 abc NaN
2 3 NaN abc
4 5 abc abc
Or if empty value is empty string use DataFrame.any with compare not equal '':
print (df)
row A B
0 1 abc abc
1 2 abc
2 3 abc
3 4
4 5 abc abc
df = df[df[['A','B']].ne('').any(axis=1)]
print (df)
row A B
0 1 abc abc
1 2 abc
2 3 abc
4 5 abc abc
if you have only two columns - you can use the how attribute of the pandas.dataFrame.dropna by setting it to 'all':
df.dropna(how='all')
first of all we need to change the blank spaces to NaN
df = df.replace(r'^\s*$',np.nan,regex=True)
then drop na whilst sub-setting your rows
df.dropna(subset=['A','B'],how='all').fillna(' ') # if you want spaces for na
print(df)
row A B
0 1 abc abc
1 2 abc
2 3 abc
4 5 abc abc

"Rank" DataFrame columns per row

Given a Time Series DataFrame is it possible to create a new DataFrame with the same dimensions but the values are the ranking for each row compared to other columns (ordered smallest value first)?
Example:
ABC DEFG HIJK XYZ
date
2018-01-14 0.110541 0.007615 0.063217 0.002543
2018-01-21 0.007012 0.042854 0.061271 0.007988
2018-01-28 0.085946 0.177466 0.046432 0.069297
2018-02-04 0.018278 0.065254 0.038972 0.027278
2018-02-11 0.071785 0.033603 0.075826 0.073270
The first row would become:
ABC DEFG HIJK XYZ
date
2018-01-14 4 2 3 1
as XYZ has the smallest value in that row and ABC the largest.
numpy.argsort looks like it might help however as it outputs the location itself I have not managed to get it to work.
Many thanks
Use double argsort for rank per rows and pass to DataFrame constructor:
df1 = pd.DataFrame(df.values.argsort().argsort() + 1, index=df.index, columns=df.columns)
print (df1)
ABC DEFG HIJK XYZ
date
2018-01-14 4 2 3 1
2018-01-21 1 3 4 2
2018-01-28 3 4 1 2
2018-02-04 1 4 3 2
2018-02-11 2 1 4 3
Or use DataFrame.rank with method='dense':
df1 = df.rank(axis=1, method='dense').astype(int)
print (df1)
ABC DEFG HIJK XYZ
date
2018-01-14 4 2 3 1
2018-01-21 1 3 4 2
2018-01-28 3 4 1 2
2018-02-04 1 4 3 2
2018-02-11 2 1 4 3

How do I exclude NaN/NaT/None from a python groupby count, but include the row?

>> df
Foo Bar Number Date
0 abc None NaN NaT
1 abcdefg None NaN NaT
2 abcd this 1111222 3/8/2017
3 abcd that 1233336 3/3/2017
4 abcd what 1346554 3/3/2017
5 abcde that 8889995 3/9/2017
6 abcde this 1849552 3/8/2017
7 abcd that 7418652 3/3/2017
8 abcdef this 4865154 3/7/2017
>> df.groupby(['Foo']).size().reset_index(name='Total')
If I do it this way, the row is counted as having one value, which it does and I understand that. I'm not sure how to include the row in the Total, but not actually count None/NaN/NaT value?
Returns:
Foo Total
0 abc 1
1 abcd 4
2 abcde 2
3 abcdef 1
4 abcdefg 1
Expected result:
Foo Total
0 abc 0
1 abcd 4
2 abcde 2
3 abcdef 1
4 abcdefg 0
You could drop nulls first, then reindex with the unique values of the Foo column at the end with a fill value.
(df.dropna().groupby('Foo')
.size()
.reindex(df.Foo.unique(), fill_value=0)
.reset_index(name='total'))
Or alternatively, you could make your Foo column Categorical.
df.Foo = pd.Categorical(df.Foo)
df.dropna().groupby('Foo').size().reset_index(name='total')
Demo
>>> (df.dropna().groupby('Foo')
.size()
.reindex(df.Foo.unique(), fill_value=0)
.reset_index(name='total'))
Foo total
0 abc 0
1 abcdefg 0
2 abcd 4
3 abcde 2
4 abcdef 1
############################################################################
>>> df.Foo = pd.Categorical(df.Foo)
>>> df.dropna().groupby('Foo').size().reset_index(name='total')
Foo total
0 abc 0
1 abcd 4
2 abcde 2
3 abcdef 1
4 abcdefg 0

Categories