Removing None values from DataFrame in Python - python

Having the following dataframe:
name
aaa
bbb
Mick
None
None
Ivan
A
C
Ivan-Peter
1
None
Juli
1
P
I want to get two dataframes.
One with values, where we have None in columns aaa and/or bbb, named filter_nulls in my code
One where we do not have None at all. df_out in my code.
This is what I have tried and it does not produce the required dataframes.
import pandas as pd
df_out = {
'name': [ 'Mick', 'Ivan', 'Ivan-Peter', 'Juli'],
'aaa': [None, 'A', '1', '1'],
'bbb': [None, 'C', None, 'P'],
}
print(df_out)
filter_nulls = df_out[df_out['aaa'].isnull()|(df_out['bbb'] is None)]
print(filter_nulls)
df_out = df_out.loc[filter_nulls].reset_index(level=0, drop=True)
print(df_out)

Use:
#DataFrame from sample data
df_out = pd.DataFrame(df_out)
#filter columns names by list and test if NaN or None at least in one row
m = df_out[['aaa','bbb']].isna().any(axis=1)
#OR test both columns separately
m = df_out['aaa'].isna() | df_out['bbb'].isna()
#filter matched and not matched rows
df1 = df_out[m].reset_index(drop=True)
df2 = df_out[~m].reset_index(drop=True)
print (df1)
name aaa bbb
0 Mick None None
1 Ivan-Peter 1 None
print (df2)
name aaa bbb
0 Ivan A C
1 Juli 1 P
Another idea with DataFrame.dropna and filter indices not exist in df2:
df2 = df_out.dropna()
df1 = df_out.loc[df_out.index.difference(df2.index)].reset_index(drop=True)
df2 = df2.reset_index(drop=True)

First of all one needs to convert df_out to a dataframe with pandas.DataFrame as follows
df_out = pd.DataFrame(df_out)
[Out]:
name aaa bbb
0 Mick None None
1 Ivan A C
2 Ivan-Peter 1 None
3 Juli 1 P
Then one can use, for both cases, pandas.Series.notnull.
With values, where we have None in columns aaa and/or bbb, named filter_nulls in my code
df1 = df_out[~df_out['aaa'].notnull() | ~df_out['bbb'].notnull()]
[Out]:
name aaa bbb
0 Mick None None
2 Ivan-Peter 1 None
Where we do not have None at all. df_out in my code.
df2 = df_out[df_out['aaa'].notnull() & df_out['bbb'].notnull()]
[Out]:
name aaa bbb
1 Ivan A C
3 Juli 1 P
Notes:
If needed one can use pandas.DataFrame.reset_index to get the following
df_new = df_out[~df_out['aaa'].notnull() | ~df_out['bbb'].notnull()].reset_index(drop=True)
[Out]:
name aaa bbb
0 Mick None None
1 Ivan-Peter 1 None

Related

subset columns based on partial match and group level in python

I am trying to split my dataframe based on a partial match of the column name, using a group level stored in a separate dataframe. The dataframes are here, and the expected output is below
df = pd.DataFrame(data={'a19-76': [0,1,2],
'a23pz': [0,1,2],
'a23pze': [0,1,2],
'b887': [0,1,2],
'b59lp':[0,1,2],
'c56-6u': [0,1,2],
'c56-6uY': [np.nan, np.nan, np.nan]})
ids = pd.DataFrame(data={'id': ['a19', 'a23', 'b8', 'b59', 'c56'],
'group': ['test', 'sub', 'test', 'pass', 'fail']})
desired output
test_ids = 'a19-76', 'b887'
sub_ids = 'a23pz', 'a23pze', 'c56-6u'
pass_ids = 'b59lp'
fail_ids = 'c56-6u', 'c56-6uY'
I have written thise onliner, which assigned the group to each column name, but doesnt create two seperate lists as required above
gb = ids.groupby([[col for col in df.columns if col.startswith(tuple(i for i in ids.id))], 'group']).agg(lambda x: list(x)).reset_index()
gb.groupby('group').agg({'level_0':lambda x: list(x)})
thanks for reading
May be not what you are looking for, but anyway.
A pending question is what to do with not matched columns, the answer obviously depends on what you will do after matching.
Plain python solution
Simple collections wrangling, but there may be a simpler way.
from collections import defaultdict
groups = defaultdict(list)
idsr = ids.to_records(index=False)
for col in df.columns:
for id, group in idsr:
if col.startswith(id):
groups[group].append(col)
break
# the following 'else' clause is optional, it creates a group for not matched columns
else: # for ... else ...
groups['UNGROUPED'].append(col)
Groups =
{'sub': ['a23pz', 'c56-6u'], 'test': ['a19-76', 'b887', 'b59lp']}
Then after
df.columns = pd.MultiIndex.from_tuples(sorted([(k, col) for k,id in groups.items() for col in id]))
df =
sub test
a23pz c56-6u a19-76 b59lp b887
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2
pandas solution
Columns to dataframe
product of dataframes (join )
filtering of the resulting dataframe
There is surely a better way
df1 = ids.copy()
df2 = df.columns.to_frame(index=False)
df2.columns = ['col']
# Not tested enhancement:
# with pandas version >= 1.2, the four following lines may be replaced by a single one :
# dfm = df1.merge(df2, how='cross')
df1['join'] = 1
df2['join'] = 1
dfm = df1.merge(df2, on='join').drop('join', axis=1)
df1.drop('join', axis=1, inplace = True)
dfm['match'] = dfm.apply(lambda x: x.col.find(x.id), axis=1).ge(0)
dfm = dfm[dfm.match][['group', 'col']].sort_values(by=['group', 'col'], axis=0)
dfm =
group col
6 sub a23pz
24 sub c56-6u
0 test a19-76
18 test b59lp
12 test b887
# Note 1: The index can be removed
# note 2: Unmatched columns are not taken in account
then after
df.columns = pd.MultiIndex.from_frame(dfm)
df =
group sub test
col a23pz c56-6u a19-76 b59lp b887
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2
You can use a regex generated from the values in iidf and filter:
Example with "test":
s = iddf.set_index('group')['id']
regex_test = '^(%s)' % '|'.join(s.loc['test'])
# the generated regex is: '^(a19|b8|b59)'
df.filter(regex=regex_test)
output:
a19-76 b887 b59lp
0 0 0 0
1 1 1 1
2 2 2 2
To get a list of columns for each unique group in iidf, apply the same process in a dictionary comprehension:
{x: list(df.filter(regex='^(%s)' % '|'.join(s.loc[x])).columns)
for x in s.index.unique()}
output:
{'test': ['a19-76', 'b887', 'b59lp'],
'sub': ['a23pz', 'c56-6u']}
NB. this should generalize to any number of groups, however, if really there are many groups, it will be preferable to loop on the columns names rather than using filter repeatedly
A straightforward groupby(...).apply(...) can achieve this result:
def id_match(group, to_match):
regex = "[{}]".format("|".join(group))
matches = to_match.str.match(regex)
return pd.Series(to_match[matches])
matched_groups = ids.groupby("group")["id"].apply(id_match, df.columns)
print(matched_groups)
group
fail 0 c56-6u
1 c56-6uY
pass 0 b887
1 b59lp
sub 0 a19-76
1 a23pz
2 a23pze
test 0 a19-76
1 a23pz
2 a23pze
3 b887
4 b59lp
You can treat this Series as a dictionary-like entity to access each of the groups independently:
print(matched_ids["fail"])
0 c56-6u
1 c56-6uY
Name: id, dtype: object
print(matched_ids["pass"])
0 b887
1 b59lp
Name: id, dtype: object
Then you can take it a step further to can subset your original DataFrame with this new Series like so:
print(df[matched_ids["fail"]])
c56-6u c56-6uY
0 0 NaN
1 1 NaN
2 2 NaN
print(df[matched_ids["pass"]])
b887 b59lp
0 0 0
1 1 1
2 2 2

Pandas, check if a column contains characters from another column, and mark out the character?

There are 2 Dataframe, df1 & df2. e.g.
df1 = pd.DataFrame({'index': [1, 2, 3, 4],
'col1': ['12abc12', '12abcbla', 'abc', 'jh']})
df2 = pd.DataFrame({'col2': ['abc', 'efj']})
what i want looks like this (find all the rows which contains the character from df2, and tag them out)
index col1 col2
0 1 12abc12 abc
1 2 12abcbla abc
2 3 abc abc
3 4 jh
I've found a similar question but not exactly what i want. Thx for any ideas in advance.
Use Series.str.extract if need first matched value:
df1['new'] = df1['col1'].str.extract(f'({"|".join(df2["col2"])})', expand=False).fillna('')
print (df1)
index col1 new
0 1 12abc12 abc
1 2 12abcbla abc
2 3 abc abc
3 4 jh
If need all matched values use Series.str.findall and Series.str.join:
df1 = pd.DataFrame({'index': [1, 2, 3, 4],
'col1': ['12abc1defj2', '12abcbla', 'abc', 'jh']})
df2 = pd.DataFrame({'col2': ['abc', 'efj']})
df1['new'] = df1['col1'].str.findall("|".join(df2["col2"])).str.join(',')
print (df1)
index col1 new
0 1 12abc1defj2 abc,efj
1 2 12abcbla abc
2 3 abc abc
3 4 jh

replace value of a dataframe based on value from another dataframe

How can I merge between one dataframe based on the other lookup dataframe.
This is dataframe A where I want to replace the values :
InfoType IncidentType DangerType
0 NaN A NaN
1 NaN C NaN
2 NaN B C
3 NaN B NaN
This is the lookup table :
ID ParamCode ParamValue ParmDesc1 ParamDesc2 SortOrder ParamStatus
0 1 IncidentType A ABC DEF 1 1
1 2 IncidentType B GHI JKL 2 1
2 3 IncidentType C MNO PQR 7 1
2 3 DangerType C STU VWX 6 1
The expected input :
InfoType IncidentType DangerType
0 NaN ABC NaN
1 NaN MNO NaN
2 NaN GHI STU
3 NaN GHI NaN
Note that ParamCode is the column names and I need to replace ParamDesc1 into respective columns in dataframe A. Every column in dataframe A may have NaN and I don't intend to remove them. Just ignore them.
This is what I have done :
ntf_cols = ['InfoType','IncidentType','DangerType']
for c in ntf_cols:
if (c in ntf.columns) & (c in param['ParamCode'].values):
paramValue = param['ParamValue'].unique()
for idx, pv in enumerate(paramValue):
ntf['NewIncidentType'] = pd.np.where(ntf.IncidentType.str.contains(pv), param['ParmDesc1'].values, "whatever")
Error :
ValueError: operands could not be broadcast together with shapes (25,)
(13,) ()
Use the lookup table to make a dict, and then replace the column values of the original dataframe. Assume the original dataframe is df1 and the lookup table is df2
...
dict_map = dict(zip(df2.ParamCode + "-" + df2.ParamValue, df2.ParmDesc1))
df1['IncidentType'] = ("IncidentType" +'-'+ df1.IncidentType).replace(dict_map)
df1['DangerType'] = ("DangerType" +'-'+ df1.DangerType).replace(dict_map)
...
EDIT: Lambda's answer gave me an idea for how you could do this for many columns that you want to apply this logical pattern to:
import pandas as pd
df1 = pd.DataFrame(dict(
InfoType = [None, None, None, None],
IncidentType = 'A C B B'.split(),
DangerType = [None, None, 'C', None],
))
df2 = pd.DataFrame(dict(
ParamCode = 'IncidentType IncidentType IncidentType DangerType'.split(),
ParamValue = 'A B C C'.split(),
ParmDesc1 = 'ABC GHI MNO STU'.split(),
))
for col in df1.columns[1:]:
dict_map = dict(
df2[df2.ParamCode == col][['ParamValue','ParmDesc1']].to_records(index=False)
)
df1[col] = df1[col].replace(dict_map)
print(df1)
This assumes every column after the first column in df1 is one that needs updating and the to-be updated column names exists as a values in the 'ParamCode' column of df2.
Python tutor link to code
This problem could be solved using some custom functions and pandas.Series.apply():
import pandas as pd
def find_incident_type(x):
if pd.isna(x):
return x
return df2[
(df2['ParamCode'] == 'IncidentType') & (df2['ParamValue']==x)
]["ParmDesc1"].values[0]
def find_danger_type(x):
if pd.isna(x):
return x
return df2[
(df2['ParamCode'] == 'DangerType') & (df2['ParamValue']==x)
]["ParmDesc1"].values[0]
df1 = pd.DataFrame(dict(
InfoType = [None, None, None, None],
IncidentType = 'A C B B'.split(),
DangerType = [None, None, 'C', None],
))
df2 = pd.DataFrame(dict(
ParamCode = 'IncidentType IncidentType IncidentType DangerType'.split(),
ParamValue = 'A B C C'.split(),
ParmDesc1 = 'ABC GHI MNO STU'.split(),
))
df1['IncidentType'] = df1['IncidentType'].apply(find_incident_type)
df1['DangerType'] = df1['DangerType'].apply(find_danger_type)
print(df1)
step through the code in python tutor
It is very possible there is a more efficient way to do this. Hopefully some one who knows it will share it.
Also the ref to df2 from the outer scope is hard coded into the custom functions and thus will only work for that variable name from the outer scope. You'll need to use an argument for pandas.Series.apply's args param if you don't want these functions to be dependent on that ref.

Pandas: Convert nan in a row to an empty array

My dataframes are like below
df1
id c1
1 abc
2 def
3 ghi
df2
id set1
1 [123,456]
2 [789]
When I join df1 and df2 (final_data = df1.merge(df2, how = 'left')). It gives me
final_df
id c1 set1
1 abc [123,456]
2 def [789]
3 ghi NaN
I'm using below code to replace NaN with empty array []
for row in final_df.loc[final_df.set1.isnull(), 'set1'].index:
final_df.at[row, 'set1'] = []
The issue is if df2 is empty dataframe. It is giving
ValueError: setting an array element with a sequence.
PS: I'm using pandas 0.23.4 version
Pandas is not designed to be used with series of lists. You lose all vectorised functionality and any manipulations on such series involve inefficient, Python-level loops.
One work-around is to define a series of empty lists:
res = df1.merge(df2, how='left')
empty = pd.Series([[] for _ in range(len(df.index))], index=df.index)
res['set1'] = res['set1'].fillna(empty)
print(res)
id c1 set1
0 1 abc [123, 456]
1 2 def [789]
2 3 ghi []
A better idea at this point, if viable, is to split your lists into separate series:
res = res.join(pd.DataFrame(res.pop('set1').values.tolist()))
print(res)
id c1 0 1
0 1 abc 123.0 456.0
1 2 def 789.0 NaN
2 3 ghi NaN NaN
This is is not ideal but will get your work done
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[1,'abc'],[2,'def'],[3,'ghi']], columns=['id', 'c1'])
df2 = pd.DataFrame([[1,[123,456]],[2,[789]]], columns=['id', 'set1'])
df=pd.merge(df1,df2, how='left', on='id')
df['set1'].fillna(0, inplace=True)
df['set1']=df['set1'].apply( lambda x:pd.Series({'set1': [] if x == 0 else x}))
print(df)

How can I check the ID of a pandas data frame in another data frame in Python?

Hello I have the following Data Frame:
df =
ID Value
a 45
b 3
c 10
And another dataframe with the numeric ID of each value
df1 =
ID ID_n
a 3
b 35
c 0
d 7
e 1
I would like to have a new column in df with the numeric ID, so:
df =
ID Value ID_n
a 45 3
b 3 35
c 10 0
Thanks
Use pandas merge:
import pandas as pd
df1 = pd.DataFrame({
'ID': ['a', 'b', 'c'],
'Value': [45, 3, 10]
})
df2 = pd.DataFrame({
'ID': ['a', 'b', 'c', 'd', 'e'],
'ID_n': [3, 35, 0, 7, 1],
})
df1.set_index(['ID'], drop=False, inplace=True)
df2.set_index(['ID'], drop=False, inplace=True)
print pd.merge(df1, df2, on="ID", how='left')
output:
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
You could use join(),
In [14]: df1.join(df2)
Out[14]:
Value ID_n
ID
a 45 3
b 3 35
c 10 0
If you want index to be numeric you could reset_index(),
In [17]: df1.join(df2).reset_index()
Out[17]:
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
You can do this in a single operation. join works on the index, which you don't appear to have set. Just set the index to ID, join df after also setting its index to ID, and then reset your index to return your original dataframe with the new column added.
>>> df.set_index('ID').join(df1.set_index('ID')).reset_index()
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
Also, because you don't do an inplace set_index on df1, its structure remains the same (i.e. you don't change its indexing).

Categories