drop column based on a string condition

drop column based on a string condition - python

How can I delete a dataframe column based on a certain string in its name?
Example:
house1 house2 chair1 chair2
index
1 foo lee sam han
2 fowler smith had sid
3 cle meg mag mog
I want to drop the columns that contain 'chair' in the string.
How can this be done in an efficient way?
Thanks.

df.drop([col for col in df.columns if 'chair' in col],axis=1,inplace=True)

UPDATE2:
In [315]: df
Out[315]:
3M110% 3M80% 6M90% 6M95% 1N90% 2M110% 3M95%
1 foo lee sam han aaa aaa fff
2 fowler smith had sid aaa aaa fff
3 cle meg mag mog aaa aaa fff
In [316]: df.loc[:, ~df.columns.str.contains('90|110')]
Out[316]:
3M80% 6M95% 3M95%
1 lee han fff
2 smith sid fff
3 meg mog fff
UPDATE:
In [40]: df
Out[40]:
house1 house2 chair1 chair2 door1 window1 floor1
1 foo lee sam han aaa aaa fff
2 fowler smith had sid aaa aaa fff
3 cle meg mag mog aaa aaa fff
In [41]: df.filter(regex='^(?!(chair|door|window).*?)')
Out[41]:
house1 house2 floor1
1 foo lee fff
2 fowler smith fff
3 cle meg fff
Original answer:
here a few alternatives:
In [37]: df.drop(df.filter(like='chair').columns, 1)
Out[37]:
house1 house2
1 foo lee
2 fowler smith
3 cle meg
In [38]: df.filter(regex='^(?!chair.*)')
Out[38]:
house1 house2
1 foo lee
2 fowler smith
3 cle meg

This should do it:
df.drop(df.columns[df.columns.str.match(r'chair')], axis=1)
Timing
MaxU method 2

One more alternative:
import pandas as pd
df = pd.DataFrame({'house1':['foo','fowler','cle'],
'house2':['lee','smith','meg'],
'chair1':['sam','had','mag'],
'chair2':['han','sid','mog']})
mask = ['chair' not in x for x in df]
df = df[df.columns[mask]]

Related

How to unstack column of dictionaies in pandas dataframe?

I have a dataframe of the following format
df = pd.DataFrame(
{"company":["McDonalds","Arbys","Wendys"],
"City":["Dallas","Austin","Chicago"],
"Datetime":[{"11/23/2016":"1","09/06/2011":"2"},
{"02/23/2012":"1","04/06/2013":"2"},
{"10/23/2017":"1","05/06/2019":"2"}]})
df
>>> Company City Datetime
>>> McDonalds Dallas {'11/23/2016': '1', '09/06/2011':'2'}
>>> Arbys Austin {'02/23/2012': '1', '04/06/2013':'2'}
>>> Wendys Chicago {'10/23/2017': '1', '05/06/2019':'2'}
The dictionary inside of the column "Datetime" is a string , so I must read it into a python dictionary by using ast.literal_eval
I would like to unstack the dataframe based on the values in datetime so that the output looks as follows:
df_out
>>> Company City Date Value
>>> McDonalds Dallas 11/23/2016 1
>>> McDonalds Dallas 09/06/2011 2
>>> Arbys Austin 02/23/2012 1
>>> Arbys Austin 04/06/2013 2
>>> Wendys Chicago 10/23/2017 1
>>> Wendys Chicago 05/06/2019 2
I am a bit lost on this one, I know I will need to iter over the rows and read each dictionary, so I had the idea of using df.iterrows() and creating namedTuples of each rows values that won't change, and then looping over the dictionary itself attaching different datetime values, but I am not sure this is the most efficient way. Any tips would be appreciated.

My try:
(df.drop('Datetime', axis=1)
.merge(df.Datetime.agg(lambda x: pd.Series(x))
.stack().reset_index(-1),
left_index=True,
right_index=True
)
.rename(columns={'level_1':'Date', 0:'Value'})
)
Output:
company City Date Value
0 McDonalds Dallas 11/23/2016 1
0 McDonalds Dallas 09/06/2011 2
1 Arbys Austin 02/23/2012 1
1 Arbys Austin 04/06/2013 2
2 Wendys Chicago 10/23/2017 1
2 Wendys Chicago 05/06/2019 2

I would flatten dictionaries in Datetime and construct a new df from it. Finally, join back.
from itertools import chain
df1 = pd.DataFrame(chain.from_iterable(df.Datetime.map(dict.items)),
index=df.index.repeat(df.Datetime.str.len()),
columns=['Date', 'Val'])
Out[551]:
Date Val
0 11/23/2016 1
0 09/06/2011 2
1 02/23/2012 1
1 04/06/2013 2
2 10/23/2017 1
2 05/06/2019 2
df_final = df.drop('Datetime', 1).join(df1)
Out[554]:
company City Date Val
0 McDonalds Dallas 11/23/2016 1
0 McDonalds Dallas 09/06/2011 2
1 Arbys Austin 02/23/2012 1
1 Arbys Austin 04/06/2013 2
2 Wendys Chicago 10/23/2017 1
2 Wendys Chicago 05/06/2019 2

Here is a clean solution:
Solution
df = df.set_index(['company', 'City'])
df_stack = (df['Datetime'].apply(pd.Series)
.stack().reset_index()
.rename(columns= {'level_2': 'Datetime', 0: 'val'}))
Output
print(df_stack.to_string())
company City Datetime val
0 McDonalds Dallas 11/23/2016 1
1 McDonalds Dallas 09/06/2011 2
2 Arbys Austin 02/23/2012 1
3 Arbys Austin 04/06/2013 2
4 Wendys Chicago 10/23/2017 1
5 Wendys Chicago 05/06/2019 2

Keep original string values after pandas str.extract() if the regex doesn't match

My input data:
df=pd.DataFrame({'A':['adam','monica','joe doe','michael mo'], 'B':['david','valenti',np.nan,np.nan]})
print(df)
A B
0 adam david
1 monica valenti
2 joe doe NaN
3 michael mo NaN
I need to extract strings after space, to a second column, but when I use my code...:
df['B'] = df['A'].str.extract(r'( [a-zA-Z](.*))')
print(df)
A B
0 adam NaN
1 monica NaN
2 joe doe doe
3 michael mo mo
...I receive NaN in each cell where value has not been extracted. How to avoid it?
I tried to extract only from rows where NaN exist using this code:
df.loc[df.B.isna(),'B'] = df.loc[df.B.isna(),'A'].str.extract(r'( [a-zA-Z](.*))')
ValueError: Incompatible indexer with DataFrame
Expected output:
A B
0 adam david
1 monica valenti
2 joe doe doe
3 michael mo mo

I think solution should be simplify - split by spaces and get second lists and pass to Series.fillna function:
df['B'] = df['B'].fillna(df['A'].str.split().str[1])
print (df)
A B
0 adam david
1 monica valenti
2 joe doe doe
3 michael mo mo
Detail:
print (df['A'].str.split().str[1])
0 NaN
1 NaN
2 doe
3 mo
Name: A, dtype: object
Your solution should be changed:
df['B'] = df['A'].str.extract(r'( [a-zA-Z](.*))')[0].fillna(df.B)
print (df)
A B
0 adam david
1 monica valenti
2 joe doe doe
3 michael mo mo
Better solution wich changed regex and expand=False for Series:
df['B'] = df['A'].str.extract(r'( [a-zA-Z].*)', expand=False).fillna(df.B)
print (df)
A B
0 adam david
1 monica valenti
2 joe doe doe
3 michael mo mo
Detail:
print (df['A'].str.extract(r'( [a-zA-Z].*)', expand=False))
0 NaN
1 NaN
2 doe
3 mo
Name: A, dtype: object
EDIT:
For extract also values from first column simpliest is use:
df1 = df['A'].str.split(expand=True)
df['A'] = df1[0]
df['B'] = df['B'].fillna(df1[1])
print (df)
A B
0 adam david
1 monica valenti
2 joe doe
3 michael mo

Your approach doesn't function because of the different shapes of the right and the left sides of your statement. The left part has the shape (2,) and the right part (2, 2):
df.loc[df.B.isna(),'B']
Returns:
2 NaN
3 NaN
And you want to fill this with:
df.loc[df.B.isna(),'A'].str.extract(r'( [a-zA-Z](.*))')
Returns:
0 1
2 doe oe
3 mo o
You can take the column 1 and then it will have the same shape (2,) as the left part and will fit:
df.loc[df.B.isna(),'A'].str.extract(r'( [a-zA-Z](.*))')[1]
Returns:
2 oe
3 o

Compare 2 dataframes Pandas, returns wrong values

There are 2 dfs
datatypes are the same
df1 =
ID city name value
1 LA John 111
2 NY Sam 222
3 SF Foo 333
4 Berlin Bar 444
df2 =
ID city name value
1 NY Sam 223
2 LA John 111
3 SF Foo 335
4 London Foo1 999
5 Berlin Bar 444
I need to compare them and produce a new df, only with values, which are in df2, but not in df1
By some reason results after applying different methods are wrong
So far I've tried
pd.concat([df1, df2], join='inner', ignore_index=True)
but it returns all values together
pd.merge(df1, df2, how='inner')
it returns df1
then this one
df1[~(df1.iloc[:, 0].isin(list(df2.iloc[:, 0])))
it returns df1
The desired output is
ID city name value
1 NY Sam 223
2 SF Foo 335
3 London Foo1 999

Use DataFrame.merge by all columns without first and indicator parameter:
c = df1.columns[1:].tolist()
Or:
c = ['city', 'name', 'value']
df = (df2.merge(df1,on=c, indicator = True, how='left', suffixes=('','_'))
.query("_merge == 'left_only'")[df1.columns])
print (df)
ID city name value
0 1 NY Sam 223
2 3 SF Foo 335
3 4 London Foo1 999

Try this:
print("------------------------------")
print(df1)
df2 = DataFrameFromString(s, columns)
print("------------------------------")
print(df2)
common = df1.merge(df2,on=["city","name"]).rename(columns = {"value_y":"value", "ID_y":"ID"}).drop("value_x", 1).drop("ID_x", 1)
print("------------------------------")
print(common)
OUTPUT:
------------------------------
ID city name value
0 ID city name value
1 1 LA John 111
2 2 NY Sam 222
3 3 SF Foo 333
4 4 Berlin Bar 444
------------------------------
ID city name value
0 1 NY Sam 223
1 2 LA John 111
2 3 SF Foo 335
3 4 London Foo1 999
4 5 Berlin Bar 444
------------------------------
city name ID value
0 LA John 2 111
1 NY Sam 1 223
2 SF Foo 3 335
3 Berlin Bar 5 444

Pandas: how to merge to dataframes on multiple columns?

I have 2 dataframes, df1 and df2.
df1 Contains the information of some interactions between people.
df1
Name1 Name2
0 Jack John
1 Sarah Jack
2 Sarah Eva
3 Eva Tom
4 Eva John
df2 Contains the status of general people and also some people in df1
df2
Name Y
0 Jack 0
1 John 1
2 Sarah 0
3 Tom 1
4 Laura 0
I would like df2 only for the people that are in df1 (Laura disappears), and for those that are not in df2 keep NaN (i.e. Eva) such as:
df2
Name Y
0 Jack 0
1 John 1
2 Sarah 0
3 Tom 1
4 Eva NaN

Create a DataFrame on unique values of df1 and map it with df2 as:
df = pd.DataFrame(np.unique(df1.values),columns=['Name'])
df['Y'] = df.Name.map(df2.set_index('Name')['Y'])
print(df)
Name Y
0 Eva NaN
1 Jack 0.0
2 John 1.0
3 Sarah 0.0
4 Tom 1.0
Note : Order is not preserved.

You can create a list of unique names in df1 and use isin
names = np.unique(df1[['Name1', 'Name2']].values.ravel())
df2.loc[~df2['Name'].isin(names), 'Y'] = np.nan
Name Y
0 Jack 0.0
1 John 1.0
2 Sarah 0.0
3 Tom 1.0
4 Laura NaN

pandas: pivoting on rank

Given this data:
pd.DataFrame({'id':['aaa','aaa','abb','abb','abb','acd','acd','acd'],
'loc':['US','UK','FR','US','IN','US','CN','CN']})
id loc
0 aaa US
1 aaa UK
2 abb FR
3 abb US
4 abb IN
5 acd US
6 acd CN
7 acd CN
How do I pivot it to this:
id loc1 loc2 loc3
aaa US UK None
abb FR US IN
acd US CN CN
I am looking for the most idiomatic method.

I think you can create new column cols with groupby, cumcount and convert to string by astype, last use pivot:
df['cols'] = 'loc' + (df.groupby('id')['id'].cumcount() + 1).astype(str)
print df
id loc cols
0 aaa US loc1
1 aaa UK loc2
2 abb FR loc1
3 abb US loc2
4 abb IN loc3
5 acd US loc1
6 acd CN loc2
7 acd CN loc3
print df.pivot(index='id', columns='cols', values='loc')
cols loc1 loc2 loc3
id
aaa US UK None
abb FR US IN
acd US CN CN
If you want remove index and columns names use rename_axis:
print df.pivot(index='id', columns='cols', values='loc').rename_axis(None)
.rename_axis(None, axis=1)
loc1 loc2 loc3
aaa US UK None
abb FR US IN
acd US CN CN
All together, thank you Colin:
print pd.pivot(df['id'], 'loc' + (df.groupby('id').cumcount() + 1).astype(str), df['loc'])
.rename_axis(None)
.rename_axis(None, axis=1)
loc1 loc2 loc3
aaa US UK None
abb FR US IN
acd US CN CN
I try rank, but I get error in version 0.18.0:
print df.groupby('id')['loc'].transform(lambda x: x.rank(method='first'))
#ValueError: first not supported for non-numeric data

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

drop column based on a string condition - python

How can I delete a dataframe column based on a certain string in its name? Example: house1 house2 chair1 chair2 index 1 foo lee sam han 2 fowler smith had sid 3 cle meg mag mog I want to drop the columns that contain 'chair' in the string. How can this be done in an efficient way? Thanks.

df.drop([col for col in df.columns if 'chair' in col],axis=1,inplace=True)

This should do it: df.drop(df.columns[df.columns.str.match(r'chair')], axis=1) Timing MaxU method 2

One more alternative: import pandas as pd df = pd.DataFrame({'house1':['foo','fowler','cle'], 'house2':['lee','smith','meg'], 'chair1':['sam','had','mag'], 'chair2':['han','sid','mog']}) mask = ['chair' not in x for x in df] df = df[df.columns[mask]]

Related

How to unstack column of dictionaies in pandas dataframe?

Keep original string values after pandas str.extract() if the regex doesn't match

Compare 2 dataframes Pandas, returns wrong values

Pandas: how to merge to dataframes on multiple columns?

pandas: pivoting on rank

Categories

Resources