map multiple columns by a single dictionary in pandas - python

I have a DataFrame with a multiple columns with 'yes' and 'no' strings. I want all of them to convert to a boolian dtype. To map one column, I would use
dict_map_yn_bool={'yes':True, 'no':False}
df['nearby_subway_station'].map(dict_map_yn_bool)
This would do the job for the one column. how can I replace multiple columns with single line of code?

You can use applymap:
df = pd.DataFrame({'nearby_subway_station':['yes','no'], 'Station':['no','yes']})
print (df)
Station nearby_subway_station
0 no yes
1 yes no
dict_map_yn_bool={'yes':True, 'no':False}
df = df.applymap(dict_map_yn_bool.get)
print (df)
Station nearby_subway_station
0 False True
1 True False
Another solution:
for x in df:
df[x] = df[x].map(dict_map_yn_bool)
print (df)
Station nearby_subway_station
0 False True
1 True False
Thanks Jon Clements for very nice idea - using replace:
df = df.replace({'yes': True, 'no': False})
print (df)
Station nearby_subway_station
0 False True
1 True False
Some differences if data are no in dict:
df = pd.DataFrame({'nearby_subway_station':['yes','no','a'], 'Station':['no','yes','no']})
print (df)
Station nearby_subway_station
0 no yes
1 yes no
2 no a
applymap create None for boolean, strings, for numeric NaN.
df = df.applymap(dict_map_yn_bool.get)
print (df)
Station nearby_subway_station
0 False True
1 True False
2 False None
map create NaN:
for x in df:
df[x] = df[x].map(dict_map_yn_bool)
print (df)
Station nearby_subway_station
0 False True
1 True False
2 False NaN
replace dont create NaN or None, but original data are untouched:
df = df.replace(dict_map_yn_bool)
print (df)
Station nearby_subway_station
0 False True
1 True False
2 False a

You could use a stack/unstack idiom
df.stack().map(dict_map_yn_bool).unstack()
Using #jezrael's setup
df = pd.DataFrame({'nearby_subway_station':['yes','no'], 'Station':['no','yes']})
dict_map_yn_bool={'yes':True, 'no':False}
Then
df.stack().map(dict_map_yn_bool).unstack()
Station nearby_subway_station
0 False True
1 True False
timing
small data
bigger data

I would work with pandas.DataFrame.replace as I think it is the simplest and has built-in arguments to support this task. Also a one-liner solution, as requested.
First case, replace all instances of 'yes' or 'no':
import pandas as pd
import numpy as np
from numpy import random
# Generating the data, 20 rows by 5 columns.
data = random.choice(['yes','no'], size=(20, 5), replace=True)
col_names = ['col_{}'.format(a) for a in range(1,6)]
df = pd.DataFrame(data, columns=col_names)
# Supplying lists of values to what they will replace. No dict needed.
df_bool = df.replace(to_replace=['yes','no'], value=[True, False])
Second case, where you only want to replace in a subset of columns, as described in the documentation for DataFrame.replace. Use a nested dictionary where the first set of keys are columns with values to replace, and values are dictionaries mapping values to their replacements:
dict_map_yn_bool={'yes':True, 'no':False}
replace_dict = {'col_1':dict_map_yn_bool,
'col_2':dict_map_yn_bool}
df_bool = df.replace(to_replace=replace_dict)

Related

A faster method than "for" to scan a DataFrame - Python

I'm finding a way (using a built-in pandas function) to scan a column of a DataFrame comparing its-self values for different indices.
Here an example using a for cycle. I've a dataframe with a single column col 1. I want to create a column col 2 with TRUE/FALSE in this way.
df["col_2"] = "False"
N=5
for idx in range(0,len(df)-N):
for i in range (idx+1,idx+N+1):
if(df["col_1"].iloc[idx]==df["col_1"].iloc[i]):
df["col_2"].iloc[idx]=True
What I'm trying to do is to compare the value of col 1 for the i-th index with the next N indices.
I'd like to do the same operation without using a for cycle . I've already tried to use a shift and df.loc , but the computational time is similar.
Have you tried doing something like
df["col_1_shifted"] = df["col_1"].shift(N)
df["col_2"] = (df["col_1"] == df["col_1_shifted"])
update: looking more carefully at your double-loop, it seems you want to flag all duplicates except the last. That's done by just changing the keep argument to 'last' instead of the default 'first'.
As suggested by #QuangHoang in the comments, duplicated() works nicely for this:
newdf = df.assign(col_2=df.duplicated(subset='col_1', keep='last'))
Example:
df = pd.DataFrame(np.random.randint(0, 5, 10), columns=['col_1'])
newdf = df.assign(col_2=df.duplicated(subset='col_1', keep='last'))
>>> newdf
col_1 col_2
0 2 False
1 0 True
2 1 True
3 0 True
4 0 False
5 3 False
6 1 True
7 1 False
8 4 True
9 4 False

How to determine if a dataframe column contains a particular list, independently of its order?

I have this dataframe :
df = pd.DataFrame()
df['Col1'] = [['B'],['A','D','B'],['D','C']]
df['Col2'] = [1,2,4]
df
Col1 Col2
0 [B] 1
1 [A,D,B] 2
2 [D,C] 4
I would like to know if Col1 contains the list [B,A,D], without caring for the order of the lists (those inside the column as the one to check).
I would like therefore to get here a True answer.
How could I do ?
Thanks
If values are not duplicated you can compare sets:
L = ['B','A','D']
print (df['Col1'].map(set).eq(set(L)))
0 False
1 True
2 False
Name: Col1, dtype: bool
If want scalar output- True or False test if at least one True in column by Series.any:
print (df['Col1'].map(set).eq(set(['B','A','D'])).any())
True
Use:
l=['B','A','D']
[set(i)==set(l) for i in df['Col1']]
#[False, True, False]
IIUC method of get_dummies
l=['B','A','D']
df.Col1.str.join(',').str.get_dummies(',')[l].all(1)
Out[197]:
0 False
1 True
2 False
dtype: bool

Compare 2 Pandas Dataframes and return all rows that are different

I have 2 Dataframes with same schema and different data. I want to compare both of them and get all rows that have different values of any column.
"df1":
id Store is_open
1 'Walmart' true
2 'Best Buy' false
3 'Target' true
4 'Home Depot' true
"df2":
id Store is_open
1 'Walmart' false
2 'Best Buy' true
3 'Target' true
4 'Home Depot' false
I was able to get the difference but I don't get all the columns but just the ones that have been changed. So I get the following output:
result_df:
id is_open is_open
1 true false
2 false true
4 true false
Here is the code to achieve the above output:
ne_stacked = (from_aoi_df != to_aoi_df).stack()
changed = ne_stacked[ne_stacked]
changed.index.names = ['id', 'col_changed']
difference_locations = np.where(from_aoi_df != to_aoi_df)
changed_from = from_aoi_df.values[difference_locations]
changed_to = to_aoi_df.values[difference_locations]
df5=pd.DataFrame({'from': changed_from, 'to': changed_to})
df5
However, besides the above result, I also want all the same columns where Store column is also added, so my expected output is :
expected_result_df:
id Store is_open_df1 is_open_df2
1 Walmart true false
2 Best Buy false true
4 Home Depot true false
How can I achieve that?
Using pandas merge function
df = pd.merge(df1,df2[['id','is_open']],on='id')
Filter out the rows which have unequal is_open columns
df = df[df["is_open_x"]!=df["is_open_y"]]
df
To rename the columns as your expectation
df.rename(columns={"is_open_x":"is_open_df1","is_open_y":"is_open_df2"})
If the data frames are of different length. Here's something you can use.
new_df = pd.concat([df1, df2]).reset_index(drop=True)
df = new_df.drop_duplicates(subset=['col1','col2'], keep=False)
This will give you a data frame called df with just the records that were different.
where df1 and df2 are the two data frames you are trying to compare.
subset= list of columns you want to find duplicates for.
keep= false will drop duplicate value with its original.
keep=last will retain the record from the second data frame.
keep=first will retain the record from the first data frame.
If the dataframes are of the same length
df=np.where(df1==df2,'true','false')
Hope this helps!!
Works if df1 and df2 have unique values...you can drop duplicates if any present in these before using this.
How about this?
df1['is_open_df2'] = df2['is_open']
expected_result_df = df1[df1['is_open'] != df1[is_open_df2']]
Use:
#compare DataFrames
m = (from_aoi_df != to_aoi_df)
#check at least one True per columns
m1 = m.any(axis=0)
#check at least one True per rows
m2 = m.any(axis=1)
#filter only not equal values
df1 = from_aoi_df.loc[m2, m1].add_suffix('_df1')
df2 = to_aoi_df.loc[m2, m1].add_suffix('_df2')
#filter equal values
df3 = from_aoi_df.loc[m2, ~m1]
#join together
df = pd.concat([df3, df1, df2], axis=1)
print (df)
id Store is_open_df1 is_open_df2
0 1 Walmart True False
1 2 Best Buy False True
3 4 Home Depot True False
Verify solution with multiple changed columns:
#changed first value id column
print (from_aoi_df)
id Store is_open
0 10 Walmart True
1 2 Best Buy False
2 3 Target True
3 4 Home Depot True
m = (from_aoi_df != to_aoi_df)
m1 = m.any(axis=0)
m2 = m.any(axis=1)
df1 = from_aoi_df.loc[m2, m1].add_suffix('_df1')
df2 = to_aoi_df.loc[m2, m1].add_suffix('_df2')
df3 = from_aoi_df.loc[m2, ~m1]
df = pd.concat([df3, df1, df2], axis=1)
print (df)
Store id_df1 is_open_df1 id_df2 is_open_df2
0 Walmart 10 True 1 False
1 Best Buy 2 False 2 True
3 Home Depot 4 True 4 False

How do I determine if the id is unique?

What code should I type for ipython notebook to determine if the code in the ID column of a csv file is unique?
I have tried searching online but to no avail.
Probably the simplest is to compare the length of the df against the length of the unique values:
len(df) == len(df['ID'].unique())
will yield True or False
Also you could call drop_duplicates():
len(df) == len(df['ID'].drop_duplicates())
Also nunique:
len(df) == df['ID'].nunique()
Example:
In [6]:
df = pd.DataFrame({'a':[0,1,1,2,3,4]})
df
Out[6]:
a
0 0
1 1
2 1
3 2
4 3
5 4
In [7]:
len(df) == df['a'].nunique()
Out[7]:
False
Another method is to invert the boolean series returned from duplicated and pass this np.all which will return true if all values are True, for this sample data we get a single False value hence it will yield False:
In [11]:
np.all(~df['a'].duplicated())
Out[11]:
False

Extracting all rows from pandas Dataframe that have certain value in a specific column

I am relatively new to Python/Pandas and am struggling with extracting the correct data from a pd.Dataframe. What I actually have is a Dataframe with 3 columns:
data =
Position Letter Value
1 a TRUE
2 f FALSE
3 c TRUE
4 d TRUE
5 k FALSE
What I want to do is put all of the TRUE rows into a new Dataframe so that the answer would be:
answer =
Position Letter Value
1 a TRUE
3 c TRUE
4 d TRUE
I know that you can access a particular column using
data['Value']
but how do I extract all of the TRUE rows?
Thanks for any help and advice,
Alex
You can test which Values are True:
In [11]: data['Value'] == True
Out[11]:
0 True
1 False
2 True
3 True
4 False
Name: Value, dtype: bool
and then use fancy indexing to pull out those rows:
In [12]: data[data['Value'] == True]
Out[12]:
Position Letter Value
0 1 a True
2 3 c True
3 4 d True
*Note: if the values are actually the strings 'TRUE' and 'FALSE' (they probably shouldn't be!) then use:
data['Value'] == 'TRUE'
You can wrap your value/values in a list and do the following:
new_df = df.loc[df['yourColumnName'].isin(['your', 'list', 'items'])]
This will return a new dataframe consisting of rows where your list items match your column name in df.

Categories