How do I check multiple pandas columns against a dictionary in Python? - python

I have an existing pandas dataframe, consisting of a country column and market column. I want to check if the countries are assigned to the correct markets. As such I created a dictionary where each country (key) is mapped to the correct markets (values) it can fall within. The structure of the dataframe is below:
The structure of the dictionary is {'key':['Market 1', 'Market 2', 'Market 3']}. This is because each country has a couple of markets they could belong to.
I would like to write a function, which checks the values in the Country column and see if according to the dictionary, the current mapping is correct. So ideally, the desired output would be as follows:
Is there a way to reference a dictionary across two columns in a function? To confirm, the keys are the country names, and the markets are the values.
I have included code required to make the dataframe:
data = {'Country': ['Mexico','Uruguay','Uruguay','Greece','Brazil','Brazil','Brazil','Brazil','Colombia','Colombia','Colombia','Japan','Japan','Brazil','Brazil','Spain','New Zealand'],
'Market': ['LATAM','LATAM','LATAM','EMEA','ASIA','ASIA','LATAM BRAZIL','LATAM BRAZIL','LATAM CASA','LATAM CASA','LATAM','LATAM','LATAM','LATAM BRAZIL','LATAM BRAZIL','SOUTHEAST ASIA','SOUTHEAST ASIA']
}
df = pd.DataFrame(data)
Thanks a lot.

First idea is create tuples and match by Index.isin:
d = {'Colombia':['LATAM','LATAM CASA'], 'Brazil':['ASIA']}
tups = [(k, x) for k, v in d.items() for x in v]
df['Market Match'] = np.where(df.set_index(['Country','Market']).index.isin(tups),
'yes', 'no')
print (df)
Country Market Market Match
0 Mexico LATAM no
1 Uruguay LATAM no
2 Uruguay LATAM no
3 Greece EMEA no
4 Brazil ASIA yes
5 Brazil ASIA yes
6 Brazil LATAM BRAZIL no
7 Brazil LATAM BRAZIL no
8 Colombia LATAM CASA yes
9 Colombia LATAM CASA yes
10 Colombia LATAM yes
11 Japan LATAM no
12 Japan LATAM no
13 Brazil LATAM BRAZIL no
14 Brazil LATAM BRAZIL no
15 Spain SOUTHEAST ASIA no
16 New Zealand SOUTHEAST ASIA no
Or by left join in DataFrame.merge with indicator=True:
d = {'Colombia':['LATAM','LATAM CASA'], 'Brazil':['ASIA']}
df1 = pd.DataFrame([(k, x) for k, v in d.items() for x in v],
columns=['Country','Market']).drop_duplicates()
df['Market Match'] = np.where(df.merge(df1,indicator=True,how='left')['_merge'].eq('both'),
'yes', 'no')

The following link might help you out in checking if specific strings (e.g. "Markets" are included in your dataframe).
Check if string contains substring
For example:
fullstring = "StackAbuse"
substring = "tack"
if substring in fullstring:
print("Found!")
else:
print("Not found!")

df['MATCH'] = df.apply(lambda row: row['Market'] in your_dictionary[row['Country']], axis=1)

Related

How to create a dataframe using a list of dictionaries that also consist of lists

I have a list of dictionaries that also consist of lists and would like to create a dataframe using this list. For example, the data looks like this:
lst = [{'France': [[12548, ABC], [45681, DFG], [45684, HJK]]},
{'USA': [[84921, HJK], [28917, KLESA]]},
{'Japan':[[38292, ASF], [48902, DSJ]]}]
And this is the dataframe I'm trying to create
Country Amount Code
France 12548 ABC
France 45681 DFG
France 45684 HJK
USA 84921 HJK
USA 28917 KLESA
Japan 38292 ASF
Japan 48902 DSJ
As you can see, the keys became column values of the country column and the numbers and the strings became the amount and code columns. I thought I could use something like the following, but it's not working.
df = pd.DataFrame(lst)
You probably need to transform the data into a format that Pandas can read.
Original data
data = [
{"France": [[12548, "ABC"], [45681, "DFG"], [45684, "HJK"]]},
{"USA": [[84921, "HJK"], [28917, "KLESA"]]},
{"Japan": [[38292, "ASF"], [48902, "DSJ"]]},
]
Transforming the data
new_data = []
for country_data in data:
for country, values in country_data.items():
new_data += [{"Country": country, "Amount": amt, "Code": code} for amt, code in values]
Create the dataframe
df = pd.DataFrame(new_data)
Ouput
Country Amount Code
0 France 12548 ABC
1 France 45681 DFG
2 France 45684 HJK
3 USA 84921 HJK
4 USA 28917 KLESA
5 Japan 38292 ASF
6 Japan 48902 DSJ
df = pd.concat([pd.DataFrame(elem) for elem in list])
df = df.apply(lambda x: pd.Series(x.dropna().values)).stack()
df = df.reset_index(level=[0], drop=True).to_frame(name = 'vals')
df = pd.DataFrame(df["vals"].to_list(),index= df.index, columns=['Amount', 'Code']).sort_index()
print(df)
output:
Amount Code
France 12548 ABC
USA 84921 HJK
Japan 38292 ASF
France 45681 DFG
USA 28917 KLESA
Japan 48902 DSJ
France 45684 HJK
Use nested list comprehension for flatten data and pass to DataFrame constructor:
lst = [
{"France": [[12548, "ABC"], [45681, "DFG"], [45684, "HJK"]]},
{"USA": [[84921, "HJK"], [28917, "KLESA"]]},
{"Japan": [[38292, "ASF"], [48902, "DSJ"]]},
]
L = [(country, *x) for country_data in lst
for country, values in country_data.items()
for x in values]
df = pd.DataFrame(L, columns=['Country','Amount','Code'])
print (df)
Country Amount Code
0 France 12548 ABC
1 France 45681 DFG
2 France 45684 HJK
3 USA 84921 HJK
4 USA 28917 KLESA
5 Japan 38292 ASF
6 Japan 48902 DSJ
Build a new dictionary that combines the individual dicts into one, before concatenating the dataframes:
new_dict = {}
for ent in lst:
for key, value in ent.items():
new_dict[key] = pd.DataFrame(value, columns = ['Amount', 'Code'])
pd.concat(new_dict, names=['Country']).droplevel(1).reset_index()
Country Amount Code
0 France 12548 ABC
1 France 45681 DFG
2 France 45684 HJK
3 USA 84921 HJK
4 USA 28917 KLESA
5 Japan 38292 ASF
6 Japan 48902 DSJ

Frequency plot of a Pandas Dataframe

I have a dataframe df such that:
df['user_location'].value_counts()
India 3741
United States 2455
New Delhi, India 1721
Mumbai, India 1401
Washington, DC 1354
...
SpaceCoast,Florida 1
stuck in a book. 1
Beirut , Lebanon 1
Royston Vasey - Tralfamadore 1
Langham, Colchester 1
Name: user_location, Length: 26920, dtype: int64
I want to know the frequency of specific countries like USA, India from the user_location column. Then I want to plot the frequencies as USA, India, and Others.
So, I want to apply some operation on that column such that the value_counts() will give the output as:
India (sum of all frequencies of all the locations in India including cities, states, etc.)
USA (sum of all frequencies of all the locations in the USA including cities, states, etc.)
Others (sum of all frequencies of the other locations)
Seems I should merge the frequencies of rows containing the same country names and merge the rest of them together! But it appears complex while handling the names of the cities, states, etc. What is the most efficient way to do it?
Adding to #Trenton_McKinney 's answer in the comments, if you need to map different country's states/provinces to the country name, you will have to do a little work to make those associations. For example, for India and USA, you can grab a list of their states from wikipedia and map them to your own data to relabel them to their respective country names as follows:
# Get states of India and USA
in_url = 'https://en.wikipedia.org/wiki/States_and_union_territories_of_India#States_and_Union_territories'
in_states = pd.read_html(in_url)[3].iloc[:, 0].tolist()
us_url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
us_states = pd.read_html(us_url)[0].iloc[:, 0].tolist()
states = in_states + us_states
# Make a sample dataframe
df = pd.DataFrame({'Country': states})
Country
0 Andhra Pradesh
1 Arunachal Pradesh
2 Assam
3 Bihar
4 Chhattisgarh
... ...
73 Virginia[E]
74 Washington
75 West Virginia
76 Wisconsin
77 Wyoming
Map state names to country names:
# Map state names to country name
states_dict = {state: 'India' for state in in_states}
states_dict.update({state: 'USA' for state in us_states})
df['Country'] = df['Country'].map(states_dict)
Country
0 India
1 India
2 India
3 India
4 India
... ...
73 USA
74 USA
75 USA
76 USA
77 USA
But from your data sample it looks like you will have a lot of edge cases to deal with as well.
Using the concept of the previous answer, firstly, I have tried to get all the locations including cities, unions, states, districts, territories. Then I have made a function checkl() such that it can check if the location is India or USA and then convert it into its country name. Finally the function has been applied on the dataframe column df['user_location'] :
# Trying to get all the locations of USA and India
import pandas as pd
us_url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
us_states = pd.read_html(us_url)[0].iloc[:, 0].tolist()
us_cities = pd.read_html(us_url)[0].iloc[:, 1].tolist() + pd.read_html(us_url)[0].iloc[:, 2].tolist() + pd.read_html(us_url)[0].iloc[:, 3].tolist()
us_Federal_district = pd.read_html(us_url)[1].iloc[:, 0].tolist()
us_Inhabited_territories = pd.read_html(us_url)[2].iloc[:, 0].tolist()
us_Uninhabited_territories = pd.read_html(us_url)[3].iloc[:, 0].tolist()
us_Disputed_territories = pd.read_html(us_url)[4].iloc[:, 0].tolist()
us = us_states + us_cities + us_Federal_district + us_Inhabited_territories + us_Uninhabited_territories + us_Disputed_territories
in_url = 'https://en.wikipedia.org/wiki/States_and_union_territories_of_India#States_and_Union_territories'
in_states = pd.read_html(in_url)[3].iloc[:, 0].tolist() + pd.read_html(in_url)[3].iloc[:, 4].tolist() + pd.read_html(in_url)[3].iloc[:, 5].tolist()
in_unions = pd.read_html(in_url)[4].iloc[:, 0].tolist()
ind = in_states + in_unions
usToStr = ' '.join([str(elem) for elem in us])
indToStr = ' '.join([str(elem) for elem in ind])
# Country name checker function
def checkl(T):
TSplit_space = [x.lower().strip() for x in T.split()]
TSplit_comma = [x.lower().strip() for x in T.split(',')]
TSplit = list(set().union(TSplit_space, TSplit_comma))
res_ind = [ele for ele in ind if(ele in T)]
res_us = [ele for ele in us if(ele in T)]
if 'india' in TSplit or 'hindustan' in TSplit or 'bharat' in TSplit or T.lower() in indToStr.lower() or bool(res_ind) == True :
T = 'India'
elif 'US' in T or 'USA' in T or 'United States' in T or 'usa' in TSplit or 'united state' in TSplit or T.lower() in usToStr.lower() or bool(res_us) == True:
T = 'USA'
elif len(T.split(','))>1 :
if T.split(',')[0] in indToStr or T.split(',')[1] in indToStr :
T = 'India'
elif T.split(',')[0] in usToStr or T.split(',')[1] in usToStr :
T = 'USA'
else:
T = "Others"
else:
T = "Others"
return T
# Appling the function on the dataframe column
print(df['user_location'].dropna().apply(checkl).value_counts())
Others 74206
USA 47840
India 20291
Name: user_location, dtype: int64
I am quite new in python coding. I think this code can be written in a better and more compact form. And as it is mentioned in the previous answer, there are still a lot of edge cases to deal with. So, I have added it on
Code Review Stack Exchange too. Any criticisms and suggestions to improve the efficiency and readability of my code would be greatly appreciated.

Which is the Pythonic way for Pandas to obtain the same result as an SQL statement: "UPDATE- LEFT JOIN - SET - WHERE"?

I would like to replicate in Pandas the following SQL structure "Update dataframe1 LEFT JON dataframe2 SET dataframe1.column1 = dataframe2.column2 WHERE dataframe1.column3 > X"
I know it is possible to merge the dataframe and then work on the merged columns with with ".where"
However, it doesn't seems to be straighforward as a solution.
df = pd.merge(df1,df2, suffix(a,b))
df['clmn1'] = df['clmn1_b'].where( df[clmn1]>0, df['clmn1_b'])
Is there a better way to reach the goal?
Thanks
To use your example from the comments:
In [21]: df
Out[21]:
Name Gender country
0 Jack M USA
1 Nick M UK
2 Alphio F RU
3 Jenny F USA
In [22]: country_map = {'USA': 'United States', 'UK': 'United Kingdom', 'RU': 'Russia'}
In [23]: df.country.map(country_map)
Out[23]:
0 United States
1 United Kingdom
2 Russia
3 United States
Name: country, dtype: object
To update just the M rows you could use loc and update:
In [24]: df.country.update(df[df.Gender == 'M'].country.map(country_map))
In [25]: df
Out[25]:
Name Gender country
0 Jack M United States
1 Nick M United Kingdom
2 Alphio F RU
3 Jenny F USA

String mode aggregation with group by function

I have dataframe which looks like below
Country City
UK London
USA Washington
UK London
UK Manchester
USA Washington
USA Chicago
I want to group country and aggregate on the most repeated city in a country
My desired output should be like
Country City
UK London
USA Washington
Because London and Washington appears 2 times whereas Manchester and Chicago appears only 1 time.
I tried
from scipy.stats import mode
df_summary = df.groupby('Country')['City'].\
apply(lambda x: mode(x)[0][0]).reset_index()
But it seems it won't work on strings
I can't replicate your error, but you can use pd.Series.mode, which accepts strings and returns a series, using iat to extract the first value:
res = df.groupby('Country')['City'].apply(lambda x: x.mode().iat[0]).reset_index()
print(res)
Country City
0 UK London
1 USA Washington
try like below:
>>> df.City.mode()
0 London
1 Washington
dtype: object
OR
import pandas as pd
from scipy import stats
Can use scipy with stats + lambda :
df.groupby('Country').agg({'City': lambda x:stats.mode(x)[0]})
City
Country
UK London
USA Washington
# df.groupby('Country').agg({'City': lambda x:stats.mode(x)[0]}).reset_index()
However, it gives nice count as well if you don't want to return ony First value:
>>> df.groupby('Country').agg({'City': lambda x:stats.mode(x)})
City
Country
UK ([London], [2])
USA ([Washington], [2])

I have a CSV and want to update it with values from another CSV. What is the most efficient way to do this?

I have this CSV:
Name Species Country
0 Hobbes Tiger U.S.
1 SherKhan Tiger India
2 Rescuer Mouse Australia
3 Mickey Mouse U.S.
And I have a second CSV:
Continent Countries Unnamed: 2 Unnamed: 3 Unnamed: 4
0 North America U.S. Mexico Guatemala Honduras
1 Asia India China Nepal NaN
2 Australia Australia NaN NaN NaN
3 Africa South Africa Botswana Zimbabwe NaN
I want to use the second CSV to update the first file so that the output is:
Name Species Country
0 Hobbes Tiger North America
1 SherKhan Tiger Asia
2 Rescuer Mouse Australia
3 Mickey Mouse North America
So far this the closest I have gotten:
import pandas as pd
# Import my data.
data = pd.read_csv('Continents.csv')
Animals = pd.read_csv('Animals.csv')
Animalsdf = pd.DataFrame(Animals)
# Transpose my data from horizontal to vertical.
data1 = data.T
# Clean my data and update my header with the first column.
data1.columns = data1.iloc[0]
# Drop now duplicated data.
data1.drop(data1.index[[0]], inplace = True)
# Build the dictionary.
data_dict = {col: list(data1[col]) for col in data1.columns}
# Update my csv.
Animals['Country'] = Animals['Country'].map(data_dict)
print ('Animals')
This results in a dictionary that has lists as its values and therefore i just get NaN out:
Name Species Country
0 Hobbes Tiger NaN
1 SherKhan Tiger NaN
2 Rescuer Mole [Australia, nan, nan, nan]
3 Mickey Mole NaN
I've tried flipping from list to tuples and this doesn't work. Have tried multiple ways to pull in the dictionary etc. I am just out of ideas.
Sorry if the code is super junky. I'm learning this as I go. Figured a project was the best way to learn a new language. Didn't think it would be this difficult.
Any suggestions would be appreciated. I need to be able to use the code so that when I get multiple reference CSVs, I can update my data with new keys. Hope this is clear.
Thanks in advance.
One intuitive solution is to use a dictionary mapping. Data from #WillMonge.
pd.DataFrame.itertuples works by producing namedtuples, but they may also be referenced using numeric indexers.
# create mapping dictionary
d = {}
for row in df.itertuples():
d.update(dict.fromkeys(filter(None, row[2:]), row[1]))
# apply mapping dictionary
data['Continent'] = data['Country'].map(d)
print(data)
Country name Continent
0 China 2 Asia
1 China 5 Asia
2 Canada 9 America
3 Egypt 0 Africa
4 Mexico 3 America
You should use DictReader and DictWriter. You can learn how to use them by below link.
https://docs.python.org/2/library/csv.html
Here is an update of your code, I have tried to add comments to explain
import pandas as pd
# Read data in (read_csv also returns a DataFrame directly)
data = pd.DataFrame({'name': [2, 5, 9, 0, 3], 'Country': ['China', 'China', 'Canada', 'Egypt', 'Mexico']})
df = pd.DataFrame({'Continent': ['Asia', 'America', 'Africa'],
'Country1': ['China', 'Mexico', 'Egypt'],
'Country2': ['Japan', 'Canada', None],
'Country3': ['Thailand', None, None ]})
# Unstack to get a row for each country (remove the continent rows)
premap_df = pd.DataFrame(df.unstack('Continent').drop('Continent')).dropna().reset_index()
premap_df.columns = ['_', 'continent_key', 'Country']
# Merge the continent back based on the continent_key (old row number)
map_df = pd.merge(premap_df, df[['Continent']], left_on='continent_key', right_index=True)[['Continent', 'Country']]
# Merge with the data now
pd.merge(data, map_df, on='Country')
For further reference, Wes McKinney's Python for Data Analysis (here is a pdf version I found online) is one of the best books out there for learning pandas
You can always create buckets and run conditions:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name':['Hobbes','SherKhan','Rescuer','Mickey'], 'Species':['Tiger','Tiger','Mouse','Mouse'],'Country':['U.S.','India','Australia','U.S.']})
North_America = ['U.S.', 'Mexico', 'Guatemala', 'Honduras']
Asia = ['India', 'China', 'Nepal']
Australia = ['Australia']
Africa = ['South Africa', 'Botswana', 'Zimbabwe']
conditions = [
(df['Country'].isin(North_America)),
(df['Country'].isin(Asia)),
(df['Country'].isin(Australia)),
(df['Country'].isin(Africa))
]
choices = [
'North America',
'Asia',
'Australia',
'Africa'
]
df['Continent'] = np.select(conditions, choices, default = np.nan)
df

Categories