Apply Zip function on Pandas Dataframe using a For Loop Argument - python

I have a function called handle text that renames values in dataframe columns:
def handle_text(txt):
if txt.lower()[:6] == 'deu_ga':
return 'Western Europe', 'Germany'
elif txt.lower()[:6] == 'fra_ga':
return 'Western Europe', 'France'
return 'Other', 'Other'
I apply handle_text on various dataframes in the following way:
campaigns_df['Region'], campaigns_df['Market'] = zip(*campaigns_df['Campaign Name'].apply(handle_text))
atlas_df['Region'], atlas_df['Market'] = zip(*atlas_df['Campaign Name'].apply(handle_text))
flashtalking_df['Region'], flashtalking_df['Market'] = zip(*flashtalking_df['Campaign Name'].apply(handle_text))
I was wondering if there was a way to do a for loop to apply the function to various dfs at once:
dataframes = [atlas_df, flashtalking_df, innovid_df, ias_viewability_df, ias_fraud_df]
columns_df = ['Campaign Name']
for df in dataframes:
for column in df.columns:
if column in columns_df:
zip(df.column.apply(handle_text))
However the error I get is:
AttributeError: 'DataFrame' object has no attribute 'column'

I managed to solve it like this:
dataframes = [atlas_df, flashtalking_df, innovid_df, ias_viewability_df, ias_fraud_df, mediaplan_df]
columns_df = 'Campaign Name'
for df in dataframes:
df['Region'], df['Market'] = zip(*df[columns_df].apply(handle_text))

Need change attribute acces by . to more general by []:
zip(df.column.apply(handle_text))
to
zip(df[column].apply(handle_text))
EDIT:
Better solution:
atlas_df = pd.DataFrame({'Campaign Name':['deu_gathf', 'deu_gahf', 'fra_gagg'],'another_col':[1,2,3]})
flashtalking_df = pd.DataFrame({'Campaign Name':['deu_gahf','fra_ga', 'deu_gatt'],'another_col':[4,5,6]})
dataframes = [atlas_df, flashtalking_df]
columns_df = 'Campaign Name'
You can map by dict and then create new columns:
d = {'deu_ga': ['Western Europe','Germany'], 'fra_ga':['Western Europe','France']}
for df in dataframes:
df[['Region','Market']] = pd.DataFrame(df[columns_df].str.lower()
.str[:6]
.map(d)
.values.tolist())
#print (df)
print (atlas_df)
Campaign Name another_col Region Market
0 deu_gathf 1 Western Europe Germany
1 deu_gahf 2 Western Europe Germany
2 fra_gagg 3 Western Europe France
print (flashtalking_df)
Campaign Name another_col Region Market
0 deu_gahf 4 Western Europe Germany
1 fra_ga 5 Western Europe France
2 deu_gatt 6 Western Europe Germany

Related

How do I check multiple pandas columns against a dictionary in Python?

I have an existing pandas dataframe, consisting of a country column and market column. I want to check if the countries are assigned to the correct markets. As such I created a dictionary where each country (key) is mapped to the correct markets (values) it can fall within. The structure of the dataframe is below:
The structure of the dictionary is {'key':['Market 1', 'Market 2', 'Market 3']}. This is because each country has a couple of markets they could belong to.
I would like to write a function, which checks the values in the Country column and see if according to the dictionary, the current mapping is correct. So ideally, the desired output would be as follows:
Is there a way to reference a dictionary across two columns in a function? To confirm, the keys are the country names, and the markets are the values.
I have included code required to make the dataframe:
data = {'Country': ['Mexico','Uruguay','Uruguay','Greece','Brazil','Brazil','Brazil','Brazil','Colombia','Colombia','Colombia','Japan','Japan','Brazil','Brazil','Spain','New Zealand'],
'Market': ['LATAM','LATAM','LATAM','EMEA','ASIA','ASIA','LATAM BRAZIL','LATAM BRAZIL','LATAM CASA','LATAM CASA','LATAM','LATAM','LATAM','LATAM BRAZIL','LATAM BRAZIL','SOUTHEAST ASIA','SOUTHEAST ASIA']
}
df = pd.DataFrame(data)
Thanks a lot.
First idea is create tuples and match by Index.isin:
d = {'Colombia':['LATAM','LATAM CASA'], 'Brazil':['ASIA']}
tups = [(k, x) for k, v in d.items() for x in v]
df['Market Match'] = np.where(df.set_index(['Country','Market']).index.isin(tups),
'yes', 'no')
print (df)
Country Market Market Match
0 Mexico LATAM no
1 Uruguay LATAM no
2 Uruguay LATAM no
3 Greece EMEA no
4 Brazil ASIA yes
5 Brazil ASIA yes
6 Brazil LATAM BRAZIL no
7 Brazil LATAM BRAZIL no
8 Colombia LATAM CASA yes
9 Colombia LATAM CASA yes
10 Colombia LATAM yes
11 Japan LATAM no
12 Japan LATAM no
13 Brazil LATAM BRAZIL no
14 Brazil LATAM BRAZIL no
15 Spain SOUTHEAST ASIA no
16 New Zealand SOUTHEAST ASIA no
Or by left join in DataFrame.merge with indicator=True:
d = {'Colombia':['LATAM','LATAM CASA'], 'Brazil':['ASIA']}
df1 = pd.DataFrame([(k, x) for k, v in d.items() for x in v],
columns=['Country','Market']).drop_duplicates()
df['Market Match'] = np.where(df.merge(df1,indicator=True,how='left')['_merge'].eq('both'),
'yes', 'no')
The following link might help you out in checking if specific strings (e.g. "Markets" are included in your dataframe).
Check if string contains substring
For example:
fullstring = "StackAbuse"
substring = "tack"
if substring in fullstring:
print("Found!")
else:
print("Not found!")
df['MATCH'] = df.apply(lambda row: row['Market'] in your_dictionary[row['Country']], axis=1)

Renaming column in pandas with a multi level index

I would like to rename 'multi level columns' of a pandas dataframe to 'single level columns'. My code so far does not give any errors but does not rename either. Any suggestions for code improvements?
import pandas as pd
url = 'https://en.wikipedia.org/wiki/Gross_national_income'
df = pd.read_html(url)[3][[('Country', 'Country'), ('GDP[10]', 'GDP[10]')]]\
.rename(columns={('Country', 'Country'):'Country', ('GDP[10]', 'GDP[10]'): 'GDP'})
df
I prefer to use the rename method. df.columns = ['Country', 'GDP'] works but is not what I am looking for.
For rename solution create dictionary by flatten values of MultiIndex with join with new columns names in zip:
url = 'https://en.wikipedia.org/wiki/Gross_national_income'
df = pd.read_html(url)[3]
df.columns = df.columns.map('_'.join)
old = ['No._No.', 'Country_Country', 'GNI (Atlas method)[8]_value (a)',
'GNI (Atlas method)[8]_a - GDP', 'GNI[9]_value (b)', 'GNI[9]_b - GDP',
'GDP[10]_GDP[10]']
new = ['No.','Country','GNI a','GDP a','GNI b', 'GNI b', 'GDP']
df = df.rename(columns=dict(zip(old, new)))
If want create dictionary for rename:
d = {'No._No.': 'No.', 'Country_Country': 'Country', 'GNI (Atlas method)[8]_value (a)': 'GNI a', 'GNI (Atlas method)[8]_a - GDP': 'GDP a', 'GNI[9]_value (b)': 'GNI b', 'GNI[9]_b - GDP': 'GNI b', 'GDP[10]_GDP[10]': 'GDP'}
df = df.rename(columns=d)
print (df)
No. Country GNI a GDP a GNI b GNI b GDP
0 1 United States 20636317 91974 20837347 293004 20544343
1 2 China 13181372 -426779 13556853 -51298 13608151
2 3 Japan 5226599 255276 5155423 184100 4971323
3 4 Germany 3905321 -42299 4058030 110410 3947620
4 5 United Kingdom 2777405 -77891 2816805 -38491 2855296
5 6 France 2752034 -25501 2840071 62536 2777535
6 7 India 2727893 9161 2691040 -27692 2718732
7 8 Italy 2038376 -45488 2106525 22661 2083864
8 9 Brazil 1902286 16804 1832170 -53312 1885482
9 10 Canada 1665565 -47776 1694054 -19287 1713341
For alternatives of "rename", you can use get_level_values(). See below:
df.columns = df.columns.get_level_values(0)
>>> print(df)
Country GDP[10]
0 United States 20544343
1 China 13608151
2 Japan 4971323
3 Germany 3947620
4 United Kingdom 2855296
5 France 2777535
6 India 2718732
7 Italy 2083864
8 Brazil 1885482
9 Canada 1713341

Frequency plot of a Pandas Dataframe

I have a dataframe df such that:
df['user_location'].value_counts()
India 3741
United States 2455
New Delhi, India 1721
Mumbai, India 1401
Washington, DC 1354
...
SpaceCoast,Florida 1
stuck in a book. 1
Beirut , Lebanon 1
Royston Vasey - Tralfamadore 1
Langham, Colchester 1
Name: user_location, Length: 26920, dtype: int64
I want to know the frequency of specific countries like USA, India from the user_location column. Then I want to plot the frequencies as USA, India, and Others.
So, I want to apply some operation on that column such that the value_counts() will give the output as:
India (sum of all frequencies of all the locations in India including cities, states, etc.)
USA (sum of all frequencies of all the locations in the USA including cities, states, etc.)
Others (sum of all frequencies of the other locations)
Seems I should merge the frequencies of rows containing the same country names and merge the rest of them together! But it appears complex while handling the names of the cities, states, etc. What is the most efficient way to do it?
Adding to #Trenton_McKinney 's answer in the comments, if you need to map different country's states/provinces to the country name, you will have to do a little work to make those associations. For example, for India and USA, you can grab a list of their states from wikipedia and map them to your own data to relabel them to their respective country names as follows:
# Get states of India and USA
in_url = 'https://en.wikipedia.org/wiki/States_and_union_territories_of_India#States_and_Union_territories'
in_states = pd.read_html(in_url)[3].iloc[:, 0].tolist()
us_url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
us_states = pd.read_html(us_url)[0].iloc[:, 0].tolist()
states = in_states + us_states
# Make a sample dataframe
df = pd.DataFrame({'Country': states})
Country
0 Andhra Pradesh
1 Arunachal Pradesh
2 Assam
3 Bihar
4 Chhattisgarh
... ...
73 Virginia[E]
74 Washington
75 West Virginia
76 Wisconsin
77 Wyoming
Map state names to country names:
# Map state names to country name
states_dict = {state: 'India' for state in in_states}
states_dict.update({state: 'USA' for state in us_states})
df['Country'] = df['Country'].map(states_dict)
Country
0 India
1 India
2 India
3 India
4 India
... ...
73 USA
74 USA
75 USA
76 USA
77 USA
But from your data sample it looks like you will have a lot of edge cases to deal with as well.
Using the concept of the previous answer, firstly, I have tried to get all the locations including cities, unions, states, districts, territories. Then I have made a function checkl() such that it can check if the location is India or USA and then convert it into its country name. Finally the function has been applied on the dataframe column df['user_location'] :
# Trying to get all the locations of USA and India
import pandas as pd
us_url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
us_states = pd.read_html(us_url)[0].iloc[:, 0].tolist()
us_cities = pd.read_html(us_url)[0].iloc[:, 1].tolist() + pd.read_html(us_url)[0].iloc[:, 2].tolist() + pd.read_html(us_url)[0].iloc[:, 3].tolist()
us_Federal_district = pd.read_html(us_url)[1].iloc[:, 0].tolist()
us_Inhabited_territories = pd.read_html(us_url)[2].iloc[:, 0].tolist()
us_Uninhabited_territories = pd.read_html(us_url)[3].iloc[:, 0].tolist()
us_Disputed_territories = pd.read_html(us_url)[4].iloc[:, 0].tolist()
us = us_states + us_cities + us_Federal_district + us_Inhabited_territories + us_Uninhabited_territories + us_Disputed_territories
in_url = 'https://en.wikipedia.org/wiki/States_and_union_territories_of_India#States_and_Union_territories'
in_states = pd.read_html(in_url)[3].iloc[:, 0].tolist() + pd.read_html(in_url)[3].iloc[:, 4].tolist() + pd.read_html(in_url)[3].iloc[:, 5].tolist()
in_unions = pd.read_html(in_url)[4].iloc[:, 0].tolist()
ind = in_states + in_unions
usToStr = ' '.join([str(elem) for elem in us])
indToStr = ' '.join([str(elem) for elem in ind])
# Country name checker function
def checkl(T):
TSplit_space = [x.lower().strip() for x in T.split()]
TSplit_comma = [x.lower().strip() for x in T.split(',')]
TSplit = list(set().union(TSplit_space, TSplit_comma))
res_ind = [ele for ele in ind if(ele in T)]
res_us = [ele for ele in us if(ele in T)]
if 'india' in TSplit or 'hindustan' in TSplit or 'bharat' in TSplit or T.lower() in indToStr.lower() or bool(res_ind) == True :
T = 'India'
elif 'US' in T or 'USA' in T or 'United States' in T or 'usa' in TSplit or 'united state' in TSplit or T.lower() in usToStr.lower() or bool(res_us) == True:
T = 'USA'
elif len(T.split(','))>1 :
if T.split(',')[0] in indToStr or T.split(',')[1] in indToStr :
T = 'India'
elif T.split(',')[0] in usToStr or T.split(',')[1] in usToStr :
T = 'USA'
else:
T = "Others"
else:
T = "Others"
return T
# Appling the function on the dataframe column
print(df['user_location'].dropna().apply(checkl).value_counts())
Others 74206
USA 47840
India 20291
Name: user_location, dtype: int64
I am quite new in python coding. I think this code can be written in a better and more compact form. And as it is mentioned in the previous answer, there are still a lot of edge cases to deal with. So, I have added it on
Code Review Stack Exchange too. Any criticisms and suggestions to improve the efficiency and readability of my code would be greatly appreciated.

Which is the Pythonic way for Pandas to obtain the same result as an SQL statement: "UPDATE- LEFT JOIN - SET - WHERE"?

I would like to replicate in Pandas the following SQL structure "Update dataframe1 LEFT JON dataframe2 SET dataframe1.column1 = dataframe2.column2 WHERE dataframe1.column3 > X"
I know it is possible to merge the dataframe and then work on the merged columns with with ".where"
However, it doesn't seems to be straighforward as a solution.
df = pd.merge(df1,df2, suffix(a,b))
df['clmn1'] = df['clmn1_b'].where( df[clmn1]>0, df['clmn1_b'])
Is there a better way to reach the goal?
Thanks
To use your example from the comments:
In [21]: df
Out[21]:
Name Gender country
0 Jack M USA
1 Nick M UK
2 Alphio F RU
3 Jenny F USA
In [22]: country_map = {'USA': 'United States', 'UK': 'United Kingdom', 'RU': 'Russia'}
In [23]: df.country.map(country_map)
Out[23]:
0 United States
1 United Kingdom
2 Russia
3 United States
Name: country, dtype: object
To update just the M rows you could use loc and update:
In [24]: df.country.update(df[df.Gender == 'M'].country.map(country_map))
In [25]: df
Out[25]:
Name Gender country
0 Jack M United States
1 Nick M United Kingdom
2 Alphio F RU
3 Jenny F USA

I have a CSV and want to update it with values from another CSV. What is the most efficient way to do this?

I have this CSV:
Name Species Country
0 Hobbes Tiger U.S.
1 SherKhan Tiger India
2 Rescuer Mouse Australia
3 Mickey Mouse U.S.
And I have a second CSV:
Continent Countries Unnamed: 2 Unnamed: 3 Unnamed: 4
0 North America U.S. Mexico Guatemala Honduras
1 Asia India China Nepal NaN
2 Australia Australia NaN NaN NaN
3 Africa South Africa Botswana Zimbabwe NaN
I want to use the second CSV to update the first file so that the output is:
Name Species Country
0 Hobbes Tiger North America
1 SherKhan Tiger Asia
2 Rescuer Mouse Australia
3 Mickey Mouse North America
So far this the closest I have gotten:
import pandas as pd
# Import my data.
data = pd.read_csv('Continents.csv')
Animals = pd.read_csv('Animals.csv')
Animalsdf = pd.DataFrame(Animals)
# Transpose my data from horizontal to vertical.
data1 = data.T
# Clean my data and update my header with the first column.
data1.columns = data1.iloc[0]
# Drop now duplicated data.
data1.drop(data1.index[[0]], inplace = True)
# Build the dictionary.
data_dict = {col: list(data1[col]) for col in data1.columns}
# Update my csv.
Animals['Country'] = Animals['Country'].map(data_dict)
print ('Animals')
This results in a dictionary that has lists as its values and therefore i just get NaN out:
Name Species Country
0 Hobbes Tiger NaN
1 SherKhan Tiger NaN
2 Rescuer Mole [Australia, nan, nan, nan]
3 Mickey Mole NaN
I've tried flipping from list to tuples and this doesn't work. Have tried multiple ways to pull in the dictionary etc. I am just out of ideas.
Sorry if the code is super junky. I'm learning this as I go. Figured a project was the best way to learn a new language. Didn't think it would be this difficult.
Any suggestions would be appreciated. I need to be able to use the code so that when I get multiple reference CSVs, I can update my data with new keys. Hope this is clear.
Thanks in advance.
One intuitive solution is to use a dictionary mapping. Data from #WillMonge.
pd.DataFrame.itertuples works by producing namedtuples, but they may also be referenced using numeric indexers.
# create mapping dictionary
d = {}
for row in df.itertuples():
d.update(dict.fromkeys(filter(None, row[2:]), row[1]))
# apply mapping dictionary
data['Continent'] = data['Country'].map(d)
print(data)
Country name Continent
0 China 2 Asia
1 China 5 Asia
2 Canada 9 America
3 Egypt 0 Africa
4 Mexico 3 America
You should use DictReader and DictWriter. You can learn how to use them by below link.
https://docs.python.org/2/library/csv.html
Here is an update of your code, I have tried to add comments to explain
import pandas as pd
# Read data in (read_csv also returns a DataFrame directly)
data = pd.DataFrame({'name': [2, 5, 9, 0, 3], 'Country': ['China', 'China', 'Canada', 'Egypt', 'Mexico']})
df = pd.DataFrame({'Continent': ['Asia', 'America', 'Africa'],
'Country1': ['China', 'Mexico', 'Egypt'],
'Country2': ['Japan', 'Canada', None],
'Country3': ['Thailand', None, None ]})
# Unstack to get a row for each country (remove the continent rows)
premap_df = pd.DataFrame(df.unstack('Continent').drop('Continent')).dropna().reset_index()
premap_df.columns = ['_', 'continent_key', 'Country']
# Merge the continent back based on the continent_key (old row number)
map_df = pd.merge(premap_df, df[['Continent']], left_on='continent_key', right_index=True)[['Continent', 'Country']]
# Merge with the data now
pd.merge(data, map_df, on='Country')
For further reference, Wes McKinney's Python for Data Analysis (here is a pdf version I found online) is one of the best books out there for learning pandas
You can always create buckets and run conditions:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name':['Hobbes','SherKhan','Rescuer','Mickey'], 'Species':['Tiger','Tiger','Mouse','Mouse'],'Country':['U.S.','India','Australia','U.S.']})
North_America = ['U.S.', 'Mexico', 'Guatemala', 'Honduras']
Asia = ['India', 'China', 'Nepal']
Australia = ['Australia']
Africa = ['South Africa', 'Botswana', 'Zimbabwe']
conditions = [
(df['Country'].isin(North_America)),
(df['Country'].isin(Asia)),
(df['Country'].isin(Australia)),
(df['Country'].isin(Africa))
]
choices = [
'North America',
'Asia',
'Australia',
'Africa'
]
df['Continent'] = np.select(conditions, choices, default = np.nan)
df

Categories