I have a dataframe df such that:
df['user_location'].value_counts()
India 3741
United States 2455
New Delhi, India 1721
Mumbai, India 1401
Washington, DC 1354
...
SpaceCoast,Florida 1
stuck in a book. 1
Beirut , Lebanon 1
Royston Vasey - Tralfamadore 1
Langham, Colchester 1
Name: user_location, Length: 26920, dtype: int64
I want to know the frequency of specific countries like USA, India from the user_location column. Then I want to plot the frequencies as USA, India, and Others.
So, I want to apply some operation on that column such that the value_counts() will give the output as:
India (sum of all frequencies of all the locations in India including cities, states, etc.)
USA (sum of all frequencies of all the locations in the USA including cities, states, etc.)
Others (sum of all frequencies of the other locations)
Seems I should merge the frequencies of rows containing the same country names and merge the rest of them together! But it appears complex while handling the names of the cities, states, etc. What is the most efficient way to do it?
Adding to #Trenton_McKinney 's answer in the comments, if you need to map different country's states/provinces to the country name, you will have to do a little work to make those associations. For example, for India and USA, you can grab a list of their states from wikipedia and map them to your own data to relabel them to their respective country names as follows:
# Get states of India and USA
in_url = 'https://en.wikipedia.org/wiki/States_and_union_territories_of_India#States_and_Union_territories'
in_states = pd.read_html(in_url)[3].iloc[:, 0].tolist()
us_url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
us_states = pd.read_html(us_url)[0].iloc[:, 0].tolist()
states = in_states + us_states
# Make a sample dataframe
df = pd.DataFrame({'Country': states})
Country
0 Andhra Pradesh
1 Arunachal Pradesh
2 Assam
3 Bihar
4 Chhattisgarh
... ...
73 Virginia[E]
74 Washington
75 West Virginia
76 Wisconsin
77 Wyoming
Map state names to country names:
# Map state names to country name
states_dict = {state: 'India' for state in in_states}
states_dict.update({state: 'USA' for state in us_states})
df['Country'] = df['Country'].map(states_dict)
Country
0 India
1 India
2 India
3 India
4 India
... ...
73 USA
74 USA
75 USA
76 USA
77 USA
But from your data sample it looks like you will have a lot of edge cases to deal with as well.
Using the concept of the previous answer, firstly, I have tried to get all the locations including cities, unions, states, districts, territories. Then I have made a function checkl() such that it can check if the location is India or USA and then convert it into its country name. Finally the function has been applied on the dataframe column df['user_location'] :
# Trying to get all the locations of USA and India
import pandas as pd
us_url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
us_states = pd.read_html(us_url)[0].iloc[:, 0].tolist()
us_cities = pd.read_html(us_url)[0].iloc[:, 1].tolist() + pd.read_html(us_url)[0].iloc[:, 2].tolist() + pd.read_html(us_url)[0].iloc[:, 3].tolist()
us_Federal_district = pd.read_html(us_url)[1].iloc[:, 0].tolist()
us_Inhabited_territories = pd.read_html(us_url)[2].iloc[:, 0].tolist()
us_Uninhabited_territories = pd.read_html(us_url)[3].iloc[:, 0].tolist()
us_Disputed_territories = pd.read_html(us_url)[4].iloc[:, 0].tolist()
us = us_states + us_cities + us_Federal_district + us_Inhabited_territories + us_Uninhabited_territories + us_Disputed_territories
in_url = 'https://en.wikipedia.org/wiki/States_and_union_territories_of_India#States_and_Union_territories'
in_states = pd.read_html(in_url)[3].iloc[:, 0].tolist() + pd.read_html(in_url)[3].iloc[:, 4].tolist() + pd.read_html(in_url)[3].iloc[:, 5].tolist()
in_unions = pd.read_html(in_url)[4].iloc[:, 0].tolist()
ind = in_states + in_unions
usToStr = ' '.join([str(elem) for elem in us])
indToStr = ' '.join([str(elem) for elem in ind])
# Country name checker function
def checkl(T):
TSplit_space = [x.lower().strip() for x in T.split()]
TSplit_comma = [x.lower().strip() for x in T.split(',')]
TSplit = list(set().union(TSplit_space, TSplit_comma))
res_ind = [ele for ele in ind if(ele in T)]
res_us = [ele for ele in us if(ele in T)]
if 'india' in TSplit or 'hindustan' in TSplit or 'bharat' in TSplit or T.lower() in indToStr.lower() or bool(res_ind) == True :
T = 'India'
elif 'US' in T or 'USA' in T or 'United States' in T or 'usa' in TSplit or 'united state' in TSplit or T.lower() in usToStr.lower() or bool(res_us) == True:
T = 'USA'
elif len(T.split(','))>1 :
if T.split(',')[0] in indToStr or T.split(',')[1] in indToStr :
T = 'India'
elif T.split(',')[0] in usToStr or T.split(',')[1] in usToStr :
T = 'USA'
else:
T = "Others"
else:
T = "Others"
return T
# Appling the function on the dataframe column
print(df['user_location'].dropna().apply(checkl).value_counts())
Others 74206
USA 47840
India 20291
Name: user_location, dtype: int64
I am quite new in python coding. I think this code can be written in a better and more compact form. And as it is mentioned in the previous answer, there are still a lot of edge cases to deal with. So, I have added it on
Code Review Stack Exchange too. Any criticisms and suggestions to improve the efficiency and readability of my code would be greatly appreciated.
Related
I have a data frame tweets_df that looks like this:
sentiment id date text
0 0 1502071360117424136 2022-03-10 23:58:14+00:00 AngelaRaeBoon1 Same Alabama Republicans charge...
1 0 1502070916318121994 2022-03-10 23:56:28+00:00 This ’ w/the sentencing JussieSmollett But mad...
2 0 1502057466267377665 2022-03-10 23:03:01+00:00 DannyClayton Not hard find takes smallest amou...
3 0 1502053718711316512 2022-03-10 22:48:08+00:00 I make fake scenarios getting fights protectin...
4 0 1502045714486022146 2022-03-10 22:16:19+00:00 WipeHomophobia Well people lands wildest thing...
.. ... ... ... ...
94 0 1501702542899691525 2022-03-09 23:32:41+00:00 There 's reason deep look things kill bad peop...
95 0 1501700281729433606 2022-03-09 23:23:42+00:00 Shame UN United Dictators Shame NATO Repeat We...
96 0 1501699859803516934 2022-03-09 23:22:01+00:00 GayleKing The difference Ukrainian refugees IL...
97 0 1501697172441550848 2022-03-09 23:11:20+00:00 hrkbenowen And includes new United States I un...
98 0 1501696149853511687 2022-03-09 23:07:16+00:00 JLaw_OTD A world women minorities POC LGBTQ÷ d...
And the second dataFrame globe_df that looks like this:
Country Region
0 Andorra Europe
1 United Arab Emirates Middle east
2 Afghanistan Asia & Pacific
3 Antigua and Barbuda South/Latin America
4 Anguilla South/Latin America
.. ... ...
243 Guernsey Europe
244 Isle of Man Europe
245 Jersey Europe
246 Saint Barthelemy South/Latin America
247 Saint Martin South/Latin America
I want to delete all rows of the dataframe tweets_df which have 'text' that does not contain a 'Country' or 'Region'.
This was my attempt:
globe_df = pd.read_csv('countriesAndRegions.csv')
tweets_df = pd.read_csv('tweetSheet.csv')
for entry in globe_df['Country']:
tweet_index = tweets_df[entry in tweets_df['text']].index # if tweets that *contain*, not equal...... entry in tweets_df['text] .... (in)or (not in)?
tweets_df.drop(tweet_index , inplace=True)
print(tweets_df)
Edit: Also, fuzzy, case-insensitive matching with stemming would be preferred when searching the 'text' for countries and regions.
Ex) If the text contained 'Ukrainian', 'british', 'engliSH', etc... then it would not be deleted
Convert country and region values to a list and use str.contains to filter out rows that do not contain these values.
#with case insensitive
vals=globe_df.stack().to_list()
tweets_df = tweets_df[tweets_df ['text'].str.contains('|'.join(vals), regex=True, case=False)]
or (with case insensitive)
vals="({})".format('|'.join(globe_df.stack().str.lower().to_list())) #make all letters lowercase
tweets_df['matched'] = tweets_df.text.str.lower().str.extract(vals, expand=False)
tweets_df = tweets_df.dropna()
# Import data
globe_df = pd.read_csv('countriesAndRegions.csv')
tweets_df = pd.read_csv('tweetSheet.csv')
# Get country and region column as list
globe_df_country = globe_df['Country'].values.tolist()
globe_df_region = globe_df['Region'].values.tolist()
# merge_lists, cause you want to check with or operator
merged_list = globe_df_country + globe_df_region
# If you want to update df while iterating it, best way to do it with using copy df
df_tweets2 = tweets_df.copy()
for index,row in tweets_df.iterrows():
# Check if splitted row's text values are intersecting with merged_list
if [i for i in merged_list if i in row['text'].split()] == []:
df_tweets2 = df_tweets2.drop[index]
tweets_df_new = df_tweets2.copy()
print(tweets_df_new)
You can try using pandas.Series.str.contains to find the values.
tweets_df[tweets_df['text'].contains('{}|{}'.format(entry['Country'],entry['Region'])]
And after creating a new column with boolean values, you can remove rows with the value True.
I have an existing pandas dataframe, consisting of a country column and market column. I want to check if the countries are assigned to the correct markets. As such I created a dictionary where each country (key) is mapped to the correct markets (values) it can fall within. The structure of the dataframe is below:
The structure of the dictionary is {'key':['Market 1', 'Market 2', 'Market 3']}. This is because each country has a couple of markets they could belong to.
I would like to write a function, which checks the values in the Country column and see if according to the dictionary, the current mapping is correct. So ideally, the desired output would be as follows:
Is there a way to reference a dictionary across two columns in a function? To confirm, the keys are the country names, and the markets are the values.
I have included code required to make the dataframe:
data = {'Country': ['Mexico','Uruguay','Uruguay','Greece','Brazil','Brazil','Brazil','Brazil','Colombia','Colombia','Colombia','Japan','Japan','Brazil','Brazil','Spain','New Zealand'],
'Market': ['LATAM','LATAM','LATAM','EMEA','ASIA','ASIA','LATAM BRAZIL','LATAM BRAZIL','LATAM CASA','LATAM CASA','LATAM','LATAM','LATAM','LATAM BRAZIL','LATAM BRAZIL','SOUTHEAST ASIA','SOUTHEAST ASIA']
}
df = pd.DataFrame(data)
Thanks a lot.
First idea is create tuples and match by Index.isin:
d = {'Colombia':['LATAM','LATAM CASA'], 'Brazil':['ASIA']}
tups = [(k, x) for k, v in d.items() for x in v]
df['Market Match'] = np.where(df.set_index(['Country','Market']).index.isin(tups),
'yes', 'no')
print (df)
Country Market Market Match
0 Mexico LATAM no
1 Uruguay LATAM no
2 Uruguay LATAM no
3 Greece EMEA no
4 Brazil ASIA yes
5 Brazil ASIA yes
6 Brazil LATAM BRAZIL no
7 Brazil LATAM BRAZIL no
8 Colombia LATAM CASA yes
9 Colombia LATAM CASA yes
10 Colombia LATAM yes
11 Japan LATAM no
12 Japan LATAM no
13 Brazil LATAM BRAZIL no
14 Brazil LATAM BRAZIL no
15 Spain SOUTHEAST ASIA no
16 New Zealand SOUTHEAST ASIA no
Or by left join in DataFrame.merge with indicator=True:
d = {'Colombia':['LATAM','LATAM CASA'], 'Brazil':['ASIA']}
df1 = pd.DataFrame([(k, x) for k, v in d.items() for x in v],
columns=['Country','Market']).drop_duplicates()
df['Market Match'] = np.where(df.merge(df1,indicator=True,how='left')['_merge'].eq('both'),
'yes', 'no')
The following link might help you out in checking if specific strings (e.g. "Markets" are included in your dataframe).
Check if string contains substring
For example:
fullstring = "StackAbuse"
substring = "tack"
if substring in fullstring:
print("Found!")
else:
print("Not found!")
df['MATCH'] = df.apply(lambda row: row['Market'] in your_dictionary[row['Country']], axis=1)
I have two functions which shift a row of a pandas dataframe to the top or bottom, respectively. After applying them more then once to a dataframe, they seem to work incorrectly.
These are the 2 functions to move the row to top / bottom:
def shift_row_to_bottom(df, index_to_shift):
"""Shift row, given by index_to_shift, to bottom of df."""
idx = df.index.tolist()
idx.pop(index_to_shift)
df = df.reindex(idx + [index_to_shift])
return df
def shift_row_to_top(df, index_to_shift):
"""Shift row, given by index_to_shift, to top of df."""
idx = df.index.tolist()
idx.pop(index_to_shift)
df = df.reindex([index_to_shift] + idx)
return df
Note: I don't want to reset_index for the returned df.
Example:
df = pd.DataFrame({'Country' : ['USA', 'GE', 'Russia', 'BR', 'France'],
'ID' : ['11', '22', '33','44', '55'],
'City' : ['New-York', 'Berlin', 'Moscow', 'London', 'Paris'],
'short_name' : ['NY', 'Ber', 'Mosc','Lon', 'Pa']
})
df =
Country ID City short_name
0 USA 11 New-York NY
1 GE 22 Berlin Ber
2 Russia 33 Moscow Mosc
3 BR 44 London Lon
4 France 55 Paris Pa
This is my dataframe:
Now, apply function for the first time. Move row with index 0 to bottom:
df_shifted = shift_row_to_bottom(df,0)
df_shifted =
Country ID City short_name
1 GE 22 Berlin Ber
2 Russia 33 Moscow Mosc
3 BR 44 London Lon
4 France 55 Paris Pa
0 USA 11 New-York NY
The result is exactly what I want.
Now, apply function again. This time move row with index 2 to the bottom:
df_shifted = shift_row_to_bottom(df_shifted,2)
df_shifted =
Country ID City short_name
1 GE 22 Berlin Ber
2 Russia 33 Moscow Mosc
4 France 55 Paris Pa
0 USA 11 New-York NY
2 Russia 33 Moscow Mosc
Well, this is not what I was expecting. There must be a problem when I want to apply the function a second time. The promblem is analog to the function shift_row_to_top.
My question is:
What's going on here?
Is there a better way to shift a specific row to top / bottom of the dataframe? Maybe a pandas-function?
If not, how would you do it?
Your problem is these two lines:
idx = df.index.tolist()
idx.pop(index_to_shift)
idx is a list and idx.pop(index_to_shift) removes the item at index index_to_shift of idx, which is not necessarily valued index_to_shift as in the second case.
Try this function:
def shift_row_to_bottom(df, index_to_shift):
idx = [i for i in df.index if i!=index_to_shift]
return df.loc[idx+[index_to_shift]]
# call the function twice
for i in range(2): df = shift_row_to_bottom(df, 2)
Output:
Country ID City short_name
0 USA 11 New-York NY
1 GE 22 Berlin Ber
3 BR 44 London Lon
4 France 55 Paris Pa
2 Russia 33 Moscow Mosc
I am trying to adapt the following code from print statement to dataframe output.
places = ['England UK','Paris FRANCE','ITALY,gh ROME','New']
location=['UK','FRANCE','ITALY']
def on_occurence(pos,location):
print (i,':',location)
root = aho_create_statemachine(location)
for i in places:
aho_find_all(i, root, on_occurence)
the print output for the above code is
England UK : UK
Paris FRANCE : FRANCE
ITALY,gh ROME : ITALY
I would like it so the df looked like:
message
country
England UK
UK
Paris FRANCE
FRANCE
ITALY,gh ROME
ITALY
I have tried the following with no luck
places = ['England UK','Paris FRANCE','ITALY,gh ROME','New']
location=['UK','FRANCE','ITALY']
df = pd.DataFrame(columns=["message","location"])
def on_occurence(pos,location):
print (i,':',location)
df = df.append({"message":i,"location":location},ignore_index=True)
root = aho_create_statemachine(location)
for i in places:
aho_find_all(i, root, on_occurence)
However the df looks like the following
message
country
NEW
UK FRANCE ITALY
df = pd.DataFrame(list(zip(places, location)), columns = ["Message", "Country"])
print(df)
My output:
Message Country
0 England UK UK
1 Paris FRANCE FRANCE
2 ITALY,gh ROME ITALY
If you want to print it without Row Index:
print(df.to_string(index=False))
Output in this case is:
Message Country
England UK UK
Paris FRANCE FRANCE
ITALY,gh ROME ITALY
I would recomend using dictionarys instead of 2 separate lists EG:
placeAndLocation = {
"england UK" : "UK",
"Paris France" : "france"
}
and so on.
Then to loop through this use:
for place, location in placeAndLocation.items():
print("place: " + place)
print("location: " + location)
I find this easier as you can easily see what data field lines up with what value and the data is contained within one variavle makeing it easier to resd down the line
I have a function called handle text that renames values in dataframe columns:
def handle_text(txt):
if txt.lower()[:6] == 'deu_ga':
return 'Western Europe', 'Germany'
elif txt.lower()[:6] == 'fra_ga':
return 'Western Europe', 'France'
return 'Other', 'Other'
I apply handle_text on various dataframes in the following way:
campaigns_df['Region'], campaigns_df['Market'] = zip(*campaigns_df['Campaign Name'].apply(handle_text))
atlas_df['Region'], atlas_df['Market'] = zip(*atlas_df['Campaign Name'].apply(handle_text))
flashtalking_df['Region'], flashtalking_df['Market'] = zip(*flashtalking_df['Campaign Name'].apply(handle_text))
I was wondering if there was a way to do a for loop to apply the function to various dfs at once:
dataframes = [atlas_df, flashtalking_df, innovid_df, ias_viewability_df, ias_fraud_df]
columns_df = ['Campaign Name']
for df in dataframes:
for column in df.columns:
if column in columns_df:
zip(df.column.apply(handle_text))
However the error I get is:
AttributeError: 'DataFrame' object has no attribute 'column'
I managed to solve it like this:
dataframes = [atlas_df, flashtalking_df, innovid_df, ias_viewability_df, ias_fraud_df, mediaplan_df]
columns_df = 'Campaign Name'
for df in dataframes:
df['Region'], df['Market'] = zip(*df[columns_df].apply(handle_text))
Need change attribute acces by . to more general by []:
zip(df.column.apply(handle_text))
to
zip(df[column].apply(handle_text))
EDIT:
Better solution:
atlas_df = pd.DataFrame({'Campaign Name':['deu_gathf', 'deu_gahf', 'fra_gagg'],'another_col':[1,2,3]})
flashtalking_df = pd.DataFrame({'Campaign Name':['deu_gahf','fra_ga', 'deu_gatt'],'another_col':[4,5,6]})
dataframes = [atlas_df, flashtalking_df]
columns_df = 'Campaign Name'
You can map by dict and then create new columns:
d = {'deu_ga': ['Western Europe','Germany'], 'fra_ga':['Western Europe','France']}
for df in dataframes:
df[['Region','Market']] = pd.DataFrame(df[columns_df].str.lower()
.str[:6]
.map(d)
.values.tolist())
#print (df)
print (atlas_df)
Campaign Name another_col Region Market
0 deu_gathf 1 Western Europe Germany
1 deu_gahf 2 Western Europe Germany
2 fra_gagg 3 Western Europe France
print (flashtalking_df)
Campaign Name another_col Region Market
0 deu_gahf 4 Western Europe Germany
1 fra_ga 5 Western Europe France
2 deu_gatt 6 Western Europe Germany