pandas create a column and assign values to it from a dictionary - python

I have a dictionary looks like this,
{"regions":[
{"name": "South America", "code": "SA01,SA02,SA03"},
{"name": "Asia Pacific", "code": "AP01,AP02,AP03"}
]}
I have a df looks like this,
id code
1 SA01
2 SA02
3 SA03
4 AP01
5 AP02
6 AP03
I like to create a column region in df whose values will be based on the code values in regions, so the result will look like,
id code region
1 SA01 South America
2 SA02 South America
3 SA03 South America
4 AP01 Asia Pacific
5 AP02 Asia Pacific
6 AP03 Asia Pacific
I am wondering whats the best way to do this.

You could redefine your dictionary (d here) to have an individual code:region entry for each code that appears in the strings and use it to map the values in the code column:
d_ = {code:sd['name'] for sd in d['regions'] for code in sd['code'].split(',')}
# {'SA01': 'South America', 'SA02': 'South America', 'SA03': 'South America',...
df['region'] = df.code.map(d_)
print(df)
id code region
0 1 SA01 South America
1 2 SA02 South America
2 3 SA03 South America
3 4 AP01 Asia Pacific
4 5 AP02 Asia Pacific
5 6 AP03 Asia Pacific

This example takes your current dataset without any modifications. I'm sure that this can be refined by someone with more pandas experience.
import pandas as pd
mydict = {"regions":[
{"name": "South America", "code": "SA01,SA02,SA03"},
{"name": "Asia Pacific", "code": "AP01,AP02,AP03"}
]}
col_names_regions = ['code', 'region name']
df_regions = pd.DataFrame(columns=col_names_regions)
for key, values in mydict.items():
for value in values:
codes = value.get('code')
name = value.get('name')
for code in codes.split(','):
df1 = {'code': code, 'region name': name}
df_regions = df_regions.append(df1, ignore_index=True)
print (df_regions)
# output
code region name
0 SA01 South America
1 SA02 South America
2 SA03 South America
3 AP01 Asia Pacific
4 AP02 Asia Pacific
5 AP03 Asia Pacific

Related

How to pull value from a column when several columns match in two data frames?

I am trying to write a script which will search a database similar to that in Table 1 based on a product/region/year specification outlined in table 2. The plan is to search for a match in Table 1 to a specification outlined in Table 2 and then pull the observation value, as seen in Table 2 - with results.
I need this code to run several loops, where the year criteria is relaxed. For example, loop 1 would search for a match in Product_L1, Geography_L1 and Year and loop 2 would search for a match in Product_L1, Geography_L1 and Year-1 and so on.
Table 1
Product level 1
Product level 2
Region level 1
Region level 2
Year
Obs. value
Portland cement
Cement
Peru
South America
2021
1
Portland cement
Cement
Switzerland
Europe
2021
2
Portland cement
Cement
USA
North America
2021
3
Portland cement
Cement
Brazil
South America
2021
4
Portland cement
Cement
South Africa
Africa
2021
5
Portland cement
Cement
India
Asia
2021
6
Portland cement
Cement
Brazil
South America
2020
7
Table 2
Product level 1
Product level 2
Region level 1
Region level 2
Year
Portland cement
Cement
Brazil
South America
2021
Portland cement
Cement
Switzerland
Europe
2021
Table 2 - with results
Product level 1
Product level 2
Region level 1
Region level 2
Year
Loop 1
Loop 2
x
Portland cement
Cement
Brazil
South America
2021
4
7
I have tried using the following code, but it comes up with the error 'Can only compare identically-labeled Series objects'. Does anyone have any suggestions on how to prevent this error?
Table_2['Loop_1'] = np.where((Table_1.Product_L1 == Table_2.Product_L1)
& (Table_1.Geography_L1 == Table_2.Geography_L1)
& (Table_1.Year == Table_2.Year),
Table_1(['obs_value'], ''))
You can perform a merge operation and provide a list of columns that you want from Table_1.
import pandas as pd
Table_1 = pd.DataFrame({
"Product_L1":["Portland cement", "Portland cement", "Portland cement", "Portland cement", "Portland cement", "Portland cement", "Portland cement"],
"Product_L2":["Cement", "Cement", "Cement", "Cement", "Cement", "Cement", "Cement"],
"Geography_L1": ["Peru", "Switzerland", "USA", "Brazil", "South Africa", "India", "Brazil"],
"Geography_L2": ["South America", "Europe", "North America", "South America", "Africa", "Asia", "South America"],
"Year": [2021, 2021, 2021, 2021, 2021, 2021, 2020],
"obs_value": [1, 2, 3, 4, 5, 6, 7]
})
Table_2 = pd.DataFrame({
"Product_L1":["Portland cement", "Portland cement"],
"Product_L2":["Cement", "Cement"],
"Geography_L1": ["Brazil", "Switzerland"],
"Geography_L2": ["South America", "Europe"],
"Year": [2021, 2021]
})
columns_list = ['Product_L1','Product_L2','Geography_L1','Geography_L2','Year','obs_value']
result = pd.merge(Table_2, Table_1[columns_list], how='left')
result is a new dataframe:
Product_L1 Product_L2 Geography_L1 Geography_L2 Year obs_value
0 Portland cement Cement Brazil South America 2021 4
1 Portland cement Cement Switzerland Europe 2021 2
EDIT: Based upon the update to the question, I think what you are trying to do is achievable using set_index and unstack. This will create a new dataframe with the observed values listed in columns 'Year_2020', 'Year_2021' etc.
index_columns = ['Product_L1','Product_L2','Geography_L1','Geography_L2', 'Year']
edit_df = Table_1.set_index(index_columns)['obs_value'].unstack().add_prefix('Year_').reset_index()

How to loop to consecutively go through a list of strings, assign value to each string and return it to a new list

Say instead of a dictionary I have these lists:
cities = ('New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok')
Europe = ('London', 'Berlin')
America = ('New York', 'Vancouver')
Asia = ('Tokyo', 'Bangkok')
I want to create a pd.DataFrame from this such as:
City
Continent
New York
America
Vancouver
America
London
Europe
Berlin
Europe
Tokyo
Asia
Bangkok
Asia
Note: this is the minimum reproductible example to keep it simple, but the real dataset is more like city -> country -> continent
I understand with such a small sample it would be possible to manually create a dictionary, but in the real example there are many more data-points. So I need to automate it.
I've tried a for loop and a while loop with arguments such as "if Europe in cities" but that doesn't do anything and I think that's because it's "false" since it compares the whole list "Europe" against the whole list "cities".
Either way, my idea was that the loops would go through every city in the cities list and return (city + continent) for each. I just don't know how to um... actually make that work.
I am very new and I wasn't able to figure anything out from looking at similar questions.
Thank you for any direction!
Problem in your Code:
First of all, let's take a look at a Code Snippet used by you: if Europe in cities: was returned nothing Correct!
It is because you are comparing the whole list [Europe] instead of individual list element ['London', 'Berlin']
Solution:
Initially, I have imported all the important modules and regenerated a List of Sample Data provided by you.
# Import all the Important Modules
import pandas as pd
# Read Data
cities = ['New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok']
Europe = ['London', 'Berlin']
America = ['New York', 'Vancouver']
Asia = ['Tokyo', 'Bangkok']
Now, As you can see in your Expected Output we have 2 Columns mentioned below:
City [Which is already available in the form of cities (List)]
Continent [Which we have to generate based on other Lists. In our case: Europe, America, Asia]
For Generating a proper Continent List follow the Code mentioned below:
# Make Continent list
continent = []
# Compare the list of Europe, America and Asia with cities
for city in cities:
if city in Europe:
continent.append('Europe')
elif city in America:
continent.append('America')
elif city in Asia:
continent.append('Asia')
else:
pass
# Print the continent list
continent
# Output of Above Code:
['America', 'America', 'Europe', 'Europe', 'Asia', 'Asia']
As you can see we have received the expected Continent List. Now let's generate the pd.DataFrame() from the same:
# Make dataframe from 'City' and 'Continent List`
data_df = pd.DataFrame({'City': cities, 'Continent': continent})
# Print Results
data_df
# Output of the above Code:
City Continent
0 New York America
1 Vancouver America
2 London Europe
3 Berlin Europe
4 Tokyo Asia
5 Bangkok Asia
Hope this Solution helps you. But if you are still facing Errors then feel free to start a thread below.
1 : Counting elements
You just count the number of cities in each continent and create a list with it :
cities = ('New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok')
Europe = ('London', 'Berlin')
America = ('New York', 'Vancouver')
continent = []
cities = []
for name, cont in zip(['Europe', 'America', 'Asia'], [Europe, America, Asia]):
continent += [name for _ in range(len(cont))]
cities += [city for city in cont]
df = pd.DataFrame({'City': cities, 'Continent': continent}
print(df)
And this gives you the following result :
City Continent
0 London Europe
1 Berlin Europe
2 New York America
3 Vancouver America
4 Tokyo Asia
5 Bangkok Asia
This is I think the best solution.
2: With dictionnary
You can create an intermediate dictionnary.
Starting from your code
cities = ('New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok')
Europe = ('London', 'Berlin')
America = ('New York', 'Vancouver')
Asia = ('Tokyo', 'Bangkok')
You would do this :
continent = dict()
for cont_name, cont_cities in zip(['Europe', 'America', 'Asia'], [Europe, America, Asia]):
for city in cont_cities:
continent[city] = cont_name
This give you the following result :
{
'London': 'Europe', 'Berlin': 'Europe',
'New York': 'America', 'Vancouver': 'America',
'Tokyo': 'Asia', 'Bangkok': 'Asia'
}
Then, you can create your DataFrame :
df = pd.DataFrame(continent.items())
print(df)
0 1
0 London Europe
1 Berlin Europe
2 New York America
3 Vancouver America
4 Tokyo Asia
5 Bangkok Asia
This solution allows you not to override your cities tuple
I think on the long run you might want to elimninate loops for large datasets. Also, you might need to include more continent depending on the content of your data.
import pandas as pd
continent = {
'0': 'Europe',
'1': 'America',
'2': 'Asia'
}
df= pd.DataFrame([Europe, America, Asia]).stack().reset_index()
df['continent']= df['level_0'].astype(str).map(continent)
df.drop(['level_0','level_1'], inplace=True, axis=1)
You should get this output
0 continent
0 London Europe
1 Berlin Europe
2 New York America
3 Vancouver America
4 Tokyo Asia
5 Bangkok Asia
Feel free to adjust to suit your use case

create new column of dataframe base on value of another dataframe run fast?

i want to create a new columns for my df_cau2['continent']. first there r 2 df of mine:
country_continent
Continent
Country
Afghanistan Asia
Albania Europe
Algeria Africa
American Samoa Oceania
and
df_cau2
date home_team away_team home_score away_score tournament city country neutral
0 1872-11-30 Scotland England 0 0 Friendly Glasgow Scotland False
1 1873-03-08 England Scotland 4 2 Friendly London England False
2 1874-03-07 Scotland England 2 1 Friendly Glasgow Scotland False
to create new column continent i use apply for df_cau2 like this:
def same_continent(home,away):
if country_continent.loc[home].Continent == country_continent.loc[away].Continent:
return country_continent.loc[home].Continent
return 'None'
df_cau2['continent']=df_cau2.apply(lambda x: same_continent(x['home_team'],x['away_team']),axis=1)
df_cau2.head()
with 39480 rows of df_cau2, this code run too slow, how can i change my code to run it's faster? i am thinking about using np.select but i don't know how to use it's in this case.
This is result that i want:
date home_team away_team home_score away_score tournament city country neutral continent
7611 1970-09-11 Iran Turkey 1 1 Friendly Teheran Iran False None
31221 2009-03-11 Nepal Pakistan 1 0 Friendly Kathmandu Nepal False Asia
32716 2010-11-17 Colombia Peru 1 1 Friendly Bogotá Colombia False South America
Thanks
IIUC, you want to set continent column only if home_team and away_team columns are in the same continent:
home_continent = df1['home_team'].map(df2.squeeze())
away_continent = df1['away_team'].map(df2.squeeze())
m = home_continent == away_continent
df1.loc[m, 'continent'] = home_continent.loc[m]
print(df1)
# Output
home_team away_team continent
0 Canada England NaN
1 France Spain Europe
2 China Japan Asia
Setup a MRE
df1 = pd.DataFrame({'home_team': ['Canada', 'France', 'China'],
'away_team': ['England', 'Spain', 'Japan']})
print(df1)
df2 = pd.DataFrame({'Country': ['Canada', 'China', 'England',
'France', 'Japan', 'Spain'],
'Continent': ['North America', 'Asia', 'Europe',
'Europe', 'Asia', 'Europe']}).set_index('Country')
print(df2)
# Output df1
home_team away_team
0 Canada England
1 France Spain
2 China Japan
# Output df2
Continent
Country
Canada North America
China Asia
England Europe
France Europe
Japan Asia
Spain Europe
Consider merge of the continent lookup data frame twice to create home and away continent columns. And since you will have both continents, assign new shared continent column conditionally with numpy.where:
df_cau2 = (
df.cau2.merge(
country_continent.reset_index(),
left_on = "home_team",
right_on = "Country",
how = "left"
).merge(
country_continent.reset_index(),
left_on = "away_team",
right_on = "Country",
how = "left",
suffixes = ["_home", "_away"]
)
)
df_cau2["shared_continent"] = np.where(
df_cau2["Continent_home"].eq(df_cau2["Continent_away"]),
df_cau2["Continent_home"],
np.nan
)

Assign values from a dictionary to a new column based on condition

This my data frame
City
sales
San Diego
500
Texas
400
Nebraska
300
Macau
200
Rome
100
London
50
Manchester
70
I want to add the country at the end which will look like this
City
sales
Country
San Diego
500
US
Texas
400
US
Nebraska
300
US
Macau
200
Hong Kong
Rome
100
Italy
London
50
England
Manchester
200
England
The countries are stored in below dictionary
country={'US':['San Diego','Texas','Nebraska'], 'Hong Kong':'Macau', 'England':['London','Manchester'],'Italy':'Rome'}
It's a little complicated because you have lists and strings as the values and strings are technically iterable, so distinguishing is more annoying. But here's a function that can flatten your dict:
def flatten_dict(d):
nd = {}
for k,v in d.items():
# Check if it's a list, if so then iterate through
if ((hasattr(v, '__iter__') and not isinstance(v, str))):
for item in v:
nd[item] = k
else:
nd[v] = k
return nd
d = flatten_dict(country)
#{'San Diego': 'US',
# 'Texas': 'US',
# 'Nebraska': 'US',
# 'Macau': 'Hong Kong',
# 'London': 'England',
# 'Manchester': 'England',
# 'Rome': 'Italy'}
df['Country'] = df['City'].map(d)
You can implement this using geopy
You can install geopy by pip install geopy
Here is the documentation : https://pypi.org/project/geopy/
# import libraries
from geopy.geocoders import Nominatim
# you need to mention a name for the app
geolocator = Nominatim(user_agent="some_random_app_name")
# get country name
df['Country'] = df['City'].apply(lambda x : geolocator.geocode(x).address.split(', ')[-1])
print(df)
City sales Country
0 San Diego 500 United States
1 Texas 400 United States
2 Nebraska 300 United States
3 Macau 200 中国
4 Rome 100 Italia
5 London 50 United Kingdom
6 Manchester 70 United Kingdom
# to get country name in english
df['Country'] = df['City'].apply(lambda x : geolocator.reverse(geolocator.geocode(x).point, language='en').address.split(', ')[-1])
print(df)
City sales Country
0 San Diego 500 United States
1 Texas 400 United States
2 Nebraska 300 United States
3 Macau 200 China
4 Rome 100 Italy
5 London 50 United Kingdom
6 Manchester 70 United Kingdom

I have a CSV and want to update it with values from another CSV. What is the most efficient way to do this?

I have this CSV:
Name Species Country
0 Hobbes Tiger U.S.
1 SherKhan Tiger India
2 Rescuer Mouse Australia
3 Mickey Mouse U.S.
And I have a second CSV:
Continent Countries Unnamed: 2 Unnamed: 3 Unnamed: 4
0 North America U.S. Mexico Guatemala Honduras
1 Asia India China Nepal NaN
2 Australia Australia NaN NaN NaN
3 Africa South Africa Botswana Zimbabwe NaN
I want to use the second CSV to update the first file so that the output is:
Name Species Country
0 Hobbes Tiger North America
1 SherKhan Tiger Asia
2 Rescuer Mouse Australia
3 Mickey Mouse North America
So far this the closest I have gotten:
import pandas as pd
# Import my data.
data = pd.read_csv('Continents.csv')
Animals = pd.read_csv('Animals.csv')
Animalsdf = pd.DataFrame(Animals)
# Transpose my data from horizontal to vertical.
data1 = data.T
# Clean my data and update my header with the first column.
data1.columns = data1.iloc[0]
# Drop now duplicated data.
data1.drop(data1.index[[0]], inplace = True)
# Build the dictionary.
data_dict = {col: list(data1[col]) for col in data1.columns}
# Update my csv.
Animals['Country'] = Animals['Country'].map(data_dict)
print ('Animals')
This results in a dictionary that has lists as its values and therefore i just get NaN out:
Name Species Country
0 Hobbes Tiger NaN
1 SherKhan Tiger NaN
2 Rescuer Mole [Australia, nan, nan, nan]
3 Mickey Mole NaN
I've tried flipping from list to tuples and this doesn't work. Have tried multiple ways to pull in the dictionary etc. I am just out of ideas.
Sorry if the code is super junky. I'm learning this as I go. Figured a project was the best way to learn a new language. Didn't think it would be this difficult.
Any suggestions would be appreciated. I need to be able to use the code so that when I get multiple reference CSVs, I can update my data with new keys. Hope this is clear.
Thanks in advance.
One intuitive solution is to use a dictionary mapping. Data from #WillMonge.
pd.DataFrame.itertuples works by producing namedtuples, but they may also be referenced using numeric indexers.
# create mapping dictionary
d = {}
for row in df.itertuples():
d.update(dict.fromkeys(filter(None, row[2:]), row[1]))
# apply mapping dictionary
data['Continent'] = data['Country'].map(d)
print(data)
Country name Continent
0 China 2 Asia
1 China 5 Asia
2 Canada 9 America
3 Egypt 0 Africa
4 Mexico 3 America
You should use DictReader and DictWriter. You can learn how to use them by below link.
https://docs.python.org/2/library/csv.html
Here is an update of your code, I have tried to add comments to explain
import pandas as pd
# Read data in (read_csv also returns a DataFrame directly)
data = pd.DataFrame({'name': [2, 5, 9, 0, 3], 'Country': ['China', 'China', 'Canada', 'Egypt', 'Mexico']})
df = pd.DataFrame({'Continent': ['Asia', 'America', 'Africa'],
'Country1': ['China', 'Mexico', 'Egypt'],
'Country2': ['Japan', 'Canada', None],
'Country3': ['Thailand', None, None ]})
# Unstack to get a row for each country (remove the continent rows)
premap_df = pd.DataFrame(df.unstack('Continent').drop('Continent')).dropna().reset_index()
premap_df.columns = ['_', 'continent_key', 'Country']
# Merge the continent back based on the continent_key (old row number)
map_df = pd.merge(premap_df, df[['Continent']], left_on='continent_key', right_index=True)[['Continent', 'Country']]
# Merge with the data now
pd.merge(data, map_df, on='Country')
For further reference, Wes McKinney's Python for Data Analysis (here is a pdf version I found online) is one of the best books out there for learning pandas
You can always create buckets and run conditions:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name':['Hobbes','SherKhan','Rescuer','Mickey'], 'Species':['Tiger','Tiger','Mouse','Mouse'],'Country':['U.S.','India','Australia','U.S.']})
North_America = ['U.S.', 'Mexico', 'Guatemala', 'Honduras']
Asia = ['India', 'China', 'Nepal']
Australia = ['Australia']
Africa = ['South Africa', 'Botswana', 'Zimbabwe']
conditions = [
(df['Country'].isin(North_America)),
(df['Country'].isin(Asia)),
(df['Country'].isin(Australia)),
(df['Country'].isin(Africa))
]
choices = [
'North America',
'Asia',
'Australia',
'Africa'
]
df['Continent'] = np.select(conditions, choices, default = np.nan)
df

Categories