I have a .csv and dataframe which has 2 columns (country, continent). I want to create a dictionary, carrying the continent as key and a list of all countries as values.
The .csv has the following format:
country
continent
Algeria
Africa
Angola
Africa
and so on.
I tried using:
continentsDict = dict([(con, cou) for con, cou in zip(continents.continent, continents.country)])
But this gave me the following output:
{'Africa': 'Zimbabwe', 'Asia': 'Yemen', 'Europe': 'Vatican City', 'North America': 'United States Virgin Islands', 'Oceania': 'Wallis and Futuna', 'South America': 'Venezuela'}
Which is the right format but only added the last value it found for the respective continent.
Anyone an idea?
Thank you!
Assuming continents is the instance of your pandas df, you could do:
continentsDict = continents.groupby("continent")["country"].apply(list).to_dict()
Given:
country continent
0 Algeria Africa
1 Angola Africa
Doing:
out = df.groupby('continent')['country'].agg(list).to_dict()
print(out)
Output:
{'Africa': ['Algeria', 'Angola']}
Related
I have a df where I've used pycountry to get full names for country column and continent column and make select box in streamlit like so,
country continent
Hong Kong Asia
Montenegro Europe
Rwanda Africa
United States North America
Germany Europe
Myanmar Asia
Saudi Arabia Asia
etc.. etc..
Streamlit code:
continent_select = df['continent'].drop_duplicates()
country_select = df['country'].drop_duplicates()
continent_sidebar = st.sidebar.selectbox('Select a continent:', continent_select)
country_sidebar = st.sidebar.selectbox('Select a country:', country_select)
Desired output: I would like country names for specific continent to show in select box ie: Select 'Asia' from continent select box. Then in country select box, Hong Kong, China, India, etc... show up.
I've tried to group the rows into a list with continent
df2 = df.groupby('continent')['country'].apply(list)
continent_sidebar = st.sidebar.selectbox('Select a continent:', df2)
But this results in having a full list in the select box ie: [Rwanda, Morocco, Sudan, etc.]
Is there an easier way besides making a dictionary and grouping them manually?
See the comments in the code.
Code
import streamlit as st
import pandas as pd
data = {
'country': ['Hong Kong', 'Montenegro', 'Rwanda', 'Myanmar', 'Saudi Arabia'],
'continent': ['Asia', 'Europe', 'Africa', 'Asia', 'Asia']
}
df = pd.DataFrame(data)
continent_select = df['continent'].drop_duplicates()
# country_select = df['country'].drop_duplicates()
continent_sidebar = st.sidebar.selectbox('Select a continent:', continent_select)
# Get dataframe where continent is the continent_sidebar.
df1 = df.loc[df.continent == continent_sidebar]
# Get the country column from df1.
df2 = df1.country
# Show df2 in the side bar.
country_sidebar = st.sidebar.selectbox('Select a country:', df2)
Output
Say instead of a dictionary I have these lists:
cities = ('New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok')
Europe = ('London', 'Berlin')
America = ('New York', 'Vancouver')
Asia = ('Tokyo', 'Bangkok')
I want to create a pd.DataFrame from this such as:
City
Continent
New York
America
Vancouver
America
London
Europe
Berlin
Europe
Tokyo
Asia
Bangkok
Asia
Note: this is the minimum reproductible example to keep it simple, but the real dataset is more like city -> country -> continent
I understand with such a small sample it would be possible to manually create a dictionary, but in the real example there are many more data-points. So I need to automate it.
I've tried a for loop and a while loop with arguments such as "if Europe in cities" but that doesn't do anything and I think that's because it's "false" since it compares the whole list "Europe" against the whole list "cities".
Either way, my idea was that the loops would go through every city in the cities list and return (city + continent) for each. I just don't know how to um... actually make that work.
I am very new and I wasn't able to figure anything out from looking at similar questions.
Thank you for any direction!
Problem in your Code:
First of all, let's take a look at a Code Snippet used by you: if Europe in cities: was returned nothing Correct!
It is because you are comparing the whole list [Europe] instead of individual list element ['London', 'Berlin']
Solution:
Initially, I have imported all the important modules and regenerated a List of Sample Data provided by you.
# Import all the Important Modules
import pandas as pd
# Read Data
cities = ['New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok']
Europe = ['London', 'Berlin']
America = ['New York', 'Vancouver']
Asia = ['Tokyo', 'Bangkok']
Now, As you can see in your Expected Output we have 2 Columns mentioned below:
City [Which is already available in the form of cities (List)]
Continent [Which we have to generate based on other Lists. In our case: Europe, America, Asia]
For Generating a proper Continent List follow the Code mentioned below:
# Make Continent list
continent = []
# Compare the list of Europe, America and Asia with cities
for city in cities:
if city in Europe:
continent.append('Europe')
elif city in America:
continent.append('America')
elif city in Asia:
continent.append('Asia')
else:
pass
# Print the continent list
continent
# Output of Above Code:
['America', 'America', 'Europe', 'Europe', 'Asia', 'Asia']
As you can see we have received the expected Continent List. Now let's generate the pd.DataFrame() from the same:
# Make dataframe from 'City' and 'Continent List`
data_df = pd.DataFrame({'City': cities, 'Continent': continent})
# Print Results
data_df
# Output of the above Code:
City Continent
0 New York America
1 Vancouver America
2 London Europe
3 Berlin Europe
4 Tokyo Asia
5 Bangkok Asia
Hope this Solution helps you. But if you are still facing Errors then feel free to start a thread below.
1 : Counting elements
You just count the number of cities in each continent and create a list with it :
cities = ('New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok')
Europe = ('London', 'Berlin')
America = ('New York', 'Vancouver')
continent = []
cities = []
for name, cont in zip(['Europe', 'America', 'Asia'], [Europe, America, Asia]):
continent += [name for _ in range(len(cont))]
cities += [city for city in cont]
df = pd.DataFrame({'City': cities, 'Continent': continent}
print(df)
And this gives you the following result :
City Continent
0 London Europe
1 Berlin Europe
2 New York America
3 Vancouver America
4 Tokyo Asia
5 Bangkok Asia
This is I think the best solution.
2: With dictionnary
You can create an intermediate dictionnary.
Starting from your code
cities = ('New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok')
Europe = ('London', 'Berlin')
America = ('New York', 'Vancouver')
Asia = ('Tokyo', 'Bangkok')
You would do this :
continent = dict()
for cont_name, cont_cities in zip(['Europe', 'America', 'Asia'], [Europe, America, Asia]):
for city in cont_cities:
continent[city] = cont_name
This give you the following result :
{
'London': 'Europe', 'Berlin': 'Europe',
'New York': 'America', 'Vancouver': 'America',
'Tokyo': 'Asia', 'Bangkok': 'Asia'
}
Then, you can create your DataFrame :
df = pd.DataFrame(continent.items())
print(df)
0 1
0 London Europe
1 Berlin Europe
2 New York America
3 Vancouver America
4 Tokyo Asia
5 Bangkok Asia
This solution allows you not to override your cities tuple
I think on the long run you might want to elimninate loops for large datasets. Also, you might need to include more continent depending on the content of your data.
import pandas as pd
continent = {
'0': 'Europe',
'1': 'America',
'2': 'Asia'
}
df= pd.DataFrame([Europe, America, Asia]).stack().reset_index()
df['continent']= df['level_0'].astype(str).map(continent)
df.drop(['level_0','level_1'], inplace=True, axis=1)
You should get this output
0 continent
0 London Europe
1 Berlin Europe
2 New York America
3 Vancouver America
4 Tokyo Asia
5 Bangkok Asia
Feel free to adjust to suit your use case
I got situation where I need to transpose a dataframe like below.
input dataframe is as below:
input_data = [
['Asia', 'China', 'Beijing'],
['Asia', 'China', 'Shenzhen'],
['America', 'United States', 'New York'],
['America', 'Canada', 'Toronto']
]
input_df = pd.DataFrame(input_data)
input_df.columns = ['continents', 'countries', 'cities']
input_df
continents
countries
cities
0
Asia
China
Beijing
1
Asia
China
Shenzhen
2
America
United States
New York
3
America
Canada
Toronto
The output data I want to get is
# only the unique values are allowed in the output list.
continents = ['Asia', 'America']
countries = [['China'], ['United States', 'Canada']]
cities = [[['Beijing', 'Shenzhen']], [['New York'], ['Toronto']]]
For this case, the input data has three levels Continents -> Countries -> Cities, but what I ultimately want is to take a multiple-level hierarchical dataframe (no matters how deep it is horizontally), then I get the output like the example, and then I will put them on a pyqt5 column view.
pandas.Series.tolist() can convert series value to list.
print(input_df['continents'].unique().tolist())
print(input_df.groupby('continents', sort=False)['countries'].apply(lambda x: x.unique().tolist()).tolist())
print(input_df.groupby(['continents', 'countries'], sort=False)['cities'].apply(lambda x: [x.unique().tolist()]).tolist())
['Asia', 'America']
[['China'], ['United States', 'Canada']]
[[['Beijing', 'Shenzhen']], [['New York']], [['Toronto']]]
As for a general approach, the first approach occurred to me is to loop through the columns of df.
def list_wrapper(alist, times):
for _ in range(times):
alist = [alist]
return alist
columns_name = input_df.columns.values.tolist()
for i in range(len(columns_name)):
if i == 0:
print(input_df[columns_name[i]].unique().tolist())
else:
print(input_df.groupby(columns_name[0:i], sort=False)[columns_name[i]].apply(lambda x: list_wrapper(x.unique().tolist(), i-1)).tolist())
I have a data frame with country and traffic columns:
Country | Traffic
US 8687
Italy 902834
Germany 2343
Brazil 4254
France 23453
I want to add a third column called "Region" to this data frame. It would look like this:
Country | Traffic | Region
US 8687 US
Italy 902834 EU
Germany 2343 EU
Brazil 4254 LA
France 23453 EU
The following code works if I have only two Regions. I am looking more for an if/else, map, or lambda statement:
df['Region'] = np.where(df['Country'] == 'US', 'US', 'EU')
Thank You.
One simple approach is this:
dict ={'US':'US','Italy':'EU','Germany':'EU','Brazil':'LA','France':'EU'}
df['Region']=df['Country'].apply(lambda x : dict[x])
You could use a dictionary:
region_from_country = {
'US': 'US',
'Italy': 'EU',
'Germany': 'EU',
'Brazil': 'LA',
'France': 'EU',
}
df['Region'] = df['Country'].replace(region_from_country)
The keys in the dictionary are the countries and the values are the corresponding regions.
I have a following pandas dataframe df:
% Renewable Energy Supply
Country
China 19.754910 1.271910e+11
United States 11.570980 9.083800e+10
Japan 10.232820 1.898400e+10
United Kingdom 10.600470 7.920000e+09
Russian Federation 17.288680 3.070900e+10
Canada 61.945430 1.043100e+10
Germany 17.901530 1.326100e+10
India 14.969080 3.319500e+10
France 17.020280 1.059700e+10
South Korea 2.279353 1.100700e+10
Italy 33.667230 6.530000e+09
Spain 37.968590 4.923000e+09
Iran 5.707721 9.172000e+09
Australia 11.810810 5.386000e+09
Brazil 69.648030 1.214900e+10
I am grouping this dataframe using the Continents each country belongs to and also using the bins obtained by using pd.cut on the column % Renewable :
out, bins = pd.cut(Top15['% Renewable'].values, bins = 5, retbins = True)
grp = Top15.groupby(by = [ContinentDict, out])
where,
ContinentDict = {'China':'Asia',
'United States':'North America',
'Japan':'Asia',
'United Kingdom':'Europe',
'Russian Federation':'Europe',
'Canada':'North America',
'Germany':'Europe',
'India':'Asia',
'France':'Europe',
'South Korea':'Asia',
'Italy':'Europe',
'Spain':'Europe',
'Iran':'Asia',
'Australia':'Australia',
'Brazil':'South America'}
Now, I want to create a new dataframe with the same columns as df and another column given by 'Country'. The indexes of this new dataframe should be given by groupby objects keys hierarchically ('Continent', 'out'). After hours of trial, I see no way to do this. Any ideas?
You can create a multi-index from continent and cut and assign it back to your data frame:
out, bins = pd.cut(Top15['% Renewable'].values, bins = 5, retbins = True)
con = Top15.index.to_series().map(ContinentDict).values
Top15.reset_index(inplace=True)
Top15.index = pd.MultiIndex.from_arrays([con, out])
Top15