I have a df where I've used pycountry to get full names for country column and continent column and make select box in streamlit like so,
country continent
Hong Kong Asia
Montenegro Europe
Rwanda Africa
United States North America
Germany Europe
Myanmar Asia
Saudi Arabia Asia
etc.. etc..
Streamlit code:
continent_select = df['continent'].drop_duplicates()
country_select = df['country'].drop_duplicates()
continent_sidebar = st.sidebar.selectbox('Select a continent:', continent_select)
country_sidebar = st.sidebar.selectbox('Select a country:', country_select)
Desired output: I would like country names for specific continent to show in select box ie: Select 'Asia' from continent select box. Then in country select box, Hong Kong, China, India, etc... show up.
I've tried to group the rows into a list with continent
df2 = df.groupby('continent')['country'].apply(list)
continent_sidebar = st.sidebar.selectbox('Select a continent:', df2)
But this results in having a full list in the select box ie: [Rwanda, Morocco, Sudan, etc.]
Is there an easier way besides making a dictionary and grouping them manually?
See the comments in the code.
Code
import streamlit as st
import pandas as pd
data = {
'country': ['Hong Kong', 'Montenegro', 'Rwanda', 'Myanmar', 'Saudi Arabia'],
'continent': ['Asia', 'Europe', 'Africa', 'Asia', 'Asia']
}
df = pd.DataFrame(data)
continent_select = df['continent'].drop_duplicates()
# country_select = df['country'].drop_duplicates()
continent_sidebar = st.sidebar.selectbox('Select a continent:', continent_select)
# Get dataframe where continent is the continent_sidebar.
df1 = df.loc[df.continent == continent_sidebar]
# Get the country column from df1.
df2 = df1.country
# Show df2 in the side bar.
country_sidebar = st.sidebar.selectbox('Select a country:', df2)
Output
Related
I have a .csv and dataframe which has 2 columns (country, continent). I want to create a dictionary, carrying the continent as key and a list of all countries as values.
The .csv has the following format:
country
continent
Algeria
Africa
Angola
Africa
and so on.
I tried using:
continentsDict = dict([(con, cou) for con, cou in zip(continents.continent, continents.country)])
But this gave me the following output:
{'Africa': 'Zimbabwe', 'Asia': 'Yemen', 'Europe': 'Vatican City', 'North America': 'United States Virgin Islands', 'Oceania': 'Wallis and Futuna', 'South America': 'Venezuela'}
Which is the right format but only added the last value it found for the respective continent.
Anyone an idea?
Thank you!
Assuming continents is the instance of your pandas df, you could do:
continentsDict = continents.groupby("continent")["country"].apply(list).to_dict()
Given:
country continent
0 Algeria Africa
1 Angola Africa
Doing:
out = df.groupby('continent')['country'].agg(list).to_dict()
print(out)
Output:
{'Africa': ['Algeria', 'Angola']}
Say instead of a dictionary I have these lists:
cities = ('New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok')
Europe = ('London', 'Berlin')
America = ('New York', 'Vancouver')
Asia = ('Tokyo', 'Bangkok')
I want to create a pd.DataFrame from this such as:
City
Continent
New York
America
Vancouver
America
London
Europe
Berlin
Europe
Tokyo
Asia
Bangkok
Asia
Note: this is the minimum reproductible example to keep it simple, but the real dataset is more like city -> country -> continent
I understand with such a small sample it would be possible to manually create a dictionary, but in the real example there are many more data-points. So I need to automate it.
I've tried a for loop and a while loop with arguments such as "if Europe in cities" but that doesn't do anything and I think that's because it's "false" since it compares the whole list "Europe" against the whole list "cities".
Either way, my idea was that the loops would go through every city in the cities list and return (city + continent) for each. I just don't know how to um... actually make that work.
I am very new and I wasn't able to figure anything out from looking at similar questions.
Thank you for any direction!
Problem in your Code:
First of all, let's take a look at a Code Snippet used by you: if Europe in cities: was returned nothing Correct!
It is because you are comparing the whole list [Europe] instead of individual list element ['London', 'Berlin']
Solution:
Initially, I have imported all the important modules and regenerated a List of Sample Data provided by you.
# Import all the Important Modules
import pandas as pd
# Read Data
cities = ['New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok']
Europe = ['London', 'Berlin']
America = ['New York', 'Vancouver']
Asia = ['Tokyo', 'Bangkok']
Now, As you can see in your Expected Output we have 2 Columns mentioned below:
City [Which is already available in the form of cities (List)]
Continent [Which we have to generate based on other Lists. In our case: Europe, America, Asia]
For Generating a proper Continent List follow the Code mentioned below:
# Make Continent list
continent = []
# Compare the list of Europe, America and Asia with cities
for city in cities:
if city in Europe:
continent.append('Europe')
elif city in America:
continent.append('America')
elif city in Asia:
continent.append('Asia')
else:
pass
# Print the continent list
continent
# Output of Above Code:
['America', 'America', 'Europe', 'Europe', 'Asia', 'Asia']
As you can see we have received the expected Continent List. Now let's generate the pd.DataFrame() from the same:
# Make dataframe from 'City' and 'Continent List`
data_df = pd.DataFrame({'City': cities, 'Continent': continent})
# Print Results
data_df
# Output of the above Code:
City Continent
0 New York America
1 Vancouver America
2 London Europe
3 Berlin Europe
4 Tokyo Asia
5 Bangkok Asia
Hope this Solution helps you. But if you are still facing Errors then feel free to start a thread below.
1 : Counting elements
You just count the number of cities in each continent and create a list with it :
cities = ('New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok')
Europe = ('London', 'Berlin')
America = ('New York', 'Vancouver')
continent = []
cities = []
for name, cont in zip(['Europe', 'America', 'Asia'], [Europe, America, Asia]):
continent += [name for _ in range(len(cont))]
cities += [city for city in cont]
df = pd.DataFrame({'City': cities, 'Continent': continent}
print(df)
And this gives you the following result :
City Continent
0 London Europe
1 Berlin Europe
2 New York America
3 Vancouver America
4 Tokyo Asia
5 Bangkok Asia
This is I think the best solution.
2: With dictionnary
You can create an intermediate dictionnary.
Starting from your code
cities = ('New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok')
Europe = ('London', 'Berlin')
America = ('New York', 'Vancouver')
Asia = ('Tokyo', 'Bangkok')
You would do this :
continent = dict()
for cont_name, cont_cities in zip(['Europe', 'America', 'Asia'], [Europe, America, Asia]):
for city in cont_cities:
continent[city] = cont_name
This give you the following result :
{
'London': 'Europe', 'Berlin': 'Europe',
'New York': 'America', 'Vancouver': 'America',
'Tokyo': 'Asia', 'Bangkok': 'Asia'
}
Then, you can create your DataFrame :
df = pd.DataFrame(continent.items())
print(df)
0 1
0 London Europe
1 Berlin Europe
2 New York America
3 Vancouver America
4 Tokyo Asia
5 Bangkok Asia
This solution allows you not to override your cities tuple
I think on the long run you might want to elimninate loops for large datasets. Also, you might need to include more continent depending on the content of your data.
import pandas as pd
continent = {
'0': 'Europe',
'1': 'America',
'2': 'Asia'
}
df= pd.DataFrame([Europe, America, Asia]).stack().reset_index()
df['continent']= df['level_0'].astype(str).map(continent)
df.drop(['level_0','level_1'], inplace=True, axis=1)
You should get this output
0 continent
0 London Europe
1 Berlin Europe
2 New York America
3 Vancouver America
4 Tokyo Asia
5 Bangkok Asia
Feel free to adjust to suit your use case
I got situation where I need to transpose a dataframe like below.
input dataframe is as below:
input_data = [
['Asia', 'China', 'Beijing'],
['Asia', 'China', 'Shenzhen'],
['America', 'United States', 'New York'],
['America', 'Canada', 'Toronto']
]
input_df = pd.DataFrame(input_data)
input_df.columns = ['continents', 'countries', 'cities']
input_df
continents
countries
cities
0
Asia
China
Beijing
1
Asia
China
Shenzhen
2
America
United States
New York
3
America
Canada
Toronto
The output data I want to get is
# only the unique values are allowed in the output list.
continents = ['Asia', 'America']
countries = [['China'], ['United States', 'Canada']]
cities = [[['Beijing', 'Shenzhen']], [['New York'], ['Toronto']]]
For this case, the input data has three levels Continents -> Countries -> Cities, but what I ultimately want is to take a multiple-level hierarchical dataframe (no matters how deep it is horizontally), then I get the output like the example, and then I will put them on a pyqt5 column view.
pandas.Series.tolist() can convert series value to list.
print(input_df['continents'].unique().tolist())
print(input_df.groupby('continents', sort=False)['countries'].apply(lambda x: x.unique().tolist()).tolist())
print(input_df.groupby(['continents', 'countries'], sort=False)['cities'].apply(lambda x: [x.unique().tolist()]).tolist())
['Asia', 'America']
[['China'], ['United States', 'Canada']]
[[['Beijing', 'Shenzhen']], [['New York']], [['Toronto']]]
As for a general approach, the first approach occurred to me is to loop through the columns of df.
def list_wrapper(alist, times):
for _ in range(times):
alist = [alist]
return alist
columns_name = input_df.columns.values.tolist()
for i in range(len(columns_name)):
if i == 0:
print(input_df[columns_name[i]].unique().tolist())
else:
print(input_df.groupby(columns_name[0:i], sort=False)[columns_name[i]].apply(lambda x: list_wrapper(x.unique().tolist(), i-1)).tolist())
I have this CSV:
Name Species Country
0 Hobbes Tiger U.S.
1 SherKhan Tiger India
2 Rescuer Mouse Australia
3 Mickey Mouse U.S.
And I have a second CSV:
Continent Countries Unnamed: 2 Unnamed: 3 Unnamed: 4
0 North America U.S. Mexico Guatemala Honduras
1 Asia India China Nepal NaN
2 Australia Australia NaN NaN NaN
3 Africa South Africa Botswana Zimbabwe NaN
I want to use the second CSV to update the first file so that the output is:
Name Species Country
0 Hobbes Tiger North America
1 SherKhan Tiger Asia
2 Rescuer Mouse Australia
3 Mickey Mouse North America
So far this the closest I have gotten:
import pandas as pd
# Import my data.
data = pd.read_csv('Continents.csv')
Animals = pd.read_csv('Animals.csv')
Animalsdf = pd.DataFrame(Animals)
# Transpose my data from horizontal to vertical.
data1 = data.T
# Clean my data and update my header with the first column.
data1.columns = data1.iloc[0]
# Drop now duplicated data.
data1.drop(data1.index[[0]], inplace = True)
# Build the dictionary.
data_dict = {col: list(data1[col]) for col in data1.columns}
# Update my csv.
Animals['Country'] = Animals['Country'].map(data_dict)
print ('Animals')
This results in a dictionary that has lists as its values and therefore i just get NaN out:
Name Species Country
0 Hobbes Tiger NaN
1 SherKhan Tiger NaN
2 Rescuer Mole [Australia, nan, nan, nan]
3 Mickey Mole NaN
I've tried flipping from list to tuples and this doesn't work. Have tried multiple ways to pull in the dictionary etc. I am just out of ideas.
Sorry if the code is super junky. I'm learning this as I go. Figured a project was the best way to learn a new language. Didn't think it would be this difficult.
Any suggestions would be appreciated. I need to be able to use the code so that when I get multiple reference CSVs, I can update my data with new keys. Hope this is clear.
Thanks in advance.
One intuitive solution is to use a dictionary mapping. Data from #WillMonge.
pd.DataFrame.itertuples works by producing namedtuples, but they may also be referenced using numeric indexers.
# create mapping dictionary
d = {}
for row in df.itertuples():
d.update(dict.fromkeys(filter(None, row[2:]), row[1]))
# apply mapping dictionary
data['Continent'] = data['Country'].map(d)
print(data)
Country name Continent
0 China 2 Asia
1 China 5 Asia
2 Canada 9 America
3 Egypt 0 Africa
4 Mexico 3 America
You should use DictReader and DictWriter. You can learn how to use them by below link.
https://docs.python.org/2/library/csv.html
Here is an update of your code, I have tried to add comments to explain
import pandas as pd
# Read data in (read_csv also returns a DataFrame directly)
data = pd.DataFrame({'name': [2, 5, 9, 0, 3], 'Country': ['China', 'China', 'Canada', 'Egypt', 'Mexico']})
df = pd.DataFrame({'Continent': ['Asia', 'America', 'Africa'],
'Country1': ['China', 'Mexico', 'Egypt'],
'Country2': ['Japan', 'Canada', None],
'Country3': ['Thailand', None, None ]})
# Unstack to get a row for each country (remove the continent rows)
premap_df = pd.DataFrame(df.unstack('Continent').drop('Continent')).dropna().reset_index()
premap_df.columns = ['_', 'continent_key', 'Country']
# Merge the continent back based on the continent_key (old row number)
map_df = pd.merge(premap_df, df[['Continent']], left_on='continent_key', right_index=True)[['Continent', 'Country']]
# Merge with the data now
pd.merge(data, map_df, on='Country')
For further reference, Wes McKinney's Python for Data Analysis (here is a pdf version I found online) is one of the best books out there for learning pandas
You can always create buckets and run conditions:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name':['Hobbes','SherKhan','Rescuer','Mickey'], 'Species':['Tiger','Tiger','Mouse','Mouse'],'Country':['U.S.','India','Australia','U.S.']})
North_America = ['U.S.', 'Mexico', 'Guatemala', 'Honduras']
Asia = ['India', 'China', 'Nepal']
Australia = ['Australia']
Africa = ['South Africa', 'Botswana', 'Zimbabwe']
conditions = [
(df['Country'].isin(North_America)),
(df['Country'].isin(Asia)),
(df['Country'].isin(Australia)),
(df['Country'].isin(Africa))
]
choices = [
'North America',
'Asia',
'Australia',
'Africa'
]
df['Continent'] = np.select(conditions, choices, default = np.nan)
df
I have a following pandas dataframe df:
% Renewable Energy Supply
Country
China 19.754910 1.271910e+11
United States 11.570980 9.083800e+10
Japan 10.232820 1.898400e+10
United Kingdom 10.600470 7.920000e+09
Russian Federation 17.288680 3.070900e+10
Canada 61.945430 1.043100e+10
Germany 17.901530 1.326100e+10
India 14.969080 3.319500e+10
France 17.020280 1.059700e+10
South Korea 2.279353 1.100700e+10
Italy 33.667230 6.530000e+09
Spain 37.968590 4.923000e+09
Iran 5.707721 9.172000e+09
Australia 11.810810 5.386000e+09
Brazil 69.648030 1.214900e+10
I am grouping this dataframe using the Continents each country belongs to and also using the bins obtained by using pd.cut on the column % Renewable :
out, bins = pd.cut(Top15['% Renewable'].values, bins = 5, retbins = True)
grp = Top15.groupby(by = [ContinentDict, out])
where,
ContinentDict = {'China':'Asia',
'United States':'North America',
'Japan':'Asia',
'United Kingdom':'Europe',
'Russian Federation':'Europe',
'Canada':'North America',
'Germany':'Europe',
'India':'Asia',
'France':'Europe',
'South Korea':'Asia',
'Italy':'Europe',
'Spain':'Europe',
'Iran':'Asia',
'Australia':'Australia',
'Brazil':'South America'}
Now, I want to create a new dataframe with the same columns as df and another column given by 'Country'. The indexes of this new dataframe should be given by groupby objects keys hierarchically ('Continent', 'out'). After hours of trial, I see no way to do this. Any ideas?
You can create a multi-index from continent and cut and assign it back to your data frame:
out, bins = pd.cut(Top15['% Renewable'].values, bins = 5, retbins = True)
con = Top15.index.to_series().map(ContinentDict).values
Top15.reset_index(inplace=True)
Top15.index = pd.MultiIndex.from_arrays([con, out])
Top15