Transpose a dataframe to a nested list of list - python

I got situation where I need to transpose a dataframe like below.
input dataframe is as below:
input_data = [
['Asia', 'China', 'Beijing'],
['Asia', 'China', 'Shenzhen'],
['America', 'United States', 'New York'],
['America', 'Canada', 'Toronto']
]
input_df = pd.DataFrame(input_data)
input_df.columns = ['continents', 'countries', 'cities']
input_df
continents
countries
cities
0
Asia
China
Beijing
1
Asia
China
Shenzhen
2
America
United States
New York
3
America
Canada
Toronto
The output data I want to get is
# only the unique values are allowed in the output list.
continents = ['Asia', 'America']
countries = [['China'], ['United States', 'Canada']]
cities = [[['Beijing', 'Shenzhen']], [['New York'], ['Toronto']]]
For this case, the input data has three levels Continents -> Countries -> Cities, but what I ultimately want is to take a multiple-level hierarchical dataframe (no matters how deep it is horizontally), then I get the output like the example, and then I will put them on a pyqt5 column view.

pandas.Series.tolist() can convert series value to list.
print(input_df['continents'].unique().tolist())
print(input_df.groupby('continents', sort=False)['countries'].apply(lambda x: x.unique().tolist()).tolist())
print(input_df.groupby(['continents', 'countries'], sort=False)['cities'].apply(lambda x: [x.unique().tolist()]).tolist())
['Asia', 'America']
[['China'], ['United States', 'Canada']]
[[['Beijing', 'Shenzhen']], [['New York']], [['Toronto']]]
As for a general approach, the first approach occurred to me is to loop through the columns of df.
def list_wrapper(alist, times):
for _ in range(times):
alist = [alist]
return alist
columns_name = input_df.columns.values.tolist()
for i in range(len(columns_name)):
if i == 0:
print(input_df[columns_name[i]].unique().tolist())
else:
print(input_df.groupby(columns_name[0:i], sort=False)[columns_name[i]].apply(lambda x: list_wrapper(x.unique().tolist(), i-1)).tolist())

Related

Converting dataframe to dictionary with country by continent

I have a .csv and dataframe which has 2 columns (country, continent). I want to create a dictionary, carrying the continent as key and a list of all countries as values.
The .csv has the following format:
country
continent
Algeria
Africa
Angola
Africa
and so on.
I tried using:
continentsDict = dict([(con, cou) for con, cou in zip(continents.continent, continents.country)])
But this gave me the following output:
{'Africa': 'Zimbabwe', 'Asia': 'Yemen', 'Europe': 'Vatican City', 'North America': 'United States Virgin Islands', 'Oceania': 'Wallis and Futuna', 'South America': 'Venezuela'}
Which is the right format but only added the last value it found for the respective continent.
Anyone an idea?
Thank you!
Assuming continents is the instance of your pandas df, you could do:
continentsDict = continents.groupby("continent")["country"].apply(list).to_dict()
Given:
country continent
0 Algeria Africa
1 Angola Africa
Doing:
out = df.groupby('continent')['country'].agg(list).to_dict()
print(out)
Output:
{'Africa': ['Algeria', 'Angola']}

Match countries with continents in streamlit selectbox

I have a df where I've used pycountry to get full names for country column and continent column and make select box in streamlit like so,
country continent
Hong Kong Asia
Montenegro Europe
Rwanda Africa
United States North America
Germany Europe
Myanmar Asia
Saudi Arabia Asia
etc.. etc..
Streamlit code:
continent_select = df['continent'].drop_duplicates()
country_select = df['country'].drop_duplicates()
continent_sidebar = st.sidebar.selectbox('Select a continent:', continent_select)
country_sidebar = st.sidebar.selectbox('Select a country:', country_select)
Desired output: I would like country names for specific continent to show in select box ie: Select 'Asia' from continent select box. Then in country select box, Hong Kong, China, India, etc... show up.
I've tried to group the rows into a list with continent
df2 = df.groupby('continent')['country'].apply(list)
continent_sidebar = st.sidebar.selectbox('Select a continent:', df2)
But this results in having a full list in the select box ie: [Rwanda, Morocco, Sudan, etc.]
Is there an easier way besides making a dictionary and grouping them manually?
See the comments in the code.
Code
import streamlit as st
import pandas as pd
data = {
'country': ['Hong Kong', 'Montenegro', 'Rwanda', 'Myanmar', 'Saudi Arabia'],
'continent': ['Asia', 'Europe', 'Africa', 'Asia', 'Asia']
}
df = pd.DataFrame(data)
continent_select = df['continent'].drop_duplicates()
# country_select = df['country'].drop_duplicates()
continent_sidebar = st.sidebar.selectbox('Select a continent:', continent_select)
# Get dataframe where continent is the continent_sidebar.
df1 = df.loc[df.continent == continent_sidebar]
# Get the country column from df1.
df2 = df1.country
# Show df2 in the side bar.
country_sidebar = st.sidebar.selectbox('Select a country:', df2)
Output

How to loop to consecutively go through a list of strings, assign value to each string and return it to a new list

Say instead of a dictionary I have these lists:
cities = ('New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok')
Europe = ('London', 'Berlin')
America = ('New York', 'Vancouver')
Asia = ('Tokyo', 'Bangkok')
I want to create a pd.DataFrame from this such as:
City
Continent
New York
America
Vancouver
America
London
Europe
Berlin
Europe
Tokyo
Asia
Bangkok
Asia
Note: this is the minimum reproductible example to keep it simple, but the real dataset is more like city -> country -> continent
I understand with such a small sample it would be possible to manually create a dictionary, but in the real example there are many more data-points. So I need to automate it.
I've tried a for loop and a while loop with arguments such as "if Europe in cities" but that doesn't do anything and I think that's because it's "false" since it compares the whole list "Europe" against the whole list "cities".
Either way, my idea was that the loops would go through every city in the cities list and return (city + continent) for each. I just don't know how to um... actually make that work.
I am very new and I wasn't able to figure anything out from looking at similar questions.
Thank you for any direction!
Problem in your Code:
First of all, let's take a look at a Code Snippet used by you: if Europe in cities: was returned nothing Correct!
It is because you are comparing the whole list [Europe] instead of individual list element ['London', 'Berlin']
Solution:
Initially, I have imported all the important modules and regenerated a List of Sample Data provided by you.
# Import all the Important Modules
import pandas as pd
# Read Data
cities = ['New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok']
Europe = ['London', 'Berlin']
America = ['New York', 'Vancouver']
Asia = ['Tokyo', 'Bangkok']
Now, As you can see in your Expected Output we have 2 Columns mentioned below:
City [Which is already available in the form of cities (List)]
Continent [Which we have to generate based on other Lists. In our case: Europe, America, Asia]
For Generating a proper Continent List follow the Code mentioned below:
# Make Continent list
continent = []
# Compare the list of Europe, America and Asia with cities
for city in cities:
if city in Europe:
continent.append('Europe')
elif city in America:
continent.append('America')
elif city in Asia:
continent.append('Asia')
else:
pass
# Print the continent list
continent
# Output of Above Code:
['America', 'America', 'Europe', 'Europe', 'Asia', 'Asia']
As you can see we have received the expected Continent List. Now let's generate the pd.DataFrame() from the same:
# Make dataframe from 'City' and 'Continent List`
data_df = pd.DataFrame({'City': cities, 'Continent': continent})
# Print Results
data_df
# Output of the above Code:
City Continent
0 New York America
1 Vancouver America
2 London Europe
3 Berlin Europe
4 Tokyo Asia
5 Bangkok Asia
Hope this Solution helps you. But if you are still facing Errors then feel free to start a thread below.
1 : Counting elements
You just count the number of cities in each continent and create a list with it :
cities = ('New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok')
Europe = ('London', 'Berlin')
America = ('New York', 'Vancouver')
continent = []
cities = []
for name, cont in zip(['Europe', 'America', 'Asia'], [Europe, America, Asia]):
continent += [name for _ in range(len(cont))]
cities += [city for city in cont]
df = pd.DataFrame({'City': cities, 'Continent': continent}
print(df)
And this gives you the following result :
City Continent
0 London Europe
1 Berlin Europe
2 New York America
3 Vancouver America
4 Tokyo Asia
5 Bangkok Asia
This is I think the best solution.
2: With dictionnary
You can create an intermediate dictionnary.
Starting from your code
cities = ('New York', 'Vancouver', 'London', 'Berlin', 'Tokyo', 'Bangkok')
Europe = ('London', 'Berlin')
America = ('New York', 'Vancouver')
Asia = ('Tokyo', 'Bangkok')
You would do this :
continent = dict()
for cont_name, cont_cities in zip(['Europe', 'America', 'Asia'], [Europe, America, Asia]):
for city in cont_cities:
continent[city] = cont_name
This give you the following result :
{
'London': 'Europe', 'Berlin': 'Europe',
'New York': 'America', 'Vancouver': 'America',
'Tokyo': 'Asia', 'Bangkok': 'Asia'
}
Then, you can create your DataFrame :
df = pd.DataFrame(continent.items())
print(df)
0 1
0 London Europe
1 Berlin Europe
2 New York America
3 Vancouver America
4 Tokyo Asia
5 Bangkok Asia
This solution allows you not to override your cities tuple
I think on the long run you might want to elimninate loops for large datasets. Also, you might need to include more continent depending on the content of your data.
import pandas as pd
continent = {
'0': 'Europe',
'1': 'America',
'2': 'Asia'
}
df= pd.DataFrame([Europe, America, Asia]).stack().reset_index()
df['continent']= df['level_0'].astype(str).map(continent)
df.drop(['level_0','level_1'], inplace=True, axis=1)
You should get this output
0 continent
0 London Europe
1 Berlin Europe
2 New York America
3 Vancouver America
4 Tokyo Asia
5 Bangkok Asia
Feel free to adjust to suit your use case

Is there a way to check if a list item is the ONLY item in the list?

I have a list of dictionaries, the whole list represents different countries, and each dictionary includes basic data about each country as following:
example a
df.countries[3]
"[{'iso_3166_1': 'DE', 'name': 'Germany'}, {'iso_3166_1': 'US', 'name': 'United States of America'}, {'iso_3166_1': 'IN', 'name': 'India'}]"
Of course, there are other cells where countries list has only one dictionary like this:
example b
df.countries[0]
"[{'iso_3166_1': 'US', 'name': 'United States of America'}]"
Or empty list like this:
example c
df.countries[505]
'[]'
What I want to do is:
Delete rows where country name is United States of America BUT
ONLY WHEN it's the only country in the list, not when there are
other countries like example a.
I tried to brainstorm and came up with something like this:
countryToRemove = "United States of America"
for index, row in df.iterrows():
if countryToRemove in row['countries']:
# row to be removed
But it deletes any row with the US in it even if other countries were there.
Edit: My dataframe is as following:
countries
0 [{'iso_3166_1': 'DE', 'name': 'Germany'}, {'is...
1 [{'iso_3166_1': 'US', 'name': 'United States o...
2 []
If you have dataframe like this:
countries
0 [{'iso_3166_1': 'DE', 'name': 'Germany'}, {'is...
1 [{'iso_3166_1': 'US', 'name': 'United States o...
2 []
Then you can use boolean indexing to filter out your dataframe:
mask = df.countries.apply(
lambda x: len(s := set(d["name"] for d in x)) == 1
and s.pop() == "United States of America"
)
print(df[~mask])
Prints:
countries
0 [{'iso_3166_1': 'DE', 'name': 'Germany'}, {'is...
2 []
EDIT: Version without := operator:
def fn(x):
s = set(d["name"] for d in x)
return len(s) == 1 and s.pop() == "United States of America"
mask = df.countries.apply(fn)
print(df[~mask])

Duplicate rows by list len in dataframe and add them as index

I have a dataframe where one of the columns has 2 or more elements inside a list format, like the following:
Email Country
0 john#gmail.com [Czech Republic, Singapore, United Kingdom]
1 Davies2#gmeail.com [Singapore, United Kingdom]
2 SooEng#gmail.com [United Kingdom, Czech Republic]
I need to do the following:
- Duplicate the number of rows by list lenght in "Country" (so for example, first row would be duplicated twice)
- For each row, I would need to have as index one of the list elements (so for example, one of them would be Czech Republic, the other row Singapore and the other row United Kingdow as index).
Does someone know how could I do it?
Thank you!
You can use .explode() to 'duplicate' the rows:
import pandas as pd
df = pd.DataFrame([['john#gmail.com', ['Czech Republic', 'Singapore', 'United Kingdom']],
['Davies2#gmeail.com', ['Singapore', 'United Kingdom']],
['SooEng#gmail.com', ['United Kingdom', 'Czech Republic']]
], columns = ['Email', 'Country'])
df.explode('Country')
Result:
Email Country
0 john#gmail.com Czech Republic
0 john#gmail.com Singapore
0 john#gmail.com United Kingdom
1 Davies2#gmeail.com Singapore
1 Davies2#gmeail.com United Kingdom
2 SooEng#gmail.com United Kingdom
2 SooEng#gmail.com Czech Republic
To set the index use:
df.explode('Country').set_index('Country')

Categories