I have a dataframe with the following information :
Country
Region
Population
France
Pas de Calais
500000
France
Provence
200000
Switzerland
Geneva
400000
United States
Florida
1200000
Could you please indicate to me how I can process to get one CSV-file per country, with all data for that country ?
Many thanks in advance for your guideline
df.groupby("Country").apply(lambda df_country: df_country.to_csv(df_country.Country[0]+".csv"))
My recomendation is:
data = {'Country': ['France', 'France', 'Switzerland', 'United States'], 'Region': ['Pas de Calais', 'Provence', 'Geneva', ' Florida'], 'Population': [500000, 200000, 400000, 1200000]}
df = pd.DataFrame.from_dict(data)
grouped_df = df.groupby(by = 'Country')
for _, subdf in grouped_df.__iter__():
# save subdataframe
subdf.to_csv()
I have a .csv and dataframe which has 2 columns (country, continent). I want to create a dictionary, carrying the continent as key and a list of all countries as values.
The .csv has the following format:
country
continent
Algeria
Africa
Angola
Africa
and so on.
I tried using:
continentsDict = dict([(con, cou) for con, cou in zip(continents.continent, continents.country)])
But this gave me the following output:
{'Africa': 'Zimbabwe', 'Asia': 'Yemen', 'Europe': 'Vatican City', 'North America': 'United States Virgin Islands', 'Oceania': 'Wallis and Futuna', 'South America': 'Venezuela'}
Which is the right format but only added the last value it found for the respective continent.
Anyone an idea?
Thank you!
Assuming continents is the instance of your pandas df, you could do:
continentsDict = continents.groupby("continent")["country"].apply(list).to_dict()
Given:
country continent
0 Algeria Africa
1 Angola Africa
Doing:
out = df.groupby('continent')['country'].agg(list).to_dict()
print(out)
Output:
{'Africa': ['Algeria', 'Angola']}
I got situation where I need to transpose a dataframe like below.
input dataframe is as below:
input_data = [
['Asia', 'China', 'Beijing'],
['Asia', 'China', 'Shenzhen'],
['America', 'United States', 'New York'],
['America', 'Canada', 'Toronto']
]
input_df = pd.DataFrame(input_data)
input_df.columns = ['continents', 'countries', 'cities']
input_df
continents
countries
cities
0
Asia
China
Beijing
1
Asia
China
Shenzhen
2
America
United States
New York
3
America
Canada
Toronto
The output data I want to get is
# only the unique values are allowed in the output list.
continents = ['Asia', 'America']
countries = [['China'], ['United States', 'Canada']]
cities = [[['Beijing', 'Shenzhen']], [['New York'], ['Toronto']]]
For this case, the input data has three levels Continents -> Countries -> Cities, but what I ultimately want is to take a multiple-level hierarchical dataframe (no matters how deep it is horizontally), then I get the output like the example, and then I will put them on a pyqt5 column view.
pandas.Series.tolist() can convert series value to list.
print(input_df['continents'].unique().tolist())
print(input_df.groupby('continents', sort=False)['countries'].apply(lambda x: x.unique().tolist()).tolist())
print(input_df.groupby(['continents', 'countries'], sort=False)['cities'].apply(lambda x: [x.unique().tolist()]).tolist())
['Asia', 'America']
[['China'], ['United States', 'Canada']]
[[['Beijing', 'Shenzhen']], [['New York']], [['Toronto']]]
As for a general approach, the first approach occurred to me is to loop through the columns of df.
def list_wrapper(alist, times):
for _ in range(times):
alist = [alist]
return alist
columns_name = input_df.columns.values.tolist()
for i in range(len(columns_name)):
if i == 0:
print(input_df[columns_name[i]].unique().tolist())
else:
print(input_df.groupby(columns_name[0:i], sort=False)[columns_name[i]].apply(lambda x: list_wrapper(x.unique().tolist(), i-1)).tolist())
This is starting to bug me: In plotly express when using animation_frame, I know it's important to set ranges so data can be displayed consistently, otherwise data may vanish across frames. But for a column with categorical values (say 'US', 'Russia', 'Germany'), I cannot find any way to avoid disappearing data when not every frame contains all categories if I want that column to appear with different colors (in the code below, that column would be 'AnotherColumn'). Plotly documentation points out
Animations are designed to work well when each row of input is present across all animation frames, and when categorical values mapped to symbol, color and facet are constant across frames. Animations may be misleading or inconsistent if these constraints are not met.
but while I can easily set a range_color when I have a continuous color range, nothing of the sort seems to work for categorical data. I can somewhat workaround this by making my data numerical (e.g. 'US'-> 1, 'Russia' -> 2) bu that is both fiddly and the result visually unappealing.
import plotly.express as px
...
fig = px.bar(data, x="NameColumn",
y="SomeColumn",
color="AnotherColumn",
animation_frame="AnimationColumn",
range_y=[0, max_y]
)
Here is a simple reproducible example:
import pandas as pd
import plotly.express as px
data_dict = {'ColorColumn': ['p', 'p', 'p', 'q'],
'xColumn': ['someName', 'someOtherName', 'someName', 'someOtherName'],
'yColumn': [10, 20, 30, 40],
'animationColumn': [1, 1, 2, 2]}
data = pd.DataFrame(data=data_dict)
fig = px.bar(data, x="xColumn",
y="yColumn",
color="ColorColumn",
animation_frame="animationColumn",
range_y=[0, 40]
)
fig.update_layout(xaxis={'title': '',
'visible': True,
'showticklabels': True})
fig.show()
If you try it out, you'll notice the second frame is missing a bar. If the ColorColumn had numeric data, you could fix this by specifying range_color (similar to the specification of range_y in the code above); my question would be, how to handle this with categorical data?
Second edit: Some requested additional data or more a more reasonable example. This might be more appropriate:
import pandas as pd
import plotly.express as px
data_dict = {'Region': ['North America', 'Asia', 'Asia',
'North America', 'Asia', 'Europe',
'North America', 'Europe', 'Asia'],
'Country': ['US', 'China', 'Korea',
'US', 'Phillipines', 'France',
'Canada', 'Germany', 'Thailand'],
'GDP': [10, 20, 30,
40, 50, 60,
70, 80, 90],
'Year': [2017, 2017, 2017,
2018, 2018, 2018,
2019, 2019, 2019]}
data = pd.DataFrame(data=data_dict)
fig = px.bar(data, x="Country",
y="GDP",
color="Region",
animation_frame="Year",
range_y=[0, 80]
)
fig.update_layout(xaxis={'title': '',
'visible': True,
'showticklabels': True})
fig.show()
A similar question has been asked and answered under Plotly: How to specify categorical x-axis elements in a plotly express animation?. The necessary adjustments for your use case aren't exactly straight-forward though, so I'll might as well set it up for you.
It all boils down to this setup using, among other things:
df['key']=df.groupby(['Year','Country']).cumcount()
df1 = pd.pivot_table(df,index='Year',columns=['key', 'Country'],values='GDP')
And:
df1 = pd.merge(df1, data[['Country', 'Region']], how='left', on='Country').drop_duplicates()
Using some neat properties of pd.pivot_table, this will give you a dataset that has all years and all countries for all regions even though GDP from these have not been specified.
The two first animation frames will look like this:
Complete code:
import pandas as pd
import plotly.express as px
data_dict = {'Region': ['North America', 'Asia', 'Asia',
'North America', 'Asia', 'Europe',
'North America', 'Europe', 'Asia'],
'Country': ['US', 'China', 'Korea',
'US', 'Phillipines', 'France',
'Canada', 'Germany', 'Thailand'],
'GDP': [10, 20, 30,
40, 50, 60,
70, 80, 90],
'Year': [2017, 2017, 2017,
2018, 2018, 2018,
2019, 2019, 2019]}
data = pd.DataFrame(data=data_dict)
# dat munging
df = data.copy()
df['key']=df.groupby(['Year','Country']).cumcount()
df1 = pd.pivot_table(df,index='Year',columns=['key', 'Country'],values='GDP')
df1 = df1.stack(level=[0,1],dropna=False).reset_index()
df1 = pd.merge(df1, data[['Country', 'Region']], how='left', on='Country').drop_duplicates()
df1.columns=['Year', 'Key', 'Country', 'GDP', 'Region']
fig = px.bar(df1, x="Country",
y="GDP",
color="Region",
animation_frame="Year",
range_y=[0, 80]
)
fig.update_layout(xaxis={'title': '',
'visible': True,
'showticklabels': True})
fig.show()
The following is not a direct answer to you question (as in what do i need to change in plotly), but rather focuses on consistent data in you DataFrame.
The basic idea is that the "Primary Key" of each of your rows in the second example is ["Year", "Country"]. plotly will now expect a value for "GDP" as well as "Region" for each combination of those. The following creates a DataFrame that looks just like so (be using a MultiIndex reindexing).
unqiue_years = data["Year"].unique()
unqiue_countries = data["Country"].unique()
# Let's first separate the region of a country
region_per_country = data[["Country", "Region"]].drop_duplicates().set_index("Country")
# Removing the region
data = data[["Year", "Country", "GDP"]].set_index(["Year", "Country"])
# Creating all possible "Year" "Country" combinations
data = data.reindex(pd.MultiIndex.from_product([unqiue_years, unqiue_countries]))
# Cleanup
data = data.reset_index().rename(columns={"level_0": "Year", "level_1": "Country"})
# Re-adding the region
data = data.merge(region_per_country, left_on="Country", right_index=True)
Running this gives us the following DataFrame (shown without the .reset_index()):
GDP Region
Year Country
2017 Canada NaN North America
China 20.0 Asia
France NaN Europe
Germany NaN Europe
Korea 30.0 Asia
Phillipines NaN Asia
Thailand NaN Asia
US 10.0 North America
2018 Canada NaN North America
China NaN Asia
France 60.0 Europe
Germany NaN Europe
Korea NaN Asia
Phillipines 50.0 Asia
Thailand NaN Asia
US 40.0 North America
2019 Canada 70.0 North America
China NaN Asia
France NaN Europe
Germany 80.0 Europe
Korea NaN Asia
Phillipines NaN Asia
Thailand 90.0 Asia
US NaN North America
which plotly will then correctly plot.
I have this dataframe whereby I wish to calculate the mean and sum from the column 'Score'. I do not want to use the .groupby().agg() method.
df = pd.DataFrame({
'Country': ['Germany', 'Germany', 'Canada', 'Canada'],
'Score': [8, 4, 35, 50],
'Continent': ['Europe', 'Europe', 'North America', 'North America']},
columns=['Country','Score','Continent'])
print (df)
Dataframe becomes:
Country Score Continent
0 Germany 8 Europe
1 Germany 4 Europe
2 Canada 35 North America
3 Canada 50 North America
The easiest method I have found is:
new_df = df.groupby('Continent')['Score'].agg({'sum': np.sum, 'avg': np.average})
Continent
Europe 12 6.0
North America 85 42.5
I now have 2 series average and total. How do I make that into a new dataframe using the index from .groupby('Continent')?
I'm trying to use the group, frame method here:
for group, frame in df.groupby('Continent'):
avg = np.average(frame['Score'])
total = np.sum(frame['Score'])
df['avg'] = avg
df['sum'] = total