How to convert columns into Numerical data? - python

All,
Currently my pandas dataset looks like following, and I would like to convert my dataframe such that it look like following. My requirements are below
Current Dataframe:
df.tail()
age country females males total year
96 United States 72700 22700 95400 2010
97 United States 50300 14500 64800 2010
98 United States 35000 8730 43700 2010
99 United States 25200 4920 30100 2010
100 United State 51200 9570 60800 2010
After Conversion:
Note: I do recognize that my required conversion is in JSON format, but basically I would like to convert my females and males columns such that I can create Gender column in my dataset and assign values 1 and 2 and also keeping males and females number in dataset. I am newbie to Python if you could provide explanation with code that will be great!

You can do a little reshaping melt, map for the genders, and to_dict to get a list of dictionaries.
v = (df.melt(['age', 'country', 'total', 'year'])
.rename({'variable': 'sex', 'total': 'people'}, axis=1))
v['sex'] = v['sex'].map({'males':1, 'females':2})
data = v.drop('value',1).to_dict('r')
print(data)
[{'age': 96,
'country': 'United States',
'people': 95400,
'sex': 2,
'year': 2010},
{'age': 97,
'country': 'United States',
'people': 64800,
'sex': 2,
'year': 2010},
...
]
You may instead want JSON, so use
json_data = v.drop('value',1).to_json(orient='records')

Related

Panda - how to produce several csv files registering group's data?

I have a dataframe with the following information :
Country
Region
Population
France
Pas de Calais
500000
France
Provence
200000
Switzerland
Geneva
400000
United States
Florida
1200000
Could you please indicate to me how I can process to get one CSV-file per country, with all data for that country ?
Many thanks in advance for your guideline
df.groupby("Country").apply(lambda df_country: df_country.to_csv(df_country.Country[0]+".csv"))
My recomendation is:
data = {'Country': ['France', 'France', 'Switzerland', 'United States'], 'Region': ['Pas de Calais', 'Provence', 'Geneva', ' Florida'], 'Population': [500000, 200000, 400000, 1200000]}
df = pd.DataFrame.from_dict(data)
grouped_df = df.groupby(by = 'Country')
for _, subdf in grouped_df.__iter__():
# save subdataframe
subdf.to_csv()

Converting dataframe to dictionary with country by continent

I have a .csv and dataframe which has 2 columns (country, continent). I want to create a dictionary, carrying the continent as key and a list of all countries as values.
The .csv has the following format:
country
continent
Algeria
Africa
Angola
Africa
and so on.
I tried using:
continentsDict = dict([(con, cou) for con, cou in zip(continents.continent, continents.country)])
But this gave me the following output:
{'Africa': 'Zimbabwe', 'Asia': 'Yemen', 'Europe': 'Vatican City', 'North America': 'United States Virgin Islands', 'Oceania': 'Wallis and Futuna', 'South America': 'Venezuela'}
Which is the right format but only added the last value it found for the respective continent.
Anyone an idea?
Thank you!
Assuming continents is the instance of your pandas df, you could do:
continentsDict = continents.groupby("continent")["country"].apply(list).to_dict()
Given:
country continent
0 Algeria Africa
1 Angola Africa
Doing:
out = df.groupby('continent')['country'].agg(list).to_dict()
print(out)
Output:
{'Africa': ['Algeria', 'Angola']}

Transpose a dataframe to a nested list of list

I got situation where I need to transpose a dataframe like below.
input dataframe is as below:
input_data = [
['Asia', 'China', 'Beijing'],
['Asia', 'China', 'Shenzhen'],
['America', 'United States', 'New York'],
['America', 'Canada', 'Toronto']
]
input_df = pd.DataFrame(input_data)
input_df.columns = ['continents', 'countries', 'cities']
input_df
continents
countries
cities
0
Asia
China
Beijing
1
Asia
China
Shenzhen
2
America
United States
New York
3
America
Canada
Toronto
The output data I want to get is
# only the unique values are allowed in the output list.
continents = ['Asia', 'America']
countries = [['China'], ['United States', 'Canada']]
cities = [[['Beijing', 'Shenzhen']], [['New York'], ['Toronto']]]
For this case, the input data has three levels Continents -> Countries -> Cities, but what I ultimately want is to take a multiple-level hierarchical dataframe (no matters how deep it is horizontally), then I get the output like the example, and then I will put them on a pyqt5 column view.
pandas.Series.tolist() can convert series value to list.
print(input_df['continents'].unique().tolist())
print(input_df.groupby('continents', sort=False)['countries'].apply(lambda x: x.unique().tolist()).tolist())
print(input_df.groupby(['continents', 'countries'], sort=False)['cities'].apply(lambda x: [x.unique().tolist()]).tolist())
['Asia', 'America']
[['China'], ['United States', 'Canada']]
[[['Beijing', 'Shenzhen']], [['New York']], [['Toronto']]]
As for a general approach, the first approach occurred to me is to loop through the columns of df.
def list_wrapper(alist, times):
for _ in range(times):
alist = [alist]
return alist
columns_name = input_df.columns.values.tolist()
for i in range(len(columns_name)):
if i == 0:
print(input_df[columns_name[i]].unique().tolist())
else:
print(input_df.groupby(columns_name[0:i], sort=False)[columns_name[i]].apply(lambda x: list_wrapper(x.unique().tolist(), i-1)).tolist())

Plotly: How to make data consistent across animation frames (i.e. avoiding vanishing data) in plotly express

This is starting to bug me: In plotly express when using animation_frame, I know it's important to set ranges so data can be displayed consistently, otherwise data may vanish across frames. But for a column with categorical values (say 'US', 'Russia', 'Germany'), I cannot find any way to avoid disappearing data when not every frame contains all categories if I want that column to appear with different colors (in the code below, that column would be 'AnotherColumn'). Plotly documentation points out
Animations are designed to work well when each row of input is present across all animation frames, and when categorical values mapped to symbol, color and facet are constant across frames. Animations may be misleading or inconsistent if these constraints are not met.
but while I can easily set a range_color when I have a continuous color range, nothing of the sort seems to work for categorical data. I can somewhat workaround this by making my data numerical (e.g. 'US'-> 1, 'Russia' -> 2) bu that is both fiddly and the result visually unappealing.
import plotly.express as px
...
fig = px.bar(data, x="NameColumn",
y="SomeColumn",
color="AnotherColumn",
animation_frame="AnimationColumn",
range_y=[0, max_y]
)
Here is a simple reproducible example:
import pandas as pd
import plotly.express as px
data_dict = {'ColorColumn': ['p', 'p', 'p', 'q'],
'xColumn': ['someName', 'someOtherName', 'someName', 'someOtherName'],
'yColumn': [10, 20, 30, 40],
'animationColumn': [1, 1, 2, 2]}
data = pd.DataFrame(data=data_dict)
fig = px.bar(data, x="xColumn",
y="yColumn",
color="ColorColumn",
animation_frame="animationColumn",
range_y=[0, 40]
)
fig.update_layout(xaxis={'title': '',
'visible': True,
'showticklabels': True})
fig.show()
If you try it out, you'll notice the second frame is missing a bar. If the ColorColumn had numeric data, you could fix this by specifying range_color (similar to the specification of range_y in the code above); my question would be, how to handle this with categorical data?
Second edit: Some requested additional data or more a more reasonable example. This might be more appropriate:
import pandas as pd
import plotly.express as px
data_dict = {'Region': ['North America', 'Asia', 'Asia',
'North America', 'Asia', 'Europe',
'North America', 'Europe', 'Asia'],
'Country': ['US', 'China', 'Korea',
'US', 'Phillipines', 'France',
'Canada', 'Germany', 'Thailand'],
'GDP': [10, 20, 30,
40, 50, 60,
70, 80, 90],
'Year': [2017, 2017, 2017,
2018, 2018, 2018,
2019, 2019, 2019]}
data = pd.DataFrame(data=data_dict)
fig = px.bar(data, x="Country",
y="GDP",
color="Region",
animation_frame="Year",
range_y=[0, 80]
)
fig.update_layout(xaxis={'title': '',
'visible': True,
'showticklabels': True})
fig.show()
A similar question has been asked and answered under Plotly: How to specify categorical x-axis elements in a plotly express animation?. The necessary adjustments for your use case aren't exactly straight-forward though, so I'll might as well set it up for you.
It all boils down to this setup using, among other things:
df['key']=df.groupby(['Year','Country']).cumcount()
df1 = pd.pivot_table(df,index='Year',columns=['key', 'Country'],values='GDP')
And:
df1 = pd.merge(df1, data[['Country', 'Region']], how='left', on='Country').drop_duplicates()
Using some neat properties of pd.pivot_table, this will give you a dataset that has all years and all countries for all regions even though GDP from these have not been specified.
The two first animation frames will look like this:
Complete code:
import pandas as pd
import plotly.express as px
data_dict = {'Region': ['North America', 'Asia', 'Asia',
'North America', 'Asia', 'Europe',
'North America', 'Europe', 'Asia'],
'Country': ['US', 'China', 'Korea',
'US', 'Phillipines', 'France',
'Canada', 'Germany', 'Thailand'],
'GDP': [10, 20, 30,
40, 50, 60,
70, 80, 90],
'Year': [2017, 2017, 2017,
2018, 2018, 2018,
2019, 2019, 2019]}
data = pd.DataFrame(data=data_dict)
# dat munging
df = data.copy()
df['key']=df.groupby(['Year','Country']).cumcount()
df1 = pd.pivot_table(df,index='Year',columns=['key', 'Country'],values='GDP')
df1 = df1.stack(level=[0,1],dropna=False).reset_index()
df1 = pd.merge(df1, data[['Country', 'Region']], how='left', on='Country').drop_duplicates()
df1.columns=['Year', 'Key', 'Country', 'GDP', 'Region']
fig = px.bar(df1, x="Country",
y="GDP",
color="Region",
animation_frame="Year",
range_y=[0, 80]
)
fig.update_layout(xaxis={'title': '',
'visible': True,
'showticklabels': True})
fig.show()
The following is not a direct answer to you question (as in what do i need to change in plotly), but rather focuses on consistent data in you DataFrame.
The basic idea is that the "Primary Key" of each of your rows in the second example is ["Year", "Country"]. plotly will now expect a value for "GDP" as well as "Region" for each combination of those. The following creates a DataFrame that looks just like so (be using a MultiIndex reindexing).
unqiue_years = data["Year"].unique()
unqiue_countries = data["Country"].unique()
# Let's first separate the region of a country
region_per_country = data[["Country", "Region"]].drop_duplicates().set_index("Country")
# Removing the region
data = data[["Year", "Country", "GDP"]].set_index(["Year", "Country"])
# Creating all possible "Year" "Country" combinations
data = data.reindex(pd.MultiIndex.from_product([unqiue_years, unqiue_countries]))
# Cleanup
data = data.reset_index().rename(columns={"level_0": "Year", "level_1": "Country"})
# Re-adding the region
data = data.merge(region_per_country, left_on="Country", right_index=True)
Running this gives us the following DataFrame (shown without the .reset_index()):
GDP Region
Year Country
2017 Canada NaN North America
China 20.0 Asia
France NaN Europe
Germany NaN Europe
Korea 30.0 Asia
Phillipines NaN Asia
Thailand NaN Asia
US 10.0 North America
2018 Canada NaN North America
China NaN Asia
France 60.0 Europe
Germany NaN Europe
Korea NaN Asia
Phillipines 50.0 Asia
Thailand NaN Asia
US 40.0 North America
2019 Canada 70.0 North America
China NaN Asia
France NaN Europe
Germany 80.0 Europe
Korea NaN Asia
Phillipines NaN Asia
Thailand 90.0 Asia
US NaN North America
which plotly will then correctly plot.

Creating mean and sum dataframe using groupby

I have this dataframe whereby I wish to calculate the mean and sum from the column 'Score'. I do not want to use the .groupby().agg() method.
df = pd.DataFrame({
'Country': ['Germany', 'Germany', 'Canada', 'Canada'],
'Score': [8, 4, 35, 50],
'Continent': ['Europe', 'Europe', 'North America', 'North America']},
columns=['Country','Score','Continent'])
print (df)
Dataframe becomes:
Country Score Continent
0 Germany 8 Europe
1 Germany 4 Europe
2 Canada 35 North America
3 Canada 50 North America
The easiest method I have found is:
new_df = df.groupby('Continent')['Score'].agg({'sum': np.sum, 'avg': np.average})
Continent
Europe 12 6.0
North America 85 42.5
I now have 2 series average and total. How do I make that into a new dataframe using the index from .groupby('Continent')?
I'm trying to use the group, frame method here:
for group, frame in df.groupby('Continent'):
avg = np.average(frame['Score'])
total = np.sum(frame['Score'])
df['avg'] = avg
df['sum'] = total

Categories