Creating single chart from three categoric values using python

Creating single chart from three categoric values using python - python

I am fairly new to python and its terminology and can be clumsy at describing the problem.Sorry for that.
What I got is three cities that produced three fruits for two years, and I need to draw the single-static chart that summarizes the data best.
The fact that dataframe have 3 categoric values (city, fruits and year) and one measure makes me confused.
At first I try to use stack bar chart, however If I use fruits in the bars and cities in X axis, I could not find where to use year value.
I tried to use pivot method to convert year value into measure, but I could not advance with two measures this time.
I mainly used Matplotlib.
Any help appreciated,
data= {
'city':['amsterdam','amsterdam','amsterdam','amsterdam','amsterdam','amsterdam','paris','paris','paris','paris','paris','paris','berlin','berlin','berlin','berlin','berlin','berlin'],
'fruits':['apples','oranges','bananas','apples','oranges','bananas','apples','oranges','bananas','apples','oranges','bananas','apples','oranges','bananas','apples','oranges','bananas'],
'year':[2000,2000,2000,2001,2001,2001,2000,2000,2000,2001,2001,2001,2000,2000,2000,2001,2001,2001],
'amount':[384,289,347,242,390,274,175,334,245,116,252,366,255,400,300,240,600,180]
}
df=pd.DataFrame(data)
df.head()
city
fruits
year
amount
0
paris
apples
2000
384
1
paris
oranges
2000
289
2
paris
bananas
2000
347
3
paris
apples
2001
242
4
paris
oranges
2001
390

Related

Creating ID for every row based on the observations in variable

A want to create a system where the observations in a variable refer to a number using Python. All the numbers from the (in this case) 5 different variables together form a unique code. The first number corresponds to the first variable. When an observations in a different row is the same as the first, the same number applies. As illustrated in the example, If apple appears in row 1 and 3, both ID's get a '1' as first number.
The output should give a new column with the ID. If all the observations in a row are the same, the ID's will be the same. In the picture below you see 5 variables leading to the unique ID on the right, which should be the output.

You can use pd.factorize:
df['UniqueID'] = (df.apply(lambda x: (1+pd.factorize(x)[0]).astype(str))
.agg(''.join, axis=1))
print(df)
# Output
Fruit Toy Letter Car Country UniqueID
0 Apple Bear A Ferrari Brazil 11111
1 Strawberry Blocks B Peugeot Chile 22222
2 Apple Blocks C Renault China 12333
3 Orange Bear D Saab China 31443
4 Orange Bear D Ferrari India 31414

In Python, how did I prevent a name from repeating in dataset and sum or average the information of the data given?

**I am analyzing data found on github reflecting COVID-19 (coronavirus) by Our World in Data on GitHub. The data found will be presented in an organized table format.
**
Link to data
https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv
Project Goals:
List every country in alphabetical order
List the total number of tests conducted by each country
List the total number of vaccinated peoples in each country
List the total number of deaths of each country
Show the average age of death to Covid-19 in each country
Here is an example of the data
iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,total_tests,new_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,tests_units,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,new_vaccinations_smoothed_per_million,new_people_vaccinated_smoothed,new_people_vaccinated_smoothed_per_hundred,stringency_index,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
AFG,Asia,Afghanistan,2020-02-24,5.0,5.0,,,,,0.122,0.122,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,,
AFG,Asia,Afghanistan,2020-02-25,5.0,0.0,,,,,0.122,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,,
AFG,Asia,Afghanistan,2020-02-26,5.0,0.0,,,,,0.122,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,,
AFG,Asia,Afghanistan,2020-02-27,5.0,0.0,,,,,0.122,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,,
Overall Question:
How do I sum the countries together so the information does not repeat? I just want the country to display once, and then next to it the sum of the information given from the data for the number of vaccinations displays. Example of the way
I want the data to display:
Country Vaccinations
Afghanistan 30235
Albania 15032
Andorra 2352
I have tried to import the data with pandasand sum the data but not quite sure how to get specifically for that one country. I want to write it so I can just display the single country by itself, but I run into the issue of either creating a new table and confuse myself. I am a beginner here and this data set is very large.

df = pd.read_csv(r'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv')
df[['location', 'total_vaccinations']].groupby('location').sum().reset_index()
location total_vaccinations
0 Afghanistan 4.013568e+08
1 Africa 1.853572e+11
2 Albania 3.587615e+08
3 Algeria 2.983973e+08
4 Andorra 4.593697e+06
.. ... ...
243 Western Sahara 0.000000e+00
244 World 4.857404e+12
245 Yemen 2.126927e+07
246 Zambia 5.435039e+08
247 Zimbabwe 3.211422e+09
[248 rows x 2 columns]

Merging different sized data frames and plotting the difference of a column

i have two dataframes Region_education_0 and Region_education_1
Region_education_0
index
Region
ConvertedComp
1
Australia/New Zealand
122573.834171
2
Caribbean
53562.111111
3
Central Asia
134422.000000
4
East Asia
112492.507042
5
Melanesia
605
Region_education_1
index
Region
ConvertedComp
1
Australia/New Zealand
122573.834171
2
Caribbean
53562.111111
3
Central Asia
134422.000000
4
East Asia
112492.507042
Index 5, Melanesia is not present in Region_education_1 because of a condition, i want to compare them and plot so i tried this
from matplotlib.pyplot import *
Region_education_combined=Region_education_0.merge(Region_education_1,left_on="Region",right_on="Region")
Region_education_combined.columns=["Region","Max of Bachelors Higher Ed","Higher Formal Education"]
Region_education_combined['Diff_HigherEd_Vals'] = Region_education_combined['Higher Formal Education'] - Region_education_combined['Max of Bachelors Higher Ed']
print(Region_education_combined)
comp_df.style.bar(subset=['Diff_HigherEd_Vals'], align='mid', color=['#d65f5f', '#5fba7d'])
index
Max of Bachelors Higher Ed
Higher Formal Education
Diff_HigherEd_Vals
1
151698.500659
122573.834171
-29124.666488
2
28413.753425
53562.111111
53562.111111
3
3944.750000
5883.000000
1938.250000
4
45091.041667
27052.384615
-18038.657051
Region column is missing from the output,to include I region tried
comp_df.style.bar(subset=['Diff_HigherEd_Vals','Region'], align='mid', color=['#d65f5f', '#5fba7d'])
and
comp_df.style.bar(Region_education_combined, align='mid', color=['#d65f5f', '#5fba7d'])
Is there any way to include region in the final output?
and i left out "Index 5, Melanesia" from 'Region_education_0' dataframe is there any way to include that too in the output ?

You can maintain the missing Region by using how="outer" when you call merge like this
Region_education_combined=Region_education_0.merge(Region_education_1,left_on="Region",right_on="Region")
Pay attention that in this case you will have a table which contains Nan where it is not possible to merge, in your case Melanesia will have a Nan in the Higher Formal Education column. In order to avoid problem you can set a default value with this
Region_education_combined["Higher Formal Education"].fillna(0, inplace=True)

Plot against dummy variables and grouped values

This is some values of the table I have
country colour ...
1 Spain red
2 USA blue
3 Greece green
4 Italy white
5 USA red
6 USA blue
7 Spain red
I want to be able to group the countries together and plot it where the country is in the x axis and the total number of 'colours' is calculated for each country. For example, country USA has 2 blues and 1 red, Spain has 2 reds etc. I want this in a bar chart form. I would like this to be done using either matplotlib or seaborn.
I would assume I would have to use dummy variables for the 'colours' column but I'm not sure how to plot against a grouped column and dummy variables.
Much appreciated if you could show and explain the process. Thank you.

Try with crosstab:
pd.crosstab(df['country'], df['colour']).plot.bar()
Output:

Grouping by multiple years in a single column and plotting the result stacked

I have a dataframe that looks like this, with the default pandas index starting at 0:
index Year Count Name
0 2005 70000 Apple
1 2005 60000 Banana
2 2006 20000 Pineapple
3 2007 70000 Cherry
4 2007 60000 Coconut
5 2007 40000 Pear
6 2008 90000 Grape
7 2008 10000 Apricot
I would like to create a stacked bar plot of this data.
However, using the df.groupby() function will only allow me to call a function such as .mean() or .count() on this data in order to plot the data by year. I am getting the following result which separates each data point and does not group them by the shared year.
I have seen the matplotlib example for stacked bar charts, but they are grouped by a common index, in this case I do not have a common index I want to plot by. Is there a way to group and plot this data without rearranging the entire dataframe?

If I understood you correctly, you could do this using pivot first:
df1 = pd.pivot_table(df, values='Count', index='Year', columns='Name')
df1.plot(kind='bar')
Output:
Or with the argument stacked=True:
df1.plot(kind='bar', stacked=True)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating single chart from three categoric values using python - python

Related

Creating ID for every row based on the observations in variable

In Python, how did I prevent a name from repeating in dataset and sum or average the information of the data given?

Merging different sized data frames and plotting the difference of a column

Plot against dummy variables and grouped values

Grouping by multiple years in a single column and plotting the result stacked

Categories

Resources