Creating single chart from three categoric values using python - python
I am fairly new to python and its terminology and can be clumsy at describing the problem.Sorry for that.
What I got is three cities that produced three fruits for two years, and I need to draw the single-static chart that summarizes the data best.
The fact that dataframe have 3 categoric values (city, fruits and year) and one measure makes me confused.
At first I try to use stack bar chart, however If I use fruits in the bars and cities in X axis, I could not find where to use year value.
I tried to use pivot method to convert year value into measure, but I could not advance with two measures this time.
I mainly used Matplotlib.
Any help appreciated,
data= {
'city':['amsterdam','amsterdam','amsterdam','amsterdam','amsterdam','amsterdam','paris','paris','paris','paris','paris','paris','berlin','berlin','berlin','berlin','berlin','berlin'],
'fruits':['apples','oranges','bananas','apples','oranges','bananas','apples','oranges','bananas','apples','oranges','bananas','apples','oranges','bananas','apples','oranges','bananas'],
'year':[2000,2000,2000,2001,2001,2001,2000,2000,2000,2001,2001,2001,2000,2000,2000,2001,2001,2001],
'amount':[384,289,347,242,390,274,175,334,245,116,252,366,255,400,300,240,600,180]
}
df=pd.DataFrame(data)
df.head()
city
fruits
year
amount
0
paris
apples
2000
384
1
paris
oranges
2000
289
2
paris
bananas
2000
347
3
paris
apples
2001
242
4
paris
oranges
2001
390
Related
Creating ID for every row based on the observations in variable
A want to create a system where the observations in a variable refer to a number using Python. All the numbers from the (in this case) 5 different variables together form a unique code. The first number corresponds to the first variable. When an observations in a different row is the same as the first, the same number applies. As illustrated in the example, If apple appears in row 1 and 3, both ID's get a '1' as first number. The output should give a new column with the ID. If all the observations in a row are the same, the ID's will be the same. In the picture below you see 5 variables leading to the unique ID on the right, which should be the output.
You can use pd.factorize: df['UniqueID'] = (df.apply(lambda x: (1+pd.factorize(x)[0]).astype(str)) .agg(''.join, axis=1)) print(df) # Output Fruit Toy Letter Car Country UniqueID 0 Apple Bear A Ferrari Brazil 11111 1 Strawberry Blocks B Peugeot Chile 22222 2 Apple Blocks C Renault China 12333 3 Orange Bear D Saab China 31443 4 Orange Bear D Ferrari India 31414
In Python, how did I prevent a name from repeating in dataset and sum or average the information of the data given?
**I am analyzing data found on github reflecting COVID-19 (coronavirus) by Our World in Data on GitHub. The data found will be presented in an organized table format. ** Link to data https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv Project Goals: List every country in alphabetical order List the total number of tests conducted by each country List the total number of vaccinated peoples in each country List the total number of deaths of each country Show the average age of death to Covid-19 in each country Here is an example of the data iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,total_tests,new_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,tests_units,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,new_vaccinations_smoothed_per_million,new_people_vaccinated_smoothed,new_people_vaccinated_smoothed_per_hundred,stringency_index,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million AFG,Asia,Afghanistan,2020-02-24,5.0,5.0,,,,,0.122,0.122,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,, AFG,Asia,Afghanistan,2020-02-25,5.0,0.0,,,,,0.122,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,, AFG,Asia,Afghanistan,2020-02-26,5.0,0.0,,,,,0.122,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,, AFG,Asia,Afghanistan,2020-02-27,5.0,0.0,,,,,0.122,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,, Overall Question: How do I sum the countries together so the information does not repeat? I just want the country to display once, and then next to it the sum of the information given from the data for the number of vaccinations displays. Example of the way I want the data to display: Country Vaccinations Afghanistan 30235 Albania 15032 Andorra 2352 I have tried to import the data with pandasand sum the data but not quite sure how to get specifically for that one country. I want to write it so I can just display the single country by itself, but I run into the issue of either creating a new table and confuse myself. I am a beginner here and this data set is very large.
df = pd.read_csv(r'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv') df[['location', 'total_vaccinations']].groupby('location').sum().reset_index() location total_vaccinations 0 Afghanistan 4.013568e+08 1 Africa 1.853572e+11 2 Albania 3.587615e+08 3 Algeria 2.983973e+08 4 Andorra 4.593697e+06 .. ... ... 243 Western Sahara 0.000000e+00 244 World 4.857404e+12 245 Yemen 2.126927e+07 246 Zambia 5.435039e+08 247 Zimbabwe 3.211422e+09 [248 rows x 2 columns]
Merging different sized data frames and plotting the difference of a column
i have two dataframes Region_education_0 and Region_education_1 Region_education_0 index Region ConvertedComp 1 Australia/New Zealand 122573.834171 2 Caribbean 53562.111111 3 Central Asia 134422.000000 4 East Asia 112492.507042 5 Melanesia 605 Region_education_1 index Region ConvertedComp 1 Australia/New Zealand 122573.834171 2 Caribbean 53562.111111 3 Central Asia 134422.000000 4 East Asia 112492.507042 Index 5, Melanesia is not present in Region_education_1 because of a condition, i want to compare them and plot so i tried this from matplotlib.pyplot import * Region_education_combined=Region_education_0.merge(Region_education_1,left_on="Region",right_on="Region") Region_education_combined.columns=["Region","Max of Bachelors Higher Ed","Higher Formal Education"] Region_education_combined['Diff_HigherEd_Vals'] = Region_education_combined['Higher Formal Education'] - Region_education_combined['Max of Bachelors Higher Ed'] print(Region_education_combined) comp_df.style.bar(subset=['Diff_HigherEd_Vals'], align='mid', color=['#d65f5f', '#5fba7d']) index Max of Bachelors Higher Ed Higher Formal Education Diff_HigherEd_Vals 1 151698.500659 122573.834171 -29124.666488 2 28413.753425 53562.111111 53562.111111 3 3944.750000 5883.000000 1938.250000 4 45091.041667 27052.384615 -18038.657051 Region column is missing from the output,to include I region tried comp_df.style.bar(subset=['Diff_HigherEd_Vals','Region'], align='mid', color=['#d65f5f', '#5fba7d']) and comp_df.style.bar(Region_education_combined, align='mid', color=['#d65f5f', '#5fba7d']) Is there any way to include region in the final output? and i left out "Index 5, Melanesia" from 'Region_education_0' dataframe is there any way to include that too in the output ?
You can maintain the missing Region by using how="outer" when you call merge like this Region_education_combined=Region_education_0.merge(Region_education_1,left_on="Region",right_on="Region") Pay attention that in this case you will have a table which contains Nan where it is not possible to merge, in your case Melanesia will have a Nan in the Higher Formal Education column. In order to avoid problem you can set a default value with this Region_education_combined["Higher Formal Education"].fillna(0, inplace=True)
Plot against dummy variables and grouped values
This is some values of the table I have country colour ... 1 Spain red 2 USA blue 3 Greece green 4 Italy white 5 USA red 6 USA blue 7 Spain red I want to be able to group the countries together and plot it where the country is in the x axis and the total number of 'colours' is calculated for each country. For example, country USA has 2 blues and 1 red, Spain has 2 reds etc. I want this in a bar chart form. I would like this to be done using either matplotlib or seaborn. I would assume I would have to use dummy variables for the 'colours' column but I'm not sure how to plot against a grouped column and dummy variables. Much appreciated if you could show and explain the process. Thank you.
Try with crosstab: pd.crosstab(df['country'], df['colour']).plot.bar() Output:
Grouping by multiple years in a single column and plotting the result stacked
I have a dataframe that looks like this, with the default pandas index starting at 0: index Year Count Name 0 2005 70000 Apple 1 2005 60000 Banana 2 2006 20000 Pineapple 3 2007 70000 Cherry 4 2007 60000 Coconut 5 2007 40000 Pear 6 2008 90000 Grape 7 2008 10000 Apricot I would like to create a stacked bar plot of this data. However, using the df.groupby() function will only allow me to call a function such as .mean() or .count() on this data in order to plot the data by year. I am getting the following result which separates each data point and does not group them by the shared year. I have seen the matplotlib example for stacked bar charts, but they are grouped by a common index, in this case I do not have a common index I want to plot by. Is there a way to group and plot this data without rearranging the entire dataframe?
If I understood you correctly, you could do this using pivot first: df1 = pd.pivot_table(df, values='Count', index='Year', columns='Name') df1.plot(kind='bar') Output: Or with the argument stacked=True: df1.plot(kind='bar', stacked=True)