Using matplotlib to obtain an overlaid histogram - python

I am new to python and I'm trying to plot an overlaid histogram for a manipulated data set from Kaggle. I tried doing it with matplotlib. This is a dataset that shows the history of gun violence in USA in recent years. I have selected only few columns for EDA.
import pandas as pd
data_set = pd.read_csv("C:/Users/Lenovo/Documents/R related
Topics/Assignment/Assignment_day2/04 Assignment/GunViolence.csv")
state_wise_crime = data_set[['date', 'state', 'n_killed', 'n_injured']]
date_value = pd.to_datetime(state_wise_crime['date'])
import datetime
state_wise_crime['Month']= date_value.dt.month
state_wise_crime.drop('date', axis = 1)
no_of_killed = state_wise_crime.groupby(['state','Year'])
['n_killed','n_injured'].sum()
no_of_killed = state_wise_crime.groupby(['state','Year']
['n_killed','n_injured'].sum()
I want an overlaid histogram that shows the no. of people killed and no.of people injured with the different states on the x-axis

Welcome to Stack Overflow! From next time, please post your data like in below format (not a link or an image) to make us easier to work on the problem. Also, if you ask about a graph output, showing the contents of desired graph (even with hand drawing) would be very helpful :)
df
state Year n_killed n_injured
0 Alabama 2013 9 3
1 Alabama 2014 591 325
2 Alabama 2015 562 385
3 Alabama 2016 761 488
4 Alabama 2017 856 544
5 Alabama 2018 219 135
6 Alaska 2014 49 29
7 Alaska 2015 84 70
8 Alaska 2016 103 88
9 Alaska 2017 70 69
As I commented in your original post, a bar plot would be more appropriate than histogram in this case since your purpose appears to be visualizing the summary statistics (sum) of each year with state-wise comparison. As far as I know, the easiest option is to use Seaborn. It depends on how you want to show the data, but below is one example. The code is as simple as below.
import seaborn as sns
sns.barplot(x='Year', y='n_killed', hue='state', data=df)
Output:
Hope this helps.

Related

python Stacked area chart Bokeh

I am trying to create a stacked area chart, which shows the number of customers by country.
So my data frame is:
date people country
2021-11-18 509 USA
2021-11-18 289 France
2021-11-18 234 Germany
2021-11-18 148 Poland
2021-11-18 101 China
I don't understand how to edit the graphic design (color).
table.groupby(['date','country'])['people'].sum().unstack().plot(
kind='area',
figsize=(10,4))
Also I tried to use the Bokeh library for nice visualization, but i don't know how to write the code
Thanks for your help. It's my first post. Sorry if I missed something.
I think your are looking for varea_stack()-function in bokeh.
My solution is based on the varea_stack-example which is part of the official documentation.
Let's assume this is your data (I added on day):
text = """date people country
2021-11-18 509 USA
2021-11-18 289 France
2021-11-18 234 Germany
2021-11-18 148 Poland
2021-11-18 101 China
2021-11-19 409 USA
2021-11-19 389 France
2021-11-19 134 Germany
2021-11-19 158 Poland
2021-11-19 191 China"""
First I bring the data in the same form of the example:
from io import StringIO
import pandas as pd
df = pd.read_csv(StringIO(text), sep='\s+', parse_dates=True, index_col=0)
df = df.groupby(['date','country']).sum().unstack()
df.columns = df.columns.droplevel(0)
df.index.name=None
df.columns.name=None
Now the DataFrame looks like this:
China France Germany Poland USA
2021-11-18 101 289 234 148 509
2021-11-19 191 389 134 158 409
Now the rest is straight forward. If your index is a DatetimeIndex you have to modify the x_axis_type of the bokeh figure. Id did this for the plot below.
from bokeh.palettes import brewer
from bokeh.plotting import figure, show, output_notebook
output_notebook()
n = df.shape[1]
p = figure(x_axis_type='datetime')
p.varea_stack(stackers=df.columns, x='index', source=df, color=brewer['Spectral'][n],)
show(p)
The output lookslike this:
You can redefine the color using the color-keyword if you like.
you should add colors to your source or you could use color pallettes in bokeh. please check here.

Display all values on a maplotlib barplot

I have a data frame with 20 values, and I am trying to bar.plot it using matplotlib. when I do it, I am not seeing the 20 bars but 10. I have 5 nana values in it and 4 of them.
Here is a sample of dataframe:
Name Bonus
Jack Carpenter 890
John Clegg 653
Mike Holiday 367
Rene Moukad 900
........... ...
my code is standard:
fig,ax = plt.subplots(figsize=(16,6))
plt.bar(df.Name, df.Bonus)
fig.autofmt_xdate(rotation=45)

How to draw plots on Specific pandas columns

So I have the df.head() being displayed below.I wanted to display the progression of salaries across time spans.As you can see the teams will get repeated across the years and the idea is to
display how their salaries changed over time.So for teamID='ATL' I will have a graph that starts by 1985 and goes all the way to the present time.
I think I will need to select teams by their team ID and have the x axis display time (year) and Y axis display year. I don't know how to do that on Pandas and for each team in my data frame.
teamID yearID lgID payroll_total franchID Rank W G win_percentage
0 ATL 1985 NL 14807000.0 ATL 5 66 162 40.740741
1 BAL 1985 AL 11560712.0 BAL 4 83 161 51.552795
2 BOS 1985 AL 10897560.0 BOS 5 81 163 49.693252
3 CAL 1985 AL 14427894.0 ANA 2 90 162 55.555556
4 CHA 1985 AL 9846178.0 CHW 3 85 163 52.147239
5 ATL 1986 NL 17800000.0 ATL 4 55 181 41.000000
You can use seaborn for this:
import seaborn as sns
sns.lineplot(data=df, x='yearID', y='payroll_total', hue='teamID')
To get different plot for each team:
for team, d in df.groupby('teamID'):
d.plot(x='yearID', y='payroll_total', label='team')
import pandas as pd
import matplotlib.pyplot as plt
# Display the box plots on 3 separate rows and 1 column
fig, axes = plt.subplots(nrows=3, ncols=1)
# Generate a plot for each team
df[df['teamID'] == 'ATL'].plot(ax=axes[0], x='yearID', y='payroll_total')
df[df['teamID'] == 'BAL'].plot(ax=axes[1], x='yearID', y='payroll_total')
df[df['teamID'] == 'BOS'].plot(ax=axes[2], x='yearID', y='payroll_total')
# Display the plot
plt.show()
depending on how many teams you want to show you should adjust the
fig, axes = plt.subplots(nrows=3, ncols=1)
Finally, you could create a loop and create the visualization for every team

Plotting on a large number of facets

I want to make a plot similar to the one that appears on the seaborn page: https://seaborn.pydata.org/examples/many_facets.html.
index state_name totalprod
0 28 North Dakota 475085000.0
1 3 California 347535000.0
2 34 South Dakota 266141000.0
3 5 Florida 247048000.0
4 21 Montana 156562000.0
To see the variation during the period in a graph initially, and then get the percentages. I'm trying the following,
grid = sns.FacetGrid(df_bystate, col="state_name", palette="tab20c", col_wrap=4)
grid.map(plt.axhline, y=0, ls=":")
grid.map(plt.plot, [("year", "totalprod")], marker="o")
grid.set(xticks=np.arange(5), yticks=[-5, 5], xlim=(-10, 10), ylim=(-5.5, 5.5))
I get the following error:
ValueError: Buffer has wrong number of dimensions (expected 1, got 3)
But the graphics are not generated with the data that supposed. I'm new to this and I do not know what I'm doing wrong, so I ask for your patience.
Thanks!!

can not remove a trend components and a seasonal components

I am trying to make a model for predicting energy production, by using ARMA model.
 
The data I can use for training is as following;
(https://github.com/soma11soma11/EnergyDataSimulationChallenge/blob/master/challenge1/data/training_dataset_500.csv)
ID Label House Year Month Temperature Daylight EnergyProduction
0 0 1 2011 7 26.2 178.9 740
1 1 1 2011 8 25.8 169.7 731
2 2 1 2011 9 22.8 170.2 694
3 3 1 2011 10 16.4 169.1 688
4 4 1 2011 11 11.4 169.1 650
5 5 1 2011 12 4.2 199.5 763
...............
11995 19 500 2013 2 4.2 201.8 638
11996 20 500 2013 3 11.2 234 778
11997 21 500 2013 4 13.6 237.1 758
11998 22 500 2013 5 19.2 258.4 838
11999 23 500 2013 6 22.7 122.9 586
As shown above, I can use data from July 2011 to May 2013 for training.
Using the training, I want to predict energy production on June 2013 for each 500 house.
The problem is that the time series data is not stationary and has trend components and seasonal components (I checked it as following.).
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data_train = pd.read_csv('../../data/training_dataset_500.csv')
rng=pd.date_range('7/1/2011', '6/1/2013', freq='M')
house1 = data_train[data_train.House==1][['EnergyProduction','Daylight','Temperature']].set_index(rng)
fig, axes = plt.subplots(nrows=1, ncols=3)
for i, column in enumerate(house1.columns):
house1[column].plot(ax=axes[i], figsize=(14,3), title=column)
plt.show()
With this data, I cannot implement ARMA model to get good prediction. So I want to get rid of the trend components and a seasonal components and make the time series data stationary. I tried this problem, but I could not remove these components and make it stationary..
I would recommend the Hodrick-Prescott (HP) filter, which is widely used in macroeconometrics to separate long-term trending component from short-term fluctuations. It is implemented statsmodels.api.tsa.filters.hpfilter.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
df = pd.read_csv('/home/Jian/Downloads/data.csv', index_col=[0])
# get part of the data
x = df.loc[df.House==1, 'Daylight']
# hp-filter, set parameter lamb=129600 following the suggestions for monthly data
x_smoothed, x_trend = sm.tsa.filters.hpfilter(x, lamb=129600)
fig, axes = plt.subplots(figsize=(12,4), ncols=3)
axes[0].plot(x)
axes[0].set_title('raw x')
axes[1].plot(x_trend)
axes[1].set_title('trend')
axes[2].plot(x_smoothed)
axes[2].set_title('smoothed x')

Categories