I have a data frame with 20 values, and I am trying to bar.plot it using matplotlib. when I do it, I am not seeing the 20 bars but 10. I have 5 nana values in it and 4 of them.
Here is a sample of dataframe:
Name Bonus
Jack Carpenter 890
John Clegg 653
Mike Holiday 367
Rene Moukad 900
........... ...
my code is standard:
fig,ax = plt.subplots(figsize=(16,6))
plt.bar(df.Name, df.Bonus)
fig.autofmt_xdate(rotation=45)
Related
Im new to Python and working with data manipulation
I have a dataframe
df3
Out[22]:
Breed Lifespan
0 New Guinea Singing Dog 18
1 Chihuahua 17
2 Toy Poodle 16
3 Jack Russell Terrier 16
4 Cockapoo 16
.. ... ...
201 Whippet 12--15
202 Wirehaired Pointing Griffon 12--14
203 Xoloitzcuintle 13
204 Yorkie--Poo 14
205 Yorkshire Terrier 14--16
As you observe above, some of the lifespans are in a range like 14--16. The datatype of [Lifespan] is
type(df3['Lifespan'])
Out[24]: pandas.core.series.Series
I want it to reflect the average of these two numbers i.e. 15. I do not want any ranges. Just the average as a single digit. How do I do this?
Using split and expand=True
df = pd.DataFrame({'Breed': ['Dog1', 'Dog2'],
'Lifespan': [12, '14--15']})
df['Lifespan'] = (df['Lifespan']
.astype(str).str.split('--', expand=True)
.astype(float).mean(axis=1)
)
df
# Breed Lifespan
# 0 Dog1 12.0
# 1 Dog2 14.5
So I have the df.head() being displayed below.I wanted to display the progression of salaries across time spans.As you can see the teams will get repeated across the years and the idea is to
display how their salaries changed over time.So for teamID='ATL' I will have a graph that starts by 1985 and goes all the way to the present time.
I think I will need to select teams by their team ID and have the x axis display time (year) and Y axis display year. I don't know how to do that on Pandas and for each team in my data frame.
teamID yearID lgID payroll_total franchID Rank W G win_percentage
0 ATL 1985 NL 14807000.0 ATL 5 66 162 40.740741
1 BAL 1985 AL 11560712.0 BAL 4 83 161 51.552795
2 BOS 1985 AL 10897560.0 BOS 5 81 163 49.693252
3 CAL 1985 AL 14427894.0 ANA 2 90 162 55.555556
4 CHA 1985 AL 9846178.0 CHW 3 85 163 52.147239
5 ATL 1986 NL 17800000.0 ATL 4 55 181 41.000000
You can use seaborn for this:
import seaborn as sns
sns.lineplot(data=df, x='yearID', y='payroll_total', hue='teamID')
To get different plot for each team:
for team, d in df.groupby('teamID'):
d.plot(x='yearID', y='payroll_total', label='team')
import pandas as pd
import matplotlib.pyplot as plt
# Display the box plots on 3 separate rows and 1 column
fig, axes = plt.subplots(nrows=3, ncols=1)
# Generate a plot for each team
df[df['teamID'] == 'ATL'].plot(ax=axes[0], x='yearID', y='payroll_total')
df[df['teamID'] == 'BAL'].plot(ax=axes[1], x='yearID', y='payroll_total')
df[df['teamID'] == 'BOS'].plot(ax=axes[2], x='yearID', y='payroll_total')
# Display the plot
plt.show()
depending on how many teams you want to show you should adjust the
fig, axes = plt.subplots(nrows=3, ncols=1)
Finally, you could create a loop and create the visualization for every team
I am new to python and I'm trying to plot an overlaid histogram for a manipulated data set from Kaggle. I tried doing it with matplotlib. This is a dataset that shows the history of gun violence in USA in recent years. I have selected only few columns for EDA.
import pandas as pd
data_set = pd.read_csv("C:/Users/Lenovo/Documents/R related
Topics/Assignment/Assignment_day2/04 Assignment/GunViolence.csv")
state_wise_crime = data_set[['date', 'state', 'n_killed', 'n_injured']]
date_value = pd.to_datetime(state_wise_crime['date'])
import datetime
state_wise_crime['Month']= date_value.dt.month
state_wise_crime.drop('date', axis = 1)
no_of_killed = state_wise_crime.groupby(['state','Year'])
['n_killed','n_injured'].sum()
no_of_killed = state_wise_crime.groupby(['state','Year']
['n_killed','n_injured'].sum()
I want an overlaid histogram that shows the no. of people killed and no.of people injured with the different states on the x-axis
Welcome to Stack Overflow! From next time, please post your data like in below format (not a link or an image) to make us easier to work on the problem. Also, if you ask about a graph output, showing the contents of desired graph (even with hand drawing) would be very helpful :)
df
state Year n_killed n_injured
0 Alabama 2013 9 3
1 Alabama 2014 591 325
2 Alabama 2015 562 385
3 Alabama 2016 761 488
4 Alabama 2017 856 544
5 Alabama 2018 219 135
6 Alaska 2014 49 29
7 Alaska 2015 84 70
8 Alaska 2016 103 88
9 Alaska 2017 70 69
As I commented in your original post, a bar plot would be more appropriate than histogram in this case since your purpose appears to be visualizing the summary statistics (sum) of each year with state-wise comparison. As far as I know, the easiest option is to use Seaborn. It depends on how you want to show the data, but below is one example. The code is as simple as below.
import seaborn as sns
sns.barplot(x='Year', y='n_killed', hue='state', data=df)
Output:
Hope this helps.
I have written code to show my data set as bar chart. this is my code:
I have read my data from .csv file in this way:
names = ["Clinic Number","Question Text","Answer Text","Answer Date","Class"]
data = pd.read_csv('ADLCI.csv', names = names)
And then
grouped = data.groupby(['Question Text','Answer Text']).size().reset_index(name='counts')
import matplotlib.pyplot as plt
plt.figure()
grouped.plot(kind='bar', title ="Functional Status Count", figsize=(15, 10), legend=True, fontsize=12)
plt.show()
This is also the result of data frame I have which I want to show as bar chart.
Question Text Answer Text counts
0 CI function No 513
1 CI function Yes 373
2 bathing? No 2827
3 bathing? Yes 408
4 dressing? No 2824
5 dressing? Yes 423
6 feeding No 2851
7 feeding Yes 160
8 housekeeping No 2803
9 housekeeping Yes 717
10 preparing food No 2604
11 preparing food Yes 593
12 responsibility for own medications No 2793
13 responsibility for own medications Yes 625
14 shopping No 35
15 shopping Yes 49
16 toileting No 2843
17 toileting Yes 239
18 transferring No 2834
19 transferring Yes 904
20 using transportation No 2816
21 using transportation Yes 483
the first column that is number has been added automatically, Actually I do not have that in my data set.
Here is the bar chart created by this code.
As you see in the bar chart, all bars has the same color. also the x axis is the number I was saying. but I dont want in this shape.
the thing I want is look like this link:
Im going to explain what changes I want to the picture I have uploaded here.
Instead of 0 and 1 ... in the x axis, it should depict the Question Text column. In detail, the bar chart in x axis will be: as we see in the dataframe there is two CI function one for yes and one for No. I want CI function instead of 0 and 1 with two different colors one pointing to the count of No 1596 and one different color pointing to Yes 1376.
The next item will be bathing?, again one bar pointing to 17965 and another one to 702.
With this I should have nearly ten bars, each contains two bars stick with each other like the link I put above.
I tried various ways like the above link but mine not showing like that or getting error.
Thanks :)
Update 1
when I applied your code:
import matplotlib.pyplot as plt
data.groupby(['Question Text','Answer Text']).sum().unstack().plot(kind='bar')
plt.show()
I got this error:
Traceback (most recent call last):
File "C:/Users/M193053/PycharmProjects/ADL-distribution/test.py", line 52, in <module>
data.groupby(['Question Text','Answer Text']).sum().unstack().plot(kind='bar')
File "C:\Users\M193053\Documents\Anaconda3\envs\conda3\lib\site-packages\pandas\plotting\_core.py", line 2941, in __call__
sort_columns=sort_columns, **kwds)
File "C:\Users\M193053\Documents\Anaconda3\envs\conda3\lib\site-packages\pandas\plotting\_core.py", line 1977, in plot_frame
**kwds)
File "C:\Users\M193053\Documents\Anaconda3\envs\conda3\lib\site-packages\pandas\plotting\_core.py", line 1804, in _plot
plot_obj.generate()
File "C:\Users\M193053\Documents\Anaconda3\envs\conda3\lib\site-packages\pandas\plotting\_core.py", line 258, in generate
self._compute_plot_data()
File "C:\Users\M193053\Documents\Anaconda3\envs\conda3\lib\site-packages\pandas\plotting\_core.py", line 373, in _compute_plot_data
'plot'.format(numeric_data.__class__.__name__))
TypeError: Empty 'DataFrame': no numeric data to plot
but when I use this code:
grouped = data.groupby(['Question Text','Answer Text']).size().reset_index(name='counts')
import matplotlib.pyplot as plt
grouped.groupby(['Question Text','Answer Text']).sum().unstack().plot(kind='bar')
plt.show()
It seems ok to me like this:
but it does not seem logical to apply two groupby. because of that Im not sure still what should I do.
Thaks for taking time :)
Update two
this is my data frame, has been got with this code:
grouped = data.groupby(['Question Text','Answer Text']).size().reset_index(name='counts')
0 CI function No 513
1 CI function Yes 373
2 bathing? No 2827
3 bathing? Yes 408
4 dressing? No 2824
5 dressing? Yes 423
6 feeding No 2851
7 feeding Yes 160
8 housekeeping No 2803
9 housekeeping Yes 717
10 preparing food No 2604
11 preparing food Yes 593
12 responsibility for own medications No 2793
13 responsibility for own medications Yes 625
14 shopping No 35
15 shopping Yes 49
16 toileting No 2843
17 toileting Yes 239
18 transferring No 2834
19 transferring Yes 904
20 using transportation No 2816
21 using transportation Yes 483
and this the data frame, has got from combination of your code and mine:
grouped = data.groupby(['Question Text','Answer Text']).size().reset_index(name='counts')
print(grouped)
import matplotlib.pyplot as plt
final = grouped.groupby(['Question Text','Answer Text']).sum()
print(final)
Question Text Answer Text
CI function No 513
Yes 373
bathing? No 2827
Yes 408
dressing? No 2824
Yes 423
feeding No 2851
Yes 160
housekeeping No 2803
Yes 717
preparing food No 2604
Yes 593
responsibility for own medications No 2793
Yes 625
shopping No 35
Yes 49
toileting No 2843
Yes 239
transferring No 2834
Yes 904
using transportation No 2816
Yes 483
Update 3
Original data frame there is 200000 rows like this :
1 bathing? No 3529933
2 dressing? No 3529933
3 feeding No 3529933
4 housekeeping No 3529933
5 responsibility for own medications No 3529933
6 using transportation No 3529933
7 toileting No 3529933
8 transferring No 3529933
10 preparing food No 3529933
11 bathing? NaN 2864155
12 dressing? NaN 2864155
13 feeding NaN 2864155
14 housekeeping NaN 2864155
15 responsibility for own medications NaN 2864155
16 toileting NaN 2864155
17 transferring NaN 2864155
19 preparing food NaN 2864155
20 using transportation Yes 2864155
21 bathing? NaN 2921299
22 dressing? NaN 2921299
You can do so(df is the dataframe you wrote):
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
df.groupby(['Question Text','Answer Text']).sum().unstack().plot(kind='bar')
plt.show()
Output:
You can also rotate the xlabel in this way:
plt.xticks(rotation=45)
but I suggest you to make the labels shorter to make it more clear
This simple code draws line chart as expected:
james_f=names[(names.name=='James') & (names.sex=='F')]
plt.plot(james_f['year'],james_f['births'])
plt.show()
But then I change condition, just delete one of them, and then it starts to draw bar chart. Why and how to force to draw line chart?
james_f=names[(names.name=='James')]
plt.plot(james_f['year'],james_f['births'])
plt.show()
Adding instead of it 1==1 rule, nothing changes(
james_f=names[(names.name=='James') & ( 1 == 1)]
plt.plot(james_f['year'],james_f['births'])
plt.show()
Even this code draws barchart:
james_f=names[(names.name=='James') | (names.name=='John') | (names.name=='Robert') ]
plt.plot(james_f['year'],james_f['births'])
james_f['births'] output (pandas.core.series.Series):
228 46
343 22
538 11
942 9655
944 5927
2312 26
2329 24
2617 9
2938 8769
....
Name: births, dtype: int64
james_f['births'].min() return 7 There is no zero or NaN values
>>> print(james_f[james_f['births'].isnull()])
Empty DataFrame
Columns: [name, sex, births, year]
Index: []
>>> james_f.head(10)
name sex births year
343 James F 22 1880
944 James M 5927 1880
2329 James F 24 1881
2940 James M 5441 1881
4372 James F 18 1882
4965 James M 5892 1882
6428 James F 25 1883
7118 James M 5223 1883
8488 James F 33 1884
9320 James M 5693 1884
Not filtering on gender yields two observations per year: one for women and one for men. The numbers of men and women with name 'James' are vastly different making the plot appear very noisy. You have (at least) two options:
(1) Sum up the number of men and women like so.
james = names[names.name == 'james']
years = []
births = []
for year, subset in james.groupby('year'):
years.append(year)
births.append(subset.births.sum())
plt.plot(years, births)
Someone with more pandas skills can probably write this as one line.
(2) Plot two separate lines for men and women like so.
james = names[names.name == 'james']
for sex, subset in james.groupby('sex'):
plt.plot(subset.year, subset.births, label=sex)
plt.legend()