Matplotlib Plot confusion - python

I'm not sure what is going on here but I have two seemingly similar bits of code designed to produce graphs in the same format:
apple_fcount = apple_f1.groupby("Year")["Domain Category"].nunique("count")
plt.figure(1); apple_fcount.plot(figsize=(12,6))
plt.xlabel('Year')
plt.ylabel('Number of Fungicides used')
plt.title('Number of fungicides used on apples in the US')
plt.savefig('C:/Users/User/Documents/Work/Year 3/Project/Plots/Apple/apple fcount')
This one produces the graph how I would like it to be seen; y axis shows number of fungicides, x axis shows the respective years. However, the following code, on a different dataset prints a usable graph but the x axis shows years as '1, 2, 3, ...' instead of the actual years.
apple_yplot = apple_y1.groupby('Year')['Value'].sum()
plt.figure(3); apple_yplot.plot(figsize=(12,6))
plt.xlabel('Year')
plt.ylabel('Yield / lb per acre')
plt.title('Graph of apple yield in the US over time')
plt.savefig('C:/Users/User/Documents/Work/Year 3/Project/Plots/Apple/Yield.png')
The only discernable difference I see in the code is that the first counts .nunique() datapoints, whilst the second is a .sum() of all data in the year. I can't imagine thats the reason behind this issue though. Both .groupby() lines print in the same format, with year being correctly displayed there.
apple_fcount:
Year
1991 19
1993 19
1995 21
1997 26
1999 28
2001 27
2003 31
2005 37
2007 30
2009 35
2011 32
Name: Domain Category, dtype: int64
apple_yplot:
Year
2007 405399
2008 541180
2009 483130
2010 473150
2011 468120
2012 417710
2013 529470
2014 510700
Name: Value, dtype: float64

Related

Why this grouped data frame don't show the expected plot?

I have this pandas data frame, where I want to make a line plot, per each year strata:
year month canasta
0 2011 1 239.816531
1 2011 2 239.092353
2 2011 3 239.332308
3 2011 4 237.591538
4 2011 5 238.384231
... ... ... ...
59 2015 12 295.578605
60 2016 1 296.918861
61 2016 2 296.398701
62 2016 3 296.488780
63 2016 4 300.922927
And I tried this code:
dca.groupby(['year', 'month'])['canasta'].mean().reset_index().plot()
But I get this result:
I must be doing something wrong. Please, could you help me with this plot? The x axis is the months, and there should be a line per each year.
Why: Because after you do reset_index, year and month become normal columns. And some_df.plot() simply plots all the columns of the dataframe into one plot, resulting what you posted.
Fix: Try unstack instead of reset_index:
(dca.groupby(['year', 'month'])
['canasta'].mean()
.unstack('year').plot()
)

Multiline plot for each id

I would like to plot multiple lines in Python for this dataset: (x = year, y = freq)
Student_ID Year Freq
A 2012 6
B 2008 22
C 2009 18
A 2010 7
B 2012 13
D 2012 31
D 2013 1
where each student_id has data for different years. count is the result of a groupby.
I would like to have one line for each student_id.
I have tried with this:
df.groupby(['year'])['freq'].count().plot()
but it does not plot one line for each student_id.
Any suggestions are more than welcome. Thank you for your help
I wasn't sure from your question if you wanted count (which in your example is all ones) or sum, so I solved this with sum - if you'd like count, just swap it out for sum in the first line.
df_ = df.groupby(['Student_ID', 'Year'])['Freq'].sum()
print(df_)
> Student_ID Year
A 2010 7
2012 6
B 2008 22
2012 13
C 2009 18
D 2012 31
2013 1
Name: Freq, dtype: int64
fig, ax = plt.subplots()
for student in set(a[0] for a in df_.index):
df_[student].plot(ax=ax, label=student)
plt.legend()
plt.show()
Which gives you:

Multiple column plotting Python

I've got data in the form:
Year Month State Value
2001 Jan AK 80
2001 Feb AK 40
2001 Mar AK 60
2001 Jan LA 70
2001 Feb LA 79
2001 Mar LA 69
2001 Jan KS 65
.
.
This data is only for Year 2001 and Months repeat on each State.
I want a basic graph with this data together in one based off the State with X-Axis being Month and Y-Axis being the Value.
When I plot with:
g = df.groupby('State')
for state, data in g:
plt.plot(df['Month'], df['Value'], label=state)
plt.show()
I get a very wonky looking graph.
I know based off plotting these individually they aren't extremely different in their behaviour but they are not even close to being this much overlapped.
Is there a way of building more of a continuous plot?
Your problem is that inside your for loop you're referencing df, which still has the data for all the states. Try:
for state, data in g:
plt.plot(data['Month'], data['Value'], label = state)
plt.legend()
plt.show()
Hopefully this helps!

pandas histogram: extracting column and group by from data

I have a dataframe for which I'm looking at histograms of subsets of the data using column and by of pandas' hist() method, as in:
ax = df.hist(column='activity_count', by='activity_month')
(then I go along and plot this info). I'm trying to determine how to programmatically pull out two pieces of data: the number of records with that particular value of 'activity_month' as well as the value of 'activity_month' when I loop over the axes:
for i,x in enumerate(ax):`
print("the value of a is", a)
print("the number of rows with value of a", b)
so that I'd get:
January 1002
February 4305
etc
Now, I can easily get the list of unique values of "activity_month", as well as a count of how many rows have a given value of activity_month equal to that,
a="January"
len(df[df["activity_month"]=a])
but I'd like to do that within the loop, for a particular iteration of i,x. How do I get a handle on the subsetted data within "x" on each iteration so I can look at the value of the "activity_month" and the number of rows with that value on that iteration?
Here is a short example dataframe:
import pandas as pd
df = pd.DataFrame([['January',19],['March',6],['January',24],['November',83],['February',23],
['November',4],['February',98],['January',44],['October',47],['January',4],
['April',8],['March',21],['April',41],['June',34],['March',63]],
columns=['activity_month','activity_count'])
Yields:
activity_month activity_count
0 January 19
1 March 6
2 January 24
3 November 83
4 February 23
5 November 4
6 February 98
7 January 44
8 October 47
9 January 4
10 April 8
11 March 21
12 April 41
13 June 34
14 March 63
If you want the sum of the values for each group from your df.groupby('activity_month'), then this will do:
df.groupby('activity_month')['activity_count'].sum()
Gives:
activity_month
April 49
February 121
January 91
June 34
March 90
November 87
October 47
Name: activity_count, dtype: int64
To get the number of rows that correspond to a given group:
df.groupby('activity_month')['activity_count'].agg('count')
Gives:
activity_month
April 2
February 2
January 4
June 1
March 3
November 2
October 1
Name: activity_count, dtype: int64
After re-reading your question, I'm convinced that you are not approaching this problem in the most efficient manner. I would highly recommend that you do not explicitly loop through the axes you have created with df.hist(), especially when this information is quickly (and directly) accessible from df itself.

Calculate average for each quarter given month columns

Suppose I have a Python Pandas dataframe with 10 rows and 16 columns. Each row stands for one product. The first column is product ID. Other 15 columns are selling price for
2010/01,2010/02,2010/03,2010/05,2010/06,2010/07,2010/08,2010/10,2010/11,2010/12,2011/01,2011/02,2011/03,2011/04,2011/05.
(The column name is in strings, not in date format) Now I want to calculate the mean selling price each quarter (1Q2010,2Q2010,...,2Q2011), I don't know how to deal with it. (Note that there is missing month for 2010/04, 2010/09 and 2011/06.)
The description above is just an example. Because this data set is quite small. It is possible to loop manually. However, the real data set I work on is 10730*202. Therefore I can not manually check which month is actually missing or map quarters manually. I wonder what efficient way I can apply here.
Thanks for the help!
This should help.
import pandas as pd
import numpy as np
rng = pd.DataFrame({'date': pd.date_range('1/1/2011', periods=72, freq='M'), 'value': np.arange(72)})
df = rng.groupby([rng.date.dt.quarter, rng.date.dt.year]) .mean()
df.index.names = ['quarter', 'year']
df.columns = ['mean']
print df
mean
quarter year
1 2011 1
2012 13
2013 25
2014 37
2015 49
2016 61
2 2011 4
2012 16
2013 28
2014 40
2015 52
2016 64
3 2011 7
2012 19
2013 31
2014 43
2015 55
2016 67
4 2011 10
2012 22
2013 34
2014 46
2015 58
2016 70

Categories