Plotting bar graph by month - matplotlib - python

I have a dataset that is in the following form:
Date A B C
01/04/2012 2 5 Y
05/04/2012 3 4 Y
06/05/2012 7 6 Y
09/05/2012 8 2 N
11/05/2012 1 4 Y
15/06/2012 5 4 Y
That continues on with more rows.
I want to plot a bar chart with date on the bottom axis converted to show just the month (i.e. April, May, July) and then on the y-axis I want the average of the sum of the A and B column so for the month of April it would be 7 (14 total over two instances) and for May it would be 9.33 (28 total over 3 instances).
I'm really struggling with how to do this and I'd prefer not to create another column that sums up A and B.

You can use groupby on month_name then mean+eval:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df.groupby([df['Date'].dt.month_name()], sort=False).mean().eval('A+B')\
.plot(kind='bar')
print(df.groupby([df['Date'].dt.month_name()], sort=False).mean().eval('A+B'))
Date
April 7.000000
May 9.333333
June 9.000000
dtype: float64

Related

Use np.where on a Mixed data type column

I have a column with data type 'o'. It has numbers, as well String. For example:
Days
5
10
15
7
No Sales Data available
9
I am trying to make a separate column using np.where, where I have written the code as
np.where(df['Days']=='No Sales Data available','No Sales',np.where(df['Days']<=10, 'Less than 10 days Sales','More than 10 Days Sales'))
Naturally, the code is giving problems due to mixed data types. Any idea how to get around such cases?
You could rewrite your statement in this way which will preserve the data type of your 'Days' column.
df['new'] = np.where(pd.to_numeric(df['Days'],errors='coerce').isna(),'No Sale',
np.where(pd.to_numeric(df['Days'],errors='coerce') <= 10,
'Less than 10 days Sales','More than 10 Days Sales'))
print(df)
Days new
0 5 Less than 10 days Sales
1 10 Less than 10 days Sales
2 15 More than 10 Days Sales
3 7 Less than 10 days Sales
4 No Sales Data available No Sale
5 9 Less than 10 days Sales
If you don't mind changing the type of your column, you could first convert to numeric and following a similar logic:
df['Days'] = pd.to_numeric(df['Days'],errors='coerce')
df['new'] = np.where(df['Days'].isna(),'No Sale',np.where(df['Days']<=10,'Less than 10 days Sales','More than 10 Days Sales'))
print(df)
Days new
0 5.0 Less than 10 days Sales
1 10.0 Less than 10 days Sales
2 15.0 More than 10 Days Sales
3 7.0 Less than 10 days Sales
4 NaN No Sale
5 9.0 Less than 10 days Sales

How to group my time by month / week in pd.DataFrame

I have this DataFrame about my Facebook that says, the events I interested at, I joined and the respective time frame for them. I am having some problem of grouping the time by month or week, because there are two of them
joined_time interested_time
0 2019-04-01 2019-04-21
1 2019-03-15 2019-04-06
2 2019-03-13 2019-03-26
Both time indicates when I clicked the 'Going' or 'Interested' button when an event pops up in Facebook. Sorry for the very small sample size, but this is what I have simplified it down to at the moment. And what I am trying to achieve here is that,
Year Month Total_Events_No Events_Joined Events_Interested
2019 3 3 2 1
4 3 1 2
Where in this DataFrame, the year and month are multi-index, and the other columns consist of the counts of respective situations.
I am using melt before groupby and unstack
s=df.melt()
s.value=pd.to_datetime(s.value)
s=s.groupby([s.value.dt.year,s.value.dt.month,s.variable]).size().unstack()
s['Total']=s.sum(axis=1)
s
variable interested_time joined_time Total
value value
2019 3 1 2 3
4 2 1 3

Python: Pandas dataframe re-arrange rows based on last three digits of Integer in Column

I have the following dataframe:
YearMonth Total Cost
2015009 $11,209,041
2015010 $20,581,043
2015011 $37,079,415
2015012 $36,831,335
2016008 $57,428,630
2016009 $66,754,405
2016010 $45,021,707
2016011 $34,783,970
2016012 $66,215,044
YearMonth is an int64 column. A value in YearMonth such as 2015009 stands for September 2015. I want to re-order the rows so that if the last 3 digits are the same, then I want the rows to appear right on top of each other sorted by year.
Below is my desired output:
YearMonth Total Cost
2015009 $11,209,041
2016009 $66,754,405
2015010 $20,581,043
2016010 $45,021,707
2015011 $37,079,415
2016011 $34,783,970
2015012 $36,831,335
2016012 $66,215,044
2016008 $57,428,630
I have scoured google to try and find how to do this but to no avail.
df['YearMonth'] = pd.to_datetime(df['YearMonth'],format = '%Y0%m')
df['Year'] = df['YearMonth'].dt.year
df['Month'] = df['YearMonth'].dt.month
df.sort_values(['Month','Year'])
YearMonth Total Year Month
8 2016-08-01 $57,428,630 2016 8
0 2015-09-01 $11,209,041 2015 9
1 2016-09-01 $66,754,405 2016 9
2 2015-10-01 $20,581,043 2015 10
3 2016-10-01 $45,021,707 2016 10
4 2015-11-01 $37,079,415 2015 11
5 2016-11-01 $34,783,970 2016 11
6 2015-12-01 $36,831,335 2015 12
7 2016-12-01 $66,215,044 2016 12
One way of doing. There may be a quicker way with fewer steps that don't involve converting YearMonth to datetime but if you have a date, it makes more sense to use that.
One way of dong this to cast your int column to string and use the string access with indexing.
df.assign(sortkey=df.YearMonth.astype(str).str[-3:])\
.sort_values('sortkey')\
.drop('sortkey', axis=1)
Output:
YearMonth Total Cost
4 2016008 $57,428,630
0 2015009 $11,209,041
5 2016009 $66,754,405
1 2015010 $20,581,043
6 2016010 $45,021,707
2 2015011 $37,079,415
7 2016011 $34,783,970
3 2015012 $36,831,335
8 2016012 $66,215,044

Compute operations within subgroups in pandas

I have a table that has multiple subgroups. For example, person A has a total of three visits and person B has a total of two visits. I also have the time of each visit:
id visit time_of_visit
A 1 2002-01-15
A 2 2003-01-15
A 3 2003-02-15
B 1 1996-08-09
B 2 1998-08-09
I want to compute how long apart each visit is in terms of years for each person. So I want something like this:
id visit time_of_visit difference_in_time
A 1 2002-01-15 na
A 2 2003-01-15 1
A 3 2003-02-15 0.0833
B 1 1996-08-09 na
B 2 1998-08-09 2
Any ideas how to do this in python pandas? Thanks!
groupby.diff on a datetime column will give you
df['time_of_visit'] = pd.to_datetime(df['time_of_visit'])
df.groupby('id')['time_of_visit'].diff()
Out:
0 NaT
1 365 days
2 31 days
3 NaT
4 730 days
Name: time_of_visit, dtype: timedelta64[ns]
However, timedeltas cannot give you years as it is not a standard measure. You can always convert by your own rules of course (for example divide by 365).
df.groupby('id')['time_of_visit'].diff().dt.days / 365
Out:
0 NaN
1 1.000000
2 0.084932
3 NaN
4 2.000000
Name: time_of_visit, dtype: float64

Pandas: groupby and get median value by month?

I have a data frame that looks like this:
org date value
0 00C 2013-04-01 0.092535
1 00D 2013-04-01 0.114941
2 00F 2013-04-01 0.102794
3 00G 2013-04-01 0.099421
4 00H 2013-04-01 0.114983
Now I want to figure out:
the median value for each organisation in each month of the year
X for each organisation, where X is the percentage difference between the lowest median monthly value, and the highest median value.
What's the best way to approach this in Pandas?
I am trying to generate the medians by month as follows, but it's failing:
df['date'] = pd.to_datetime(df['date'])
ave = df.groupby(['row_id', 'date.month']).median()
This fails with KeyError: 'date.month'.
UPDATE: Thanks to #EdChum I'm now doing:
ave = df.groupby([df['row_id'], df['date'].dt.month]).median()
which works great and gives me:
99P 1 0.106975
2 0.091344
3 0.098958
4 0.092400
5 0.087996
6 0.081632
7 0.083592
8 0.075258
9 0.080393
10 0.089634
11 0.085679
12 0.108039
99Q 1 0.110889
2 0.094837
3 0.100658
4 0.091641
5 0.088971
6 0.083329
7 0.086465
8 0.078368
9 0.082947
10 0.090943
11 0.086343
12 0.109408
Now I guess, for each item in the index, I need to find the min and max calculated values, then the difference between them. What is the best way to do that?
For your first error you have a syntax error, you can pass a list of the column names or the columns themselves, you passed a list of names and date.month doesn't exist so you want:
ave = df.groupby([df['row_id'], df['date'].dt.month]).median()
After that you can calculate the min and max for a specific index level so:
((ave.max(level=0) - ave.min(level=0))/ave.max(level=0)) * 100
should give you what you want.
This calculates the difference between the min and max value for each organisation, divides by the max at that level and creates the percentage by multiplying by 100

Categories