Creating a Box-Plot but by value_counts() [Number of events occurred] - python

I have the following dataframe. Each entry is an event that occurred [550624 events]. Suppose we are interested in a box-plot of the number of events occurring per day each month.
print(df)
Month Day
0 4 1
1 4 1
2 4 1
3 4 1
4 4 1
... ...
550619 10 31
550620 10 31
550621 10 31
550622 10 31
550623 10 31
[550624 rows x 2 columns]
df2 = df.groupby('Month')['Day'].value_counts().sort_index()
Month Day
4 1 2162
2 1564
3 1973
4 1620
5 1860
10 27 2022
28 1606
29 1316
30 1674
31 1726
sns.boxplot(x = df2.index.get_level_values('Month'), y = df2)
Output of sns.boxplot
My question is whether this way is the most efficient/direct way to create this visual info or if I am taking a round-about way of achieving this.
Is there a more direct way to achieve this visual?

Related

Grouping of a dataframe monthly after calculating the highest daily values

I've got a dataframe with two columns one is datetime dataframe consisting of dates, and another one consists of quantity. It looks like something like this,
Date Quantity
0 2019-01-05 10
1 2019-01-10 15
2 2019-01-22 14
3 2019-02-03 12
4 2019-05-11 25
5 2019-05-21 4
6 2019-07-08 1
7 2019-07-30 15
8 2019-09-05 31
9 2019-09-10 44
10 2019-09-25 8
11 2019-12-09 10
12 2020-04-11 111
13 2020-04-17 5
14 2020-06-05 17
15 2020-06-16 12
16 2020-06-22 14
I want to make another dataframe. It should consist of two columns one is Month/Year and the other is Till Highest. I basically want to calculate the highest quantity value until that month and group it using month/year. Example of what I want precisely is,
Month/Year Till Highest
0 Jan/2019 15
1 Feb/2019 15
2 May/2019 25
3 Jul/2019 25
4 Sep/2019 44
5 Dec/2019 44
6 Apr/2020 111
7 Jun/2020 111
In my case, the dataset is vast, and I've readings of almost every day of each month and each year in the specified timeline. Here I've made a dummy dataset to show an example of what I want.
Please help me with this. Thanks in advance :)
See the annotated code:
(df
# convert date to monthly period (2019-01)
.assign(Date=pd.to_datetime(df['Date']).dt.to_period('M'))
# period and max quantity per month
.groupby('Date')
.agg(**{'Month/Year': ('Date', 'first'),
'Till highest': ('Quantity', 'max')})
# format periods as Jan/2019 and get cumulated max quantity
.assign(**{
'Month/Year': lambda d: d['Month/Year'].dt.strftime('%b/%Y'),
'Till highest': lambda d: d['Till highest'].cummax()
})
# drop the groupby index
.reset_index(drop=True)
)
output:
Month/Year Till highest
0 Jan/2019 15
1 Feb/2019 15
2 May/2019 25
3 Jul/2019 25
4 Sep/2019 44
5 Dec/2019 44
6 Apr/2020 111
7 Jun/2020 111
In R you can use cummax:
df=data.frame(Date=c("2019-01-05","2019-01-10","2019-01-22","2019-02-03","2019-05-11","2019-05-21","2019-07-08","2019-07-30","2019-09-05","2019-09-10","2019-09-25","2019-12-09","2020-04-11","2020-04-17","2020-06-05","2020-06-16","2020-06-22"),Quantity=c(10,15,14,12,25,4,1,15,31,44,8,10,111,5,17,12,14))
data.frame(`Month/Year`=unique(format(as.Date(df$Date),"%b/%Y")),
`Till Highest`=cummax(tapply(df$Quantity,sub("-..$","",df$Date),max)),
check.names=F,row.names=NULL)
Month/Year Till Highest
1 Jan/2019 15
2 Feb/2019 15
3 May/2019 25
4 Jul/2019 25
5 Sep/2019 44
6 Dec/2019 44
7 Apr/2020 111
8 Jun/2020 111

groupby week - pandas dataframe

I would like to group a pandas dataframe by week from the last entry day, and make the sum for each column /per week.
(1 week : monday -> sunday, if the last entry is tuesday, the week is is composed of monday and tuesday data only, not today - 7 days)
df:
a b c d e
2019-01-01 1 2 5 0 1
...
2020-01-25 2 3 6 1 0
2020-01-26 1 2 3 4 5
expected output:
week a b c d e
104 9 8 8 8 7
...
1 7 8 8 8 9
code:
df = df.rename_axis('date').reset_index()
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
df.groupby(DF.date.dt.strftime('%W')).sum()
Problem: not the week number I want and the weeks n of each year are grouped in the same line
Try extract the iso calendar (year-week-day), then groupby:
s = dt.index.isocalendar()
dt.groupby([s.year, s.week]).sum()
You would get something like this:
a b c d e
year week
2019 1 18 33 31 26 25
2 36 31 25 28 31
3 33 22 44 22 29
4 36 36 35 33 31
5 27 30 26 31 36

Appending DataFrame columns to another DataFrame at an index/location that meets conditions [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have a one_sec_flt DataFrame that has 300,000+ points and a flask DataFrame that has 230 points. Both DataFrames have columns Hour, Minute, Second. I want to append the flask DataFrame to the same time it was taken in the one_sec_flt data.
Flasks DataFrame
year month day hour minute second... gas1 gas2 gas3
0 2018 4 8 16 27 48... 10 25 191
1 2018 4 8 16 40 20... 45 34 257
...
229 2018 5 12 14 10 05... 3 72 108
one_sec_flt DataFrame
Year Month Day Hour Min Second... temp wind
0 2018 4 8 14 30 20... 300 10
1 2018 4 8 14 45 15... 310 8
...
305,212 2018 5 12 14 10 05... 308 24
I have this code I started with but I don't know how to append one DataFrame to another at that exact timestamp.
for i in range(len(flasks)):
for j in range(len(one_sec_flt)):
if (flasks.hour.iloc[i] == one_sec_flt.Hour.iloc[j]):
if (flasks.minute.iloc[i] == one_sec_flt.Min.iloc[j]):
if (flasks.second.iloc[i] == one_sec_flt.Sec.iloc[j]):
print('match')
My output goal would look like:
Year Month Day Hour Min Second... temp wind gas1 gas2 gas3
0 2018 4 8 14 30 20... 300 10 nan nan nan
1 2018 4 8 14 45 15... 310 8 nan nan nan
2 2018 4 8 15 15 47... ... ... nan nan nan
3 2018 4 8 16 27 48... ... ... 10 25 191
4 2018 4 8 16 30 11... ... ... nan nan nan
5 2018 4 8 16 40 20... ... ... 45 34 257
... ... ... ... ... ... ... ... ... ... ... ...
305,212 2018 5 12 14 10 05... 308 24 3 72 108
If you can concatenate both the dataframes Flask & one_sec_flt, then sort by the times, it might achieve what you are looking for(at least, if I understood the problem statement correctly).
Flasks
Out[13]:
year month day hour minute second
0 2018 4 8 16 27 48
1 2018 4 8 16 40 20
one_sec
Out[14]:
year month day hour minute second
0 2018 4 8 14 30 20
1 2018 4 8 14 45 15
df_res = pd.concat([Flasks,one_sec])
df_res
Out[16]:
year month day hour minute second
0 2018 4 8 16 27 48
1 2018 4 8 16 40 20
0 2018 4 8 14 30 20
1 2018 4 8 14 45 15
df_res.sort_values(by=['year','month','day','hour','minute','second'])
Out[17]:
year month day hour minute second
0 2018 4 8 14 30 20
1 2018 4 8 14 45 15
0 2018 4 8 16 27 48
1 2018 4 8 16 40 20

Grouping data series by day intervals with Pandas

I have to perform some data analysis on a seasonal basis.
I have circa one and a half years worth of hourly measurements, from the end of 2015 to the second half of 2017. What I want to do is to sort this data in seasons.
Here's an example of the data I am working with:
Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C)
04/12/2015,2015,12,4,6,18,0,6,2968,1781,16.2
04/12/2015,2015,12,4,6,19,0,6,2437,1462,16.2
19/04/2016,2016,4,19,3,3,0,3,1348,809,14.4
19/04/2016,2016,4,19,3,4,0,3,1353,812,14.1
11/06/2016,2016,6,11,7,19,0,7,1395,837,18.8
11/06/2016,2016,6,11,7,20,0,7,1370,822,17.4
11/06/2016,2016,6,11,7,21,0,7,1364,818,17
11/06/2016,2016,6,11,7,22,0,7,1433,860,17.5
04/12/2016,2016,12,4,1,17,0,1,1425,855,14.6
04/12/2016,2016,12,4,1,18,0,1,1466,880,14.4
07/03/2017,2017,3,7,3,14,0,3,3668,2201,14.2
07/03/2017,2017,3,7,3,15,0,3,3666,2200,14
24/04/2017,2017,4,24,2,5,0,2,1347,808,11.4
24/04/2017,2017,4,24,2,6,0,2,1816,1090,11.5
24/04/2017,2017,4,24,2,7,0,2,2918,1751,12.4
15/06/2017,2017,6,15,5,13,1,1,2590,1554,22.5
15/06/2017,2017,6,15,5,14,1,1,2629,1577,22.5
15/06/2017,2017,6,15,5,15,1,1,2656,1594,22.1
15/11/2017,2017,11,15,4,13,0,4,3765,2259,15.6
15/11/2017,2017,11,15,4,14,0,4,3873,2324,15.9
15/11/2017,2017,11,15,4,15,0,4,3905,2343,15.8
15/11/2017,2017,11,15,4,16,0,4,3861,2317,15.3
As you can see I have data on three different years.
What I was thinking to do is to convert the first column with the pd.to_datetime() command. Then to group the rows according to the day/month, regardless of the year in dd/mm intervals (if winter goes from the 21/12 to the 21/03, create a new dataframe with all of those rows in which the date is included in this interval, regardless of the year), but I couldn't do it by neglecting the year (which make things more complicated).
EDIT:
A desired output would be:
df_spring
Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C)
19/04/2016,2016,4,19,3,3,0,3,1348,809,14.4
19/04/2016,2016,4,19,3,4,0,3,1353,812,14.1
07/03/2017,2017,3,7,3,14,0,3,3668,2201,14.2
07/03/2017,2017,3,7,3,15,0,3,3666,2200,14
24/04/2017,2017,4,24,2,5,0,2,1347,808,11.4
24/04/2017,2017,4,24,2,6,0,2,1816,1090,11.5
24/04/2017,2017,4,24,2,7,0,2,2918,1751,12.4
df_autumn
Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C)
04/12/2015,2015,12,4,6,18,0,6,2968,1781,16.2
04/12/2015,2015,12,4,6,19,0,6,2437,1462,16.2
04/12/2016,2016,12,4,1,17,0,1,1425,855,14.6
04/12/2016,2016,12,4,1,18,0,1,1466,880,14.4
15/11/2017,2017,11,15,4,13,0,4,3765,2259,15.6
15/11/2017,2017,11,15,4,14,0,4,3873,2324,15.9
15/11/2017,2017,11,15,4,15,0,4,3905,2343,15.8
15/11/2017,2017,11,15,4,16,0,4,3861,2317,15.3
And so on for the remaining seasons.
Define each season by filtering the relevant rows using Day and Month columns as presented for winter:
df_winter = df.loc[((df['Day'] >= 21) & (df['Month'] == 12)) | (df['Month'] == 1) | (df['Month'] == 2) | ((df['Day'] <= 21) & (df['Month'] == 3))]
you can simply filter your dataframe by month.isin()
# spring
df[df['Month'].isin([3,4])]
Date Year Month Day Day week Hour Holiday Week Day Impulse Power (kW) Temperature (C)
2 19/04/2016 2016 4 19 3 3 0 3 1348 809 14.4
3 19/04/2016 2016 4 19 3 4 0 3 1353 812 14.1
10 07/03/2017 2017 3 7 3 14 0 3 3668 2201 14.2
11 07/03/2017 2017 3 7 3 15 0 3 3666 2200 14.0
12 24/04/2017 2017 4 24 2 5 0 2 1347 808 11.4
13 24/04/2017 2017 4 24 2 6 0 2 1816 1090 11.5
14 24/04/2017 2017 4 24 2 7 0 2 2918 1751 12.4
# autumn
df[df['Month'].isin([11,12])]
Date Year Month Day Day week Hour Holiday Week Day Impulse Power (kW) Temperature (C)
0 04/12/2015 2015 12 4 6 18 0 6 2968 1781 16.2
1 04/12/2015 2015 12 4 6 19 0 6 2437 1462 16.2
8 04/12/2016 2016 12 4 1 17 0 1 1425 855 14.6
9 04/12/2016 2016 12 4 1 18 0 1 1466 880 14.4
18 15/11/2017 2017 11 15 4 13 0 4 3765 2259 15.6
19 15/11/2017 2017 11 15 4 14 0 4 3873 2324 15.9
20 15/11/2017 2017 11 15 4 15 0 4 3905 2343 15.8
21 15/11/2017 2017 11 15 4 16 0 4 3861 2317 15.3

calculating mean and sum in pivot_table in pandas sorted by two separate desired col values

I have a data set from 2015-2018 which has months and days as 2nd and third col like below:
Year Month Day rain temp humidity snow
2015 1 1 0 20 60 0
2015 1 2 2 18 58 0
2015 1 3 0 20 62 2
2015 1 4 5 15 62 0
2015 1 5 2 18 61 1
2015 1 6 0 19 60 2
2015 1 7 3 20 59 0
2015 1 8 2 17 65 0
2015 1 9 1 17 61 0
I wanted to use pivot_table to calculate something like (the mean of temperature for year 2016 and months (1,2,3)
I was wondering if anyone could help me with this?
You can do with pd.cut then groupby
df.temp.groupby([df.Year,pd.cut(df.Month,[0,3,6,9,12],labels=['Winter','Spring','Summer','Autumn'],right =False)]).mean()
Out[93]:
Year Month
2015 Winter 18.222222

Categories