Iterating over groups in a dataframe [duplicate] - python

This question already has answers here:
Looping over groups in a grouped dataframe
(2 answers)
Closed 4 years ago.
The issue I am having is that I want to group the dataframe and then use functions to manipulate the data after its been grouped. For example I want to group the data by Date and then iterate through each row in the date groups to parse to a function?
The issue is groupby seems to create a tuple of the key and then a massive string consisting of all of the rows in the data making iterating through each row impossible

When you apply groupby on a dataframe, you don't get rows, you get groups of dataframe. For example, consider:
df
ID Date Days Volume/Day
0 111 2016-01-01 20 50
1 111 2016-02-01 25 40
2 111 2016-03-01 31 35
3 111 2016-04-01 30 30
4 111 2016-05-01 31 25
5 112 2016-01-01 31 55
6 112 2016-01-02 26 45
7 112 2016-01-03 31 40
8 112 2016-01-04 30 35
9 112 2016-01-05 31 30
for i, g in df.groupby('ID'):
print(g, '\n')
ID Date Days Volume/Day
0 111 2016-01-01 20 50
1 111 2016-02-01 25 40
2 111 2016-03-01 31 35
3 111 2016-04-01 30 30
4 111 2016-05-01 31 25
ID Date Days Volume/Day
5 112 2016-01-01 31 55
6 112 2016-01-02 26 45
7 112 2016-01-03 31 40
8 112 2016-01-04 30 35
9 112 2016-01-05 31 30
For your case, you should probably look into dfGroupby.apply, if you want to apply some function on your groups, dfGroupby.transform to produce like indexed dataframe (see docs for explanation) or dfGroupby.agg, if you want to produce aggregated results.
You'd do something like:
r = df.groupby('Date').apply(your_function)
You'd define your function as:
def your_function(df):
... # operation on df
return result
If you have problems with the implementation, please open a new question, post your data and your code, and any associated errors/tracebacks. Happy coding.

Related

Grouping of a dataframe monthly after calculating the highest daily values

I've got a dataframe with two columns one is datetime dataframe consisting of dates, and another one consists of quantity. It looks like something like this,
Date Quantity
0 2019-01-05 10
1 2019-01-10 15
2 2019-01-22 14
3 2019-02-03 12
4 2019-05-11 25
5 2019-05-21 4
6 2019-07-08 1
7 2019-07-30 15
8 2019-09-05 31
9 2019-09-10 44
10 2019-09-25 8
11 2019-12-09 10
12 2020-04-11 111
13 2020-04-17 5
14 2020-06-05 17
15 2020-06-16 12
16 2020-06-22 14
I want to make another dataframe. It should consist of two columns one is Month/Year and the other is Till Highest. I basically want to calculate the highest quantity value until that month and group it using month/year. Example of what I want precisely is,
Month/Year Till Highest
0 Jan/2019 15
1 Feb/2019 15
2 May/2019 25
3 Jul/2019 25
4 Sep/2019 44
5 Dec/2019 44
6 Apr/2020 111
7 Jun/2020 111
In my case, the dataset is vast, and I've readings of almost every day of each month and each year in the specified timeline. Here I've made a dummy dataset to show an example of what I want.
Please help me with this. Thanks in advance :)
See the annotated code:
(df
# convert date to monthly period (2019-01)
.assign(Date=pd.to_datetime(df['Date']).dt.to_period('M'))
# period and max quantity per month
.groupby('Date')
.agg(**{'Month/Year': ('Date', 'first'),
'Till highest': ('Quantity', 'max')})
# format periods as Jan/2019 and get cumulated max quantity
.assign(**{
'Month/Year': lambda d: d['Month/Year'].dt.strftime('%b/%Y'),
'Till highest': lambda d: d['Till highest'].cummax()
})
# drop the groupby index
.reset_index(drop=True)
)
output:
Month/Year Till highest
0 Jan/2019 15
1 Feb/2019 15
2 May/2019 25
3 Jul/2019 25
4 Sep/2019 44
5 Dec/2019 44
6 Apr/2020 111
7 Jun/2020 111
In R you can use cummax:
df=data.frame(Date=c("2019-01-05","2019-01-10","2019-01-22","2019-02-03","2019-05-11","2019-05-21","2019-07-08","2019-07-30","2019-09-05","2019-09-10","2019-09-25","2019-12-09","2020-04-11","2020-04-17","2020-06-05","2020-06-16","2020-06-22"),Quantity=c(10,15,14,12,25,4,1,15,31,44,8,10,111,5,17,12,14))
data.frame(`Month/Year`=unique(format(as.Date(df$Date),"%b/%Y")),
`Till Highest`=cummax(tapply(df$Quantity,sub("-..$","",df$Date),max)),
check.names=F,row.names=NULL)
Month/Year Till Highest
1 Jan/2019 15
2 Feb/2019 15
3 May/2019 25
4 Jul/2019 25
5 Sep/2019 44
6 Dec/2019 44
7 Apr/2020 111
8 Jun/2020 111

merging tables and combining columns together in python [duplicate]

This question already has an answer here:
merging 2 dataframes vertically [duplicate]
(1 answer)
Closed 2 years ago.
I have 2 csv files that I merged together based on their code number. Now there are 2 columns for dates because one column is for dates in 2013 and another is for dates in 2014. I'm not sure if it is a thing, but is there a way in pandas or python where I can "append" them into on entire column for just dates?
Csv1
countyCode Date AQI
1 2013-01-14 122
6 2013-06-10 60
8 2013-10-20 82
Csv 2
countyCode Date AQI
1 2014-02-29 22
6 2014-08-11 41
8 2014-11-06 87
Here is my attempt in merging:
air2013=pd.read_csv("aqi_2013.csv", index_col=0)
air2014=pd.read_csv("aqi_2014.csv", index_col=0)
air2013.merge(air2014,on=['countyCode'])
Output (so far)
countyCode Date_x AQI Date_y AQI
1 2013-01-14 122 2014-02-29 22
6 2013-06-10 60 2014-08-11 41
8 2013-10-20 82 2014-11-06 87
Overall, is there a way where I can add the values from Date_y to Date_x so there is just one Date column?
When you read_csv and specify the index_col=0, it treats the countyCode column values as unique identifiers. But they do not appear to be unique, as they are repeated in both dataframes, so I don't think you want these to be your index in a merged dataframe. Try this instead:
import pandas as pd
air2013=pd.read_csv("aqi_2013.csv")
air2014=pd.read_csv("aqi_2014.csv")
air = pd.concat([air2013, air2014])
This gives:
countyCode Date AQI
0 1 2013-01-14 122
1 6 2013-06-10 60
2 8 2013-10-20 82
0 1 2014-02-29 22
1 6 2014-08-11 41
2 8 2014-11-06 87
Csv1.append(Csv2, ignore_index=True)
countyCode Date AQI
0 1 2013-01-14 122
1 6 2013-06-10 60
2 8 2013-10-20 82
3 1 2014-02-29 22
4 6 2014-08-11 41
5 8 2014-11-06 87

Compare Relative Start Dates in Pandas

I would like to create a table of relative start dates using the output of a Pandas pivot table. The columns of the pivot table are months, the rows are accounts, and the cells are a running total of actions. For example:
Date1 Date2 Date3 Date4
1 1 2 3
N/A 1 2 2
The first row's first instance is Date1.
The second row's first instance is Date2.
The new table would be formatted such that the columns are now the months relative to the first action and would look like:
FirstMonth SecondMonth ThirdMonth
1 1 2
1 2 2
Creating the initial pivot table is strightforward in pandas, I'm curious if there are any suggestion for how to develop the table of relative starting points. Thank you!
First, make sure your dataframe columns are actual datetime values. Then you can run the following to calculate the sum of actions for each date and then group those values by month and calculate the corresponding monthly sum:
>>>df
2019-01-01 2019-01-02 2019-02-01
Row
0 4 22 40
1 22 67 86
2 72 27 25
3 0 26 60
4 44 62 32
5 73 86 81
6 81 17 58
7 88 29 21
>>>df.sum().groupby(df.sum().index.month).sum()
1 720
2 403
And if you want it to reflect what you had above:
>>> out = df.sum().groupby(df.sum().index.month).sum().to_frame().T
>>> out.columns = [datetime.datetime.strftime(datetime.datetime.strptime(str(x),'%m'),'%B') for x in out.columns]
>>> out
January February
0 720 403
And if I misunderstood you, and you want it broken out by record / row:
>>> df.T.groupby(df.T.index.month).sum().T
1 2
Row
0 26 40
1 89 86
2 99 25
3 26 60
4 106 32
5 159 81
6 98 58
7 117 21
Rename the columns as above.
The trick is to use .apply() combined with dropna().
df.T.apply(lambda x: pd.Series(x.dropna().values)).T

Sum and average after datetime pandas

I have a problem trying to make a sum after the date within an hour and the average from column no. 2 and also after the date within an hour. I tried something like this, but there are errors.
df[6] = df.groupby(df[5].dt.hour).sum()
df
I know how to do the whole column, but I do not know how the average is in an hour. He would like to get such an effect:
2 3 4 5 Sum Average
0 29 12 296 2017-01-01 01:00:07.500 4 47,7
1 29 12 296 2017-01-01 01:00:07.500 4 47,7
2 66 5 646 2017-01-01 01:00:31.410 4 47,7
3 66 5 646 2017-01-01 01:00:31.410 4 47,7
4 63 5 596 2017-01-01 02:00:32.670 2 63
5 63 5 596 2017-01-01 02:00:32.670 2 63
6 43 8 655 2017-01-01 03:00:36.720 2 43
7 43 8 655 2017-01-01 03:00:36.720 2 43
You can use:
df['Avg'] = df.groupby(pd.Grouper(key=5, freq='H'))[2].transform('mean')

create new columns with info of other column on python pandas DataFrame

I have a grouped dataframe
id num week
101 23 7 3
8 1
9 2
102 34 8 4
9 1
10 2
...
And I need to create new columns and have a dataFrame like this
id num 7 8 9 10
101 23 3 1 2 0
102 34 0 4 1 2
...
As you may see, the values of the week column turned into several columns.
I may also have the input dataFrame not grouped, or with reset_index, like this:
id num week
101 23 7 3
101 23 8 1
101 23 9 2
102 34 8 4
102 34 9 1
102 34 10 2
...
but I don't know with which would be easier to start.
Notice that id and num are both keys
Use unstack() and fillna(0) to not have NaNs.
Let's load the data:
id num week val
101 23 7 3
101 23 8 1
101 23 9 2
102 34 8 4
102 34 9 1
102 34 10 2
s = pd.read_clipboard(index_col=[0,1,2], squeeze=True)
Notice I have set the index to be id, num and week. If you haven't yet, use set_index.
Now we can unstack: move from the index (rows) to the columns. By default it does it to the last level in line, which is week here, but you could specify it using level=-1 or level='week'
s.unstack().fillna(0)
Note that as pointed out by #piRsquared you can do s.unstack(fill_value=0) to do it in one go.

Categories