merging tables and combining columns together in python [duplicate] - python

This question already has an answer here:
merging 2 dataframes vertically [duplicate]
(1 answer)
Closed 2 years ago.
I have 2 csv files that I merged together based on their code number. Now there are 2 columns for dates because one column is for dates in 2013 and another is for dates in 2014. I'm not sure if it is a thing, but is there a way in pandas or python where I can "append" them into on entire column for just dates?
Csv1
countyCode Date AQI
1 2013-01-14 122
6 2013-06-10 60
8 2013-10-20 82
Csv 2
countyCode Date AQI
1 2014-02-29 22
6 2014-08-11 41
8 2014-11-06 87
Here is my attempt in merging:
air2013=pd.read_csv("aqi_2013.csv", index_col=0)
air2014=pd.read_csv("aqi_2014.csv", index_col=0)
air2013.merge(air2014,on=['countyCode'])
Output (so far)
countyCode Date_x AQI Date_y AQI
1 2013-01-14 122 2014-02-29 22
6 2013-06-10 60 2014-08-11 41
8 2013-10-20 82 2014-11-06 87
Overall, is there a way where I can add the values from Date_y to Date_x so there is just one Date column?

When you read_csv and specify the index_col=0, it treats the countyCode column values as unique identifiers. But they do not appear to be unique, as they are repeated in both dataframes, so I don't think you want these to be your index in a merged dataframe. Try this instead:
import pandas as pd
air2013=pd.read_csv("aqi_2013.csv")
air2014=pd.read_csv("aqi_2014.csv")
air = pd.concat([air2013, air2014])
This gives:
countyCode Date AQI
0 1 2013-01-14 122
1 6 2013-06-10 60
2 8 2013-10-20 82
0 1 2014-02-29 22
1 6 2014-08-11 41
2 8 2014-11-06 87

Csv1.append(Csv2, ignore_index=True)
countyCode Date AQI
0 1 2013-01-14 122
1 6 2013-06-10 60
2 8 2013-10-20 82
3 1 2014-02-29 22
4 6 2014-08-11 41
5 8 2014-11-06 87

Related

Grouping of a dataframe monthly after calculating the highest daily values

I've got a dataframe with two columns one is datetime dataframe consisting of dates, and another one consists of quantity. It looks like something like this,
Date Quantity
0 2019-01-05 10
1 2019-01-10 15
2 2019-01-22 14
3 2019-02-03 12
4 2019-05-11 25
5 2019-05-21 4
6 2019-07-08 1
7 2019-07-30 15
8 2019-09-05 31
9 2019-09-10 44
10 2019-09-25 8
11 2019-12-09 10
12 2020-04-11 111
13 2020-04-17 5
14 2020-06-05 17
15 2020-06-16 12
16 2020-06-22 14
I want to make another dataframe. It should consist of two columns one is Month/Year and the other is Till Highest. I basically want to calculate the highest quantity value until that month and group it using month/year. Example of what I want precisely is,
Month/Year Till Highest
0 Jan/2019 15
1 Feb/2019 15
2 May/2019 25
3 Jul/2019 25
4 Sep/2019 44
5 Dec/2019 44
6 Apr/2020 111
7 Jun/2020 111
In my case, the dataset is vast, and I've readings of almost every day of each month and each year in the specified timeline. Here I've made a dummy dataset to show an example of what I want.
Please help me with this. Thanks in advance :)
See the annotated code:
(df
# convert date to monthly period (2019-01)
.assign(Date=pd.to_datetime(df['Date']).dt.to_period('M'))
# period and max quantity per month
.groupby('Date')
.agg(**{'Month/Year': ('Date', 'first'),
'Till highest': ('Quantity', 'max')})
# format periods as Jan/2019 and get cumulated max quantity
.assign(**{
'Month/Year': lambda d: d['Month/Year'].dt.strftime('%b/%Y'),
'Till highest': lambda d: d['Till highest'].cummax()
})
# drop the groupby index
.reset_index(drop=True)
)
output:
Month/Year Till highest
0 Jan/2019 15
1 Feb/2019 15
2 May/2019 25
3 Jul/2019 25
4 Sep/2019 44
5 Dec/2019 44
6 Apr/2020 111
7 Jun/2020 111
In R you can use cummax:
df=data.frame(Date=c("2019-01-05","2019-01-10","2019-01-22","2019-02-03","2019-05-11","2019-05-21","2019-07-08","2019-07-30","2019-09-05","2019-09-10","2019-09-25","2019-12-09","2020-04-11","2020-04-17","2020-06-05","2020-06-16","2020-06-22"),Quantity=c(10,15,14,12,25,4,1,15,31,44,8,10,111,5,17,12,14))
data.frame(`Month/Year`=unique(format(as.Date(df$Date),"%b/%Y")),
`Till Highest`=cummax(tapply(df$Quantity,sub("-..$","",df$Date),max)),
check.names=F,row.names=NULL)
Month/Year Till Highest
1 Jan/2019 15
2 Feb/2019 15
3 May/2019 25
4 Jul/2019 25
5 Sep/2019 44
6 Dec/2019 44
7 Apr/2020 111
8 Jun/2020 111

Pandas expanding mean after a certain date

I need some help with groupby and expanding mean in python pandas.
I am trying to use pandas expanding mean and groupby. In this image below, I want to group by using the id column and expand mean by date. But the catch is for January I am not using expanding mean. For example, you can think January might be a past month and take the overall mean of the value column and grouping by ids.
For February and March I want to use expanding mean of value column on top of January. So for 7 Feb and id 1, the 44.5 in expanding mean column is basically mean of January before the value of 89 occurs today. The next value for id 1 is on 7-Mar which is inclusive of previous value of 89 on 7 Feb for id = 1.
So basically my idea is taking the overall mean upto Feb 1, and then use expanding mean on top of whatever mean has been calculated upto that date.
id date value count(prior) expanding mean (after feb)
1 1-Jan 28 4 44.75
2 1-Jan 43 3 37.33
3 1-Jan 69 3 57.00
1 2-Jan 31 4 44.75
2 2-Jan 22 3 37.33
1 7-Jan 82 4 44.75
2 7-Jan 47 3 37.33
3 7-Jan 79 3 57.00
1 8-Jan 38 4 44.75
3 8-Jan 23 3 57.00
1 7-Feb 89 4 44.75
2 7-Feb 22 3 37.33
3 7-Feb 80 3 57.00
2 19-Feb 91 4 33.50
3 19-Feb 97 4 62.75
1 7-Mar 48 5 53.60
2 7-Mar 98 5 45.00
3 7-Mar 35 5 69.60
I've given the count columns as a reference to how the count is increasing. It basically means everything prior to that date.

Compare Relative Start Dates in Pandas

I would like to create a table of relative start dates using the output of a Pandas pivot table. The columns of the pivot table are months, the rows are accounts, and the cells are a running total of actions. For example:
Date1 Date2 Date3 Date4
1 1 2 3
N/A 1 2 2
The first row's first instance is Date1.
The second row's first instance is Date2.
The new table would be formatted such that the columns are now the months relative to the first action and would look like:
FirstMonth SecondMonth ThirdMonth
1 1 2
1 2 2
Creating the initial pivot table is strightforward in pandas, I'm curious if there are any suggestion for how to develop the table of relative starting points. Thank you!
First, make sure your dataframe columns are actual datetime values. Then you can run the following to calculate the sum of actions for each date and then group those values by month and calculate the corresponding monthly sum:
>>>df
2019-01-01 2019-01-02 2019-02-01
Row
0 4 22 40
1 22 67 86
2 72 27 25
3 0 26 60
4 44 62 32
5 73 86 81
6 81 17 58
7 88 29 21
>>>df.sum().groupby(df.sum().index.month).sum()
1 720
2 403
And if you want it to reflect what you had above:
>>> out = df.sum().groupby(df.sum().index.month).sum().to_frame().T
>>> out.columns = [datetime.datetime.strftime(datetime.datetime.strptime(str(x),'%m'),'%B') for x in out.columns]
>>> out
January February
0 720 403
And if I misunderstood you, and you want it broken out by record / row:
>>> df.T.groupby(df.T.index.month).sum().T
1 2
Row
0 26 40
1 89 86
2 99 25
3 26 60
4 106 32
5 159 81
6 98 58
7 117 21
Rename the columns as above.
The trick is to use .apply() combined with dropna().
df.T.apply(lambda x: pd.Series(x.dropna().values)).T

Iterating over groups in a dataframe [duplicate]

This question already has answers here:
Looping over groups in a grouped dataframe
(2 answers)
Closed 4 years ago.
The issue I am having is that I want to group the dataframe and then use functions to manipulate the data after its been grouped. For example I want to group the data by Date and then iterate through each row in the date groups to parse to a function?
The issue is groupby seems to create a tuple of the key and then a massive string consisting of all of the rows in the data making iterating through each row impossible
When you apply groupby on a dataframe, you don't get rows, you get groups of dataframe. For example, consider:
df
ID Date Days Volume/Day
0 111 2016-01-01 20 50
1 111 2016-02-01 25 40
2 111 2016-03-01 31 35
3 111 2016-04-01 30 30
4 111 2016-05-01 31 25
5 112 2016-01-01 31 55
6 112 2016-01-02 26 45
7 112 2016-01-03 31 40
8 112 2016-01-04 30 35
9 112 2016-01-05 31 30
for i, g in df.groupby('ID'):
print(g, '\n')
ID Date Days Volume/Day
0 111 2016-01-01 20 50
1 111 2016-02-01 25 40
2 111 2016-03-01 31 35
3 111 2016-04-01 30 30
4 111 2016-05-01 31 25
ID Date Days Volume/Day
5 112 2016-01-01 31 55
6 112 2016-01-02 26 45
7 112 2016-01-03 31 40
8 112 2016-01-04 30 35
9 112 2016-01-05 31 30
For your case, you should probably look into dfGroupby.apply, if you want to apply some function on your groups, dfGroupby.transform to produce like indexed dataframe (see docs for explanation) or dfGroupby.agg, if you want to produce aggregated results.
You'd do something like:
r = df.groupby('Date').apply(your_function)
You'd define your function as:
def your_function(df):
... # operation on df
return result
If you have problems with the implementation, please open a new question, post your data and your code, and any associated errors/tracebacks. Happy coding.

create new columns with info of other column on python pandas DataFrame

I have a grouped dataframe
id num week
101 23 7 3
8 1
9 2
102 34 8 4
9 1
10 2
...
And I need to create new columns and have a dataFrame like this
id num 7 8 9 10
101 23 3 1 2 0
102 34 0 4 1 2
...
As you may see, the values of the week column turned into several columns.
I may also have the input dataFrame not grouped, or with reset_index, like this:
id num week
101 23 7 3
101 23 8 1
101 23 9 2
102 34 8 4
102 34 9 1
102 34 10 2
...
but I don't know with which would be easier to start.
Notice that id and num are both keys
Use unstack() and fillna(0) to not have NaNs.
Let's load the data:
id num week val
101 23 7 3
101 23 8 1
101 23 9 2
102 34 8 4
102 34 9 1
102 34 10 2
s = pd.read_clipboard(index_col=[0,1,2], squeeze=True)
Notice I have set the index to be id, num and week. If you haven't yet, use set_index.
Now we can unstack: move from the index (rows) to the columns. By default it does it to the last level in line, which is week here, but you could specify it using level=-1 or level='week'
s.unstack().fillna(0)
Note that as pointed out by #piRsquared you can do s.unstack(fill_value=0) to do it in one go.

Categories