Select data, resample and join horizontally in Python - python

I have two dataframes df1 and df2.
df1:
Id Date Remark
0 28 2010-04-08 xx
1 29 2010-10-10 yy
2 30 2012-12-03 zz
3 31 2010-03-16 aa
df2:
Id Timestamp Value Site
0 28 2010-04-08 13:20:15.120 125.0 93
1 28 2010-04-08 13:20:16.020 120.0 94
2 28 2010-04-08 13:20:18.020 135.0 95
3 28 2010-04-08 13:20:18.360 140.0 96
...
1000 29 2010-06-15 05:04:15.120 16.0 101
1001 29 2010-06-15 05:05:16.320 14.0 101
...
I would like to select all Value data 10 days before/including the Date in df1 from df2 for the same Id. For example, for Id 28, Date is 2010-04-08, so select Value where Timestamp is between 2010-03-30 00:00:00 and 2010-04-08 23:59:59(inclusive).
Then, I want to resample the Value data using forward fill ffill and backward fill bfill at 1min frequency so that there will be 10 x 24 x 60 = 14400 values exactly for each Id.
Lastly, I'd like to rearrange the dataframe horizontally with transpose.
Expected output looks like this:
Id Date value1 value2 ... value14399 value14400 Remark
0 28 2010-04-08 125.0 125.0 ... ... xx (value1 and value2 and following values before "2010-04-08 13:20:15.120" are 125.0 as a result of backward fill since the first value for Id 29 is 125.0)
1 29 2010-10-10 16.0 16.0 ... yy
...
I am not sure what's the best way to approach this problem since I'm adding "another time series dimension" to the dataframe. Any idea is appreciated.

Related

How to merge rows in pandas dataframe [duplicate]

This question already has answers here:
pandas group by and find first non null value for all columns
(3 answers)
Merge each group's rows into one row
(1 answer)
Closed 1 year ago.
I have a dataframe that looks like this:
productID
units sold
units in inventory
101
32
NaN
102
45
NaN
103
15
NaN
104
27
NaN
101
NaN
18
102
NaN
12
103
NaN
30
104
NaN
23
As you can see, the first column contains duplicates, where each instance has data in one 'data' column, but not the other 'data' column.
Is there a way to merge the rows, so the dataframe looks like this?
productID
units sold
units in inventory
101
32
18
102
45
12
103
15
30
104
27
23
Try groupby.first:
>>> df.groupby('productID', as_index=False).first()
productID units sold units in inventory
0 101 32.0 18.0
1 102 45.0 12.0
2 103 15.0 30.0
3 104 27.0 23.0
>>>

calculate the difference between pandas rows in pairs

I have a dataframe of orders as below, where the column 'Value' represents cash in/out and the 'Date' column reflects when the transaction occurred.
Each transaction is grouped, so that the 'QTY' out, is always succeeded by the 'QTY' 'in', reflected by the sign in the 'QTY' column:
Date Qty Price Value
0 2014-11-18 58 495.775716 -2875499
1 2014-11-24 -58 484.280147 2808824
2 2014-11-26 63 474.138699 -2987073
3 2014-12-31 -63 507.931247 3199966
4 2015-01-05 59 495.923771 -2925950
5 2015-02-05 -59 456.224370 2691723
How can I create two columns, 'n_days' and 'price_diff' that is the difference in days between the two dates of each transaction and the 'Value'?
I have tried:
df['price_diff'] = df['Value'].rolling(2).apply(lambda x: x[0] + x[1])
but receiving a key error for the first observation (0).
Many thanks
Why don't you just use sum:
df['price_diff'] = df['Value'].rolling(2).sum()
Although from the name, it looks like
df['price_diff'] = df['Price'].diff()
And, for the two columns:
df[['Date_diff','Price_diff']] = df[['Date','Price']].diff()
Output:
Date Qty Price Value Date_diff Price_diff
0 2014-11-18 58 495.775716 -2875499 NaT NaN
1 2014-11-24 -58 484.280147 2808824 6 days -11.495569
2 2014-11-26 63 474.138699 -2987073 2 days -10.141448
3 2014-12-31 -63 507.931247 3199966 35 days 33.792548
4 2015-01-05 59 495.923771 -2925950 5 days -12.007476
5 2015-02-05 -59 456.224370 2691723 31 days -39.699401
Updated Per comment, you can try:
df['Val_sum'] = df['Value'].rolling(2).sum()[1::2]
Output:
Date Qty Price Value Val_sum
0 2014-11-18 58 495.775716 -2875499 NaN
1 2014-11-24 -58 484.280147 2808824 -66675.0
2 2014-11-26 63 474.138699 -2987073 NaN
3 2014-12-31 -63 507.931247 3199966 212893.0
4 2015-01-05 59 495.923771 -2925950 NaN
5 2015-02-05 -59 456.224370 2691723 -234227.0

Pandas expanding mean after a certain date

I need some help with groupby and expanding mean in python pandas.
I am trying to use pandas expanding mean and groupby. In this image below, I want to group by using the id column and expand mean by date. But the catch is for January I am not using expanding mean. For example, you can think January might be a past month and take the overall mean of the value column and grouping by ids.
For February and March I want to use expanding mean of value column on top of January. So for 7 Feb and id 1, the 44.5 in expanding mean column is basically mean of January before the value of 89 occurs today. The next value for id 1 is on 7-Mar which is inclusive of previous value of 89 on 7 Feb for id = 1.
So basically my idea is taking the overall mean upto Feb 1, and then use expanding mean on top of whatever mean has been calculated upto that date.
id date value count(prior) expanding mean (after feb)
1 1-Jan 28 4 44.75
2 1-Jan 43 3 37.33
3 1-Jan 69 3 57.00
1 2-Jan 31 4 44.75
2 2-Jan 22 3 37.33
1 7-Jan 82 4 44.75
2 7-Jan 47 3 37.33
3 7-Jan 79 3 57.00
1 8-Jan 38 4 44.75
3 8-Jan 23 3 57.00
1 7-Feb 89 4 44.75
2 7-Feb 22 3 37.33
3 7-Feb 80 3 57.00
2 19-Feb 91 4 33.50
3 19-Feb 97 4 62.75
1 7-Mar 48 5 53.60
2 7-Mar 98 5 45.00
3 7-Mar 35 5 69.60
I've given the count columns as a reference to how the count is increasing. It basically means everything prior to that date.

Compare Relative Start Dates in Pandas

I would like to create a table of relative start dates using the output of a Pandas pivot table. The columns of the pivot table are months, the rows are accounts, and the cells are a running total of actions. For example:
Date1 Date2 Date3 Date4
1 1 2 3
N/A 1 2 2
The first row's first instance is Date1.
The second row's first instance is Date2.
The new table would be formatted such that the columns are now the months relative to the first action and would look like:
FirstMonth SecondMonth ThirdMonth
1 1 2
1 2 2
Creating the initial pivot table is strightforward in pandas, I'm curious if there are any suggestion for how to develop the table of relative starting points. Thank you!
First, make sure your dataframe columns are actual datetime values. Then you can run the following to calculate the sum of actions for each date and then group those values by month and calculate the corresponding monthly sum:
>>>df
2019-01-01 2019-01-02 2019-02-01
Row
0 4 22 40
1 22 67 86
2 72 27 25
3 0 26 60
4 44 62 32
5 73 86 81
6 81 17 58
7 88 29 21
>>>df.sum().groupby(df.sum().index.month).sum()
1 720
2 403
And if you want it to reflect what you had above:
>>> out = df.sum().groupby(df.sum().index.month).sum().to_frame().T
>>> out.columns = [datetime.datetime.strftime(datetime.datetime.strptime(str(x),'%m'),'%B') for x in out.columns]
>>> out
January February
0 720 403
And if I misunderstood you, and you want it broken out by record / row:
>>> df.T.groupby(df.T.index.month).sum().T
1 2
Row
0 26 40
1 89 86
2 99 25
3 26 60
4 106 32
5 159 81
6 98 58
7 117 21
Rename the columns as above.
The trick is to use .apply() combined with dropna().
df.T.apply(lambda x: pd.Series(x.dropna().values)).T

Iterating over groups in a dataframe [duplicate]

This question already has answers here:
Looping over groups in a grouped dataframe
(2 answers)
Closed 4 years ago.
The issue I am having is that I want to group the dataframe and then use functions to manipulate the data after its been grouped. For example I want to group the data by Date and then iterate through each row in the date groups to parse to a function?
The issue is groupby seems to create a tuple of the key and then a massive string consisting of all of the rows in the data making iterating through each row impossible
When you apply groupby on a dataframe, you don't get rows, you get groups of dataframe. For example, consider:
df
ID Date Days Volume/Day
0 111 2016-01-01 20 50
1 111 2016-02-01 25 40
2 111 2016-03-01 31 35
3 111 2016-04-01 30 30
4 111 2016-05-01 31 25
5 112 2016-01-01 31 55
6 112 2016-01-02 26 45
7 112 2016-01-03 31 40
8 112 2016-01-04 30 35
9 112 2016-01-05 31 30
for i, g in df.groupby('ID'):
print(g, '\n')
ID Date Days Volume/Day
0 111 2016-01-01 20 50
1 111 2016-02-01 25 40
2 111 2016-03-01 31 35
3 111 2016-04-01 30 30
4 111 2016-05-01 31 25
ID Date Days Volume/Day
5 112 2016-01-01 31 55
6 112 2016-01-02 26 45
7 112 2016-01-03 31 40
8 112 2016-01-04 30 35
9 112 2016-01-05 31 30
For your case, you should probably look into dfGroupby.apply, if you want to apply some function on your groups, dfGroupby.transform to produce like indexed dataframe (see docs for explanation) or dfGroupby.agg, if you want to produce aggregated results.
You'd do something like:
r = df.groupby('Date').apply(your_function)
You'd define your function as:
def your_function(df):
... # operation on df
return result
If you have problems with the implementation, please open a new question, post your data and your code, and any associated errors/tracebacks. Happy coding.

Categories