Moving average for months over years - python

I am new to pandas and would appreciate guidance with the following problem. I have a dataframe that looks like the following:
In [88]: df.head()
Out[88]:
Jan Feb Mar Apr May Jun ... Dec
Year ...
1758 13 15 14 5 5 5 ... 12
1759 11 10 7 4 3 6 ... 11
1760 19 15 18 5 13 6 ... 11
1761 14 16 14 9 9 11 ... 10
1762 13 12 12 8 5 3 ... 11
I need to compute moving average per month in the following way:
Moving_average of Mar_1761 = (value_of_Mar_1761)/(sum of values from Sep_1760 to Aug_1761)
If I am using the rolling average function of pandas, how do I code the logic to inspect predecessor or successor row for a particular point?

The easiest approach is to reshape to data to a long format using .stack, which can be be passed straight into rolling mean.
In [34]: pd.rolling_mean(df.stack(), window=12)
Out[34]:
Year
1758 Jan NaN
Feb NaN
Mar NaN
Apr NaN
May NaN
Jun NaN
Jul NaN
Aug NaN
Sep NaN
Oct NaN
Nov NaN
Dec 0.035038
1759 Jan -0.076660
Feb -0.153907
Mar -0.286818
Apr -0.306684
May -0.159371
Jun -0.230627
Jul -0.175845

Related

Combine multiple Pandas series with identical column names, but different indices

I have many pandas series structured more or less as follows.
s1 s2 s3 s4
Date val1 Date val1 Date val2 Date val2
Jan 10 Apr 25 Jan 14 Apr 11
Feb 11 May 18 Feb 17 May 7
Mar 8 Jun 15 Mar 16 Jun 21
I would like to combine these series into a single data frame, with structure as follows:
Date val1 val2
Jan 10 14
Feb 11 17
Mar 8 16
Apr 25 11
May 18 7
Jun 15 21
In an attempt to combine them, I have tried using pd.concat to create this single data frame. However, I have not been able to do so. The results of pd.concat(series, axis=1) (where series is a list [s1,s2,s3,s4]) is:
Date val1 val1 val2 val2
Jan 10 nan 14 nan
Feb 11 nan 17 nan
Mar 8 nan 16 nan
Apr nan 25 nan 11
May nan 18 nan 7
Jun nan 15 nan 21
And pd.concat(series, axis=0) simply creates a single series, ignoring the column names.
Is there a parameter in concat that will yield my desired result? Or is there some other function that can collapse the incorrect, nan-filled data frame into a frame with non-repeated columns and no nans?
One way to do is groupby Date and choose first:
(pd.concat( [s1,s2,s3,s4])
.groupby('Date', as_index=False, sort=False).first()
)
Output:
Date val1 val2
0 Jan 10 14
1 Feb 11 17
2 Mar 8 16
3 Apr 25 11
4 May 18 7
5 Jun 15 21

Pandas Multindex: iterate rows and add specific values to create a new variable

I have a pandas data frame with Multindex (id and datetime) and one column named X1.
X1
id datetime
a1ssjdldf 2019 Jul 10 2
2019 Jul 11 22
2019 Jul 12 21
r2dffs 2019 Jul 10 14
2019 Jul 11 13
2019 Jul 12 11
I want to create a new variable X2 where the corresponding value is the difference between the X1 value of the same row and the X1 value of the previous row. But every time it sees a new id the corresponding value has to be restarted from zero.
For example:
X1 X2
id datetime
a1ssjdldf 2019 Jul 10 2 0
2019 Jul 11 22 20
2019 Jul 12 21 -1
r2dffs 2019 Jul 10 14 0
2019 Jul 11 13 -1
2019 Jul 12 11 -2
Use DataFrameGroupBy.diff by first level and replace missing values by Series.fillna:
df['X2'] = df.groupby(level=0)['X1'].diff().fillna(0, downcast='int')
print (df)
X1 X2
id datetime
a1ssjdldf 2019 Jul 10 2 0
2019 Jul 11 22 20
2019 Jul 12 21 -1
r2dffs 2019 Jul 10 14 0
2019 Jul 11 13 -1
2019 Jul 12 11 -2

How to add a column with the time to a pandas dataframe (created from a JSON)?

I retrieve data (JSON format) from a software API and transform it into a dataframe to write it in a CSV (pandas library). I would add a column with the time. I would like it to be written "time" on the first row and for example "Fri Mar 29 09:16:02 2019" on the following ones. An idea on how to achieve this?
I got to add the time but just on the first row of my dataframe.
import json
import pandas as pd
import time
import urllib.request
url='http://localhost:47800/api/v1/bacnet/devices/0/objects?properties=present-value&properties=object-name'
req = urllib.request.Request(url)
r = urllib.request.urlopen(req).read()
data = json.loads(r.decode('utf-8'))
time=time.asctime(time.localtime(time.time()))
result = pd.io.json.json_normalize(data['objects'])
result_tri = result.reindex(columns=[time,'object-name','present-value'])
Current result
Fri Mar 29 09:47:36 2019 object-name present-value
0 NaN Température_1 0 660.0
1 NaN Humidité_1 1 497.0
2 NaN Pression_1 2 497.0
3 NaN Vitesse_Vent 3 497.0
4 NaN Luminosité 4 497.0
5 NaN Etat_Pompe 3 0.0
6 NaN Greisch_Simulator NaN
7 NaN networkPort 30800 NaN
Desired result
Time object-name present-value
0 Fri Mar 29 09:47:36 2019 Température_1 0 660.0
1 Fri Mar 29 09:47:36 2019 Humidité_1 1 497.0
2 Fri Mar 29 09:47:36 2019 Pression_1 2 497.0
3 Fri Mar 29 09:47:36 2019 Vitesse_Vent 3 497.0
4 Fri Mar 29 09:47:36 2019 Luminosité 4 497.0
5 Fri Mar 29 09:47:36 2019 Etat_Pompe 3 0.0
6 Fri Mar 29 09:47:36 2019 Greisch_Simulator NaN
7 Fri Mar 29 09:47:36 2019 networkPort 30800 NaN
use
result_tri = result.reindex(columns=['Time','object-name','present-value'])
result_tri['Time'] = time
You can add new column in your df directly.
When you are doing
result_tri = result.reindex(columns=[time,'object-name','present-value'])
**you actually doing**
result_tri = result.reindex(columns="Fri Mar 29 09:47:36 2019",'object-name','present-value']
time is variable in you method which gets replaced with the value you have assigned to it.
you just need to do:
result = pd.io.json.json_normalize(data['objects'])
result["time"] = time.asctime(time.localtime(time.time()))
result = result.reindex(columns=['Time','object-name','present-value'])

How to split one row into multiple and apply datetime on dataframe column?

I have one dataframe which looks like below:
Date_1 Date_2
0 5 Dec 2017 5 Dec 2017
1 14 Dec 2017 14 Dec 2017
2 15 Dec 2017 15 Dec 2017
3 18 Dec 2017 21 Dec 2017 18 Dec 2017 21 Dec 2017
4 22 Dec 2017 22 Dec 2017
Conditions to be checked:
Want to check if any row contains two dates or not like 3rd row. If present split them into two separate rows.
Apply the datetime on both columns.
I am trying to do the same operation like below:
df['Date_1'] = pd.to_datetime(df['Date_1'], format='%d %b %Y')
But getting below error:
ValueError: unconverted data remains:
Expected Output:
Date_1 Date_2
0 5 Dec 2017 5 Dec 2017
1 14 Dec 2017 14 Dec 2017
2 15 Dec 2017 15 Dec 2017
3 18 Dec 2017 18 Dec 2017
4 21 Dec 2017 21 Dec 2017
5 22 Dec 2017 22 Dec 2017
After using regex with findall get the you date , your problem become a unnesting problem
s=df.apply(lambda x : x.str.findall(r'((?:\d{,2}\s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|\.|\s|,)\s?\d{,2}[a-z]*(?:-|,|\s)?\s?\d{,4})'))
unnesting(s,['Date_1','Date_2']).apply(pd.to_datetime)
Out[82]:
Date_1 Date_2
0 2017-12-05 2017-12-05
1 2017-12-14 2017-12-14
2 2017-12-15 2017-12-15
3 2017-12-18 2017-12-18
3 2017-12-21 2017-12-21
4 2017-12-22 2017-12-22

Python dataframe group by column and create new column with percentage [duplicate]

This question already has answers here:
Pandas percentage of total with groupby
(16 answers)
Closed 5 years ago.
I have a scenario simulating to a dataframe which looks something like below:
Month Amount
1 Jan 260
2 Feb 179
3 Mar 153
4 Apr 142
5 May 128
6 Jun 116
7 Jul 71
8 Aug 56
9 Sep 49
10 Oct 17
11 Nov 0
12 Dec 0
I'm trying to get new column by calculating percentage for each row using dataframe group by and use lambda function as below:
df = pd.DataFrame(mylistofdict)
df = df.groupby('Month')["Amount"].apply(lambda x: x / x.sum()*100)
But I'm not getting the expected result below only 2 columns:
Month Percentage
1 Jan 22%
2 Feb 15%
3 Mar 13%
4 Apr 12%
5 May 11%
6 Jun 10%
7 Jul 6%
8 Aug 5%
9 Sep 4%
10 Oct 1%
11 Nov 0
12 Dec 0
How do i modify my code or is there anything better than use dataframe.
If values of Month are unique use:
df['perc'] = df["Amount"] / df["Amount"].sum() * 100
print (df)
Month Amount perc
1 Jan 260 22.203245
2 Feb 179 15.286080
3 Mar 153 13.065756
4 Apr 142 12.126388
5 May 128 10.930828
6 Jun 116 9.906063
7 Jul 71 6.063194
8 Aug 56 4.782237
9 Sep 49 4.184458
10 Oct 17 1.451751
11 Nov 0 0.000000
12 Dec 0 0.000000
If values of Month are duplicated I believe is possible use:
print (df)
Month Amount
1 Jan 260
1 Jan 100
3 Mar 153
4 Apr 142
5 May 128
6 Jun 116
7 Jul 71
8 Aug 56
9 Sep 49
10 Oct 17
11 Nov 0
12 Dec 0
df = df.groupby('Month', as_index=False, sort=False)["Amount"].sum()
df['perc'] = df["Amount"] / df["Amount"].sum() * 100
print (df)
Month Amount perc
0 Jan 360 32.967033
1 Mar 153 14.010989
2 Apr 142 13.003663
3 May 128 11.721612
4 Jun 116 10.622711
5 Jul 71 6.501832
6 Aug 56 5.128205
7 Sep 49 4.487179
8 Oct 17 1.556777
9 Nov 0 0.000000
10 Dec 0 0.000000

Categories