Pandas values back in time when plotting them - python

I have a dataframe with some IDs and for each ID I have a timestamp and a value associated. I need to make all the IDs start from 0 and plot them together to see differences, so I used this code:
df['Date']=df.groupby('ID').cumcount()
To substitute the Date columns with indexes going from 0 to the last row number of that specific ID. Now, my problem is that the result of the normal plot is this one, with lines going back in time and I can't understand why
Another image:
As you can see, that is not possible but I don't understand how to fix this. Basically, I'm plotting all the values in both dataframes for each ID and for the newly created time.
After that, I need to perform statistical analysis on them, like create rolling mean or variance, and a gaussian distribution over the dataframes.
How can I fix this?
Edit:
here's my dataframe:
ID Date value
12345 2017-02-09 14:35:00 60.0
12345 2017-02-09 14:40:00 62.0
12345 2017-02-09 14:45:00 58.0
12345 2017-02-09 14:50:00 60.0
54321 2017-03-09 13:35:00 50.0
54321 2017-03-09 13:40:00 58.0
54321 2017-03-09 13:45:00 59.0
54321 2017-03-09 13:50:00 61.0
I would need to reshape the Date column and start everything from 0, with the command above, and the below result:
ID timestamp value
12345 0 60.0
12345 1 62.0
12345 2 58.0
12345 3 60.0
54321 0 50.0
54321 1 58.0
54321 2 59.0
54321 3 61.0
edit2: if I try with the following code
fig, ax = plt.subplots(figsize=(8,6))
for label, df in p_df.groupby('ID'):
df.vals.plot(kind="line", ax=ax, label=label)
plt.legend()
the compiler keep building but doesn't print anything.
And how do I confront two dataframes in this way?

Related

How to group by column and a fixed time window/frequency

EDIT: My main goal is not to use a for loop and find a way of grouping the data efficiently/fast.
I am trying to solve a problem, which is about grouping together different rows of data based on an ID and a time window of 30 Days.
I have the following example data:
ID
Time
12345
2021-01-01 14:00:00
12345
2021-01-15 14:00:00
12345
2021-01-29 14:00:00
12345
2021-02-15 14:00:00
12345
2021-02-16 14:00:00
12345
2021-03-15 14:00:00
12345
2021-04-24 14:00:00
12344
2021-01-24 14:00:00
12344
2021-01-25 14:00:00
12344
2021-04-24 14:00:00
And I would like to have the following data:
ID
Time
Group
12345
2021-01-01 14:00:00
1
12345
2021-01-15 14:00:00
1
12345
2021-01-29 14:00:00
1
12345
2021-02-15 14:00:00
2
12345
2021-02-16 14:00:00
2
12345
2021-03-15 14:00:00
3
12345
2021-04-24 14:00:00
4
12344
2021-01-24 14:00:00
5
12344
2021-01-25 14:00:00
5
12344
2021-04-24 14:00:00
6
(4 can also be 1 as it is in a new group based on the ID 12344; 5 can also be 2)
I could differentiate then based on the ID column. So the Group does not need to be unique but can be.
The most important would be to separate it based on the ID and then check all the rows for each ID and assign an ID to the 30 Days time window. By 30 Days time window I mean that e.g. the first time frame for ID 12345 starts at 2021-01-01 and goes up to 2021-01-31 (this should be the group 1) and then the second time time frame for the ID 12345 starts at 2021-02-01 and would go to 2021-03-02 (for 30 days).
The problem I have faced with using the following code is that it uses the first date it finds in the dataframe:
grouped_data = df.groupby(["ID",pd.Grouper(key = "Time", freq = "30D")]).count()
In the above code I have just tried to count the rows (which wouldn't give me the Group, but I have tried to group it with my logic).
I hope someone can help me with this, because I have tried so many different things and nothing did work. I have already used the following (but maybe wrong):
pd.rolling()
pd.Grouper()
for loop
etc.
I really don't want to use for loop as I have 1.5 Mio rows.
And I have tried to vectorize the for loop but I am not really familiar with vectorization and was struggling to transfer my for loop to a vectorization.
Please let me know if I can use pd.Grouper differently so I get the results. thanks in advance.
For arbitrary windows you can use pandas.cut
eg, for 30 day bins starting at 2021-01-01 00:00:00 for the entirety of 2021 you can use:
bins = pd.date_range("2021", "2022", freq="30D")
group = pd.cut(df["Time"], bins)
group will label each row with an interval which you can then group on etc. If you want the groups to have labels 0, 1, 2, etc then you can map values with:
dict(zip(group.unique(), range(group.nunique())))
EDIT: approach where the windows are 30 day intervals, disjoint, and starting at a time in the Time column:
times = df["Time"].sort_values()
ii = pd.IntervalIndex.from_arrays(times, times+pd.Timedelta("30 days"))
disjoint_intervals = []
prev_interval = None
for i, interval in enumerate(ii):
if prev_interval is None or interval.left >= prev_interval.right: # no overlap
prev_interval = interval
disjoint_intervals.append(i)
bins = ii[disjoint_intervals]
group = pd.cut(df["Time"], bins)
Apologies, this is not a vectorised approach. Struggling to think if one could exist.
SOLUTION:
The solution which worked for me is the following:
I have imported the sampleData from excel into a dataframe. The data looks like this:
ID
Time
12345
2021-01-01 14:00:00
12345
2021-01-15 14:00:00
12345
2021-01-29 14:00:00
12345
2021-02-15 14:00:00
12345
2021-02-16 14:00:00
12345
2021-03-15 14:00:00
12345
2021-04-24 14:00:00
12344
2021-01-24 14:00:00
12344
2021-01-25 14:00:00
12344
2021-04-24 14:00:00
Then I have used the following steps:
Import the data:
df_test = pd.read_excel(r"sampleData.xlsx")
Order the dataframe so we have the correct order of ID and Time:
df_test_ordered = df_test.sort_values(["ID","Time"])
df_test_ordered = df_test_ordered.reset_index(drop=True)
I have also reset the index and dropped it as it has manipulated my calculations later on.
Create column with time difference between the previous row:
df_test_ordered.loc[df_test_ordered["ID"] == df_test_ordered["ID"].shift(1),"time_diff"] = df_test_ordered["Time"] - df_test_ordered["Time"].shift(1)
Transform timedelta64[ns] to timedelta64[D]:
df_test_ordered["time_diff"] = df_test_ordered["time_diff"].astype("timedelta64[D]")
Calculate the cumsum per ID:
df_test_ordered["cumsum"] = df_test_ordered.groupby("ID")["time_diff"].transform(pd.Series.cumsum)
Backfill the dataframe (exchange the NaN values with the next value):
df_final = df_test_ordered.ffill().bfill()
Create the window by dividing by 30 (30 days time period):
df_final["Window"] = df_final["cumsum"] / 30
df_final["Window_int"] = df_final["Window"].astype(int)
The "Window_int" column is now a kind of ID (not unique; but unique within the groups of column "ID").
Furthermore, I needed to backfill the dataframe as there were NaN values due to the calculation of time difference only if the previous ID equals the ID. If not then NaN is set as time difference. Backfilling will just set the NaN value to the next time difference which makes no difference mathematically and assign the correct value.
Solution dataframe:
ID Time time_diff cumsum Window Window_int
0 12344 2021-01-24 14:00:00 1.0 1.0 0.032258 0
1 12344 2021-01-25 14:00:00 1.0 1.0 0.032258 0
2 12344 2021-04-24 14:00:00 89.0 90.0 2.903226 2
3 12345 2021-01-01 14:00:00 14.0 14.0 0.451613 0
4 12345 2021-01-15 14:00:00 14.0 14.0 0.451613 0
5 12345 2021-01-29 14:00:00 14.0 28.0 0.903226 0
6 12345 2021-02-15 14:00:00 17.0 45.0 1.451613 1
7 12345 2021-02-16 14:00:00 1.0 46.0 1.483871 1
8 12345 2021-03-15 14:00:00 27.0 73.0 2.354839 2
9 12345 2021-04-24 14:00:00 40.0 113.0 3.645161 3

How to automatically pivot data in pandas

I am used to work with Excel and trying to learn Python especially Pandas. My goal is to plot a large dataset with Plotly/Dash. My dataset looks very much like the dataset on the Pandas tutorial. I have got more paramters and with 20 locations also more locations.
date.utc location parameter value
2067 2019-05-07 01:00:00+00:00 London Westminster no 23.0
2068 2019-05-07 01:00:00+00:00 London Westminster no2 45.0
2069 2019-05-07 01:00:00+00:00 London Westminster pm25 11.0
1003 2019-05-07 01:00:00+00:00 FR04014 no2 25.0
100 2019-05-07 01:00:00+00:00 BETR801 pm25 12.5
1098 2019-05-07 01:00:00+00:00 BETR801 no2 50.5
1109 2019-05-07 01:00:00+00:00 London Westminster co 8.0
I import the file with pd.read_csv and then manually create a pivot for every location and every parameter with a seperate variable and this is quite a work to do.
Is there a way to automatically pivot this data? I want the locations grouped and a column for every parameter. My goal is to have this data in dash and at the top I want a dropbown with the location and on the right side I want to choose no, no2, pm .... with individual axis labels for each parameter.
I found this code here on stack overflow and trying to adapt it for me but it doesn't work.
df = pd.read_csv('https://api.statbank.dk/v1/data/mpk100/CSV?valuePresentation=Value&timeOrder=Ascending&LAND=*&Tid=*', sep=';')
df = df[df['INDHOLD'] != '..']
df['rate'] = df['INDHOLD'].str.replace(',', '.').astype(float)
available_countries = df['LAND'].unique()
df.groupby('LAND')
Many thanks in advance.:)
If I understand you correctly:
x = df.pivot(["date.utc", "location"], "parameter", "value")
print(x)
Prints:
parameter co no no2 pm25
date.utc location
2019-05-07 01:00:00+00:00 BETR801 NaN NaN 50.5 12.5
FR04014 NaN NaN 25.0 NaN
London Westminster 8.0 23.0 45.0 11.0

Incomplete filling when upsampling with `agg` for multiple columns (pandas resample)

I found this behavior of resample to be confusing after working on a related question. Here are some time series data at 5 minute intervals but with missing rows (code to construct at end):
user value total
2020-01-01 09:00:00 fred 1 1
2020-01-01 09:05:00 fred 13 1
2020-01-01 09:15:00 fred 27 3
2020-01-01 09:30:00 fred 40 12
2020-01-01 09:35:00 fred 15 12
2020-01-01 10:00:00 fred 19 16
I want to fill in the missing times using different methods for each column to fill missing data. For user and total, I want to to a forward fill, while for value I want to fill in with zeroes.
One approach I found was to resample, and then fill in the missing data after the fact:
resampled = df.resample('5T').asfreq()
resampled['user'].ffill(inplace=True)
resampled['total'].ffill(inplace=True)
resampled['value'].fillna(0, inplace=True)
Which gives correct expected output:
user value total
2020-01-01 09:00:00 fred 1.0 1.0
2020-01-01 09:05:00 fred 13.0 1.0
2020-01-01 09:10:00 fred 0.0 1.0
2020-01-01 09:15:00 fred 27.0 3.0
2020-01-01 09:20:00 fred 0.0 3.0
2020-01-01 09:25:00 fred 0.0 3.0
2020-01-01 09:30:00 fred 40.0 12.0
2020-01-01 09:35:00 fred 15.0 12.0
2020-01-01 09:40:00 fred 0.0 12.0
2020-01-01 09:45:00 fred 0.0 12.0
2020-01-01 09:50:00 fred 0.0 12.0
2020-01-01 09:55:00 fred 0.0 12.0
2020-01-01 10:00:00 fred 19.0 16.0
I thought one would be able to use agg to specify what to do by column. I try to do the following:
resampled = df.resample('5T').agg({'user':'ffill',
'value':'sum',
'total':'ffill'})
I find this to be more clear and simpler, but it doesn't give the expected output. The sum works, but the forward fill does not:
user value total
2020-01-01 09:00:00 fred 1 1.0
2020-01-01 09:05:00 fred 13 1.0
2020-01-01 09:10:00 NaN 0 NaN
2020-01-01 09:15:00 fred 27 3.0
2020-01-01 09:20:00 NaN 0 NaN
2020-01-01 09:25:00 NaN 0 NaN
2020-01-01 09:30:00 fred 40 12.0
2020-01-01 09:35:00 fred 15 12.0
2020-01-01 09:40:00 NaN 0 NaN
2020-01-01 09:45:00 NaN 0 NaN
2020-01-01 09:50:00 NaN 0 NaN
2020-01-01 09:55:00 NaN 0 NaN
2020-01-01 10:00:00 fred 19 16.0
Can someone explain this output, and if there is a way to achieve the expected output using agg? It seems odd that the forward fill doesn't work here, but if I were to just do resampled = df.resample('5T').ffill(), that would work for every column (but is undesired here as it would do so for the value column as well). The closest I have come is to individually run resampling for each column and apply the function I want:
resampled = pd.DataFrame()
d = {'user':'ffill',
'value':'sum',
'total':'ffill'}
for k, v in d.items():
resampled[k] = df[k].resample('5T').apply(v)
This works, but feels silly given that it adds extra iteration and uses the dictionary I am trying to pass to agg! I have looked a few posts on agg and apply but can't seem to explain what is happening here:
Losing String column when using resample and aggregation with pandas
resample multiple columns with pandas
pandas groupby with agg not working on multiple columns
Pandas named aggregation not working with resample agg
I have also tried using groupby with a pd.Grouper and using the pd.NamedAgg class, with no luck.
Example data:
import pandas as pd
dates = ['01-01-2020 9:00', '01-01-2020 9:05', '01-01-2020 9:15',
'01-01-2020 9:30', '01-01-2020 9:35', '01-01-2020 10:00']
dates = pd.to_datetime(dates)
df = pd.DataFrame({'user':['fred']*len(dates),
'value':[1,13,27,40,15,19],
'total':[1,1,3,12,12,16]},
index=dates)

To fill the 0's with before 3 months average value using python

My data set has values like
date quantity
01/04/2018 35
01/05/2018 33
01/06/2018 75
01/07/2018 0
01/08/2018 70
01/09/2018 0
01/10/2018 66
Code I tried:
df['rollmean3'] = df['quantity'].rolling(3).mean()
output:
2018-04-01 35.0 NaN
2018-05-01 33.0 NaN
2018-06-01 75.0 47.666667
2018-07-01 0.0 36.000000
2018-08-01 70.0 48.333333
2018-09-01 0.0 23.333333
2018-10-01 66.0 45.333333
EXPECTED OUTPUT:
But I need output as it should take the AVERAGE of 35,33 and 75 and fill it in the 0.0 value.
and for next zero it should calculated average for previous three values and fill it.
2018-04-01 35.0
2018-05-01 33.0
2018-06-01 75.0
2018-07-01 0.0 47.666667
2018-08-01 70.0
2018-09-01 0.0 64.22222 # average of (0, 47.6667 and 75)
2018-10-01 66.0
like this output should be displayed
Unfortunately there does not seem to be a vectorized solution for this in Pandas. You'll need to iterate the rows and fill in the missing values one by one. This will be slow; if you need to speed it up you can JIT compile your code using Numba.
Like John Zwinck said, there's no vectorized solution in pandas for this.
You'll have to use something like .itterrows(), like this:
for i, row in df.iterrows():
if row['quantity'] == 0:
df.loc[i,'quantity'] = df['quantity'].iloc[(i-3):i].mean()
Or even with recursion, if you prefer:
def fill_recursively(column: pd.Series, window_size: int = 3):
if 0 in column.values:
idx = column.tolist().index(0)
column[idx] = column[(idx-window_size):idx].mean()
column = fill_recursively(column)
return column
You can verify that fill_recursively(df['quantity']) returns the desired result (just make sure that it has the dtype float, otherwise it will be rounded to the nearest integer).

Convert timestamps from temporal series to the same index

I have a data frame containing a timestamp every 5 minutes with a value for each ID. Now, I need to perform some analysis and I would like to plot all the time series on the same temporal time window.
My data frame is similar to this one:
ID timestamp value
12345 2017-02-09 14:35:00 60.0
12345 2017-02-09 14:40:00 62.0
12345 2017-02-09 14:45:00 58.0
12345 2017-02-09 14:50:00 60.0
54321 2017-03-09 13:35:00 50.0
54321 2017-03-09 13:40:00 58.0
54321 2017-03-09 13:45:00 59.0
54321 2017-03-09 13:50:00 61.0
For instance, in the xy axis, I need to use the x=0 value as the first timestamp for each ID, and the x=1 the second after 5 minutes, and so on.
Until now, I correctly resampled every 5 minutes with this code:
df = df.set_index('Date').resample('5T').mean().reset_index()
But, given the fact the every ID starts at different timestamps, I don't know how to modify the timestamps in order to use the first measured date of each ID as timestamp 0, and each next timestamp every 5 minutes as timestamp 1, timestamp 2, timestamp 3, ecc, in order to plot the series of each ID to confront them graphically. A sample final df may be:
ID timestamp value
12345 0 60.0
12345 1 62.0
12345 2 58.0
12345 3 60.0
54321 0 50.0
54321 1 58.0
54321 2 59.0
54321 3 61.0
Using this data frame, is is possible to plot all the series starting and finishing at the same point? Start at 0 and finish after 3 days.
How do I create such different timestamps and plot every series for each ID on the same figure?
Thankl you very much
First create a new column with the timestamp number in 5 minutes intervals.
df['ts_number'] = df.groupby(['ID']).timestamp.apply(lambda x: (x - x.min())/pd.Timedelta(minutes=5))
If you know in advance that all your timestamps are in 5 minute intervalls and they are sorted, then you can also use
df['ts_number'] = df.groupby(['ID']).cumcount()
Then plot the pivoted data:
df.pivot('ts_number', 'ID', 'value').plot()

Categories