I have a data frame containing a timestamp every 5 minutes with a value for each ID. Now, I need to perform some analysis and I would like to plot all the time series on the same temporal time window.
My data frame is similar to this one:
ID timestamp value
12345 2017-02-09 14:35:00 60.0
12345 2017-02-09 14:40:00 62.0
12345 2017-02-09 14:45:00 58.0
12345 2017-02-09 14:50:00 60.0
54321 2017-03-09 13:35:00 50.0
54321 2017-03-09 13:40:00 58.0
54321 2017-03-09 13:45:00 59.0
54321 2017-03-09 13:50:00 61.0
For instance, in the xy axis, I need to use the x=0 value as the first timestamp for each ID, and the x=1 the second after 5 minutes, and so on.
Until now, I correctly resampled every 5 minutes with this code:
df = df.set_index('Date').resample('5T').mean().reset_index()
But, given the fact the every ID starts at different timestamps, I don't know how to modify the timestamps in order to use the first measured date of each ID as timestamp 0, and each next timestamp every 5 minutes as timestamp 1, timestamp 2, timestamp 3, ecc, in order to plot the series of each ID to confront them graphically. A sample final df may be:
ID timestamp value
12345 0 60.0
12345 1 62.0
12345 2 58.0
12345 3 60.0
54321 0 50.0
54321 1 58.0
54321 2 59.0
54321 3 61.0
Using this data frame, is is possible to plot all the series starting and finishing at the same point? Start at 0 and finish after 3 days.
How do I create such different timestamps and plot every series for each ID on the same figure?
Thankl you very much
First create a new column with the timestamp number in 5 minutes intervals.
df['ts_number'] = df.groupby(['ID']).timestamp.apply(lambda x: (x - x.min())/pd.Timedelta(minutes=5))
If you know in advance that all your timestamps are in 5 minute intervalls and they are sorted, then you can also use
df['ts_number'] = df.groupby(['ID']).cumcount()
Then plot the pivoted data:
df.pivot('ts_number', 'ID', 'value').plot()
Related
EDIT: My main goal is not to use a for loop and find a way of grouping the data efficiently/fast.
I am trying to solve a problem, which is about grouping together different rows of data based on an ID and a time window of 30 Days.
I have the following example data:
ID
Time
12345
2021-01-01 14:00:00
12345
2021-01-15 14:00:00
12345
2021-01-29 14:00:00
12345
2021-02-15 14:00:00
12345
2021-02-16 14:00:00
12345
2021-03-15 14:00:00
12345
2021-04-24 14:00:00
12344
2021-01-24 14:00:00
12344
2021-01-25 14:00:00
12344
2021-04-24 14:00:00
And I would like to have the following data:
ID
Time
Group
12345
2021-01-01 14:00:00
1
12345
2021-01-15 14:00:00
1
12345
2021-01-29 14:00:00
1
12345
2021-02-15 14:00:00
2
12345
2021-02-16 14:00:00
2
12345
2021-03-15 14:00:00
3
12345
2021-04-24 14:00:00
4
12344
2021-01-24 14:00:00
5
12344
2021-01-25 14:00:00
5
12344
2021-04-24 14:00:00
6
(4 can also be 1 as it is in a new group based on the ID 12344; 5 can also be 2)
I could differentiate then based on the ID column. So the Group does not need to be unique but can be.
The most important would be to separate it based on the ID and then check all the rows for each ID and assign an ID to the 30 Days time window. By 30 Days time window I mean that e.g. the first time frame for ID 12345 starts at 2021-01-01 and goes up to 2021-01-31 (this should be the group 1) and then the second time time frame for the ID 12345 starts at 2021-02-01 and would go to 2021-03-02 (for 30 days).
The problem I have faced with using the following code is that it uses the first date it finds in the dataframe:
grouped_data = df.groupby(["ID",pd.Grouper(key = "Time", freq = "30D")]).count()
In the above code I have just tried to count the rows (which wouldn't give me the Group, but I have tried to group it with my logic).
I hope someone can help me with this, because I have tried so many different things and nothing did work. I have already used the following (but maybe wrong):
pd.rolling()
pd.Grouper()
for loop
etc.
I really don't want to use for loop as I have 1.5 Mio rows.
And I have tried to vectorize the for loop but I am not really familiar with vectorization and was struggling to transfer my for loop to a vectorization.
Please let me know if I can use pd.Grouper differently so I get the results. thanks in advance.
For arbitrary windows you can use pandas.cut
eg, for 30 day bins starting at 2021-01-01 00:00:00 for the entirety of 2021 you can use:
bins = pd.date_range("2021", "2022", freq="30D")
group = pd.cut(df["Time"], bins)
group will label each row with an interval which you can then group on etc. If you want the groups to have labels 0, 1, 2, etc then you can map values with:
dict(zip(group.unique(), range(group.nunique())))
EDIT: approach where the windows are 30 day intervals, disjoint, and starting at a time in the Time column:
times = df["Time"].sort_values()
ii = pd.IntervalIndex.from_arrays(times, times+pd.Timedelta("30 days"))
disjoint_intervals = []
prev_interval = None
for i, interval in enumerate(ii):
if prev_interval is None or interval.left >= prev_interval.right: # no overlap
prev_interval = interval
disjoint_intervals.append(i)
bins = ii[disjoint_intervals]
group = pd.cut(df["Time"], bins)
Apologies, this is not a vectorised approach. Struggling to think if one could exist.
SOLUTION:
The solution which worked for me is the following:
I have imported the sampleData from excel into a dataframe. The data looks like this:
ID
Time
12345
2021-01-01 14:00:00
12345
2021-01-15 14:00:00
12345
2021-01-29 14:00:00
12345
2021-02-15 14:00:00
12345
2021-02-16 14:00:00
12345
2021-03-15 14:00:00
12345
2021-04-24 14:00:00
12344
2021-01-24 14:00:00
12344
2021-01-25 14:00:00
12344
2021-04-24 14:00:00
Then I have used the following steps:
Import the data:
df_test = pd.read_excel(r"sampleData.xlsx")
Order the dataframe so we have the correct order of ID and Time:
df_test_ordered = df_test.sort_values(["ID","Time"])
df_test_ordered = df_test_ordered.reset_index(drop=True)
I have also reset the index and dropped it as it has manipulated my calculations later on.
Create column with time difference between the previous row:
df_test_ordered.loc[df_test_ordered["ID"] == df_test_ordered["ID"].shift(1),"time_diff"] = df_test_ordered["Time"] - df_test_ordered["Time"].shift(1)
Transform timedelta64[ns] to timedelta64[D]:
df_test_ordered["time_diff"] = df_test_ordered["time_diff"].astype("timedelta64[D]")
Calculate the cumsum per ID:
df_test_ordered["cumsum"] = df_test_ordered.groupby("ID")["time_diff"].transform(pd.Series.cumsum)
Backfill the dataframe (exchange the NaN values with the next value):
df_final = df_test_ordered.ffill().bfill()
Create the window by dividing by 30 (30 days time period):
df_final["Window"] = df_final["cumsum"] / 30
df_final["Window_int"] = df_final["Window"].astype(int)
The "Window_int" column is now a kind of ID (not unique; but unique within the groups of column "ID").
Furthermore, I needed to backfill the dataframe as there were NaN values due to the calculation of time difference only if the previous ID equals the ID. If not then NaN is set as time difference. Backfilling will just set the NaN value to the next time difference which makes no difference mathematically and assign the correct value.
Solution dataframe:
ID Time time_diff cumsum Window Window_int
0 12344 2021-01-24 14:00:00 1.0 1.0 0.032258 0
1 12344 2021-01-25 14:00:00 1.0 1.0 0.032258 0
2 12344 2021-04-24 14:00:00 89.0 90.0 2.903226 2
3 12345 2021-01-01 14:00:00 14.0 14.0 0.451613 0
4 12345 2021-01-15 14:00:00 14.0 14.0 0.451613 0
5 12345 2021-01-29 14:00:00 14.0 28.0 0.903226 0
6 12345 2021-02-15 14:00:00 17.0 45.0 1.451613 1
7 12345 2021-02-16 14:00:00 1.0 46.0 1.483871 1
8 12345 2021-03-15 14:00:00 27.0 73.0 2.354839 2
9 12345 2021-04-24 14:00:00 40.0 113.0 3.645161 3
I have a sensor that measures data every ~60seconds. There is a little bit of delay between calls, so the data might look like this:
timestamp, value
12:01:45, 100
12:02:50, 90
12:03:55, 87
# 12:04 missing
12:05:00, 91
I only need precision to the minute, not seconds. Since this gathers data all day long, there should be 1440 entries (1440 minutes per day), however, there are some missing timestamps.
I'm loading this into a pd.DataFrame, and I'd like to have 1440 rows no matter what. How could I squeeze in None values to any missing timestamps?
timestamp, value
12:01:45, 100
12:02:50, 90
12:03:55, 87
12:04:00, None # Squeezed in a None value
12:05:00, 91
Additionally, some data is missing for several HOURS, but I'd still like to fill those with None.
Ultimately, I wish to plot the data using matplotlib, with the x-axis ranging between (0, 1440), and the y-axis ranging between (0, 100).
Use Resampler.first with Series.fillna if need replace only values between first and last timestamp:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.resample('1min', on='timestamp').first()
df['timestamp'] = df['timestamp'].fillna(df.index.to_series())
df = df.reset_index(drop=True)
print (df)
timestamp value
0 2021-09-20 12:01:45 100.0
1 2021-09-20 12:02:50 90.0
2 2021-09-20 12:03:55 87.0
3 2021-09-20 12:04:00 NaN
4 2021-09-20 12:05:00 91.0
If need all datetimes per day add DataFrame.reindex:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.resample('1min', on='timestamp').first()
rng = pd.date_range('00:00:00','23:59:00', freq='Min')
df = df.reindex(rng)
df['timestamp'] = df['timestamp'].fillna(df.index.to_series())
df = df.reset_index(drop=True)
print (df)
timestamp value
0 2021-09-20 00:00:00 NaN
1 2021-09-20 00:01:00 NaN
2 2021-09-20 00:02:00 NaN
3 2021-09-20 00:03:00 NaN
4 2021-09-20 00:04:00 NaN
... ...
1435 2021-09-20 23:55:00 NaN
1436 2021-09-20 23:56:00 NaN
1437 2021-09-20 23:57:00 NaN
1438 2021-09-20 23:58:00 NaN
1439 2021-09-20 23:59:00 NaN
[1440 rows x 2 columns]
I have a dataframe with some IDs and for each ID I have a timestamp and a value associated. I need to make all the IDs start from 0 and plot them together to see differences, so I used this code:
df['Date']=df.groupby('ID').cumcount()
To substitute the Date columns with indexes going from 0 to the last row number of that specific ID. Now, my problem is that the result of the normal plot is this one, with lines going back in time and I can't understand why
Another image:
As you can see, that is not possible but I don't understand how to fix this. Basically, I'm plotting all the values in both dataframes for each ID and for the newly created time.
After that, I need to perform statistical analysis on them, like create rolling mean or variance, and a gaussian distribution over the dataframes.
How can I fix this?
Edit:
here's my dataframe:
ID Date value
12345 2017-02-09 14:35:00 60.0
12345 2017-02-09 14:40:00 62.0
12345 2017-02-09 14:45:00 58.0
12345 2017-02-09 14:50:00 60.0
54321 2017-03-09 13:35:00 50.0
54321 2017-03-09 13:40:00 58.0
54321 2017-03-09 13:45:00 59.0
54321 2017-03-09 13:50:00 61.0
I would need to reshape the Date column and start everything from 0, with the command above, and the below result:
ID timestamp value
12345 0 60.0
12345 1 62.0
12345 2 58.0
12345 3 60.0
54321 0 50.0
54321 1 58.0
54321 2 59.0
54321 3 61.0
edit2: if I try with the following code
fig, ax = plt.subplots(figsize=(8,6))
for label, df in p_df.groupby('ID'):
df.vals.plot(kind="line", ax=ax, label=label)
plt.legend()
the compiler keep building but doesn't print anything.
And how do I confront two dataframes in this way?
What I'm doing is I have generated a DataFrame with pandas:
df_output = pd.DataFrame(columns={"id","Payout date", "Amount"}
In column 'Payout date' is a datetime, and in 'Amount' a float. I'm taking the values for each row from a csv:
df=pd.read_csv("file.csv", encoding = "ISO-8859-1", low_memory=False)
but when I assign the values:
df_output.loc[df_output['id'] == index, 'Payout date'].iloc[0]=(parsed_date)
pay=payments.get()
ref=refunds.get()
df_output.loc[df_output['id'] == index, 'Amount'].iloc[0]=(pay+ref-for_next_day)
and I print it the columns 'Payout date' and 'Amount' it only prints the id correctly, and NaT for the payouts and NaN for the amount, even when casting them to floats, or using
df_output['Amount']=pd.to_numeric(df_output['Amount'])
df_output['Payout date'] = pd.to_datetime(df_output['Payout date'])
I've also tried casting the values before passing them to the DataFrame, with no luck, so what I'm getting is this:
id Payout date Amount
1 NaT NaN
2 NaT NaN
3 NaT NaN
4 NaT NaN
5 NaT NaN
Instead, I'm looking for something like this:
id Payout date Amount
1 2019-03-11 3.2
2 2019-03-11 3.2
3 2019-03-11 3.2
4 2019-03-11 3.2
5 2019-03-11 3.2
EDIT
print(df_output.head(5))
print(df.head(5))
id Payout date Amount
1 NaT NaN
2 NaT NaN
3 NaT NaN
4 NaT NaN
5 NaT NaN
id Created (UTC) Type Currency Amount Fee Net
1 2016-07-27 13:28:00 charge mxn 672.0 31.54 640.46
2 2016-07-27 15:21:00 charge mxn 146.0 9.58 136.42
3 2016-07-27 16:18:00 charge mxn 200.0 11.83 188.17
4 2016-07-27 17:18:00 charge mxn 146.0 9.58 136.42
5 2016-07-27 18:11:00 charge mxn 286.0 15.43 270.57
Probably the easiest thing to do would be just to rename the columns of the dataframe you're loading:
df = pd.read_csv("file.csv", encoding = "ISO-8859-1", low_memory=False, index_col='id')
df.columns(rename={"Created (UTC)":'Payout Date'}, inplace=True)
df_output = df[['Payout Date', 'Amount']]
EDIT:
if you're trying to assign a column in one dataframe to the column of another just do this:
output_df['Amount'] = df['Amount']
This is best explained through an example.
I have the following dataframe (each row can be thought of as a transaction):
DATE AMOUNT
2017-01-29 10
2017-01-30 20
2017-01-31 30
2017-02-01 40
2017-02-02 50
2017-02-03 60
I would like to compute a 2-day rolling sum but only for rows in February.
Code snippet I have currently:
df.set_index('DATE',inplace=True)
res=df.rolling('2d')['AMOUNT'].sum()
which gives:
AMOUNT
2017-01-29 10
2017-01-30 30
2017-01-31 50
2017-02-01 70
2017-02-02 90
2017-02-03 110
but I really only need the output in the last 3 rows, the operations on the first 3 rows are unnecessary. When the dataframe is huge, this incurs immense time complexity. How do I compute the rolling sum only for the last 3 rows (other than computing the rolling sum for all rows and then doing a row filtering operation after that)?
*I cannot pre-filter the dataframe either because there wouldn't be the 'lookback' period in January for the correct rolling sum value to be obtained.
You can use timedelta to filter your df and keep the last day of January.
import datetime
dateStart = datetime.date(2017, 2, 1) - datetime.timedelta(days=1)
dateEnd = datetime.date(2017, 2, 3)
df.loc[dateStart:dateEnd]
Then you can do your rolling operation and drop the first line (which is 2017-01-31)
you can just compute the rolling sum only for the last rows by using tail(4)
res = df.tail(4).rolling('2d')['AMOUNT'].sum()
Output:
DATE
2017-01-31 NaN
2017-02-01 70.0
2017-02-02 90.0
2017-02-03 110.0
Name: AMOUNT, dtype: float64
If you want to merge those values - excluding 2017-01-31 then you can do the following:
df.loc[res.index[1:]] = res.tail(3)
Output:
AMOUNT
DATE
2017-01-29 10.0
2017-01-30 20.0
2017-01-31 30.0
2017-02-01 70.0
2017-02-02 90.0
2017-02-03 110.0