How to upsample a panda data frame - python

I have a comma separated data file as follows:
ID | StartTimeStamp | EndTimeStamp | Duration (in seconds) | AssetName
1233 | 2017-01-01 00:00:02 | 2017-01-01 00:10:01 | 601 | Car1
1233 | 2017-01-01 00:10:01 | 2017-01-01 00:10:12 | 11 | Car1
...
1235 | 2017-01-01 00:00:02 | 2017-01-01 00:10:01 | 601 | CarN
etc.
Now I would like to create the following using the starttime and duration to upsample the data.
ID | StartTimeStamp | AssetName
1233 | 2017-01-01 00:00:02 | Car1
1233 | 2017-01-01 00:00:03 | Car1
1233 | 2017-01-01 00:00:04 | Car1
...
1233 | 017-01-01 00:10:01 | Car1
...
1235 | 2017-01-01 00:00:02 | CarN
1235 | 2017-01-01 00:00:03 | CarN
1235 | 2017-01-01 00:00:04 | CarN
... (i.e. 601 rows of data one per second)
1235 | 2017-01-01 00:10:01 | CarN
but I am add odds on how to do this as upsampling seems to be only able to work with timeseries? I was thinking of using a for loop using the StartTimeStamp and number of seconds in the file, but am at a loss on how to go about this?

You can resample for each ID group and then fill the gaps in character columns
import pandas as pd
df_resampled = df.set_index(pd.to_datetime(df.StartTimeStamp)).groupby('ID')
# Expand out the dataframe for one second
df_resampled = df_resampled.resample('1S').asfreq()
# Interpolate AssetName for each group
df_resampled['AssetName'] = df_resampled['AssetName'].ffill().bfill()

Related

How to group data with similar dates in pandas

I have two csv files. Both the files contain Date, Stock, Open, High,Low,Close column of a single day. I made one dataframe from these two files. So, in this single dataframe 1st the data of Stock 1 is printed from day open to day close and then data of Stock 2 from day open to day end .The data is of 15 min interval and a day starts with 2019-01-01 09:15:00 and ends with 2019-01-01 15:15:00.
What I want is to create a dataframe where data of stock1 at 2019-01-01 09:15:00 is printed and then the data of stock2 at the same time and so on for 2019-01-01 09:30:00, 2019-01-01 09:45:00....
Check the image:
New Answer:
After reading your response I figured the best course of action for your issue would be moving your data to a 2-index DataFrame format using Pandas MultiIndex
arrays = [
np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
]
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df
Out[16]:
0 1 2 3
bar one -0.424972 0.567020 0.276232 -1.087401
two -0.673690 0.113648 -1.478427 0.524988
baz one 0.404705 0.577046 -1.715002 -1.039268
two -0.370647 -1.157892 -1.344312 0.844885
foo one 1.075770 -0.109050 1.643563 -1.469388
two 0.357021 -0.674600 -1.776904 -0.968914
qux one -1.294524 0.413738 0.276662 -0.472035
two -0.013960 -0.362543 -0.006154 -0.923061
Old Answer
You can use the pandas concat method. If their index format matches, the Pandas API will take care of the rest.
import pandas as pd
import datetime
idx = pd.date_range("2018-01-01", periods=5, freq="H")
ts = pd.DataFrame(range(len(idx)), index=idx)
| | 0 |
|:--------------------|----:|
| 2018-01-01 00:00:00 | 0 |
| 2018-01-01 01:00:00 | 1 |
| 2018-01-01 02:00:00 | 2 |
| 2018-01-01 03:00:00 | 3 |
| 2018-01-01 04:00:00 | 4 |
idy = pd.date_range("2018-01-02", periods=10, freq="H")
tsy = pd.DataFrame(range(len(idy)), index=idy)
| | 0 |
|:--------------------|----:|
| 2018-01-02 00:00:00 | 0 |
| 2018-01-02 01:00:00 | 1 |
| 2018-01-02 02:00:00 | 2 |
| 2018-01-02 03:00:00 | 3 |
| 2018-01-02 04:00:00 | 4 |
| 2018-01-02 05:00:00 | 5 |
| 2018-01-02 06:00:00 | 6 |
| 2018-01-02 07:00:00 | 7 |
| 2018-01-02 08:00:00 | 8 |
| 2018-01-02 09:00:00 | 9 |
Result:
pd.concat([ts, tsy])
| | 0 |
|:--------------------|----:|
| 2018-01-01 00:00:00 | 0 |
| 2018-01-01 01:00:00 | 1 |
| 2018-01-01 02:00:00 | 2 |
| 2018-01-01 03:00:00 | 3 |
| 2018-01-01 04:00:00 | 4 |
| 2018-01-02 00:00:00 | 0 |
| 2018-01-02 01:00:00 | 1 |
| 2018-01-02 02:00:00 | 2 |
| 2018-01-02 03:00:00 | 3 |
| 2018-01-02 04:00:00 | 4 |
| 2018-01-02 05:00:00 | 5 |
| 2018-01-02 06:00:00 | 6 |
| 2018-01-02 07:00:00 | 7 |
| 2018-01-02 08:00:00 | 8 |
| 2018-01-02 09:00:00 | 9 |

How can I set last rows of a dataframe based on condition in Python?

I have 1 dataframes, df1, with 2 different columns. The first column 'col1' is a datetime column, and the second one is a int column with only 2 possible values (0 or 1). Here is an example of the dataframe:
+----------------------+----------+
| col1 | col2 |
+----------------------+----------+
| 2020-01-01 10:00:00 | 0 |
+----------------------+----------+
| 2020-01-01 11:00:00 | 1 |
+----------------------+----------+
| 2020-01-01 12:00:00 | 1 |
+----------------------+----------+
| 2020-01-02 11:00:00 | 0 |
+----------------------+----------+
| 2020-01-02 12:00:00 | 1 |
+----------------------+----------+
| ... | ... |
+----------------------+----------+
As you can see, the datetimes are sorted in an ascending order. What I would like is: for each diferent date (in this example are 2 diferent dates, 2020-01-01 and 2020-01-02 with diferent times) I would like to mantain the first 1 value and put as 0 the previous and the next ones in that date. So, the resulting dataframe would be:
+----------------------+----------+
| col1 | col2 |
+----------------------+----------+
| 2020-01-01 10:00:00 | 0 |
+----------------------+----------+
| 2020-01-01 11:00:00 | 1 |
+----------------------+----------+
| 2020-01-01 12:00:00 | 0 |
+----------------------+----------+
| 2020-01-02 11:00:00 | 0 |
+----------------------+----------+
| 2020-01-02 12:00:00 | 1 |
+----------------------+----------+
| ... | ... |
+----------------------+----------+
How can I do it in Python?
Use:
df['col1'] = pd.to_datetime(df.col1)
mask = df.groupby(df.col1.dt.date)['col2'].cumsum().eq(1)
df.col2.where(mask, 0, inplace = True)
Output:
>>> df
col1 col2
0 2020-01-01 10:00:00 0
1 2020-01-01 12:00:00 1
2 2020-01-01 12:00:00 0
3 2020-01-02 11:00:00 0
4 2020-01-02 12:00:00 1

How to create rows that fill the time between events in Python

I am building a data frame for survival analysis starting from 2018-01-01 00:00:00 and ending TODAY. I have two columns with start and end times only for the events that ocurred associated with an ID.
However, I need to add rows with the times between which the event was not observed
Here I show what I have:
+--------+-----+-----+---------------------+---------------------+
| State | ID1 | ID2 | Start_Time | End_Time |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-04 04:00:00 | 2019-12-04 19:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-08 06:30:00 | 2019-12-20 10:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-22 11:00:00 | 2019-12-22 23:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-26 08:00:00 | 2019-12-29 16:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-19 08:00:00 | 2018-09-20 04:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-25 16:30:00 | 2018-09-26 23:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-27 01:30:00 | 2018-09-27 10:30:00 |
+--------+-----+-----+---------------------+---------------------+
And what I need is:
+--------+-----+-----+---------------------+---------------------+
| State | ID1 | ID2 | Start_Time | End_Time |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2018-01-01 00:00:00 | 2019-12-04 04:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-04 04:00:00 | 2019-12-04 19:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-04 19:30:00 | 2019-12-08 06:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-08 06:30:00 | 2019-12-20 10:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-20 10:00:00 | 2019-12-22 11:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-22 11:00:00 | 2019-12-22 23:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-22 23:00:00 | 2019-12-26 08:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-26 08:00:00 | 2019-12-29 16:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-29 16:30:00 | TODAY |
+--------+-----+-----+---------------------+---------------------+
| State1 | 112 | AA1 | 2018-01-01 00:00:00 | 2018-09-19 08:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-19 08:00:00 | 2018-09-20 04:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA1 | 2018-09-20 04:30:00 | 2018-09-25 16:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-25 16:30:00 | 2018-09-26 23:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA1 | 2018-09-26 23:00:00 | 2018-09-27 01:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-27 01:30:00 | 2018-09-27 10:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-27 10:30:00 | TODAY |
+--------+-----+-----+---------------------+---------------------+
I have tried this code (borrowed from: How to find the start time and end time of an event in python?), but it gives me only the sequence of events, not the desired rows and the answer provided by #Fredy MontaƱo (below):
fill_date = []
for item in range(1,df.shape[0],1):
if (df['End_Time'][item-1] - df['Start_Time'][item]) == 0:
""
else:
fill_date.append([df["State"][item-1], df["ID1"][item-1], df["ID2"][item-1], df['End_Time'][item-1],df['Start_Time'][item]])
df_add = pd.DataFrame(fill_date)
df_add.columns = ["State", "ID1", "ID2", 'Start_Time', 'End_Time']
df_output = pd.concat([df[["State", "ID1", "ID2", "Start_Time", "End_Time"]], df_add],axis = 0)
df_output = df_output.sort_values(["State", "ID2", "Start_Time"], ascending=True)
I think I have to put a condition over the STATE, ID1 and ID2 variables in order to not to take times from the previous groups.
Any suggestion?
Maybe this solution works for you.
I slice the dataframe only to take the dates, but it works for you you can repeat it taking into account the states and ID
df = df[['Start_Time', 'End_Time']]
fill_date = []
for item in range(1,df.shape[0],1):
if df['Start_Time'][item] - df['End_Time'][item-1] == 0:
""
else:
fill_date.append([df['End_Time'][item-1],df['Start_Time'][item]])
df_add = pd.DataFrame(fill_date)
df_add.columns = ['Start_Time', 'End_Time']
and finally, I do a concat to join you original dataframe with the new df of dates of not Observed events dates on squares are the new
df_final = pd.concat([df,df_add],axis = 0)
df_final.sort_index(0)

Iterate over each day, calculate average price of first and last 3 hours and take difference of those averages in Python

I have a data frame that looks like this,
+---------------------+------------+----------+-------+
| Timestamp | Date | Time | Price |
+---------------------+------------+----------+-------+
| 2017-01-01 00:00:00 | 2017-01-01 | 00:00:00 | 20 |
| 2017-01-01 00:01:00 | 2017-01-01 | 00:01:00 | 25 |
| 2017-01-01 00:02:00 | 2017-01-01 | 00:02:00 | 15 |
| 2017-01-01 00:03:00 | 2017-01-01 | 00:03:00 | 20 |
| ... | | | |
| 2017-01-01 00:20:00 | 2017-01-01 | 00:20:00 | 25 |
| 2017-01-01 00:21:00 | 2017-01-01 | 00:21:00 | 15 |
| 2017-01-01 00:22:00 | 2017-01-01 | 00:22:00 | 10 |
| 2017-01-01 00:23:00 | 2017-01-01 | 00:23:00 | 25 |
| 2017-01-01 00:00:00 | 2017-02-01 | 00:00:00 | 10 |
| 2017-02-01 00:01:00 | 2017-02-01 | 00:01:00 | 25 |
| 2017-02-01 00:02:00 | 2017-02-01 | 00:02:00 | 10 |
| 2017-02-01 00:03:00 | 2017-02-01 | 00:03:00 | 25 |
| ... | | | |
| 2017-02-01 00:20:00 | 2017-02-01 | 00:20:00 | 15 |
| 2017-02-01 00:21:00 | 2017-02-01 | 00:21:00 | 10 |
| 2017-02-01 00:22:00 | 2017-02-01 | 00:22:00 | 25 |
| 2017-02-01 00:24:00 | 2017-02-01 | 00:23:00 | 10 |
+---------------------+------------+----------+-------+
Timestamp datetime64[ns]
Date datetime64[ns]
Time object
Price float64
and I'm trying to calculate difference between the average price of the first 3 hours and the last 3 hours of a day.
Design in my mind is to do something like this;
For every unique date in Date
a = avg(price.first(3))
b = avg(price.last(3))
dif = a - b
append to another dataset
---------EDIT----------
and the expected result is;
+------------+---------+
| Date | Diff |
+------------+---------+
| 2017-01-01 | 3.33334 |
| 2017-01-02 | 0 |
+------------+---------+
My real query will be in seconds rather then hours.(I didnt wanted to put 120 rows in here show 2 minutes of the data).So hours are representations of seconds.
And there can be some missing rows in the dataset so if I just do price.first(3600) it can overshoot for some days right? If I can solve this using df.Timestamp.datetime.hour that will be more precise I think.
I really can't get my head around figuring how to get first and last 3 Price for everyday kind of approach. Any help will be much much appreciated!! Thank you so so much in advance!
As you showed, the hours are ordered, so you can groupby day, and the get the list of the prices of the 24 hours of the day, then, you can apply a function to do the difference. You could try something like this:
import pandas as pd
from statistics import mean
def getavg(ls):
mean3first=mean(ls[:3])
mean3last=mean(ls[len(ls)-3:])
return mean3first-mean3last
diff_means= df.groupby(['Date']).agg(list)['Price'].apply(getavg).reset_index()
diff_means.columns=['Date','Diff']
print(diff_means)
I'm not entirely sure what format you want the result in, but I found a solution that I find pretty elegant:
unique_dates = df.Date.unique()
new_df = pd.DataFrame()
for u_date in unique_dates:
first_3 = np.mean(df[df.Date == u_date].reset_index().head(3).Price)
last_3 = np.mean(df[df.Date == u_date].reset_index().tail(3).Price)
new_df = new_df.append(
pd.DataFrame([[u_date, last_3 - first_3]], columns = ['Date', 'PriceDiff']))

Reshape long form panel data to wide stacked time series

I have panel data of the form:
+--------+----------+------------+----------+
| | user_id | order_date | values |
+--------+----------+------------+----------+
| 0 | 11039591 | 2017-01-01 | 3277.466 |
| 1 | 25717549 | 2017-01-01 | 587.553 |
| 2 | 13629086 | 2017-01-01 | 501.882 |
| 3 | 3022981 | 2017-01-01 | 1352.546 |
| 4 | 6084613 | 2017-01-01 | 441.151 |
| ... | ... | ... | ... |
| 186415 | 17955698 | 2020-05-01 | 146.868 |
| 186416 | 17384133 | 2020-05-01 | 191.461 |
| 186417 | 28593228 | 2020-05-01 | 207.201 |
| 186418 | 29065953 | 2020-05-01 | 430.401 |
| 186419 | 4470378 | 2020-05-01 | 87.086 |
+--------+----------+------------+----------+
as a Pandas DataFrame in Python.
The data is basically stacked time series data; the table contains numerous time series corresponding to observations for unique users within a certain period (2017/01 - 2020/05 above). The level of coverage for the period is likely to be very low amongst individual users, meaning that if you isolate the individual time series they're all of varying lengths.
I want to take this long-format panel data and convert it to wide format, such that each column is a day and each row corresponds to a unique user:
+----------+------------+------------+------------+------------+------------+
| | 2017-01-01 | 2017-01-02 | 2017-01-03 | 2017-01-04 | 2017-01-05 |
+----------+------------+------------+------------+------------+------------+
| 11039591 | 3277.466 | 6482.722 | NaN | NaN | NaN |
| 25717549 | 587.553 | NaN | NaN | NaN | NaN |
| 13629086 | 501.882 | NaN | NaN | NaN | NaN |
| 3022981 | 1352.546 | NaN | NaN | 557.728 | NaN |
| 6084613 | 441.151 | NaN | NaN | NaN | NaN |
+----------+------------+------------+------------+------------+------------+
I'm struggling to get this using unstack/pivot or other Pandas built-ins as I keep running into:
ValueError: Index contains duplicate entries, cannot reshape
due to the repeated user IDs.
My solution at the moment uses a loop to index the individual timeseries and concatenates them together so it's not scalable - it's already really slow with just 180k rows:
def time_series_stacker(df):
ts = list()
for user in df['user_id'].unique():
values = df.loc[df['user_id']==user].drop('user_id', axis=1).T.values
instance = pd.DataFrame(
values[1,:].reshape(1,-1),
index=[user],
columns=values[0,:].astype('datetime64[ns]')
)
ts.append(instance)
return pd.concat(ts, axis=0)
Can anyone help out with reshaping this more efficiently please?
This is a perfect time to try out pivot_table
user_id order_date values
0 11039591 2017-01-01 3277.466
1 11039591 2017-01-02 587.553
2 13629086 2017-01-03 501.882
3 13629086 2017-01-02 1352.546
4 6084613 2017-01-01 441.151
df.pivot_table(index='user_id',columns='order_date',values='values')
Output
order_date 2017-01-01 2017-01-02 2017-01-03
user_id
6084613 441.151 NaN NaN
11039591 3277.466 587.553 NaN
13629086 NaN 1352.546 501.882

Categories