How can I set last rows of a dataframe based on condition in Python? - python

I have 1 dataframes, df1, with 2 different columns. The first column 'col1' is a datetime column, and the second one is a int column with only 2 possible values (0 or 1). Here is an example of the dataframe:
+----------------------+----------+
| col1 | col2 |
+----------------------+----------+
| 2020-01-01 10:00:00 | 0 |
+----------------------+----------+
| 2020-01-01 11:00:00 | 1 |
+----------------------+----------+
| 2020-01-01 12:00:00 | 1 |
+----------------------+----------+
| 2020-01-02 11:00:00 | 0 |
+----------------------+----------+
| 2020-01-02 12:00:00 | 1 |
+----------------------+----------+
| ... | ... |
+----------------------+----------+
As you can see, the datetimes are sorted in an ascending order. What I would like is: for each diferent date (in this example are 2 diferent dates, 2020-01-01 and 2020-01-02 with diferent times) I would like to mantain the first 1 value and put as 0 the previous and the next ones in that date. So, the resulting dataframe would be:
+----------------------+----------+
| col1 | col2 |
+----------------------+----------+
| 2020-01-01 10:00:00 | 0 |
+----------------------+----------+
| 2020-01-01 11:00:00 | 1 |
+----------------------+----------+
| 2020-01-01 12:00:00 | 0 |
+----------------------+----------+
| 2020-01-02 11:00:00 | 0 |
+----------------------+----------+
| 2020-01-02 12:00:00 | 1 |
+----------------------+----------+
| ... | ... |
+----------------------+----------+
How can I do it in Python?

Use:
df['col1'] = pd.to_datetime(df.col1)
mask = df.groupby(df.col1.dt.date)['col2'].cumsum().eq(1)
df.col2.where(mask, 0, inplace = True)
Output:
>>> df
col1 col2
0 2020-01-01 10:00:00 0
1 2020-01-01 12:00:00 1
2 2020-01-01 12:00:00 0
3 2020-01-02 11:00:00 0
4 2020-01-02 12:00:00 1

Related

Add a new record for each missing second in a DataFrame with TimeStamp [duplicate]

This question already has answers here:
Add missing dates to pandas dataframe
(7 answers)
Closed 9 months ago.
Be the next Pandas DataFrame:
| date | counter |
|-------------------------------------|------------------|
| 2022-01-01 10:00:01 | 1 |
| 2022-01-01 10:00:04 | 1 |
| 2022-01-01 10:00:06 | 1 |
I want to create a function that, given the previous DataFrame, returns another similar DataFrame, adding a new row for each missing time instant and counter 0 in that time interval.
| date | counter |
|-------------------------------------|------------------|
| 2022-01-01 10:00:01 | 1 |
| 2022-01-01 10:00:02 | 0 |
| 2022-01-01 10:00:03 | 0 |
| 2022-01-01 10:00:04 | 1 |
| 2022-01-01 10:00:05 | 0 |
| 2022-01-01 10:00:06 | 1 |
In case the initial DataFrame contained more than one day, you should do the same, filling in with each missing second interval for all days included.
Thank you for your help.
Use DataFrame.asfreq working with DatetimeIndex:
df = df.set_index('date').asfreq('1S', fill_value=0).reset_index()
print (df)
date counter
0 2022-01-01 10:00:01 1
1 2022-01-01 10:00:02 0
2 2022-01-01 10:00:03 0
3 2022-01-01 10:00:04 1
4 2022-01-01 10:00:05 0
5 2022-01-01 10:00:06 1
You can also use df.resample:
In [314]: df = df.set_index('date').resample('1S').sum().fillna(0).reset_index()
In [315]: df
Out[315]:
date counter
0 2022-01-01 10:00:01 1
1 2022-01-01 10:00:02 0
2 2022-01-01 10:00:03 0
3 2022-01-01 10:00:04 1
4 2022-01-01 10:00:05 0
5 2022-01-01 10:00:06 1

How to group data with similar dates in pandas

I have two csv files. Both the files contain Date, Stock, Open, High,Low,Close column of a single day. I made one dataframe from these two files. So, in this single dataframe 1st the data of Stock 1 is printed from day open to day close and then data of Stock 2 from day open to day end .The data is of 15 min interval and a day starts with 2019-01-01 09:15:00 and ends with 2019-01-01 15:15:00.
What I want is to create a dataframe where data of stock1 at 2019-01-01 09:15:00 is printed and then the data of stock2 at the same time and so on for 2019-01-01 09:30:00, 2019-01-01 09:45:00....
Check the image:
New Answer:
After reading your response I figured the best course of action for your issue would be moving your data to a 2-index DataFrame format using Pandas MultiIndex
arrays = [
np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
]
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df
Out[16]:
0 1 2 3
bar one -0.424972 0.567020 0.276232 -1.087401
two -0.673690 0.113648 -1.478427 0.524988
baz one 0.404705 0.577046 -1.715002 -1.039268
two -0.370647 -1.157892 -1.344312 0.844885
foo one 1.075770 -0.109050 1.643563 -1.469388
two 0.357021 -0.674600 -1.776904 -0.968914
qux one -1.294524 0.413738 0.276662 -0.472035
two -0.013960 -0.362543 -0.006154 -0.923061
Old Answer
You can use the pandas concat method. If their index format matches, the Pandas API will take care of the rest.
import pandas as pd
import datetime
idx = pd.date_range("2018-01-01", periods=5, freq="H")
ts = pd.DataFrame(range(len(idx)), index=idx)
| | 0 |
|:--------------------|----:|
| 2018-01-01 00:00:00 | 0 |
| 2018-01-01 01:00:00 | 1 |
| 2018-01-01 02:00:00 | 2 |
| 2018-01-01 03:00:00 | 3 |
| 2018-01-01 04:00:00 | 4 |
idy = pd.date_range("2018-01-02", periods=10, freq="H")
tsy = pd.DataFrame(range(len(idy)), index=idy)
| | 0 |
|:--------------------|----:|
| 2018-01-02 00:00:00 | 0 |
| 2018-01-02 01:00:00 | 1 |
| 2018-01-02 02:00:00 | 2 |
| 2018-01-02 03:00:00 | 3 |
| 2018-01-02 04:00:00 | 4 |
| 2018-01-02 05:00:00 | 5 |
| 2018-01-02 06:00:00 | 6 |
| 2018-01-02 07:00:00 | 7 |
| 2018-01-02 08:00:00 | 8 |
| 2018-01-02 09:00:00 | 9 |
Result:
pd.concat([ts, tsy])
| | 0 |
|:--------------------|----:|
| 2018-01-01 00:00:00 | 0 |
| 2018-01-01 01:00:00 | 1 |
| 2018-01-01 02:00:00 | 2 |
| 2018-01-01 03:00:00 | 3 |
| 2018-01-01 04:00:00 | 4 |
| 2018-01-02 00:00:00 | 0 |
| 2018-01-02 01:00:00 | 1 |
| 2018-01-02 02:00:00 | 2 |
| 2018-01-02 03:00:00 | 3 |
| 2018-01-02 04:00:00 | 4 |
| 2018-01-02 05:00:00 | 5 |
| 2018-01-02 06:00:00 | 6 |
| 2018-01-02 07:00:00 | 7 |
| 2018-01-02 08:00:00 | 8 |
| 2018-01-02 09:00:00 | 9 |

Iterate over each day, calculate average price of first and last 3 hours and take difference of those averages in Python

I have a data frame that looks like this,
+---------------------+------------+----------+-------+
| Timestamp | Date | Time | Price |
+---------------------+------------+----------+-------+
| 2017-01-01 00:00:00 | 2017-01-01 | 00:00:00 | 20 |
| 2017-01-01 00:01:00 | 2017-01-01 | 00:01:00 | 25 |
| 2017-01-01 00:02:00 | 2017-01-01 | 00:02:00 | 15 |
| 2017-01-01 00:03:00 | 2017-01-01 | 00:03:00 | 20 |
| ... | | | |
| 2017-01-01 00:20:00 | 2017-01-01 | 00:20:00 | 25 |
| 2017-01-01 00:21:00 | 2017-01-01 | 00:21:00 | 15 |
| 2017-01-01 00:22:00 | 2017-01-01 | 00:22:00 | 10 |
| 2017-01-01 00:23:00 | 2017-01-01 | 00:23:00 | 25 |
| 2017-01-01 00:00:00 | 2017-02-01 | 00:00:00 | 10 |
| 2017-02-01 00:01:00 | 2017-02-01 | 00:01:00 | 25 |
| 2017-02-01 00:02:00 | 2017-02-01 | 00:02:00 | 10 |
| 2017-02-01 00:03:00 | 2017-02-01 | 00:03:00 | 25 |
| ... | | | |
| 2017-02-01 00:20:00 | 2017-02-01 | 00:20:00 | 15 |
| 2017-02-01 00:21:00 | 2017-02-01 | 00:21:00 | 10 |
| 2017-02-01 00:22:00 | 2017-02-01 | 00:22:00 | 25 |
| 2017-02-01 00:24:00 | 2017-02-01 | 00:23:00 | 10 |
+---------------------+------------+----------+-------+
Timestamp datetime64[ns]
Date datetime64[ns]
Time object
Price float64
and I'm trying to calculate difference between the average price of the first 3 hours and the last 3 hours of a day.
Design in my mind is to do something like this;
For every unique date in Date
a = avg(price.first(3))
b = avg(price.last(3))
dif = a - b
append to another dataset
---------EDIT----------
and the expected result is;
+------------+---------+
| Date | Diff |
+------------+---------+
| 2017-01-01 | 3.33334 |
| 2017-01-02 | 0 |
+------------+---------+
My real query will be in seconds rather then hours.(I didnt wanted to put 120 rows in here show 2 minutes of the data).So hours are representations of seconds.
And there can be some missing rows in the dataset so if I just do price.first(3600) it can overshoot for some days right? If I can solve this using df.Timestamp.datetime.hour that will be more precise I think.
I really can't get my head around figuring how to get first and last 3 Price for everyday kind of approach. Any help will be much much appreciated!! Thank you so so much in advance!
As you showed, the hours are ordered, so you can groupby day, and the get the list of the prices of the 24 hours of the day, then, you can apply a function to do the difference. You could try something like this:
import pandas as pd
from statistics import mean
def getavg(ls):
mean3first=mean(ls[:3])
mean3last=mean(ls[len(ls)-3:])
return mean3first-mean3last
diff_means= df.groupby(['Date']).agg(list)['Price'].apply(getavg).reset_index()
diff_means.columns=['Date','Diff']
print(diff_means)
I'm not entirely sure what format you want the result in, but I found a solution that I find pretty elegant:
unique_dates = df.Date.unique()
new_df = pd.DataFrame()
for u_date in unique_dates:
first_3 = np.mean(df[df.Date == u_date].reset_index().head(3).Price)
last_3 = np.mean(df[df.Date == u_date].reset_index().tail(3).Price)
new_df = new_df.append(
pd.DataFrame([[u_date, last_3 - first_3]], columns = ['Date', 'PriceDiff']))

Interpolate time series and resample/pivot. How to get the expected output

I have a df that looks like this:
Video | Start | End | Duration |
vid1 |2018-10-02 16:00:29 |2018-10-02 20:07:05 | 246 |
vid2 |2018-10-04 16:03:08 |2018-10-04 16:10:11 | 7 |
vid3 |2018-10-04 10:13:40 |2018-10-06 12:07:38 | 113 |
What I want to do is resample dataframe by 10 minutes by start column and assign 1 if the video lasted in that timestamp and 0 if not.
The desired output is:
Start | vid1 | vid2 | vid3 |
2018-10-02 16:00:00| 1 | 0 | 0 |
2018-10-02 16:10:00| 1 | 0 | 0 |
...
2018-10-04 16:10:00| 0 | 1 | 0 |
2018-10-04 16:20:00| 0 | 0 | 1 |
The output is presented only for visualization the output, hence, it can contain errors.
The problem is that I can not resample dataframe in a way to make a desired crosstab output.
Try this:
df.apply(lambda x: pd.Series(x['Video'],
index=pd.date_range(x['Start'].floor('10T'),
x['End'].ceil('10T'),
freq='10T')), axis=1)\
.stack().str.get_dummies().reset_index(level=0, drop=True)
Output:
vid1 vid2 vid3
2018-10-02 16:00:00 1 0 0
2018-10-02 16:10:00 1 0 0
2018-10-02 16:20:00 1 0 0
2018-10-02 16:30:00 1 0 0
2018-10-02 16:40:00 1 0 0
... ... ... ...
2018-10-06 11:30:00 0 0 1
2018-10-06 11:40:00 0 0 1
2018-10-06 11:50:00 0 0 1
2018-10-06 12:00:00 0 0 1
2018-10-06 12:10:00 0 0 1
[330 rows x 3 columns]

Pandas: get observations by timestamp

I got a list of dynamic values (e.g. observations). It records all value changes of an entity (e.g. display).
df
+----+---------------------+-----------------+---------+
| | time | display_index | value |
|----+---------------------+-----------------+---------|
| 0 | 2017-11-06 13:00:00 | 1 | val1 |
| 1 | 2017-11-06 14:00:00 | 1 | val2 |
| 2 | 2017-11-06 15:00:00 | 1 | val1 |
| 3 | 2017-11-06 13:30:00 | 2 | val3 |
| 4 | 2017-11-06 14:05:00 | 2 | val4 |
| 5 | 2017-11-06 15:30:00 | 2 | val1 |
+----+---------------------+-----------------+---------+
Now I got a second list of timestamps and I'm interested in the values that each display has shown at that time. Note that the first timestamp (13:00) for display_index 2 is before any value is even known for that one (first record is 13:30).
df_times
+----+---------------------+-----------------+
| | time | display_index |
|----+---------------------+-----------------|
| 0 | 2017-11-06 13:20:00 | 1 |
| 1 | 2017-11-06 13:40:00 | 1 |
| 2 | 2017-11-06 13:00:00 | 2 |
| 3 | 2017-11-06 14:00:00 | 2 |
+----+---------------------+-----------------+
I tried calculating the period between both timestamps and chose the observation with the minimum value for that period:
df_merged = df_times.merge(df, on='display_index', how='outer', suffixes=['','_measured'])
df_merged['seconds'] = (df_merged.time_measured - df_merged.time).astype('timedelta64[s]')
df_merged['seconds'] = df_merged['seconds'].apply(math.fabs)
df_merged = df_merged.sort_values('seconds').groupby(['time', 'display_index'], as_index=False).first()
print(tabulate(df_merged, headers='keys', tablefmt='psql'))
+----+---------------------+-----------------+---------------------+---------+-----------+
| | time | display_index | time_measured | value | seconds |
|----+---------------------+-----------------+---------------------+---------+-----------|
| 0 | 2017-11-06 13:00:00 | 2 | 2017-11-06 13:30:00 | val3 | 1800 |
| 1 | 2017-11-06 13:20:00 | 1 | 2017-11-06 13:00:00 | val1 | 1200 |
| 2 | 2017-11-06 13:40:00 | 1 | 2017-11-06 14:00:00 | val2 | 1200 |
| 3 | 2017-11-06 14:00:00 | 2 | 2017-11-06 14:05:00 | val4 | 300 |
+----+---------------------+-----------------+---------------------+---------+-----------+
The problem is that the last values for display 1 and 2 are wrong since they are still showing another value at that time. It should be val1 for display 1 and val3 for display 2. What I'm actually looking for is the observation that was last seen before the timestamp. So how to do this?
Here's the code that I used:
import pandas as pd
from tabulate import tabulate
import math
values = [("2017-11-06 13:00", 1, 'val1'),
("2017-11-06 14:00", 1, 'val2'),
("2017-11-06 15:00", 1, 'val1'),
("2017-11-06 13:30", 2, 'val3'),
("2017-11-06 14:05", 2, 'val4'),
("2017-11-06 15:30", 2, 'val1'),
]
labels = ['time', 'display_index', 'value']
df = pd.DataFrame.from_records(values, columns=labels)
df['time'] = pd.to_datetime(df['time'])
print(tabulate(df, headers='keys', tablefmt='psql'))
values = [("2017-11-06 13:20", 1),
("2017-11-06 13:40", 1),
("2017-11-06 13:00", 2),
("2017-11-06 14:00", 2),
]
labels = ['time', 'display_index']
df_times = pd.DataFrame.from_records(values, columns=labels)
df_times['time'] = pd.to_datetime(df_times['time'])
print(tabulate(df_times, headers='keys', tablefmt='psql'))
df_merged = df_times.merge(df, on='display_index', how='outer', suffixes=['','_measured'])
df_merged['seconds'] = (df_merged.time_measured - df_merged.time).astype('timedelta64[s]')
df_merged['seconds'] = df_merged['seconds'].apply(math.fabs)
df_merged = df_merged.sort_values('seconds').groupby(['time', 'display_index'], as_index=False).first()
print(tabulate(df_merged, headers='keys', tablefmt='psql'))
This is a perfect use case for pd.merge_asof
Note: I think you got the second row wrong.
# dataframes need to be sorted
df_times = df_times.sort_values(['time', 'display_index'])
df = df.sort_values(['time', 'display_index'])
pd.merge_asof(
df_times, df.assign(time_measured=df.time),
on='time', by='display_index', direction='forward'
).assign(seconds=lambda d: d.time_measured.sub(d.time).dt.total_seconds())
time display_index value time_measured seconds
0 2017-11-06 13:00:00 2 val3 2017-11-06 13:30:00 1800.0
1 2017-11-06 13:20:00 1 val2 2017-11-06 14:00:00 2400.0
2 2017-11-06 13:40:00 1 val2 2017-11-06 14:00:00 1200.0
3 2017-11-06 14:00:00 2 val4 2017-11-06 14:05:00 300.0
Explanation
pd.merge_asof for every row in the left argument, it attempts to locate a matching row in the right argument.
Since we passed direction='forward' it will look forward from the row in the left argument and find the next value.
I needed a way to capture the time_measured column. Since merge_asof snags the time column, I assigned it as a different column that I can use as intended. The use of df.assign(time_measured=df.time) just dups the column for use later.
I use assign again. This time to assign a new column seconds. When using assign, you can pass an array of equal length as the dataframe. You can pass a series in which the values will align based on the index. Or you can pass a callable that will get passed the dataframe that is calling assign. This is what I did. The lambda takes the calling dataframe and finds the difference in those two date columns and converts the resulting series of timedeltas to seconds.

Categories