I have a pandas dataframe like as shown below
df = pd.DataFrame({'login_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM','06/06/2014 05:00:00 AM','','10/11/1990'],
'DURATION':[21,30,200,34,45,np.NaN})
I would like to add DURATION values to the login_date column
The DURATION is represented in Days type
If there is NA in DURATION column, just replace it with 0.
So, I tried the below
df['DURATION'] = df['DURATION'].fillna(0)
df['login_date'] = pd.to_datetime(df['login_date'])
df['DURATION'] = df['DURATION'].astype('Int64')
df['logout_Date'] = df['login_date'] + pd.offsets.DateOffset(days=df['DURATION'])
However, this results in an error as shown below
TypeError: Invalid type <class 'pandas.core.series.Series'>. Must be int or float.
But I have already converted my DURATION column to int64 type.
How to add a column of values to my logout_date column
Try:
df["logout_date"] = pd.to_datetime(df["login_date"]) + df["DURATION"].fillna(0).apply(lambda x: pd.Timedelta(days=x))
print(df)
Prints:
login_date DURATION logout_date
0 5/7/2013 09:27:00 AM 21.0 2013-05-28 09:27:00
1 09/08/2013 11:21:00 AM 30.0 2013-10-08 11:21:00
2 06/06/2014 08:00:00 AM 200.0 2014-12-23 08:00:00
3 06/06/2014 05:00:00 AM 34.0 2014-07-10 05:00:00
4 45.0 NaT
5 10/11/1990 NaN 1990-10-11 00:00:00
Related
I have a dataframe with time data in the format:
date values
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 -9999
3 2013-01-01 03:00:00 -9999
4 2013-01-01 04:00:00 0.0
.. ... ...
8754 2016-12-31 18:00:00 427.5
8755 2016-12-31 19:00:00 194.9
8756 2016-12-31 20:00:00 -9999
8757 2016-12-31 21:00:00 237.6
8758 2016-12-31 22:00:00 -9999
8759 2016-12-31 23:00:00 0.0
And I want every month that the value -9999 is repeated more than 175 times those values get changed to NaN.
Imagine that we have this other dataframe with the number of times the value is repeated per month:
date values
0 2013-01 200
1 2013-02 0
2 2013-03 2
3 2013-04 181
4 2013-05 0
5 2013-06 0
6 2013-07 66
7 2013-08 0
8 2013-09 7
In this case, the month of January and April passed the stipulated value and that first dataframe should be:
date values
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 NaN
3 2013-01-01 03:00:00 NaN
4 2013-01-01 04:00:00 0.0
.. ... ...
8754 2016-12-31 18:00:00 427.5
8755 2016-12-31 19:00:00 194.9
8756 2016-12-31 20:00:00 -9999
8757 2016-12-31 21:00:00 237.6
8758 2016-12-31 22:00:00 -9999
8759 2016-12-31 23:00:00 0.0
I imagined creating a list using tolist() that separates the months that the value appears more than 175 times and then creating a condition if df["values"]==-9999 and df["date"] in list_with_months and then change the values.
You can do this using a transform call where you calculate the number of values per month in the same dataframe. Then you create a new column conditionally on this:
import numpy as np
MISSING = -9999
THRESHOLD = 175
# Create a month column
df['month'] = df['date'].dt.to_period('M')
# Count number of MISSING per month and assign to dataframe
df['n_missing'] = (
df.groupby('month')['values']
.transform(lambda d: (d == MISSING).sum())
)
# If value is MISSING and number of missing is above THRESHOLD, replace with NaN, otherwise keep original values
df['new_value'] = np.where(
(df['values'] == MISSING) & (df['n_missing'] > THRESHOLD),
np.nan,
df['values']
)
I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [101,101,101,101,202,202,202],
'start_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM','06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM'],
'end_date':['5/12/2013 09:27:00 AM',np.nan,'06/11/2014 08:00:00 AM',np.nan,'12/16/2011 10:00:00','10/18/2012 00:00:00',np.nan],
'type':['O','I','O','O','I','O','I']})
df.start_date = pd.to_datetime(df.start_date)
df['end_date'] = pd.to_datetime(df.end_date)
I would like to fillna() under the end_date column based on two approaches below
a) If NA is found in any row except last row of that person, fillna by copying the value from next row
b) If NA is found in the last row of that person fillna by adding 10 days to his start_date (because there is no next row for that person to copy from. So, we give random value of 10 days)
The rules a and b only for persons with type=I.
For persons with type=O, just fillna by copying the value from start_date.
This is what I tried. You can see am writing the same code line twice.
df['end_date'] = np.where(df['type'].str.contains('I'),pd.DatetimeIndex(df['end_date'].bfill()),pd.DatetimeIndex(df.start_date.dt.date))
df['end_date'] = np.where(df['type'].str.contains('I'),pd.DatetimeIndex(df['start_date'] + pd.DateOffset(10)),pd.DatetimeIndex(df.start_date.dt.date))
Any elegant and efficient way to write this as I have to apply this on a big data with 15 million rows?
I expect my output to be like as shown below
Solution
s1 = df.groupby('person_id')['start_date'].shift(-1)
s1 = s1.fillna(df['start_date'] + pd.DateOffset(days=10))
s1 = df['end_date'].fillna(s1)
s2 = df['end_date'].fillna(df['start_date'])
df['end_date'] = np.where(df['type'].eq('I'), s1, s2)
Explanations
Group the dataframe on person_id and shift the column start_date one units upwards.
>>> df.groupby('person_id')['start_date'].shift(-1)
0 2013-09-08 11:21:00
1 2014-06-06 08:00:00
2 2014-06-06 05:00:00
3 NaT
4 2012-10-13 00:00:00
5 2012-12-13 11:45:00
6 NaT
Name: start_date, dtype: datetime64[ns]
Fill the NaN values in the shifted column with the values from start_date column after adding an offset of 10 days
>>> s1.fillna(df['start_date'] + pd.DateOffset(days=10))
0 2013-09-08 11:21:00
1 2014-06-06 08:00:00
2 2014-06-06 05:00:00
3 2014-06-16 05:00:00
4 2012-10-13 00:00:00
5 2012-12-13 11:45:00
6 2012-12-23 11:45:00
Name: start_date, dtype: datetime64[ns]
Now fill the NaN values in end_date column with the above series s1
>>> df['end_date'].fillna(s1)
0 2013-05-12 09:27:00
1 2014-06-06 08:00:00
2 2014-06-11 08:00:00
3 2014-06-16 05:00:00
4 2011-12-16 10:00:00
5 2012-10-18 00:00:00
6 2012-12-23 11:45:00
Name: end_date, dtype: datetime64[ns]
Similarly fill the NaN values in end_date column with the values from start_date column to create a series s2
>>> df['end_date'].fillna(df['start_date'])
0 2013-05-12 09:27:00
1 2013-09-08 11:21:00
2 2014-06-11 08:00:00
3 2014-06-06 05:00:00
4 2011-12-16 10:00:00
5 2012-10-18 00:00:00
6 2012-12-13 11:45:00
Name: end_date, dtype: datetime64[ns]
Then use np.where to select the values from s1 / s2 based on the condition where the type is I or O
>>> df
person_id start_date end_date type
0 101 2013-05-07 09:27:00 2013-05-12 09:27:00 O
1 101 2013-09-08 11:21:00 2014-06-06 08:00:00 I
2 101 2014-06-06 08:00:00 2014-06-11 08:00:00 O
3 101 2014-06-06 05:00:00 2014-06-06 05:00:00 O
4 202 2011-12-11 10:00:00 2011-12-16 10:00:00 I
5 202 2012-10-13 00:00:00 2012-10-18 00:00:00 O
6 202 2012-12-13 11:45:00 2012-12-23 11:45:00 I
Background: In mplfinance, I want to be able to plot multiple trade markers in the same bar. Currently to my understanding you can add only 1 (or 1 buy and 1 sell) to the same bar. I cannot have 2 more trades on the same side in the same bar unless I create another series.
Here is an example:
d = {'TradeDate': ['2018-10-15 06:00:00',
'2018-10-29 03:00:00',
'2018-10-29 03:00:00',
'2018-10-29 06:00:00',
'2018-11-15 05:00:00',
'2018-11-15 05:00:00',
'2018-11-15 05:00:00'],
'Price': [1.1596,
1.1433,
1.13926,
1.14015,
1.1413,
1.1400,
1.1403]}
df = pd.DataFrame(data=d)
df
TradeDate Price
0 2018-10-15 06:00:00 1.15960
1 2018-10-29 03:00:00 1.14330
2 2018-10-29 03:00:00 1.13926
3 2018-10-29 06:00:00 1.14015
4 2018-11-15 05:00:00 1.14130
5 2018-11-15 05:00:00 1.14000
6 2018-11-15 05:00:00 1.14030
As you can see there are multiple trades for 2 datetimes. Now I would like to apply a rule that says "If there is more than 1 trade(here: Price) per date, create a new column for the additional price, keep doing so until all prices for the same TradeDate (datetime) have been distributed across columns, and all datetimes are unique". So the more prices for the same date, the more extra columns are needed.
The end result would look like this (I finagled this data manually):
TradeDate Price Price2 Price3
0 2018-10-15 06:00:00 1.15960 NaN NaN
1 2018-10-29 03:00:00 1.14330 1.13926 NaN
3 2018-10-29 06:00:00 1.14015 NaN NaN
4 2018-11-15 05:00:00 1.14130 1.14000 1.1403
The trick is to add an incremental counter to each unique datetime. Such that if a datetime is encountered more than once, this counter increases.
To do this, we groupby tradedate, and get a cumulative count of the number of duplicate tradedates there are for a given tradedate. I then add 1 to this value so our counting starts at 1 instea of 0.
df["TradeDate_count"] = df.groupby("TradeDate").cumcount() + 1
print(df)
TradeDate Price TradeDate_count
0 2018-10-15 06:00:00 1.15960 1
1 2018-10-29 03:00:00 1.14330 1
2 2018-10-29 03:00:00 1.13926 2
3 2018-10-29 06:00:00 1.14015 1
4 2018-11-15 05:00:00 1.14130 1
5 2018-11-15 05:00:00 1.14000 2
6 2018-11-15 05:00:00 1.14030 3
Now that we've added that column, we can simply pivot to achieve your desired result. Note that I added a rename(...) method simply to add "price" to our column names. I also used the rename_axis method since our pivot returned us a named index for the columns which some users find hard to look at, so I figured it would be best to remove it.
new_df = (df.pivot(index="TradeDate", columns="TradeDate_count", values="Price")
.rename(columns="price{}".format)
.rename_axis(columns=None))
price1 price2 price3
TradeDate
2018-10-15 06:00:00 1.15960 NaN NaN
2018-10-29 03:00:00 1.14330 1.13926 NaN
2018-10-29 06:00:00 1.14015 NaN NaN
2018-11-15 05:00:00 1.14130 1.14000 1.1403
A slightly different approach is to group the data by the TradeDate and concatonate all the values into a list. This can then be pulled out into separate columns and assigned to a new dataframe.
reduce = df.groupby('TradeDate').agg(list)
new_df = pd.DataFrame(reduced['Price'].to_list(), index=reduced.index)
As per the other answer, if you wanted to rename for nicer comprehension you could do the following:
new_df.rename(columns=lambda x: f'Price{x if x > 0 else ""}', inplace=True)
I have the following table:
Hora_Retiro count_uses
0 00:00:18 1
1 00:00:34 1
2 00:02:27 1
3 00:03:13 1
4 00:06:45 1
... ... ...
748700 23:58:47 1
748701 23:58:49 1
748702 23:59:11 1
748703 23:59:47 1
748704 23:59:56 1
And I want to group all values within each hour, so I can see the total number of uses per hour (00:00:00 - 23:00:00)
I have the following code:
hora_pico_aug= hora_pico.groupby(pd.Grouper(key="Hora_Retiro",freq='H')).count()
Hora_Retiro column is of timedelta64[ns] type
Which gives the following output:
count_uses
Hora_Retiro
00:00:02 2566
01:00:02 602
02:00:02 295
03:00:02 5
04:00:02 10
05:00:02 4002
06:00:02 16075
07:00:02 39410
08:00:02 76272
09:00:02 56721
10:00:02 36036
11:00:02 32011
12:00:02 33725
13:00:02 41032
14:00:02 50747
15:00:02 50338
16:00:02 42347
17:00:02 54674
18:00:02 76056
19:00:02 57958
20:00:02 34286
21:00:02 22509
22:00:02 13894
23:00:02 7134
However, the index column starts at 00:00:02, and I want it to start at 00:00:00, and then go from one hour intervals. Something like this:
count_uses
Hora_Retiro
00:00:00 2565
01:00:00 603
02:00:00 295
03:00:00 5
04:00:00 10
05:00:00 4002
06:00:00 16075
07:00:00 39410
08:00:00 76272
09:00:00 56721
10:00:00 36036
11:00:00 32011
12:00:00 33725
13:00:00 41032
14:00:00 50747
15:00:00 50338
16:00:00 42347
17:00:00 54674
18:00:00 76056
19:00:00 57958
20:00:00 34286
21:00:00 22509
22:00:00 13894
23:00:00 7134
How can i make it to start at 00:00:00??
Thanks for the help!
You can create an hour column from Hora_Retiro column.
df['hour'] = df['Hora_Retiro'].dt.hour
And then groupby on the basis of hour
gpby_df = df.groupby('hour')['count_uses'].sum().reset_index()
gpby_df['hour'] = pd.to_datetime(gpby_df['hour'], format='%H').dt.time
gpby_df.columns = ['Hora_Retiro', 'sum_count_uses']
gpby_df
gives
Hora_Retiro sum_count_uses
0 00:00:00 14
1 09:00:00 1
2 10:00:00 2
3 20:00:00 2
I assume that Hora_Retiro column in your DataFrame is of
Timedelta type. It is not datetime, as in this case there
would be printed also the date part.
Indeed, your code creates groups starting at the minute / second
taken from the first row.
To group by "full hours":
round each element in this column to hour,
then group (just by this rounded value).
The code to do it is:
hora_pico.groupby(hora_pico.Hora_Retiro.apply(
lambda tt: tt.round('H'))).count_uses.count()
However I advise you to make up your mind, what do you want to count:
rows or values in count_uses column.
In the second case replace count function with sum.
There is a pandas dataframe like this:
index
2018-06-01 02:50:00 R 45.48 -2.8
2018-06-01 07:13:00 R 45.85 -2.0
...
2018-06-01 08:37:00 R 45.87 -2.7
I would like to round the index to the hour like this:
index
2018-06-01 02:00:00 R 45.48 -2.8
2018-06-01 07:00:00 R 45.85 -2.0
...
2018-06-01 08:00:00 R 45.87 -2.7
I am trying the following code:
df = df.date_time.apply ( lambda x : x.round('H'))
but returns a serie instead of a dataframe with the modified index column
Try using floor:
df.index.floor('H')
Setup:
df = pd.DataFrame(np.arange(25),index=pd.date_range('2018-01-01 01:12:50','2018-01-02 01:12:50',freq='H'),columns=['Value'])
df.head()
Value
2018-01-01 01:12:50 0
2018-01-01 02:12:50 1
2018-01-01 03:12:50 2
2018-01-01 04:12:50 3
2018-01-01 05:12:50 4
df.index = df.index.floor('H')
df.head()
Value
2018-01-01 01:00:00 0
2018-01-01 02:00:00 1
2018-01-01 03:00:00 2
2018-01-01 04:00:00 3
2018-01-01 05:00:00 4
Try my method:
Add a new column by the rounded value of hour:
df['E'] = df.index.round('H')
Set it as index:
df1 = df.set_index('E')
Delete the name you set('E' here):
df1.index.name = None
And now, df1 is a new DataFrame with index hour rounded from df.
Try this
df['index'].apply(lambda dt: datetime.datetime(dt.year, dt.month, dt.day, dt.hour,60*(dt.minute // 60)))