`set_index()` does not sort index? - python

Concatenating this dask DataFrame to this pandas DataFrame and using set_index to sort index does not result in a sorted index. Is this normal?
from dask import dataframe as dd
import pandas as pd
a=list('aabbccddeeffgghhi')
df = pd.DataFrame(dict(a=a),
index = pd.date_range(start='2010/01/01', end='2010/02/01', periods=len(a))).reset_index()
ddf = dd.from_pandas(df, npartitions=5)
a2=list('aabbccddeef')
df2 = pd.DataFrame(dict(a=a2),
index = pd.date_range(start='2020/01/01',end='2020/01/06', periods=len(a2))).reset_index()
ddf2 = dd.concat([ddf, df2]).set_index('index')
ddf2.compute()
a
index
2010-01-01 00:00:00 a
2010-01-02 22:30:00 a
2010-01-04 21:00:00 b
2010-01-06 19:30:00 b
2010-01-08 18:00:00 c
2010-01-10 16:30:00 c
2010-01-12 15:00:00 d
2010-01-14 13:30:00 d
2010-01-16 12:00:00 e
2010-01-18 10:30:00 e
2010-01-20 09:00:00 f
2010-01-22 07:30:00 f
2010-01-24 06:00:00 g
2010-01-26 04:30:00 g
2010-01-28 03:00:00 h
2010-01-30 01:30:00 h
2010-02-01 00:00:00 i
2020-01-01 00:00:00 a
2020-01-01 12:00:00 a
2020-01-02 00:00:00 b
2020-01-02 12:00:00 b
2020-01-03 00:00:00 c
2020-01-03 12:00:00 c
2020-01-04 00:00:00 d
2020-01-04 12:00:00 d
2020-01-05 00:00:00 e
2020-01-05 12:00:00 e
2020-01-06 00:00:00 f
Please, do I do something the wrong way?

Yes, it is completly normal because most pandas operations don not assume a sorted index, some do though.
In dask dataframes you must apply
ddf2 = dd.concat([ddf, df2]).set_index('index', sorted = True).
By the way, your data is already properly sorted by index. Regard the years (2010, 2020)

Try sort_index to sort. Set index is just to set the index, doesn't force a resorting.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html

Related

Pandas - Replace Duplicates with Nan and Keep Row

How do I replace duplicates for each group with NaNs while keeping the rows?
I need to keep rows without removing and perhaps keeping the first original value where it shows up first.
import pandas as pd
from datetime import timedelta
df = pd.DataFrame({
'date': ['2019-01-01 00:00:00','2019-01-01 01:00:00','2019-01-01 02:00:00', '2019-01-01 03:00:00',
'2019-09-01 02:00:00','2019-09-01 03:00:00','2019-09-01 04:00:00', '2019-09-01 05:00:00'],
'value': [10,10,10,10,12,12,12,12],
'ID': ['Jackie','Jackie','Jackie','Jackie','Zoop','Zoop','Zoop','Zoop',]
})
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
date value ID
0 2019-01-01 00:00:00 10 Jackie
1 2019-01-01 01:00:00 10 Jackie
2 2019-01-01 02:00:00 10 Jackie
3 2019-01-01 03:00:00 10 Jackie
4 2019-09-01 02:00:00 12 Zoop
5 2019-09-01 03:00:00 12 Zoop
6 2019-09-01 04:00:00 12 Zoop
7 2019-09-01 05:00:00 12 Zoop
Desired Dataframe:
date value ID
0 2019-01-01 00:00:00 10 Jackie
1 2019-01-01 01:00:00 NaN Jackie
2 2019-01-01 02:00:00 NaN Jackie
3 2019-01-01 03:00:00 NaN Jackie
4 2019-09-01 02:00:00 12 Zoop
5 2019-09-01 03:00:00 NaN Zoop
6 2019-09-01 04:00:00 NaN Zoop
7 2019-09-01 05:00:00 NaN Zoop
Edit:
Duplicated values should only be dropped on the same date indifferent of the frequency. So if value 10 shows up on twice on Jan-1 and three times on Jan-2, the value 10 should only show up once on Jan-1 and once on Jan-2.
I assume you check duplicates on columns value and ID and further check on date of column date
df.loc[df.assign(d=df.date.dt.date).duplicated(['value','ID', 'd']), 'value'] = np.nan
Out[269]:
date value ID
0 2019-01-01 00:00:00 10.0 Jackie
1 2019-01-01 01:00:00 NaN Jackie
2 2019-01-01 02:00:00 NaN Jackie
3 2019-01-01 03:00:00 NaN Jackie
4 2019-09-01 02:00:00 12.0 Zoop
5 2019-09-01 03:00:00 NaN Zoop
6 2019-09-01 04:00:00 NaN Zoop
7 2019-09-01 05:00:00 NaN Zoop
As #Trenton suggest, you may use pd.NA to avoid import numpy
(Note: as #rafaelc sugguest: here is the link explain detail differences between pd.NA and np.nan https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values)
df.loc[df.assign(d=df.date.dt.date).duplicated(['value','ID', 'd']), 'value'] = pd.NA
Out[273]:
date value ID
0 2019-01-01 00:00:00 10 Jackie
1 2019-01-01 01:00:00 <NA> Jackie
2 2019-01-01 02:00:00 <NA> Jackie
3 2019-01-01 03:00:00 <NA> Jackie
4 2019-09-01 02:00:00 12 Zoop
5 2019-09-01 03:00:00 <NA> Zoop
6 2019-09-01 04:00:00 <NA> Zoop
7 2019-09-01 05:00:00 <NA> Zoop
This is working if the dataframe is sorted - as in your example:
import numpy as np # to be used for np.nan
df['duplicate'] = df['value'].shift(1) # create a duplicate column
df['value'] = df.apply(lambda x: np.nan if x['value'] == x['duplicate'] \
else x['value'], axis=1) # conditional replace
df = df.drop('duplicate', axis=1) # drop helper column
Group on the dates and take the first observed value (not necessarily the first when sorted by time), then merge the result back to the original dataframe.
df2 = df.groupby([df['date'].dt.date, 'ID'], as_index=False).first()
>>> df.drop(columns='value').merge(df2, on=['date', 'ID'], how='left')[df.columns]
date value ID
0 2019-01-01 00:00:00 10.0 Jackie
1 2019-01-01 01:00:00 NaN Jackie
2 2019-01-01 02:00:00 NaN Jackie
3 2019-01-01 03:00:00 NaN Jackie
4 2019-09-01 02:00:00 12.0 Zoop
5 2019-09-01 03:00:00 NaN Zoop
6 2019-09-01 04:00:00 NaN Zoop
7 2019-09-01 05:00:00 NaN Zoop

Applying start and endtime as filters to dataframe

I'm working on a timeseries dataframe which looks like this and has data from January to August 2020.
Timestamp Value
2020-01-01 00:00:00 -68.95370
2020-01-01 00:05:00 -67.90175
2020-01-01 00:10:00 -67.45966
2020-01-01 00:15:00 -67.07624
2020-01-01 00:20:00 -67.30549
.....
2020-07-01 00:00:00 -65.34212
I'm trying to apply a filter on the previous dataframe using the columns start_time and end_time in the dataframe below:
start_time end_time
2020-01-12 16:15:00 2020-01-13 16:00:00
2020-01-26 16:00:00 2020-01-26 16:10:00
2020-04-12 16:00:00 2020-04-13 16:00:00
2020-04-20 16:00:00 2020-04-21 16:00:00
2020-05-02 16:00:00 2020-05-03 16:00:00
The output should assign all values which are not within the start and end time as zero and retain values for the start and end times specified in the filter. I tried applying two simultaneous filters for start and end time but didn't work.
Any help would be appreciated.
Idea is create all masks by Series.between in list comprehension, then join with logical_or by np.logical_or.reduce and last pass to Series.where:
print (df1)
Timestamp Value
0 2020-01-13 00:00:00 -68.95370 <- changed data for match
1 2020-01-01 00:05:00 -67.90175
2 2020-01-01 00:10:00 -67.45966
3 2020-01-01 00:15:00 -67.07624
4 2020-01-01 00:20:00 -67.30549
5 2020-07-01 00:00:00 -65.34212
L = [df1['Timestamp'].between(s, e) for s, e in df2[['start_time','end_time']].values]
m = np.logical_or.reduce(L)
df1['Value'] = df1['Value'].where(m, 0)
print (df1)
Timestamp Value
0 2020-01-13 00:00:00 -68.9537
1 2020-01-01 00:05:00 0.0000
2 2020-01-01 00:10:00 0.0000
3 2020-01-01 00:15:00 0.0000
4 2020-01-01 00:20:00 0.0000
5 2020-07-01 00:00:00 0.0000
Solution using outer join of merge method and query:
print(df1)
timestamp Value <- changed Timestamp to timestamp to avoid name conflict in query
0 2020-01-13 00:00:00 -68.95370 <- also changed data for match
1 2020-01-01 00:05:00 -67.90175
2 2020-01-01 00:10:00 -67.45966
3 2020-01-01 00:15:00 -67.07624
4 2020-01-01 00:20:00 -67.30549
5 2020-07-01 00:00:00 -65.34212
df1.loc[df1.index.difference(df1.assign(key=0).merge(df2.assign(key=0), how = 'outer')\
.query("timestamp >= start_time and timestamp < end_time").index),"Value"] = 0
result:
timestamp Value
0 2020-01-13 00:00:00 -68.9537
1 2020-01-01 00:05:00 0.0000
2 2020-01-01 00:10:00 0.0000
3 2020-01-01 00:15:00 0.0000
4 2020-01-01 00:20:00 0.0000
5 2020-07-01 00:00:00 0.0000
Key assign(key=0) is added to both dataframes to produce cartesian product.

Reverse position of entries in pandas dataframe based on condition

Here I have an extract from my pandas dataframe which is survey data with two datetime fields. It appears that some of the start times and end times were filled in the wrong position in the survey. Here is an example from my dataframe. The start and end time in the 8th row, I suspect were entered the wrong way round.
Just to give context, I generated the third column like this:
df_time['trip_duration'] = df_time['tripEnd_time'] - df_time['tripStart_time']
The three columns are in timedelta64 format.
Here is the top of my dataframe:
tripStart_time tripEnd_time trip_duration
1 22:30:00 23:15:00 00:45:00
2 11:00:00 11:30:00 00:30:00
3 09:00:00 09:15:00 00:15:00
4 13:30:00 14:25:00 00:55:00
5 09:00:00 10:15:00 01:15:00
6 12:00:00 12:15:00 00:15:00
7 08:00:00 08:30:00 00:30:00
8 11:00:00 09:15:00 -1 days +22:15:00
9 14:00:00 14:30:00 00:30:00
10 14:55:00 15:20:00 00:25:00
What I am trying to do is, loop through these two columns, and for each time 'tripEnd_time' is less than 'tripStart_time' swap the positions of these two entries. So in the case of row 8 above, I would make tripStart_time = tripEnd_time and tripEnd_time = tripStart_time.
I am not quite sure the best way to approach this. Should I use nested for loop where i compare each entry in the two columns?
Thanks
Use Series.abs:
df_time['trip_duration'] = (df_time['tripEnd_time'] - df_time['tripStart_time']).abs()
print (df_time)
1 22:30:00 23:15:00 00:45:00
2 11:00:00 11:30:00 00:30:00
3 09:00:00 09:15:00 00:15:00
4 13:30:00 14:25:00 00:55:00
5 09:00:00 10:15:00 01:15:00
6 12:00:00 12:15:00 00:15:00
7 08:00:00 08:30:00 00:30:00
8 11:00:00 09:15:00 01:45:00
9 14:00:00 14:30:00 00:30:00
10 14:55:00 15:20:00 00:25:00
What is same like:
a = df_time['tripEnd_time'] - df_time['tripStart_time']
b = df_time['tripStart_time'] - df_time['tripEnd_time']
mask = df_time['tripEnd_time'] > df_time['tripStart_time']
df_time['trip_duration'] = np.where(mask, a, b)
print (df_time)
tripStart_time tripEnd_time trip_duration
1 22:30:00 23:15:00 00:45:00
2 11:00:00 11:30:00 00:30:00
3 09:00:00 09:15:00 00:15:00
4 13:30:00 14:25:00 00:55:00
5 09:00:00 10:15:00 01:15:00
6 12:00:00 12:15:00 00:15:00
7 08:00:00 08:30:00 00:30:00
8 11:00:00 09:15:00 01:45:00
9 14:00:00 14:30:00 00:30:00
10 14:55:00 15:20:00 00:25:00
You can switch column values on selected rows:
df_time.loc[df_time['tripEnd_time'] < df_time['tripStart_time'],
['tripStart_time', 'tripEnd_time']] = df_time.loc[
df_time['tripEnd_time'] < df_time['tripStart_time'],
['tripEnd_time', 'tripStart_time']].values

After groupby, evaluate value in column against column values in all rows in the group

I am looking for the following functionality in python:
I have a Pandas DataFrame with 4 columns: ID, StartDate, EndDate, Moment.
I want to group by ID and evaluate per row in the group whether the Moment variable falls between the interval between StartDate and EndDate. The problem is that I want to evaluate this for each row in the group. For example in the following DataFrame there are two groups (ID=1 and ID=2) and both groups contains of 5 rows. For each row, I want a boolean for each row in both groups whether the moment variable in that row falls in ANY of the time windows in the group, the window being [date1, date2].
import pandas as pd
i = pd.date_range('2018-04-11', periods=10, freq='2D20min')
i2 = pd.date_range('2018-04-12', periods=10, freq='2D20min')
i3 = pd.date_range('2018-04-9', periods=10, freq='1D6H')
id = ['1', '1', '1', '1', '1', '2', '2', '2', '2', '2']
ts = pd.DataFrame({'date1': i, 'date2': i2, 'moment': i3}, index=id)
ID date1 date2 moment
1 2018-04-11 00:00:00 2018-04-12 00:00:00 2018-04-09 00:00:00
1 2018-04-13 00:20:00 2018-04-14 00:20:00 2018-04-10 06:00:00
1 2018-04-15 00:40:00 2018-04-16 00:40:00 2018-04-11 12:00:00
1 2018-04-17 01:00:00 2018-04-18 01:00:00 2018-04-12 18:00:00
1 2018-04-19 01:20:00 2018-04-20 01:20:00 2018-04-14 00:00:00
2 2018-04-21 01:40:00 2018-04-22 01:40:00 2018-04-15 06:00:00
2 2018-04-23 02:00:00 2018-04-24 02:00:00 2018-04-16 12:00:00
2 2018-04-25 02:20:00 2018-04-26 02:20:00 2018-04-17 18:00:00
2 2018-04-27 02:40:00 2018-04-28 02:40:00 2018-04-19 00:00:00
2 2018-04-29 03:00:00 2018-04-30 03:00:00 2018-04-20 06:00:00
In this case, the value for moment in the first row of the first group does not fall in any of the five time intervals. Neither does the second. The third value, 2018-04-11 12:00:00 does fall in the interval in the first row and I would thus want to have True returned.
The desired result would look as follows:
ID date1 date2 moment result
1 2018-04-11 00:00:00 2018-04-12 00:00:00 2018-04-09 00:00:00 False
1 2018-04-13 00:20:00 2018-04-14 00:20:00 2018-04-10 06:00:00 False
1 2018-04-15 00:40:00 2018-04-16 00:40:00 2018-04-11 12:00:00 True
1 2018-04-17 01:00:00 2018-04-18 01:00:00 2018-04-12 18:00:00 False
1 2018-04-19 01:20:00 2018-04-20 01:20:00 2018-04-14 00:00:00 True
2 2018-04-21 01:40:00 2018-04-22 01:40:00 2018-04-15 06:00:00 False
2 2018-04-23 02:00:00 2018-04-24 02:00:00 2018-04-16 12:00:00 False
2 2018-04-25 02:20:00 2018-04-26 02:20:00 2018-04-17 18:00:00 False
2 2018-04-27 02:40:00 2018-04-28 02:40:00 2018-04-19 00:00:00 False
2 2018-04-29 03:00:00 2018-04-30 03:00:00 2018-04-20 06:00:00 False
EDIT
I already 'solved' this problem with the following approach but am looking for a more pythonic and perhaps faster way...
boolean_result = []
for c in ts.index.unique():
temp = ts.loc[ts.index == c]
for row in temp.index:
current_date = temp['moment'][row]
boolean_result.append(max((temp['date1'] <= current_date)
& (current_date <= temp['date2'])))
ts['Result'] = boolean_result
This may actually be very slow if your dataframe is too big, and there might be an optimal solution other than this one:
def time_in_range(start, end, x):
"""Return true if x is in the range [start, end]"""
if start <= x and x <= end:
return True
else:
return False
# empty list to be appended
result = []
test_list = []
for i in ts.index.unique():
temp_df = ts[ts.index == i]
for j in range(0, len(temp_df)):
for k in range(0, len(temp_df)):
test_list.append(time_in_range(temp_df.date1.iloc[k], temp_df.date2.iloc[k], temp_df.moment.iloc[j]))
result.append(any(test_list))
# reset the list
test_list = []
ts['result'] = result

Pandas : merge on date and hour from datetime index

I have two data frames like following, data frame A has datetime even with minutes, data frame B only has hour.
df:A
dataDate original
2018-09-30 11:20:00 3
2018-10-01 12:40:00 10
2018-10-02 07:00:00 5
2018-10-27 12:50:00 5
2018-11-28 19:45:00 7
df:B
dataDate count
2018-09-30 10:00:00 300
2018-10-01 12:00:00 50
2018-10-02 07:00:00 120
2018-10-27 12:00:00 234
2018-11-28 19:05:00 714
I like to merge the two on the basis of hour date and hour, so that now in dataframe A should have all the rows filled on the basis of merge on date and hour
I can try to do it via
A['date'] = A.dataDate.date
B['date'] = B.dataDate.date
A['hour'] = A.dataDate.hour
B['hour'] = B.dataDate.hour
and then merge
merge_df = pd.merge(A,B, how='left', left_on=['date', 'hour'],
right_on=['date', 'hour'])
but its a very long process, Is their an efficient way to perform the same operation with the help of pandas time series or date functionality?
Use map if need append only one column from B to A with floor for set minutes and seconds if exist to 0:
d = dict(zip(B.dataDate.dt.floor('H'), B['count']))
A['count'] = A.dataDate.dt.floor('H').map(d)
print (A)
dataDate original count
0 2018-09-30 11:20:00 3 NaN
1 2018-10-01 12:40:00 10 50.0
2 2018-10-02 07:00:00 5 120.0
3 2018-10-27 12:50:00 5 234.0
4 2018-11-28 19:45:00 7 714.0
For general solution use DataFrame.join:
A.index = A.dataDate.dt.floor('H')
B.index = B.dataDate.dt.floor('H')
A = A.join(B, lsuffix='_left')
print (A)
dataDate_left original dataDate count
dataDate
2018-09-30 11:00:00 2018-09-30 11:20:00 3 NaT NaN
2018-10-01 12:00:00 2018-10-01 12:40:00 10 2018-10-01 12:00:00 50.0
2018-10-02 07:00:00 2018-10-02 07:00:00 5 2018-10-02 07:00:00 120.0
2018-10-27 12:00:00 2018-10-27 12:50:00 5 2018-10-27 12:00:00 234.0
2018-11-28 19:00:00 2018-11-28 19:45:00 7 2018-11-28 19:05:00 714.0

Categories