So I have a series of dates and I want to split it into chunks based on continuity.Series looks like the following:
2019-01-01 36.581647
2019-01-02 35.988585
2019-01-03 35.781111
2019-01-04 35.126273
2019-01-05 34.401451
2019-01-06 34.351714
2019-01-07 34.175517
2019-01-08 33.622116
2019-01-09 32.861861
2019-01-10 32.915251
2019-01-11 32.866832
2019-01-12 32.214259
2019-01-13 31.707626
2019-01-14 32.556175
2019-01-15 32.674965
2019-01-16 32.391766
2019-01-17 32.463836
2019-01-18 32.151290
2019-01-19 31.952946
2019-01-20 31.739855
2019-01-21 31.355354
2019-01-22 31.271243
2019-01-23 31.273255
2019-01-24 31.442803
2019-01-25 32.034161
2019-01-26 31.455956
2019-01-27 31.408881
2019-01-28 31.066477
2019-01-29 30.489070
2019-01-30 30.356210
2019-01-31 30.470496
2019-02-01 29.949312
2019-02-02 29.916971
2019-02-03 29.865447
2019-02-04 29.512595
2019-02-05 29.297967
2019-02-06 28.743329
2019-02-07 28.509800
2019-02-08 27.681294
2019-02-10 26.441899
2019-02-11 26.787360
2019-02-12 27.368621
2019-02-13 27.085167
2019-02-14 26.856398
2019-02-15 26.793370
2019-02-16 26.334788
2019-02-17 25.906381
2019-02-18 25.367705
2019-02-19 24.939880
2019-02-20 25.021575
2019-02-21 25.006527
2019-02-22 24.984512
2019-02-23 24.372664
2019-02-24 24.183728
2019-10-10 23.970567
2019-10-11 24.755944
2019-10-12 25.155136
2019-10-13 25.273033
2019-10-14 25.490775
2019-10-15 25.864637
2019-10-16 26.168158
2019-10-17 26.600422
2019-10-18 26.959990
2019-10-19 26.965104
2019-10-20 27.128877
2019-10-21 26.908657
2019-10-22 26.979930
2019-10-23 26.816817
2019-10-24 27.058753
2019-10-25 27.453882
2019-10-26 27.358057
2019-10-27 27.374445
2019-10-28 27.418648
2019-10-29 27.458521
2019-10-30 27.859687
2019-10-31 28.093942
2019-11-01 28.494706
2019-11-02 28.517255
2019-11-03 28.492476
2019-11-04 28.723757
2019-11-05 28.835151
2019-11-06 29.367227
2019-11-07 29.920598
2019-11-08 29.746370
2019-11-09 29.498023
2019-11-10 29.745044
2019-11-11 30.935084
2019-11-12 31.710737
2019-11-13 32.890792
2019-11-14 33.011911
2019-11-15 33.121803
2019-11-16 32.805403
2019-11-17 32.887447
2019-11-18 33.350492
2019-11-19 33.525344
2019-11-20 33.791458
2019-11-21 33.674697
2019-11-22 33.642584
2019-11-23 33.704386
2019-11-24 33.472346
2019-11-25 33.317035
2019-11-26 32.934307
2019-11-27 33.573193
2019-11-28 32.840514
2019-11-29 33.085686
2019-11-30 33.138131
2019-12-01 33.344264
2019-12-02 33.524948
2019-12-03 33.694687
2019-12-04 33.836534
2019-12-05 34.343416
2019-12-06 34.321793
2019-12-07 34.156796
2019-12-08 34.399591
2019-12-09 34.931185
2019-12-10 35.294034
2019-12-11 35.021331
2019-12-12 34.292483
2019-12-13 34.330898
2019-12-14 34.354278
2019-12-15 34.436500
2019-12-16 34.869841
2019-12-17 34.932567
2019-12-18 34.855816
2019-12-19 35.226241
2019-12-20 35.184222
2019-12-21 35.456716
2019-12-22 35.730350
2019-12-23 35.739911
2019-12-24 35.800030
2019-12-25 35.896615
2019-12-26 35.871280
2019-12-27 35.509646
2019-12-28 35.235416
2019-12-29 34.848605
2019-12-30 34.926700
2019-12-31 34.787211
And I want to split it like:
chunk,start,end,value
0,2019-01-01,2019-02-24,35.235416
1,2019-10-10,2019-12-31,34.787211
Values are random and can be of any aggregated function. About that I dont care. But still cannot find a way to do it. The important thing is the chunks I get
I assume that your DataFrame:
has columns named Date and Amount,
Date column is of datetime type (not string).
To generate your result, define the following function, to be applied
to each group of rows:
def grpRes(grp):
return pd.Series([grp.Date.min(), grp.Date.max(), grp.Amount.mean()],
index=['start', 'end', 'value'])
Then apply it to each group and rename the index:
res = df.groupby(df.Date.diff().dt.days.fillna(1, downcast='infer')
.gt(1).cumsum()).apply(grpRes)
res.index.name = 'chunk'
I noticed that your data sample has no row for 2019-02-09, but you
dot't treat such single missing day as a violation of the
"continuity rule".
If you realy want such behaviour, change gt(1) to e.g. gt(2).
One way is boolean indexing, which assumes your data is already sorted. I also assumed your columns were named ['Date', 'Val]
#reset index so you have a dataframe
data = s.reset_index()
# boolean indexing where the date below is greater than 1 day
end = data[((data['Date'] - data['Date'].shift(-1)).dt.days.abs() != 1)].reset_index(drop=True).rename(columns={'Date':'End', 'Val': 'End_val'})
# boolean indexing where the date above is greater than one day
start = data[(data['Date'] - data['Date'].shift()).dt.days != 1].reset_index(drop=True).rename(columns={'Date':'Start', 'Val':'Start_val'})
# concat your data
pd.concat([start,end], axis=1)
Start Start_val End End_val
0 2019-01-01 36.581647 2019-02-08 27.681294
1 2019-02-10 26.441899 2019-02-24 24.183728
2 2019-10-10 23.970567 2019-12-31 34.787211
Related
So im taking a pandas and Numpy course and ran into a problem, the course instructor performed the solution and it worked, i followed every step and it didnt work for me
Pardon the length, i included the actual datasets for clarity
i assigned the following items in a list to the variable " dates" as instructed, see below
dates = [
"2016-12-22",
"2017-05-03",
"2017-01-06",
"2017-03-05",
"2017-02-12",
"2017-03-21",
"2017-04-14",
"2017-04-15",
]
then i have a series im working against named oil_series with the following data
Date is the Index Name
date
2016-12-20
2016-12-21
2016-12-22
2016-12-23
2016-12-27
2016-12-28
2016-12-29
2016-12-30
2017-01-03
2017-01-04
2017-01-05
2017-01-06
2017-01-09
2017-01-10
2017-01-11
2017-01-12
2017-01-13
2017-01-17
2017-01-18
2017-01-19
2017-01-20
2017-01-23
2017-01-24
2017-01-25
2017-01-26
2017-01-27
2017-01-30
2017-01-31
2017-02-01
2017-02-02
2017-02-03
2017-02-06
2017-02-07
2017-02-08
2017-02-09
2017-02-10
2017-02-13
2017-02-14
2017-02-15
2017-02-16
2017-02-17
2017-02-21
2017-02-22
2017-02-23
2017-02-24
2017-02-27
2017-02-28
2017-03-01
2017-03-02
2017-03-03
2017-03-06
2017-03-07
2017-03-08
2017-03-09
2017-03-10
2017-03-13
2017-03-14
2017-03-15
2017-03-16
2017-03-17
2017-03-20
2017-03-21
2017-03-22
2017-03-23
2017-03-24
2017-03-27
2017-03-28
2017-03-29
2017-03-30
2017-03-31
2017-04-03
2017-04-04
2017-04-05
2017-04-06
2017-04-07
2017-04-10
2017-04-11
2017-04-12
2017-04-13
2017-04-17
2017-04-18
2017-04-19
2017-04-20
2017-04-21
2017-04-24
2017-04-25
2017-04-26
2017-04-27
2017-04-28
2017-05-01
2017-05-02
2017-05-03
2017-05-04
2017-05-05
2017-05-08
2017-05-09
2017-05-10
2017-05-11
2017-05-12
2017-05-15
Values
52.22
51.44
51.98
52.01
52.82
54.01
53.8
53.75
52.36
53.26
53.77
53.98
51.95
50.82
52.19
53.01
52.36
52.45
51.12
51.39
52.33
52.77
52.38
52.14
53.24
53.18
52.63
52.75
53.9
53.55
53.81
53.01
52.19
52.37
52.99
53.84
52.96
53.21
53.11
53.41
53.41
54.02
53.61
54.48
53.99
54.04
54
53.82
52.63
53.33
53.19
52.68
49.83
48.75
48.05
47.95
47.24
48.34
48.3
48.34
47.79
47.02
47.29
47
47.3
47.02
48.36
49.47
50.3
50.54
50.25
50.99
51.14
51.69
52.25
53.06
53.38
53.12
53.19
52.62
52.46
50.49
50.26
49.64
48.9
49.22
49.22
48.96
49.31
48.83
47.65
47.79
45.55
46.23
46.46
45.84
47.28
47.81
47.83
48.86
So when i write the following code to filter the "oil_prices" whose index are dates against the "dates" list i created, see code below
mask = (oil_series.index.isin(dates)) & (oil_series <= 50)
oil_series.loc[mask]
the following error occurs
Error from running the code
Please help me understand the problem
According your comment, you have a MultiIndex of only one level. Convert it to Index and your code should work:
oil_series.index = oil_series.index.get_level_values('date')
# Your code here.
mask = (oil_series.index.isin(dates)) & (oil_series <= 50)
I have a pandas dataframe with over 100 timestamps that defines the non-working-time of a machine:
>>> off_time
date (index) start end
2020-07-04 18:00:00 23:50:00
2020-08-24 00:00:00 08:00:00
2020-08-24 14:00:00 16:00:00
2020-09-04 00:00:00 23:59:59
2020-10-05 18:00:00 22:00:00
I also have a second dataframe (called data) with over 1000 timestamps defining the duration of some processes:
>>> data
process-name start-time end-time duration
name1 2020-07-17 08:00:00+00:00 2020-07-18 22:00:00+00:00 1 day 14:00:00
name2 2020-08-24 01:00:00+00:00 2020-08-24 12:00:00+00:00 14:00:00
name3 2020-09-20 07:00:00+00:00 2020-09-20 19:00:00+00:00 12:00:00
name4 2020-09-04 16:00:00+00:00 2020-09-04 18:50:00+00:00 02:50:00
name5 2020-10-04 11:00:00+00:00 2020-10-05 20:00:00+00:00 1 day 09:00:00
In order to get the effective working time for each process in data, I now have to subtract the non-working time from the duration. For example, I have to subtract the time between 18 and 20 for the process "Name 5", since this time is planned as non-working time.
I wrote a code with many if-else conditions, which I see as a potential source of errors! Is there a clean way to calculate effective time without using too many if-else? Any help would be greatly appreciated.
Set up sample data (I added a couple of rows to your samples to include some edge cases):
######### OFF TIMES
off = pd.DataFrame([
["2020-07-04", dt.time(18), dt.time(23,50)],
["2020-08-24", dt.time(0), dt.time(8)],
["2020-08-24", dt.time(14), dt.time(16)],
["2020-09-04", dt.time(0), dt.time(23,59,59)],
["2020-10-04", dt.time(15), dt.time(18)],
["2020-10-05", dt.time(18), dt.time(22)]], columns= ["date", "start", "end"])
off["date"] = pd.to_datetime(off["date"])
off = off.set_index("date")
### Convert start and end times to datetimes in UTC timezone, since that is much
### easier to handle and fits the other data
off["start"] = pd.to_datetime(off.index.astype("string") + " " + off.start.astype("string")+"+00:00")
off["end"] = pd.to_datetime(off.index.astype("string") + " " + off.end.astype("string")+"+00:00")
off
>>
start end
date
2020-07-04 2020-07-04 18:00:00+00:00 2020-07-04 23:50:00+00:00
2020-08-24 2020-08-24 00:00:00+00:00 2020-08-24 08:00:00+00:00
2020-08-24 2020-08-24 14:00:00+00:00 2020-08-24 16:00:00+00:00
2020-09-04 2020-09-04 00:00:00+00:00 2020-09-04 23:59:59+00:00
2020-10-04 2020-10-04 15:00:00+00:00 2020-10-04 18:00:00+00:00
2020-10-05 2020-10-05 18:00:00+00:00 2020-10-05 22:00:00+00:00
######### PROCESS TIMES
data = pd.DataFrame([
["name1","2020-07-17 08:00:00+00:00","2020-07-18 22:00:00+00:00"],
["name2","2020-08-24 01:00:00+00:00","2020-08-24 12:00:00+00:00"],
["name3","2020-09-20 07:00:00+00:00","2020-09-20 19:00:00+00:00"],
["name4","2020-09-04 16:00:00+00:00","2020-09-04 18:50:00+00:00"],
["name5","2020-10-04 11:00:00+00:00","2020-10-05 20:00:00+00:00"],
["name6","2020-09-03 10:00:00+00:00","2020-09-06 05:00:00+00:00"]
], columns = ["process", "start", "end"])
data["start"] = pd.to_datetime(data["start"])
data["end"] = pd.to_datetime(data["end"])
data["duration"] = data.end -data.start
data
>>
process start end duration
0 name1 2020-07-17 08:00:00+00:00 2020-07-18 22:00:00+00:00 1 days 14:00:00
1 name2 2020-08-24 01:00:00+00:00 2020-08-24 12:00:00+00:00 0 days 11:00:00
2 name3 2020-09-20 07:00:00+00:00 2020-09-20 19:00:00+00:00 0 days 12:00:00
3 name4 2020-09-04 16:00:00+00:00 2020-09-04 18:50:00+00:00 0 days 02:50:00
4 name5 2020-10-04 11:00:00+00:00 2020-10-05 20:00:00+00:00 1 days 09:00:00
5 name6 2020-09-03 10:00:00+00:00 2020-09-06 05:00:00+00:00 2 days 19:00:00
As you can see, I added a row to off on 2020-10-04, so that name5 has 2 off times, which could happen in your data and would need to be handled correctly. (this means that in the example in your question, you need to subtract 5 hours instead of 2)
I also added the process name6 which is multiple days long.
This is my solution, which will be applied to each row in data
def get_relevant_off(pr):
relevant = off[off.end.gt(pr["start"]) & off.start.lt(pr["end"])].copy()
if not relevant.empty:
relevant.loc[relevant["start"].lt(pr["start"]), "start"] = pr["start"]
relevant.loc[relevant["end"].gt(pr["end"]), "end"] = pr["end"]
to_subtract = (relevant.end - relevant.start).sum()
return pr["duration"] - to_subtract
else: return pr.duration
Explanation:
first row in the function subsets the relevant rows of off, based on the row pr
replace off starts that are lower than process starts with process starts and do the same with ends, since we don't want to sum the whole off time, but only what is actually at the same time as the process.
get the duration of off times by subtracting off starts from off ends and sum those
then subtract that from the total duration.
data["effective"] = data.apply(get_relevant_off, axis= 1)
data
>>
process start end duration effective
0 name1 2020-07-17 08:00:00+00:00 2020-07-18 22:00:00+00:00 1 days 14:00:00 1 days 14:00:00
1 name2 2020-08-24 01:00:00+00:00 2020-08-24 12:00:00+00:00 0 days 11:00:00 0 days 04:00:00
2 name3 2020-09-20 07:00:00+00:00 2020-09-20 19:00:00+00:00 0 days 12:00:00 0 days 12:00:00
3 name4 2020-09-04 16:00:00+00:00 2020-09-04 18:50:00+00:00 0 days 02:50:00 0 days 00:00:00
4 name5 2020-10-04 11:00:00+00:00 2020-10-05 20:00:00+00:00 1 days 09:00:00 1 days 04:00:00
5 name6 2020-09-03 10:00:00+00:00 2020-09-06 05:00:00+00:00 2 days 19:00:00 1 days 19:00:01
Caveat: I am assuming that off times never overlap. Also, I liked this problem, but don't have any more time to spend on testing this, so let me know if I overlooked some edge cases that break it and I will try to find the time to fix it.
In the terminal, I have pd.options.display.max_rows set to 60. But for a series that goes over 60 rows, the display is truncated down to show only 10 rows. How do I increase the number of truncated rows shown?
For example, the following (which is within the max_rows setting), shows 60 rows of data:
s = pd.date_range('2019-01-01', '2019-06-01').to_series()
s[:60]
But if I ask for 61 rows, it gets severely truncated:
In [44]: s[:61]
Out[44]:
2019-01-01 2019-01-01
2019-01-02 2019-01-02
2019-01-03 2019-01-03
2019-01-04 2019-01-04
2019-01-05 2019-01-05
...
2019-02-26 2019-02-26
2019-02-27 2019-02-27
2019-02-28 2019-02-28
2019-03-01 2019-03-01
2019-03-02 2019-03-02
Freq: D, Length: 61, dtype: datetime64[ns]
How can I set it so that I see, for example, 20 rows, every time it goes beyond the max_rows limit?
From the docs, you can use pd.options.display.min_rows.
Once the display.max_rows is exceeded, the display.min_rows options determines how many rows are shown in the truncated repr.
Example:
>>> pd.set_option('max_rows', 59)
>>> pd.set_option('min_rows', 20)
>>> s = pd.date_range('2019-01-01', '2019-06-01').to_series()
>>> s[:60]
2019-01-01 2019-01-01
2019-01-02 2019-01-02
2019-01-03 2019-01-03
2019-01-04 2019-01-04
2019-01-05 2019-01-05
2019-01-06 2019-01-06
2019-01-07 2019-01-07
2019-01-08 2019-01-08
2019-01-09 2019-01-09
2019-01-10 2019-01-10
...
2019-02-20 2019-02-20
2019-02-21 2019-02-21
2019-02-22 2019-02-22
2019-02-23 2019-02-23
2019-02-24 2019-02-24
2019-02-25 2019-02-25
2019-02-26 2019-02-26
2019-02-27 2019-02-27
2019-02-28 2019-02-28
2019-03-01 2019-03-01
Freq: D, Length: 60, dtype: datetime64[ns]
I have a dataframe which has two date columns:
Date1 Date2
2018-10-02 2018-12-21
2019-01-20 2019-04-30
and so on
I want to create a third column which basically is a column containing all the months between the two dates, something like this:
Date1 Date2 months
2018-10-02 2018-12-21 201810
2018-10-02 2018-12-21 201811
2018-10-02 2018-12-21 201812
2019-01-20 2019-04-30 201901
2019-01-20 2019-04-30 201902
2019-01-20 2019-04-30 201903
2019-01-20 2019-04-30 201904
How can i do this? I tried using this formula:
df['months']=df.apply(lambda x: pd.date_range(x.Date1,x.Date2, freq='MS').strftime("%Y%m"))
but i am not getting the desired result. Kindly help. Thanks
Using merge
final = df.merge(df.apply(lambda s: pd.date_range(s.Date1, s.Date2, freq='30D'), 1)\
.explode()\
.rename('Months')\
.dt.strftime('%Y%m'),
left_index=True,
right_index=True)
Months Date1 Date2
0 201810 2018-10-02 2018-12-21
0 201811 2018-10-02 2018-12-21
0 201812 2018-10-02 2018-12-21
1 201901 2019-01-20 2019-04-30
1 201902 2019-01-20 2019-04-30
1 201903 2019-01-20 2019-04-30
1 201904 2019-01-20 2019-04-30
Using groupby with pd.date_range and join.
Notice: I used replace(day=1), to be sure we catch every month.
months = df.groupby(level=0).apply(lambda x: pd.date_range(x['Date1'].iat[0].replace(day=1),
x['Date2'].iat[0],
freq='MS')).explode().to_frame(name='Months')
df2 = months.join(df).reset_index(drop=True)
Output
Months Date1 Date2
0 2018-10-01 2018-10-02 2018-12-21
1 2018-11-01 2018-10-02 2018-12-21
2 2018-12-01 2018-10-02 2018-12-21
3 2019-01-01 2019-01-20 2019-04-30
4 2019-02-01 2019-01-20 2019-04-30
5 2019-03-01 2019-01-20 2019-04-30
6 2019-04-01 2019-01-20 2019-04-30
You can melt the data, resample, and merge back:
df.merge(df.reset_index() # reset index as column
.melt(id_vars='index', value_name='months') # usual melt
.resample('M', on='months') # resample to get the month list
.first().ffill() # interpolate the index
.drop(['variable', 'months'], axis=1) # remove unnecessary columns
.reset_index(), # make months a column
left_index=True,
right_on='index'
)
Output:
Date1 Date2 months index
0 2018-10-02 2018-12-21 2018-10-31 0.0
1 2018-10-02 2018-12-21 2018-11-30 0.0
2 2018-10-02 2018-12-21 2018-12-31 0.0
3 2019-01-20 2019-04-30 2019-01-31 1.0
4 2019-01-20 2019-04-30 2019-02-28 1.0
5 2019-01-20 2019-04-30 2019-03-31 1.0
6 2019-01-20 2019-04-30 2019-04-30 1.0
I have two data frames like following, data frame A has datetime even with minutes, data frame B only has hour.
df:A
dataDate original
2018-09-30 11:20:00 3
2018-10-01 12:40:00 10
2018-10-02 07:00:00 5
2018-10-27 12:50:00 5
2018-11-28 19:45:00 7
df:B
dataDate count
2018-09-30 10:00:00 300
2018-10-01 12:00:00 50
2018-10-02 07:00:00 120
2018-10-27 12:00:00 234
2018-11-28 19:05:00 714
I like to merge the two on the basis of hour date and hour, so that now in dataframe A should have all the rows filled on the basis of merge on date and hour
I can try to do it via
A['date'] = A.dataDate.date
B['date'] = B.dataDate.date
A['hour'] = A.dataDate.hour
B['hour'] = B.dataDate.hour
and then merge
merge_df = pd.merge(A,B, how='left', left_on=['date', 'hour'],
right_on=['date', 'hour'])
but its a very long process, Is their an efficient way to perform the same operation with the help of pandas time series or date functionality?
Use map if need append only one column from B to A with floor for set minutes and seconds if exist to 0:
d = dict(zip(B.dataDate.dt.floor('H'), B['count']))
A['count'] = A.dataDate.dt.floor('H').map(d)
print (A)
dataDate original count
0 2018-09-30 11:20:00 3 NaN
1 2018-10-01 12:40:00 10 50.0
2 2018-10-02 07:00:00 5 120.0
3 2018-10-27 12:50:00 5 234.0
4 2018-11-28 19:45:00 7 714.0
For general solution use DataFrame.join:
A.index = A.dataDate.dt.floor('H')
B.index = B.dataDate.dt.floor('H')
A = A.join(B, lsuffix='_left')
print (A)
dataDate_left original dataDate count
dataDate
2018-09-30 11:00:00 2018-09-30 11:20:00 3 NaT NaN
2018-10-01 12:00:00 2018-10-01 12:40:00 10 2018-10-01 12:00:00 50.0
2018-10-02 07:00:00 2018-10-02 07:00:00 5 2018-10-02 07:00:00 120.0
2018-10-27 12:00:00 2018-10-27 12:50:00 5 2018-10-27 12:00:00 234.0
2018-11-28 19:00:00 2018-11-28 19:45:00 7 2018-11-28 19:05:00 714.0