This is a bit complex, and I greatly appreciate any help! I am trying to randomly sample rows from a .csv file. Essentially, I want a resulting file of unique locations (Locations are specified by Easting and Northing columns of the data file, below). I want to randomly pull 1 location per 12 hour period per SessionDate in this file (12 hour periods divided into: between 0631 and 1829 hours and between 1830 and 0630 hours; Given as Start: and End: in Data File, below); BUT if any 2 locations are within 6 hours of each other (based on their Start: time), for that location to be tossed, and a new location to be randomly drawn, and for this sampling to continue until no new locations are drawn (i.e., sampling WITHOUT replacement). I have been trying to do this with python, but my experience is very limited. I tried first putting each row into a dictionary, and recently each row into a list, as follows:
import random
import csv
f = open('file.csv', "U")
list = []
for line in f:
list.append(line.split(','))
I'm unsure where to go from here - how to sample from these lists the way I need to, then write them to an output file with my 'unique' locations.
Here is the top few lines of my data file:
SessionDate Start: End: Easting Northing
27-Apr-07 18:00 21:45 174739 9785206
28-Apr-07 18:00 21:30 171984 9784738
28-Apr-07 18:00 21:30 171984 9784738
28-Apr-07 18:00 21:30 171984 9784738
28-Apr-07 18:00 21:30 171984 9784738
It gets a bit complicated as some of the observations span midnight, so they may be on different dates, but can be within 6 hours of each other (which is why I have this criterion), for example:
SessionDate Start: End: Easting Northing
27-Apr-07 22:30 23:25 171984 9784738
28-Apr-07 0:25 1:30 174739 9785206
Here's my solution - I made a few changes to your data (location to make it easier to eyeball the results). I basically create a dict of dates pointing to another dict of locations which points to a list of selected rows.
data = """SessionDate Start: End: Easting Northing
27-Apr-07 18:00 21:45 A 1
27-Apr-07 18:00 21:30 G 2
28-Apr-07 18:00 21:30 B 2
28-Apr-07 18:00 21:30 B 2
28-Apr-07 18:00 21:30 B 2
29-Apr-07 8:00 11:30 C 3
29-Apr-07 20:00 21:30 C 3
29-Apr-07 20:00 21:30 C 3
30-Apr-07 8:00 10:30 D 4
30-Apr-07 16:00 17:30 E 5
30-Apr-07 14:00 21:30 F 6
30-Apr-07 18:00 21:30 F 6
"""
selected = {}
for line in data.split("\n"):
if "Session" in line:
continue
if not line:
continue
tmp = [x for x in line.split() if x]
raw_dt = " ".join([tmp[0], tmp[1]]).strip()
curr_dt = datetime.strptime(raw_dt, "%d-%b-%y %H:%M")
loc = (tmp[-2], tmp[-1])
found = False
for dt in selected:
diff = dt - curr_dt
if dt < curr_dt:
diff = curr_dt - dt
# print dt, curr_dt, diff, diff <= timedelta(hours=12), loc, loc in selected[dt]
if diff <= timedelta(hours=12):
if loc not in selected[dt]:
selected[dt].setdefault(loc, []).append(tmp)
found = True
else:
found = True
if not found:
if curr_dt not in selected:
selected[curr_dt] = {}
if loc not in selected[curr_dt]:
selected[curr_dt][loc] = [tmp,]
# if output needs to be sorted
rows = sorted(x for k in selected for l in selected[k] for x in selected[k][l])
for row in rows:
print " ".join(row)
This is not a complete answer, but something to point you in the right direction
As I said in a comment, handling datetime objects in python is done with the datetime module. Here's a little example related to your problem:
from datetime import datetime
d1 = datetime.strptime("27-Apr-07 18:00", "%d-%b-%y %H:%M")
d2 = datetime.strptime("28-Apr-07 01:00", "%d-%b-%y %H:%M")
difference = d2 - d1
#Difference in hours
dH = difference.days*24 + difference.seconds/3600
Other than that, simply loop through the sorted file, after reading a whole 12H block, sample ramdomly, make sure your unique condition is met (if not repeat) and move on.
Related
I am trying to compare or merge two different data sets and I am using pandas for that.
The challenge that I am facing is that data is spread across rows in the first data set (Data1) and the other data set (Data2) has the same data spread across columns, below are the screenshots.
Screenshot 1st - This is Data1
Screenshot 2nd - This is Data2
Also, I have attached the same Excel workbook here for your reference.
What I am trying to do is convert one of them to another format to match the dataset and perform the merge.
Note: Transpose is not helping me, since I need to do it for each department and transpose does put everything either in rows or columns including department, whereas I only want to transpose weekly data.
What is the best way to achieve this in Python?
One option to transform the second dataframe is with pivot_longer from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
df = pd.read_excel('Test_Data_Set.xlsx', sheet_name=None)
df1 = df['Data1']
df2 = df['Data2']
df3 = df2.pivot_longer(index = ['code', 'name'], names_to = 'day_of_week', names_pattern=r'(.+)\s.+')
df1.merge(df3, on =['code', 'name', 'day_of_week'])
code name day_of_week start_time end_time value
0 test2 Test_Department2 Monday 900 1900 08:00 - 20:00
1 test2 Test_Department2 Tuesday 900 1900 08:00 - 20:00
2 test2 Test_Department2 Wednesday 900 1900 08:00 - 20:00
3 test2 Test_Department2 Thursday 900 1900 08:00 - 20:00
4 test2 Test_Department2 Friday 900 1900 10:00 - 19:00
5 test2 Test_Department2 Saturday 900 1900 10:00 - 19:00
6 test2 Test_Department2 Sunday 900 1900 12:00 - 17:00
I am reading some data from an csv file where the datatype of the two columns are in hh:mm format. Here is an example:
Start End
11:15 15:00
22:30 2:00
In the above example, the End in the 2nd row happens in the next day. I am trying to get the time difference between these two columns in the most efficient way as the dataset is huge. Is there any good pythonic way for doing this? Also, since there is no date, and some Ends happen in the next I get wrong result when I calculate the diff.
>>> import pandas as pd
>>> df = pd.read_csv(file_path)
>>> pd.to_datetime(df['End'])-pd.to_datetime(df['Start'])
0 0 days 03:45:00
1 0 days 03:00:00
2 -1 days +03:30:00
You can use the technique (a+x)%x with a timedelta of 24h (or 1d, same)
the + timedelta(hours=24) makes all values becomes positive
the % timedelta(hours=24) makes the ones above 24h back of 24h
df['duration'] = (pd.to_datetime(df['End']) - pd.to_datetime(df['Start']) + timedelta(hours=24)) \
% timedelta(hours=24)
Gives
Start End duration
0 11:15 15:00 0 days 03:45:00
1 22:30 2:00 0 days 03:30:00
I have a df with DateTimeIndex (hourly readings) and light intensity.
Time Light
1/2/2017 18:00 31
1/2/2017 19:00 -5
1/2/2017 20:00 NA
......
......
2/2/2017 05:00 NA
2/2/2017 06:00 20
The issue is that after sunset (6 pm) until sunrise (6 am), the sensor doesn't work and has bad readings. I would like to set any readings in this period to 0.
You can create a mask with these conditions and set the value based on it.
hours = (df.index.to_series().dt.hour) # convert DateTimeIndex to hours
mask = (hours > 6) & (hours < 18)
df.loc[~mask, 'Light'] = 0
You should convert the DataTimeIndex to Series to access the datetime methods.
I am looking at shift data of a factory that works 24 hours a day. I want to group the data at each shift change which is 6:00 and 18:00. Up till now I have been trying to it with:
Data_Frame.groupby([pd.Grouper(freq='12H')]).count()
However I have realised that since freq is set to 12H, it will always take a period of 12 hours including during daylight savings.
Unfortunately it is always 6:00 and 18:00 even when the clocks change. That means in reality there is one shift in the year that is 11 hours long and another that is 13 hours long so in the middle of the year group is off by 1 hour.
I feel that this is such a fundamental thing (daylight savings) that there should be some way of telling pandas that it needs to take account of daylight savings.
I have tried changing it from UTC to Europe/London however it still takes 12 hours periods.
Many Thanks
edit:
Only way I have found to do this is, before using groupby is to split my data into 3 (before first hour change, during hour change, second hour change) use groupby on each individually then putting them back together but this is irritating and tedious so anything better than this is hugely appreciated.
Hourly and 10 minute time-zone-aware time series' spanning spring dst change:
ts_hrly = pd.date_range('03-10-2018', '3-13-2018', freq='H', tz='US/Eastern')
ts_10m = pd.date_range('03-10-2018', '3-13-2018', freq='10T', tz='US/Eastern')
Use the hourly data
ts = ts_hrly
df = pd.DataFrame({'tstamp':ts,'period':range(len(ts))})
The dst transition looks like this:
>>> df[18:23]
period tstamp
18 18 2018-03-11 00:00:00-05:00
19 19 2018-03-11 01:00:00-05:00
20 20 2018-03-11 03:00:00-04:00
21 21 2018-03-11 04:00:00-04:00
22 22 2018-03-11 05:00:00-04:00
>>>
To group into twelve hourly increments on 06:00 and 18:00 boundaries I assigned each observation to a shift number then grouped by the shift number
My data conveniently starts at a shift change so calculate elapsed time since that first shift change:
nanosec = df['tstamp'].values - df.iloc[0,1].value
Find the shift changes and use np.cumsum() to assign shift numbers
shift_change = nanosec.astype(np.int64) % (3600 * 1e9 * 12) == 0
df['shift_nbr'] = shift_change.cumsum()
gb = df.groupby(df['shift_nbr'])
for k,g in gb:
print(f'{k} has {len(g)} items')
>>>
1 has 12 items
2 has 12 items
3 has 12 items
4 has 12 items
5 has 12 items
6 has 12 items
I haven't found a way to compensate for data starting in the middle of a shift.
If you want the groups for shifts affected by dst changes to have 11 or 13 items, change the timezone aware series to a timezone naive series
df2 = pd.DataFrame({'tstamp':pd.to_datetime(ts.strftime('%m-%d-%y %H:%M')),'period':range(len(ts))})
Use the same process to assign and group by shift numbers
nanosec = df2['tstamp'].values - df2.iloc[0,1].value
shift_change = nanosec.astype(np.int64) % (3600 * 1e9 * 12) == 0
df2['shift_nbr'] = shift_change.cumsum()
for k,g in gb2:
print(f'{k} has {len(g)} items')
>>>
1 has 12 items
2 has 11 items
3 has 12 items
4 has 12 items
5 has 12 items
6 has 12 items
7 has 1 items
Unfortunately, pd.to_datetime(ts.strftime('%m-%d-%y %H:%M')) takes some time. Here is a faster/better way to do it using the hour attribute of the timestamps to calculate elapsed hours - no need to create a separate timezone naive series, the hour attribute appears to be unaware. It also works for data starting in the middle of a shift.
ts = pd.date_range('01-01-2018 03:00', '01-01-2019 06:00', freq='H', tz='US/Eastern')
df3 = pd.DataFrame({'tstamp':ts,'period':range(len(ts))})
shift_change = ((df3['tstamp'].dt.hour - 6) % 12) == 0
shift_nbr = shift_change.cumsum()
gb3 = df3.groupby(shift_nbr)
print(sep,'gb3')
for k,g in gb3:
if len(g) != 12:
print(f'shift starting {g.iloc[0,1]} has {len(g)} items')
>>>
shift starting 2018-01-01 03:00:00-05:00 has 3 items
shift starting 2018-03-10 18:00:00-05:00 has 11 items
shift starting 2018-11-03 18:00:00-04:00 has 13 items
shift starting 2019-01-01 06:00:00-05:00 has 1 items
I have a DataFrame containing events like this:
location start_time end_time some_value1 some_value2
LECP 00:00 01:30 25 nice info
LECP 02:00 04:00 10 other info
LECS 02:00 03:00 5 lorem
LIPM 02:55 03:15 9 ipsum
and I want to split the rows so that I get maximum intervals of 1 hour, e.g. if an event has a duration of 01:30, I want to get a row of length 01:00 and another of 00:30. If an event has a length of 02:30, I want to get three rows. And if an event has a duration of an hour or less, it should just remain being one row. Like so:
location start_time end_time some_value1 some_value2
LECP 00:00 01:00 25 nice info
LECP 01:00 01:30 25 nice info
LECP 02:00 03:00 10 other info
LECP 03:00 04:00 10 other info
LECS 02:00 03:00 5 lorem
LIPM 02:55 03:15 9 ipsum
It does not matter if the remainder is at the beginning or the end. It even would not matter if the duration is distributed equally to the rows, as long as no rows has a duration of > 1 hour.
What I tried:
- reading through Time Series / Date functionality and not understanding anything
- searching StackOverflow.
I adapted this answer to implement hourly and not daily splits. This code works in a WHIL-loop, so it will re-itereate as long as there are rows with durations still > 1hour.
mytimedelta = pd.Timedelta('1 hour')
#create boolean mask
split_rows = (dfob['duration'] > mytimedelta)
while split_rows.any():
#get new rows to append and adjust start time to 1 hour later.
new_rows = dfob[split_rows].copy()
new_rows['start'] = new_rows['start'] + mytimedelta
#update the end time of old rows
dfob.loc[split_rows, 'end'] = dfob.loc[split_rows, 'start'] + \
pd.DateOffset(hours=1, seconds=-1)
dfob = dfob.append(new_rows)
#update the duration of all rows
dfob['duration'] = dfob['end'] - dfob['start']
#create an updated boolean mask
split_rows = (dfob['duration'] > mytimedelta)
#when job is done:
dfob.sort_index().reset_index(drop=True)
dfob['duration'] = dfob['end'] - dfob['start']