"Parallel" indexing in pandas (not hierarchical) - python

Short version: I have two TimeSeries (recording start and recording end) I would like to use as indices for data in a Panel (or DataFrame). Not hierarchical, but parallel. I am uncertain how to do this.
Long version:
I am constructing a pandas Panel with some data akin to temperature and density at certain distances from an antenna. As I see it, the most natural structure is having e.g. temp and dens as items (i.e. sub-DataFrames of the Panel), recording time as major axis (index), and thus distance from the antenna as minor axis (colums).
My problem is this: For each recording, the instrument averages/integrates over some amount of time. Thus, for each data dump, two timestamps are saved: start recording and end recording. I need both of those. Thus, I would need something which might be called "parallel indexing", where two different TimeSeries (startRec and endRec) work as indices, and I can get whichever I prefer for a certain data point. Of course, I don't really need to index by both, but both need to be naturally available in the data structure. For example, for any given temperature or density recording, I need to be able to get both the start and end time of the recording.
I could of course keep the two TimeSeries in a separate DataFrame, but with the main point of pandas being automatic data alignment, this is not really ideal.
How can I best achieve this?
Example data
Sample Panel with three recordings at two distances from the antenna:
import pandas as pd
import numpy as np
data = pd.Panel(data={'temp': np.array([[21, 20],
[19, 17],
[15, 14]]),
'dens': np.array([[1001, 1002],
[1000, 998],
[997, 995]])},
minor_axis=['1m', '3m'])
Output of data:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 2 (minor_axis)
Items axis: dens to temp
Major_axis axis: 0 to 2
Minor_axis axis: 1m to 3m
Here, the major axis is currently only an integer-based index (0 to 2). The minor axis is the two measurement distances from the antenna.
I have two TimeSeries I'd like to use as indices:
from datetime import datetime
startRec = pd.TimeSeries([datetime(2013, 11, 11, 15, 00, 00),
datetime(2013, 11, 12, 15, 00, 00),
datetime(2013, 11, 13, 15, 00, 00)])
endRec = pd.TimeSeries([datetime(2013, 11, 11, 15, 00, 10),
datetime(2013, 11, 12, 15, 00, 10),
datetime(2013, 11, 13, 15, 00, 10)])
Output of startRec:
0 2013-11-11 15:00:00
1 2013-11-12 15:00:00
2 2013-11-13 15:00:00
dtype: datetime64[ns]

Being in a Panel makes this a little trickier. I typically stick with DataFrames.
But how does this look:
import pandas as pd
from datetime import datetime
startRec = pd.TimeSeries([datetime(2013, 11, 11, 15, 0, 0),
datetime(2013, 11, 12, 15, 0, 0),
datetime(2013, 11, 13, 15, 0, 0)])
endRec = pd.TimeSeries([datetime(2013, 11, 11, 15, 0, 10),
datetime(2013, 11, 12, 15, 0, 10),
datetime(2013, 11, 13, 15, 0, 10)])
_data1m = pd.DataFrame(data={
'temp': np.array([21, 19, 15]),
'dens': np.array([1001, 1000, 997]),
'start': startRec,
'end': endRec
}
)
_data3m = pd.DataFrame(data={
'temp': np.array([20, 17, 14]),
'dens': np.array([1002, 998, 995]),
'start': startRec,
'end': endRec
}
)
_data1m.set_index(['start', 'end'], inplace=True)
_data3m.set_index(['start', 'end'], inplace=True)
data = pd.Panel(data={'1m': _data1m, '3m': _data3m})
data.loc['3m'].select(lambda row: row[0] < pd.Timestamp('2013-11-12') or
row[1] < pd.Timestamp('2013-11-13'))
and that outputs:
dens temp
start end
2013-11-11 15:00:00 2013-11-11 15:00:10 1002 20
2013-11-12 15:00:00 2013-11-12 15:00:10 998 17

Related

Filter datetime column by every 15 minute interval

The problem:
I have a dataframe with a datetime column(formated in datetime format of python), which contains a reading for example, 2020-01-03T00:00:00.000Z, 2020-01-03T00:05:00.000Z and so on, until 2020-01-03T00:23:55.000Z, for different dates.
I want to filter the entire dataframe based on this column but only keep readings at every 0th, 15th, 30th, 45th minute.
I saw another question which did something similar with pd.date_range(start, freq='15T', periods=len(df)), but the question is not the same. Thank you.
I was able to this in a easy and elegant way,
let us assume the dataframe is called df and and the column in question is called 'datetime', here is the solution:
import datetime as dt # in case not already imported!
df['minute'] = df['datetime'].dt.minute
df= df[df['minute'].isin([0, 15, 30, 45])]
What about grouping your data in intervals and applying whatever aggregation/transform you need on top ?
from datetime import datetime
import pandas as pd
df = pd.DataFrame(
{
"dt": [
datetime(2022, 12, 29, 15, 23, 0),
datetime(2022, 12, 29, 15, 38, 0),
datetime(2022, 12, 29, 15, 43, 0),
datetime(2022, 12, 29, 16, 11, 0),
],
"dat": [1, 2, 3, 4],
}
)
groups = df.groupby(
pd.Grouper(key="dt", freq="15min", origin="start_day", label="left")
)
groups.first()
gives
dat
dt
2022-12-29 15:15:00 1.0
2022-12-29 15:30:00 2.0
2022-12-29 15:45:00 NaN
2022-12-29 16:00:00 4.0
you can resample the date index for 15 minute intervals and bfill the data.
start_date="12/1/2022"
end_date="12/3/2022"
df=pd.DataFrame(pd.date_range(start_date, end_date,freq="D"),columns=["Date"])
df['Date']=df['Date'].astype('datetime64[ns]')
df.set_index('Date',inplace=True)
df=df.asfreq('15T', method='bfill')
for item in df.index:
print(item)

Converting integer Seconds values into datetime. Python

I'm trying to figure out the easiest way to automate the conversion of an array of seconds into datetime. I'm very familiar with converting the seconds from 1970 into datetime, but the values that I have here are for the seconds elapsed in a given day. For example, 14084 is the number if seconds that has passed on 2011,11,11, and I was able to generate the datetime below.
str(dt.timedelta(seconds = 14084))
Out[245]: '3:54:44'
dt.datetime.combine(date(2011,11,11),time(3,54,44))
Out[250]: datetime.datetime(2011, 11, 11, 3, 54, 44)
Is there a faster way of conversion for an array.
numpy has support for arrays of datetimes with a timedelta type for manipulating them:
https://numpy.org/doc/stable/reference/arrays.datetime.html
e.g. you can do this:
import numpy as np
date_array = np.arange('2005-02', '2005-03', dtype='datetime64[D]')
date_array += np.timedelta64(4, 's') # Add 4 seconds
If you have an array of seconds, you could convert it into an array of timedeltas and add that to a fixed datetime
Say you have
seconds = [14084, 14085, 15003]
You can use pandas
import pandas as pd
series = pd.to_timedelta(seconds, unit='s') + pd.to_datetime('2011-11-11')
series = series.to_series().reset_index(drop=True)
print(series)
0 2011-11-11 03:54:44
1 2011-11-11 03:54:45
2 2011-11-11 04:10:03
dtype: datetime64[ns]
Or a list comprehension
list_comp = [datetime.datetime(2011, 11, 11) +
datetime.timedelta(seconds=s) for s in seconds]
print(list_comp)
[datetime.datetime(2011, 11, 11, 3, 54, 44), datetime.datetime(2011, 11, 11, 3, 54, 45), datetime.datetime(2011, 11, 11, 4, 10, 3)]

Np.interp doesn't work correctly on timestamps during American daylight savings transitions

These times are from Queensland, Australia where Daylight savings is not observed.
I have a program that interpolates time using this strategy, but it interpolates with respect to daylight savings time.
For example, this script interpolates between two time points with 100 intervals total.
import numpy as np
from datetime import datetime
import datetime as dt
import time
x_dec = np.linspace(0, np.pi, num=100)
time1 = dt.datetime(2017, 11, 4, 20, 47, 0)
time2 = dt.datetime(2017, 11, 5, 3, 1, 0)
this_time = time.mktime(time1.timetuple())
next_time = time.mktime(time2.timetuple())
this_x_temp = np.interp(x_dec, (x_dec.min(), x_dec.max()), (this_time, next_time))
this_x = np.vectorize(datetime.fromtimestamp)(this_x_temp)
print(this_x)
Instead of cleanly producing interpolated times, it cycles through times around 1AM twice in concordance with American Daylight Savings time. See the example output.
...
datetime.datetime(2017, 11, 5, 1, 49, 29, 90909)
datetime.datetime(2017, 11, 5, 1, 53, 52, 121212)
datetime.datetime(2017, 11, 5, 1, 58, 15, 151515)
datetime.datetime(2017, 11, 5, 1, 2, 38, 181818)
datetime.datetime(2017, 11, 5, 1, 7, 1, 212121)
datetime.datetime(2017, 11, 5, 1, 11, 24, 242424)
datetime.datetime(2017, 11, 5, 1, 15, 47, 272727)
...
I don't think that I can use a time series for this application given that all of my observations are at random times throughout the day and I need around 100 datapoints of interpolated times between the observations. How can I make np.interp ignore American daylight savings time and just interpolate the time as if it there is no daylight savings switch?
What you're seeing is US daylight savings time kicking in. At 2AM on 5 Nov 2017, it actually became 1AM again.
If you don't want to deal with US DST, I recommend using pytz to explicitly set your your time zone. It also might help to check what time zone your computer is set to, as datetime draws it's location and DST data from that. You can set it with
os.environ['TZ'] = 'Australia/Brisbane'
I think. Don't want to play around with my system too much

Fill in missing dates in dataframe using the mean

I have dates that I'm pulling into a dataframe at regular intervals.
The data is generally well-formed, but sometimes there are bad data in an otherwise date column.
I would always expect to have a date in the parsed 9 digit form:
(tm_year=2000, tm_mon=11, tm_mday=30, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=335, tm_isdst=-1)
(2015, 12, 29, 0, 30, 50, 1, 363, 0)
How should I check and fix this?
What I would like to do is replace whatever is not a date, with a date based on a variable that represents the last_update + 1/2 the update interval, so the items are not filtered out by later functions.
Data as shown is published_parsed from feedparser.
import pandas as pd
import datetime
# date with ugly data
df_date_ugly = pd.DataFrame({'date': [
(2015, 12, 29, 0, 30, 50, 1, 363, 0),
(2015, 12, 28, 23, 59, 12, 0, 362, 0),
'None', '',
(2015, 12, 28, 23, 59, 12, 0, 362, 0)
]})
# date is fine
df_date = pd.DataFrame({'date': [
(2015, 12, 29, 0, 30, 50, 1, 363, 0),
(2015, 12, 28, 23, 59, 12, 0, 362, 0),
(2015, 12, 28, 23, 59, 12, 0, 362, 0)
]})
Pseudocode
if the original_date is valid
return original_date
else
return substitute_date
import calendar
import numpy as np
import pandas as pd
def tuple_to_timestamp(x):
try:
return calendar.timegm(x) # 1
except (TypeError, ValueError):
return np.nan
df = pd.DataFrame({'orig': [
(2015, 12, 29, 0, 30, 50, 1, 363, 0),
(2015, 12, 28, 23, 59, 12, 0, 362, 0),
'None', '',
(2015, 12, 30, 23, 59, 12, 0, 362, 0)]})
ts = df['orig'].apply(tuple_to_timestamp) # 2
# 0 1451349050
# 1 1451347152
# 2 NaN
# 3 NaN
# 4 1451519952
# Name: orig, dtype: float64
ts = ts.interpolate() # 3
# 0 1451349050
# 1 1451347152
# 2 1451404752
# 3 1451462352
# 4 1451519952
# Name: orig, dtype: float64
df['fixed'] = pd.to_datetime(ts, unit='s') # 4
print(df)
yields
orig fixed
0 (2015, 12, 29, 0, 30, 50, 1, 363, 0) 2015-12-29 00:30:50
1 (2015, 12, 28, 23, 59, 12, 0, 362, 0) 2015-12-28 23:59:12
2 None 2015-12-29 15:59:12
3 2015-12-30 07:59:12
4 (2015, 12, 30, 23, 59, 12, 0, 362, 0) 2015-12-30 23:59:12
Explanation:
calendar.timegm converts each time-tuple to a timestamp. Unlike
time.mktime, it interprets the time-tuple as being in UTC, not local time.
apply calls tuple_to_timestamp for each row of df['orig'].
The nice thing about timestamps is that they are numeric, so you can then use
numerical methods such as Series.interpolate to fill in NaNs with interpolated
values. Note that the two NaNs do not get filled with same interpolated value; their values are linearly interpolated based on their position as given by ts.index.
pd.to_datetime converts to timestamps to dates.
When working with dates and times in pandas, convert them to a pandas timestamp using pandas.to_datetime. To use this function, we will convert the list into a string with just the date and time elements. For your case, values that are not lists of length 9 will be considered bad and are replaced with a empty string ''.
#convert list into string with date & time
#only elements with lists of length 9 will be parsed
dates_df = df_date_ugly.applymap(lambda x: "{0}/{1}/{2} {3}:{4}:{5}".format(x[0],x[1],x[2], x[3], x[4], x[5]) if len(x)==9 else '')
#convert to a pandas timestamp
dates_df = pd.to_datetime(dates_df['date'], errors = 'coerce'))
date
0 2015-12-29 00:30:50
1 2015-12-28 23:59:12
2 NaT
3 NaT
4 2015-12-28 23:59:12
Find the indices where the dates are missing use pd.isnull():
>>>missing = pd.isnull(dates_df['date']).index
>>>missing
Int64Index([2, 3], dtype='int64')
To set the missing date as the midpoint between 2 dates:
start_date = dates_df.iloc[0,:]
end_date = dates_df.iloc[4,:]
missing_date = start_date + (end_date - start_date)/2

How do I join integers, Decimal() and datetime.datetime() in a list?

I am using pyodbc to return rows from a SQL database, and I can connect and get all of the rows just fine. But now I am confused how to deal with the data returned. I would like to join all of the values in the returned list, into a string that I can then write to a file. But I am unsure how to deal with multiple data types in the list.
Here is the base code leading to the list:
key = "03001001"
cursor.execute('''select * from table_name where key='{}' '''.format(key))
rows = cursor.fetchall()
for x in rows:
print(x)
When I print(x) it returns the following line:
('03001001', 2, datetime.datetime(2014, 11, 13, 4, 30), 0, Decimal('-0.1221'), 5, 0, 0, 0, datetime.datetime(2014, 11, 13, 14, 30), datetime.datetime(2014, 11, 13, 4, 30, 12), 0)
I would like for it to simply be a tab separated string.
print('\t'.join(map(repr, x)))
will lead to
'03001001' 2 datetime.datetime(2014, 11, 13, 4, 30) 0 Decimal('-0.1221')5 0 0 0 datetime.datetime(2014, 11, 13, 14, 30) datetime.datetime(2014, 11, 13, 4, 30, 12) 0
If you want a human readable date and decimal entry use str instead of repr (like Matti John's answer):
print('\t'.join(map(str, x)))
will print
03001001 2 2014-11-13 04:30:00 0 -0.1221 5 0 0 02014-11-13 14:30:00 2014-11-13 04:30:12 0

Categories