Fill in missing dates in dataframe using the mean

Fill in missing dates in dataframe using the mean - python

I have dates that I'm pulling into a dataframe at regular intervals.
The data is generally well-formed, but sometimes there are bad data in an otherwise date column.
I would always expect to have a date in the parsed 9 digit form:
(tm_year=2000, tm_mon=11, tm_mday=30, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=335, tm_isdst=-1)
(2015, 12, 29, 0, 30, 50, 1, 363, 0)
How should I check and fix this?
What I would like to do is replace whatever is not a date, with a date based on a variable that represents the last_update + 1/2 the update interval, so the items are not filtered out by later functions.
Data as shown is published_parsed from feedparser.
import pandas as pd
import datetime
# date with ugly data
df_date_ugly = pd.DataFrame({'date': [
(2015, 12, 29, 0, 30, 50, 1, 363, 0),
(2015, 12, 28, 23, 59, 12, 0, 362, 0),
'None', '',
(2015, 12, 28, 23, 59, 12, 0, 362, 0)
]})
# date is fine
df_date = pd.DataFrame({'date': [
(2015, 12, 29, 0, 30, 50, 1, 363, 0),
(2015, 12, 28, 23, 59, 12, 0, 362, 0),
(2015, 12, 28, 23, 59, 12, 0, 362, 0)
]})
Pseudocode
if the original_date is valid
return original_date
else
return substitute_date

import calendar
import numpy as np
import pandas as pd
def tuple_to_timestamp(x):
try:
return calendar.timegm(x) # 1
except (TypeError, ValueError):
return np.nan
df = pd.DataFrame({'orig': [
(2015, 12, 29, 0, 30, 50, 1, 363, 0),
(2015, 12, 28, 23, 59, 12, 0, 362, 0),
'None', '',
(2015, 12, 30, 23, 59, 12, 0, 362, 0)]})
ts = df['orig'].apply(tuple_to_timestamp) # 2
# 0 1451349050
# 1 1451347152
# 2 NaN
# 3 NaN
# 4 1451519952
# Name: orig, dtype: float64
ts = ts.interpolate() # 3
# 0 1451349050
# 1 1451347152
# 2 1451404752
# 3 1451462352
# 4 1451519952
# Name: orig, dtype: float64
df['fixed'] = pd.to_datetime(ts, unit='s') # 4
print(df)
yields
orig fixed
0 (2015, 12, 29, 0, 30, 50, 1, 363, 0) 2015-12-29 00:30:50
1 (2015, 12, 28, 23, 59, 12, 0, 362, 0) 2015-12-28 23:59:12
2 None 2015-12-29 15:59:12
3 2015-12-30 07:59:12
4 (2015, 12, 30, 23, 59, 12, 0, 362, 0) 2015-12-30 23:59:12
Explanation:
calendar.timegm converts each time-tuple to a timestamp. Unlike
time.mktime, it interprets the time-tuple as being in UTC, not local time.
apply calls tuple_to_timestamp for each row of df['orig'].
The nice thing about timestamps is that they are numeric, so you can then use
numerical methods such as Series.interpolate to fill in NaNs with interpolated
values. Note that the two NaNs do not get filled with same interpolated value; their values are linearly interpolated based on their position as given by ts.index.
pd.to_datetime converts to timestamps to dates.

When working with dates and times in pandas, convert them to a pandas timestamp using pandas.to_datetime. To use this function, we will convert the list into a string with just the date and time elements. For your case, values that are not lists of length 9 will be considered bad and are replaced with a empty string ''.
#convert list into string with date & time
#only elements with lists of length 9 will be parsed
dates_df = df_date_ugly.applymap(lambda x: "{0}/{1}/{2} {3}:{4}:{5}".format(x[0],x[1],x[2], x[3], x[4], x[5]) if len(x)==9 else '')
#convert to a pandas timestamp
dates_df = pd.to_datetime(dates_df['date'], errors = 'coerce'))
date
0 2015-12-29 00:30:50
1 2015-12-28 23:59:12
2 NaT
3 NaT
4 2015-12-28 23:59:12
Find the indices where the dates are missing use pd.isnull():
>>>missing = pd.isnull(dates_df['date']).index
>>>missing
Int64Index([2, 3], dtype='int64')
To set the missing date as the midpoint between 2 dates:
start_date = dates_df.iloc[0,:]
end_date = dates_df.iloc[4,:]
missing_date = start_date + (end_date - start_date)/2

Related

Filter datetime column by every 15 minute interval

The problem:
I have a dataframe with a datetime column(formated in datetime format of python), which contains a reading for example, 2020-01-03T00:00:00.000Z, 2020-01-03T00:05:00.000Z and so on, until 2020-01-03T00:23:55.000Z, for different dates.
I want to filter the entire dataframe based on this column but only keep readings at every 0th, 15th, 30th, 45th minute.
I saw another question which did something similar with pd.date_range(start, freq='15T', periods=len(df)), but the question is not the same. Thank you.

I was able to this in a easy and elegant way,
let us assume the dataframe is called df and and the column in question is called 'datetime', here is the solution:
import datetime as dt # in case not already imported!
df['minute'] = df['datetime'].dt.minute
df= df[df['minute'].isin([0, 15, 30, 45])]

What about grouping your data in intervals and applying whatever aggregation/transform you need on top ?
from datetime import datetime
import pandas as pd
df = pd.DataFrame(
{
"dt": [
datetime(2022, 12, 29, 15, 23, 0),
datetime(2022, 12, 29, 15, 38, 0),
datetime(2022, 12, 29, 15, 43, 0),
datetime(2022, 12, 29, 16, 11, 0),
],
"dat": [1, 2, 3, 4],
}
)
groups = df.groupby(
pd.Grouper(key="dt", freq="15min", origin="start_day", label="left")
)
groups.first()
gives
dat
dt
2022-12-29 15:15:00 1.0
2022-12-29 15:30:00 2.0
2022-12-29 15:45:00 NaN
2022-12-29 16:00:00 4.0

you can resample the date index for 15 minute intervals and bfill the data.
start_date="12/1/2022"
end_date="12/3/2022"
df=pd.DataFrame(pd.date_range(start_date, end_date,freq="D"),columns=["Date"])
df['Date']=df['Date'].astype('datetime64[ns]')
df.set_index('Date',inplace=True)
df=df.asfreq('15T', method='bfill')
for item in df.index:
print(item)

Fill list with last value if date gap is greater than N seconds

Suppose I have the list data:
import numpy as np
import datetime
np.random.seed(0)
aux = [10,30,50,60,70,110,120]
base = datetime.datetime(2018, 1, 1, 22, 34, 20)
data = [[base + datetime.timedelta(seconds=s),
round(np.random.rand(),3)] for s in aux]
This returns:
data ==
[[datetime.datetime(2018, 1, 1, 22, 34, 30), 0.549],
[datetime.datetime(2018, 1, 1, 22, 34, 50), 0.715],
[datetime.datetime(2018, 1, 1, 22, 35, 10), 0.603],
[datetime.datetime(2018, 1, 1, 22, 35, 20), 0.545],
[datetime.datetime(2018, 1, 1, 22, 35, 30), 0.424],
[datetime.datetime(2018, 1, 1, 22, 36, 10), 0.646],
[datetime.datetime(2018, 1, 1, 22, 36, 20), 0.438]]
What I want to do is fill the spaces where the gaps in the dates are greater than10 seconds using the last previous value. For this example, the output should be:
desired_output ==
[[datetime.datetime(2018, 1, 1, 22, 34, 30), 0.549],
[datetime.datetime(2018, 1, 1, 22, 34, 40), 0.549],
[datetime.datetime(2018, 1, 1, 22, 34, 50), 0.715],
[datetime.datetime(2018, 1, 1, 22, 35), 0.715],
[datetime.datetime(2018, 1, 1, 22, 35, 10), 0.603],
[datetime.datetime(2018, 1, 1, 22, 35, 20), 0.545],
[datetime.datetime(2018, 1, 1, 22, 35, 30), 0.424],
[datetime.datetime(2018, 1, 1, 22, 35, 40), 0.424],
[datetime.datetime(2018, 1, 1, 22, 35, 50), 0.424],
[datetime.datetime(2018, 1, 1, 22, 36), 0.424],
[datetime.datetime(2018, 1, 1, 22, 36, 10), 0.646],
[datetime.datetime(2018, 1, 1, 22, 36, 20), 0.438]]
I can't think of any smart way to do this. All dates are separated by multiples of 10 seconds. Any ideas?

Option 1: with Pandas
If you're open to using Pandas, it makes reindexing operations like this easy:
>>> import pandas as pd
>>> df = pd.DataFrame(data, columns=['date', 'value'])
>>> ridx = df.set_index('date').asfreq('10s').ffill().reset_index()
>>> ridx
date value
0 2018-01-01 22:34:30 0.549
1 2018-01-01 22:34:40 0.549
2 2018-01-01 22:34:50 0.715
3 2018-01-01 22:35:00 0.715
4 2018-01-01 22:35:10 0.603
5 2018-01-01 22:35:20 0.545
6 2018-01-01 22:35:30 0.424
7 2018-01-01 22:35:40 0.424
8 2018-01-01 22:35:50 0.424
9 2018-01-01 22:36:00 0.424
10 2018-01-01 22:36:10 0.646
11 2018-01-01 22:36:20 0.438
.asfreq('10s') will fill the missing 10-second intervals. .ffill() means "forward-fill" missing values with the last-seen valid value.
To get back to the data structure that you have now (though note that the elements will be 2-tuples, rather then lists of length 2):
>>> native_ridx = list(zip(ridx['date'].dt.to_pydatetime().tolist(), ridx['value']))
>>> from pprint import pprint
>>> pprint(native_ridx[:5])
[(datetime.datetime(2018, 1, 1, 22, 34, 30), 0.549),
(datetime.datetime(2018, 1, 1, 22, 34, 40), 0.549),
(datetime.datetime(2018, 1, 1, 22, 34, 50), 0.715),
(datetime.datetime(2018, 1, 1, 22, 35), 0.715),
(datetime.datetime(2018, 1, 1, 22, 35, 10), 0.603)]
To confirm:
>>> assert all(tuple(i) == j for i, j in zip(desired_output, native_ridx))
Option 2: Native Python
import datetime
def make_daterange(
start: datetime.datetime,
end: datetime.datetime,
incr=datetime.timedelta(seconds=10)
):
yield start
while start < end:
start += incr
yield start
def reindex_ffill(data: list, incr=datetime.timedelta(seconds=10)):
dates, _ = zip(*data)
data = dict(data)
start, end = min(dates), max(dates)
daterng = make_daterange(start, end, incr)
# If initial value is not valid, the element at [0][0] will be NaN
lastvalid = np.nan
get = data.get
for date in daterng:
value = get(date)
if value:
yield date, value
lastvalid = value
else:
yield date, lastvalid
Example:
>>> pynative_ridx = list(reindex_ffill(data))
>>> assert all(tuple(i) == j for i, j in zip(desired_output, pynative_ridx))

Iterable Datetime - How to get continuos datetime objects with day name in Python?

I want to create a mapping of the everyday of the week with its datetime object. So my dictionary should be having keys as "Monday", "Tuesday", .. (so on) so that I can get a datetime object for every day on the next(!) week.
At the moment I have a dictionary with these values:
DAYS_DATETIME_RELATIONS = {
"today": datetime.datetime.now(),
"tomorrow": datetime.datetime.now() + datetime.timedelta(days=1),
"after_tomorrow": datetime.datetime.now() + datetime.timedelta(days=2)
}
Unfortunately I cannot find any algorithmic solution for this and hope anyone of you could help me.

This can be achieved by using 2 dictionaries in the following manner:
import calendar
import datetime
days = {i: calendar.day_name[i-1] for i in range(7)}
today = datetime.datetime.now()
# using i % 7 so an arbitrary range can be used, for example
# range(7, 15) to get the week after the next week
next_week = {days[i % 7]: (today + datetime.timedelta(days=i)).date()
for i in range(7)}
print(next_week)
# {'Tuesday': datetime.date(2018, 1, 9), 'Sunday': datetime.date(2018, 1, 7),
# 'Monday': datetime.date(2018, 1, 8), 'Thursday': datetime.date(2018, 1, 11),
# 'Wednesday': datetime.date(2018, 1, 10), 'Friday': datetime.date(2018, 1, 12),
# 'Saturday': datetime.date(2018, 1, 13)}
print(next_week['Saturday'])
# 2018-01-13

Here is another way to solve your question using datetime and timedelta from datetime module:
from datetime import datetime, timedelta
def generate_dict_relation(_time, _days=0):
keys = {'Yesterday': -1, 'Today': 0, 'Tomorow': 1, 'After_tomorrow': 2}
if not _days:
return {key: _time + timedelta(days=keys.get(key, 0)) for key in keys}
else:
return {(_time + timedelta(days=_days+k)).strftime('%A'): _time + timedelta(days=_days+k)
for k in range(0, 7)}
_date_now = datetime.now()
DAYS_DATETIME_RELATIONS = {}
# get dates: yesterday, today, tomorrow and after tomorrow
DAYS_DATETIME_RELATIONS.update(generate_dict_relation(_date_now, 0))
# get dates after 7 days = 1 week
DAYS_DATETIME_RELATIONS.update(generate_dict_relation(_date_now, 7))
next_tuesday = DAYS_DATETIME_RELATIONS.get('Tuesday')
next_monday = DAYS_DATETIME_RELATIONS.get('Monday')
yesterday = DAYS_DATETIME_RELATIONS.get('Yesterday')
print('{0:%d/%m/%Y %H:%M:%S:%s} \t {1}'.format(next_tuesday, repr(next_tuesday)))
print('{0:%d/%m/%Y %H:%M:%S:%s} \t {1}'.format(next_monday, repr(next_monday)))
print('{0:%d/%m/%Y %H:%M:%S:%s} \t {1}'.format(yesterday, repr(yesterday)))
Output:
16/01/2018 10:56:26:1516096586 datetime.datetime(2018, 1, 16, 10, 56, 26, 659949)
15/01/2018 10:56:26:1516010186 datetime.datetime(2018, 1, 15, 10, 56, 26, 659949)
06/01/2018 10:56:26:1515232586 datetime.datetime(2018, 1, 6, 10, 56, 26, 659949)

One very generic way will be to create a custom iterator to return you the continuos datetime objects as:
from datetime import datetime, timedelta
class RepetetiveDate(object):
def __init__(self, day_range=7, datetime_obj=datetime.now(), jump_days=1):
self.day_range = day_range
self.day_counter = 0
self.datetime_obj = datetime_obj
self.jump_days = jump_days
self.time_deltadiff = timedelta(days=self.jump_days)
def __iter__(self):
return self
# If you are on Python 2.7
# define this function as `next(self)`
def __next__(self):
if self.day_counter >= self.day_range:
raise StopIteration
if self.day_counter != 0: # don't update for the first iteration
self.datetime_obj += self.time_deltadiff
self.day_counter += 1
return self.datetime_obj
Here, this iterator returns continuos datetime object starting from the datetime object you'll initially pass (default starts from current date).
It is using 3 optional params which you may customize as per your need:
day_range: Maximum allowed iteration for the RepetetiveDate iterator. Default value is 7.
jump_days: Integer value for jumping the number of days for the datetime object in next iteration. That means, if jump_days is equal to "2", will return datetime objects of every alternate date. To get the datetime objects of past, pass this value as negative. Default value is 1.
datetime_obj: Accepts the datetime from which date you want to start your iteration. Default value is current date.
If you are new to iterators, take a look at:
What exactly are Python's iterator, iterable, and iteration protocols?
Difference between Python's Generators and Iterators
Sample Run for upcoming dates:
>>> x = RepetetiveDate()
>>> next(x)
datetime.datetime(2018, 1, 8, 15, 55, 39, 124654)
>>> next(x)
datetime.datetime(2018, 1, 9, 15, 55, 39, 124654)
>>> next(x)
datetime.datetime(2018, 1, 10, 15, 55, 39, 124654)
Sample Run for previous dates:
>>> x = RepetetiveDate(jump_days=-1)
>>> next(x)
datetime.datetime(2018, 1, 6, 15, 55, 39, 124654)
>>> next(x)
datetime.datetime(2018, 1, 5, 15, 55, 39, 124654)
>>> next(x)
datetime.datetime(2018, 1, 4, 15, 55, 39, 124654)
How to get your desired dictionary?
Using this, you may create your dictionary using the dict comprehension as:
Dictionary of all days of week
>>> {d.strftime("%A"): d for d in RepetetiveDate(day_range=7)}
{
'Monday': datetime.datetime(2018, 1, 8, 15, 23, 16, 926364),
'Tuesday': datetime.datetime(2018, 1, 9, 15, 23, 16, 926364),
'Wednesday': datetime.datetime(2018, 1, 10, 15, 23, 16, 926364),
'Thursday': datetime.datetime(2018, 1, 11, 15, 23, 16, 926364),
'Friday': datetime.datetime(2018, 1, 12, 15, 23, 16, 926364),
'Saturday': datetime.datetime(2018, 1, 13, 15, 23, 16, 926364),
'Sunday': datetime.datetime(2018, 1, 14, 15, 23, 16, 926364)
}
Here I am using d.strftime("%A") to extract day name from the datetime object.
List of current days for next 4 weeks
>>> [d for d in RepetetiveDate(jump_days=7, day_range=4))]
[
datetime.datetime(2018, 1, 7, 16, 17, 45, 45005),
datetime.datetime(2018, 1, 14, 16, 17, 45, 45005),
datetime.datetime(2018, 1, 21, 16, 17, 45, 45005),
datetime.datetime(2018, 1, 28, 16, 17, 45, 45005)
]

One very clean way to implement this is using rrule from the dateutil library. For example:
>>> from dateutil.rrule import rrule, DAILY
>>> from datetime import datetime
>>> start_date = datetime.now()
>>> {d.strftime("%A"): d for d in rrule(freq=DAILY, count=7, dtstart=start_date)}
which will return your desired dict object:
{
'Sunday': datetime.datetime(2018, 1, 7, 17, 2, 30),
'Monday': datetime.datetime(2018, 1, 8, 17, 2, 30),
'Tuesday': datetime.datetime(2018, 1, 9, 17, 2, 30),
'Wednesday': datetime.datetime(2018, 1, 10, 17, 2, 30),
'Thursday': datetime.datetime(2018, 1, 11, 17, 2, 30),
'Friday': datetime.datetime(2018, 1, 12, 17, 2, 30),
'Saturday': datetime.datetime(2018, 1, 13, 17, 2, 30)
}
(Special thanks to Jon Clements for telling me about rrule)

How do I join integers, Decimal() and datetime.datetime() in a list?

I am using pyodbc to return rows from a SQL database, and I can connect and get all of the rows just fine. But now I am confused how to deal with the data returned. I would like to join all of the values in the returned list, into a string that I can then write to a file. But I am unsure how to deal with multiple data types in the list.
Here is the base code leading to the list:
key = "03001001"
cursor.execute('''select * from table_name where key='{}' '''.format(key))
rows = cursor.fetchall()
for x in rows:
print(x)
When I print(x) it returns the following line:
('03001001', 2, datetime.datetime(2014, 11, 13, 4, 30), 0, Decimal('-0.1221'), 5, 0, 0, 0, datetime.datetime(2014, 11, 13, 14, 30), datetime.datetime(2014, 11, 13, 4, 30, 12), 0)
I would like for it to simply be a tab separated string.

print('\t'.join(map(repr, x)))
will lead to
'03001001' 2 datetime.datetime(2014, 11, 13, 4, 30) 0 Decimal('-0.1221')5 0 0 0 datetime.datetime(2014, 11, 13, 14, 30) datetime.datetime(2014, 11, 13, 4, 30, 12) 0
If you want a human readable date and decimal entry use str instead of repr (like Matti John's answer):
print('\t'.join(map(str, x)))
will print
03001001 2 2014-11-13 04:30:00 0 -0.1221 5 0 0 02014-11-13 14:30:00 2014-11-13 04:30:12 0

"Parallel" indexing in pandas (not hierarchical)

Short version: I have two TimeSeries (recording start and recording end) I would like to use as indices for data in a Panel (or DataFrame). Not hierarchical, but parallel. I am uncertain how to do this.
Long version:
I am constructing a pandas Panel with some data akin to temperature and density at certain distances from an antenna. As I see it, the most natural structure is having e.g. temp and dens as items (i.e. sub-DataFrames of the Panel), recording time as major axis (index), and thus distance from the antenna as minor axis (colums).
My problem is this: For each recording, the instrument averages/integrates over some amount of time. Thus, for each data dump, two timestamps are saved: start recording and end recording. I need both of those. Thus, I would need something which might be called "parallel indexing", where two different TimeSeries (startRec and endRec) work as indices, and I can get whichever I prefer for a certain data point. Of course, I don't really need to index by both, but both need to be naturally available in the data structure. For example, for any given temperature or density recording, I need to be able to get both the start and end time of the recording.
I could of course keep the two TimeSeries in a separate DataFrame, but with the main point of pandas being automatic data alignment, this is not really ideal.
How can I best achieve this?
Example data
Sample Panel with three recordings at two distances from the antenna:
import pandas as pd
import numpy as np
data = pd.Panel(data={'temp': np.array([[21, 20],
[19, 17],
[15, 14]]),
'dens': np.array([[1001, 1002],
[1000, 998],
[997, 995]])},
minor_axis=['1m', '3m'])
Output of data:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 2 (minor_axis)
Items axis: dens to temp
Major_axis axis: 0 to 2
Minor_axis axis: 1m to 3m
Here, the major axis is currently only an integer-based index (0 to 2). The minor axis is the two measurement distances from the antenna.
I have two TimeSeries I'd like to use as indices:
from datetime import datetime
startRec = pd.TimeSeries([datetime(2013, 11, 11, 15, 00, 00),
datetime(2013, 11, 12, 15, 00, 00),
datetime(2013, 11, 13, 15, 00, 00)])
endRec = pd.TimeSeries([datetime(2013, 11, 11, 15, 00, 10),
datetime(2013, 11, 12, 15, 00, 10),
datetime(2013, 11, 13, 15, 00, 10)])
Output of startRec:
0 2013-11-11 15:00:00
1 2013-11-12 15:00:00
2 2013-11-13 15:00:00
dtype: datetime64[ns]

Being in a Panel makes this a little trickier. I typically stick with DataFrames.
But how does this look:
import pandas as pd
from datetime import datetime
startRec = pd.TimeSeries([datetime(2013, 11, 11, 15, 0, 0),
datetime(2013, 11, 12, 15, 0, 0),
datetime(2013, 11, 13, 15, 0, 0)])
endRec = pd.TimeSeries([datetime(2013, 11, 11, 15, 0, 10),
datetime(2013, 11, 12, 15, 0, 10),
datetime(2013, 11, 13, 15, 0, 10)])
_data1m = pd.DataFrame(data={
'temp': np.array([21, 19, 15]),
'dens': np.array([1001, 1000, 997]),
'start': startRec,
'end': endRec
}
)
_data3m = pd.DataFrame(data={
'temp': np.array([20, 17, 14]),
'dens': np.array([1002, 998, 995]),
'start': startRec,
'end': endRec
}
)
_data1m.set_index(['start', 'end'], inplace=True)
_data3m.set_index(['start', 'end'], inplace=True)
data = pd.Panel(data={'1m': _data1m, '3m': _data3m})
data.loc['3m'].select(lambda row: row[0] < pd.Timestamp('2013-11-12') or
row[1] < pd.Timestamp('2013-11-13'))
and that outputs:
dens temp
start end
2013-11-11 15:00:00 2013-11-11 15:00:10 1002 20
2013-11-12 15:00:00 2013-11-12 15:00:10 998 17

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fill in missing dates in dataframe using the mean - python

Related

Filter datetime column by every 15 minute interval

Fill list with last value if date gap is greater than N seconds

Iterable Datetime - How to get continuos datetime objects with day name in Python?

How do I join integers, Decimal() and datetime.datetime() in a list?

"Parallel" indexing in pandas (not hierarchical)

Categories

Resources