Using Pandas to reformat dates with inconsistent User inputs - python

I am trying to clean a spreadsheet of user-inputted data that includes a "birth_date" column. The issue I am having is that the date formating ranges widely between users, including inputs without markers between the date, month, and year. I am having a hard time developing a formula that is intelligent enough to interpret such a wide range of inputs. Here is a sample:
1/6/46
7/28/99
11272000
11/28/78
Here is where I started:
df['birth_date']=pd.to_datetime(df.birth_date)
This does not seem to make it past the first example, as it looks for a two-month format. Can anyone help with this?

Your best bet is to check each input and give a consistent output. Assuming Month-Day-Year formats, you can use this function
import pandas as pd
import re
def fix_dates(dates):
new = []
for date in dates:
chunks = re.split(r"[\/\.\-]", date)
if len(chunks) == 3:
m, d, y = map(lambda x: x.zfill(2), chunks)
y = y[2:] if len(y) == 4 else y
new.append(f"{m}/{d}/{y}")
else:
m = date[:2]
d = date[2:4]
y = date[4:]
y = y[2:] if len(y) == 4 else y
new.append(f"{m}/{d}/{y}")
return new
inconsistent_dates = '1/6/46 7/28/99 11272000 11/28/78'.split(' ')
pd.to_datetime(pd.Series(fix_dates(inconsistent_dates)))
0 2046-01-06
1 1999-07-28
2 2000-11-27
3 1978-11-28
dtype: datetime64[ns]

Related

Resample a data frame into n-month periods with arbitrary end-of-period months

I want to resample() my daily data into six-month chunks. However, I want the ends of the six-month chunks to be the ends of April and October. If I use df.resample('6M').sum() (or df.groupby(pd.Grouper(freq='6M').sum()), the end of the first six-month chunk is the end of the first month in the data. I know about anchored offsets, but I do not know how to create a custom anchored offset (e.g., '6M-APR' does not work).
Here is some example code:
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(
data={'logret': np.random.randn(1000)},
index=pd.date_range(start='2001-05-25', periods=1000, freq='B')
)
df.resample('6M').sum()
Which yields the following output:
logret
2001-05-31 2.2950148716254297
2001-11-30 -12.536360930670858
2002-05-31 5.468848462868161
2002-11-30 13.027927629740189
2003-05-31 -10.37282118563155
2003-11-30 -0.156275418330286
2004-05-31 -3.0768727498370905
2004-11-30 28.328856464071546
2005-05-31 -3.6462613215100546
I have not achieved my goal (six-month resampling that ends in April and October) with the start, offset, and loffset arguments to .resample().
I have achieved my goal with the hack below. However, it loses the date index, and I would like a more robust/repeatable approach.
def sixmonth(d, b=4):
y, m, h = d.year, d.month, 1
if (m > (b + 6)): y += 1
elif (m > b): h += 1
return y + h/10
df.groupby(sixmonth).sum()
Which yields the following output without a date:
logret
2001.2 -10.300839024148
2002.1 9.321994034984547
2002.2 8.855517878860585
2003.1 -2.4576797445001493
2003.2 -7.002919570231796
2004.1 -9.36895555474087
2004.2 27.13038641177464
2005.1 3.154551390326532
Of course, I could improve this hack. But is there a better/robust/repeatable solution for n-period resampling that ends in arbitrary months?
Another workaround, keeping the datetime index:
def custom_6M(df, month=4):
df = df.resample("M").sum()
df = df.rolling(6).sum()
return df[df.index.month.isin([month,month+6])]
>>> custom_6M(df)
logret
2001-10-31 -10.300839
2002-04-30 9.321994
2002-10-31 8.855518
2003-04-30 -2.457680
2003-10-31 -7.002920
2004-04-30 -9.368956
2004-10-31 27.130386
It's a pain. When I needed something similar, I ended up with the following approach:
anchor_month = 4
non_months = (anchor_month + 3) % 12, (anchor_month + 9) % 12
df = df.resample('Q-APR').sum()
df = (df.reset_index()
.groupby(df.index.month.isin(non_months).cumsum())
.agg({'index': 'last', 'logret': 'sum'})
.set_index('index'))
Result here:
logret
index
2001-10-31 -10.300839
2002-04-30 9.321994
2002-10-31 8.855518
2003-04-30 -2.457680
2003-10-31 -7.002920
2004-04-30 -9.368956
2004-10-31 27.130386
2005-04-30 3.154551
But the problem is, that sometimes the last index doesn't fit (okay here). That can be fixed by another '6M'-resample. Overall: Not pretty.
Thanks for the answers.
I have two more options.
Append a time-stamped series to df to anchor the six-month resampling periods
I hoped that .resample()'s origin argument would let me manually anchor my six-month resampling periods. It doesn't, but the following code does.
df.append(pd.Series(name=pd.to_datetime('2001-04-30'), dtype='float')).resample('6M').sum()
Improve my sixmonth() function to use timestamps
def sixmonth(d, m=6, n=4):
o = (m - (d.month - n)) % m
return d + pd.offsets.MonthEnd(o)
I first .resample('M') to make sure that I have end-of-month dates.
I could modify sixmonth() to check for end-of-month dates, but I'm more afraid of finding some new edge case than a little inefficiency.
df.resample('M').sum().groupby(sixmonth).sum()

Adding Leading Zeros to a field with MM:SS time data

I have the following data:
data shows a race time finish and pace:
As you can see, the data doesn't show the hour format for people who finish before the hour mark and in order to do some analysis, i need to convert into a time format but pandas doesn't recognize just the MM:SS format. how can I pad '0:' in front of the rows where hour is missing?
i'm sorry, this is my first time posting.
Considering your data is in csv format.
# reading in the data file
df = pd.read_csv('data_file.csv')
# replacing spaces with '_' in column names
df.columns = [c.replace(' ', '_') for c in df.columns]
for i, row in df.iterrows():
val_inital = str(row.Gun_time)
val_final = val_inital.replace(':','')
if len(val_final)<5:
val_final = "0:" + val_inital
df.at[i, 'Gun_time'] = val_final
# saving newly edited csv file
df.to_csv('new_data_file.csv')
Before:
Gun time
0 28:48
1 29:11
2 1:01:51
3 55:01
4 2:08:11
After:
Gun_time
0 0:28:48
1 0:29:11
2 1:01:51
3 0:55:01
4 2:08:11
You can try to apply the following function to the columns you want to change then maybe change it to timedelta
df['Gun time'].apply(lambda x: '0:' + x if len(x) == 5 \
else ('0:0' + x if len(x) == 4 else x))
df['Gun time'] = pd.to_timedelta(df['Gun Time'])

How to get overlap between two date ranges that have a start and end time from a csv?

I already asked a similar question but was able to piece some more of it together but need more help. Determining how one date/time range overlaps with the second date/time range?
I want to be able to check when two date range with start date/time and end date/time overlap. My type2 has about 50 rows while type 1 has over 500. I want to be able to take the start and end of type2 and see if it falls within type1 range. Here is a snip of the data, however the dates do change down the list from 2019-04-01 the the following days.
type1 type1_start type1_end
a 2019-04-01T00:43:18.046Z 2019-04-01T00:51:35.013Z
b 2019-04-01T02:16:46.490Z 2019-04-01T02:23:23.887Z
c 2019-04-01T03:49:31.981Z 2019-04-01T03:55:16.153Z
d 2019-04-01T05:21:22.131Z 2019-04-01T05:28:05.469Z
type2 type2_start type2_end
1 2019-04-01T00:35:12.061Z 2019-04-01T00:37:00.783Z
2 2019-04-02T00:37:15.077Z 2019-04-02T00:39:01.393Z
3 2019-04-03T00:39:18.268Z 2019-04-03T00:41:01.844Z
4 2019-04-04T00:41:21.576Z 2019-04-04T00:43:02.071Z`
I have been googling the best way to this and have read through Determine Whether Two Date Ranges Overlap and understand how it should be done, but I don't know enough about how to call for the variables and make them work.
#Here is what I have, but I am stuck and have no clue where to go form here:
import pandas as pd
from pandas import Timestamp
import numpy as np
from collections import namedtuple
colnames = ['type1', 'type1_start', 'type1_end', 'type2', 'type2_start', 'type2_end']
data = pd.read_csv('test.csv', names=colnames, parse_dates=['type1_start', 'type1_end','type2_start', 'type2_end'])
A_start = data['type1_start']
A_end = data['type1_end']
B_start= data['typer2_start']
B_end = data['type2_end']
t1 = data['type1']
t2 = data['type2']
r1 = (B_start, B_end)
r2 = (A_start, A_end)
def doesOverlap(r1, r2):
if B_start > A_start:
swap(r1, r2)
if A_start > B_end:
return false
return true
It would be nice to have a csv with a result of true or false overlap. I was able to make my data run using this also Efficiently find overlap of date-time ranges from 2 dataframes but it isn't correct in the results. I added couple of rows that I know should overlap to the data, and it didn't work. I'd need for each type2 start/end to go through each type1.
Any help would be greatly appreciated.
Here is one way to do it:
import pandas as pd
def overlaps(row):
if ((row['type1_start'] < row['type2_start'] and row['type2_start'] < row['type1_end'])
or (row['type1_start'] < row['type2_end'] and row['type2_end'] < row['type1_end'])):
return True
else:
return False
colnames = ['type1', 'type1_start', 'type1_end', 'type2', 'type2_start', 'type2_end']
df = pd.read_csv('test.csv', names=colnames, parse_dates=[
'type1_start', 'type1_end', 'type2_start', 'type2_end'])
df['overlap'] = df.apply(overlaps, axis=1)
print('\n', df)
gives:
type1 type1_start type1_end type2 type2_start type2_end overlap
0 type1 type1_start type1_end type2 type2_start type2_end False
1 a 2019-03-01T00:43:18.046Z 2019-04-02T00:51:35.013Z 1 2019-04-01T00:35:12.061Z 2019-04-01T00:37:00.783Z True
2 b 2019-04-01T02:16:46.490Z 2019-04-01T02:23:23.887Z 2 2019-04-02T00:37:15.077Z 2019-04-02T00:39:01.393Z False
3 c 2019-04-01T03:49:31.981Z 2019-04-01T03:55:16.153Z 3 2019-04-03T00:39:18.268Z 2019-04-03T00:41:01.844Z False
4 d 2019-04-01T05:21:22.131Z 2019-04-01T05:28:05.469Z 4 2019-04-04T00:41:21.576Z 2019-04-04T00:43:02.071Z False
Below df1 contains type1 records and df2 contains type2 records:
df_new = df1.assign(key=1)\
.merge(df2.assign(key=1), on='key')\
.assign(has_overlap=lambda x: ~((x.type2_start > x.type1_end) | (x.type2_end < x.type1_start)))
REF: Performant cartesian product (CROSS JOIN) with pandas

Compare each pair of dates in two columns in python efficiently

I have a data frame with a column of start dates and a column of end dates. I want to check the integrity of the dates by ensuring that the start date is before the end date (i.e. start_date < end_date).I have over 14,000 observations to run through.
I have data in the form of:
Start End
0 2008-10-01 2008-10-31
1 2006-07-01 2006-12-31
2 2000-05-01 2002-12-31
3 1971-08-01 1973-12-31
4 1969-01-01 1969-12-31
I have added a column to write the result to, even though I just want to highlight whether there are incorrect ones so I can delete them:
dates['Correct'] = " "
And have began to check each date pair using the following, where my dataframe is called dates:
for index, row in dates.iterrows():
if dates.Start[index] < dates.End[index]:
dates.Correct[index] = "correct"
elif dates.Start[index] == dates.End[index]:
dates.Correct[index] = "same"
elif dates.Start[index] > dates.End[index]:
dates.Correct[index] = "incorrect"
Which works, it is just taking a really really long-time (about over 15 minutes). I need a more efficiently running code - is there something I am doing wrong or could improve?
Why not just do it in a vectorized way:
is_correct = dates['Start'] < dates['End']
is_incorrect = dates['Start'] > dates['End']
is_same = ~is_correct & ~is_incorrect
Since the list doesn't need to be compared sequentially, you can gain performance by splitting your dataset and then using multiple processes to perform the comparison simultaneously. Take a look at the multiprocessing module for help.
Something like the following may be quicker:
import pandas as pd
import datetime
df = pd.DataFrame({
'start': ["2008-10-01", "2006-07-01", "2000-05-01"],
'end': ["2008-10-31", "2006-12-31", "2002-12-31"],
})
def comparison_check(df):
start = datetime.datetime.strptime(df['start'], "%Y-%m-%d").date()
end = datetime.datetime.strptime(df['end'], "%Y-%m-%d").date()
if start < end:
return "correct"
elif start == end:
return "same"
return "incorrect"
In [23]: df.apply(comparison_check, axis=1)
Out[23]:
0 correct
1 correct
2 correct
dtype: object
Timings
In [26]: %timeit df.apply(comparison_check, axis=1)
1000 loops, best of 3: 447 µs per loop
So by my calculations, 14,000 rows should take (447/3)*14,000 = (149 µs)*14,000 = 2.086s, so a might shorter than 15 minutes :)

Timestamp from date, and formatting in panda or csv

I have a function that outputs a dataframe generated from a RINEX (GPS) file. At present, I get the dataframe to be output into separated satellite (1-32) files. I'd like to access in the first column (either when it's still a dataframe or in these new files) in order to format the date to a timestamp in seconds, like below:
Epochs Epochs
2014-04-27 00:00:00 -> 00000
2014-04-27 00:00:30 -> 00030
2014-04-27 00:01:00 -> 00060
This requires stripping the date away, then converting hh:mm:ss to seconds. I've hit a wall trying to figure out how best to access this first column (Epochs) and then make the conversion on the entire column. The code I have been working on is:
def read_data(self, RINEXfile):
obs_data_chunks = []
while True:
obss, _, _, epochs, _ = self.read_data_chunk(RINEXfile)
if obss.shape[0] == 0:
break
obs_data_chunks.append(pd.Panel(
np.rollaxis(obss, 1, 0),
items=['G%02d' % d for d in range(1, 33)],
major_axis=epochs,
minor_axis=self.obs_types
).dropna(axis=0, how='all').dropna(axis=2, how='all'))
obs_data_chunks_dataframe = obs_data_chunks[0]
for sv in range(32):
sat = obs_data_chunks_dataframe[sv, :]
print "sat_columns: {0}".format(sat.columns[0]) #list header of first column: L1
sat.to_csv(('SV_{0}').format(sv+1), index_label="Epochs", sep='\t')
Do I perform this conversion within the dataframe i.e on "sat", or on the files after using the "to_csv"? I'm a bit lost here. Same question for formatting the columns. See the not-so-nicely formatted columns below:
Epochs L1 L2 P1 P2 C1 S1 S2
2014-04-27 00:00:00 669486.833 530073.33 24568752.516 24568762.572 24568751.442 43.0 38.0
2014-04-27 00:00:30 786184.519 621006.551 24590960.634 24590970.218 24590958.374 43.0 38.0
2014-04-27 00:01:00 902916.181 711966.252 24613174.234 24613180.219 24613173.065 42.0 38.0
2014-04-27 00:01:30 1019689.006 802958.016 24635396.428 24635402.41 24635395.627 42.0 37.0
2014-04-27 00:02:00 1136478.43 893962.705 24657620.079 24657627.11 24657621.828 42.0 37.0
UPDATE:
By saying that I've hit a wall trying to figure out how best to access this first column (Epochs), the ""sat" dataframe originally in its header had no "Epochs". It simply had the signals:
L1 L2 P1 P2 C1 S1 S2
The index, (date&time), was missing from the header. In order to overcome this in my csv output files, I "forced" the name with:
sat.to_csv(('SV_{0}').format(sv+1), index_label="Epochs", sep='\t')
I would expect before generating the csv files, I should (but don't know how) be able to access this index (date&time) column and simply convert all dates/times in one swoop, so that the timestamps are outputted.
UPDATE:
The epochs are generated in the dataframe in another function as so:
epochs = np.zeros(CHUNK_SIZE, dtype='datetime64[us]')
UPDATE:
def read_data_chunk(self, RINEXfile, CHUNK_SIZE = 10000):
obss = np.empty((CHUNK_SIZE, TOTAL_SATS, len(self.obs_types)), dtype=np.float64) * np.NaN
llis = np.zeros((CHUNK_SIZE, TOTAL_SATS, len(self.obs_types)), dtype=np.uint8)
signal_strengths = np.zeros((CHUNK_SIZE, TOTAL_SATS, len(self.obs_types)), dtype=np.uint8)
epochs = np.zeros(CHUNK_SIZE, dtype='datetime64[us]')
flags = np.zeros(CHUNK_SIZE, dtype=np.uint8)
i = 0
while True:
hdr = self.read_epoch_header(RINEXfile)
#print hdr
if hdr is None:
break
epoch, flags[i], sats = hdr
epochs[i] = np.datetime64(epoch)
sat_map = np.ones(len(sats)) * -1
for n, sat in enumerate(sats):
if sat[0] == 'G':
sat_map[n] = int(sat[1:]) - 1
obss[i], llis[i], signal_strengths[i] = self.read_obs(RINEXfile, len(sats), sat_map)
i += 1
if i >= CHUNK_SIZE:
break
return obss[:i], llis[:i], signal_strengths[:i], epochs[:i], flags[:i]
UPDATE:
My apologies if my description was somewhat vague. Actually I'm modifying code already developed, and I'm not a SW developer so it's a strong learning curve for me too. Let me explain further: the "Epochs" are read from another function:
def read_epoch_header(self, RINEXfile):
epoch_hdr = RINEXfile.readline()
if epoch_hdr == '':
return None
year = int(epoch_hdr[1:3])
if year >= 80:
year += 1900
else:
year += 2000
month = int(epoch_hdr[4:6])
day = int(epoch_hdr[7:9])
hour = int(epoch_hdr[10:12])
minute = int(epoch_hdr[13:15])
second = int(epoch_hdr[15:18])
microsecond = int(epoch_hdr[19:25]) # Discard the least significant digits (use microseconds only).
epoch = datetime.datetime(year, month, day, hour, minute, second, microsecond)
flag = int(epoch_hdr[28])
if flag != 0:
raise ValueError("Don't know how to handle epoch flag %d in epoch header:\n%s", (flag, epoch_hdr))
n_sats = int(epoch_hdr[29:32])
sats = []
for i in range(0, n_sats):
if ((i % 12) == 0) and (i > 0):
epoch_hdr = RINEXfile.readline()
sats.append(epoch_hdr[(32+(i%12)*3):(35+(i%12)*3)])
return epoch, flag, sats
In the above read_data function, these are appended into a dataframe. I basically want to have this dataframe separated by its satellite axis, so that each satellite file has in the first column, the epochs, then the following 7 signals. The last bit of code in the read_data file (below) explains this:
for sv in range(32):
sat = obs_data_chunks_dataframe[sv, :]
print "sat_columns: {0}".format(sat.columns[0]) #list header of first column: L1
sat.to_csv(('SV_{0}').format(sv+1), index_label="Epochs", sep='\t')
The problem here is (1) I want to have the first column as timestamps (so, strip the date, convert so midnight = 00000s and 23:59:59 = 86399s) not as they are now, and (2) ensure the columns are aligned, so I can eventually manipulate these further using a different class to perform other calculations i.e. L1 minus L2 plotted against time, etc.
It will be much quicker to do this when it's a df, if the dtype is datetime64 then just convert to int64 and then divide by nanoseconds:
In [241]:
df['Epochs'].astype(np.int64) // 10**9
Out[241]:
0 1398556800
1 1398556830
2 1398556860
3 1398556890
4 1398556920
Name: Epochs, dtype: int64
If it's a string then convert using to_datetime and then perform the above:
df['Epochs'] = pd.to_datetime(df['Epochs']).astype(np.int64) // 10**9
see related
I resolved part of this myself in the end: in the read_epoch_header function, I simply manipulated a variable that converted just hh:mm:ss to seconds, and used this as the epoch. Doesn't look that elegant but it works. Just need to format the header so that it aligns with the columns (and they are aligned too). Cheers, pymat

Categories