How can I subset based on smaller than time XYZ? - python

I am trying to create a new data frame that excludes all timestamps bigger/later than 12:00:00. I tried multiple approaches but I keep on having an issue. The time was previously a datetime column that I change to 2 columns, a date and time column (format datetime.time)
Code:
Issue thrown out:
Do you have any suggestions to be able to do this properly?

Alright the solution:
change the Datetype from datetime.time to Datetime
Use the between method by setting the Datetime column as the index and setting the range
The result is a subset of the dataframe based on the needed time frame.

Related

ValueError: time-weighted interpolation only works on Series or DataFrames with a DatetimeIndex When using Pandas Interpolate using Time method

I am using a dataset found on the Kaggle website (https://www.kaggle.com/claytonmiller/lbnl-automated-fault-detection-for-buildings-data) specifically the 'RTU.CSV'.
I have converted the timestamp to DateTime using following code:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
yet when I try to use the Pandas Interpolate using Time method
df.interpolate(method = "time")
The Error I get is
ValueError: time-weighted interpolation only works on Series or
DataFrames with a DatetimeIndex
Can anyone explain what does this means?
You are trying to call the interpolate on the whole dataframe instead of just the timestamp column. The dataframe has other columns that are not time data. The interpolate will work if: it is called on a specific Series (column) if it is time data, or the dataframe, via an index (DatetimeIndex).
I imagine this is what you intended to do:
df['Timestamp'].interpolate(method = "time")
If you wish to turn your timestamp column into the index:
df.set_index(df['Timestamp'], inplace=True)
Edit from seeing the dataset: my guess is that you might need something a bit more powerful than interpolate if you want to, basically, predict all columns values based on the timestamp and historical data. iterpolate is here more to fill the gaps in a column for example. As your timestamp is pretty regular, you can also choose to assume the rest of the data is partially independent from it and call interpolate on all columns 1 by 1 (the method might need to be changed). But since it is big chunks of data missing at the start, not sure how good interpolate guesses would be

Remove date from datetime in csv

for a project in python we need to use a csv file with several columns and create a ML model. My problem is, that one column is datetime, and the date is useless for the predictions, but i don't know how to remove it, as it is in the same column with the time like (so I can't just drop the column):
26.03.2018 00:00:00
Can you help me remove the date somehow? I tried different methods for handling 'datetime' but non worked so far.
data = pd.read_csv("TotalTrafo.csv")
dir(data)
type(data.Trafo1)
pandas.core.series.Series
Just do:
df['DateTime column']=df['DateTime column'].dt.time
to get only time .
for a datetime object foo you can simply call foo.time to get only the time (foo.date for date, and so on)
If your pandas series does not contain datetime objects you can convert it to datetime by doing something like this
data['Trafo1'] = pd.to_datetime(data['Trafo1'])
#or
data.Trafo1 = pd.to_datetime(data.Trafo1)

How can I eliminate timezone awareness from timestamps in python on elements in a pandas series?

I have a few different data frames with a similar data structure, but the contents of those data frames are slightly different. Specifically, the datetime format of the datetime fields is different- in some cases, the timestamps are timezone aware, in other cases, they are not. I need to find the minimum range of timestamps that overlap all three dataframes, such that the data in the final dataframes exclusively overlaps the same time periods.
The approach I wanted to take was to take the minimum start time from each of the starttime timestamps in each dataframe, and then take the max of those, and then repeat (but invert) the process for the endtimes. However, when I do this I get an error indicating I cannot compare timestamps with different timezone awareness. I've taken a few different approaches- using tz_convert on the timestamp series, as below:
model_output_dataframes['workRequestSplitEndTime']= pd.to_datetime(model_output_dataframes['workRequestSplitEndTime'], infer_datetime_format=True).tz_convert(None)
this generates the error
TypeError: index is not a valid DatetimeIndex or PeriodIndex
So I tried converting it into a datetimeindex, and then converting it:
model_output_dataframes['workRequestSplitEndTime']= pd.DatetimeIndex(pd.to_datetime(model_output_dataframes['workRequestSplitEndTime'], infer_datetime_format=True)).tz_convert(None)
and this generates a separate error:
ValueError: cannot reindex from a duplicate axis
So at this point, I'm somewhat stuck - I feel like after my conversions I'm back at the place I started.
I would appreciate any help you can give me.

How can i select multiple date columns in a dataframe in pandas, then format them all ? (python)

I have a large dataset with multiple date columns that I need to clean up, mostly by removing the time stamp since it is all 00:00:00. I want to write a function that collects all columns if type is datetime, then format all of them instead of having to attack one each.
I figured it out. This is what I came up with and it works for me:
def tidy_dates(df):
for col in df.select_dtypes(include="datetime64[ns, UTC]"):
df[col] = df[col].dt.strftime("%Y-%m-%d")
return df

Python Pandas: Overwriting an Index with a list of datetime objects

I have an input CSV with timestamps in the header like this (the number of timestamps forming columns is several thousand):
header1;header2;header3;header4;header5;2013-12-30CET00:00:00;2013-12-30CET00:01:00;...;2014-00-01CET00:00:00
In Pandas 0.12 I was able to do this, to convert string timestamps into datetime objects. The following code strips out the 'CEST' in the timestamp string (translate()), reads it in as a datetime (strptime()) and then localizes it to the correct timezone (localize()) [The reason for this approach was because, with the versions I had at least, CEST wasn't being recognised as a timezone].
DF = pd.read_csv('some_csv.csv',sep=';')
transtable = string.maketrans(string.uppercase,' '*len(string.uppercase))
tz = pytz.country_timezones('nl')[0]
timestamps = DF.columns[5:]
timestamps = map(lambda x:x.translate(transtable), timestamps)
timestamps = map(lambda x:datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'), timestamps)
timestamps = map(lambda x: pytz.timezone(tz).localize(x), timestamps)
DF.columns[5:] = timestamps
However, my downstream code required that I run off of pandas 0.16
While running on 0.16, I get this error with the above code at the last line of the above snippet:
*** TypeError: Indexes does not support mutable operations
I'm looking for a way to overwrite my index with the datetime object. Using the method to_datetime() doesn't work for me, returning:
*** ValueError: Unknown string format
I have some subsequent code that copies, then drops, the first few columns of data in this dataframe (all the 'header1; header2, header3'leaving just the timestamps. The purpose being to then transpose, and index by the timestamp.
So, my question:
Either:
how can I overwrite a series of column names with a datetime, such that I can pass in a pre-arranged set of timestamps that pandas will be able to recognise as a timestamp in subsequent code (in pandas v0.16)
Or:
Any other suggestions that achieve the same effect.
I've explored set_index(), replace(), to_datetime() and reindex() and possibly some others but non seem to be able to achieve this overwrite. Hopefully this is simple to do, and I'm just missing something.
TIA
I ended up solving this by the following:
The issue was that I had several thousand column headers with timestamps, that I couldn't directly parse into datetime objects.
So, in order to get these timestamp objects incorporated I added a new column called 'Time', and then included the datetime objects in there, then setting the index to the new column (I'm omitting code where I purged the rows of other header data, through drop() methods:
DF = DF.transpose()
DF['Time'] = timestamps
DF = DF.set_index('Time')
Summary: If you have a CSV with a set of timestamps in your headers that you cannot parse; a way around this is to parse them separately, include in a new column of Time with the correct datetime objects, then set_index() based on the new column.

Categories