Python Pandas: Converting Dataframes and setting Indexes - python

I have to do the tasks below, but some parts do not work out properly. Here are the steps:
Converting a dataframe with unix timestamps to a dataframe with datetime values, works with the following code:
datetime_df = pd.to_datetime(unix_df, unit='s')
Resampling the datetime with smpl = datetime_df[0].resample('10min')
Converting it back to unix timestamp format with: unix_df =
datetime_df.astype(np.int64) // 10 ** 9
Step 1 and 3 work, but step 2 doesn't work because Python tells me it needs a DateTimeIndex and I don't know how to set it. The strange thing is that the index completely disappears after to_datetime, so I tried creating a list and then making a dataframe from it again, but it still didn't work. Somebody can help me?

Related

How to create a pandas dataframe using a list of 'epoch dates' into '%Y-%m-%d %s:%m:%f%z' format?

My objective is to create the following pandas dataframe (with the 'date_time' column in '%Y-%m-%d %s:%m:%f%z' format):
batt_no date_time
3 4 2019-09-19 20:59:06+00:00
4 5 2019-09-19 23:44:07+00:00
5 6 2019-09-20 00:44:06+00:00
6 7 2019-09-20 01:14:06+00:00
But the constraint is that I don't want to first create a dataframe as follows and then convert the 'date_time' column into the above format.
batt_no date_time
3 4 1568926746
4 5 1568936647
5 6 1568940246
6 7 1568942046
I need to directly create it by converting two lists of values into the desired dataframe.
The following is what I've tried but I get an error
(please note: the 'date_time' values are in epoch format which I need to specify but have them converted into the '%Y-%m-%d %s:%m:%f%z' format):
pd.DataFrame({'batt_volt':[4,5,6,7],
'date_time':[1568926746,1568936647,1568940246,1568942046].dt.strftime('%Y-%m-%d %s:%m:%f%z')}, index=[3,4,5,6])
Can anyone please help?
Edit Note: My question is different from the one asked here.
The question there deals with conversion of a single value of pandas datetime to unix timestamp. Mine's different because:
My timestamp values are slightly different from any of the types mentioned there
I don't need to convert any timestamp value, rather create a full-fledged dataframe having values of the desired timestamp - in a particular manner using lists that I've clearly mentioned in my question.
I've clearly stated the way I've attempted the process but requires some modifications in order to run without error, which in no way is similar to the question asked in the aforementioned link.
Hence, my question is definitely different. I'd request to kindly reopen it.
As suggested, I put the solution in comment as an answer here.
pd.DataFrame({'batt_volt':[4,5,6,7], 'date_time': pd.to_datetime([1568926746,1568936647,1568940246,1568942046], unit='s', utc=True).strftime('%Y-%m-%d %s:%m:%f%z')}, index=[3,4,5,6])
pd.to_datetime works with dates, or list of dates, and input dates can be in many formats including epoch int. Keyword unit ensure that those ints are interpreted as a number of seconds since 1970-01-01, not of ms, μs, ns, ...
So it is quite easy to use when creating a DataFrame to create directly the list of dates.
Since a list of string, with a specific format was wanted (btw, outside any specific context, I maintain that it is probably preferable to store datetimes, and convert to string only for I/O operations. But I don't know the specific context), we can use .strftime on the result, which is of type DatetimeIndex when to_datetime is called with a list. And .strftime also works on those, and then is applied on all datetimes of the list. So we get a list of string of the wanted format.
Last remaining problem was the timezone. And here, there is no perfect solution. Because a simple int as those we had at the beginning does not carry a timezone. By default, to_datetime creates datetime without a timezone (like those ints are). So they are relative to any timezone the user decide they are.
to_datetime can create "timezone aware datetime". But only UTC. Which is done by keyword arg utc=True
With that we get timezone aware datetime, assuming that the ints we provided were in number of seconds since 1970-01-01 00:00:00+00:00

How can I subset based on smaller than time XYZ?

I am trying to create a new data frame that excludes all timestamps bigger/later than 12:00:00. I tried multiple approaches but I keep on having an issue. The time was previously a datetime column that I change to 2 columns, a date and time column (format datetime.time)
Code:
Issue thrown out:
Do you have any suggestions to be able to do this properly?
Alright the solution:
change the Datetype from datetime.time to Datetime
Use the between method by setting the Datetime column as the index and setting the range
The result is a subset of the dataframe based on the needed time frame.

Trying to convert a column with strings to float via Pandas

Hi I have looked but on stackoverflow and not found a solution for my problem. Any help highly appeciated.
After importing a csv I noticed that all the types of the columns are object and not float.
My goal is to convert all the columns but the YEAR column to float. I have read that you first have to strip the columns for taking blanks out and then also convert NaNs to 0 and then try to convert strings to floats. But in the code below I'm getting an error.
My code in Jupyter notes is:
And I get the following error.
How do I have to change the code.
All the columns but the YEAR column have to be set to float.
If you can help me set the column Year to datetime that would be also very nice. But my main problem is getting the data right so I can start making calculations.
Thanks
Runy
Easiest would be
df = df.astype(float)
df['YEAR'] = df['YEAR'].astype(int)
Also, your code fails because you have two columns with the same name BBPWN, so when you do df['BBPWN'], you will get a dataframe with those two columns. Then, df['BBPWN'].str will fail.

Reading Date times from Excel to Python using Pandas

I'm trying to read from an Excel file that gets converted to python and then gets split into numbers (Integers and floats) and everything else. There are numerous columns of different types.
I currently bring in the data with
pd.read_excel
and then split the data up with
DataFrame.select_dtypes("number")
When users upload a time (so 12:30:00) they expect for it to be recognized as a time. However python (currently) treats it as dtype object.
If I specify the column with parse_dates then it works, however since I don't know what the data is in advance I ideally want this to be done automatically. I`ve tried setting parse_dates = True however it doesn't seem to make a difference.
I'm not sure if there is a way to recognize the datatime after the file is uploaded. Again however I would want this to be done without having to specify the column (so anything that can be converted is)
Many Thanks
If your data contains only one column with dtype object (I assume it is a string) you can do the following:
1) filter the column with dtype object
import pandas as pd
datatime_col = df.select_dtypes(object)
2) convert it to seconds
datetime_col_in_seconds = pd.to_timedelta(datatime_col.loc[0]).dt.total_seconds()
Then you can re-append the converted column to your original data and/or do whatever processing you want.
Eventually, you can convert it back to datetime.
datetime_col = pd.to_datetime(datetime_col_in_seconds, unit='s')
if you have more than one column with dtype object you might have to do some more pre-processing but I guess this is a good way to start tackling your particular case.
This does what I need
for column_name in df.columns:
try:
df.loc[:, column_name] = pd.to_timedelta(df.loc[:, column_name].astype(str))
except ValueError:
pass
This tries to convert every column into a timedelta format. If it isn't capable of transforming it, it returns a value error and moves onto the next column.
After being run any columns that could be recognized as a timedelta format are transformed.

Python Pandas: Overwriting an Index with a list of datetime objects

I have an input CSV with timestamps in the header like this (the number of timestamps forming columns is several thousand):
header1;header2;header3;header4;header5;2013-12-30CET00:00:00;2013-12-30CET00:01:00;...;2014-00-01CET00:00:00
In Pandas 0.12 I was able to do this, to convert string timestamps into datetime objects. The following code strips out the 'CEST' in the timestamp string (translate()), reads it in as a datetime (strptime()) and then localizes it to the correct timezone (localize()) [The reason for this approach was because, with the versions I had at least, CEST wasn't being recognised as a timezone].
DF = pd.read_csv('some_csv.csv',sep=';')
transtable = string.maketrans(string.uppercase,' '*len(string.uppercase))
tz = pytz.country_timezones('nl')[0]
timestamps = DF.columns[5:]
timestamps = map(lambda x:x.translate(transtable), timestamps)
timestamps = map(lambda x:datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S'), timestamps)
timestamps = map(lambda x: pytz.timezone(tz).localize(x), timestamps)
DF.columns[5:] = timestamps
However, my downstream code required that I run off of pandas 0.16
While running on 0.16, I get this error with the above code at the last line of the above snippet:
*** TypeError: Indexes does not support mutable operations
I'm looking for a way to overwrite my index with the datetime object. Using the method to_datetime() doesn't work for me, returning:
*** ValueError: Unknown string format
I have some subsequent code that copies, then drops, the first few columns of data in this dataframe (all the 'header1; header2, header3'leaving just the timestamps. The purpose being to then transpose, and index by the timestamp.
So, my question:
Either:
how can I overwrite a series of column names with a datetime, such that I can pass in a pre-arranged set of timestamps that pandas will be able to recognise as a timestamp in subsequent code (in pandas v0.16)
Or:
Any other suggestions that achieve the same effect.
I've explored set_index(), replace(), to_datetime() and reindex() and possibly some others but non seem to be able to achieve this overwrite. Hopefully this is simple to do, and I'm just missing something.
TIA
I ended up solving this by the following:
The issue was that I had several thousand column headers with timestamps, that I couldn't directly parse into datetime objects.
So, in order to get these timestamp objects incorporated I added a new column called 'Time', and then included the datetime objects in there, then setting the index to the new column (I'm omitting code where I purged the rows of other header data, through drop() methods:
DF = DF.transpose()
DF['Time'] = timestamps
DF = DF.set_index('Time')
Summary: If you have a CSV with a set of timestamps in your headers that you cannot parse; a way around this is to parse them separately, include in a new column of Time with the correct datetime objects, then set_index() based on the new column.

Categories