From a source I retrieve some data in JSON format. I want to save this data (measurements in time) as a text file. Repeatedly I want to go the same source and see if new measurements are available, if so I want to add it to the other measurements.
The data I get looks like this:
{"xyz":[{"unixtime":"1458255600","time":"00:00","day":"18\/03","value":"11","paramlabel":"30-500 mHz","popupcorr":"550","iconnr":"7","paramname":"30-500 mHz"},{"unixtime":"1458256200","time":"00:10","day":"18\/03","value":"14","paramlabel":"30-500 mHz","popupcorr":"550","iconnr":"7","paramname":"30-500 mHz"},etc.]}
I load this data into a pandas DataFrame to be able to work with it more easily. When I load this into a dataframe however, all columns are treated as strings. How can I make sure that the unixtime column is treated as a timestamp (such that I can convert to a datetime)?
use to_datetime and pass unit='s' to treat the value as epoch time after converting the dtype to int using astype:
df['unixtime'] = pd.to_datetime(df['unixtime'].astype(int), unit='s')
Example:
In [162]:
pd.to_datetime(1458255600, unit='s')
Out[162]:
Timestamp('2016-03-17 23:00:00')
Related
I am trying to convert some data from a .txt file to a dataframe to use it for some analysis
the form of the data in the .txt is a follows
DATE_TIME VELOC MEASURE
[m/s] [l/h]
A 09.01.2023 12:45:20 ??? ???
A 09.01.2023 12:46:20 0,048 52,67
A 09.01.2023 12:47:20 0,049 53,77
A 09.01.2023 12:48:20 0,050 54,86
I load the data to a dataframe no problem i covnert the str values of the measurement to float etc everything is good as shows in the
image
the problem I get is when trying to convert the column of the date time that is string to datetime pandas format using this line of code:
volume_flow['DATE_TIME'] = pd.to_datetime(volume_flow['DATE_TIME'], format = '%d.%m.%Y %H:%M:S')
and i get the following error
ValueError: time data '09.01.2023 12:46:20' does not match format '%d.%m.%Y %H:%M:S' (match)
but i don't see how the format is off
I am really lost as to why this is caused as i used the same code with different formats of datetime before with no problem
further more i tried using format = '%dd.%mm.%yyyy %H:%M:S' as well with the same results and when i let the pandas.to_datetime convert it automatically it confuses the day and the month of the data. the data is between 09.01-12.01 so you can't really tell if one is the month or day just by the values.
I think you should go from this
(..., format='%d.%m.%Y %H:%M:S')
to this
(..., format='%d.%m.%Y %H:%M:%S')
You forgot the percentage character!
check the documentations for correct time format. You will note that the directive %S represents the seconds.
Second as a decimal number [00,61].
I have a vaex dataframe that reads from a hdf5 file. It has a date column which is read as string. I converted it into datetime. However, I am not able to do any date comparisons. I can extract day,month,year, etc from the date so the conversion is correct. But how do I perform operations like date is between x and y?
import vaex
import datetime
vaex_df=vaex.open('filename.hdf5')
vaex_df['pDate']=vaex_df.Date.values.astype('datetime64[ns]')
The datatypes are as expected
print(data.dtypes)
## Date <class 'str'>
## pDate datetime64[ns]
Now I need to filter out rows based on some date
start_date=datetime.date(2019,10,1)
vaex_df=vaex_df[(vaex_df.pDate.dt>=start_date)]
print(vaex_df) # throws SyntaxError: invalid token
I get an invalid token when I try to look at the new dataframe.
I can extract the month and year separately and apply the filter. But that would give a wrong result
vaex_df=vaex_df[(vaex_df.pDate.dt.month>int(str(start_date)[5:7]))&(vaex_df.pDate.dt.year>=int(str(start_date)[:4]))]
How do I do date range comparison operations in vaex?
datetime from numpy works
#Instead of
start_date=datetime.date(2019,10,1)
#Use
start_date=np.datetime64('2019-10-01')
On the vaex dataframe
vaex_df=vaex_df[(vaex_df.pDate>=start_date)]
I am using pandas to convert a column having date and time into seconds by using the following code:
df['date_time'] = pd.to_timedelta(df['date_time'])
df['date_time'] = df['date_time'].dt.total_seconds()
The dataset is:
If i use the following code:
df['date_time'] = pd.to_datetime(df['date_time'], errors='coerce')
df['date_time'] = df['date_time'].dt.total_seconds()
print(df.head())
Then i get the following error:
AttributeError: 'DatetimeProperties' object has no attribute 'total_seconds'
So as the case with dt.timestamp
So my queries are:
Is it necessary to convert the time to seconds for training the model? If yes then how and if not then why?
This one is related to two other columns named weather_m and weather_d, weather_m has 38 different types of entries or we say 38 different categories out of which only one will be true at a time and weather_m has 11 but the case is same as with weather_m. So i am confused a bit here whether to split this categorical data and merge 49 new columns in the original dataset and dropping weather_m and weather_d to train the model or use LabelEncoder instead of pd.get_dummies?
Converting a datetime or timestamp into a timedelta (duration) doesn't make sense. It'd only make sense if you want the duration between the given timestamp, and some other reference date. Then you can get the timedelta just by using - to get the difference between 2 dates.
Since your datetime column is a string you also need to convert it to a datetime first: df['date_time'] = pd.to_datetime(df['date_time'], format='%m/%d/%Y %H:%M').
Then you can try something like: ref_date = datetime.datetime(1970, 1, 1, 0, 0); df['secs_since_epoch'] = (df['date_time'] - ref_date).dt.total_seconds()
If the different categories are totally distinct from each other (and they don't e.g. have an implicit ordering to them) then you should use one hot encoding yes, replacing the original columns. Since the number of categories is small that should be fine.
(though it also depends what exactly you're gonna run on this data. some libraries might be ok with the original categorical column, and do the conversion implicitly for you)
The default format of csv is dd/mm/yyyy. When I convert it to datetime by df['Date']=pd.to_datetime(df['Date']), it change the format to mm//dd/yyyy.
Then, I used df['Date'] = pd.to_datetime(df['Date']).dt.strftime('%d/%m/%Y')
to convert to dd/mm/yyyy, But, they are in the string (object) format. However, I need to change them to datetime format. When I use again this (df['Date']=pd.to_datetime(df['Date'])), it gets back to the previous format. Need your help
You can use the parse_dates and dayfirst arguments of pd.read_csv, see: the docs for read_csv()
df = pd.read_csv('myfile.csv', parse_dates=['Date'], dayfirst=True)
This will read the Date column as datetime values, correctly taking the first part of the date input as the day. Note that in general you will want your dates to be stored as datetime objects.
Then, if you need to output the dates as a string you can call dt.strftime():
df['Date'].dt.strftime('%d/%m/%Y')
When I use again this: df['Date'] = pd.to_datetime(df['Date']), it gets back to the previous format.
No, you cannot simultaneously have the string format of your choice and keep your series of type datetime. As remarked here:
datetime series are stored internally as integers. Any
human-readable date representation is just that, a representation,
not the underlying integer. To access your custom formatting, you can
use methods available in Pandas. You can even store such a text
representation in a pd.Series variable:
formatted_dates = df['datetime'].dt.strftime('%m/%d/%Y')
The dtype of formatted_dates will be object, which indicates
that the elements of your series point to arbitrary Python times. In
this case, those arbitrary types happen to be all strings.
Lastly, I strongly recommend you do not convert a datetime series
to strings until the very last step in your workflow. This is because
as soon as you do so, you will no longer be able to use efficient,
vectorised operations on such a series.
This solution will work for all cases where a column has mixed date formats. Add more conditions to the function if needed. Pandas to_datetime() function was not working for me, but this seems to work well.
import date
def format(val):
a = pd.to_datetime(val, errors='coerce', cache=False).strftime('%m/%d/%Y')
try:
date_time_obj = datetime.datetime.strptime(a, '%d/%m/%Y')
except:
date_time_obj = datetime.datetime.strptime(a, '%m/%d/%Y')
return date_time_obj.date()
Saving the changes to the same column.
df['Date'] = df['Date'].apply(lambda x: format(x))
Saving as CSV.
df.to_csv(f'{file_name}.csv', index=False, date_format='%s')
I was having trouble manipulating a time-series data provided to me for a project. The data contains the number of flight bookings made on a website per second in a duration of 30 minutes. Here is a part of the column containing the timestamp
>>> df['Date_time']
0 7/14/2017 2:14:14 PM
1 7/14/2017 2:14:37 PM
2 7/14/2017 2:14:38 PM
I wanted to do
>>> pd.set_index('Date_time')
and use the datetime and timedelta methods provided by pandas to generate the timestamp to be used as index to access and modify any value in any cell.
Something like
>>> td=datetime(year=2017,month=7,day=14,hour=2,minute=14,second=36)
>>> td1=dt.timedelta(minutes=1,seconds=58)
>>> ti1=td1+td
>>> df.at[ti1,'column_name']=65000
But the timestamp generated is of the form
>>> print(ti1)
2017-07-14 02:16:34
Which cannot be directly used as an index in my case as can be clearly seen. Is there a workaround for the above case without writing additional methods myself?
I want to do the above as it provides me greater level of control over the data than looking for the default numerical index for each row I want to update and hence will prove more efficient accordig to me
Can you check the dtype of the 'Date_time' column and confirm for me that it is string (object) ?
df.dtypes
If so, you should be able to cast the values to pd.Timestamp by using the following.
df['timestamp'] = df['Date_time'].apply(pd.Timestamp)
When we call .dtypes now, we should have a 'timestamp' field of type datetime64[ns], which allows us to use builtin pandas methods more easily.
I would suggest it is prudent to index the dataframe by the timestamp too, achieved by setting the index equal to that column.
df.set_index('timestamp', inplace=True)
We should now be able to use some more useful methods such as
df.loc[timestamp_to_check, :]
df.loc[start_time_stamp : end_timestamp, : ]
df.asof(timestamp_to_check)
to lookup values from the DataFrame based upon passing a datetime.datetime / pd.Timestamp / np.datetime64 into the above. Note that you will need to cast any string (object) 'lookups' to one of the above types in order to make use of the above correctly.
I prefer to use pd.Timestamp() - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Timestamp.html to handle datetime conversion from strings unless I am explicitly certain of what format the datetime string is always going to be in.