How to convert string from csv to hour(s):minute(s)? - python

This link shows my csv file and graph.
I want to represent the AVG number (which are seconds actually) as hour(s):minute(s) on y axis.
I think, it cannot be solved because I spent 3 days wit this problem.
But to be more precise, aside of lot of conversations with dateime, timedelta, timestamp nothing worked.
Either the data could no be shown on y axis because it did not represent number like variable to plot or I've got not proper representation of the data.
I was trying to create something like converting seconds to calculate with divmod
than put them on the top of the bars with annonate.
Later I have used Timple.
I do not understand how should I create an acceptable datatype for this.

I've made some related and use pandasDataFrame.plot
>>> import pandas as pd
>>> df = pd.DataFrame()
>>> df["activity"] = ['run', 'swim', 'drive']
>>> df["avg"] = [86400,43200,21600]
>>> df
activity avg
0 run 86400
1 swim 43200
2 drive 21600
>>> df.plot.bar(x="activity")
<AxesSubplot: xlabel='activity'>
>>> plt.show()
To represent time transcurred for a certain number of seconds you can use fromtimestamp and formatting strftime but it might not be compatible with matplotlib, then using Timple is something related, but graph could not be properly plotted or maybe it is needed to perform something like explore data or apply a certain statistical procedure.
>>> import datetime
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> import timple
>>> tmpl = timple.Timple()
>>> tmpl.enable()
>>> timedeltas = np.array([datetime.timedelta(seconds=(s)) for s in df["avg"]])
>>> timedeltas
array([datetime.timedelta(days=1), datetime.timedelta(seconds=43200),
datetime.timedelta(seconds=21600)], dtype=object)
>>> plt.plot(timedeltas, df["activity"])
[<matplotlib.lines.Line2D object at 0x0000026FAC3F5B40>]
>>> plt.show()

Related

Weighted average datetime, off but only for certain months

I am calculating the weighted average of a series of datetimes (and must be doing it wrong, since I can't explain the following):
import pandas as pd
import numpy as np
foo = pd.DataFrame({'date': ['2022-06-01', '2022-06-16'],
'value': [1000, 10000]})
foo['date'] = pd.to_datetime(foo['date'])
bar = np.average(foo['date'].view(dtype='float64'), weights=foo['value'])
print(np.array(bar).view(dtype='datetime64[ns]'))
returns
2022-06-14T15:16:21.818181818, which is expected.
Changing the month to July:
foo = pd.DataFrame({'date': ['2022-07-01', '2022-07-16'],
'value': [1000, 10000]})
foo['date'] = pd.to_datetime(foo['date'])
bar = np.average(foo['date'].view(dtype='float64'), weights=foo['value'])
print(np.array(bar).view(dtype='datetime64[ns]'))
returns 2022-07-14T23:59:53.766924660,
when the expected result is
2022-07-14T15:16:21.818181818.
The expected result, as calculated in Excel:
What am I overlooking?
EDIT: Additional Detail
My real dataset is much larger and I'd like to use numpy if possible.
foo['date'] can be assumed to be dates with no time component, but the weighted average will have a time component.
I strongly suspect this is a resolution/rounding issue.
I'm assuming that for averaging dates those are converted to timestamps - and then the result is converted back to a datetime object. But pandas works in nanoseconds, so the timestamp values, multiplied by 1000 and 10000 respectively, exceed 2**52 - i.e. exceed the mantissa capability of 64-bit floats.
On the contrary Excel works in milliseconds, so no problems here; Python's datetime.datetime works in microseconds, so still no problems:
dt01 = datetime(2022,7,1)
dt16 = datetime(2022,7,16)
datetime.fromtimestamp((dt01.timestamp()*1000 + dt16.timestamp()*10000)/11000)
datetime.datetime(2022, 7, 14, 15, 16, 21, 818182)
So if you need to use numpy/pandas I suppose your best option is to convert dates to timedeltas from a "starting" date (i.e. define a "custom epoch") and compute the weighted average of those values.
first of all I don't think the problem is in your code, I think pandas has a problem.
the problem is that when you use the view command it translates it to a very small number (e-198) so I believe it accidently lose resolution.
I found a solution (I think it works but it gave an 3 hour difference from your answer):
from datetime import datetime
import pandas as pd
import numpy as np
foo = pd.DataFrame({'date': ['2022-07-01', '2022-07-16'],
'value': [1000, 10000]})
foo['date'] = pd.to_datetime(foo['date'])
# bar = np.average([x.timestamp() for x in foo['date']], weights=foo['value'])
bar = np.average(foo['date'].apply(datetime.timestamp), weights=foo['value']) # vectorized
print(datetime.fromtimestamp(bar))

getting mean values of dates in pandas dataframe

I can't seem to understand what the difference is between <M8[ns] and date time formats on how these operations relate to why this does or doesn't work.
import pandas as pd
import datetime as dt
import numpy as np
my_dates = ['2021-02-03','2021-02-05','2020-12-25', '2021-12-27','2021-12-12']
my_numbers = [100,200,0,400,500]
df = pd.DataFrame({'a':my_dates, 'b':my_numbers})
df['a']=pd.to_datetime(df['a')
# ultimate goal is to be able to go. * df.mean() * and be able to see mean DATE
# but this doesn't seem to work so...
df['a'].mean().strftime('%Y-%m-%d') ### ok this works... I can mess around and concat stuff...
# But why won't this work?
df2 = df.select_dtypes('datetime')
df2.mean() # WONT WORK
df2['a'].mean() # WILL WORK?
What I seem to be running into unless I am missing something is the difference between 'datetime' and '<M8[ns]' and how that works when I'm trying to get the mean date.
You can try passing numeric_only parameter in mean() method:
out=df.select_dtypes('datetime').mean(numeric_only=False)
output of out:
a 2021-06-03 04:48:00
dtype: datetime64[ns]
Note: It will throw you an error If the dtype is string
mean function you apply is different in each case.
import pandas as pd
import datetime as dt
import numpy as np
my_dates = ['2021-02-03','2021-02-05','2020-12-25', '2021-12-27','2021-12-12']
my_numbers = [100,200,0,400,500]
df = pd.DataFrame({'a':my_dates, 'b':my_numbers})
df['a']=pd.to_datetime(df['a'])
df.mean()
This mean function its the DataFrame mean function, and it works on numeric data. To see who is numeric, do:
df._get_numeric_data()
b
0 100
1 200
2 0
3 400
4 500
But df['a'] is a datetime series.
df['a'].dtype, type(df)
(dtype('<M8[ns]'), pandas.core.frame.DataFrame)
So df['a'].mean() apply different mean function that works on datetime values. That's why df['a'].mean() output the mean of datetime values.
df['a'].mean()
Timestamp('2021-06-03 04:48:00')
read more here:
difference-between-data-type-datetime64ns-and-m8ns
DataFrame.mean() ignores datetime series
#28108

Timestamp in Numpy

I am trying to extract data from a netcdf file using wrf-python. The data is for every hour. The date is being extracted as a number, and not a calendar-date-time. First I extract the data, convert it to a flat np array, then try to save the file. The format is saved as '%s'
np.savetxt((stn + 'WRF_T2_T10_WS_WD.csv'), np.transpose(arr2D), %s, delimiter=',', header=headers, comments='')
it looks like this:
but it needs to look like this:
Thanks
By convention, dates are frequently stored as an offset in seconds from Jan 1, 1970
For the case of converting seconds, this answer Python Numpy Loadtxt - Convert unix timestamp suggests converting them by changing their datatype (should be as efficient as possible as it dodges by-row loops, copying data, etc.)
x = np.asarray(x, dtype='datetime64[s]')
However, the E+18 postfix implies that if you really have a date, your timestamps are in nanoseconds, so datetime64[ns] may work for you
import time
import numpy as np
>>> a = np.array([time.time() * 10**9]) # epoch seconds to ns
>>> a # example array
array([1.60473147e+18])
>>> a = np.asarray(a, dtype='datetime64[ns]')
>>> a
array(['2020-11-07T06:44:29.714103040'], dtype='datetime64[ns]')

Python convert from ordinal time with milliseconds [duplicate]

I just started moving from Matlab to Python 2.7 and I have some trouble reading my .mat-files. Time information is stored in Matlab's datenum format. For those who are not familiar with it:
A serial date number represents a calendar date as the number of days that has passed since a fixed base date. In MATLAB, serial date number 1 is January 1, 0000.
MATLAB also uses serial time to represent fractions of days beginning at midnight; for example, 6 p.m. equals 0.75 serial days. So the string '31-Oct-2003, 6:00 PM' in MATLAB is date number 731885.75.
(taken from the Matlab documentation)
I would like to convert this to Pythons time format and I found this tutorial. In short, the author states that
If you parse this using python's datetime.fromordinal(731965.04835648148) then the result might look reasonable [...]
(before any further conversions), which doesn't work for me, since datetime.fromordinal expects an integer:
>>> datetime.fromordinal(731965.04835648148)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: integer argument expected, got float
While I could just round them down for daily data, I actually need to import minutely time series. Does anyone have a solution for this problem? I would like to avoid reformatting my .mat files since there's a lot of them and my colleagues need to work with them as well.
If it helps, someone else asked for the other way round. Sadly, I'm too new to Python to really understand what is happening there.
/edit (2012-11-01): This has been fixed in the tutorial posted above.
You link to the solution, it has a small issue. It is this:
python_datetime = datetime.fromordinal(int(matlab_datenum)) + timedelta(days=matlab_datenum%1) - timedelta(days = 366)
a longer explanation can be found here
Using pandas, you can convert a whole array of datenum values with fractional parts:
import numpy as np
import pandas as pd
datenums = np.array([737125, 737124.8, 737124.6, 737124.4, 737124.2, 737124])
timestamps = pd.to_datetime(datenums-719529, unit='D')
The value 719529 is the datenum value of the Unix epoch start (1970-01-01), which is the default origin for pd.to_datetime().
I used the following Matlab code to set this up:
datenum('1970-01-01') % gives 719529
datenums = datenum('06-Mar-2018') - linspace(0,1,6) % test data
datestr(datenums) % human readable format
Just in case it's useful to others, here is a full example of loading time series data from a Matlab mat file, converting a vector of Matlab datenums to a list of datetime objects using carlosdc's answer (defined as a function), and then plotting as time series with Pandas:
from scipy.io import loadmat
import pandas as pd
import datetime as dt
import urllib
# In Matlab, I created this sample 20-day time series:
# t = datenum(2013,8,15,17,11,31) + [0:0.1:20];
# x = sin(t)
# y = cos(t)
# plot(t,x)
# datetick
# save sine.mat
urllib.urlretrieve('http://geoport.whoi.edu/data/sine.mat','sine.mat');
# If you don't use squeeze_me = True, then Pandas doesn't like
# the arrays in the dictionary, because they look like an arrays
# of 1-element arrays. squeeze_me=True fixes that.
mat_dict = loadmat('sine.mat',squeeze_me=True)
# make a new dictionary with just dependent variables we want
# (we handle the time variable separately, below)
my_dict = { k: mat_dict[k] for k in ['x','y']}
def matlab2datetime(matlab_datenum):
day = dt.datetime.fromordinal(int(matlab_datenum))
dayfrac = dt.timedelta(days=matlab_datenum%1) - dt.timedelta(days = 366)
return day + dayfrac
# convert Matlab variable "t" into list of python datetime objects
my_dict['date_time'] = [matlab2datetime(tval) for tval in mat_dict['t']]
# print df
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 201 entries, 2013-08-15 17:11:30.999997 to 2013-09-04 17:11:30.999997
Data columns (total 2 columns):
x 201 non-null values
y 201 non-null values
dtypes: float64(2)
# plot with Pandas
df = pd.DataFrame(my_dict)
df = df.set_index('date_time')
df.plot()
Here's a way to convert these using numpy.datetime64, rather than datetime.
origin = np.datetime64('0000-01-01', 'D') - np.timedelta64(1, 'D')
date = serdate * np.timedelta64(1, 'D') + origin
This works for serdate either a single integer or an integer array.
Just building on and adding to previous comments. The key is in the day counting as carried out by the method toordinal and constructor fromordinal in the class datetime and related subclasses. For example, from the Python Library Reference for 2.7, one reads that fromordinal
Return the date corresponding to the proleptic Gregorian ordinal, where January 1 of year 1 has ordinal 1. ValueError is raised unless 1 <= ordinal <= date.max.toordinal().
However, year 0 AD is still one (leap) year to count in, so there are still 366 days that need to be taken into account. (Leap year it was, like 2016 that is exactly 504 four-year cycles ago.)
These are two functions that I have been using for similar purposes:
import datetime
def datetime_pytom(d,t):
'''
Input
d Date as an instance of type datetime.date
t Time as an instance of type datetime.time
Output
The fractional day count since 0-Jan-0000 (proleptic ISO calendar)
This is the 'datenum' datatype in matlab
Notes on day counting
matlab: day one is 1 Jan 0000
python: day one is 1 Jan 0001
hence an increase of 366 days, for year 0 AD was a leap year
'''
dd = d.toordinal() + 366
tt = datetime.timedelta(hours=t.hour,minutes=t.minute,
seconds=t.second)
tt = datetime.timedelta.total_seconds(tt) / 86400
return dd + tt
def datetime_mtopy(datenum):
'''
Input
The fractional day count according to datenum datatype in matlab
Output
The date and time as a instance of type datetime in python
Notes on day counting
matlab: day one is 1 Jan 0000
python: day one is 1 Jan 0001
hence a reduction of 366 days, for year 0 AD was a leap year
'''
ii = datetime.datetime.fromordinal(int(datenum) - 366)
ff = datetime.timedelta(days=datenum%1)
return ii + ff
Hope this helps and happy to be corrected.

Convert the following time info to something that pyplot can recognise

I have a DataFrame with two columns of time information. The first is the epoch time in seconds, and the second is the corresponding formatted str time like "2015-06-01T09:00:00+08:00" where "+08:00" denotes the timezone.
I'm aware that time formats are in a horrible mess in Python, and that matplotlib.pyplot seems to only recognise the datetime format. I tried several ways to convert the str time to datetime but none of them would work. When I use pd.to_datetime it will convert to datetime64, and when using pd.Timestamp it converts to Timestamp, and even when I tried using combinations of these two functions, the output would always be either datetime64 or Timestamp but NEVER for once datetime. I also tried the method suggested in this answer. Didn't work. It's kind of driving me up the wall now.
Could anybody kindly figure out a quick way for this? Thanks!
I post a minimal example below:
import matplotlib.pyplot as plt
import time
import pandas as pd
df = pd.DataFrame([[1433120400, "2015-06-01T09:00:00+08:00"]], columns=["epoch", "strtime"])
# didn't work
df["usable_time"] = pd.to_datetime(df["strtime"])
# didn't work either
df["usable_time"] = pd.to_datetime(df["strtime"].apply(lambda s: pd.Timestamp(s)))
# produced a strange type called "struct_time". Don't think it'd be compatible with pyplot
df["usable_time"] = df["epoch"].apply(lambda x: time.localtime(x))
# attempted to plot with pyplot
df["usable_time"] = pd.to_datetime(df["strtime"])
plt.plot(x=df["usable_time"], y=[0.123])
plt.show()
UPDATE (per comments)
It seems like the confusion here is stemming from the fact that the call to plt.plot() takes positional x/y arguments instead of keyword arguments. In other words, the appropriate signature is:
plt.plot(x, y)
Or, alternately:
plt.plot('x_label', 'y_label', data=obj)
But not:
plt.plot(x=x, y=y)
There's a separate discussion of why this quirk of Pyplot exists here, also see ImportanceOfBeingErnest's comments below.
Original
This isn't really an answer, more of a demonstration that Pyplot doesn't have an issue with Pandas datetime data. I've added an extra row to df to make the plot clearer:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame([[1433120400, "2015-06-01T09:00:00+08:00"],
[1433130400, "2015-07-01T09:00:00+08:00"]],
columns=["epoch", "strtime"])
df["usable_time"] = pd.to_datetime(df["strtime"])
df.dtypes
epoch int64
strtime object
usable_time datetime64[ns]
dtype: object
plt.plot(df.usable_time, df.epoch)
pd.__version__ # '0.23.3'
matplotlib.__version__ # '2.2.2'
You can use to_pydatetime (from the dt accessor or Timestamp) to get back native datetime objects if you really want to, e.g.:
pd.to_datetime(df["strtime"]).dt.to_pydatetime()
This will return an array of native datetime objects:
array([datetime.datetime(2015, 6, 1, 1, 0)], dtype=object)
However, pyplot seems to be able to work with pandas datetime series.

Categories