I bumped into an unexpected-to-me behaviour in pandas. Here is the code, running python 3.10.8.
In [1]: from datetime import datetime, timezone
In [2]: import pandas
In [3]: pandas.__version__
Out[3]: '1.4.4'
In [4]: df = pandas.DataFrame(data={"end_date": [datetime(2022, 1, 20, tzinfo=timezone.utc)]})
In [5]: df.end_date.dt.tz
Out[5]: datetime.timezone.utc
In [6]: df.fillna(value={"end_date": datetime.now(tz=timezone.utc)}).end_date.dt.tz
In [7]: df.assign(end_date=lambda df: df["end_date"].fillna(datetime.now(tz=timezone.utc))).end_date.dt.tz
Out[7]: datetime.timezone.utc
As you can see, when using .fillna(value={...}), the timezone information is lost even if you do not have any value to fill. But it is kept when no dictionary is used.
Is it expected?
Thanks in advance.
I can't seem to understand what the difference is between <M8[ns] and date time formats on how these operations relate to why this does or doesn't work.
import pandas as pd
import datetime as dt
import numpy as np
my_dates = ['2021-02-03','2021-02-05','2020-12-25', '2021-12-27','2021-12-12']
my_numbers = [100,200,0,400,500]
df = pd.DataFrame({'a':my_dates, 'b':my_numbers})
df['a']=pd.to_datetime(df['a')
# ultimate goal is to be able to go. * df.mean() * and be able to see mean DATE
# but this doesn't seem to work so...
df['a'].mean().strftime('%Y-%m-%d') ### ok this works... I can mess around and concat stuff...
# But why won't this work?
df2 = df.select_dtypes('datetime')
df2.mean() # WONT WORK
df2['a'].mean() # WILL WORK?
What I seem to be running into unless I am missing something is the difference between 'datetime' and '<M8[ns]' and how that works when I'm trying to get the mean date.
You can try passing numeric_only parameter in mean() method:
out=df.select_dtypes('datetime').mean(numeric_only=False)
output of out:
a 2021-06-03 04:48:00
dtype: datetime64[ns]
Note: It will throw you an error If the dtype is string
mean function you apply is different in each case.
import pandas as pd
import datetime as dt
import numpy as np
my_dates = ['2021-02-03','2021-02-05','2020-12-25', '2021-12-27','2021-12-12']
my_numbers = [100,200,0,400,500]
df = pd.DataFrame({'a':my_dates, 'b':my_numbers})
df['a']=pd.to_datetime(df['a'])
df.mean()
This mean function its the DataFrame mean function, and it works on numeric data. To see who is numeric, do:
df._get_numeric_data()
b
0 100
1 200
2 0
3 400
4 500
But df['a'] is a datetime series.
df['a'].dtype, type(df)
(dtype('<M8[ns]'), pandas.core.frame.DataFrame)
So df['a'].mean() apply different mean function that works on datetime values. That's why df['a'].mean() output the mean of datetime values.
df['a'].mean()
Timestamp('2021-06-03 04:48:00')
read more here:
difference-between-data-type-datetime64ns-and-m8ns
DataFrame.mean() ignores datetime series
#28108
I've always been confused about the interaction between Python's standard library datetime objects and Numpy's datetime objects. The following code gives an error, which baffles me.
from datetime import datetime
import numpy as np
b = np.empty((1,), dtype=np.datetime64)
now = datetime.now()
b[0] = np.datetime64(now)
This gives the following error:
TypeError: Cannot cast NumPy timedelta64 scalar from metadata [us] to according to the rule 'same_kind'
What am I doing wrong here?
np.datetime64 is a class, whereas np.dtype('datetime64[us]') is a NumPy dtype:
import numpy as np
print(type(np.datetime64))
# <class 'type'>
print(type(np.dtype('datetime64[us]')))
# <class 'numpy.dtype'>
Specify the dtype of b using the NumPy dtype, not the class:
from datetime import datetime
import numpy as np
b = np.empty((1,), dtype='datetime64[us]')
# b = np.empty((1,), dtype=np.dtype('datetime64[us]')) # also works
now = datetime.now()
b[0] = np.datetime64(now)
print(b)
# ['2019-05-30T08:55:43.111008']
Note that datetime64[us] is just one of a number of possible dtypes. For
instance, there are datetime64[ns], datetime64[ms], datetime64[s],
datetime64[D], datetime64[Y] dtypes, depending on the desired time
resolution.
datetime.dateitem.now() returns a a datetime with microsecond resolution,
so I chose datetime64[us] to match.
Is there a way to use numpy's vectorization capabilities when doing a boolean operation on an array of datetime objects where you want to compare attributes of datetime objects.
My naive first attempt was:
import datetime as dtm
import numpy as np
dt = np.array([dtm.datetime(2014,1,4,12,2,1), dtm.datetime(2014,1,4,12,2,1), dtm.datetime(2014,1,6,12,2,1), dtm.datetime(2014,1,5,12,2,1), dtm.datetime(2014,1,4,12,2,1), dtm.datetime(2013,1,4,13,3,1), dtm.datetime(2013,1,5,22,2,1)])
bool = (dt.year == 2014)
That gave me the error:
AttributeError: 'numpy.ndarray' object has no attribute 'year'
which was obvious in retrospect.
I don't think that my second naive attempt was vectorizable, but thought it would get the job done:
bool = np.array([dts.year == 2014 for dts in dt])
However, I get the error:
SyntaxError: invalid syntax
I don't understand what I am doing wrong in this statement.
I would prefer a vectorizable solution, and I can do this using a for loop, but I think I should at least be able to do this in one line similar to my second attempt.
Is it possible to vectorize this statement? If not, what am I doing wrong in my second attempt? Thanks.
Or you could, as you said, vectorize.
import datetime as dtm
import numpy as np
dt = np.array([dtm.datetime(2014,1,4,12,2,1), dtm.datetime(2014,1,4,12,2,1), dtm.datetime(2014,1,6,12,2,1), dtm.datetime(2014,1,5,12,2,1), dtm.datetime(2014,1,4,12,2,1), dtm.datetime(2013,1,4,13,3,1), dtm.datetime(2013,1,5,22,2,1)])
is_2014 = np.vectorize(lambda d: d.year == 2014)
bool_ = is_2014(dt)
Note that np.vectorize does not necessarily provide better performance than a pure Python loop and primarily serves as syntactic sugar.
You can use pandas:
>>> import pandas as pd
>>> ser = pd.Series([dtm.datetime(2014,1,4,12,2,1),
dtm.datetime(2014,1,4,12,2,1),
dtm.datetime(2014,1,6,12,2,1),
dtm.datetime(2014,1,5,12,2,1),
dtm.datetime(2014,1,4,12,2,1),
dtm.datetime(2013,1,4,13,3,1),
dtm.datetime(2013,1,5,22,2,1)])
>>> ser[ser.dt.year==2014]
0 2014-01-04 12:02:01
1 2014-01-04 12:02:01
2 2014-01-06 12:02:01
3 2014-01-05 12:02:01
4 2014-01-04 12:02:01
dtype: datetime64[ns]
Or the bools as NumPy array:
>>> (ser.dt.year==2014).values
array([ True, True, True, True, True, False, False], dtype=bool)
Try numpy's own datetime64 dtype. You may need to do some arithmetic to get out the years. Alternatively, you could use an array with Unix timestamp integers.
I'm using matplotlib.dates to convert my string dates into date objects thinking it would be easier to manipulate later.
import matplotlib.dates as md
def ConvertDate(datestr):
'''
Convert string date into matplotlib date object
'''
datefloat = md.datestr2num(datestr)
return md.num2date(datefloat)
What I was trying to do was filter my structured array to tell me the index numbers of rows belong to a certain month and/or year
import numpy as np
np.where( data['date'] == 2008 )
I can probably use a lambda function to convert each object into string value like so
lambda x: x.strftime('%Y')
to compare each item but I dont know where to put this lambda function into np.where or if its even possible.
Any ideas? Or is there some better way to do this?
After a lot of error messages, I think I found an answer to my own question.
[ x for x in range(len(data)) if data['date'][x].year == 2008 ]
I did a list comprehension to return the indexes of the structured array that matched a query. I also included #hayden's suggestion to use .year instead of strftime() Maybe numpy.where() is still faster but this suits my needs right now.
Note: you might as well use datetime's datetime.strptime function:
import datetime
import numpy as np
dt1 = datetime.datetime.strptime('1/2/2012', '%d/%m/%Y')
dt2 = datetime.datetime.strptime('1/2/2011', '%d/%m/%Y')
In [5]: dt1
Out[5]: datetime.datetime(2012, 2, 1, 0, 0)
You can then use numpy.non-zero (to filter your array to the indices of those datetimes where, for example, year is 2012):
a = np.array([dt1, dt2])
b = np.array(map(lambda x: x.year, a))
In [8]: b
Out[8]: array([2012, 2011], dtype=bool)
In [9]: np.nonzero(b==2012)
Out[9]: (array([0]),)
Also, I would suggest looking into pandas which has this functionality built-in (on top of numpy), many more convenience functions (e.g. to_datetime), as well as efficient datetime storage...