Find rows in numpy array of matplotlib date objects

Find rows in numpy array of matplotlib date objects - python

I'm using matplotlib.dates to convert my string dates into date objects thinking it would be easier to manipulate later.
import matplotlib.dates as md
def ConvertDate(datestr):
'''
Convert string date into matplotlib date object
'''
datefloat = md.datestr2num(datestr)
return md.num2date(datefloat)
What I was trying to do was filter my structured array to tell me the index numbers of rows belong to a certain month and/or year
import numpy as np
np.where( data['date'] == 2008 )
I can probably use a lambda function to convert each object into string value like so
lambda x: x.strftime('%Y')
to compare each item but I dont know where to put this lambda function into np.where or if its even possible.
Any ideas? Or is there some better way to do this?

After a lot of error messages, I think I found an answer to my own question.
[ x for x in range(len(data)) if data['date'][x].year == 2008 ]
I did a list comprehension to return the indexes of the structured array that matched a query. I also included #hayden's suggestion to use .year instead of strftime() Maybe numpy.where() is still faster but this suits my needs right now.

Note: you might as well use datetime's datetime.strptime function:
import datetime
import numpy as np
dt1 = datetime.datetime.strptime('1/2/2012', '%d/%m/%Y')
dt2 = datetime.datetime.strptime('1/2/2011', '%d/%m/%Y')
In [5]: dt1
Out[5]: datetime.datetime(2012, 2, 1, 0, 0)
You can then use numpy.non-zero (to filter your array to the indices of those datetimes where, for example, year is 2012):
a = np.array([dt1, dt2])
b = np.array(map(lambda x: x.year, a))
In [8]: b
Out[8]: array([2012, 2011], dtype=bool)
In [9]: np.nonzero(b==2012)
Out[9]: (array([0]),)
Also, I would suggest looking into pandas which has this functionality built-in (on top of numpy), many more convenience functions (e.g. to_datetime), as well as efficient datetime storage...

Related

Convert a date to a different format for an entire new column

I want to convert the date in a column in a dataframe to a different format. Currently, it has this format: '2019-11-20T01:04:18'. I want it to have this format: 20-11-19 1:04.
I think I need to develop a loop and generate a new column for the new date format. So essentially, in the loop, I would refer to the initial column and then generate the variable for the new column in the format I want.
Can someone help me out to complete this task?
The following code works for one occasion:
import datetime
d = datetime.datetime.strptime('2019-11-20T01:04:18', '%Y-%m-%dT%H:%M:%S')
print d.strftime('%d-%m-%y %H:%M')

From a previous answer in this site , this should be able to help you, comments give explanation
You can read your data into pandas from csv or database or create some test data as shown below for testing.
>>> import pandas as pd
>>> df = pd.DataFrame({'column': {0: '26/1/2016', 1: '26/1/2016'}})
>>> # First convert the column to datetime datatype
>>> df['column'] = pd.to_datetime(df.column)
>>> # Then call the datetime object format() method, set the modifiers you want here
>>> df['column'] = df['column'].dt.strftime('%Y-%m-%dT%H:%M:%S')
>>> df
column
0 2016-01-26T00:00:00
1 2016-01-26T00:00:00
NB. Check to ensure that all your columns have similar date strings

You can either achieve it like this:
from datetime import datetime
df['your_column'] = df['your_column'].apply(lambda x: datetime.strptime(x, '%Y-%m-%dT%H:%M:%S').strftime('%d-%m-%y %H:%M'))

Using pandas to sort data from a dataset

I have a dataset which includes a column for dates. The format for this column is dd.mm.yyyy.
I tried using the recommended methods for sorting the dates to restrict the range to 'December' and '2014.' However, none of the methods seem to be functioning properly. I am considering trying to rearrange it so that it is in the format of yyyy.mm.dd. I'm not sure how to go about doing this. Can someone help?
Code such as
(df['date']>'1-12-2014')&(df['date']<='31-12-2014') don't seem to work.

The problem is that your dates are strings that pandas isn't recognizing as dates. You want to convert them to datetime objects first. There are a couple ways to do this:
df['date'] = df['date'].apply(lambda d: pd.strptime(d, '%d.%m.%Y'))
or
df['date'] = pd.to_datetime(df['date'], format = '%d.%m.%Y')
In both cases, the key is using a format string that matches your data. Then, you can filter how you want:
from datetime import date
df[(df['date'] >= date(2014, 12, 1))&(df['date'] <= date(2014, 12, 31))]

Representing Date as numeric number

so I'm trying to figure out a way to convert a normal date in the format "dd/mm/yyyy" (for example : 31/12/2016).
I want to find a way to convert this date into a unique number, so I can re-convert it back.
for example i thought of sum=day+month*12 + year*365 as the number and then :
(sum % 365 ) / 12...
but it's not working for each statment. so any ideas?

It is far better not to handle the strings yourself. Instead use the module called datetime. Here is an incredibly good answer which should hopefully satisfy what you need to do. Converting string into datetime
For example, in your case you would need the following
import datetime
your_date = datetime.datetime.strptime("31/12/2016", "%d/%m/%Y")
Then this article How to convert datetime to integer in python explains how you can turn it into an integer, but as stated in the answer, this is usually a bad idea.

You can use datetime module to extract the day, month, and year and then use as you want.
In [9]: (y, m, d) = str(datetime.date.today()).split('-')[:3]
In [10]: y, m, d
Out[10]: ('2016', '11', '10')
The output is in string format, which can be converted to integers.
In [11]: int(y), int(m), int(d)
Out[11]: (2016, 11, 10)

Problems with timezone in numpy.datetime64

I'm a bit confused how numpy handles timezones. If I create a datetime-object just with a date, it seems it uses Zulu-Timezone. If I use an additional timestep, it uses my current timezone. If I then manipulate these objects, e.g. add a timedelta, the results are different:
import numpy as np
a = np.datetime64('2015-04-22')
b = np.datetime64('2015-04-22T00:00')
delta = np.timedelta64(1,'h')
print(a+delta,b+delta)
I must ensure that all values are in the same timezone, so my question is, how can I ensure that a user, who initializes these date doesn't mix dates and dates with time.

If you specify Zulu in datetime with timestep you'll get uniform data.
In [30]: b = np.datetime64('2015-04-22T00:00Z')
In [31]: b + delta
Out[31]: numpy.datetime64('2015-04-22T03:00+0200')
In [32]: a + delta
Out[32]: numpy.datetime64('2015-04-22T03:00+0200','h')
http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html#basic-datetimes

Why pandas series return the element of my numpy datetime64 array as timestamp?

I have a pandas Series which can be constructed like the following:
given_time = datetime(2013, 10, 8, 0, 0, 33, 945109,
tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60, name=None))
given_times = np.array([given_time] * 3, dtype='datetime64[ns]'))
column = pd.Series(given_times)
The dtype of my Series is datetime64[ns]
However, when I access it: column[1], somehow it becomes of type pandas.tslib.Timestamp, while column.values[1] stays np.datetime64. Does Pandas auto cast my datetime into Timestamp when accessing the item? Is it slow?
Do I need to worry about the difference in types? As far as I see, Timestamp seems not have timezone (numpy.datetime64('2013-10-08T00:00:33.945109000+0100') -> Timestamp('2013-10-07 23:00:33.945109', tz=None))
In practice, I would do datetime arithmetic like take difference, compare to a datetimedelta. Does the possible type inconsistency around my operators affect my use case at all?
Besides, am I encouraged to use pd.to_datetime instead of astype(dtype='datetime64') while converting datetime objects?

Pandas time types are built on top of numpy's datetime64.
In order to continue using the pandas operators, you should keep using pd.to_datetime, rather than as astype(dtype='datetime64'). This is especially true since you'll be taking date time deltas, which pandas handles admirably, for example with resampling, and period definitions.
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#up-and-downsampling
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#period
Though I haven't measured, since the pandas times are hiding numpy times, I suspect the conversion, is quite fast. Alternatively, you can just use pandas built in time series definitions and avoid the conversion altogether.
As a rule of thumb, it's good to use the type from the package you'll be using functions from, though. So if you're really only going to use numpy to manipulate the arrays, then stick with numpy date time. Pandas methods => pandas date time.

I had read in the documentation somewhere (apologies, can't find link) that scalar values will be converted to timestamps while arrays will keep their data type. For example:
from datetime import date
import pandas as pd
time_series = pd.Series([date(2010 + x, 1, 1) for x in range(5)])
time_series = time_series.apply(pd.to_datetime)
so that:
In[1]:time_series
Out[1]:
0 2010-01-01
1 2011-01-01
2 2012-01-01
3 2013-01-01
4 2014-01-01
dtype: datetime64[ns]
and yet:
In[2]:time_series.iloc[0]
Out[2]:Timestamp('2010-01-01 00:00:00')
while:
In[3]:time_series.values[0]
In[3]:numpy.datetime64('2009-12-31T19:00:00.000000000-0500')
because iloc requests a scalar from pandas (type conversion to Timestamp) while values requests the full numpy array (no type conversion).
There is similar behavior for series of length one. Additionally, referencing more than one element in the slice (ie iloc[1:10]) will return a series, which will always keep its datatype.
I'm unsure as to why pandas behaves this way.
In[4]: pd.__version__
Out[4]: '0.15.2'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find rows in numpy array of matplotlib date objects - python

Related

Convert a date to a different format for an entire new column

Using pandas to sort data from a dataset

Representing Date as numeric number

Problems with timezone in numpy.datetime64

Why pandas series return the element of my numpy datetime64 array as timestamp?

Categories

Resources