How to compare np.datetime64 up to month only? - python

I stumbled upon a "problem" while working with my data some time ago, when I started messing with pandas. That problem is that, when you compare np.datetime64 objects with strings, numpy will fill out the rest of the information to fit datetime with the lowest value possible (01 for months, 01 for days and so on).
The same happens if you call an np.datetime64 object and specify only up to the month, the rest of the information will still be filled with the lowest possible value:
np.datetime('2019-07','M')
>>numpy.datetime64('2019-08')
The problem for me is that, many times, my only concern is with what happens between time periods, like months.
For exemple, if I want to filter every row where payments were made within last month, it would be ideal to use:
month = '2019-07'
df[df['pay_day']==month]
But when doing something like that, it will compare up to the day and fail for every date that isn't the first day of the month. I have tried transforming datetime to str, slicing and putting it back together, but for filtering purposes it gets messy. Another thing I have tried is:
df['pay_day'].days=1
The idea was to bring all days to 01, so there would be no problem when comparing and filtering, but it just fills the whole column with int64 1's.
Any ideas on how to do that?

You can use pandas datetime accessor object .dtand get corresponding property (month here) for comparison.
df[df['pay_day'].dt.month == month]

I have found a way that works for this problem in specific: if we set all days to 01, there should be no problem, but it is hard do manipulate np.datetime64. There is a way, though:
df['pay_day'] = df['pay_day'].astype('datetime64[M]')
So that all days will be set to 01, and comparison based on month becomes easy. But if there is a need of editing the days to any other value, I guess it's harder, but this works.
I got the idea from: https://stackoverflow.com/a/52810147/8424939

Related

How to create a pandas dataframe using a list of 'epoch dates' into '%Y-%m-%d %s:%m:%f%z' format?

My objective is to create the following pandas dataframe (with the 'date_time' column in '%Y-%m-%d %s:%m:%f%z' format):
batt_no date_time
3 4 2019-09-19 20:59:06+00:00
4 5 2019-09-19 23:44:07+00:00
5 6 2019-09-20 00:44:06+00:00
6 7 2019-09-20 01:14:06+00:00
But the constraint is that I don't want to first create a dataframe as follows and then convert the 'date_time' column into the above format.
batt_no date_time
3 4 1568926746
4 5 1568936647
5 6 1568940246
6 7 1568942046
I need to directly create it by converting two lists of values into the desired dataframe.
The following is what I've tried but I get an error
(please note: the 'date_time' values are in epoch format which I need to specify but have them converted into the '%Y-%m-%d %s:%m:%f%z' format):
pd.DataFrame({'batt_volt':[4,5,6,7],
'date_time':[1568926746,1568936647,1568940246,1568942046].dt.strftime('%Y-%m-%d %s:%m:%f%z')}, index=[3,4,5,6])
Can anyone please help?
Edit Note: My question is different from the one asked here.
The question there deals with conversion of a single value of pandas datetime to unix timestamp. Mine's different because:
My timestamp values are slightly different from any of the types mentioned there
I don't need to convert any timestamp value, rather create a full-fledged dataframe having values of the desired timestamp - in a particular manner using lists that I've clearly mentioned in my question.
I've clearly stated the way I've attempted the process but requires some modifications in order to run without error, which in no way is similar to the question asked in the aforementioned link.
Hence, my question is definitely different. I'd request to kindly reopen it.
As suggested, I put the solution in comment as an answer here.
pd.DataFrame({'batt_volt':[4,5,6,7], 'date_time': pd.to_datetime([1568926746,1568936647,1568940246,1568942046], unit='s', utc=True).strftime('%Y-%m-%d %s:%m:%f%z')}, index=[3,4,5,6])
pd.to_datetime works with dates, or list of dates, and input dates can be in many formats including epoch int. Keyword unit ensure that those ints are interpreted as a number of seconds since 1970-01-01, not of ms, μs, ns, ...
So it is quite easy to use when creating a DataFrame to create directly the list of dates.
Since a list of string, with a specific format was wanted (btw, outside any specific context, I maintain that it is probably preferable to store datetimes, and convert to string only for I/O operations. But I don't know the specific context), we can use .strftime on the result, which is of type DatetimeIndex when to_datetime is called with a list. And .strftime also works on those, and then is applied on all datetimes of the list. So we get a list of string of the wanted format.
Last remaining problem was the timezone. And here, there is no perfect solution. Because a simple int as those we had at the beginning does not carry a timezone. By default, to_datetime creates datetime without a timezone (like those ints are). So they are relative to any timezone the user decide they are.
to_datetime can create "timezone aware datetime". But only UTC. Which is done by keyword arg utc=True
With that we get timezone aware datetime, assuming that the ints we provided were in number of seconds since 1970-01-01 00:00:00+00:00

How do I get missing data and how do I remove holidays from a dataframe?

I have the following problem:
I have fetched the data of different asset classes over a period of 5 years in a dataframe. Since I have to work with returns, I have converted them into returns with pct_change(). Furthermore, I have removed the weekend from the period. For this I used resample('B').asfreq() so that I only have values from Monday to Friday. I was also missing data, which I then interpolated with interpolate().
My problem now is that I still have holidays in my dataframe where the stock exchanges were closed, so there was no change in the returns. Therefore I have some 0's in my dataframe.
Does anyone know how I can best fix this problem?
I want to calculate the correlation between different asset classes.
So you want to remove all lines which contain 0?
If so you can use df = df.loc[(df != 0).all(1)]

How to convert pandas data frame datetime column to int?

I am facing an issue while converting one of my datetime columns in pandas dataframe to int. My code is:
df['datetime_column'].astype(np.int64)
The error which I am getting is:
invalid literal for int() with base 10: '2018-02-25 09:31:15'
I am quite clueless about what is happening as the conversion for some of my other datetime columns are working fine. Is there some issue with the range of the date which can be converted to int?
You would use
df['datetime_colum'].apply(lambda x:x.toordinal())
If it fails, the cause could be that your column is an object and not datetime. So you need:
df['datetime_colum'] = pd.to_datetime(df['datetime_colum'])
before sending it to ordinal.
If you are working on features engineering, you can try creating days between date1 and date2, get boolean for if it is winter, summer, autumn or spring by looking at months, and if you have time, boolean of if it is morning, noontime, or night, but all depending on your machines learning problem.
it seems you solved the problem yourself judging from your comment. My guess is that you created the data frame without specifying that the column should be read as anything other than a string, so it's a string. If I'm right, and you check the column type, it should show as object. If you check an individual entry in the column, it should show as a string.
If the issue is something else, please follow up.

"Uncertain" datetime objects in python?

I have a bunch of field data, where some have a well known acquisition day, while for some the acquisition is just known with an unvertainty margin, say +/- 1.5 months as an example.
Is there something such as an "uncertain datetime object" that could handle these uncertainties?
I was thinking to insert just "99" and "99" for day and month as a zeroth order approach and then for example create an Enum object that labels the date as uncertain. But first of all inserting nines doesn't work, because datetime takes good care that you insert valid month and day when instantiating a datetime object.
Is there a cleverer aprroach to this? Is there maybe an already existing package than can deal with uncertain dates?
An entry like +/- 1.5 is a difference from some measured time. One way you can codify this is with a timedelta object, which represents the difference in two datetime objects.
Here is how you would represent an interval of minus five minutes:
import datetime
i = datetime.timedelta(minutes=-5)
Now, to calculate a time you just add that to an actual date time value.

How to deal with "partial" dates (2010-00-00) from MySQL in Django?

In one of my Django projects that use MySQL as the database, I need to have a date fields that accept also "partial" dates like only year (YYYY) and year and month (YYYY-MM) plus normal date (YYYY-MM-DD).
The date field in MySQL can deal with that by accepting 00 for the month and the day. So 2010-00-00 is valid in MySQL and it represent 2010. Same thing for 2010-05-00 that represent May 2010.
So I started to create a PartialDateField to support this feature. But I hit a wall because, by default, and Django use the default, MySQLdb, the python driver to MySQL, return a datetime.date object for a date field AND datetime.date() support only real date. So it's possible to modify the converter for the date field used by MySQLdb and return only a string in this format 'YYYY-MM-DD'. Unfortunately the converter use by MySQLdb is set at the connection level so it's use for all MySQL date fields. But Django DateField rely on the fact that the database return a datetime.date object, so if I change the converter to return a string, Django is not happy at all.
Someone have an idea or advice to solve this problem? How to create a PartialDateField in Django ?
EDIT
Also I should add that I already thought of 2 solutions, create 3 integer fields for year, month and day (as mention by Alison R.) or use a varchar field to keep date as string in this format YYYY-MM-DD.
But in both solutions, if I'm not wrong, I will loose the special properties of a date field like doing query of this kind on them: Get all entries after this date. I can probably re-implement this functionality on the client side but that will not be a valid solution in my case because the database can be query from other systems (mysql client, MS Access, etc.)
First, thanks for all your answers. None of them, as is, was a good solution for my problem, but, for your defense, I should add that I didn't give all the requirements. But each one help me think about my problem and some of your ideas are part of my final solution.
So my final solution, on the DB side, is to use a varchar field (limited to 10 chars) and storing the date in it, as a string, in the ISO format (YYYY-MM-DD) with 00 for month and day when there's no month and/or day (like a date field in MySQL). This way, this field can work with any databases, the data can be read, understand and edited directly and easily by a human using a simple client (like mysql client, phpmyadmin, etc.). That was a requirement. It can also be exported to Excel/CSV without any conversion, etc. The disadvantage is that the format is not enforce (except in Django). Someone could write 'not a date' or do a mistake in the format and the DB will accept it (if you have an idea about this problem...).
This way it's also possible to do all of the special queries of a date field relatively easily. For queries with WHERE: <, >, <=, >= and = work directly. The IN and BETWEEN queries work directly also. For querying by day or month you just have to do it with EXTRACT (DAY|MONTH ...). Ordering work also directly. So I think it covers all the query needs and with mostly no complication.
On the Django side, I did 2 things. First, I have created a PartialDate object that look mostly like datetime.date but supporting date without month and/or day. Inside this object I use a datetime.datetime object to keep the date. I'm using the hours and minutes as flag that tell if the month and day are valid when they are set to 1. It's the same idea that steveha propose but with a different implementation (and only on the client side). Using a datetime.datetime object gives me a lot of nice features for working with dates (validation, comparaison, etc.).
Secondly, I have created a PartialDateField that mostly deal with the conversion between the PartialDate object and the database.
So far, it works pretty well (I have mostly finish my extensive unit tests).
You could store the partial date as an integer (preferably in a field named for the portion of the date you are storing, such as year, month or day) and do validation and conversion to a date object in the model.
EDIT
If you need real date functionality, you probably need real, not partial, dates. For instance, does "get everything after 2010-0-0" return dates inclusive of 2010 or only dates in 2011 and beyond? The same goes for your other example of May 2010. The ways in which different languages/clients deal with partial dates (if they support them at all) are likely to be highly idiosyncratic, and they are unlikely to match MySQL's implementation.
On the other hand, if you store a year integer such as 2010, it is easy to ask the database for "all records with year > 2010" and understand exactly what the result should be, from any client, on any platform. You can even combine this approach for more complicated dates/queries, such as "all records with year > 2010 AND month > 5".
SECOND EDIT
Your only other (and perhaps best) option is to store truly valid dates and come up with a convention in your application for what they mean. A DATETIME field named like date_month could have a value of 2010-05-01, but you would treat that as representing all dates in May, 2010. You would need to accommodate this when programming. If you had date_month in Python as a datetime object, you would need to call a function like date_month.end_of_month() to query dates following that month. (That is pseudocode, but could be easily implemented with something like the calendar module.)
It sounds like you want to store a date interval. In Python this would (to my still-somewhat-noob understanding) most readily be implemented by storing two datetime.datetime objects, one specifying the start of the date range and the other specifying the end. In a manner similar to that used to specify list slices, the endpoint would not itself be included in the date range.
For example, this code would implement a date range as a named tuple:
>>> from datetime import datetime
>>> from collections import namedtuple
>>> DateRange = namedtuple('DateRange', 'start end')
>>> the_year_2010 = DateRange(datetime(2010, 1, 1), datetime(2011, 1, 1))
>>> the_year_2010.start <= datetime(2010, 4, 20) < the_year_2010.end
True
>>> the_year_2010.start <= datetime(2009, 12, 31) < the_year_2010.end
False
>>> the_year_2010.start <= datetime(2011, 1, 1) < the_year_2010.end
False
Or even add some magic:
>>> DateRange.__contains__ = lambda self, x: self.start <= x < self.end
>>> datetime(2010, 4, 20) in the_year_2010
True
>>> datetime(2011, 4, 20) in the_year_2010
False
This is such a useful concept that I'm pretty sure that somebody has already made an implementation available. For example, a quick glance suggests that the relativedate class from the dateutil package will do this, and more expressively, by allowing a 'years' keyword argument to be passed to the constructor.
However, mapping such an object into database fields is somewhat more complicated, so you might be better off implementing it simply by just pulling both fields separately and then combining them. I guess this depends on the DB framework; I'm not very familiar with that aspect of Python yet.
In any case, I think the key is to think of a "partial date" as a range rather than as a simple value.
edit
It's tempting, but I think inappropriate, to add more magic methods that will handle uses of the > and < operators. There's a bit of ambiguity there: does a date that's "greater than" a given range occur after the range's end, or after its beginning? It initially seems appropriate to use <= to indicate that the date on the right-hand side of the equation is after the start of the range, and < to indicate that it's after the end.
However, this implies equality between the range and a date within the range, which is incorrect, since it implies that the month of May, 2010 is equal to the year 2010, because May the 4th, 2010 equates to the both of them. IE you would end up with falsisms like 2010-04-20 == 2010 == 2010-05-04 being true.
So probably it would be better to implement a method like isafterstart to explicitly check if a date is after the beginning of the range. But again, somebody's probably already done it, so it's probably worth a look on pypi to see what's considered production-ready. This is indicated by the presence of "Development Status :: 5 - Production/Stable" in the "Categories" section of a given module's pypi page. Note that not all modules have been given a development status.
Or you could just keep it simple, and using the basic namedtuple implementation, explicitly check
>>> datetime(2012, 12, 21) >= the_year_2010.start
True
Can you store the date together with a flag that tells how much of the date is valid?
Something like this:
YEAR_VALID = 0x04
MONTH_VALID = 0x02
DAY_VALID = 0x01
Y_VALID = YEAR_VALID
YM_VALID = YEAR_VALID | MONTH_VALID
YMD_VALID = YEAR_VALID | MONTH_VALID | DAY_VALID
Then, if you have a date like 2010-00-00, convert that to 2010-01-01 and set the flag to Y_VALID. If you have a date like 2010-06-00, convert that to 2010-06-01 and set the flag to YM_VALID.
So, then, PartialDateField would be a class that bundles together a date and the date-valid flag described above.
P.S. You don't actually need to use the flags the way I showed it; that's the old C programmer in me coming to the surface. You could use Y_VALID, YM_VALID, YMD_VALID = range(3) and it would work about as well. The key is to have some kind of flag that tells you how much of the date to trust.
Although not in Python - here's an example of how the same problem was solved in Ruby - using a single Integer value - and bitwise operators to store year, month and day - with month and day optional.
https://github.com/58bits/partial-date
Look at the source in lib for date.rb and bits.rb.
I'm sure a similar solution could be written in Python.
To persist the date (sortable) you just save the Integer to the database.

Categories