How to sort a pandas dataframe by date

How to sort a pandas dataframe by date - python

I am importing data into a pandas dataframe from Google BigQuery and I'd like to sort the results by date. My code is as follows:
import sys, getopt
import pandas as pd
from datetime import datetime
# set your BigQuery service account private private key
pkey ='#REMOVED#'
destination_table = 'test.test_table_2'
project_id = '#REMOVED#'
# write your query
query = """
SELECT date, SUM(totals.visits) AS Visits
FROM `#REMOVED#.#REMOVED#.ga_sessions_20*`
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 3 day) and
DATE_sub(current_date(), interval 1 day)
GROUP BY Date
"""
data = pd.read_gbq(query, project_id, dialect='standard', private_key=pkey, parse_dates=True, index_col='date')
date = data.sort_index()
data.info()
data.describe()
print(data.head())
My output is shown below, as you can see dates are not sorted.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
date 3 non-null object
Visits 3 non-null int32
dtypes: int32(1), object(1)
memory usage: 116.0+ bytes
date Visits
0 20180312 207440
1 20180310 178155
2 20180311 207452
I have read several questions and so far tried the below, which resulted in no change to my output:
Removing index_col='date' and adding date = data.sort_values(by='date')
Setting the date column as the index, then sorting the index (shown above).
Setting headers (headers = ['Date', 'Visits']) and dypes (dtypes = [datetime, int]) to my read_gbq line (parse_dates=True, names=headers)
What am I missing?

I managed to solve this by transforming my date field into a datetime object, I assumed this would be done automatically by parse_date=True but it seems that will only parse a existing datetime object.
I added the following after my query to create a new datetime column from my date string, then I was able to use data.sort_index() and it worked as expected:
time_format = '%Y-%m-%d'
data = pd.read_gbq(query, project_id, dialect='standard', private_key=pkey)
data['n_date'] = pd.to_datetime(data['date'], format=time_format)
data.index = data['n_date']
del data['date']
del data['n_date']
data.index.names = ['Date']
data = data.sort_index()

As most of the work is done on the Google BigQuery side, I'd do sorting there as well:
query = """
SELECT date, SUM(totals.visits) AS Visits
FROM `#REMOVED#.#REMOVED#.ga_sessions_20*`
WHERE parse_date('%y%m%d', _table_suffix) between
DATE_sub(current_date(), interval 3 day) and
DATE_sub(current_date(), interval 1 day)
GROUP BY Date
ORDER BY Date
"""

This should work:
data.sort_values('date', inplace=True)

Related

Python - Remove lines prior to current month and year

I have a dataframe that contain arrival dates for vessels and I'd want to make python recognize the current year and month that we are at the moment and remove all entries that are prior to the current month and year.
I have a column with the date itself in the format '%d/%b/%Y' and columns for month and year separatly if needed.
For instance, if today is 01/01/2022. I'd like to remove everything that is from dec/2021 and prior.

Using pandas periods and boolean indexing:
# set up example
df = pd.DataFrame({'date': ['01/01/2022', '08/02/2022', '09/03/2022'], 'other_col': list('ABC')})
# find dates equal or greater to this month
keep = (pd.to_datetime(df['date'], dayfirst=False)
.dt.to_period('M')
.ge(pd.Timestamp('today').to_period('M'))
)
# filter
out = df[keep]
Output:
date other_col
1 08/02/2022 B
2 09/03/2022 C

from datetime import datetime
import pandas as pd
df = ...
# assuming your date column is named 'date'
t = datetime.utcnow()
df = df[pd.to_datetime(df.date) >= datetime(t.year, t.month, t.day)]

Let us consider this example dataframe:
import pandas as pd
import datetime
df = pd.DataFrame()
data = [['nao victoria', '21/Feb/2012'], ['argo', '6/Jun/2022'], ['kon tiki', '23/Aug/2022']]
df = pd.DataFrame(data, columns=['Vessel', 'Date'])
You can convert your dates to datetimes, by using pandas' to_datetime method; for instance, you may save the output into a new Series (column):
df['Datetime']=pd.to_datetime(df['Date'], format='%d/%b/%Y')
You end up with the following dataframe:
Vessel Date Datetime
0 nao victoria 21/Feb/2012 2012-02-21
1 argo 6/Jun/2022 2022-06-06
2 kon tiki 23/Aug/2022 2022-08-23
You can then reject rows containing datetime values that are smaller than today's date, defined using datetime's now method:
df = df[df.Datetime > datetime.datetime.now()]
This returns:
Vessel Date Datetime
2 kon tiki 23/Aug/2022 2022-08-23

Using pandas.Grouper to split datetime.time column into time ranges

I am reading from an Excel file that has a column with times. Since I can't upload the actual file, I created the variable timeIntervals to illustrate.
When I run this code...
import pandas as pd
import datetime
from pyPython import *
def main():
timeIntervals = pd.date_range("11:00", "21:30", freq="30min").time
df = pd.DataFrame({"Times": timeIntervals})
grp = pd.Grouper(key="Times", freq="3H")
value = df.groupby(grp).count()
print(value)
if __name__ == '__main__':
main()
I get the following error:
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'
How can I use pandas.Grouper in combination with DataFrame.groupby to "group" dataframe df into discrete time ranges (3 hours) ? Are there other alternatives?

A few issues:
A date_range cannot be reduced to just time only without losing the required datatype for resampling on time window.
count counts the non-NaN values in a column so one must be provided since there are no remaining columns in the sample frame.
We can fix the first issue by turning the time column into a datetime:
timeIntervals = pd.date_range("11:00", "21:30", freq="30min") # remove time here
df = pd.DataFrame({"Times": timeIntervals})
If we are not creating these values from a date_range we can simply convert the column to_datetime:
df['Times'] = pd.to_datetime(df['Times'], format='%H:%M:%S')
Then we can groupby and count:
value = df.groupby(pd.Grouper(key="Times", freq="3H"))['Times'].count()
If needed we can update the index to only reflect the time after grouping:
value.index = value.index.time
As a result value becomes:
09:00:00 2
12:00:00 6
15:00:00 6
18:00:00 6
21:00:00 2
Name: Times, dtype: int64
All together with to_datetime:
def main():
time_intervals = pd.date_range("11:00", "21:30", freq="30min").time
df = pd.DataFrame({"Times": time_intervals})
# Convert to DateTime
df['Times'] = pd.to_datetime(df['Times'], format='%H:%M:%S')
# Group and count specific column
value = df.groupby(pd.Grouper(key="Times", freq="3H"))['Times'].count()
# Retrieve only Time information
value.index = value.index.time
print(value)
Or without retrieving time before DataFrame creation:
def main():
time_intervals = pd.date_range("11:00", "21:30", freq="30min")
df = pd.DataFrame({"Times": time_intervals})
value = df.groupby(pd.Grouper(key="Times", freq="3H"))['Times'].count()
value.index = value.index.time
print(value)

Creating pandas DatetimeIndex in Dataframe from DST aware datetime objects

From an online API I gather a series of data points, each with a value and an ISO timestamp. Unfortunately I need to loop over them, so I store them in a temporary dict and then create a pandas dataframe from that and set the index to the timestamp column (simplified example):
from datetime import datetime
import pandas
input_data = [
'2019-09-16T06:44:01+02:00',
'2019-11-11T09:13:01+01:00',
]
data = []
for timestamp in input_data:
_date = datetime.fromisoformat(timestamp)
data.append({'time': _date})
pd_data = pandas.DataFrame(data).set_index('time')
As long as all timestamps are in the same timezone and DST/non-DST everything works fine, and, I get a Dataframe with a DatetimeIndex which I can work on later.
However, once two different time-offsets appear in one dataset (above example), I only get an Index, in my dataframe, which does not support any time-based methods.
Is there any way to make pandas accept timezone-aware, differing date as index?

A minor correction of the question's wording, which I think is important. What you have are UTC offsets - DST/no-DST would require more information than that, i.e. a time zone. Here, this matters since you can parse timestamps with UTC offsets (even different ones) to UTC easily:
import pandas as pd
input_data = [
'2019-09-16T06:44:01+02:00',
'2019-11-11T09:13:01+01:00',
]
dti = pd.to_datetime(input_data, utc=True)
# dti
# DatetimeIndex(['2019-09-16 04:44:01+00:00', '2019-11-11 08:13:01+00:00'], dtype='datetime64[ns, UTC]', freq=None)
I prefer to work with UTC so I'd be fine with that. If however you need date/time in a certain time zone, you can convert e.g. like
dti = dti.tz_convert('Europe/Berlin')
# dti
# DatetimeIndex(['2019-09-16 06:44:01+02:00', '2019-11-11 09:13:01+01:00'], dtype='datetime64[ns, Europe/Berlin]', freq=None)

A pandas datetime column also requires the offset to be the same. A column with different offsets, will not be converted to a datetime dtype.
I suggest, do not convert the data to a datetime until it's in pandas.
Separate the time offset, and treat it as a timedelta
to_timedelta requires a format of 'hh:mm:ss' so add ':00' to the end of the offset
See Pandas: Time deltas for all the available timedelta operations
pandas.Series.dt.tz_convert
pandas.Series.tz_localize
Convert to a specific TZ with:
If a datetime is not datetime64[ns, UTC] dtype, then first use .dt.tz_localize('UTC') before .dt.tz_convert('US/Pacific')
Otherwise df.datetime_utc.dt.tz_convert('US/Pacific')
import pandas as pd
# sample data
input_data = ['2019-09-16T06:44:01+02:00', '2019-11-11T09:13:01+01:00']
# dataframe
df = pd.DataFrame(input_data, columns=['datetime'])
# separate the offset from the datetime and convert it to a timedelta
df['offset'] = pd.to_timedelta(df.datetime.str[-6:] + ':00')
# if desired, create a str with the separated datetime
# converting this to a datetime will lead to AmbiguousTimeError because of overlapping datetimes at 2AM, per the OP
df['datetime_str'] = df.datetime.str[:-6]
# convert the datetime column to a datetime format without the offset
df['datetime_utc'] = pd.to_datetime(df.datetime, utc=True)
# display(df)
datetime offset datetime_str datetime_utc
0 2019-09-16T06:44:01+02:00 0 days 02:00:00 2019-09-16 06:44:01 2019-09-16 04:44:01+00:00
1 2019-11-11T09:13:01+01:00 0 days 01:00:00 2019-11-11 09:13:01 2019-11-11 08:13:01+00:00
print(df.info())
[out]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 2 non-null object
1 offset 2 non-null timedelta64[ns]
2 datetime_str 2 non-null object
3 datetime_utc 2 non-null datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1), object(2), timedelta64[ns](1)
memory usage: 192.0+ bytes
# convert to local timezone
df.datetime_utc.dt.tz_convert('US/Pacific')
[out]:
0 2019-09-15 21:44:01-07:00
1 2019-11-11 00:13:01-08:00
Name: datetime_utc, dtype: datetime64[ns, US/Pacific]
Other Resources
Calculate Pandas DataFrame Time Difference Between Two Columns in Hours and Minutes.
Talk Python to Me: Episode #271: Unlock the mysteries of time, Python's datetime that is!
Real Python: Using Python datetime to Work With Dates and Times
The dateutil module provides powerful extensions to the standard datetime module.

filtering dataframe based on date and time (in 2 separate columns)

I have the below pandas dataframe that I am trying to filter, to provide me with an updated dataframe based on the last date & time in my database.
This is a sample of the dataframe I am trying to filter:
>>> df
# The time is in '%H:%M:%S' format, and the date is in '%d-%b-%Y'
Time Date Items
00:05:00 29-May-2018 foo
00:06:00 30-May-2018 barr
00:07:00 31-May-2018 gaaa
00:11:00 31-May-2018 raaa
... ... ...
What I am trying to do, is to filter this dataframe based on the last entry in my sql database. For example, the last entry is: ['20:05:00','30-May-2018']. The below code is what I used to filter out from df:
last_entry = ['20:05:00','30-May-2018']
# Putting time into a datetime format to work within the dataframe.
last_entry_time = datetime.strptime(last_entry[0], '%H:%M:%S').time()
new_entry = df[(df['Date'] >= last_entry[1]) & (df['Time'] > last_entry_time)]
If I were to just have the filter as: new_entry = df[df['Date'] >= last_entry[1])] instead, this works well to return the current date and newer based on the last date, which is: 30-May-2018 and 31-May-2018.
However, regarding the time portion, because my last_entry time is 20:05:00, it starts to filter out the rest of the data that i'm trying to collect...
Question:
How can I perform the filter of the dataframe, such that it returns me the new entries in the dataframe, that is based off the old date and time in the database?
Ideal result
last_entry = ['20:05:00','30-May-2018']
>>> new_entry
Time Date Items
00:07:00 31-May-2018 gaaa
00:11:00 31-May-2018 raaa
... ... ...

One option is to create a datetime column in your DataFrame, and then filter on this column, for example:
df["real_date"] = pd.to_datetime(df["Date"], format="%d-%b-%Y")
df["real_time"] = pd.to_timedelta(df["Time"])
df["real_datetime"] = df["real_date"] + df["real_time"]
You also need to convert your last_entry variable to a proper datetime, for example like this:
from dateutil.parser import parse
from datetime import datetime
date = parse(last_entry[1], dayfirst=True)
time_elements = [int(t) for t in last_entry[0].split(":")]
last_entry_dt = datetime(date.year, date.month, date.day, time_elements[0], time_elements[1], time_elements[2])
Then you can filter the new DataFrame column like so:
df[df["real_datetime"] >= last_entry_dt]

NaNs when extracting no. of days between two dates in pandas

I have a dataframe that contains the columns company_id, seniority, join_date and quit_date. I am trying to extract the number of days between join date and quit date. However, I get NaNs.
If I drop off all the columns in the dataframe except for quit date and join date and run the same code again, I get what I expect. However with all the columns, I get NaNs.
Here's my code:
df['join_date'] = pd.to_datetime(df['join_date'])
df['quit_date'] = pd.to_datetime(df['quit_date'])
df['days'] = df['quit_date'] - df['join_date']
df['days'] = df['days'].astype(str)
df1 = pd.DataFrame(df.days.str.split(' ').tolist(), columns = ['days', 'unwanted', 'stamp'])
df['numberdays'] = df1['days']
This is what I get:
days numberdays
585 days 00:00:00 NaN
340 days 00:00:00 NaN
I want 585 from the 'days' column in the 'numberdays' column. Similarly for every such row.
Can someone help me with this?
Thank you!

Instead of converting to string, extract the number of days from the timedelta value using the dt accessor.
import pandas as pd
df = pd.DataFrame({'join_date': ['2014-03-24', '2013-04-29', '2014-10-13'],
'quit_date':['2015-10-30', '2014-04-04', '']})
df['join_date'] = pd.to_datetime(df['join_date'])
df['quit_date'] = pd.to_datetime(df['quit_date'])
df['days'] = df['quit_date'] - df['join_date']
df['number_of_days'] = df['days'].dt.days
#Mohammad Yusuf Ghazi points out that dt.day is necessary to get the number of days instead of dt.days when working with datetime data rather than timedelta.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to sort a pandas dataframe by date - python

This should work: data.sort_values('date', inplace=True)

Related

Python - Remove lines prior to current month and year

Using pandas.Grouper to split datetime.time column into time ranges

Creating pandas DatetimeIndex in Dataframe from DST aware datetime objects

filtering dataframe based on date and time (in 2 separate columns)

NaNs when extracting no. of days between two dates in pandas

Categories

Resources