Selecting Data from Last Week in Python - python

I have a large database and I am looking to read only the last week for my python code.
My first problem is that the column with the received date and time is not in the format for datetime in pandas. My input (Column 15) looks like this:
recvd_dttm
1/1/2015 5:18:32 AM
1/1/2015 6:48:23 AM
1/1/2015 13:49:12 PM
From the Time Series / Date functionality in the pandas library I am looking at basing my code off of the "Week()" function shown in the example below:
In [87]: d
Out[87]: datetime.datetime(2008, 8, 18, 9, 0)
In [88]: d - Week()
Out[88]: Timestamp('2008-08-11 09:00:00')
I have tried ordering the date this way:
df =pd.read_csv('MYDATA.csv')
orderdate = datetime.datetime.strptime(df['recvd_dttm'], '%m/%d/%Y').strftime('%Y %m %d')
however I am getting this error
TypeError: must be string, not Series
Does anyone know a simpler way to do this, or how to fix this error?
Edit: The dates are not necessarily in order. AND sometimes there is a faulty error in the database like a date that is 9/03/2015 (in the future) someone mistyped. I need to be able to ignore those.

import datetime as dt
# convert strings to datetimes
df['recvd_dttm'] = pd.to_datetime(df['recvd_dttm'])
# get first and last datetime for final week of data
range_max = df['recvd_dttm'].max()
range_min = range_max - dt.timedelta(days=7)
# take slice with final week of data
sliced_df = df[(df['recvd_dttm'] >= range_min) &
(df['recvd_dttm'] <= range_max)]

You may iterate over the dates to convert by making a list comprehension
orderdate = [datetime.datetime.strptime(ttm, '%m/%d/%Y').strftime('%Y %m %d') for ttm in list(df['recvd_dttm'])]

Related

Not all dates are captured when filtering by dates. Python Pandas

I am filtering a dataframe by dates to produce two seperate versions:
Data from only today's date
Data from the last two years
However, when I try to filter on the date, it seems to miss dates that are within the last two years.
date_format = '%m-%d-%Y' # desired date format
today = dt.now().strftime(date_format) # today's date. Will always result in today's date
today = dt.strptime(today, date_format).date() # converting 'today' into a datetime object
today = today.strftime(date_format)
two_years = today - relativedelta(years=2) # date is today's date minus two years.
two_years = two_years.strftime(date_format)
# normalizing the format of the date column to the desired format
df_data['date'] = pd.to_datetime(df_data['date'], errors='coerce').dt.strftime(date_format)
df_today = df_data[df_data['date'] == today]
df_two_year = df_data[df_data['date'] >= two_years]
Which results in:
all dates ['07-17-2020' '07-15-2020' '08-01-2019' '03-25-2015']
today df ['07-17-2020']
two year df ['07-17-2020' '08-01-2019']
The 07-15-2020 date is missing from the two year, even though 08-01-2019 is captured.
you don't need to convert anything to string, simply work with datetime dtype. Ex:
import pandas as pd
df = pd.DataFrame({'date': pd.to_datetime(['07-17-2020','07-15-2020','08-01-2019','03-25-2015'])})
today = pd.Timestamp('now')
print(df[df['date'].dt.date == today.date()])
# date
# 0 2020-07-17
print(df[(df['date'].dt.year >= today.year-1) & (df['date'].dt.date != today.date())])
# date
# 1 2020-07-15
# 2 2019-08-01
What you get from the comparison operations (adjust them as needed...) are boolean masks - you can use them nicely to filter the df.
Your datatype conversions are the problem here. You could do this:
today = dt.now() # today's date. Will always result in today's date
two_years = today - relativedelta(years=2) # date is today's date minus two years.
This prints '2018-07-17 18:40:42.704395'. You can then convert it to the date only format.
two_years = two_years.strftime(date_format)
two_years = dt.strptime(two_years, date_format).date()

pandas Groupby MonthStart with two business days offset

I have a DataFrame that is indexed by date and has daily data.
As described I wish to group and aggregate this data by calendar month start minus 2 business days. My idea is to use groupby and MonthBegin with a 2 days BDay offset to this.
When I try run the code
import pandas as pd
import pandas.tseries.offsets as of
days = of.MonthBegin() - of.BDay(2)
g = df.groupby(pd.Grouper(freq=days, level='Date')).sum()
I get an error
TypeError: Argument 'other' has incorrect type (expected
datetime.datetime, got BusinessDay)
Perhaps I need to use the rollback method on MonthBegin but when I try
days = of.MonthBegin()
days.rollback(of.BDay(2))
g_df = df.groupby(pd.Grouper(freq=days, level='Date')).sum()
TypeError: Cannot convert input [<2 * BusinessDays>] of type to Timestamp
Does anyone have any ideas how to correctly use the offsets to groupby MonthBegin - 2BDay ?
It is hard to tell, what you want to achieve without any data of yours, but here is how you could do it:
df = pd.DataFrame({"dates": ["2018-01-02", "2018-01-03", "2018-02-02", "2018-01-04"],
"vals": [10, 20, 10, 5]})
df.groupby((pd.to_datetime(df.dates) - of.MonthBegin() - of.BDay(2)).dt.month).vals.sum()
Output:
dates
1 10
12 35
Name: vals, dtype: int64

np where : can't compare datetime.date to unicode

I am trying to do a np.where in a dataframe and separate the date that are from 2017 or more and below 2017.
I need to compare a column "Creation_Date" (date format "%d/%m/%Y") with the value '01/01/2017'.
I keep getting the same error which is : can't compare datetime.date to unicode
I have converted a "Creation_Date" column to a date format using the function strftime. Then I have converted the value '01/01/2017' to date format to compare it with the values in the "Creation_Date" column.
Here is the actual code :
my_df['temp_date'] = pd.to_datetime(my_df['Creation_Date'], dayfirst=True).dt.strftime('%Y-%m-%d')
t1 = my_df['temp_date'] >= dt.date(2017, 1, 1)
my_df['Final_Date'] = np.where(t1,'2017 or more','Below 2017')
Also tried :
my_df['temp_date'] = pd.to_datetime(my_df['Creation_Date'], dayfirst=True).dt.strftime('%Y-%m-%d')
t1 = my_df['temp_date'] >= dt.datetime.strptime('01/01/2017','%d/%m/%Y')
my_df['Final_Date'] = np.where(t1,'2017 or more','Below 2017')
I still can't manage to get the right format between these comparison : can't compare datetime.date to unicode.
I need to get a Final_Date column that distinguish the Creation_Date from 2017 or more and below 2017.
Can you help me, please ?
Best regards.
From your code,
my_df['temp_date'] = pd.to_datetime(my_df['Creation_Date'], dayfirst=True).dt.strftime('%Y-%m-%d')
then my_df['temp_date'] are strings, so you can't really compare them to dt.date(2017, 1, 1) or dt.datetime.strptime('01/01/2017','%d/%m/%Y') which are both datetime type.
On the other hand, pandas allows comparison between pandas.Datetime type and date string. So you can get rid of dt.strftime and you can compare:
my_df['temp_date']=pd.to_datetime(my_df['Creation_Date'], dayfirst=True)
my_df['temp_date'] >= '2017-01-01'

Convert Twitter Time into Datetime in Specific Format to Count Frequency of Tweets on a Day

So I have twitter data and I'm trying to count how many tweets I have on different days. So for example, in a list of 10 tweets, they may each have been created on a bunch of different days so I just want to figure out how many tweets there are for a given day (in the set of tweets).
Each object is in JSON format and the fields can be accessed as a dictionary key. In this case, to figure out when it was created, I use the 'date' field from below:
{'location': [Decimal('-118.3851587'), Decimal('34.0843881')], 'text': "random sample text", 'user': 'random user i cant show', 'id': Decimal('NaN'), 'date': 'Thu Oct 20 02:40:55 +0000 2016'}]
i.e. the date is formatted in the raw data as such:
Thu Oct 20 02:40:55 +0000 2016
I need to get that into this format:
2016-10-20
I was planning to make a pandas dataframe that would create a new row for each date as it came across one, but I'm worried that having to go through and dynamically add rows each time is expensive.
Since I know the specific range of days the tweets were in, I was going to just create a dataframe with a pre-determined rows containing those dates.
To do that, I used the following code:
from datetime import date, timedelta as td
d1 = date(2016, 9, 17)
d2 = date(2016, 11, 7)
delta = d2-d1
listOfDates = []
for i in range(delta.days+1):
print(d1 + td(days=i))
listOfDates.append(d1 + td(days=i))
This would output the following dates:
2016-09-17
2016-09-18
2016-09-19
2016-09-20
2016-09-21
...
2016-11-04
2016-11-05
2016-11-06
2016-11-07
This created a list of dates from start to finish with which I created a dataframe (using DataFrame.set_index where the values in the list of dates became the row values).
But now when I go through my twitter data I need to dynamically check the date from the date field in the same format as it is in the columns (i.e. 2016-10-20 from the raw data example above). I'm a little lost as to how I go about formatting it on the fly to the specific format above.
EDIT
New question (slightly unrelated but still pertinent).
So in my code, I have a list of dates but these are all datetime objects (i.e. they were generated in the block of code I have in my post and stored in "listOfDates").
I have a dataframe where the rows are dates, so I used df.set_index(listOfDates) but it says error: "KeyError: datetime.date(2016, 9, 17)".
How do I make the list show the objects in the right format instead of saying datetime.date? Might be a dumb question...
Well, actually, I used strftime to get it to the right format but it still says KeyError: '2016-09-17'
NVM I'm dumb. It was df.index not df.set_index
First make some lambda functions for formatting an individual string.
from datetime import datetime
import re
unformatted = "Thu Oct 20 02:40:55 +0000 2016"
# Use re to get rid of the milliseconds.
remove_ms = lambda x:re.sub("\+\d+\s","",x)
# Make the string into a datetime object.
mk_dt = lambda x:datetime.strptime(remove_ms(x), "%a %b %d %H:%M:%S %Y")
# Format your datetime object.
my_form = lambda x:"{:%Y-%m-%d}".format(mk_dt(x))
my_form(unformatted)
>>>'2016-10-20'
Now you can assuming you have Pandas DataFrame with columns of strings in the same format you can apply your new function to all of the elements in that column like so:
my_df.dates_column.apply(my_form)
Or you could make a lambda function to convert each item as you append it to the list in your for loop
from datetime import date, timedelta as td
# Make a lambda function to directly format your datetime objects.
dt_form = lambda x:"{:%Y-%m-%d}".format(x)
d1 = date(2016, 9, 17)
d2 = date(2016, 11, 7)
delta = d2-d1
listOfDates = []
for i in range(delta.days+1):
# print(d1 + td(days=i))
listOfDates.append(dt_form(d1 + td(days=i)))

Mapping Values in a pandas Dataframe column?

I am trying to filter out some data and seem to be running into some errors.
Below this statement is a replica of the following code I have:
url = "http://elections.huffingtonpost.com/pollster/2012-general-election-romney-vs-obama.csv"
source = requests.get(url).text
s = StringIO(source)
election_data = pd.DataFrame.from_csv(s, index_col=None).convert_objects(
convert_dates="coerce", convert_numeric=True)
election_data.head(n=3)
last_day = max(election_data["Start Date"])
filtered = election_data[((last_day-election_data['Start Date']).days <= 5)]
As you can see last_day is the max within the column election_data
I would like to filter out the data in which the difference between
the max and x is less than or equal to 5 days
I have tried using for - loops, and various combinations of list comprehension.
filtered = election_data[map(lambda x: (last_day - x).days <= 5, election_data["Start Date"]) ]
This line would normally work however, python3 gives me the following error:
<map object at 0x10798a2b0>
Your first attempt has it almost right. The issue is
(last_day - election_date['Start Date']).days
which should instead be
(last_day - election_date['Start Date']).dt.days
Series objects do not have a days attribute, only TimedeltaIndex objects do. A fully working example is below.
data = pd.read_csv(url, parse_dates=['Start Date', 'End Date', 'Entry Date/Time (ET)'])
data.loc[(data['Start Date'].max() - data['Start Date']).dt.days <= 5]
Note that I've used Series.max which is more performant than the built-in max. Also, data.loc[mask] is slightly faster than data[mask] since it is less-overloaded (has a more specialized use case).
If I understand your question correctly, you just want to filter your data where any Start Date value that is <=5 days away from the last day. This sounds like something pandas indexing could easily handle, using .loc.
If you want an entirely new DataFrame object with the filtered data:
election_data # your frame
last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
new_df = election_data.loc[(last_day-election_data["Start Date"]<=date)]
Or if you just want the Start Date column post-filtering:
last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
filtered_dates = election_data.loc[(last_day-election_data["Start Date"]<=date), "Start Date"]
Note that your date variable needs to be your date in the format required by Start Date (possibly YYYYmmdd format?). If you don't know what this variable should be, then just print(last_day) then count 5 days back.

Categories