pandas read_sql. How to query with where clause of date field - python

I have a field month-year which is in datetime64[ns] format.
How do i use this field in where clause to get rolling 12 months data(past 12 months data).
Below query does not work, but I would like something that filters data for 12 months.
select * from ABCD.DEFG_TABLE where monthyear > '2019-01-01'
FYI - It is an oracle database. If i can avoid hard coding the value 2019-01-01 that would be great!!

You need to use the datetime and set the date format as below.
Just get your relative date and if you follow datetime format as YYYYMMDD, use strftime from date time with regex string as ("%Y%m%d")
import datetime
import pandas
from dateutil.relativedelta import relativedelta
query = "SELECT * FROM ng_scott.Emp"
between_first = datetime.date.today()
between_second = between_first - relativedelta(years=1)
# GET THE DATASET
dataset = pd.read_sql(query , con=engine)
# PARSE THE DATASET
filtered_dataset = dataset[(dataset['DOJ'] > between_first ) & (dataset['DOJ'] > between_second )]
print(filtered_dataset)

You can do this with pure SQL.
The following expression dynamically computes the beginning of the months 1 year ago:
add_months(trunc(sysdate, 'month'), -12)
This phrases as: take the date at the first day of the current month, and withdraw 12 months from it.
You can just use it as a filter condition:
select * from ABCD.DEFG_TABLE where monthyear > add_months(trunc(sysdate, 'month'), -12)
NB: this assumes that monthyear is of datatype date.

Related

MongoEngine filter for date column

I am trying to use MongoEngine to apply a filter on a mongodb collection called Employees. The filter is based on country, city and join_date.
The filter condition is that the number of months obtained by subtracting join_date from today's date should be a minimum of "x" months, where x is a setting value. So, for example, if x is 18 months, I need to find all employees whose join_date was a minimum of 18 months prior to today's date.
I am trying to achieve this by calling the filter() method, but I'm unable to figure out how to do that.
matching_records = Employees.objects(
country=rule.country,
city=rule.city) \
.filter(relativedelta.relativedelta(datetime.datetime.now, join_date).months > 18)
I get an error, "name join_date is not defined". I am unable to figure out how to get the filter to work. Please help.
You need to use the lte (less than or equal) or gte (greater than or equal) operators like this:
from datetime import datetime, timedelta
import dateutil.relativedelta
from mongoengine import *
connect()
now = datetime.utcnow()
yesterday = now - dateutil.relativedelta.relativedelta(days=5)
past = now - dateutil.relativedelta.relativedelta(months=20)
class TestDate(Document):
dt = DateTimeField()
# Saving 3 objects to verify the query works
TestDate(dt=now).save()
TestDate(dt=yesterday).save()
TestDate(dt=past).save()
TestDate.objects(dt__lte=now - dateutil.relativedelta.relativedelta(months=18)) # return TestData associated with `past`
TestDate.objects(dt__gte=now - dateutil.relativedelta.relativedelta(months=18)) # return TestData associated with `now` and `yesterday`

In Python I have embedded SQL code with 2 sets of between dates. How to loop the python code to run multiple sets of between dates

I am new to this so sorry if my question is odd or confusing. In python I have an embedded SQL query that has 2 between dates for data. I have several dates I want to loop this code through each 'between date sets' I have for the entire month. I feel like I am missing a package that would help with this and I have not found a tutorial to follow something like this.
Say for example sake.
List of between dates
2020-02-01 AND 2020-02-05,
2020-02-02 AND 2020-02-06,
2020-02-03 AND 2020-02-07,
... all the way to ...
2020-02-28 AND 2020-03-04
Where I am at so far is this and I can't figure out how to setup an array for this.
import psybopg2
import getpass
import pandas
con = psybopg2.connect(host="blah",database="blah",user=getpass.getpass
cur.execute("""
SELECT
Address
,Create_Data
,Event_Date
FROM
table.a
WHERE
Create_Date between '2020-03-20' AND '2020-03-25' --(want to insert set of dates from the list
AND
Event_Date between '2020-03-20' AND '2020-03-25' --(want to insert the same between date used above
""")
output = cur.fetchall ()
data = pd.DataFrame(output)
cur.close()
con.close()`
Use datetime and timedelta:
from datetime import datetime
start_date = "2020-02-01"
stop_date = "2020-02-28"
start = datetime.strptime(start_date, "%Y-%m-%d")
stop = datetime.strptime(stop_date, "%Y-%m-%d")
from datetime import timedelta
while start < stop:
first_date = start #first date for between
second_date = start + timedelta(days=4) #second date for between
#Use the above in sql query
start = start + timedelta(days=1) # increase day one by one

PySpark: filtering a DataFrame by date field in range where date is string

My dataframes contains one field which is a date and it appears in the string format, as example
'2015-07-02T11:22:21.050Z'
I need to filter the DataFrame on the date to get only the records in the last week.
So, I was trying a map approach where I transformed the string dates to datetime objects with strptime:
def map_to_datetime(row):
format_string = '%Y-%m-%dT%H:%M:%S.%fZ'
row.date = datetime.strptime(row.date, format_string)
df = df.map(map_to_datetime)
and then I would apply a filter as
df.filter(lambda row:
row.date >= (datetime.today() - timedelta(days=7)))
I manage to get the mapping working but the filter fails with
TypeError: condition should be string or Column
Is there a way to use a filtering in a way that works or should I change the approach and how?
I figured out a way to solve my problem by using the SparkSQL API with dates in String format.
Here is an example:
last_week = (datetime.today() - timedelta(days=7)).strftime(format='%Y-%m-%d')
new_df = df.where(df.date >= last_week)
Spark >= 1.5
You can use INTERVAL
from pyspark.sql.functions import expr, current_date
df_casted.where(col("dt") >= current_date() - expr("INTERVAL 7 days"))
Spark < 1.5
You can solve this without using worker side Python code and switching to RDDs. First of all, since you use ISO 8601 string, your data can be directly casted to date or timestamp:
from pyspark.sql.functions import col
df = sc.parallelize([
('2015-07-02T11:22:21.050Z', ),
('2016-03-20T21:00:00.000Z', )
]).toDF(("d_str", ))
df_casted = df.select("*",
col("d_str").cast("date").alias("dt"),
col("d_str").cast("timestamp").alias("ts"))
This will save one roundtrip between JVM and Python. There are also a few way you can approach the second part. Date only:
from pyspark.sql.functions import current_date, datediff, unix_timestamp
df_casted.where(datediff(current_date(), col("dt")) < 7)
Timestamps:
def days(i: int) -> int:
return 60 * 60 * 24 * i
df_casted.where(unix_timestamp() - col("ts").cast("long") < days(7))
You can also take a look at current_timestamp and date_sub
Note: I would avoid using DataFrame.map. It is better to use DataFrame.rdd.map instead. It will save you some work when switching to 2.0+
from datetime import datetime, timedelta
last_7_days = (datetime.today() - timedelta(days=7)).strftime(format='%Y-%m-%d')
new_df = signal1.where(signal1.publication_date >= last_7_days)
You need to import datetime and timedelta for this.

Pandas - Python, deleting rows based on Date column

I'm trying to delete rows of a dataframe based on one date column; [Delivery Date]
I need to delete rows which are older than 6 months old but not equal to the year '1970'.
I've created 2 variables:
from datetime import date, timedelta
sixmonthago = date.today() - timedelta(188)
import time
nineteen_seventy = time.strptime('01-01-70', '%d-%m-%y')
but I don't know how to delete rows based on these two variables, using the [Delivery Date] column.
Could anyone provide the correct solution?
You can just filter them out:
df[(df['Delivery Date'].dt.year == 1970) | (df['Delivery Date'] >= sixmonthago)]
This returns all rows where the year is 1970 or the date is less than 6 months.
You can use boolean indexing and pass multiple conditions to filter the df, for multiple conditions you need to use the array operators so | instead of or, and parentheses around the conditions due to operator precedence.
Check the docs for an explanation of boolean indexing
Be sure the calculation itself is accurate for "6 months" prior. You may not want to be hardcoding in 188 days. Not all months are made equally.
from datetime import date
from dateutil.relativedelta import relativedelta
#http://stackoverflow.com/questions/546321/how-do-i-calculate-the-date-six-months-from-the-current-date-using-the-datetime
six_months = date.today() - relativedelta( months = +6 )
Then you can apply the following logic.
import time
nineteen_seventy = time.strptime('01-01-70', '%d-%m-%y')
df = df[(df['Delivery Date'].dt.year == nineteen_seventy.tm_year) | (df['Delivery Date'] >= six_months)]
If you truly want to drop sections of the dataframe, you can do the following:
df = df[(df['Delivery Date'].dt.year != nineteen_seventy.tm_year) | (df['Delivery Date'] < six_months)].drop(df.columns)

Break-up year, months & days in Pandas

I have a input parameter dictionary as below -
InparamDict = {'DataInputDate':'2014-10-25'
}
Using the field InparamDict['DataInputDate'], I want to pull up data from 2013-10-01 till 2013-10-25. What would be the best way to arrive at the same using Pandas?
The sql equivalent is -
DATEFROMPARTS(DATEPART(year,GETDATE())-1,DATEPART(month,GETDATE()),'01')
You forgot to mention if you're trying to pull up data from a DataFrame, Series or what. If you just want to get the date parts, you just have to get the attribute you want from the Timestamp object.
from pandas import Timestamp
dt = Timestamp(InparamDict['DataInputDate'])
dt.year, dt.month, dt.day
If the dates are in a DataFrame (df) and you convert them to dates instead of strings. You can select the data by ranges as well, for instance
df[df['DataInputDate'] > datetime(2013,10,1)]

Categories