I am writing a Oracle SQL query inside a Python script. The query is as follows:
query_dict={
'df_fire':
'''
SELECT INSURED_ID AS CUST_ID, COUNT(*) AS CNT
from POLICY
WHERE POLICY_EXPDATE >= TO_DATE('2018/01/01', 'YYYY/MM/DD')
AND POLICY_EFFDATE <= TO_DATE('2018/01/31', 'YYYY/MM/DD')
GROUP BY INSURED_ID
'''
}
"""
#Note: The duration for this kind of insurance policy is one-year.
#Note: It only shows each policy's effective date(POLICY_EFFDATE) and expire date(POLICY_EXPDAT) in the database.
Then I put it into a pickle file and open it as the following:
df_fire ={}
account, pwd = 'E', 'I!'
for var, query in query_dict.items():
df_fire[var] = get_SQL_raw_data(account, pwd, var, query)
pickle.dump(df_fire, open('./input/df_fire.pkl', 'wb'))
df_fire_dict = pickle.load(open('./input/df_fire.pkl', 'rb'))
df_fire = df_fire_dict['df_fire']
However, this result is only for 201801 without snap date. My goal is to make a dataframe with yyyymm from 201801 to 202004 (showing as the following). That is, I want to count how many insurance policy a person has in each month. Maybe I need to use for loop but I couldn't figure out where and how to use it.
My goal:
yyyymm icust_d cnt
-------------------
201801 A12345 1
201802 A12345 1
201803 A12345 2
.... .... ....
202004 A12345 5
I'm new to Python and have been gooling how to do this for hours but still can't get it done. Hope someone can help. Thank you very much.
Consider an extended aggregate query to group on YYYYMM. No loop needed:
SELECT TO_CHAR(POLICY_EFFDATE, 'YYYYMM') AS YYYYMM,
INSURED_ID AS CUST_ID,
COUNT(*) AS CNT
FROM POLICY
WHERE POLICY_EXPDATE >= TO_DATE('2018/01/01', 'YYYY/MM/DD')
AND POLICY_EFFDATE <= TO_DATE('2020/04/30', 'YYYY/MM/DD')
GROUP BY TO_CHAR(POLICY_EFFDATE, 'YYYYMM'),
INSURED_ID
ORDER BY TO_CHAR(POLICY_EFFDATE, 'YYYYMM')
Related
I´m accessing a Microsoft SQL Server database with pyodbc in Python and I have many tables regarding states and years. I´m trying to create a pandas.DataFrame with all of them, but I don't know how to create a function and still create columns specifying YEAR and STATE for each of these states and years (I'm using NY2000 as an example). How should I build that function or "if loop"? Sorry for the lack of clarity, it's my first post here :/
tables = tuple([NY2000DX,NY2001DX,NY2002DX,AL2000DX,AL2001DX,AL2002DX,MA2000DX,MA2001DX,MA2002DX])
jobs = tuple([55,120])
query = """ SELECT
ID,
Job_ID,
FROM {}
WHERE Job_ID IN {}
""".format(tables,jobs)
NY2000 = pd.read_sql(query, server)
NY2000["State"] = NY
NY2000["Year"] = 2000
My desirable result would be a DF with the information from all tables with columns specifing State and Year. Like:
Year
State
ID
Job_ID
2000
NY
13
55
2001
NY
20
55
2002
NY
25
55
2000
AL
15
120
2001
AL
60
120
2002
AL
45
120
------------
-------
--------
----------
Thanks for the support :)
I agree with the comments about a normalised database and you haven't posted the table structures either. I'm assuming the only way to know year and state is by the table name, if so then you can do something along these lines:
df=pd.DataFrame({"Year":[],"State":[],"ID":[],"JOB_ID":[]})
tables = ["NY2000DX2","NY2001DX","NY2002DX","AL2000DX","AL2001DX","AL2002DX","MA2000DX","MA2001DX","MA2002DX"]
jobs = tuple([55,120])
def readtables(tablename, jobsincluded):
query = """ SELECT
{} YEAR,
{} STATE,
ID,
Job_ID,
FROM {}
WHERE Job_ID IN {}
""".format(tablename[2:6],tablename[:2],tablename,jobsincluded)
return query
for table in tables:
print(readtables(table,jobs))
#dftable= pd.read_sql('readtables(table,jobs)', conn)
#df=pd.concat[df,dftable]
please note that I commented out the actual table reading and concatenation into the final dataframe, as I don't actually have a connection to test. I just printed the resulting queries as a proof of concept.
General:
I need to create a statistic tool from a given DB with many hundreds of thousands entries. So I never need to write to the DB, only get data.
Problem:
I have a user table, in my case i select 20k users (between two dates). Now I need to select only users, who at least spent money once (from these 20k).
To do so I have 3 different tables where the data is saved whether a user spent money. (So we work here with 4 tables in total):
User, Transaction_1, Transaction_2, Transaction_3
What I did so far:
In the model of the User class I have created a property which checks whether the user appears once in one of the Transaction tables:
#property
def spent_money_once(self):
spent_money_atleast_once = False
in_transactions = Transaction_1.query.filter(Transaction_1.user_id == self.id).first()
if in_transactions:
spent_money_atleast_once = True
return spent_money_atleast_once
in_transactionsVK = Transaction_2.query.filter(Transaction_2.user_id == self.id).first()
if in_transactionsVK:
spent_money_atleast_once = True
return spent_money_atleast_once
in_transactionsStripe = Transaction_3.query.filter(Transaction_3.user_id == self.id).first()
if in_transactionsStripe:
spent_money_atleast_once = True
return spent_money_atleast_once
return spent_money_atleast_once
Then I created two counters for male and female users, so I can count how many of these 20k users spent money at least once:
males_payed_atleast_once = 0
females_payed_atleast_once = 0
for male_user in male_users.all():
if male_user.spent_money_once is True:
males_payed_atleast_once += 1
for female_user in female_users.all():
if female_user.spent_money_once is True:
females_payed_atleast_once += 1
But this takes really long to calculate, arround 40-60 min. I have never worked with such huge data amounts, maybe it is normal?
Additional info:
In case you wonder how male_users and female_users look like:
# Note: is this even efficient, if all() completes the query than I need to store the .all() into variables, otherwise everytime I call .all() it takes time
global all_users
global male_users
global female_users
all_users = Users.query.filter(Users.date_added >= start_date, Users.date_added <= end_date)
male_users = Users.query.filter(Users.date_added >= start_date, Users.date_added <= end_date, Users.gender == "1")
female_users = Users.query.filter(Users.date_added >= start_date, Users.date_added <= end_date, Users.gender == "2")
I am trying to save certain queries in global variables to improve performance.
I am using Python 3 | Flask | Sqlalchemy for this task. The DB is MySQL.
I tryed now a completely different approach and used join and it is now way faster, it completes the query in 10 sec, which took 60 min.~:
# males
paying_males_1 = male_users.join(Transaction_1, Transaction_1.user_id == Users.id).all()
paying_males_2 = male_users.join(Transaction_2, Transaction_2.user_id == Users.id).all()
paying_males_3 = male_users.join(Transaction_3, Transaction_3.user_id == Users.id).all()
males_payed_all = paying_males_1 + paying_males_2 + paying_males_3
males_payed_atleast_once = len(set(males_payed_all))
I am simply joining each table and use .all(), the results are simple lists. After that I am merging all lists and typecasting them to set. Now I have only the unique users. The last step is to count them by using len() on the set.
Assuming you need to aggregate the info of the 3 tables together before counting, this will be a bit faster:
SELECT userid, SUM(ct) AS total
FROM (
( SELECT userid, COUNT(*) AS ct FROM trans1 GROUP BY userid )
UNION ALL
( SELECT userid, COUNT(*) AS ct FROM trans2 GROUP BY userid )
UNION ALL
( SELECT userid, COUNT(*) AS ct FROM trans3 GROUP BY userid )
)
GROUP BY userid
HAVING total >= 2
Recommend you test this in the mysql commandline tool, then figure out how to convert it to Python 3 | Flask | Sqlalchemy
Funny thing about packages that "hide the database" --; you still need to understand how the database works if you are going to do anything non-trivial.
The following code is an SQL query for google's BigQuery that counts the number of times my PyPI package has been downloaded in the last 30 days.
#standardSQL
SELECT COUNT(*) AS num_downloads
FROM `the-psf.pypi.downloads*`
WHERE file.project = 'pycotools'
-- Only query the last 30 days of history
AND _TABLE_SUFFIX
BETWEEN FORMAT_DATE(
'%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
Is it possible to modify this query so that I get the number of downloads every 30 days since the package was uploaded? The output would be a .csv that looks something like this:
date count
01-01-2016 10
01-02-2016 20
.. ..
01-05-2018 100
I recommend to use the EXTRACT or MONTH() and to count only the file.project field as it will let the query run faster. the query you could use is:
#standardSQL
SELECT
EXTRACT(MONTH FROM _PARTITIONDATE) AS month_,
EXTRACT(YEAR FROM _PARTITIONDATE) AS year_,
count(file.project) as count
FROM
`the-psf.pypi.downloads*`
WHERE
file.project= 'pycotools'
GROUP BY 1, 2
ORDER by 1 ASC
I tried it with the public dataset:
#standardSQL
SELECT
EXTRACT(MONTH FROM pickup_datetime) AS month_,
EXTRACT(YEAR FROM pickup_datetime) AS year_,
count(rate_code) as count
FROM
`nyc-tlc.green.trips_2015`
WHERE
rate_code=5
GROUP BY 1, 2
ORDER by 1 ASC
or using legacy
SELECT
MONTH(pickup_datetime) AS month_,
YEAR(pickup_datetime) AS year_,
count(rate_code) as count
FROM
[nyc-tlc:green.trips_2015]
WHERE
rate_code=5
GROUP BY 1, 2
ORDER by 1 ASC
the result is:
month_ year_ count
1 2015 34228
2 2015 36366
3 2015 42221
4 2015 41159
5 2015 41934
6 2015 39506
I see you are using _TABLE_SUFFIX, so if you are querying partitioned table you can use the _PARTITIONDATE column instead of formatting the date and using the date_sub function. This will use less compute time as well.
To query from one partition:
SELECT
[COLUMN]
FROM
[DATASET].[TABLE]
WHERE
_PARTITIONDATE BETWEEN '2016-01-01'
AND '2016-01-02'
I have a database that includes 440 days of several series with a sampling time of 5 seconds. There is also missing data.
I want to calculate the daily average. The way I am doing it now is that I make 440 queries and do the averaging afterward. But, this is very time consuming since for every query the whole database is searched for related entries. I imagine there must be a more efficient way of doing this.
I am doing this in python, and I am just learning sql. Here's the query section of my code:
time_cur = date_begin
Data = numpy.zeros(shape=(N, NoFields - 1))
X = []
nN = 0
while time_cur<date_end:
X.append(time_cur)
cur = con.cursor()
cur.execute("SELECT * FROM os_table \
WHERE EXTRACT(year from datetime_)=%s\
AND EXTRACT(month from datetime_)=%s\
AND EXTRACT(day from datetime_)=%s",\
(time_cur.year, time_cur.month, time_cur.day));
Y = numpy.array([0]*(NoFields-1))
n = 0.0
while True:
n = n + 1
row = cur.fetchone()
if row == None:
break
Y = Y + numpy.array(row[1:])
Data[nN][:] = Y/n
nN = nN + 1
time_cur = time_cur + datetime.timedelta(days=1)
And, my data looks like this:
datetime_,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14
2012-11-13-00:07:53,42,30,0,0,1,9594,30,218,1,4556,42,1482,42
2012-11-13-00:07:58,70,55,0,0,2,23252,55,414,2,2358,70,3074,70
2012-11-13-00:08:03,44,32,0,0,0,11038,32,0,0,5307,44,1896,44
2012-11-13-00:08:08,36,26,0,0,0,26678,26,0,0,12842,36,1141,36
2012-11-13-00:08:13,33,26,0,0,0,6590,26,0,0,3521,33,851,33
I appreciate your suggestions.
Thanks
Iman
I don't know the np function so I don't understand what are you averaging. If you show your table and the logic to get the average...
But this is how to get a daily average for a single column
import psycopg2
conn = psycopg2.connect('host=localhost4 port=5432 dbname=cpn')
cursor = conn.cursor()
cursor.execute('''
select
datetime::date as day,
avg(c1) as c1_average,
avg(c2) as c2_average
from os_table
where datetime between %s and %s
group by 1
order by 1
''',
(time_cur, date_end)
);
rs = cursor.fetchall()
conn.close()
for day in rs:
print day[0], day[1], day[2]
This answer uses SQL Server syntax - I am not sure how different PostgreSQL is - it should be fairly similar, you may find things like the DATEADD, DATEDIFF and CONVERT statements are different, (Actually, almost certainly the CONVERT statement - just convert the date to a varchar instead -I am just using it as a reportName, so it not vital) You should be able to follow the theory of this, even if the code doesn't run in PostgreSQL without tweaking.
First Create a Reports Table ( you will use this to link to the actual table you want to report on )
CREATE TABLE Report_Periods (
report_name VARCHAR(30) NOT NULL PRIMARY KEY,
report_start_date DATETIME NOT NULL,
report_end_date DATETIME NOT NULL,
CONSTRAINT date_ordering
CHECK (report_start_date <= report_end_date)
)
Next populate the report table with the dates you need to report on, there are many ways to do this - the method I've chosen here will only use the days you need, but you could create this with all dates you are ever likely to use, so you only have to do it once.
INSERT INTO Report_Periods (report_name, report_start_date, report_end_date)
SELECT CONVERT(VARCHAR, [DatePartOnly], 0) AS DateName,
[DatePartOnly] AS StartDate,
DATEADD(ms, -3, DATEADD(dd,1,[DatePartOnly])) AS EndDate
FROM ( SELECT DISTINCT DATEADD(DD, DATEDIFF(DD, 0, datetime_), 0) AS [DatePartOnly]
FROM os_table ) AS M
Note in SQL server, the smallest time allowed is 3 milliseconds - so the above statement adds 1 day, then subtracts 3 milliseconds to create a start and end datetime for a day. Again PostgrSQL may have different values
This means you can simply link the reports table back to your os_table to get averages, counts etc very simply
SELECT AVG(value) AS AvgValue, COUNT(value) AS NumValues, R.report_name
FROM os_table AS T
JOIN Report_Periods AS R ON T.datetime_>= R.report_start_date AND T.datetime_<= R.report_end_date
GROUP BY R.report_name
I have an MS Access database which contains 2 tables; the first containing rows showing “dd/mm/yyy”, “hh:mm”, “price” – with each row showing data for the next incremental minute. Basically , intraday minute price data for a stock index.
The second table contains the following data; “dd/mm/yyy”, “Closing Price”, “volatility” – with volatility priced using a format of “15.55” for example. Each additional row shows information for the next incremental day.
So the first table has rows going up in minutes, the second has rows going up in days.
I would like to form a pyodbc query that pulls out data from the second table, based on the date and references it to the first table. So that I can get rows of data containing:
“dd/mm/yy”, “hh:mm”, “price”, “closing price”, “volatility”.
The “closing price” and “volatility” values with then hopefully repeat themselves until the minutes tick over into the next day, and then they will update to show the next day’s values.
I am completely new to programming and have managed to do the following so far. I can pull each query down individually (and print them so I know what I am getting).
I know that the output below contains an additional “00:00:00” after the date – not 100% sure how to get rid of that – and I don’t think it is vital anyway as the dates should still be able to be referenced. (one problem at a time! ;))
Output from table one looks like this:
2005-01-03 00:00:00 17:00 1213.25
2005-01-03 00:00:00 17:01 1213.25
2005-01-03 00:00:00 17:02 1213.75
2005-01-03 00:00:00 17:03 1213.75
Output from table 2 looks like this:
2005-01-03 00:00:00 1206.25 14.08
2005-01-04 00:00:00 1191.00 13.98
2005-01-05 00:00:00 1183.25 14.09
2005-01-06 00:00:00 1188.25 13.58
here is my code so far:
from math import sqrt
import time
import pyodbc
"""establishes connection to the database and creates "cursors" with which to query data"""
ACCESS_DATABASE_FILE = "C:\\Python27\\Lib\\site-packages\\xy\\Apache Python\\SandP.accdb"
ODBC_CONN_STR = 'DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=%s;' % ACCESS_DATABASE_FILE
cnxn = pyodbc.connect(ODBC_CONN_STR)
cursor1 = cnxn.cursor()
cursor2 = cnxn.cursor()
"""uses cursor1 to pull out distinct rows of data, ordered by distinct date/time"""
cursor1.execute("select distinct [Intraday_values].[Date_], Time_, Close from [Intraday_values] order by [Intraday_values].[Date_]")
row1 = cursor1.fetchall()
for row1 in row1:
print row1[0], row1[1], row1[2]
time.sleep(2)
"""uses cursor2 to pull out settlement prices and volatility data and order by distinct date"""
cursor2.execute("select distinct [Closing_prices].[Date_], Last_price, Volatility from [Closing_prices] order by [Closing_prices].[Date_]")
row2 = cursor2.fetchall()
for row2 in row2:
print row2[0], row2[1], row2[2]
time.sleep(2)
Any help or suggestions would be very much appreciated and may just save me from pulling the rest of my hair out….
It looks to me like you should be able to do that with a single query that JOINs both tables
cursor1.execute(
"select distinct iv.[Date_], iv.Time_, iv.Close, cp.Last_price, cp.Volatility " +
"from [Intraday_values] AS iv INNER JOIN [Closing_prices] AS cp" +
" ON cp.[Date_] = iv.[Date_] " +
"order by iv.[Date_], iv.Time_")
row1 = cursor1.fetchall()
for row1 in row1:
print row1[0], row1[1], row1[2], row1[3], row1[4]