How to create a function with SQL in Python and create columns? - python

I´m accessing a Microsoft SQL Server database with pyodbc in Python and I have many tables regarding states and years. I´m trying to create a pandas.DataFrame with all of them, but I don't know how to create a function and still create columns specifying YEAR and STATE for each of these states and years (I'm using NY2000 as an example). How should I build that function or "if loop"? Sorry for the lack of clarity, it's my first post here :/
tables = tuple([NY2000DX,NY2001DX,NY2002DX,AL2000DX,AL2001DX,AL2002DX,MA2000DX,MA2001DX,MA2002DX])
jobs = tuple([55,120])
query = """ SELECT
ID,
Job_ID,
FROM {}
WHERE Job_ID IN {}
""".format(tables,jobs)
NY2000 = pd.read_sql(query, server)
NY2000["State"] = NY
NY2000["Year"] = 2000
My desirable result would be a DF with the information from all tables with columns specifing State and Year. Like:
Year
State
ID
Job_ID
2000
NY
13
55
2001
NY
20
55
2002
NY
25
55
2000
AL
15
120
2001
AL
60
120
2002
AL
45
120
------------
-------
--------
----------
Thanks for the support :)

I agree with the comments about a normalised database and you haven't posted the table structures either. I'm assuming the only way to know year and state is by the table name, if so then you can do something along these lines:
df=pd.DataFrame({"Year":[],"State":[],"ID":[],"JOB_ID":[]})
tables = ["NY2000DX2","NY2001DX","NY2002DX","AL2000DX","AL2001DX","AL2002DX","MA2000DX","MA2001DX","MA2002DX"]
jobs = tuple([55,120])
def readtables(tablename, jobsincluded):
query = """ SELECT
{} YEAR,
{} STATE,
ID,
Job_ID,
FROM {}
WHERE Job_ID IN {}
""".format(tablename[2:6],tablename[:2],tablename,jobsincluded)
return query
for table in tables:
print(readtables(table,jobs))
#dftable= pd.read_sql('readtables(table,jobs)', conn)
#df=pd.concat[df,dftable]
please note that I commented out the actual table reading and concatenation into the final dataframe, as I don't actually have a connection to test. I just printed the resulting queries as a proof of concept.

Related

How can I efficiently filter large amounts of data from an sqlite DB into numpy arrays?

I use a program that outputs data into an sqlite database. For my needs, the data in the existing database is not formatted well, so I am doing some preprocessing which will be inserted into a new table. See below:
Input (there are more columns not shown, and this table is actually a join of three, if that matters)
First Name
Last Name
State
Start Time
Stop Time
Bill
Smith
NV
0
5
Bill
Smith
NV
12
15
Bill
Smith
NV
7
8
Bill
Smith
NV
45
47
Maggie
Tangerine
MI
3
7
Maggie
Tangerine
MI
68
90
Bill
Smith
NV
60
66
Desired output
First Name
Last Name
Times
Bill
Smith
np.array(0,5,12,15,7,8,45,47,60,66)
Maggie
Tangerine
np.array(3,7,68,90)
Right now what I tried first was a query to pull the data for a specific name before inserting into the new table:
df = pd.read_sql_query('''
SELECT StartTime,StopTime
FROM Input
INNER JOIN Input1
ON ...
INNER JOIN Input2
ON ...
WHERE FirstName = ?
LastName = ?
State = ?
''', conn, params=(first,last,state))
np.concat(np.unique(df[0].values.flatten()),np.unique(df[1].values.flatten()))
This is going to be really terrible for efficiency, since the query for one set of names takes just slightly less than a query for all the names. Would it be better to try and pull the times for a specific name from the pandas dataframe? Is there some other way to do this I am not thinking of?
yeah, i would look into trying to query all the data without a where clause specifying the name, state and add a group by claus at the end.
"Group by firstname, lastname, state"

For loop date variable to SQL in python

I am writing a Oracle SQL query inside a Python script. The query is as follows:
query_dict={
'df_fire':
'''
SELECT INSURED_ID AS CUST_ID, COUNT(*) AS CNT
from POLICY
WHERE POLICY_EXPDATE >= TO_DATE('2018/01/01', 'YYYY/MM/DD')
AND POLICY_EFFDATE <= TO_DATE('2018/01/31', 'YYYY/MM/DD')
GROUP BY INSURED_ID
'''
}
"""
#Note: The duration for this kind of insurance policy is one-year.
#Note: It only shows each policy's effective date(POLICY_EFFDATE) and expire date(POLICY_EXPDAT) in the database.
Then I put it into a pickle file and open it as the following:
df_fire ={}
account, pwd = 'E', 'I!'
for var, query in query_dict.items():
df_fire[var] = get_SQL_raw_data(account, pwd, var, query)
pickle.dump(df_fire, open('./input/df_fire.pkl', 'wb'))
df_fire_dict = pickle.load(open('./input/df_fire.pkl', 'rb'))
df_fire = df_fire_dict['df_fire']
However, this result is only for 201801 without snap date. My goal is to make a dataframe with yyyymm from 201801 to 202004 (showing as the following). That is, I want to count how many insurance policy a person has in each month. Maybe I need to use for loop but I couldn't figure out where and how to use it.
My goal:
yyyymm icust_d cnt
-------------------
201801 A12345 1
201802 A12345 1
201803 A12345 2
.... .... ....
202004 A12345 5
I'm new to Python and have been gooling how to do this for hours but still can't get it done. Hope someone can help. Thank you very much.
Consider an extended aggregate query to group on YYYYMM. No loop needed:
SELECT TO_CHAR(POLICY_EFFDATE, 'YYYYMM') AS YYYYMM,
INSURED_ID AS CUST_ID,
COUNT(*) AS CNT
FROM POLICY
WHERE POLICY_EXPDATE >= TO_DATE('2018/01/01', 'YYYY/MM/DD')
AND POLICY_EFFDATE <= TO_DATE('2020/04/30', 'YYYY/MM/DD')
GROUP BY TO_CHAR(POLICY_EFFDATE, 'YYYYMM'),
INSURED_ID
ORDER BY TO_CHAR(POLICY_EFFDATE, 'YYYYMM')

How to map multiple joined tables into new tables with different column names

So my organization is supplying data to another organization which requires the data to match a particular schema, but the required data is spread across multiple tables with different field names.
I've made a crosswalk of the corresponding fields (i.e. LOCATION_ID -> STATE_ID) where they apply, and some fields don't exist in our data. My question is conceptual as I'm not sure what the best approach is.
Any pointers in the right direction would be helpful.
I'm most familiar with Python and was thinking of using a Pandas script or an R script to rework the data and export to a new table, but I'm sure there's a more elegant solution in standard SQL or t-SQL.
Edit:
An example per suggestions:
Source Table
SITE_USE
UNKNOWN
ELEVATION
ELEV_METHOD
ELEV_DATUM
New Table
Well Type
Well Water Level Recorder Indicator
Land Surface Elevation Value (ft)
Land Surface Elevation Method
Land Surface Elevation Datum
I'm thinking of a view.
In Oracle, there's Scott's sample schema which contains tables about employees and departments. A simple query, with no mapping for any columns, looks like this:
SQL> select e.deptno, d.dname, e.empno, e.ename, e.job, e.sal
2 from emp e join dept d on e.deptno = d.deptno
3 where d.deptno = 10;
DEPTNO DNAME EMPNO ENAME JOB SAL
---------- -------------- ---------- ---------- --------- ----------
10 ACCOUNTING 7782 CLARK MANAGER 2450
10 ACCOUNTING 7839 KING PRESIDENT 5000
10 ACCOUNTING 7934 MILLER CLERK 1300
SQL>
Now, if you can (why couldn't you?) create such a query for YOUR tables, a simple option is to create a view and do the column mapping. For example, the same column set as above:
SQL> create or replace view v_emps as
2 select e.deptno as dept_number,
3 d.dname as dept_name,
4 e.empno as employee_id,
5 e.ename as employee_name,
6 e.job,
7 e.sal as salary
8 from emp e join dept d on e.deptno = d.deptno;
View created.
SQL>
Finally, it is a matter of a simple query to fetch data you need and provide it to whoever needs it, with brand new column names:
SQL> select * from v_emps
2 where dept_number = 10;
DEPT_NUMBER DEPT_NAME EMPLOYEE_ID EMPLOYEE_N JOB SALARY
----------- -------------- ----------- ---------- --------- ----------
10 ACCOUNTING 7782 CLARK MANAGER 2450
10 ACCOUNTING 7839 KING PRESIDENT 5000
10 ACCOUNTING 7934 MILLER CLERK 1300
SQL>
No additional mapping is required; perform any kind of query, it'll do:
SQL> select * from v_emps
2 where job = 'CLERK';
DEPT_NUMBER DEPT_NAME EMPLOYEE_ID EMPLOYEE_N JOB SALARY
----------- -------------- ----------- ---------- --------- ----------
10 ACCOUNTING 7934 MILLER CLERK 1300
20 RESEARCH 7876 ADAMS CLERK 1100
20 RESEARCH 7369 SMITH CLERK 800
30 SALES 7900 JAMES CLERK 950
SQL>
If it is MS SQL Server you use (regarding T-SQL you mentioned), no problem either; there are views there as well.
SQL Query is straight forward to join tables with alias column names
SELECT C.Id AS Identifier, C.LastName + ', ' + C.FirstName AS CustomerName
FROM Order O JOIN Customer C ON O.CustomerId = C.Id
Here Order is one table and Customer is another table
Each table has an alias identifier O and C correspondingly CustomerName is a non existing columns made from two columns
Identifier is a alias name given for Id from Customer table which is C.Id
Same way you can join another table and so on
ON O.CustomerId = C.Id means both customer table and order tables has same data in that column

Peewee counting all objects with specific value

My struggle is not with creating a table, I can create a table. The problem is to populate columns based off of calculations of other tables.
I have looked at How to create all tables defined in models using peewee and this is not helping me do summations and count etc..
I have a hypothetical database (database.db) and created these two tables:
Table 1 (from class User)
id name
1 Jamie
2 Sam
3 Mary
Table 2 (from class Sessions)
id SessionId
1 4121
1 4333
1 4333
3 5432
I simply want to create a new table using peewee:
id name sessionCount TopSession # <- (Session that appears most for the given user)
1 Jamie 3 4333
2 Sam 0 NaN
3 Mary 1 5432
4 ...
Each entry in Table1 and Table2 was created using User.create(...) or Sessions.create(...)
The new table should look at the data that is in the database.db (ie Table1 and Table2) and perform the calculations.
This would be simple in Pandas, but I cant seem to find a query that can do this. Please help
I found it...
query = Sessions.select(fn.COUNT(Sessions.id)).where(Sessions.id==1)
count = query.scalar()
print(count)
>>> 3
# Or:
query = Sessions.select().where(Sessions.id == 1).count()
3
For anyone out there : )

select data from table and compare with dataframe

I have a dataframe like this
Name age city
John 31 London
Pierre 35 Paris
...
Kasparov 40 NYC
I would like to select data from redshift city table using sql where city are included in city of the dataframe
query = select * from city where ....
Can you help me to accomplish this query?
Thank you
Jeril's answer is going to right direction but not complete. df.unique() result is not a string it's series. You need a string in your where clause
# create a string for cities to use in sql, the way sql expects the string
unique_cities = ','.join("'{0}'".format(c) for c in list(df['city'].unique()))
# output
'London','Paris'
#sql query would be
query = f"select * from city where name in ({unique_cities})"
The code above is assuming you are using python 3.x
Please let me know if this solves your issue
You can try the following:
unique_cities = df['city'].unique()
# sql query
select * from city where name in unique_cities

Categories