Download AWS RDS Data using Python After a certain Timestamp - python

I have an RDS database where there is a single sql table and new timeseries data shows up in it every 3 hours.
I am trying to make a python script that pulls me all rows of data that came after a certain timestamp (for example t=04/03/2022 21:45:54)?
I tried to look for resources online but I am confused, what Boto3 functions I need to use for this? And what should be my example query?

Here is how I solved the main thing in this question. This code pulls all the rows from the RDS SQL Database that come after a certain timestamp (oldTimestamp). On the first search I found that pyodbc does the job, but it took me sometime to get it to work. One needs to be very careful with string formatting in the pyodbc.connect() function and the string format of sql query. With these 2 things handled well, this should work for you very smoothly. Cheers!
import pyodbc
import pandas as pd
server = 'write your server endpoint in here'
username = 'yourusername'
password = 'yourpassword'
database = 'nameofdatabase'
cnxn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
oldTimestamp = '2022-04-22 23:30:00'
sql = "SELECT * FROM dbo.eq_min WHERE dbo.eq_min.Timestamp > '" +"{}' ORDER BY dbo.eq_min.Timestamp ASC".format(oldTimestamp)
df = pd.read_sql(sql, cnxn)
print(df.head())

Related

Python Pandas SQLAlchemy how to make connection to a local SQL Server

I am trying to connect to a local network SQL Server using SQLAlchemy. I don't know how to use SQLAlchemy for doing this. Other examples I have seen do not use the more modern Python (3.6+) f-string. I need to have data in a Pandas dataframe "df". I'm not 100% sure but this local server does not have a username and password requirement...
So this is working right now.
import pandas as pd
import pyodbc
import sqlalchemy as sql
server = 'NetworkServer' # this is the server name that IT said my data is on.
database = 'Database_name' # The name of the database and this database has multiple tables.
table_name = 't_lake_data' # name of the table that I want.
# I'm not sure but this local server does not have a username and password requirement.
engine = sql.create_engine(f'mssql+pyodbc://{server}/{database}?trusted_connection=yes&driver=SQL+Server')
# I don't know all the column names so I use * to represent all column names.
sql_str = f"SELECT * FROM dbo.{table_name}"
df = pd.read_sql_query(sql_str, engine, parse_dates="DATE_TIME")
So if there are concerns with how this looks leave a comment. Thank you.

Querying SQL table column of type JSON from Python takes forever?

I am using the mysql.connector package in Python in order to query some data from my database.
connection = mysql.connector.connect(
host="host",
user="usr",
password="pw"
)
mycursor = connection.cursor(buffered=True)
command = "SELECT data FROM schema.`table`"
mycursor.execute(command)
fetchy = mycursor.fetchone()
print(fetchy)
connection.close()
The column "data" is of type JSON. This command takes multiple minutes to run for a single row. I have several thousand rows. If I select from a different non-json column, I have no problem returning data instantly, but with this column it is taking absurdly long. It does not take this long when querying from MySQL workbench.
I'm wondering if there is an issue with my machine or if this is normal.
Is there a reason for this? Or anything I can do to fix it?

from sql server to pandas dataframe with pyodbc - while working with small tables, it gives an error on complex sql queries

1 step: Create a temporary table with pyodbc into sql server for objects
2 step: Select objects from temporary table and load it into pandas dataframe
3 step: print dataframe
for creating a temporary table i work with pyodbc cursor as it trohws errors with pandas.read_sql command. wheras it trohws an error if i try to convert the cursor into a pandas dataframe. even with the special line for handling tuples into dataframes.
my program to connect, create, read and print which works as long as the query stays simple as it is now. (my actual approach has a few hundred lines of sql query statement)
import codecs
import os
import io
import pandas as pd
import pyodbc as po
server = 'sql_server'
database = 'sql_database'
connection = po.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';Trusted_Connection=yes;')
cursor = connection.cursor()
query1 = """
CREATE TABLE #ttobject (object_nr varchar(6), change_date datetime)
INSERT INTO #ttobject (object_nr)
VALUES
('112211'),
('113311'),
('114411');
"""
query2 = """
SELECT *
FROM #ttobject
Drop table if exists #ttobject
"""
cursor.execute(query1)
df = pd.read_sql_query(query2, connection)
print(df)
Because of the lenght of the actually query i save you the trouble but instead post here the error code:
('HY000', '[HY000] [Microsoft][ODBC SQL Server Driver]Connection is busy with results for another hstmt (0) (SQLExecDirectW)')
This error gets thrown at query2 which is a multiple select statement with some joins and pivote functions
When I'm trying to put everything into one cursor i got issues with converting it from cursor to DataFrame (tried several methodes, maybe someone knows one which isn't on SO already or has a special title so i couldn't find it)
same problem if I'm trying to only use pd.read_sql then the creation of the temporary table is not working
I don't know where to go on from here.
Please let me know if i can assist you with further details which i may overwatched in accordance to my lostlyness :S
23.5.19 Further investigating:
According to Gord i tried to add autocommit to true which will work
for simple sql statements but not for my really long and
timeconsuming one.
Secondly i tried to add
"cursor.execute('SET NOCOUNT ON; EXEC schema.proc #muted = 1')
At the moment i guess that the first query takes longer so python already starting with the second and therefore the connection is
blocked. Or that the first query is returing some feedback so python
thinks it is finished before it actually is.
Added a time.sleep(100) after ececution of first query but still getting the hstmt is busy error. Wondering why this is becaus it should have had enough time to process the first
Funfact: The query is running smoothly as long as I'm not trying to output any result from it

cx_Oracle output different from Oracle SQL developer output for the same SQL query

I am facing a cx_Oracle/Oracle developer issue.
The Python/Oracle connection has been set up using cx_Oracle.
import cx_Oracle
import pandas as pd
db=cx_Oracle.connect('username/passwordZ#hostname:port/SID')
cursor = db.cursor()
cursor.execute("My SQL Query Here")
df = pd.DataFrame(cursor.fetchall())
Then I got a data frame with 97 rows. But if I copy the same SQL Query to Oracle developer and run it, I got a output with much less rows. The SQL query is basically selecting all deals made before sysdate.
select table1.a, table2.b, table3.date
from table1, table2 , table2
where table1.id = table2.id
and table1.id = table3.id
and table2.Action = 'Y'
and table3.Date < 'sysdate'-1
I am guessing this issue is related to the 'sysdate' in the SQL query, the 'sysdate' can't be processed correctly to Oracle SQL which lead to different outputs.
Any one had some issue before and any suggestions?
Thank you!!!

Fastest way to fetch table from MySQL into Pandas

I am trying to determine the fastest way to fetch data from MySQL into Pandas. So far, I have tried three different approaches:
Approach 1: Using pymysql and modifying field type (inspired by Fastest way to load numeric data into python/pandas/numpy array from MySQL)
import pymysql
from pymysql.converters import conversions
from pymysql.constants import FIELD_TYPE
conversions[FIELD_TYPE.DECIMAL] = float
conversions[FIELD_TYPE.NEWDECIMAL] = float
conn = pymysql.connect(host = host, port = port, user= user, passwd= passwd, db= db)
Approach 2: Using MySqldb
import MySQLdb
from MySQLdb.converters import conversions
from MySQLdb.constants import FIELD_TYPE
conversions[FIELD_TYPE.DECIMAL] = float
conversions[FIELD_TYPE.NEWDECIMAL] = float
conn = MySQLdb.connect(host = host, port = port, user= user, passwd= passwd, db= db)
Approach 3: Using sqlalchemy
import sqlalchemy as SQL
engine = SQL.create_engine('mysql+mysqldb://{0}:{1}#{2}:{3}/{4}'.format(user, passwd, host, port, db))
Approach 2 is the best out of these three and takes an average of 4 seconds to fetch my table. However, fetching the table only takes 2 seconds on MySQL Workbench. How can I shave off this 2 extra seconds ? Does anyone know of any alternative ways to accomplish this ?
You can use ConnectorX library that written with rust and is about 10 times faster than pandas.
This library gets data from the database and fills the dataframe.
I think you may find answers using a specific library such as "peewee" or the function df.read_sql_query from the pandas library. To use df.read_sql_query :
MyEngine = create_engine('[YourDatabase]://[User]:[Pass]#[Host]/[DatabaseName]', echo = True)
df = pd.read_sql_query('select * from [TableName]', con= MyEngine)
Also, for uploading data from a dataframe to SQL:
df.to_sql([TableName], MyEngine, if_exists = 'append', index=False)
You must put if_exists = 'append' if the table already exists, or it will auto-default to fail. You could also put replace if you want to replace as new table as well.
For data integrity sake it's nice using dataframes for uploads and downloads due to its ability to handle data well. Depending on your size of upload, it should be pretty efficient on upload time too.
If you want to go an extra step, peewee queries may help make upload time faster, although I have not personally tested speed. Peewee is an ORM library like SQLAlchemy that I found to be very easy and expressive to develop with.
You also could use dataframes as well. Just skim over the documentation - you would construct and assign a query, then convert it to a dataframe like this:
MyQuery = [TableName]select()where([TableName.column] == "value")
df = pd.DataFrame(list(MyQuery.dicts()))
Hope this helps.

Categories