Python3 Loop Through Array of DBs Then Connect - python

I am trying to loop through a number of mysql hosts which have the same connection info and execute the same query on each of them & fetch the results.
I'm still learning python & am stuck on the following;
import pymysql
ENDPOINTS=['endpoint01', 'endpoint02', 'endpoint03', 'endpoint04']
USER="SOME_USER"
PASS="SOME_PASSWORD"
print("Testing")
for x in ENDPOINTS:
# Open database connection
DATAB = pymysql.connect(x,USER,PASS)
cursor = DATAB.cursor()
cursor.execute("show databases like 'THIS%'")
data = cursor.fetchall()
print (data)
DATAB.close()
And this is the error I receive;
DATAB = pymysql.connect(x,USER,PASS)
TypeError: __init__() takes 1 positional argument but 4 were given

You're passing the parameters incorrectly. Try
DATAB = pymysql.connect(host=x,user=USER,password=PASS):

Related

Why does any row I insert into my AWS Postgres database (with psycopg2) immediately become a dead tuple? [duplicate]

This question already has an answer here:
Why is my postgresql function insert rolling back?
(1 answer)
Closed 3 years ago.
Having a problem with connecting psycopg2 to an AWS Postgres server and inserting a row.
Below is a test script that attempts to connect to the server and insert one row. The test query works when I use it in pgAdmin. That is, it runs successfully and the row can be selected.
When I run the python script, the server shows that a connection is made. No exceptions are thrown. I can even try to insert like a hundred rows and there's a big spike in traffic. And yet nothing can be found in the table.
import psycopg2
from getpass import getpass
​
# connect to database
try:
connection = psycopg2.connect(
dbname = "postgres",
user = "username",
password = getpass(),
host = "blahblah.us-east-1.rds.amazonaws.com",
port = '5432'
)
print("connected!")
except:
print("oops")
​
#cursor object
cursor_boi = connection.cursor()
​
# simple test query
test_query = """INSERT INTO reviews (review_id, username, movie_id, review_date, review_text, review_title, user_rating, helpful_num, helpful_denom)
VALUES (1, 'myname', 12345678, '2016-06-23', 'I love this movie!', 'Me happy', 5, 6, 12 )"""
​
# execute query
try:
cursor_boi.execute(test_query)
print(test_query)
except:
print("oopsie!")
​
# close connection
if(connection):
cursor_boi.close()
connection.close()
The database statistics report the following for my "reviews" table:
Tuples inserted: 257
Tuples deleted: 1
Dead Tuples: 8
Last autovacuum: 2019-12-13 15:49:20.369715+00
And the Dead Tuples field increments every time I run the Python script. So it seems that every record I insert immediately becomes a dead tuple. Why is this, and how can I stop it? I imagine the records are being overwritten, but if so, they're not being replaced with anything.
Solved. I forgot to commit the connection with connection.commit(). Thanks #roganjosh.

Pandas read_sql inconsistent behaviour dependent on driver?

When I run a query from my local machine to a SQL server db, data is returned. If I run the same query from a JupyterHub server (using ssh), the following is returned:
TypeError: 'NoneType' object is not iterable
Implying it isn't getting any data.
The connection string is OK on both systems (albeit different), because running the same stored procedure works fine on both systems using the connection string -
Local= "Driver={SQL Server};Server=DNS-based-address;Database=name;uid=user;pwd=pwd"
Hub = "DRIVER=FreeTDS;SERVER=IP.add.re.ss;PORT=1433;DATABASE=name;UID=dbuser;PWD=pwd;TDS_Version=8.0"
Is there something in the FreeTDS driver that affects chunksize, or means a set nocount is required in the original query as per this NoneType object is not iterable error in pandas - I tried this fix by the way and got nowhere.
Are you using pymssql, which is build on top of FreeTDS?
For SQL-Server you could also try the Microsoft JDBC Driver with the python package jaydebeapi: https://github.com/microsoft/mssql-jdbc.
import pandas as pd
import pymssql
conn = pymssql.connect(
host = r'192.168.254.254',
port = '1433',
user = r'user',
password = r'password',
database = 'DB_NAME'
)
query = """SELECT * FROM db_table"""
df = pd.read_sql(con=conn, sql=query)

Python - pool.map_async pass params to the function

I'm getting data into a list of pandas.Dataframe.
Then I need to send this data to a DB. It tool ~2min per iteration.
So I wanted to know of parallel processing is a good idea ? (no lock issue or something ?)
So I wanted to implement it from :
for df in df_list:
# Send a DF's batch to the DB
print('Sending DF\'s data to DB')
df.to_sql('ga_canaux', engine, if_exists='append', index=False)
db.check_data()
To something I find about multiprocessing:
with multiprocessing.Pool(processes=4) as pool:
results = pool.map_async(df.to_sql(???), df_list)
results.wait()
How can I pass the params I need in df.to_sql with map_async ?
EDIT :
I try something to pass N arguments like :
pool = multiprocessing.Pool()
args = ((df, engine, db) for df in df_list)
results = pool.map(multiproc, args)
results.wait()
but I get the error TypeError: can't pickle _thread._local objects
EDIT2:
I changed a bit the way I'm doing mp and it's kinda work (179s vs 732s with the same sample dataset). But I'm facing an error when I try to read from the DB inside the pool.
# Connect to the remote DB
global DB
DB = Database()
global ENGINE
ENGINE = DB.connect()
pool = mp.Pool(mp.cpu_count() - 1)
pool.map(multiproc, df_list)
pool.close()
pool.join()
def multiproc(df):
print('Sending DF\'s data to DB')
df.to_sql('ga_canaux', ENGINE, if_exists='append', index=False)
DB.check_data() // HERE
Error :
(psycopg2.OperationalError) SSL SYSCALL error: EOF detected
[SQL: 'SELECT COUNT(*) FROM ga_canaux'] (Background on this error at: http://sqlalche.me/e/e3q8)
EDIT 3
When I try a on a bigger sample I got a DB timeout : psycopg2.DatabaseError: SSL SYSCALL error: Operation timed out

mySQLdb connection Returns a Truncated Output

I'm trying to connect to a sql server remotely that runs a store procedure and returns a huge file as an output.
When I run the file locally on the SQLbox its fine and returns ~800,000 rows as expected, but when I try to run it using the mySQLdb library from python, it receives a truncated output of only ~6000 rows.
It runs fine for smaller data, so I'm guessing there's some result limit that's coming into play.
I'm sure there's some property that needs to be changed somewhere but there doesn't seem to be any documentation on the pypi library regarding the same.
For explanatory purposes, I've included my code below:
import MySQLdb
import pandas as pd
connection = MySQLdb.connect(sql_server,sql_admin,sql_pw,sql_db)
sql_command = """call function(4)"""
return pd.read_sql(sql_command, connection)
I was able to solve this using cursors. The approach I took is shown below and hopefully should help anyone else.
connection = MySQLdb.connect (host = sql_server, user = sql_admin, passwd = sql_pw, db = sql_db)
cursor = connection.cursor ()
cursor.execute("""call function(4)""")
data = cursor.fetchall()
frame = []
for row in data:
frame.append(row)
return pd.DataFrame(frame)
cursor.close ()
# close the connection
connection.close ()

How to use psycopg2 connection string with variables?

I am trying to connect to a Postgres Database with variables like this:
cs = "dbname=%s user=%s password=%s host=%s port=%s",(dn,du,dp,dh,dbp)
con = None
con = psycopg2.connect(cs)
However I get the error message:
TypeError: argument 1 must be string, not tuple
I need to be able to use variables in the connection string. Anyone know how to accomplish this?
Your code currently creates a tuple with your string and the tuple you are trying to sub. You need:
cs = "dbname=%s user=%s password=%s host=%s port=%s" % (dn,du,dp,dh,dbp)
You could pass params directly without building connection string:
con = psycopg2.connect(
dbname=dn,
user=du,
password=dp,
host=dh,
port=dbp,
)
Docs for more details on psycopg2.connect

Categories