I'm executing a long python function that reads the data from 2 mySQL tables, appends it, transforms it, and writes the outputs back to sql. For some reason, the first write, which is 'append' works fine, the other one, which is 'replace' freezes (i.e. script is running, can't execute sql commands through terminal but nothing happens). The write is very small, when I disconnect (restart mySQL service) and do the write as a separate line of code, there's no problem and it takes less than a second. Here's the code I'm using:
from sqlalchemy import create_engine
import pandas as pd
def local_connect(login_local,password_local,database_local):
engine_input="mysql://"+login_local+":"+password_local+"#localhost/"+database_local
engine = create_engine(engine_input)
con = engine.connect()
return con
con=local_connect(login_local,password_local,database_local)
...
aaa.to_sql(name='aaa',con=con_,if_exists='append',index=True)
bbb.to_sql(name='bbb',con=con_,if_exists='replace',index=True)
aaa and bbb are pandas dataframes.
EDIT solved it by reconnecting after the first 'to sql'
con_.close()
con_=connect.local_connect(login_local,password_local,database_local)
What is the reason and a better way to do this?
Related
I'm wrting a python code that creates a SQLite database and does some calculations for massive tables. To begin with, reason i'm doing it in SQLite through python is memory, my data is huge that will break into a memory error if run in, say, pandas. and if chuncked it'll take ages, generally because pandas is slow with merges and groupes, etc.
So my issue now is at some point, i want to calculate exponential of one column in a table (sample code below) but it seems that SQLite doesn't have an EXP function.
I can write data to a dataframe and then use numpy to calculate the EXP but that then beats the whole point that pushed my twoards DBs and not have the additional time of reading/writing back and forth between the DB and python.
so my question is this: is there a way around this to calculate the exponential within the database? i've read that i can create the function within sqlite3 in python, but i have no idea how. If you know how or can direct me to where i can find relavent info then i would be thankful, thanks.
Sample of my code where i'm trying to do the calculation, note here i'm just providing a sample where the table is coming directly from a csv, but in my process it's actually created within the DB after lots of megres and group bys:
import sqlite3
#set path and files names
folderPath = 'C:\\SCP\\'
inputDemandFile = 'demandFile.csv'
#set connection to database
conn = sqlite3.connect(folderPath + dataBaseName)
cur = conn.cursor()
#read demand file into db
inputDemand = pd.read_csv(folderPath + inputDemandFile)
inputDemand.to_sql('inputDemand', conn, if_exists='replace', index=False)
#create new table and calculate EXP
cur.execute('CREATE TABLE demand_exp AS SELECT from_zone_id, to_zone_id, EXP(demand) AS EXP_Demand FROM inputDemand;')
i've read that i can create the function within sqlite3 in python, but i have no idea how.
That's conn.create_function()
https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.create_function
>>> import math
>>> conn.create_function('EXP', 1, math.exp)
>>> cur.execute('select EXP(1)')
>>> cur.fetchone()
(2.718281828459045,)
I have the following python code which runs multiple SQL Queries in Oracle database and combines them into one dataframe.
The queries exist in a txt file and every row is a separate SQL query. The loop runs sequentially the queries. I want to cancel any SQL queries that run for more than 10 secs so as not to create an overhead in the database.
The following code doesnt actually me give the results that i want. More specifically this bit of the code really help me on my issue:
if (time.time() - start) > 10:
connection.cancel()
Full python code is the following. Probably it is an oracle function that can be called so as to cancel the query.
import pandas as pd
import cx_Oracle
import time
ip = 'XX.XX.XX.XX'
port = XXXX
svc = 'XXXXXX'
dsn_tns = cx_Oracle.makedsn(ip, port, service_name = svc)
connection = cx_Oracle.connect(user='XXXXXX'
, password='XXXXXX'
, dsn=dsn_tns
, encoding = "UTF-8"
, nencoding = "UTF-8"
)
filepath = 'C:/XXXXX'
appended_data = []
with open(filepath + 'sql_queries.txt') as fp:
line = fp.readline()
while line:
start = time.time()
df = pd.read_sql(line, con=connection)
if (time.time() - start) > 10:
connection.cancel()
print("Cancel")
appended_data.append(df)
df_combined = pd.concat(appended_data, axis=0)
line = fp.readline()
print(time.time() - start)
fp.close()
A better approach would be to spend some time tuning the queries to make them as efficient as necessary. As #Andrew points out we can't easily kill a database query from outside the database - or even from another session inside the database (it requires DBA level privileges).
Indeed, most DBAs would rather you ran a query for 20 seconds rather than attempt to kill every query which runs more than 10. Apart from anything else, having a process which polls you query to see how long it's been running for is itself a waste of database resources.
I suggest you discuss this with your DBA. You may find you're worrying about nothing.
Look at cx_Oracle 7's Connection.callTimeout setting. You'll need to be using Oracle client libraries 18+. (These will connect to Oracle DB 11.2+). The doc for the equivalent node-oracledb parameter explains the fine print behind the Oracle behavior and round trips.
I'm trying to upload a pandas DataFrame directly to Redshift using the to_sql function.
connstr = 'redshift+psycopg2://%s:%s#%s.redshift.amazonaws.com:%s/%s' %
(username, password, cluster, port, db_name)
def send_data(df, block_size=10000):
engine = create_engine(connstr)
with engine.connect() as conn, conn.begin():
df.to_sql(name='my_table_clean', schema='my_schema', con=conn, index=False,
if_exists='replace', chunksize=block_size)
del engine
The table my_schema.my_table_clean exists (but is empty), and the connection built using connstr is also valid (verified by a correspond retrieve_data method). The retrieve function pulls data from my_table and my script cleans it up using pandas to output to my_table_clean.
The problem is, I keep getting the following error:
TypeError: _get_column_info() takes exactly 9 arguments (8 given)
during the to_sql function.
I can't seem to figure out what is causing this error. Is anyone familiar with it?
Using
python 2.7.13
pandas 0.20.2
sqlalchemy 1.2.0.
Note: I'm trying to circumvent S3 -> Redshift for this script since I don't want to create a folder in my bucket just for one file, and this single script doesn't conform to my overall ETL structure. I'm hoping to just run this one script after the ETL that creates the original my_table.
I am trying to retrieve data from National Stock Exchange for a given Scrip name.
I already have created a database name "NSE" in MySQL. But did not create any table.
Following script I am using to retrieve per minute data from the NSE website (let's say I want to retrieve data for scrip (stock) 'CYIENT'.
from alpha_vantage.timeseries import TimeSeries
import matplotlib.pyplot as plt
import sys
import pymysql
#database connection
conn = pymysql.connect(host="localhost", user="root", passwd="pwd123", database="NSE")
c = conn.cursor()
your_key = "WLLS3TVOG22C6P9J"
def stockchart(symbol):
ts = TimeSeries(key=your_key, output_format='pandas')
data, meta_data = ts.get_intraday(symbol=symbol,interval='1min', outputsize='full')
sql.write_frame(data, con=conn, name='NSE', if_exists='replace', flavor='mysql')
print(data.head())
data['close'].plot()
plt.title('Stock chart')
plt.show()
symbol=input("Enter symbol name:")
stockchart(symbol)
#commiting the connection then closing it.
conn.commit()
conn.close()
On running the above script I am getting following errors:
'sql' is not defined.
Also I am not sure if the above script will also create a table in NSE for (user input) stock 'CYIENT'.
Before answering, I hope the code is a mock, not the real code. Otherwise, I'd suggest to change your credentials.
Now, I believe you are trying to use pandas.io.sql.write_frame (for pandas<=0.13.1). However, you forgot to import the module, thus the interpreter doesn't recognize the module sql. To fix it just add
from pandas.io import sql
to the begining of the script.
Notice the parameters you use in the function call. You use if_exists='replace', so the table NSE will be dropped and recreated every time you run the function. It will contain whatever data contains.
I have a query that returns over 125K rows.
The goal is to write a script the iterates through the rows, and for each, populate a second table with data processed from the result of the query.
To develop the script, I created a duplicate database with a small subset of the data (4126 rows)
On the small database, the following code works:
import os
import sys
import random
import mysql.connector
cnx = mysql.connector.connect(user='dbuser', password='thePassword',
host='127.0.0.1',
database='db')
cnx_out = mysql.connector.connect(user='dbuser', password='thePassword',
host='127.0.0.1',
database='db')
ins_curs = cnx_out.cursor()
curs = cnx.cursor(dictionary=True)
#curs = cnx.cursor(dictionary=True,buffered=True) #fail
with open('sql\\getRawData.sql') as fh:
sql = fh.read()
curs.execute(sql, params=None, multi=False)
result = curs.fetchall() #<=== script stops at this point
print len(result) #<=== this line never executes
print curs.column_names
curs.close()
cnx.close()
cnx_out.close()
sys.exit()
The line curs.execute(sql, params=None, multi=False) succeeds on both the large and small databases.
If I use curs.fetchone() in a loop, I can read all records.
If I alter the line:
curs = cnx.cursor(dictionary=True)
to read:
curs = cnx.cursor(dictionary=True,buffered=True)
The script hangs at curs.execute(sql, params=None, multi=False).
I can find no documentation on any limits to fetchall(), nor can I find any way to increase the buffer size, and no way to tell how large a buffer I even need.
There are no exceptions raised.
How can I resolve this?
I was having this same issue, first on a query that returned ~70k rows and then on one that only returned around 2k rows (and for me RAM was also not the limiting factor). I switched from using mysql.connector (i.e. the mysql-connector-python package) to MySQLdb (i.e. the mysql-python package) and then was able to fetchall() on large queries with no problem. Both packages seem to follow the python DB API, so for me MySQLdb was a drop-in replacement for mysql.connector, with no code changes necessary beyond the line that sets up the connection. YMMV if you're leveraging something specific about mysql.connector.
Pragmatically speaking, if you don't have a specific reason to be using mysql.connector the solution to this is just to switch to a package that works better!