Exit loop in python if SQL query doesn't bring any data - python

I am new to Python and have been given a task to download data from different Database ( MS SQl and Teradata ). The logic behind my code is as follow :
1: Code picks up data for Vendor from an excel file.
2: From that list it loops through all the vendors and gives out a list of documents.
3: Then I use the list downloaded in step 2 to download data from teradata and append in a final dataset.
My question is, if the data in second step is blank the while loop goes in infinite. Any way to exit that an still execute the rest of the iteration?
import pyodbc
import pandas as pd
VendNum = pd.ExcelFile(r"C:\desktop\VendorNumber.xlsx").parse('Sheet3',
dtype=str)
VendNum['Vend_Num'] = VendNum['Vend_Num'].astype(str).str.pad(10,
side='left', fillchar='0')
fDataSet = pd.DataFrame()
MSSQLconn=pyodbc.connect(r'Driver={SQL Server Native Client
11.0};Server=Servername;Database=DBName;Trusted_Connection=yes;')
TDconn = pyodbc.connect
(r"DSN=Teradata;DBCNAME=DBname;UID=User;PWD=password;",autocommit =True)
for index, row in VendNum.iterrows():
DocNum = pd.DataFrame()
if index > len(VendNum["Vend_Num"]):
break
while DocNum.size == 0:
print("Read SQL " + row["Vend_Num"])
DocNum = pd.read_sql_query("select Col1 from Table11 where
Col2 = " + "'" + row["Vend_Num"] + "'" + " and Col3 =
'ABC'",MSSQLconn)
print("Execute SQL " + row["Vend_Num"])
if DocNum.size > 0:
print(row["Vend_Num"])
dataList = ""
dfToList = DocNum['Col1'].tolist()
for i in dfToList:
dataList += "'"+i+ "'" + ","
dataList=dataList[0:-1]
DataSet= pd.read_sql("
Some SQl statement which works fine "),TDconn)
fDataSet = fDataSet.append(DataSet)
MSSQLconn.close()
TDconn.close()
The expected output is to append fDataset with each iteration of the code but when a blank Dataframe ( named DataSet ) is there the while loop doesn't exit.

When you are using system resources you should use
with open(...):

As mentioned by Chris,
A while loop is supposed to be infinite, until a condition is met. Perhaps create a for loop instead, which makes a few attempts then passes.
I changed the WHILE to IF and it's working fine.

Related

Python - execute multiple SQL queries in a array in python

I am going to execute multiple SQL queries in python but I think since they are in array there are some extra characters like [" "] which read_sql_query function cannot execute them, or maybe there is another problem. Do anyone know how can I solve this problem?
My array:
array([['F_TABLE1'],
['F_TABLE2'],
['F_TABLE3'],
['F_TABLE4'],
['F_TABLE5'],
['F_TABLE6'],
['F_TABLE1'],
['F_TABLE8']], dtype=object)
My python code:
SQL_Query = []
for row in range(len(array)):
SQL_Query.append('SELECT ' + "'" + array[row] + "'" + ', COUNT(*) FROM ' + array[row])
SQL = []
for row in range(len(SQL_Query)):
SQL = pd.read_sql_query(SQL_Query[row], conn)
PS: I separated them in two for to see what is wrong with my code.
Also I print one of the arrays to see what is the output of my array.
print(SQL_Query[0])
The output:
["SELECT 'F_CLINICCPARTY_HIDDEN', COUNT(*) FROM F_TABLE1"]
Because of the above output I think the problem is extra characters.
It gives me this error:
Execution failed on sql '["SELECT 'F_TABLE1', COUNT(*) FROM F_TABLE1"]': expecting string or bytes object
Hi reason for this kind of an issue is when getting a numpy array value like array[row] will resultant in a numpy object. Even though in basic array terms it looks like the way numpy behaves differently. use item method to overcome the issue.
Based on your array I have written a sample code in there I'm only referring to the same array object instead of columns and tables.
import numpy as np
array1 = np.array([['F_TABLE1'],
['F_TABLE2'],
['F_TABLE3'],
['F_TABLE4'],
['F_TABLE5'],
['F_TABLE6'],
['F_TABLE1'],
['F_TABLE8']], dtype=object)
SQL_Query = []
for row in range(len(array1)):
SQL_Query.append("SELECT \'{0}\',COUNT(*) FROM {1}".format(str(array1.item(row)),str(array1.item(row))) )
print(SQL_Query)
feel free to use two array objects for selecting SQL columns and tables.
And using '' while selecting a column name in a query is not recommended. I have included that in this answer because I haven't got any idea about the destination database type.
I change my query to this(putting * helped me to omit other characters):
SQL_Query = []
for row in range(len(array)):
SQL_Query.append('SELECT ' + "'" + array[row] + "'" AS tableName + ', COUNT(*) AS Count FROM ' + array[row])
SQL = []
for row in range(len(SQL_Query)):
SQL.append(pd.read_sql_query(*SQL_Query[row], conn))
result = []
for item in SQL:
result.append(item)
result = pd.concat(result)
result
My output:
tableName
Count
F_TABLE1
20
F_TABLE2
30
F_TABLE3
220
F_TABLE4
50
F_TABLE5
10
F_TABLE6
2130
F_TABLE7
250

python Multiprocessing with ibm_db2?

I'm reading around 15 Million data from db2 LUW using python ibm_db2 module. I'm reading the data for every one million once and again reading this data in chunks to avoid memory issues. The problem here is to complete one loop of reading one million data its taking 4 Mins . How can i use multiprocessing to avoid delay . Blow is my code.
start =0
count = 15000000
check_point = 1000000
chunk_size = 100000
connection = get_db_conn_cur(secrets_db2)
for i in range(start, count, check_point):
query_str = "SELECT * FROM (SELECT ROW_NUMBER() OVER (ORDER BY a.timestamp) row_num, * from table a) where row_num between " + str( i + 1) + " and " + str(i + check_point) + ""
number_of_batches = check_point // chunk_size
last_chunk = check_point - (number_of_batches * chunk_size)
counter = 0
cur = connection.cursor()
cur.execute(query_str) s
chunk_size_l = chunk_size
while True:
counter = counter + 1
columns = [desc[0] for desc in cur.description]
print('counter', counter)
if counter > number_of_batches:
chunk_size_l = last_chunk
results = cur.fetchmany(chunk_size_l)
if not results:
break
df = pd.DataFrame(results)
#further processing
The problem here is not multiprocessing. Is your approach to read data. As far as I can see, you're using ROW_NUMBER() just to number the rows and then fetch them 1 million rows per loop.
As you're not using any WHERE condition on table a this will result in a FULL TABLE SCAN every loop you're running. You are wasting too much CPU and I/O on Db2 server just to have a row number, and that's why it's taking 4 minutes or more each loop.
Also, this is far from a efficient way to fetch data. You can have duplicate reads or phantom reads as long as data changes during your program. But that is subject for another thread.
You should be using one of the 4 fetch methods described here reading row by row from a SQL CURSOR. That way you'll be using a very small amount of RAM and can read data efficiently:
sql = "SELECT * FROM a ORDER BY a.timestamp"
stmt = ibm_db.exec_immediate(conn, sql)
dictionary = ibm_db.fetch_both(stmt)
while dictionary != False:
dictionary = ibm_db.fetch_both(stmt)

Load data from snowflake to pandas dataframe (python) in batches

I am having trouble loading 4.6M rows (11 vars) from snowflake to python. I generally use R, and it handles the data with no problem ... but I am struggling with Python (which I have rarely used, but need to on this occasion).
Attempts so far:
Use new python connector - obtained error message (as documented here: Snowflake Python Pandas Connector - Unknown error using fetch_pandas_all)
Amend my previous code to work in batches .. - this is what I am hoping for help with here.
The example code on the snowflake webpage https://docs.snowflake.com/en/user-guide/python-connector-pandas.html gets me almost there, but doesn't show how to concatenate the data from the multiple fetches efficiently - no doubt because those familiar with python already would know this.
This is where I am at:
import snowflake.connector
import pandas as pd
from itertools import chain
SNOWFLAKE_DATA_SOURCE = '<DB>.<Schema>.<VIEW>'
query = '''
select *
from table(%s)
;
'''
def create_snowflake_connection():
conn = snowflake.connector.connect(
user='MYUSERNAME',
account='MYACCOUNT',
authenticator = 'externalbrowser',
warehouse='<WH>',
database='<DB>',
role='<ROLE>',
schema='<SCHEMA>'
)
return conn
def fetch_pandas_concat_df(cur):
rows = 0
grow = []
while True:
dat = cur.fetchmany(50000)
if not dat:
break
colstring = ','.join([col[0] for col in cur.description])
df = pd.DataFrame(dat, columns =colstring.split(","))
grow.append(df)
rows += df.shape[0]
print(rows)
return pd.concat(grow)
def fetch_pandas_concat_list(cur):
rows = 0
grow = []
while True:
dat = cur.fetchmany(50000)
if not dat:
break
grow.append(dat)
colstring = ','.join([col[0] for col in cur.description])
rows += len(dat)
print(rows)
# note that grow is a list of list of tuples(?) [[(),()]]
return pd.DataFrame(list(chain(*grow)), columns = colstring.split(","))
cur = con.cursor()
cur.execute(query, (SNOWFLAKE_DATA_SOURCE))
df1 = fetch_pandas_concat_df(cur) # this takes forever to concatenate the dataframes - I had to stop it
df3 = fetch_pandas_concat_list(cur) # this is also taking forever.. at least an hour so far .. R is done in < 10 minutes....
df3.head()
df3.shape
cur.close()
The string manipulation you're doing is extremely expensive computationally. Besides, why would you want to combine everything to a single string, just to them break it back out?
Take a look at this section of the snowflake documentation. Essentially, you can go straight from the cursor object to the dataframe which should speed things up immensely.

Fast loading and querying data in Python

I am doing some data analysis in Python. I have ~15k financial products identified by ISIN code and ~15 columns of daily data for each of them. I would like to easily and quickly access the data given an ISIN code.
The data is in a MySQL DB. On the Python side so far I have been working with Pandas DataFrame.
First thing I did was to use pd.read_sql to load the DF directly from the database. However, this is relatively slow. Then I tried loading the full database in a single DF and serializing it to a pickle file. The loading of the pickle file is fast, a few seconds. However, when querying for an individual product, the perfomance is the same as if I am querying the SQL DB. Here is some code:
import pandas as pd
from sqlalchemy import create_engine, engine
from src.Database import Database
import time
import src.bonds.database.BondDynamicDataETL as BondsETL
database_instance = Database(Database.get_db_instance_risk_analytics_prod())
engine = create_engine(
"mysql+pymysql://"
+ database_instance.get_db_user()
+ ":"
+ database_instance.get_db_pass()
+ "#"
+ database_instance.get_db_host()
+ "/"
+ database_instance.get_db_name()
)
con = engine.connect()
class DataBase:
def __init__(self):
print("made a DatBase instance")
def get_individual_bond_dynamic_data(self, isin):
return self.get_individual_bond_dynamic_data_from_db(isin, con)
#staticmethod
def get_individual_bond_dynamic_data_from_db(isin, connection):
df = pd.read_sql(
"SELECT * FROM BondDynamicDataClean WHERE isin = '"
+ isin
+ "' ORDER BY date ASC",
con=connection,
)
df.set_index("date", inplace=True)
return df
class PickleFile:
def __init__(self):
print("made a PickleFile instance")
df = pd.read_pickle("bonds_pickle.pickle")
# df.set_index(['isin', 'date'], inplace=True)
self.data = df
print("loaded file")
def get_individual_bond_dynamic_data(self, isin):
result = self.data.query("isin == '#isin'")
return result
fromPickle = PickleFile()
fromDB = DataBase()
isins = BondsETL.get_all_isins_with_dynamic_data_from_db(
connection=con,
table_name=database_instance.get_bonds_dynamic_data_clean_table_name(),
)
isins = isins[0:50]
start_pickle = time.time()
for i, isin in enumerate(isins):
x = fromPickle.get_individual_bond_dynamic_data(isin)
print("pickle: " + str(i))
stop_pickle = time.time()
for i, isin in enumerate(isins):
x = fromDB.get_individual_bond_dynamic_data(isin)
print("db: " + str(i))
stop_db = time.time()
pickle_t = stop_pickle - start_pickle
db_t = stop_db - stop_pickle
print("pickle: " + str(pickle_t))
print("db: " + str(db_t))
print("ratio: " + str(pickle_t / db_t))
This results in:
pickle: 7.636280059814453
db: 6.167926073074341
ratio: 1.23806283819615
Also, curiously enough setting the index on the DF (uncommenting the line in the constructor) slows down everything!
I thought of trying https://www.pytables.org/index.html as an alternative to Pandas. Any other ideas or comments?
Greetings,
Georgi
So, collating some thoughts from the comments:
Use mysqlclient instead of PyMySQL if you want more speed on the SQL side of the fence.
Ensure the columns you're querying on are indexed in your SQL table (isin for querying and date for ordering).
You can set index_col="date" directly in read_sql() according to the docs; it might be faster.
I'm no Pandas expert, but I think self.data[self.data.isin == isin] would be more performant than self.data.query("isin == '#isin'").
If you don't need to query things cross-isin and want to use pickles, you could store the data for each isin in a separate pickle file.
Also, for the sake of Lil Bobby Tables, the patron saint of SQL injection attacks, use parameters in your SQL statements instead of concatenating strings.
It helped a lot to transform the large data frame into a dictionary {isin -> DF} of smaller data frames indexed by ISIN code. Data retrieval is much more efficient from a dictionary compared to from a DF. Also, it is very natural to be able to request a single DF given an ISIN code. Hope this helps someone else.

SQL - Update all records of a single column

I'm trying to write a python script that parses through a file and updates a database with the new values obtained from the parsed file. My code looks like this:
startTime = datetime.now()
db = <Get DB Handle>
counter = 0
with open('CSV_FILE.csv') as csv_file:
data = csv_file.read().splitlines()
for line in data:
data1 = line.split(',')
execute_string = "update table1 set col1=" + data1[1] +
" where col0 is '" + data1[0] + "'"
db.execute(execute_string)
counter = counter+1
if(counter % 1000 == 0 and counter != 0):
print ".",
print ""
print datetime.now() - startTime
But that operation took about 10 mins to finish. Any way I can tweak my SQL query to quicken it?
Read lines in batch size (may be size of 1000) and try bulk_update query for mysql. I think that process will be faster than current one as in lesser queries you'll be updating more data.
I agree you need batching. Have a look at this question and best answer How can I do a batch insert into an Oracle database using Python?

Categories