I'm reading around 15 Million data from db2 LUW using python ibm_db2 module. I'm reading the data for every one million once and again reading this data in chunks to avoid memory issues. The problem here is to complete one loop of reading one million data its taking 4 Mins . How can i use multiprocessing to avoid delay . Blow is my code.
start =0
count = 15000000
check_point = 1000000
chunk_size = 100000
connection = get_db_conn_cur(secrets_db2)
for i in range(start, count, check_point):
query_str = "SELECT * FROM (SELECT ROW_NUMBER() OVER (ORDER BY a.timestamp) row_num, * from table a) where row_num between " + str( i + 1) + " and " + str(i + check_point) + ""
number_of_batches = check_point // chunk_size
last_chunk = check_point - (number_of_batches * chunk_size)
counter = 0
cur = connection.cursor()
cur.execute(query_str) s
chunk_size_l = chunk_size
while True:
counter = counter + 1
columns = [desc[0] for desc in cur.description]
print('counter', counter)
if counter > number_of_batches:
chunk_size_l = last_chunk
results = cur.fetchmany(chunk_size_l)
if not results:
break
df = pd.DataFrame(results)
#further processing
The problem here is not multiprocessing. Is your approach to read data. As far as I can see, you're using ROW_NUMBER() just to number the rows and then fetch them 1 million rows per loop.
As you're not using any WHERE condition on table a this will result in a FULL TABLE SCAN every loop you're running. You are wasting too much CPU and I/O on Db2 server just to have a row number, and that's why it's taking 4 minutes or more each loop.
Also, this is far from a efficient way to fetch data. You can have duplicate reads or phantom reads as long as data changes during your program. But that is subject for another thread.
You should be using one of the 4 fetch methods described here reading row by row from a SQL CURSOR. That way you'll be using a very small amount of RAM and can read data efficiently:
sql = "SELECT * FROM a ORDER BY a.timestamp"
stmt = ibm_db.exec_immediate(conn, sql)
dictionary = ibm_db.fetch_both(stmt)
while dictionary != False:
dictionary = ibm_db.fetch_both(stmt)
Related
I know there are a few ways to retrieve data from RDB table.
One with pandas as read_sql, the other with cursore.fetchall().
What are the main differences between both ways in terms of:
memory usage - is df less reccomended?
performance - selecting data from a table (e.g. large set of data)
performace - inserting data with a loop for cursor vs df.to_sql.
Thanks!
That's an interesting question.
For a ~10GB SQLite database, I get the following results for your second question.
pandas.sql_query seems comparable to speed with the cursor.fetchall.
The rest I leave as an exercise. :D
import sqlite3
import time
import pandas as pd
def db_operation(db_name):
connection = sqlite3.connect(db_name)
c = connection.cursor()
yield c
connection.commit()
connection.close()
start = time.perf_counter()
for i in range(0, 10):
with db_operation('components.db') as c:
c.execute('''SELECT * FROM hashes WHERE (weight > 50 AND weight < 100)''')
fetchall = c.fetchall()
time.perf_counter() - start
2.967 # fractional seconds
start = time.perf_counter()
for i in range(0, 10):
connection = sqlite3.connect('components.db')
sql_query = pd.read_sql_query('''SELECT * FROM hashes WHERE (weight > 50 AND weight < 100)''', con = connection)
connection.commit()
connection.close()
time.perf_counter() - start
2.983 # fractional seconds
The difference is that cursor.fetchall() is a bit more spartan (=plain).
pandas.read_sql_query returns a <class 'pandas.core.frame.DataFrame'> and so you can use all the methods of pandas.DataFrame, like pandas.DataFrame.to_latex, pandas.DataFrame.to_csv pandas.DataFrame.to_excel, etc. (documentation link)
One can accomplish the same exact goals with cursor.fetchall, but needs to press some or a lot extra keys.
I am trying to read the number of rows in a large access database and I am trying to find the most efficient method. Here is my code:
driver = 'access driver as string'
DatabaseLink = 'access database link as string'
Name = 'access table name as string'
conn = pyodbc.connect(r'Driver={' + driver + '};DBQ=' + DatabaseLink +';')
cursor = conn.cursor()
AccessSize = cursor.execute('SELECT count(1) FROM '+ Name).fetchone()[0]
connection.close()
This works and AccessSize does give me an integer with the number of rows in the database, however it takes far too long to compute (my database has over 2 million rows and 15 columns).
I have attempted to read the data through pd.read_sql and used the chunksize functionality to loop through and keep counting the length of each chunk but this also takes long. I have also attempted .fetchall in the cursor execute section but the speed is similar to .fetchone
I would have thought there would be a faster method to quickly calculate the length of the table as I don't require the entire table to be read. My thought is to find the index value of the last row as this essentially is the number of rows but I am unsure how to do this.
Thanks
From comment to the question:
Unfortunately the database doesn't have a suitable keys or indexes in any of its columns.
Then you can't expect good performance from the database because every SELECT will be a table scan.
I have an Access database on a network share. It contains a single table with 1 million rows and absolutely no indexes. The Access database file itself is 42 MiB. When I do
t0 = time()
df = pd.read_sql_query("SELECT COUNT(*) AS n FROM Table1", cnxn)
print(f'{time() - t0} seconds')
it takes 75 seconds and generates 45 MiB of network traffic. Simply adding a primary key to the table increases the file size to 48 MiB, but the same code takes 10 seconds and generates 7 MiB of network traffic.
TL;DR: Add a primary key to the table or continue to suffer from poor performance.
2 million should not take that long. I have Use pd.read_sql(con, sql) like this:
con = connection
sql = """ my sql statement
here"""
table = pd.read_sql(sql=sql, con=con)
Are you doing something different?
In my case I am using a db2 database, maybe that is why is faster.
My Python code writes a nested dictionary from a table in SQLite. The table has around 40 Million rows. It processes 1 Million rows in about 30-60 seconds.
After it reached ~90% (36 Million rows), it slowed down and is not printing anything anymore without raising any errors.
The code :
selection_query = "Select * From my_table"
cursor = conn.cursor()
Cursor.execute(tbl)
dictionary = {}
Counter_1 = 0
row_nr = 0
for row in Cursor:
dict_key_1 = str(row[0])
dict_key_2 = str(row[1])
value = row[5]
Counter_1 += 1
row_nr += 1
if dict_key_1 not in dictionary:
dictionary[dict_key_1]={}
dictionary[dict_key_1].update({dict_key_2 : value})
if Counter_1>1000000:
print(str("{00:.3%}".format(row_nr/4000000)) + str(datetime.now()))
counter=0
Why did it suddenly slow down so drastically?
Well i think you've got a problem with the memory here. Printing that many lines on the console uses a lot of memory of your rams (if it takes 100 Bytes per line then it would take about 400 MB ~ 500 MB). Your PC may say that this app is using up too much memory and stops it.
I am new to Python and have been given a task to download data from different Database ( MS SQl and Teradata ). The logic behind my code is as follow :
1: Code picks up data for Vendor from an excel file.
2: From that list it loops through all the vendors and gives out a list of documents.
3: Then I use the list downloaded in step 2 to download data from teradata and append in a final dataset.
My question is, if the data in second step is blank the while loop goes in infinite. Any way to exit that an still execute the rest of the iteration?
import pyodbc
import pandas as pd
VendNum = pd.ExcelFile(r"C:\desktop\VendorNumber.xlsx").parse('Sheet3',
dtype=str)
VendNum['Vend_Num'] = VendNum['Vend_Num'].astype(str).str.pad(10,
side='left', fillchar='0')
fDataSet = pd.DataFrame()
MSSQLconn=pyodbc.connect(r'Driver={SQL Server Native Client
11.0};Server=Servername;Database=DBName;Trusted_Connection=yes;')
TDconn = pyodbc.connect
(r"DSN=Teradata;DBCNAME=DBname;UID=User;PWD=password;",autocommit =True)
for index, row in VendNum.iterrows():
DocNum = pd.DataFrame()
if index > len(VendNum["Vend_Num"]):
break
while DocNum.size == 0:
print("Read SQL " + row["Vend_Num"])
DocNum = pd.read_sql_query("select Col1 from Table11 where
Col2 = " + "'" + row["Vend_Num"] + "'" + " and Col3 =
'ABC'",MSSQLconn)
print("Execute SQL " + row["Vend_Num"])
if DocNum.size > 0:
print(row["Vend_Num"])
dataList = ""
dfToList = DocNum['Col1'].tolist()
for i in dfToList:
dataList += "'"+i+ "'" + ","
dataList=dataList[0:-1]
DataSet= pd.read_sql("
Some SQl statement which works fine "),TDconn)
fDataSet = fDataSet.append(DataSet)
MSSQLconn.close()
TDconn.close()
The expected output is to append fDataset with each iteration of the code but when a blank Dataframe ( named DataSet ) is there the while loop doesn't exit.
When you are using system resources you should use
with open(...):
As mentioned by Chris,
A while loop is supposed to be infinite, until a condition is met. Perhaps create a for loop instead, which makes a few attempts then passes.
I changed the WHILE to IF and it's working fine.
I'm trying to write a python script that parses through a file and updates a database with the new values obtained from the parsed file. My code looks like this:
startTime = datetime.now()
db = <Get DB Handle>
counter = 0
with open('CSV_FILE.csv') as csv_file:
data = csv_file.read().splitlines()
for line in data:
data1 = line.split(',')
execute_string = "update table1 set col1=" + data1[1] +
" where col0 is '" + data1[0] + "'"
db.execute(execute_string)
counter = counter+1
if(counter % 1000 == 0 and counter != 0):
print ".",
print ""
print datetime.now() - startTime
But that operation took about 10 mins to finish. Any way I can tweak my SQL query to quicken it?
Read lines in batch size (may be size of 1000) and try bulk_update query for mysql. I think that process will be faster than current one as in lesser queries you'll be updating more data.
I agree you need batching. Have a look at this question and best answer How can I do a batch insert into an Oracle database using Python?