Writing nested dictionary slows down drastically

Writing nested dictionary slows down drastically - python

My Python code writes a nested dictionary from a table in SQLite. The table has around 40 Million rows. It processes 1 Million rows in about 30-60 seconds.
After it reached ~90% (36 Million rows), it slowed down and is not printing anything anymore without raising any errors.
The code :
selection_query = "Select * From my_table"
cursor = conn.cursor()
Cursor.execute(tbl)
dictionary = {}
Counter_1 = 0
row_nr = 0
for row in Cursor:
dict_key_1 = str(row[0])
dict_key_2 = str(row[1])
value = row[5]
Counter_1 += 1
row_nr += 1
if dict_key_1 not in dictionary:
dictionary[dict_key_1]={}
dictionary[dict_key_1].update({dict_key_2 : value})
if Counter_1>1000000:
print(str("{00:.3%}".format(row_nr/4000000)) + str(datetime.now()))
counter=0
Why did it suddenly slow down so drastically?

Well i think you've got a problem with the memory here. Printing that many lines on the console uses a lot of memory of your rams (if it takes 100 Bytes per line then it would take about 400 MB ~ 500 MB). Your PC may say that this app is using up too much memory and stops it.

Related

SQLITE - unable to combine thousands of delete statements into single transaction

I'm facing an issue with deleting thousands of rows from a single table takes a long time (over 2 minutes for 14k records) while inserting the same records is nearly instant (200ms). Both the insert and delete statements are handled identically - a loop generates the statements and appends them to a list, then the list is passed to a separate function that opens a transaction, executes all the statements and then finishes with commit. At least this was my impression before I started testing this with pseudocode - but it looks like I misunderstood the need for opening the transaction manually.
I've read about transactions (https://www.sqlite.org/faq.html#q19) but since the inserts are pretty much instant then I am not sure whether this is the case here.
My understanding is that transaction == commit, and if this is correct then it looks like all the delete statements are in a single transaction - mid-processing I can see all the deleted rows until the final commit, after which they are actually deleted. Ie the situation in the FAQ link below should be different - since no commit takes place. But the slow speed indicates that it is still doing something else that is slowing things down as if each delete statement were a separate transaction.
After running the pseudocode it appears that while the changes are not committed until explicit commit is sent (via conn.commit()) but the "begin" or "begin transaction" in front of the loop does not have any effect. I think this is because sqlite3 sends the "begin" automatically in the background ( Merge SQLite files into one db file, and 'begin/commit' question )
Pseudocode to test this out:
import sqlite3
from datetime import datetime
insert_queries = []
delete_queries = []
rows = 30000
for i in range(rows):
insert_queries.append(f'''INSERT INTO test_table ("column1") VALUES ("{i}");''')
for i in range(rows):
delete_queries.append(f'''DELETE from test_table where column1 ="{i}";''')
conn = sqlite3.connect('/data/test.db', check_same_thread=False)
timestamp = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%S")
print('*'*50)
print(f'Starting inserts: {timestamp}')
# conn.execute('BEGIN TRANSACTION')
for query in insert_queries:
conn.execute(query)
conn.commit()
timestamp = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%S")
print(f'Finished inserts: {timestamp}')
print('*'*50)
timestamp = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%S")
print(f'Starting deletes: {timestamp}')
# conn.isolation_level = None
# conn.execute('BEGIN;')
# conn.execute('BEGIN TRANSACTION;')
for query in delete_queries:
conn.execute(query)
conn.commit()
timestamp = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%S")
print(f'Finished deletes: {timestamp}')
One weird thing is that the row count exponentially increases the delete time (2s to delete 10k rows, 7s to delete 20k rows, 43s to delete 50k rows) while the insert time is instant regardless of row count.
EDIT:
The original question was - why does the delete statement take so much more time compared to insert statement and how to speed it up so the speeds for both inserting and deleting rows would be similar.
As per snakecharmerb's suggestion one workaround around this would be to do it like this:
rows = 100000
delete_ids = ''
for i in range(rows):
if delete_ids:
delete_ids += f',"{i}"'
else:
delete_ids += f'"{i}"'
delete_str = f'''DELETE from test_table where column1 IN ({delete_ids});'''
conn.execute(delete_str)
conn.commit()
While this is most likely against all best-practices, it does seem to work - it takes around 2 seconds to delete 1mil rows.

I tried batching the deletes in sets of 50:
...
batches = []
batch = []
for i in range(rows):
batch.append(str(i))
if len(batch) == 50:
batches.append(batch)
batch = []
if batch:
batches.append(batch)
...
base = 'DELETE FROM test_table WHERE column1 IN ({})'
for batch in batches:
placeholders = ','.join(['?'] * len(batch))
sql = base.format(placeholders)
conn.execute(sql, batch)
conn.commit()
...
and this reduced the duration to 1 - 2 seconds (from 6 - 8 originally).
Combining this approach with executemany resulted in a 1 second duration.
Using a query to define the deleted columns was almost instant
DELETE FROM test_table WHERE column1 IN (SELECT column1 FROM test_table)
but it's possible Sqlite recognises that this query is the same as a bare DELETE FROM test_table and optimises.
Switching off the secure_delete PRAGMA seemed to make performance even worse.

python Multiprocessing with ibm_db2?

I'm reading around 15 Million data from db2 LUW using python ibm_db2 module. I'm reading the data for every one million once and again reading this data in chunks to avoid memory issues. The problem here is to complete one loop of reading one million data its taking 4 Mins . How can i use multiprocessing to avoid delay . Blow is my code.
start =0
count = 15000000
check_point = 1000000
chunk_size = 100000
connection = get_db_conn_cur(secrets_db2)
for i in range(start, count, check_point):
query_str = "SELECT * FROM (SELECT ROW_NUMBER() OVER (ORDER BY a.timestamp) row_num, * from table a) where row_num between " + str( i + 1) + " and " + str(i + check_point) + ""
number_of_batches = check_point // chunk_size
last_chunk = check_point - (number_of_batches * chunk_size)
counter = 0
cur = connection.cursor()
cur.execute(query_str) s
chunk_size_l = chunk_size
while True:
counter = counter + 1
columns = [desc[0] for desc in cur.description]
print('counter', counter)
if counter > number_of_batches:
chunk_size_l = last_chunk
results = cur.fetchmany(chunk_size_l)
if not results:
break
df = pd.DataFrame(results)
#further processing

The problem here is not multiprocessing. Is your approach to read data. As far as I can see, you're using ROW_NUMBER() just to number the rows and then fetch them 1 million rows per loop.
As you're not using any WHERE condition on table a this will result in a FULL TABLE SCAN every loop you're running. You are wasting too much CPU and I/O on Db2 server just to have a row number, and that's why it's taking 4 minutes or more each loop.
Also, this is far from a efficient way to fetch data. You can have duplicate reads or phantom reads as long as data changes during your program. But that is subject for another thread.
You should be using one of the 4 fetch methods described here reading row by row from a SQL CURSOR. That way you'll be using a very small amount of RAM and can read data efficiently:
sql = "SELECT * FROM a ORDER BY a.timestamp"
stmt = ibm_db.exec_immediate(conn, sql)
dictionary = ibm_db.fetch_both(stmt)
while dictionary != False:
dictionary = ibm_db.fetch_both(stmt)

pd.read_sql method to count number of rows in a large Access database

I am trying to read the number of rows in a large access database and I am trying to find the most efficient method. Here is my code:
driver = 'access driver as string'
DatabaseLink = 'access database link as string'
Name = 'access table name as string'
conn = pyodbc.connect(r'Driver={' + driver + '};DBQ=' + DatabaseLink +';')
cursor = conn.cursor()
AccessSize = cursor.execute('SELECT count(1) FROM '+ Name).fetchone()[0]
connection.close()
This works and AccessSize does give me an integer with the number of rows in the database, however it takes far too long to compute (my database has over 2 million rows and 15 columns).
I have attempted to read the data through pd.read_sql and used the chunksize functionality to loop through and keep counting the length of each chunk but this also takes long. I have also attempted .fetchall in the cursor execute section but the speed is similar to .fetchone
I would have thought there would be a faster method to quickly calculate the length of the table as I don't require the entire table to be read. My thought is to find the index value of the last row as this essentially is the number of rows but I am unsure how to do this.
Thanks

From comment to the question:
Unfortunately the database doesn't have a suitable keys or indexes in any of its columns.
Then you can't expect good performance from the database because every SELECT will be a table scan.
I have an Access database on a network share. It contains a single table with 1 million rows and absolutely no indexes. The Access database file itself is 42 MiB. When I do
t0 = time()
df = pd.read_sql_query("SELECT COUNT(*) AS n FROM Table1", cnxn)
print(f'{time() - t0} seconds')
it takes 75 seconds and generates 45 MiB of network traffic. Simply adding a primary key to the table increases the file size to 48 MiB, but the same code takes 10 seconds and generates 7 MiB of network traffic.
TL;DR: Add a primary key to the table or continue to suffer from poor performance.

2 million should not take that long. I have Use pd.read_sql(con, sql) like this:
con = connection
sql = """ my sql statement
here"""
table = pd.read_sql(sql=sql, con=con)
Are you doing something different?
In my case I am using a db2 database, maybe that is why is faster.

How to reduce the size of a txt file created by Python?

I have about 2M rows x 70 columns worth of numerical and categorical data in a table on a Netezza server and want to dump that into a .txt file using Python.
I have done this in the past with SAS and on my test case I get a txt file worth 450MB.
I used Python and tried a couple of things.
# One line at a time
startTime = datetime.datetime.now().replace(microsecond=0)
cnxn = pyodbc.connect('DSN=NZ_LAB')
cursor = cnxn.cursor()
c = cursor.execute("""SELECT * FROM MYTABLE""")
with open('dump_test_pyodbc.csv','wb') as csv:
csv.write(','.join([g[0] for g in c.description])+'\n')
while 1:
a=c.fetchone()
if not a:
break
csv.write(','.join([str(g) for g in a])+'\n')
cnxn.close()
endTime = datetime.datetime.now().replace(microsecond=0)
print "Time elapsed PYODBC:", endTime - startTime
>>Time elapsed PYODBC: 0:18:20
# Use Pandas chunksize
startTime = datetime.datetime.now().replace(microsecond=0)
cnxn = pyodbc.connect('DSN=NZ_LAB')
sql = ("""SELECT * FROM MYTABLE""")
df = psql.read_sql(sql, cnxn, chunksize=1000)
for k, chunk in enumerate(df):
if k == 0:
chunk.to_csv('dump_chunk.csv',index=False,mode='w')
else:
chunk.to_csv('dump_chunk.csv',index=False,mode='a',header=False)
endTime = datetime.datetime.now().replace(microsecond=0)
print "Time elapsed PANDAS:", endTime - startTime
cnxn.close()
>>Time elapsed PANDAS: 0:29:29
Now to the size:
The Pandas approach created a file worth 690MB, the other approach created a file worth 630MB.
Speed and size seem to favor the former approach, however, size wise, this is still much larger than the original SAS approach.
Any ideas on how to improve the Python approach(es) to reduce the output size?
EDIT: ADDING EXAMPLES--------------------
Ok, it seems like SAS is doing a better job at managing integers, where it makes sense. I think that's what is making up most of the difference in size.
SAS:
xxxxxx,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.49,40.65,63.31,1249.92...
Pandas:
xxxxxx,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.49,40.65,63.31,1249.92...
fetchone():
xxxxxx,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,2.49,40.65,63.31,1249.92...
EDIT 2: SOLUTION------------------------------------
I ended up removing unnecessary decimals with:
csv.write(','.join([str(g.strip()) if type(g)==str else '%g'%(g) for g in a])+'\n')
This brought the file size down to SAS level.

I was going to make this a comment, but text formatting will help.
My guess is you're running into an issue of quoted vs unquoted CSV files. SAS has an option to create unquoted CSV files. Here's an example:
This Value,That Value,3,Other Value,423,985.32
The file I think you're getting is more accurate, and won't create problems for fields with commas in them. The same row, quoted:
"This Value","That Value","3","Other Value","423,985.32"
As you can see, in the first (SAS) example, if read into a Spreadsheet it would read as two different values, "423" and "985.32". In the second example, it becomes clear it is actually one value, "423,985.32". This is why the quoted format you are now getting (if I'm right) is more accurate and safe.

MySql: among 20000 rows of records only shows 5, why?

I want to test my MySQL database and how it handles my future data. it is only a table with two columns, one of the column is only one word and another one is 30.000 characters in. So I copied and inserted into the same table 20.000 times that shows the size is 2.0 GB. Now I want to browse them through phpMyAdmin it shows nothing and every destroyed that table to show anything. I output it through python it only shows 5 rows that were inserted before this copy. I used a script to delete rows from IDs between 5000 - 10.000 it works. That means that data is there but doesn't come out. Any explanation?
import MySQLdb as mdb
con = mdb.connect('127.0.0.1', 'root','password', 'database')
title = []
entry = []
y = 0
with con:
conn = con.cursor()
conn.execute("SELECT * FROM mydatabase WHERE id='2' AND myword = 'jungleboy'")
rows = conn.fetchall()
for i in rows:
title.append(i[1])
entry.append(i[2])
for x in range(20000):
cur.execute("INSERT INTO mydatabase(myword,explanation) VALUES (%s,%s)",(str(title[0]),str(entry[0])))
if x > y+50:
print str(x)
y = x

I'm not sure I understand your question, but here are some tips with the code you have pasted.
After any INSERT or other query that adds, removes or changes data in a table, you need to commit the transaction with con.commit().
There is a limit on how many records can be fetched with fetchall(). You can see and adjust this limit by printing the arraysize attribute of the cursor:
print 'I can only fetch {0.arraysize} rows at a time.'.format(cur)
To guarantee that you are fetching every row, loop through the results, like this:
q = "SELECT .... " # some query that returns a lot of results
conn.execute(q)
rows = con.fetchone() # or fetchall()
while rows:
print rows
rows = con.fetchone() # again, or fetchall()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Writing nested dictionary slows down drastically - python

Well i think you've got a problem with the memory here. Printing that many lines on the console uses a lot of memory of your rams (if it takes 100 Bytes per line then it would take about 400 MB ~ 500 MB). Your PC may say that this app is using up too much memory and stops it.

Related

SQLITE - unable to combine thousands of delete statements into single transaction

python Multiprocessing with ibm_db2?

pd.read_sql method to count number of rows in a large Access database

How to reduce the size of a txt file created by Python?

MySql: among 20000 rows of records only shows 5, why?

Categories

Resources