SQL - Update all records of a single column - python

I'm trying to write a python script that parses through a file and updates a database with the new values obtained from the parsed file. My code looks like this:
startTime = datetime.now()
db = <Get DB Handle>
counter = 0
with open('CSV_FILE.csv') as csv_file:
data = csv_file.read().splitlines()
for line in data:
data1 = line.split(',')
execute_string = "update table1 set col1=" + data1[1] +
" where col0 is '" + data1[0] + "'"
db.execute(execute_string)
counter = counter+1
if(counter % 1000 == 0 and counter != 0):
print ".",
print ""
print datetime.now() - startTime
But that operation took about 10 mins to finish. Any way I can tweak my SQL query to quicken it?

Read lines in batch size (may be size of 1000) and try bulk_update query for mysql. I think that process will be faster than current one as in lesser queries you'll be updating more data.

I agree you need batching. Have a look at this question and best answer How can I do a batch insert into an Oracle database using Python?

Related

python Multiprocessing with ibm_db2?

I'm reading around 15 Million data from db2 LUW using python ibm_db2 module. I'm reading the data for every one million once and again reading this data in chunks to avoid memory issues. The problem here is to complete one loop of reading one million data its taking 4 Mins . How can i use multiprocessing to avoid delay . Blow is my code.
start =0
count = 15000000
check_point = 1000000
chunk_size = 100000
connection = get_db_conn_cur(secrets_db2)
for i in range(start, count, check_point):
query_str = "SELECT * FROM (SELECT ROW_NUMBER() OVER (ORDER BY a.timestamp) row_num, * from table a) where row_num between " + str( i + 1) + " and " + str(i + check_point) + ""
number_of_batches = check_point // chunk_size
last_chunk = check_point - (number_of_batches * chunk_size)
counter = 0
cur = connection.cursor()
cur.execute(query_str) s
chunk_size_l = chunk_size
while True:
counter = counter + 1
columns = [desc[0] for desc in cur.description]
print('counter', counter)
if counter > number_of_batches:
chunk_size_l = last_chunk
results = cur.fetchmany(chunk_size_l)
if not results:
break
df = pd.DataFrame(results)
#further processing
The problem here is not multiprocessing. Is your approach to read data. As far as I can see, you're using ROW_NUMBER() just to number the rows and then fetch them 1 million rows per loop.
As you're not using any WHERE condition on table a this will result in a FULL TABLE SCAN every loop you're running. You are wasting too much CPU and I/O on Db2 server just to have a row number, and that's why it's taking 4 minutes or more each loop.
Also, this is far from a efficient way to fetch data. You can have duplicate reads or phantom reads as long as data changes during your program. But that is subject for another thread.
You should be using one of the 4 fetch methods described here reading row by row from a SQL CURSOR. That way you'll be using a very small amount of RAM and can read data efficiently:
sql = "SELECT * FROM a ORDER BY a.timestamp"
stmt = ibm_db.exec_immediate(conn, sql)
dictionary = ibm_db.fetch_both(stmt)
while dictionary != False:
dictionary = ibm_db.fetch_both(stmt)

Array Outputting result set with the same amount of rows in a sql database

I have a query that reaches into a MySQL database and grabs row data that match the column "cab" which is a variable that is passed on from a previous html page. That variable is cabwrite.
SQL's response is working just fine, it queries and matches the column 'cab' with all data point in the rows that match id cab.
Once that happens I then remove the data I don't need line identifier and cab.
The output from that is result_set.
However when I print the data to verify its what I expect I'm met with the same data for every row I have.
Example data:
Query has 4 matching rows that is finds
This is currently what I'm getting:
> data =
> ["(g11,none,tech11)","(g2,none,tech13)","(g3,none,tech15)","(g4,none,tech31)"]
> ["(g11,none,tech11)","(g2,none,tech13)","(g3,none,tech15)","(g4,none,tech31)"]
> ["(g11,none,tech11)","(g2,none,tech13)","(g3,none,tech15)","(g4,none,tech31)"]
> ["(g11,none,tech11)","(g2,none,tech13)","(g3,none,tech15)","(g4,none,tech31)"]
Code:
cursor = connection1.cursor(MySQLdb.cursors.DictCursor)
cursor.execute("SELECT * FROM devices WHERE cab=%s " , [cabwrite])
result_set = cursor.fetchall()
data = []
for row in result_set:
localint = "('%s','%s','%s')" % ( row["localint"], row["devicename"], row["hostname"])
l = str(localint)
data.append(l)
print (data)
This is what I want it too look like:
data = [(g11,none,tech11),(g2,none,tech13),(g3,none,tech15),(g4,none,tech31)]
["('Gi3/0/13','None','TECH2_HELP')", "('Gi3/0/7','None','TECH2_1507')", "('Gi1/0/11','None','TECH2_1189')", "('Gi3/0/35','None','TECH2_4081')", "('Gi3/0/41','None','TECH2_5625')", "('Gi3/0/25','None','TECH2_4598')", "('Gi3/0/43','None','TECH2_1966')", "('Gi3/0/23','None','TECH2_2573')", "('Gi3/0/19','None','TECH2_1800')", "('Gi3/0/39','None','TECH2_1529')"]
Thanks Tripleee did what you recommended and found my issue... legacy FOR clause in my code upstream was causing the issue.

Exit loop in python if SQL query doesn't bring any data

I am new to Python and have been given a task to download data from different Database ( MS SQl and Teradata ). The logic behind my code is as follow :
1: Code picks up data for Vendor from an excel file.
2: From that list it loops through all the vendors and gives out a list of documents.
3: Then I use the list downloaded in step 2 to download data from teradata and append in a final dataset.
My question is, if the data in second step is blank the while loop goes in infinite. Any way to exit that an still execute the rest of the iteration?
import pyodbc
import pandas as pd
VendNum = pd.ExcelFile(r"C:\desktop\VendorNumber.xlsx").parse('Sheet3',
dtype=str)
VendNum['Vend_Num'] = VendNum['Vend_Num'].astype(str).str.pad(10,
side='left', fillchar='0')
fDataSet = pd.DataFrame()
MSSQLconn=pyodbc.connect(r'Driver={SQL Server Native Client
11.0};Server=Servername;Database=DBName;Trusted_Connection=yes;')
TDconn = pyodbc.connect
(r"DSN=Teradata;DBCNAME=DBname;UID=User;PWD=password;",autocommit =True)
for index, row in VendNum.iterrows():
DocNum = pd.DataFrame()
if index > len(VendNum["Vend_Num"]):
break
while DocNum.size == 0:
print("Read SQL " + row["Vend_Num"])
DocNum = pd.read_sql_query("select Col1 from Table11 where
Col2 = " + "'" + row["Vend_Num"] + "'" + " and Col3 =
'ABC'",MSSQLconn)
print("Execute SQL " + row["Vend_Num"])
if DocNum.size > 0:
print(row["Vend_Num"])
dataList = ""
dfToList = DocNum['Col1'].tolist()
for i in dfToList:
dataList += "'"+i+ "'" + ","
dataList=dataList[0:-1]
DataSet= pd.read_sql("
Some SQl statement which works fine "),TDconn)
fDataSet = fDataSet.append(DataSet)
MSSQLconn.close()
TDconn.close()
The expected output is to append fDataset with each iteration of the code but when a blank Dataframe ( named DataSet ) is there the while loop doesn't exit.
When you are using system resources you should use
with open(...):
As mentioned by Chris,
A while loop is supposed to be infinite, until a condition is met. Perhaps create a for loop instead, which makes a few attempts then passes.
I changed the WHILE to IF and it's working fine.

Conflict accessing twice in azure storage table simultaneously

I've been running a script to retrieve data from an Azure storage table (such as this one as a reference) and copy it in another table from the same storage account without problem.
Now, the issue came when I tried to access this latter table to run some calculations and copy that in another table from the same storage account. This script returned the following error:
AzureConflictHttpError: Conflict
{"odata.error":{"code":"EntityAlreadyExists","message":{"lang":"en-US","value":"The specified entity already exists.\nRequestId:57d9b721-6002-012d-3d0c-b88bef000000\nTime:2019-01-29T19:55:53.5984026Z"}}}
At the same time, however, the code I was running previously also stopped printing the same error and won't start again even if no code is run, returning the previous error over and over again.
Is there any way to access the same API in azure storage multiple times?
UPDATE
Adding the source code, sorry for not having done that before. Basically the 2 codes I'm running in parallel are the same but with different filters; on this one I'm taking the data from Table 1 (which has a row per second) and I'm averaging these numbers per minute to add a row to Table 2, and on the other script I'm taking data from this Table 2 to average these rows per minute to a 5-minute average row in another Table 3, so basically a few parameters change but the code is basically the same.
There will be a third script, slightly different to these 2, but will take Table 2 as the input source, run other calculations and paste the results in a new row per minute in a future Table 4, so in general my idea is to have multiple entries to multiple tables at the same time to build new specific tables.
import datetime
import time
from azure.storage.table import TableService, Entity
delta_time = '00:01:00'
retrieve_time = '00:10:00'
start_time = '08:02:00'
utc_diff = 3
table_service = TableService(account_name='xxx', account_key='yyy')
while True:
now_time = datetime.datetime.now().strftime("%H:%M:%S")
now_date = datetime.datetime.now().strftime("%d-%m-%Y")
hour = datetime.datetime.now().hour
if hour >= 21:
now_date = (datetime.datetime.now() + datetime.timedelta(days=1)).strftime("%d-%m-%Y")
retrieve_max = (datetime.datetime.now() + datetime.timedelta(hours=utc_diff)+ datetime.timedelta(minutes=-10)).strftime("%H:%M:%S")
start_diff = datetime.datetime.strptime(now_time, '%H:%M:%S') - datetime.datetime.strptime(start_time, '%H:%M:%S') + datetime.timedelta(hours=utc_diff)
if start_diff.total_seconds() > 0:
query = "PartitionKey eq '"+str(now_date)+"' and RowKey ge '"+str(retrieve_max)+"'"
tasks=table_service.query_entities('Table1',query)
iqf_0 = []
for task in tasks:
if task.Name == "IQF_0":
iqf_0.append([task.RowKey, task.Area])
last_time = iqf_0[len(iqf_0)-1][0]
time_max = datetime.datetime.strptime(last_time, '%H:%M:%S') - datetime.datetime.strptime(delta_time, '%H:%M:%S') #+ datetime.timedelta(hours=utc_diff)
area = 0.0
count = 0
for i in range(len(iqf_0)-1, -1, -1):
diff = datetime.datetime.strptime(last_time, '%H:%M:%S') - datetime.datetime.strptime(iqf_0[i][0], '%H:%M:%S')
if diff.total_seconds() < 60:
area += iqf_0[i][1]
count += 1
else:
break
area_average = area/count
output_row = Entity()
output_row.PartitionKey = now_date
output_row.RowKey = last_time
output_row.Name = task.Name
output_row.Area = area_average
table_service.insert_entity('Table2', output_row)
date_max = datetime.datetime.now() + datetime.timedelta(days=-1)
date_max = date_max.strftime("%d-%m-%Y")
query = "PartitionKey eq '"+str(date_max)+"' and RowKey ge '"+str(retrieve_max)+"'"
tasks=table_service.query_entities('Table2',query)
for task in tasks:
diff = datetime.datetime.strptime(now_time, '%H:%M:%S') - datetime.datetime.strptime(task.RowKey, '%H:%M:%S') + datetime.timedelta(hours=utc_diff)
print(i, datetime.datetime.strptime(now_time, '%H:%M:%S'), datetime.datetime.strptime(task.RowKey, '%H:%M:%S'), diff.total_seconds())
if task.PartitionKey == date_max and diff.total_seconds()>0:
table_service.delete_entity('Table2', task.PartitionKey, task.RowKey)
time.sleep(60 - time.time() % 60)
It sounds like you were running two codes to copy data in a same Azure Storage Accout from Table 1 to Table 2 to Table 3 at the same time. Per my experience, the issue was normally caused by writing a data record (a Table Entity) concurrently at the same time, or using the incorrect method for an existing Entity, which is resource competition issue for writing.
It's a common Table Service Error, you can find it at here.
And there is a document Inserting and Updating Entities which explains the differences of the operation effect between the functions Insert Entity, Update Entity, Merge Entity, Insert Or Merge Entity, and Insert Or Replace Entity.
Now, your code did not shared for us. Considering for all possible cases, there are three solutions to fix the issue.
Run your two codes one after another in order of copying data between different tables, not concurrently.
Using the correct function to update data for an existing entity, you can refer to the document above and the similar SO thread Add or replace entity in Azure Table Storage.
To use a global lock for a unique primary key of a Table Entity to avoid operating the same Table Entity concurrently in two codes at the same time.

Creating tables in MySQL based on the names of the columns in another table

I have a table with ~133M rows and 16 columns. I want to create 14 tables on another database on the same server for each of columns 3-16 (columns 1 and 2 are `id` and `timestamp` which will be in the final 14 tables as well but won't have their own table), where each table will have the name of the original column. Is this possible to do exclusively with an SQL script? It seems logical to me that this would be the preferred, and fastest way to do it.
Currently, I have a Python script that "works" by parsing the CSV dump of the original table (testing with 50 rows), creating new tables, and adding the associated values, but it is very slow (I estimated almost 1 year to transfer all 133M rows, which is obviously not acceptable). This is my first time using SQL in any capacity, and I'm certain that my code can be sped up, but I'm not sure how because of my unfamiliarity with SQL. The big SQL string command in the middle was copied from some other code in our codebase. I've tried using transactions as seen below, but it didn't seem to have any significant effect on the speed.
import re
import mysql.connector
import time
# option flags
debug = False # prints out information during runtime
timing = True # times the execution time of the program
# save start time for timing. won't be used later if timing is false
start_time = time.time()
# open file for reading
path = 'test_vaisala_sql.csv'
file = open(path, 'r')
# read in column values
column_str = file.readline().strip()
columns = re.split(',vaisala_|,', column_str) # parse columns with regex to remove commas and vasiala_
if debug:
print(columns)
# open connection to MySQL server
cnx = mysql.connector.connect(user='root', password='<redacted>',
host='127.0.0.1',
database='measurements')
cursor = cnx.cursor()
# create the table in the MySQL database if it doesn't already exist
for i in range(2, len(columns)):
table_name = 'vaisala2_' + columns[i]
sql_command = "CREATE TABLE IF NOT EXISTS " + \
table_name + "(`id` BIGINT(20) NOT NULL AUTO_INCREMENT, " \
"`timestamp` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, " \
"`milliseconds` BIGINT(20) NOT NULL DEFAULT '0', " \
"`value` varchar(255) DEFAULT NULL, " \
"PRIMARY KEY (`id`), " \
"UNIQUE KEY `milliseconds` (`milliseconds`)" \
"COMMENT 'Eliminates duplicate millisecond values', " \
"KEY `timestamp` (`timestamp`)) " \
"ENGINE=InnoDB DEFAULT CHARSET=utf8;"
if debug:
print("Creating table", table_name, "in database")
cursor.execute(sql_command)
# read in rest of lines in CSV file
for line in file.readlines():
cursor.execute("START TRANSACTION;")
line = line.strip()
values = re.split(',"|",|,', line) # regex split along commas, or commas and quotes
if debug:
print(values)
# iterate of each data column. Starts at 2 to eliminate `id` and `timestamp`
for i in range(2, len(columns)):
table_name = "vaisala2_" + columns[i]
timestamp = values[1]
# translate timestamp back to epoch time
try:
pattern = '%Y-%m-%d %H:%M:%S'
epoch = int(time.mktime(time.strptime(timestamp, pattern)))
milliseconds = epoch * 1000 # convert seconds to ms
except ValueError: # errors default to 0
milliseconds = 0
value = values[i]
# generate SQL command to insert data into destination table
sql_command = "INSERT IGNORE INTO {} VALUES (NULL,'{}',{},'{}');".format(table_name, timestamp,
milliseconds, value)
if debug:
print(sql_command)
cursor.execute(sql_command)
cnx.commit() # commits changes in destination MySQL server
# print total execution time
if timing:
print("Completed in %s seconds" % (time.time() - start_time))
This doesn't need to be incredibly optimized; it's perfectly acceptable if the machine has to run for a few days in order to do it. But 1 year is far too long.
You can create a table from a SELECT like:
CREATE TABLE <other database name>.<column name>
AS
SELECT <column name>
FROM <original database name>.<table name>;
(Replace the <...> with your actual object names or extend it with other columns or a WHERE clause or ...)
That will also insert the data from the query into the new table. And it's probably the fastest way.
You could use dynamic SQL and information from the catalog (namely information_schema.columns) to create the CREATE statements or create them manually, which is annoying but acceptable for 14 columns I guess.
When using scripts to talk to databases you want to minimise the number of messages that are sent as each message creates a further delay on your execution time. Currently, it looks as if you are sending (by your approximation) 133 million messages, and thus, slowing down your script 133 million times. A simple optimisation would be to parse your spreadsheet and split the data into the tables (either in memory or saving them to disk) and only then send the data to the new DB.
As you hinted, it's much quicker to write an SQL script to redistribute the data.

Categories