Importing database takes a lot of time - python

I am trying to import a table that contains 81462 rows in a dataframe using the following code:
sql_conn = pyodbc.connect('DRIVER={SQL Server}; SERVER=server.database.windows.net; DATABASE=server_dev; uid=user; pwd=pw')
query = "select * from product inner join brand on Product.BrandId = Brand.BrandId"
df = pd.read_sql(query, sql_conn)
And the whole process takes a very long time. I think that I am already 30-minutes in and it's still processing. I'd assume that this is not quite normal - so how else should I import it so the processing time is quicker?

Thanks to #RomanPerekhrest. FETCH NEXT imported everything within 1-2 minutes.
SELECT product.Name, brand.Name as BrandName, description, size FROM Product inner join brand on product.brandid=brand.brandid ORDER BY Name OFFSET 1 ROWS FETCH NEXT 80000 ROWS ONLY

Related

Inserting Records To Delta Table Through Databricks

I wanted to insert 100,000 records into a delta table using databricks. I am trying to insert data by using a simple for loop , something like -
revision_date = '01/04/2022'
for i in range( 0 , 100,000):
spark.sql(""" insert into db.delta_table_name values ( 'Class1' , '{revision_date}' + i """)
The problem is , it takes awfully long to insert data using insert statement in databricks. It took almost 5+ hours to complete this. Can anyone suggest an alternative or a solution for this problem in databricks.
My Cluster configuration is - 168 GB, 24 core, DBR 9.1 LTS,Spark 3.1.2
The loop through enormous INSERT operations on Delta Table costs a lot because it involves a new Transaction Logging for every single INSERT command. May read more on the doc.
Instead, it would be better to create a whole Spark dataframe first and then execute just one WRITE operation to insert data into Delta Table. The example code below will do in less than a minute.
from pyspark.sql.functions import expr, row_number, lit, to_date, date_add
from pyspark.sql.window import Window
columns = ['col1']
rows = [['Class1']]
revision_date = '01/04/2022'
# just create a one record dataframe
df = spark.createDataFrame(rows, columns)
# duplicate to 100,000 records
df = df.withColumn('col1', expr('explode(array_repeat(col1,100000))'))
# create date column
df = df.withColumn('revision_date', lit(revision_date))
df = df.withColumn('revision_date', to_date('revision_date', 'dd/MM/yyyy'))
# create sequence column
w = Window().orderBy(lit('X'))
df = df.withColumn("col2", row_number().over(w))
# use + operation to add date
df = df.withColumn("revision_date", df.revision_date + df.col2)
# drop unused column
df = df.drop("col2")
# write to the delta table location
df.write.format('delta').mode('overwrite').save('/location/of/your/delta/table')

Is there a faster way to move millions of rows from Excel to a SQL database using Python?

I am a financial analyst with about two month's experience with Python, and I am working on a project using Python and SQL to automate the compilation of a report. The process involves accessing a changing number of Excel files saved in a share drive, pulling two tabs from each (summary and quote) and combining the datasets into two large "Quote" and "Summary" tables. The next step is to pull various columns from each, combine, calculate, etc.
The problem is that the dataset ends up being 3.4mm rows and around 30 columns. The program I wrote below works, but it took 40 minutes to work through the first part (creating the list of dataframes) and another 4.5 hours to create the database and export the data, not to mention using a LOT of memory.
I know there must be a better way to accomplish this, but I don't have a CS background. Any help would be appreciated.
import os
import pandas as pd
from datetime import datetime
import sqlite3
from sqlalchemy import create_engine
from playsound import playsound
reportmonth = '2020-08'
month_folder = r'C:\syncedSharePointFolder'
os.chdir(month_folder)
starttime = datetime.now()
print('Started', starttime)
c = 0
tables = list()
quote_combined = list()
summary_combined = list()
# Step through files in synced Sharepoint directory, select the files with the specific
# name format. For each file, parse the file name and add to 'tables' list, then load
# two specific tabs as pandas dataframes. Add two columns, format column headers, then
# add each dataframe to the list of dataframes.
for xl in os.listdir(month_folder):
if '-Amazon' in xl:
ttime = datetime.now()
table_name = str(xl[11:-5])
tables.append(table_name)
quote_sheet = pd.read_excel(xl, sheet_name='-Amazon-Quote')
summary_sheet = pd.read_excel(xl, sheet_name='-Amazon-Summary')
quote_sheet.insert(0,'reportmonth', reportmonth)
summary_sheet.insert(0,'reportmonth', reportmonth)
quote_sheet.insert(0,'source_file', table_name)
summary_sheet.insert(0,'source_file', table_name)
quote_sheet.columns = quote_sheet.columns.str.strip()
quote_sheet.columns = quote_sheet.columns.str.replace(' ', '_')
summary_sheet.columns = summary_sheet.columns.str.strip()
summary_sheet.columns = summary_sheet.columns.str.replace(' ', '_')
quote_combined.append(quote_sheet)
summary_combined.append(summary_sheet)
c = c + 1
print('Step', c, 'complete: ', datetime.now() - ttime, datetime.now() - starttime)
# Concatenate the list of dataframes to append one to another.
# Totals about 3.4mm rows for August
totalQuotes = pd.concat(quote_combined)
totalSummary = pd.concat(summary_combined)
# Change directory, create Sqlite database, and send the combined dataframes to database
os.chdir(r'H:\AaronS\Databases')
conn = sqlite3.connect('AMZN-Quote-files_' + reportmonth)
cur = conn.cursor()
engine = create_engine('sqlite:///AMZN-Quote-files_' + reportmonth + '.sqlite', echo=False)
sqlite_connection = engine.connect()
sqlite_table = 'totalQuotes'
sqlite_table2 = 'totalSummary'
totalQuotes.to_sql(sqlite_table, sqlite_connection, if_exists = 'replace')
totalSummary.to_sql(sqlite_table2, sqlite_connection, if_exists = 'replace')
print('Finished. It took: ', datetime.now() - starttime)
'''
I see a few things you could do. Firstly, since your first step is just to transfer the data to your SQL DB, you don't necessarily need to append all the files to each other. You can just attack the problem one file at a time (which means you can multiprocess!) - then, whatever computations need to be completed, can come later. This will also result in you cutting down your RAM usage since if you have 10 files in your folder, you aren't loading all 10 up at the same time.
I would recommend the following:
Construct an array of filenames that you need to access
Write a wrapper function that can take a filename, open + parse the file, and write the contents to your MySQL DB
Use the Python multiprocessing.Pool class to process them simultaneously. If you run 4 processes, for example, your task becomes 4 times faster! If you need to derive computations from this data and hence need to aggregate it, please do this once the data's in the MySQL DB. This will be way faster.
If you need to define some computations based on the aggregate data, do it now, in the MySQL DB. SQL is an incredibly powerful language, and there's a command out there for practically everything!
I've added in a short code snippet to show you what I'm talking about :)
from multiprocessing import Pool
PROCESSES = 4
FILES = []
def _process_file(filename):
print("Processing: "+filename)
pool = Pool(PROCESSES)
pool.map(_process_file, FILES)
SQL clarification: You don't need an independent table for every file you move to SQL! You can create a table based on a given schema, and then add the data from ALL your files to that one table, row by row. This is essentially what the function you use to go from DataFrame to table does, but it creates 10 different tables. You can look at some examples on inserting a row into a table here.However, in the specific use case that you have, setting the if_exists parameter to "append" should work, as you've mentioned in your comment. I just added the earlier references in because you mentioned that you're fairly new to Python, and a lot of my friends in the finance industry have found gaining a slightly more nuanced understanding of SQL to be extremely useful.
Try this, Here most of the time is taken is during Loading Data from excel to Dataframe. I am not sure following script will reduce the time to within seconds but It will reduce the RAM baggage, which in turn could speed up the process. It will potentially reduce the time by at least 5-10 minutes. Since I have no access to data I cannot be sure. But you should try this
import os
import pandas as pd
from datetime import datetime
import sqlite3
from sqlalchemy import create_engine
from playsound import playsound
os.chdir(r'H:\AaronS\Databases')
conn = sqlite3.connect('AMZN-Quote-files_' + reportmonth)
engine = create_engine('sqlite:///AMZN-Quote-files_' + reportmonth + '.sqlite', echo=False)
sqlite_connection = engine.connect()
sqlite_table = 'totalQuotes'
sqlite_table2 = 'totalSummary'
reportmonth = '2020-08'
month_folder = r'C:\syncedSharePointFolder'
os.chdir(month_folder)
starttime = datetime.now()
print('Started', starttime)
c = 0
tables = list()
for xl in os.listdir(month_folder):
if '-Amazon' in xl:
ttime = datetime.now()
table_name = str(xl[11:-5])
tables.append(table_name)
quote_sheet = pd.read_excel(xl, sheet_name='-Amazon-Quote')
summary_sheet = pd.read_excel(xl, sheet_name='-Amazon-Summary')
quote_sheet.insert(0,'reportmonth', reportmonth)
summary_sheet.insert(0,'reportmonth', reportmonth)
quote_sheet.insert(0,'source_file', table_name)
summary_sheet.insert(0,'source_file', table_name)
quote_sheet.columns = quote_sheet.columns.str.strip()
quote_sheet.columns = quote_sheet.columns.str.replace(' ', '_')
summary_sheet.columns = summary_sheet.columns.str.strip()
summary_sheet.columns = summary_sheet.columns.str.replace(' ', '_')
quote_sheet.to_sql(sqlite_table, sqlite_connection, if_exists = 'append')
summary_sheet.to_sql(sqlite_table2, sqlite_connection, if_exists = 'append')
c = c + 1
print('Step', c, 'complete: ', datetime.now() - ttime, datetime.now() - starttime)

pd.read_sql method to count number of rows in a large Access database

I am trying to read the number of rows in a large access database and I am trying to find the most efficient method. Here is my code:
driver = 'access driver as string'
DatabaseLink = 'access database link as string'
Name = 'access table name as string'
conn = pyodbc.connect(r'Driver={' + driver + '};DBQ=' + DatabaseLink +';')
cursor = conn.cursor()
AccessSize = cursor.execute('SELECT count(1) FROM '+ Name).fetchone()[0]
connection.close()
This works and AccessSize does give me an integer with the number of rows in the database, however it takes far too long to compute (my database has over 2 million rows and 15 columns).
I have attempted to read the data through pd.read_sql and used the chunksize functionality to loop through and keep counting the length of each chunk but this also takes long. I have also attempted .fetchall in the cursor execute section but the speed is similar to .fetchone
I would have thought there would be a faster method to quickly calculate the length of the table as I don't require the entire table to be read. My thought is to find the index value of the last row as this essentially is the number of rows but I am unsure how to do this.
Thanks
From comment to the question:
Unfortunately the database doesn't have a suitable keys or indexes in any of its columns.
Then you can't expect good performance from the database because every SELECT will be a table scan.
I have an Access database on a network share. It contains a single table with 1 million rows and absolutely no indexes. The Access database file itself is 42 MiB. When I do
t0 = time()
df = pd.read_sql_query("SELECT COUNT(*) AS n FROM Table1", cnxn)
print(f'{time() - t0} seconds')
it takes 75 seconds and generates 45 MiB of network traffic. Simply adding a primary key to the table increases the file size to 48 MiB, but the same code takes 10 seconds and generates 7 MiB of network traffic.
TL;DR: Add a primary key to the table or continue to suffer from poor performance.
2 million should not take that long. I have Use pd.read_sql(con, sql) like this:
con = connection
sql = """ my sql statement
here"""
table = pd.read_sql(sql=sql, con=con)
Are you doing something different?
In my case I am using a db2 database, maybe that is why is faster.

Getting no such table error using pandas and sqldf

I am getting a sqlite3 error.
OperationalError: no such table: Bills
I first call my dataframes using pandas and then call those dataframes in my query which works fine
import pandas as pd
from pandasql import sqldf
Bills = pd.read_csv("Bills.csv")
Accessorials = pd.read_csv("Accessorials.csv")
q = """
Select
CityStateLane,
Count(BillID) as BillsCount,
Sum(BilledAmount) as BillsSum,
Count(Distinct CarrierName) as NumberOfCarriers,
Avg(BilledAmount) as BillsAverage,
Avg(BilledWeight) as WeightAverage
From
Bills
Where
Direction = 'THIRD PARTY'
Group by
CityStateLane
Order by
BillsCount DESC
"""
topCityStateLane = sqldf(q)
I then create another data frame using another query but this calls the errors saying Bills is not there even though I successfully used it in the previous query.
q = """
SELECT
Bills.BillID as BillID,
A2.TotalAcc as TotalAcc
FROM
(SELECT
BillID_Value,
SUM(PaidAmount_Value) as "TotalAcc"
FROM
Accessorials
GROUP BY
BillID_Value
) AS A2,
Bills
WHERE
A2.BillID_Value = Bills.BillID
"""
temp = sqldf(q)
Thank you for taking the time to read this.
Are you trying to join Bills with A2 table? You can't select columns from two tables in one select from statement.
q = """
SELECT
Bills.BillID as BillID,
A2.TotalAcc as TotalAcc
FROM
(SELECT
BillID_Value,
SUM(PaidAmount_Value) as "TotalAcc"
FROM
Accessorials
GROUP BY
BillID_Value
) AS A2
join Bills
on A2.BillID_Value = Bills.BillID
"""
temp = sqldf(q)
) AS A2,
Bills
I think this is where your issue is. You're not calling the Bills table in your FROM clause, you're calling the return table from the subquery you wrote with the alas A2. In other words, your From clause is pointing at the A2 'table' not Bills. As Qianbo Wang mentioned, if you want to return output from these two separate tables you will have to join them together.

Put retrieved data from MySQL query into DataFrame pandas by a for loop

I have one database with two tables, both have a column called barcode, the aim is to retrieve barcode from one table and search for the entries in the other where extra information of that certain barcode is stored. I would like to have bothe retrieved data to be saved in a DataFrame. The problem is when I want to insert the retrieved data into DataFrame from the second query, it stores only the last entry:
import mysql.connector
import pandas as pd
cnx = mysql.connector(user,password,host,database)
query_barcode = ("SELECT barcode FROM barcode_store")
cursor = cnx.cursor()
cursor.execute(query_barcode)
data_barcode = cursor.fetchall()
Up to this point everything works smoothly, and here is the part with problem:
query_info = ("SELECT product_code FROM product_info WHERE barcode=%s")
for each_barcode in data_barcode:
cursor.execute(query_info % each_barcode)
pro_info = pd.DataFrame(cursor.fetchall())
pro_info contains only the last matching barcode information! While I want to retrieve all the information for each data_barcode match.
That's because you are consistently overriding existing pro_info with new data in each loop iteration. You should rather do something like:
query_info = ("SELECT product_code FROM product_info")
cursor.execute(query_info)
pro_info = pd.DataFrame(cursor.fetchall())
Making so many SELECTs is redundant since you can get all records in one SELECT and instantly insert them to your DataFrame.
#edit: However if you need to use the WHERE statement to fetch only specific products, you need to store records in a list until you insert them to DataFrame. So your code will eventually look like:
pro_list = []
query_info = ("SELECT product_code FROM product_info WHERE barcode=%s")
for each_barcode in data_barcode:
cursor.execute(query_info % each_barcode)
pro_list.append(cursor.fetchone())
pro_info = pd.DataFrame(pro_list)
Cheers!

Categories