Thanks for reading my post.
I need to deal with big files, let me give you more context, I extract some tables from a database convert those tables to CSV and after that, I convert them to JSON.
All that is to send the information to BigQuery.
Now my script works fine but I have a problem, some tables I extract are so so big one of them has 14 Gb, my problem is my server memory just has 8 Gb, exist any way to integrate some to my script to split or append the information ???
My script:
import pyodbc
import fileinput
import csv
import pandas as pd
import json
import os
import sys
conn = pyodbc.connect("Driver={SQL Server};"
"Server=TEST;"
"username=test;"
"password=12345;"
"Database=TEST;"
"Trusted_Connection=no;")
cursor = conn.cursor()
query = "SELECT * FROM placeholder where "
with open(r"D:\Test.txt") as file:
lines = file.readlines()
print(lines)
for user_input in lines:
result = query.replace("placeholder", user_input)
print(result)
sql_query = pd.read_sql(result,conn)
df = pd.DataFrame(sql_query)
user_inputs = user_input.strip("\n")
filename = os.path.join('D:\\', user_inputs + '.csv')
df.to_csv (filename, index = False, encoding='utf-8', sep = '~', quotechar = "`", quoting=csv.QUOTE_ALL)
print(filename)
filename_json = os.path.join('D:\\', user_inputs + '.jsonl')
csvFilePath = (filename)
jsonFilePath = (filename_json)
print(filename_json)
df_o = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)
df_o.to_json(filename_json, orient = "records", lines = True, date_format = "iso", double_precision = 15, force_ascii = False, date_unit = 'ms', default_handler = str)
dir_name = "D:\\"
test = os.listdir(dir_name)
for item in test:
if item.endswith(".csv"):
os.remove(os.path.join(dir_name, item))
cursor.close()
conn.close()
I'm really new to python, I hope you can help me to integrate some into my script.
Really thanks so many guys !!!
Kind regards.
For large data sets you should avoid reading all of it at once and then writing it all at once. You should do partial reads and partial writes.
Since you are using BigQuery you should use paritions to limit the query output. Have some logic to update the partition offsets. For each partition you can generate one file per parition. In this case your output would be like output-1.csv, output-2.csv etc.
An example of using parition:
SELECT * FROM placeholder
WHERE transaction_date >= '2016-01-01'
As a bonus tip, avoid doing Select * as BigQuery is columnar storage system mentioning the columns you would want to read will significatnly improve the peformance.
I am making a program that fetches column names and dumps the data into csv format.
Now everything is working just fine and data is being dumped into csv, the problem is,
I am not able to fetch headers into csv. If I open the exported csv file into excel, only data shows up not the column headers. How do I do that?
Here's my code:
import cx_Oracle
import csv
dsn_tns = cx_Oracle.makedsn(--Details--)
conn = cx_Oracle.connect(--Details--)
d = conn.cursor()
csv_file = open("profile.csv", "w")
writer = csv.writer(csv_file, delimiter=',', lineterminator="\n", quoting=csv.QUOTE_NONNUMERIC)
d.execute("""
select * from all_tab_columns where OWNER = 'ABBAS'
""")
tables_tu = d.fetchall()
for row in tables_tu:
writer.writerow(row)
conn.close()
csv_file.close()
What code do I use to export headers too in csv?
Place this just above your for loop:
writer.writerow(i[0] for i in d.description)
Because d.description is a read-only attribute containing 7-tuples that look like:
(name,
type_code,
display_size,
internal_size,
precision,
scale,
null_ok)
I am using pandas library to store excel into bytesIO memory. Later, I am storing this bytesIO object into SQL Server as below-
df = pandas.DataFrame(data1, columns=['col1', 'col2', 'col3'])
output = BytesIO()
writer = pandas.ExcelWriter(output,engine='xlsxwriter')
df.to_excel(writer)
writer.save()
output.seek(0)
workbook = output.read()
#store into table
Query = '''
INSERT INTO [TABLE]([file]) VALUES(?)
'''
values = (workbook)
cursor = conn.cursor()
cursor.execute(Query, values)
cursor.close()
conn.commit()
#Create excel file.
Query1 = "select [file] from [TABLE] where [id] = 1"
result = conn.cursor().execute(Query1).fetchall()
print(result[0])
Now, I want to pull the BytesIO object back from table and create an excel file and store it locally. How Do I do it?
Finally, I got solution.Below are the steps performed:
Takes Dataframe and convert it to excel and store it in memory in BytesIO format.
Store BytesIO object in Database column having varbinary(max)
Pull the stored BytesIO object and create an excel file locally.
Python Code:
#Get Required data in DataFrame:
df = pandas.DataFrame(data1, columns=['col1', 'col2', 'col3'])
#Convert the data frame to Excel and store it in BytesIO object `workbook`:
output = BytesIO()
writer = pandas.ExcelWriter(output,engine='xlsxwriter')
df.to_excel(writer)
writer.save()
output.seek(0)
workbook = output.read()
#store into Database table
Query = '''
INSERT INTO [TABLE]([file]) VALUES(?)
'''
values = (workbook)
cursor = conn.cursor()
cursor.execute(Query, values)
cursor.close()
conn.commit()
#Retrieve the BytesIO object from Database
Query1 = "select [file] from [TABLE] where [id] = 1"
result = conn.cursor().execute(Query1).fetchall()
WriteObj = BytesIO()
WriteObj.write(result[0][0])
WriteObj.seek(0)
df = pandas.read_excel(WriteObj)
df.to_excel("outputFile.xlsx")
I have downloaded some datas as a sqlite database (data.db) and I want to open this database in python and then convert it into pandas dataframe.
This is so far I have done
import sqlite3
import pandas
dat = sqlite3.connect('data.db') #connected to database with out error
pandas.DataFrame.from_records(dat, index=None, exclude=None, columns=None, coerce_float=False, nrows=None)
But its throwing this error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 980, in from_records
coerce_float=coerce_float)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 5353, in _to_arrays
if not len(data):
TypeError: object of type 'sqlite3.Connection' has no len()
How to convert sqlite database to pandas dataframe
Despite sqlite being part of the Python Standard Library and is a nice and easy interface to SQLite databases, the Pandas tutorial states:
Note In order to use read_sql_table(), you must have the SQLAlchemy
optional dependency installed.
But Pandas still supports sqlite3 access if you want to avoid installing SQLAlchemy:
import sqlite3
import pandas as pd
# Create your connection.
cnx = sqlite3.connect('file.db')
df = pd.read_sql_query("SELECT * FROM table_name", cnx)
As stated here, but you need to know the name of the used table in advance.
The line
data = sqlite3.connect('data.db')
opens a connection to the database. There are no records queried up to this. So you have to execute a query afterward and provide this to the pandas DataFrame constructor.
It should look similar to this
import sqlite3
import pandas as pd
dat = sqlite3.connect('data.db')
query = dat.execute("SELECT * From <TABLENAME>")
cols = [column[0] for column in query.description]
results= pd.DataFrame.from_records(data = query.fetchall(), columns = cols)
I am not really firm with SQL commands, so you should check the correctness of the query. should be the name of the table in your database.
Parsing a sqlite .db into a dictionary of dataframes without knowing the table names:
def read_sqlite(dbfile):
import sqlite3
from pandas import read_sql_query, read_sql_table
with sqlite3.connect(dbfile) as dbcon:
tables = list(read_sql_query("SELECT name FROM sqlite_master WHERE type='table';", dbcon)['name'])
out = {tbl : read_sql_query(f"SELECT * from {tbl}", dbcon) for tbl in tables}
return out
Search sqlalchemy, engine and database name in google (sqlite in this case):
import pandas as pd
import sqlalchemy
db_name = "data.db"
table_name = "LITTLE_BOBBY_TABLES"
engine = sqlalchemy.create_engine("sqlite:///%s" % db_name, execution_options={"sqlite_raw_colnames": True})
df = pd.read_sql_table(table_name, engine)
I wrote a piece of code up that saves tables in a database file such as .sqlite or .db and creates an excel file out of it with each table as a sheet or makes individual tables into csvs.
Note: You don't need to know the table names in advance!
import os, fnmatch
import sqlite3
import pandas as pd
#creates a directory without throwing an error
def create_dir(dir):
if not os.path.exists(dir):
os.makedirs(dir)
print("Created Directory : ", dir)
else:
print("Directory already existed : ", dir)
return dir
#finds files in a directory corresponding to a regex query
def find(pattern, path):
result = []
for root, dirs, files in os.walk(path):
for name in files:
if fnmatch.fnmatch(name, pattern):
result.append(os.path.join(root, name))
return result
#convert sqlite databases(.db,.sqlite) to pandas dataframe(excel with each table as a different sheet or individual csv sheets)
def save_db(dbpath=None,excel_path=None,csv_path=None,extension="*.sqlite",csvs=True,excels=True):
if (excels==False and csvs==False):
print("Atleast one of the parameters need to be true: csvs or excels")
return -1
#little code to find files by extension
if dbpath==None:
files=find(extension,os.getcwd())
if len(files)>1:
print("Multiple files found! Selecting the first one found!")
print("To locate your file, set dbpath=<yourpath>")
dbpath = find(extension,os.getcwd())[0] if dbpath==None else dbpath
print("Reading database file from location :",dbpath)
#path handling
external_folder,base_name=os.path.split(os.path.abspath(dbpath))
file_name=os.path.splitext(base_name)[0] #firstname without .
exten=os.path.splitext(base_name)[-1] #.file_extension
internal_folder="Saved_Dataframes_"+file_name
main_path=os.path.join(external_folder,internal_folder)
create_dir(main_path)
excel_path=os.path.join(main_path,"Excel_Multiple_Sheets.xlsx") if excel_path==None else excel_path
csv_path=main_path if csv_path==None else csv_path
db = sqlite3.connect(dbpath)
cursor = db.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cursor.fetchall()
print(len(tables),"Tables found :")
if excels==True:
#for writing to excel(xlsx) we will be needing this!
try:
import XlsxWriter
except ModuleNotFoundError:
!pip install XlsxWriter
if (excels==True and csvs==True):
writer = pd.ExcelWriter(excel_path, engine='xlsxwriter')
i=0
for table_name in tables:
table_name = table_name[0]
table = pd.read_sql_query("SELECT * from %s" % table_name, db)
i+=1
print("Parsing Excel Sheet ",i," : ",table_name)
table.to_excel(writer, sheet_name=table_name, index=False)
print("Parsing CSV File ",i," : ",table_name)
table.to_csv(os.path.join(csv_path,table_name + '.csv'), index_label='index')
writer.save()
elif excels==True:
writer = pd.ExcelWriter(excel_path, engine='xlsxwriter')
i=0
for table_name in tables:
table_name = table_name[0]
table = pd.read_sql_query("SELECT * from %s" % table_name, db)
i+=1
print("Parsing Excel Sheet ",i," : ",table_name)
table.to_excel(writer, sheet_name=table_name, index=False)
writer.save()
elif csvs==True:
i=0
for table_name in tables:
table_name = table_name[0]
table = pd.read_sql_query("SELECT * from %s" % table_name, db)
i+=1
print("Parsing CSV File ",i," : ",table_name)
table.to_csv(os.path.join(csv_path,table_name + '.csv'), index_label='index')
cursor.close()
db.close()
return 0
save_db();
If data.db is your SQLite database and table_name is one of its tables, then you can do:
import pandas as pd
df = pd.read_sql_table('table_name', 'sqlite:///data.db')
No other imports needed.
i have stored my data in database.sqlite table name is Reviews
import sqlite3
con=sqlite3.connect("database.sqlite")
data=pd.read_sql_query("SELECT * FROM Reviews",con)
print(data)
here is what I try to achieve my current code is working fine I get the query to run on my sql server but I will need to gather information from several servers. How would I add a column with the dbserver listed in that column?
import pyodbc
import csv
f = open("dblist.ini")
dbserver,UID,PWD = [ variable[variable.find("=")+1 :] for variable in f.readline().split("~")]
connectstring = "DRIVER={SQL server};SERVER=" + dbserver + ";DATABASE=master;UID="+UID+";PWD="+PWD
cnxn = pyodbc.connect(connectstring)
cursor = cnxn.cursor()
fd = open('mssql1.txt', 'r')
sqlFile = fd.read()
fd.close()
cursor.execute(sqlFile)
with open("out.csv", "wb") as csv_file:
csv_writer = csv.writer(csv_file, delimiter = '!')
csv_writer.writerow([i[0] for i in cursor.description]) # write headers
csv_writer.writerows(cursor)
You could add the extra information in your sql query. For example:
select "dbServerName", * from table;
Your cursor will return with an extra column in front of your real data that has the db Server name. The downside to this method is you're transferring a little more extra data.