SQL Query to read from csv file - python

I have a csv that I want to query to get some data and use that in another Python program. The .csv field has a name, but not the user id. The sql table has the user id. I would like to read the name from the csv, query the sql table for the user id, and then write that to another .csv (or just get the data to use). An example of doing this with a prompt that I have:
ACCEPT askone CHAR PROMPT 'First Name: ';
ACCEPT asktwo CHAR PROMPT 'Last Name: ';
select user_id from test.sy_users
where sy_first_nm = '&&askone' and sy_last_nm = '&&asktwo';
This works, but I'm trying to do it from a csv file with around 40 or 50 users that I need to get their id's. I just want askone and asktwo to come from the csv file. It seems like it should be simple, but I have not found a solution that actually works

You have two options:
Use UTL_FILE where you can read data from your file and then tokenize it to extract the data you need (keep in mind that this requires the file to be on the DB server)
You can try using SQL*Loader, which is usually my prefered choice, because it lets you define a control file, which configures how to parse the file and load it into a table. After that you can just do your processing by querying up the data from the table you loaded them into.

There is a Jupyter notebook showing reading and writing CSV files in cx_Oracle in https://github.com/cjbj/cx-oracle-notebooks. The python-oracledb doc has the same load-from-CSV example. (python-oracledb is the new name for cx_Oracle).
Here are some generic reading and writing CSV examples that move data to and from database tables. Of course, once you have data in Python, you can do whatever you like with it, e.g. to use for your subquery, instead of writing it directly to a table.
If your CSV file looks like:
101,Abel
154,Baker
132,Charlie
199,Delta
. . .
and you have a table created with:
create table test (id number, name varchar2(25));
then you can load data into the table with an example like:
import oracledb
import csv
# CSV file
FILE_NAME = 'data.csv'
# Adjust the number of rows to be inserted in each iteration
# to meet your memory and performance requirements
BATCH_SIZE = 10000
connection = oracledb.connect(user="hr", password=userpwd,
dsn="dbhost.example.com/orclpdb")
with connection.cursor() as cursor:
# Predefine the memory areas to match the table definition.
# This can improve performance by avoiding memory reallocations.
# Here, one parameter is passed for each of the columns.
# "None" is used for the ID column, since the size of NUMBER isn't
# variable. The "25" matches the maximum expected data size for the
# NAME column
cursor.setinputsizes(None, 25)
with open(FILE_NAME, 'r') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
sql = "insert into test (id, name) values (:1, :2)"
data = []
for line in csv_reader:
data.append((line[0], line[1]))
if len(data) % BATCH_SIZE == 0:
cursor.executemany(sql, data)
data = []
if data:
cursor.executemany(sql, data)
connection.commit()
One example of writing query data to a CSV file is:
sql = """select * from all_objects where rownum <= 10000"""
with connection.cursor() as cursor:
cursor.arraysize = 1000 # tune for performance
with open("testwrite.csv", "w", encoding="utf-8") as outputfile:
writer = csv.writer(outputfile, lineterminator="\n")
results = cursor.execute(sql)
writer.writerows(results)

Related

How to avoid 'database disk image malformed' error while loading large json file in python / pandas?

I am trying to read a table from sqlite database (of size 4 gb). Each cell of the table is a json (few cells have large json format files in it).
The query works fine when I execute it within db browser but in Python it gives an error : 'Database disk image is malformed'
I have tried with different tables and the problem persists. The number of rows to fetch with the query, is about 5000 . However, each cell in itself might have a long json structured string (of about 10000 lines).
I have already tried working with replicas of database and with other databases. I tried following as well, in the db
Pragma integrity check;
Pragma temp_store = 2; // to force data into RAM
The problem seems to be linked with Pandas / Python than the actual DB:
import sqlite3
import pandas as pd
conn = sqlite3.connect(db)
sql = """
select a.Topic, a.Timestamp, a.SessionId, a.ContextMask, b.BuildUUID, a.BuildId, a.LayerId,
a.Payload
from MessageTable a
inner JOIN
BuildTable b
on a.BuildId = b.BuildId
where a.Topic = ('Engine/Sensors/SensorData')
and b.BuildUUID = :job
"""
cur = conn.cursor()
cur.execute(sql, {"job" : '06c95a97-40c7-49b7-ad1b-0d439d412464'})
sensordf = pd.DataFrame(data = cur.fetchall(), columns = ['Topic', 'Timestamp_epoch', 'SessionId', 'ContextMask'
'BuildUUID', 'BuildId', 'LayerId', 'Payload'])
I expect the output to be in pandas dataframe with the last column containing json values in each cell. I can further write some script to parse from json to extract more data.

How to convert sqlilte3 database to tsv from python code?

I am trying to get the tables inside sqlite3 database and save them into a tsv file.
does pandas have a tool to do that?
I know how to do it from sqlite:
sqlite> .mode tabs
sqlite> .output test1.tsv
sqlite> Select * from <table_name>;
but how to do the samething in python environment?
TSV == tab separated values, so the built-in csv module is more than enough to export your data. Something as simple as:
import csv
import sqlite3
connection = sqlite3.connect("your_database.db") # open connection to your database
cursor = connection.cursor() # get a cursor for it
cursor.execute("SELECT * FROM your_table_name") # execute the query
rows = cursor.fetchall() # collect the data
with open("test1.tsv", "wb") as f: # On Python 3.x use "w" mode and newline=""
writer = csv.writer(f, delimiter="\t") # create a CSV writer, tab delimited
writer.writerows(rows) # write your SQLite data
should do the trick.
Above answer did not really work for me. I couldn't find a good answer without many lines of code or without having format issue in the output file. So I just came up with my own version that works without issues in a couple of lines.
import pandas as pd
conn = sqlite3.connect(self.err_db_file, isolation_level=None,
detect_types=sqlite3.PARSE_COLNAMES)
db_df = pd.read_sql_query("SELECT * FROM error_log", conn)
db_df.to_csv('database.tsv', index=False, sep='\t')

Fastest way to load .xlsx file into MySQL database

I'm trying to import data from a .xlsx file into a SQL database.
Right now, I have a python script which uses the openpyxl and MySQLdb modules to
establish a connection to the database
open the workbook
grab the worksheet
loop thru the rows the the worksheet, extracting the columns I need
and inserting each record into the database, one by one
Unfortunately, this is painfully slow. I'm working with a huge data set, so I need to find a faster way to do this (preferably with Python). Any ideas?
wb = openpyxl.load_workbook(filename="file", read_only=True)
ws = wb['My Worksheet']
conn = MySQLdb.connect()
cursor = conn.cursor()
cursor.execute("SET autocommit = 0")
for row in ws.iter_rows(row_offset=1):
sql_row = # data i need
cursor.execute("INSERT sql_row")
conn.commit()
Disable autocommit if it is on! Autocommit is a function which causes MySQL to immediately try to push your data to disk. This is good if you only have one insert, but this is what causes each individual insert to take a long time. Instead, you can turn it off and try to insert the data all at once, committing only once you've run all of your insert statements.
Something like this might work:
con = mysqldb.connect(
host="your db host",
user="your username",
passwd="your password",
db="your db name"
)
con.execute("SET autocommit = 0")
cursor = con.cursor()
data = # some code to get data from excel
for datum in data:
cursor.execute("your insert statement".format(datum))
con.commit()
con.close()
Consider saving workbook's worksheet as a CSV, then use MySQL's LOAD DATA INFILE. This is often a very fast read.
sql = """LOAD DATA INFILE '/path/to/data.csv'
INTO TABLE myTable
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '\"'
LINES TERMINATED BY '\n'"""
cursor.execute(sql)
con.commit()

Fast data moving from CSV to SQLite by Python

I have a problem. There are hundreds of CSV files, ca. 1,000,000 lines each one.
I need to move that data in a specific way, but script working very slow (it passing few ten of tousands per hour).
My code:
import sqlite3 as lite
import csv
import os
my_file = open('file.csv', 'r')
reader = csv.reader(my_file, delimiter=',')
date = '2014-09-29'
con = lite.connect('test.db', isolation_level = 'exclusive')
for row in reader:
position = row[0]
item_name = row[1]
cur = con.cursor()
cur.execute("CREATE TABLE IF NOT EXISTS [%s] (Date TEXT, Position INT)" % item_name)
cur.execute("INSERT INTO [%s] VALUES(?, ?)" % item_name, (date, position))
con.commit()
I found an information saying about isolation_level and single accessing to database, but it didn't work well.
Lines CSV files have a structure: 1,item1 | 2,item2
Does anyone could to help me? Thanks!
Don't do sql inserts. Prepare CSV file first, then do:
.separator <separator>
.import <loadFile> <tableName>
See here: http://cs.stanford.edu/people/widom/cs145/sqlite/SQLiteLoad.html
You certainly don't want to create a new cursor object for each row to insert - and checking for table creation at each line will certainly slow you down s well -
I'd suggest doing this in 2 passes: first
you create the needed tables, on the second pass you record
the data. If it is still slow, you could make a
a more sophisticated in-memory collection of data
to be inserted and perform "executemany" - but this would
require some sophistication to group data by name in memory
prior to comitting;.
import sqlite3 as lite
import csv
import os
my_file = open('file.csv', 'r')
reader = csv.reader(my_file, delimiter=',')
date = '2014-09-29'
con = lite.connect('test.db', isolation_level = 'exclusive')
cur = con.cursor()
table_names = set(row[1] for row in reader)
my_file.seek(0)
for name in table_names:
cur.execute("CREATE TABLE IF NOT EXISTS [%s] (Date TEXT, Position INT)" % item_name)
for row in reader:
position = row[0]
item_name = row[1]
cur.execute("INSERT INTO [%s] VALUES(?, ?)" % item_name, (date, position))
con.commit()
The code is inefficient in that it performs two SQL statements for each row in CSV. Try to optimize.
Is there a way to process CSV first and convert it to SQL statements?
Are rows in CSV grouped by tables (item name's)? If yes, you can accumulate the rows to be inserted into the same table (generate a set of INSERT statements for the same table) and only prefix the resulting set of statements with CREATE TABLE IF NOT EXISTS once, not every of them.
If possible, use bulk insert. If I get it right, bulk insert is introduced with SQLite v.3.27.1. More on this: Is it possible to insert multiple rows at a time in an SQLite database?
If needed, bulk insert in chunks. More on this: Bulk insert huge data into SQLite using Python
I have the same problem. Now it is solved! I would like to share the methods with everyone who is facing the same problem!
We use sqlite3 database as an example, and other databases may also work but are not sure. We adopt pandas and sqlites modules in python.
This can convert a list of csv files [file1,file2,...] into talbes [table1,table2,...] quickly.
import pandas as pd
import sqlite3 as sql
DataBasePath="C:\\Users\\...\\database.sqlite"
conn=sql.connect(DataBasePath)
filePath="C:\\Users\\...\\filefolder\\"
datafiles=["file1","file2","file3",...]
for f in datafiles:
df=pd.read_csv(filePath+f+".csv")
df.to_sql(name=f,con=conn,if_exists='append', index=False)
conn.close()
What's more, this code can create database if it doesn't exist. The argument of pd.to_sql() 'if_exists' is important. Its value is "fail" as default, which will import data if it exists otherwise does nothing; "replace" will drop the table first if it exists then create new table and import data; "append" will import data if it exists otherwise creates a new one can import data.

create table using python objects

I'm using sqlite3 in a Python script to extract data from a client's spreadsheet. My client is planning to add on to the spreadsheet, so my sqlite code should generate its columns based on the headers I extract from the first line. How do I do this? This is my naive attempt:
import sqlite3
conn = sqlite3.connect('./foo.sql')
c = conn.cursor()
for line in file:
if line[0] == 'firstline':
# Below is the line in question
c.execute(""" create table if not exists bar(?, ? ,?); """, lineTuple)
else:
c.execute(""" insert into bar values (?, ?, ?); """, lineTuple)
I think, csv module of python can help you to extract file data.
First, convert your spreadsheet in csv format (save as csv command) with appropriate delimiter.
then, try below code snippet:
import csv
file_ptr = open('filename.csv','r');
fields = range(0, total number of columns in file header)
file_data = csv.DictReader(file_ptr, fields, delimiter=',')
for data in file_data:
print data
# data will be in dict format and first line would be all your headers,else are column data
# here, database query and code processing
Hope, it will be helpful.

Categories