How to import XML data straight to SQL server using Python? - python

I have XML file I want to import XML data into SQL server table using Python. I know if you we want to run Python script we can use
sp_execute_external_script stored procedure. I have also developed stored procedure which convert XML file to CSV file and then using Bulk Insert load it to SQL server. But is it possible to load it directly without converting to CSV file?
My XML to CSV and loading CSV to SQL server code is below:
CREATE PROCEDURE dbo.XMLParser
(
#XMLFilePath VARCHAR(MAX),
#CSVFilePath VARCHAR(MAX)
)
AS
BEGIN
SET NOCOUNT ON;
DECLARE #a VARCHAR(MAX) = #XMLFilePath,
#b VARCHAR(MAX) = #CSVFilePath;
EXECUTE sp_execute_external_script #language = N'Python',
#script = N'import xml.etree.ElementTree as ET
import csv
tree = ET.parse(a)
root = tree.getroot()
employee_data = open(b, "w")
csvwriter = csv.writer(employee_data)
employees_head = []
count = 0
for member in root.findall("Employee"):
employee = []
address_list = []
if count == 0:
name = member.find("Name").tag
employees_head.append(name)
PhoneNumber = member.find("PhoneNumber").tag
employees_head.append(PhoneNumber)
EmailAddress = member.find(''EmailAddress'').tag
employees_head.append(EmailAddress)
Address = member[3].tag
employees_head.append(Address)
csvwriter.writerow(employees_head)
count = count + 1
name = member.find("Name").text
employee.append(name)
PhoneNumber = member.find("PhoneNumber").text
employee.append(PhoneNumber)
EmailAddress = member.find("EmailAddress").text
employee.append(EmailAddress)
Address = member[3][0].text
address_list.append(Address)
City = member[3][1].text
address_list.append(City)
StateCode = member[3][2].text
address_list.append(StateCode)
PostalCode = member[3][3].text varcg
address_list.append(PostalCode)
employee.append(address_list)
csvwriter.writerow(employee)
employee_data.close()',
#params = N'#a varchar(max),#b varchar(max)',
#a = #a,
#b = #b;
BULK INSERT dbo.Employee
FROM 'E:\EmployeeData.csv'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
);
END;

You could convert your XML to python list that contains dics. you loop and you insert row by row in your data base. You could also put 'sleep one second' when you insert every row.

Related

how to ingest a table specification file in .txt form and create a table in sqlite?

I tried a few different ways, below but having trouble a) removing the width and b) removing the \n with a comma. I have a txt file like the below and I want to take that information and create a table in sqlite (all using python)
"field",width, type
name, 15, string
revenue, 10, decimal
invoice_date, 10, string
amount, 2, integer
Current python code - trying to read in the file, and get the values to pass in the sql statement below
import os
import pandas as pd
dir_path = os.path.dirname(os.path.realpath(__file__))
file = open(str(dir_path) + '/revenue/revenue_table_specifications.txt','r')
lines = file.readlines()
table = lines[2::]
s = ''.join(str(table).split(','))
x = s.replace("\n", ",").strip()
print(x)
sql I want to pass in
c = sqlite3.connect('rev.db') #connnect to DB
try:
c.execute('''CREATE TABLE
revenue_table (information from txt file,
information from txt file,
....)''')
except sqlite3.OperationalError: #i.e. table exists already
pass
This produces something that will work.
def makesql(filename):
s = []
for row in open(filename):
if row[0] == '"':
continue
parts = row.strip().split(", ")
s.append( f"{parts[0]} {parts[2]}" )
return "CREATE TABLE revenue_table (\n" + ",\n".join(s) + ");"
sql = makesql( 'x.csv' )
print(sql)
c.execute( sql )

Python read mysql to csv

I would like to read a mysql database in chunks and write its contents to a bunch of csv files.
While this can be done easily with pandas using below:
df_chunks = pd.read_sql_table(table_name, con, chunksize=CHUNK_SIZE)
for i, df in enumerate(chunks):
df.to_csv("file_{}.csv".format(i)
Assuming I cannot use pandas, what other alternative can I use? I tried using
import sqlalchemy as sqldb
import csv
CHUNK_SIZE = 100000
table_name = "XXXXX"
host = "XXXXXX"
user = "XXXX"
password = "XXXXX"
database = "XXXXX"
port = "XXXX"
engine = sqldb.create_engine('mysql+pymysql://{}:{}#{}:{}/{}'.format(user,password,host,port,database))
con = engine.connect()
metadata = sqldb.MetaData()
table = sqldb.Table(table_name, metadata, autoload=True, autoload_with=engine)
query = table.select()
proxy = con.execution_options(stream_results=True).execute(query)
cols = [""] + [column.name for column in table.c]
file_num = 0
while True:
batch = proxy.fetchmany(CHUNK_SIZE)
if not batch:
break
csv_writer = csv.writer("file_{}.csv".format(file_num), delimiter=',')
csv_writer.writerow(cols)
#csv_writer.writerows(batch) # while this work, it does not have the index similar to df.to_csv()
for i, row in enumerate(batch):
csv_writer.writerow(i + row) # will error here
file_num += 1
proxy.close()
While using .writerows(batch) works fine, it does not have the index like the result you get from df.to_csv(). I would like to add the row number equivalent as well, but cant seem to add to the row which is a sqlalchemy.engine.result.RowProxy. How can I do it? Or what other faster alternative can I use?
Look up SELECT ... INTO OUTFILE ...
It will do the task in 1 SQL statement; 0 lines of Python (other than invoking that SQL).

How to automatically create table and its columns based on CSV using python

This is the snippet for CSV
Column Header Values
LGA_CODE_2016 LGA10050
Median_age_persons 39
Median_mortgage_repay_monthly 1421
Median_tot_prsnl_inc_weekly 642
Median_rent_weekly 231
Median_tot_fam_inc_weekly 1532
Average_num_psns_per_bedroom 0.8
Median_tot_hhd_inc_weekly 1185
Average_household_size 2.3
I have 200+ CSVs which has combination of datatypes such as Varchar, Integer, Float.
First column of every table must be Primary Key. (i.e LGA_CODE_2016 as above mentioned CSV)
Here is the code I tried
import csv
import psycopg2
import os
import glob
import re
conn = psycopg2.connect("host= hostnamexx dbname=dbnamexx user= usernamexx password=
pwdxx")
print("Connecting to Database")
csvPath = "./TestDataLGA/"
# Loop through each CSV
for filename in glob.glob(csvPath+"*.csv"):
# Create a table name
tablename = filename.replace("./TestDataLGA\\", "").replace(".csv", "")
print tablename
# Open file
fileInput = open(filename, "r")
# Extract first line of file
firstLine = fileInput.readline().strip()
#Extract seconf line of file
secondLine = fileInput.readline()
# Split columns into an array [...]
columns = firstLine.split(",")
colvals = secondLine.split(",")
# Build SQL code to drop table if exists and create table
sqlQueryCreate = 'DROP TABLE IF EXISTS '+ " abs.ABS_" + tablename + ";\n"
sqlQueryCreate += 'CREATE TABLE'+ " abs.ABS_" + tablename + "("
# Define columns for table
for column in columns:
for dtype in colvals:
dt = bool(re.match(r"^\d+?\.\d+?$", dtype))
if dtype.isdigit():
dtype = "INTEGER"
elif dt == True:
dtype = "FLOAT(2)"
else:
dtype = "VARCHAR(64)"
sqlQueryCreate += column + " " + dtype + ",\n"
sqlQueryCreate = sqlQueryCreate[:-2]
sqlQueryCreate += ");"
print sqlQueryCreate
#cur = conn.cursor()
#cur.execute(sqlQueryCreate)
#conn.commit()
#cur.close()
This is the output that I am getting
DROP TABLE IF EXISTS abs.ABS_G02_AUS_LGA;
CREATE TABLE abs.ABS_G02_AUS_LGA(LGA_CODE_2016 FLOAT(2),
Median_age_persons FLOAT(2),
Median_mortgage_repay_monthly FLOAT(2),
Median_tot_prsnl_inc_weekly FLOAT(2),
Median_rent_weekly FLOAT(2),
Median_tot_fam_inc_weekly FLOAT(2),
Average_num_psns_per_bedroom FLOAT(2),
Median_tot_hhd_inc_weekly FLOAT(2),
Average_household_size FLOAT(2));
PS C:\Python27\Scripts>
If I run my inner For loop by itself I get correct set of Datatypes based on the CSV but when I am trying to run it with other For loop, it only prints the last generated datatype which is Float(2) for all column headers.
I am also confused where to put code for Primary Key.
Can anyone help me fix this issue?
I have tried several permutations and combination of looping them and using Break command. But nothing seems to work.
PS: I am working on Test data hence just one CSV file can be seen here output.
This is the continuation for my earlier question how to automatically create table based on CSV into postgres using python

Parsing just a column of a CSV file with multiple columns

I am learning Python and am currently working with it to parse a CSV file.
The CSV file has 3 columns:
Full_name, university, and Birth_Year.
I have successfully loaded,read, and printed the content of a given CSV file into Python, but here’s where I am stuck at:
I want to use and parse ONLY the column Full_name to 3 columns: first, middle, and last. If there are only 2 words in the name, then the middle name should be null.
The resulting parsed output should then be inserted to a sql db through Python.
Here’s my code so far:
import csv
if __name__ == '__main__':
if len (sys.argv) != 2:
print("Please enter the csv file too: python name_parsing.py student_info.csv")
sys.exit()
else:
with open(sys.argv[1], "r" ) as file:
reader = csv.DictReader(file) #I use DictReader because the csv file has > 1 col
# Print the names on the cmd
for row in reader:
name = row["Full_name"]
for name in reader:
if len(name) == 2:
print(first_name = name[0])
print(middle_name = None)
print(last_name = name[2])
if len(name) == 3 : # The assumption is that name is either 2 or 3 words only.
print(first_name = name[0])
print(middle_name = name[1])
print(last_name = name[2])
db.execute("INSERT INTO name (first, middle, last) VALUES(?,?,?)",
row["first_name"], row["middle_name"], row["last_name"])
Running the program above gives me no output whatsoever. How to parse my code the right way? Thank you.
I created a sample file based on your description. The content looks as below:
Full_name,University,Birth_Year
Prakash Ranjan Gupta,BPUT,1920
Hari Shankar,NIT,1980
John Andrews,MIT,1950
Arbaaz Aslam Khan,REC,2005
And then I executed the code below. It runs fine on my jupyter notebook. You can add the lines (sys.argv) != 2 etc) with this as you need. I have used sqlite3 database I hope this works. In case you want the if/main block added to this, let me know: can edit.
This is going by your code. (Otherwise You can do this using pandas in an easier way I believe.)
import csv
import sqlite3
con = sqlite3.connect('name_data.sql') ## Make DB connection and create a table if it does not exist
cur = con.cursor()
cur.execute('''CREATE TABLE IF NOT EXISTS UNIV_DATA
(FIRSTNAME TEXT,
MIDDLE_NAME TEXT,
LASTNAME TEXT,
UNIVERSITY TEXT,
YEAR TEXT)''')
with open('names_data.csv') as fh:
read_data = csv.DictReader(fh)
for uniData in read_data:
lst_nm = uniData['Full_name'].split()
if len(lst_nm) == 2:
fn,ln = lst_nm
mn = None
else:
fn,mn,ln = lst_nm
# print(fn,mn,ln,uniData['University'],uniData['Birth_Year'] )
cur.execute('''
INSERT INTO UNIV_DATA
(FIRSTNAME, MIDDLE_NAME, LASTNAME, UNIVERSITY, YEAR)
VALUES(?,?,?,?,?)''',
(fn,mn,ln,uniData['University'],uniData['Birth_Year'])
)
con.commit()
cur.close()
con.close()
If you want to read the data in the table UNIV_DATA:
Option 1: (prints the rows in the form of tuple)
import sqlite3
con = sqlite3.connect('name_data.sql') #Make connection to DB and create a connection object
cur = con.cursor() #Create a cursor object
results = cur.execute('SELECT * FROM UNIV_DATA') # Execute the query and store the rows retrieved in 'result'
[print(result) for result in results] #Traverse through 'result' in a loop to print the rows retrieved
cur.close() #close the cursor
con.close() #close the connection
Option 2: (prints all the rows in the form of a pandas data frame - execute in jupyter ...preferably )
import sqlite3
import pandas as pd
con = sqlite3.connect('name_data.sql') #Make connection to DB and create a connection object
df = pd.read_sql('SELECT * FROM UNIV_DATA', con) #Query the table and store the result in a dataframe : df
df
When you call name = row["Full_name"] it is going to return a string representing the name, e.g. "John Smith".
In python strings can be treated like lists, so in this case if you called len(name) it would return 10 as "John Smith" has 10 characters. As this doesn't equal 2 or 3, nothing will happen in your for loop.
What you need is some way to turn the string into a list that containing the first, second and last names. You can do this using the split function. If you call name.split(" ") it would split the string whenever there is a space, continuing the above example this would return ["John", "Smith"] which should make your code work.

Python- iterate through table of addresses, geolocate with function, copy lat/long to row?

I have a table of addresses, and I need my Python script to use my Google API geolocate function to return lat/long coordinates for each and add to a new field in the same row for each address. The geocode function works fine- I just can't get the script to iterate through each row of the table, add the address to the function, and then copy the output lat/long to the field in the same row. here's what I have:
import urllib, json, time, arcpy
arcpy.env.workspace = "D:/GIS/addr.dbf"
#sets variables, adds new field to hold lat/long coordinates
fc = 'addr.dbf'
field1 = 'address'
field2 = 'loc'
arcpy.AddField_management(fc, field2, "TEXT")
#function that uses Google API to geolocate- this part works consistently
def geolocate(address,
api="key_here",delay=4):
base = r"https://maps.googleapis.com/maps/api/geocode/json?"
addP = "address=" + address.replace(" ","+")
gUrl = base + addP + "&key=" + api
response = urllib.urlopen(gUrl)
jres = response.read()
jData = json.loads(jres)
if jData['status'] == 'OK':
resu = jData['results'][0]
finList = [resu['formatted_address'],resu['geometry']['location']
['lat'],resu['geometry']['location']['lng']]
else:
finList = [None,None,None]
time.sleep(delay)
return finList
#adds address field as text to geolocate in function, adds output lat/long
#(indexed locations 0 and 1 from the finList output)
##this is the part that doesn't work!
geo = geolocate(address = field1)
cursor = arcpy.UpdateCursor(fc, [field1, field2])
for row in cursor:
field2 = geo[0], geo[1]
cursor.updateRow(row);
You're calling the geolocate function with the string 'address'
field1 = 'address'
geo = geolocate(address = field1)
You want to call geolocate with the actual address, which means you need to do it within your loop that is iterating through the cursor. So it should be something like:
fields = [field1, field2])
with arcpy.da.UpdateCursor(fc, fields) as cursor:
for row in cursor:
address = row[0]
geo = geolocate(address = address)
row[1] = geo[0], geo[1]
cursor.updateRow(row)
Note: I used the cursor from the data access module which was introduced in ArcGIS 10.1. I also used the 'with' syntax for the cursor so that it automatically handles deleting the cursor.

Categories