I would like to read a mysql database in chunks and write its contents to a bunch of csv files.
While this can be done easily with pandas using below:
df_chunks = pd.read_sql_table(table_name, con, chunksize=CHUNK_SIZE)
for i, df in enumerate(chunks):
df.to_csv("file_{}.csv".format(i)
Assuming I cannot use pandas, what other alternative can I use? I tried using
import sqlalchemy as sqldb
import csv
CHUNK_SIZE = 100000
table_name = "XXXXX"
host = "XXXXXX"
user = "XXXX"
password = "XXXXX"
database = "XXXXX"
port = "XXXX"
engine = sqldb.create_engine('mysql+pymysql://{}:{}#{}:{}/{}'.format(user,password,host,port,database))
con = engine.connect()
metadata = sqldb.MetaData()
table = sqldb.Table(table_name, metadata, autoload=True, autoload_with=engine)
query = table.select()
proxy = con.execution_options(stream_results=True).execute(query)
cols = [""] + [column.name for column in table.c]
file_num = 0
while True:
batch = proxy.fetchmany(CHUNK_SIZE)
if not batch:
break
csv_writer = csv.writer("file_{}.csv".format(file_num), delimiter=',')
csv_writer.writerow(cols)
#csv_writer.writerows(batch) # while this work, it does not have the index similar to df.to_csv()
for i, row in enumerate(batch):
csv_writer.writerow(i + row) # will error here
file_num += 1
proxy.close()
While using .writerows(batch) works fine, it does not have the index like the result you get from df.to_csv(). I would like to add the row number equivalent as well, but cant seem to add to the row which is a sqlalchemy.engine.result.RowProxy. How can I do it? Or what other faster alternative can I use?
Look up SELECT ... INTO OUTFILE ...
It will do the task in 1 SQL statement; 0 lines of Python (other than invoking that SQL).
Related
I was trying to read some data from a text file and write it down in a Sql server table using Pandas Module and FOR LOOP. Below is my code..
import pandas as pd
import pyodbc
driver = '{SQL Server Native Client 11.0}'
conn = pyodbc.connect(
Trusted_Connection = 'Yes',
Driver = driver,
Server = '***********',
Database = 'Sullins_Data'
)
def createdata():
cursor = conn.cursor()
cursor.execute(
'insert into Sullins_Datasheet(Part_Number,Web_Link) values(?,?);',
(a,j))
conn.commit()
a = pd.read_csv('check9.txt',header=None, names=['Part_Number','Web_Links'] ) # 2 Columns, 8 rows
b = pd.DataFrame(a)
p_no = (b['Part_Number'])
w_link = (b['Web_Links'])
# print(p_no)
for i in p_no:
a = i
for l in w_link:
j = l
createdata()
As you can see from the code that I have created 2 variables a and j to hold the value of both the columns of the text file one by one and write it in the sql table.
But after running the code I have got only the last row value in the table out of 8 rows.
When I used createdate function inside w_link for loop, it write the duplicate value in the table.
Please suggest where I am doing wrong.
here is sample of how your code is working
a = 0
b = 0
ptr=['s','d','f','e']
pt=['a','b','c','d']
for i in ptr:
a=i
print(a,end='')
for j in pt:
b=j
print(b,end='')
I'm new to Python so reaching out for help. I have a csv file in S3 bucket, I would like to use Python pyodbc to import this csv file to a table in SQL server. This file is 50 MB (400k records). My code is below. As my code states below, my csv data is in a dataframe, how can I use Bulk insert to insert dataframe data into sql server table. If my approach does not work, please advise me with a different approach.
# Connection to S3
s3 = boto3.client(
service_name = 's3',
region_name = 'us-gov-west-1',
aws_access_key_id = 'ZZZZZZZZZZZZZZZZZZ',
aws_secret_access_key = 'AAAAAAAAAAAAAAAAA')
# Connection to SQL Server
server = 'myserver.amazonaws.com'
path = 'folder1/folder2/folder3/myCSVFile.csv'
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE=DB-staging;UID=User132;PWD=XXXXXX')
cursor = cnxn.cursor()
obj_sum = s3.get_object(Bucket = 'my_bucket', Key = path)
csv_data = pd.read_csv(obj_sum['Body'])
df = pd.DataFrame(csv_data, columns = ['SYSTEM_NAME', 'BUCKET_NAME', 'LOCATION', 'FILE_NAME', 'LAST_MOD_DATE', 'FILE_SIZE'])
#print(df.head(n=15).to_string(index=False))
# Insert DataFrame to table
cursor.execute("""truncate table dbo.table1""")
cursor.execute("""BULK INSERT dbo.table1 FROM """ + .....# what do I put here since data is in dataframe??)
I tried to loop through the dataframe and it took 20 minutes to insert 5k records. Code below. Looping through each record is an option but a poor one. This is why I'm moving towards bulk insert if possible.
for i in df.itertuples(index = False):
if i.FILE_SIZE != 0:
cursor.execute("""insert into dbo.table1 (SYSTEM_NAME, BUCKET_NAME, X_LOCATION, FILE_NAME, LAST_MOD_DATE, FILE_SIZE)
values (?,?,?,?,?,?)""", i.SYSTEM_NAME, i.BUCKET_NAME, i.LOCATION, i.FILE_NAME, i.LAST_MOD_DATE, i.FILE_SIZE)
Lastly, bonus question ... I would like to check if the "FILE_SIZE" column in my dataframe equals to 0, if it is skip over that record and move forward to the next record.
Thank you in advnace.
Thanks for the help.
using fast_executemany = True did the job for me.
engine = sal.create_engine("mssql+pyodbc://username:password#"+server+":1433/db-name?driver=ODBC+Driver+17+for+SQL+Server?Trusted_Connection=yes",
fast_executemany = True)
conn = engine.connect()
I had to change my code around to use "sqlalchemy" but it working great now.
To call the function to upload data to SQL Server is below:
df.to_sql(str, con = engine, index = False, if_exists = 'replace')
I need to download a large table from an oracle database into a python server, using cx_oracle to do so. However, the ram is limited on the python server and so I need to do it in a batch way.
I know already how to do generally a whole table
usr = ''
pwd = ''
tns = '(Description = ...'
orcl = cx_Oracle.connect(user, pwd, tns)
curs = orcl.cursor()
printHeader=True
tabletoget = 'BIGTABLE'
sql = "SELECT * FROM " + "SCHEMA." + tabletoget
curs.execute(sql)
data = pd.read_sql(sql, orcl)
data.to_csv(tabletoget + '.csv'
I'm not sure what to do though to load say a batch of 10000 rows at a time and then save it off to a csv and then rejoin.
You can use cx_Oracle directly to perform this sort of batch:
curs.arraysize = 10000
curs.execute(sql)
while True:
rows = cursor.fetchmany()
if rows:
write_to_csv(rows)
if len(rows) < curs.arraysize:
break
If you are using Oracle Database 12c or higher you can also use the OFFSET and FETCH NEXT ROWS options, like this:
offset = 0
numRowsInBatch = 10000
while True:
curs.execute("select * from tabletoget offset :offset fetch next :nrows only",
offset=offset, nrows=numRowsInBatch)
rows = curs.fetchall()
if rows:
write_to_csv(rows)
if len(rows) < numRowsInBatch:
break
offset += len(rows)
This option isn't as efficient as the first one and involves giving the database more work to do but it may be better for you depending on your circumstances.
None of these examples use pandas directly. I am not particularly familiar with that package, but if you (or someone else) can adapt this appropriately, hopefully this will help!
You can achieve your result like this. Here I am loading data to df.
import cx_Oracle
import time
import pandas
user = "test"
pw = "test"
dsn="localhost:port/TEST"
con = cx_Oracle.connect(user,pw,dsn)
start = time.time()
cur = con.cursor()
cur.arraysize = 10000
try:
cur.execute( "select * from test_table" )
names = [ x[0] for x in cur.description]
rows = cur.fetchall()
df=pandas.DataFrame( rows, columns=names)
print(df.shape)
print(df.head())
finally:
if cur is not None:
cur.close()
elapsed = (time.time() - start)
print(elapsed, "seconds")
I have XML file I want to import XML data into SQL server table using Python. I know if you we want to run Python script we can use
sp_execute_external_script stored procedure. I have also developed stored procedure which convert XML file to CSV file and then using Bulk Insert load it to SQL server. But is it possible to load it directly without converting to CSV file?
My XML to CSV and loading CSV to SQL server code is below:
CREATE PROCEDURE dbo.XMLParser
(
#XMLFilePath VARCHAR(MAX),
#CSVFilePath VARCHAR(MAX)
)
AS
BEGIN
SET NOCOUNT ON;
DECLARE #a VARCHAR(MAX) = #XMLFilePath,
#b VARCHAR(MAX) = #CSVFilePath;
EXECUTE sp_execute_external_script #language = N'Python',
#script = N'import xml.etree.ElementTree as ET
import csv
tree = ET.parse(a)
root = tree.getroot()
employee_data = open(b, "w")
csvwriter = csv.writer(employee_data)
employees_head = []
count = 0
for member in root.findall("Employee"):
employee = []
address_list = []
if count == 0:
name = member.find("Name").tag
employees_head.append(name)
PhoneNumber = member.find("PhoneNumber").tag
employees_head.append(PhoneNumber)
EmailAddress = member.find(''EmailAddress'').tag
employees_head.append(EmailAddress)
Address = member[3].tag
employees_head.append(Address)
csvwriter.writerow(employees_head)
count = count + 1
name = member.find("Name").text
employee.append(name)
PhoneNumber = member.find("PhoneNumber").text
employee.append(PhoneNumber)
EmailAddress = member.find("EmailAddress").text
employee.append(EmailAddress)
Address = member[3][0].text
address_list.append(Address)
City = member[3][1].text
address_list.append(City)
StateCode = member[3][2].text
address_list.append(StateCode)
PostalCode = member[3][3].text varcg
address_list.append(PostalCode)
employee.append(address_list)
csvwriter.writerow(employee)
employee_data.close()',
#params = N'#a varchar(max),#b varchar(max)',
#a = #a,
#b = #b;
BULK INSERT dbo.Employee
FROM 'E:\EmployeeData.csv'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
);
END;
You could convert your XML to python list that contains dics. you loop and you insert row by row in your data base. You could also put 'sleep one second' when you insert every row.
I have a large JSON file (400k lines). I am trying to isolate the following:
Policies- "description"
policy items - "users" and "database values"
JSON FILE - https://pastebin.com/hv8mLfgx
Expected Output from Pandas: https://imgur.com/a/FVcNGsZ
Everything after "Policy Items" is re-iterated the exact same throughout the file. I have tried the code below to isolate "users". It doesn't seem to work, I'm trying to dump all of this into a CSV.
Edit* here was a solution I was attempting to try, but could not get this to work correctly - Deeply nested JSON response to pandas dataframe
from pandas.io.json import json_normalize as Jnormal
import json
import pprint, csv
import re
with open("Ranger_Policies_20190204_195010.json") as file:
jsonDF = json.load(file)
for item in jsonDF['policies'][0]['policyItems'][0]:
print ('{} - {} - {}'.format(jsonDF['users']))
EDIT 2: I have some working code which is able to grab some of the USERS, but it does not grab all of them. Only 11 out of 25.
from pandas.io.json import json_normalize as Jnormal
import json
import pprint, csv
import re
with open("Ranger_Policies_20190204_195010.json") as file:
jsonDF = json.load(file)
pNode = Jnormal(jsonDF['policies'][0]['policyItems'], record_path='users')
print(pNode.head(500))
EDIT 3: This is the Final working copy, however I am still not copying over all my TABLE data. I set a loop to simply ignore everything. Capture everything and I'd sort it in Excel, Does anyone have any ideas why I cannot capture all the TABLE values?
json_data = json.load(file)
with open("test.csv", 'w', newline='') as fd:
wr = csv.writer(fd)
wr.writerow(('Database name', 'Users', 'Description', 'Table'))
for policy in json_data['policies']:
desc = policy['description']
db_values = policy['resources']['database']['values']
db_tables = policy['resources']['table']['values']
for item in policy['policyItems']:
users = item['users']
for dbT in db_tables:
for user in users:
for db in db_values:
_ = wr.writerow((db, user, desc, dbT))```
Pandas is overkill here: the csv standard module is enough. You have just to iterate on policies to extract the description an database values, next on policyItems to extract the users:
with open("Ranger_Policies_20190204_195010.json") as file:
jsonDF = json.load(file)
with open("outputfile.csv", newline='') as fd:
wr = csv.writer(fd)
_ = wr.writerow(('Database name', 'Users', 'Description'))
for policy in js['policies']:
desc = policy['description']
db_values = policy['resources']['database']['values']
for item in policy['policyItems']:
users = item['users']
for user in users:
for db in db_values:
if db != '*':
_ = wr.writerow((db, user, desc))
Here is one way to do it, and let's assume your json data is in a variable called json_data
from itertools import product
def make_dfs(data):
cols = ['db_name', 'user', 'description']
for item in data.get('policies'):
description = item.get('description')
users = item.get('policyItems', [{}])[0].get('users', [None])
db_name = item.get('resources', {}).get('database', {}).get('values', [None])
db_name = [name for name in db_name if name != '*']
prods = product(db_name, users, [description])
yield pd.DataFrame.from_records(prods, columns=cols)
df = pd.concat(make_dfs(json_data), ignore_index=True)
print(df)
db_name user description
0 m2_db hive Policy for all - database, table, column
1 m2_db rangerlookup Policy for all - database, table, column
2 m2_db ambari-qa Policy for all - database, table, column
3 m2_db af34 Policy for all - database, table, column
4 m2_db g748 Policy for all - database, table, column
5 m2_db hdfs Policy for all - database, table, column
6 m2_db dh10 Policy for all - database, table, column
7 m2_db gs22 Policy for all - database, table, column
8 m2_db dh27 Policy for all - database, table, column
9 m2_db ct52 Policy for all - database, table, column
10 m2_db livy_pyspark Policy for all - database, table, column
Tested on Python 3.5.1 and pandas==0.23.4