I have data stored in CSV files in multiple folders that I want to load into multiple SQL tables using MySQL on an Ubuntu system. Each table and file follows this schema (the files don't have the id field):
+ ------ + -------- + -------- + --------- + ---------- +
| SPO_Id | SPO_Name | SPO_Date | SPO_Price | SPO_Amount |
+ ------ + -------- + -------- + --------- + ---------- +
Each file contains pricing and sales data for a single day. Unfortunately, the files are not named after their date; they are stored in folders that are named after the date. Here's an example diagram of what the directory looks like
------> 20170102 ------> prices.csv
/
/
Exmpl ------> 20170213 ------> prices.csv
\
\
------> 20170308 ------> prices.csv
Here is a query I've written that pulls data from a file and stores it into a table:
use pricing ; # the database I want the tables in
drop table if exists SP_2017_01_02 ;
create table SP_2017_01_02 (
SPO_Id int not null primary key auto_increment,
SPO_Name varchar(32),
SPO_Date date,
SPO_Price float,
SPO_Amount int
);
load data local infile '/Exmpl/20170102/prices.csv'
into table SP_2017_01_02
fields terminated by ','
lines terminated by '\n'
ignore 1 lines # First line contains field name information
(SPO_Name, SPO_Date, SPO_Price, SPO_Amount) ;
select * from SP_2017_01_02 ;
show tables ;
This query works fine for loading one table in at a time; however, because I have hundreds of tables, I need to automate this process. I've looked around on SO and here are a few things I've found:
Here is a question similar to mine, only this question references SQL Server. The answer gives a suggestion of what to do without any real substance.
This question is also very similar to mine, only this is specifically using SSIS, to which I don't have access (and the question is left unanswered)
This post suggests using control file reference, but this is for sql-loader and oracle.
Using python may be the way to go, but I've never used it before and my question seems like too complicated a problem with which to start.
This one and this one also use python, but they're just updating one table with data from one file.
I've worked a lot in SQL Server, but I'm fairly new to MySQL. Any help is greatly appreciated!
Update
I have attempted to do this using Dynamic SQL in MySQL. Unfortunately, MySQL requires the use of stored procedures to do Dynamic SQL, but it doesn't allow the function load data in a stored procedure. As #RandomSeed pointed out, this cannot be done with only MySQL. I'm going to take his advice and attempt to write a shell/python script to handle this.
I'll leave this question open until I (or someone else) can come up with a solid answer.
So once you have a sql query/function/script that reads a single table, which it looks like you do (or can build an equivalent one in python somewhat simply), using python to loop through the directory structure and get filenames is fairly simple. If you can somehow pass the infile '/Exmpl/20170102/prices.csv' a new csv parameter each time and call your sql script from within python, you should be good.
I don't have much time right now, but I wanted to show you how you could get those filename strings using python.
import os
prices_csvs = []
for root, dirs, files in os.walk(os.path.join('insert_path_here', 'Exmpl'):
for f in files:
if f == 'prices.csv':
prices_csvs.append(os.path.join(root, f))
break # optional, use if there only is one prices.csv in each subfolder
for csv_file in prices_csvs:
# csv_file is a string of the path for each prices.csv
# if you can insert it as the `infile` parameter and run the sql, you are done
# admittedly, i don't know how to do this at the moment
os.walk goes down through each subdirectory, giving the name root to the path to that folder, listing all the directories as dirs and files as files stored there. From there it's a simple check to see if the filename matches what you're looking for, and storing it in a list if it does. Looping over the list yields strings containing the path to each prices.csv in Exmpl.
Hope that shed a very little light on how python could help
I've marked Charlie's answer as the correct answer because, although he does not fully answer the question, he gave me a great start. Below is the code for anyone who might want to see how to load csv files into MySQL. The basic idea is to dynamically construct a string in Python and then execute that string in MySQL.
#!/usr/bin/python
import os
import MySQLdb # Use this module in order to interact with SQL
# Find all the file names located in this directory
prices_csvs = []
for root, dirs, files in os.walk(os.path.join('insert_path_here', 'Exmpl'):
for f in files:
if f == 'prices.csv':
prices_csvs.append(os.path.join(root, f))
break
# Connect to the MySQL database
db = MySQLdb.connect(host ="<Enter Host Here>", user = "<Enter User here>", passwd = "<Enter Password Here>", db = "<Enter Database name here>" )
# must create cursor object
cur = db.cursor()
for csv_file in prices_csvs:
directory = "'" + csv_file + "'"
table = csv_file[56:64] # This extracts the name of the table from the directory
sql_string1 = "drop table if exists SD" + table + " ;\n"
sql_string2 = "create table SD" + table + " as \n\
<Enter your fields here> \n\
); \n"
sql_string3 = "load data local infile " + directory + " \n\
into table TempPrices \n\
fields terminated by ',' \n\
lines terminated by " + repr('\n') + " \n\
ignore 1 lines ;\n"
# Print out the strings for debugging
print sql_string1
print sql_string2
print sql_string3
print sql_string4
print sql_string5
# Execute your SQL statements
cur.execute(sql_string1)
cur.execute(sql_string2)
cur.execute(sql_string3)
cur.execute(sql_string4)
cur.execute(sql_string5)
db.commit()
db.close()
While debugging, I found it very helpful to copy the printed SQL statement and paste it into MySQL to confirm that the strings were being constructed successfully.
Related
I am trying to take a table from a postgres database and copy the contents of that csv into a table. The error I am getting is psycopg2.error.SyntaxError: syntax error at or near "C" I have looked at the other people who have had the same error but the problem is I already tried what everyone is suggesting and I am printing out my path to make sure it is correct and I have copied and pasted that graph into the file manager and it opens the csv so I am confused as to the issue.
path1 = r'C:\Users\Hank\Documents'
tb = 'test29'
path = os.path.join(path1,"Testing.csv")
print(f"This is the output path fro the csv {path}")
def csv_to_postgres():
sql = f"""COPY {tb} FROM {path} DELIMITER ',' CSV HEADER;"""
cur.execute(sql)
conn.commit()
print(f"Printing to {path} was successful.")
csv_to_postgres()
I have also tried string as "sql query {}".format(path) and "sql query %s" path and none of the three are working. The path that is printed out has been put into the Windows 10 search function and opens the CSV.
Thanks to a comment I was also shown that I could just use os.path.exists(path) to see if the path exists and it returns True so I am lost.
To further add on to this I was able to make the command work in PgAdmin by changing the file permissions for Everyone but the same exact code still isn't running in Python.
The code below is the solution to my problem. Basically the issue was that I first had to open the CSV with python and then use copy_expert to get it copied in. tb is the table name and I believe you need STDIN in the sql query.
sql = f"""COPY {tb} FROM STDIN DELIMITER ',' CSV HEADER;"""
with open(path) as f:
cur.copy_expert(sql,f)
conn.commit()
I have a table that has 10 million plus records(rows) in it. I am trying to do a one-time load into s3 by select *'ing the table and then writing it to a gzip file in my local file system. Currently, I can run my script to collect 800,000 records into the gzip file but then I receive an error, and the remainder records are obviously not inserted.
Since there is no continuation in sql (for example- if you run 10 limit 800,000 queries, it wont be in order).
So, is there a way to writer a python/airflow function that can load the 10 million+ table in batches? Perhaps theres a way in python where I can do a select * statement and continue the statement after x amount of records into separate gzip files?
Here is my python/airflow script so far that when ran, it only writers 800,000 records to the path variable:
def gzip_postgres_table(table_name, **kwargs):
path = '/usr/local/airflow/{}.gz'.format(table_name)
server_post = create_tunnel_postgres()
server_post.start()
etl_conn = conn_postgres_internal(server_postgres)
record = get_etl_record(kwargs['master_table'],
kwargs['table_name'])
cur = etl_conn.cursor()
unload_sql = '''SELECT *
FROM schema1.database1.{0} '''.format(record['table_name'])
cur.execute(unload_sql)
result = cur.fetchall()
column_names = [i[0] for i in cur.description]
fp = gzip.open(path, 'wt')
myFile = csv.writer(fp, delimiter=',')
myFile.writerow(column_names)
myFile.writerows(result)
fp.close()
etl_conn.close()
server_postgres.stop()
The best, I mean THE BEST approach to insert so many records into PostgreSQL, or to get them form PostgreSQL, is to use postgresql COPY. This means you would have to change your approach drastically, but there's no better way that I know in PostgreSQL. COPY manual
COPY creates a file with the query you are executing or it can insert into a table from a file.
COPY moves data between PostgreSQL tables and standard file-system
files.
The reason why is the best solution is because your using PostgreSQL default method to handle external data, without intermediaries; so it's fast and secure.
COPY works like a charm with CSV files. You should change your approach to a file handling method and the use of COPY.
Since COPY runs with SQL, you can divide your data using LIMIT and OFFSET in the query. For example:
COPY (SELECT * FROM country LIMIT 10 OFFSET 10) TO '/usr1/proj/bray/sql/a_list_of_10_countries.copy';
-- This creates 10 countries starting in the row 10
COPY only works with files that are accessible with PostgreSQL user in the server.
PL Function (edited):
If you want COPY to be dynamic, you can use the COPY into a PL function. For example:
CREATE OR REPLACE FUNCTION copy_table(
table_name text,
file_name text,
vlimit text,
voffset text
)RETURNS VOID AS $$
DECLARE
query text;
BEGIN
query := 'COPY (SELECT * FROM country LIMIT '||vlimit||' OFFSET '||voffset||') TO '''||file_name||''' DELIMITER '','' CSV';
-- NOTE that file_name has to have its dir too.
EXECUTE query;
END;$$ LANGUAGE plpgsql;
SECURITY DEFINER
LANGUAGE plpgsql;
To execute the function you just have to do:
SELECT copy_table('test','/usr/sql/test.csv','10','10')
Notes:
If the PL will be public, you have to check for SQL injection attacks.
You can program the PL to suit your needs, this is just an example.
The function returns VOID, so it just do the COPY, if you need some feedback you should return something else.
The function has to be owned with user postgres from the server, because it needs file access; that is why it needs SECURITY DEFINER, so that any database user can run the PL.
Does anyone know a simple way to use python to load several CSV files into one given access table?
For example, my directory could have 100 files named import_*.csv (import_1.csv, import_2.csv, etc)
There is one destination table in MS Access that should receive all of these csv's.
I know I could use pyodbc and build up statements line-by-line to do this, but that's a lot of coding. You also then have to keep your SQL up-to-date as fields might get added or removed. MS Access has it's own bulk load functionality - and I'm hoping that either this is accessible via python or that python has a library to that will do the same.
I would be fantastic if there is a library out there that could do it as easily as:
dbobj.connectOdbc( dsn )
dbobj.bulkLoad( "MyTable" , "c:/temp/test.csv" )
Internally it takes some work to figure out the schema and to make it work. But hopefully someone out there has already done the heavy lifting?
Is there a way to do a bulk import? Reading into pandas is trivial enough - but then you have to get it into MS Access from there.
This is an old post, but I'll take a crack at it. So, you have 100+ CSV files in a directory and you want to push everything into MS Access. Ok, I would combine all CSV files into one sight DF in Python, then save the DF, and import that into MS Access.
#1 Use Python to merge all CSV files into one single dataframe:
# Something like...
import pandas as pd
import csv
import glob
import os
#os.chdir("C:\\your_path_here\\")
results = pd.DataFrame([])
filelist = glob.glob("C:\\your_path_here\*.csv")
#dfList=[]
for filename in filelist:
print(filename)
namedf = pd.read_csv(filename, skiprows=0, index_col=0)
results = results.append(namedf)
results.to_csv('C:\\your_path_here\\Combinefile.csv')
Alternatively, and this is how I would to it...Use VBA in Access to consolidate all CSV files into one single table (no need, whatsoever, for Python).
Private Sub Command1_Click()
Dim strPathFile As String, strFile As String, strPath As String
Dim strTable As String
Dim blnHasFieldNames As Boolean
' Change this next line to True if the first row in EXCEL worksheet
' has field names
blnHasFieldNames = False
' Replace C:\Documents\ with the real path to the folder that
' contains the CSV files
strPath = "C:\your_path_here\"
' Replace tablename with the real name of the table into which
' the data are to be imported
strTable = "tablename"
strFile = Dir(strPath & "*.csv")
Do While Len(strFile) > 0
DoCmd.TransferText acImportDelim, , strTable, strPathFile, True
' Uncomment out the next code step if you want to delete the
' EXCEL file after it's been imported
' Kill strPathFile
strFile = Dir()
Loop
End Sub
I am writing data to a file in Python from MYSQL databases tables with hardcoded headers and footer using the folowing code:
for record in cur.fetchall():
filteredrecord = (record[0] + "\t" + record[1])
print(filteredrecord)
feed_file = open("c:\\test\\test.txt", "w")
feed_file.write("Name" + "\t" + "Age" )
feed_file.write("\n" + (filteredrecord))
feed_file.write("\n" + "ENDOFFILE")
feed_file.close()
This works fine when there are records present within the database table however when there are no records present in a database table i select from nothing gets written to my file not even my hardcoded headers and footer.
I get the following output when a record is present:
output when records on db table present
I would like to get the following written to my file when there are no records present:
output needed when no records present on db table
How can I get the above to write to file when there are no records within my database table?
You have your entire code for opening the file, writing the header/footer and closing the file again, all within the for loop of iterating over the records returned from the query. In fact, if you have more than one record, it should keep opening the file, overwriting the content with the new record, including header/footer, and close the file.
What you want is to open the file once, write the header, then loop over the records and write each, then finally write the footer and close the file. The code might look something like this:
with open("c:\\test\\test.txt", "w") as feed_file:
feed_file.write("Name" + "\t" + "Age" )
for record in cur.fetchall():
filteredrecord = (record[0] + "\t" + record[1])
print(filteredrecord)
feed_file.write("\n" + (filteredrecord))
feed_file.write("\n" + "ENDOFFILE")
Note that you don't need to close the file explicitly when using the with structure.
I have this script
SELECT = """
select
coalesce (p.ID,'') as id,
coalesce (p.name,'') as name,
from TABLE as p
"""
self.cur.execute(SELECT)
for row in self.cur.itermap():
id = '%(id)s' % row
name = '%(name)s' % row
xml +=" <item>\n"
xml +=" <id>" + id + "</id>\n"
xml +=" <name>" + name + "</name>\n"
xml +=" </item>\n\n"
#save xml to file here
f = open...
and I need to save data from huge database to file. There are 10 000s (up to 40000) of items in my database and it takes very long time when script runs (1 hour and more) until finish.
How can I take data I need from database and save it to file "at once"? (as quick as possible? I don't need xml output because I can process data from output on my server later. I just need to do it as quickly as possible. Any idea?)
Many thanks!
P.S.
I found out this interesting thing: When I use this code to "erase" xml variable every 2000 records and save it to another variable it works pretty fast! So there must be something "wrong" with filling in xml variable according to my former code.
result = float(id)/2000
if result == int(result):
xml_whole += xml
xml = ""
wow, after testing with code
result = float(id)/2000
if result == int(result):
xml_whole += xml
xml = ""
is my script up to 50x faster!
i would like to know why is python so slow with xml +=... ?
You're doing a lot of unnecessary work (and however, if you erase the xml variable, you're not writing the same data as before...)
Why don't you just write the XML as it goes? You could also avoid the two COALESCEs, and write that check in Python (if ID is null then make id '', etc.).
SELECT = """
select
coalesce (p.ID,'') as id,
coalesce (p.name,'') as name,
from TABLE as p
"""
self.cur.execute(SELECT)
# Open XML file
f = open("file.xml", ...)
f.write("<?xml version... (what encoding?)
for row in self.cur.itermap():
f.write("<item>\n <id>%(id)s</id>\n <name>%(name)s</name>\n</item>\n"
# Other f.writes() if necessary
f.close()