I have a series of .csv files with some data, and I want a Python script to open them all, do some preprocessing, and upload the processed data to my postgres database.
I have it mostly complete, but my upload step isn't working. I'm sure it's something simple that I'm missing, but I just can't find it. I'd appreciate any help you can provide.
Here's the code:
import psycopg2
import sys
from os import listdir
from os.path import isfile, join
import csv
import re
import io
try:
con = db_connect("dbname = '[redacted]' user = '[redacted]' password = '[redacted]' host = '[redacted]'")
except:
print("Can't connect to database.")
sys.exit(1)
cur = con.cursor()
upload_file = io.StringIO()
file_list = [f for f in listdir(mypath) if isfile(join(mypath, f))]
for file in file_list:
id_match = re.search(r'.*-(\d+)\.csv', file)
if id_match:
id = id_match.group(1)
file_name = format(id_match.group())
with open(mypath+file_name) as fh:
id_reader = csv.reader(fh)
next(id_reader, None) # Skip the header row
for row in id_reader:
[stuff goes here to get desired values from file]
if upload_file.getvalue() != '': upload_file.write('\n')
upload_file.write('{0}\t{1}\t{2}'.format(id, [val1], [val2]))
print(upload_file.getvalue()) # prints output that looks like I expect it to
# with thousands of rows that seem to have the right values in the right fields
cur.copy_from(upload_file, '[my_table]', sep='\t', columns=('id', 'col_1', 'col_2'))
con.commit()
if con:
con.close()
This runs without error, but a select query in psql still shows no records in the table. What am I missing?
Edit: I ended up giving up and writing it to a temporary file, and then uploading the file. This worked without any trouble...I'd obviously rather not have the temporary file though, so I'm happy to have suggestions if someone sees the problem.
When you write to an io.StringIO (or any other file) object, the file pointer remains at the position of the last character written. So, when you do
f = io.StringIO()
f.write('1\t2\t3\n')
s = f.readline()
the file pointer stays at the end of the file and s contains an empty string.
To read (not getvalue) the contents, you must reposition the file pointer to the beginning, e.g. use seek(0)
upload_file.seek(0)
cur.copy_from(upload_file, '[my_table]', columns = ('id', 'col_1', 'col_2'))
This allows copy_from to read from the beginning and import all the lines in your upload_file.
Don't forget, that you read and keep all the files in your memory, which might work for a single small import, but may become a problem when doing large imports or multiple imports in parallel.
Related
I want to create a general program for monitoring purposes to see which inputdata is being used for various models in our company.
therefore, i want to loop through our (production) model folder and find all the .py of .ipynb files and open those, read them as a string using glob (and os). For now, i made a loop that looks for all scripts containing a csv (as a start):
path = directory
search_word = 'csv'
#list to store files that contain matching word
final_files = []
for folder_path, folders, files in os.walk(path):
#IPYNB files
path = folder_path+'\\*.IPYNB'
for filepath in glob.glob(path, recursive=True):
try:
with open(filepath) as fp:
# read the file as a string
data = fp.read()
if search_word in data:
final_files.append(filepath)
except:
print('Exception while reading file')
print(final_files)
This gives back, all IPYNB files containing the word csv in the script. So, i'm able toe read within the files.
What i want to have, is that within the part where now i'm searching for the 'CSV', i want the program to read the file (as doing right now) and determine which inputdata (and output in the end) is being used.
For example, 1 file (.IPYNB) contains this script part (input used for a model):
#Dataset 1
df1 = pd.read_csv('Data.csv', sep=';')
#dataset 2
sql_conn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER=X;DATABASE=X;Trusted_Connection=yes')
query = "SELECT * FROM database.schema.data2"
df2 = pd.read_sql_query(query, sql_conn)
#dataset 3
sql_conn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER=X;DATABASE=X;Trusted_Connection=yes')
query = "SELECT element1, element2 FROM database.schema.data3"
df3 = pd.read_sql_query(query, sql_conn)
How can i make the program such that it extracts the following facts:
Data.csv
database.schema.data2
database.schema.data3
Anyone a good idea?
Thanks in advance!
I am try to create some temporal files and make some operations on them inside a loop. Then I will access the information on all of the temporal files. And do some operations with that information. For simplicity I brought the following code that reproduces my issue:
import tempfile
tmp_files = []
for i in range(40):
tmp = tempfile.NamedTemporaryFile(suffix=".txt")
with open(tmp.name, "w") as f:
f.write(str(i))
tmp_files.append(tmp.name)
string = ""
for tmp_file in tmp_files:
with open(tmp_file, "r") as f:
data = f.read()
string += data
print(string)
ERROR:
with open(tmp_file, "r") as f: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpynh0kbnw.txt'
When I look on /tmp directory (with some time.sleep(2) on the loop) I see that the file is deleted and only one is preserved. And for that the error.
Of course I could handle to keep all the files with the flag tempfile.NamedTemporaryFile(suffix=".txt", delete=False). But that is not the idea. I would like to hold the temporal files just for the running time of the script. I also could delete the files with os.remove. But my question is more why this happen. Because I expected that the files hold to the end of the running. Because I don't close the file on the execution (or do I?).
A lot of thanks in advance.
tdelaney does already answer your actual question.
I just would like to offer you an alternative to NamedTemporaryFile. Why not creating a temporary folder which is removed (with all files in it) at the end of the script?
Instead of using a NamedTemporaryFile, you could use tempfile.TemporaryDirectory. The directory will be deleted when closed.
The example below uses the with statement which closes the file handle automatically when the block ends (see John Gordon's comment).
import os
import tempfile
with tempfile.TemporaryDirectory() as temp_folder:
tmp_files = []
for i in range(40):
tmp_file = os.path.join(temp_folder, f"{i}.txt")
with open(tmp_file, "w") as f:
f.write(str(i))
tmp_files.append(tmp_file)
string = ""
for tmp_file in tmp_files:
with open(tmp_file, "r") as f:
data = f.read()
string += data
print(string)
By default, a NamedTemporaryFile deletes its file when closed. its a bit subtle, but tmp = tempfile.NamedTemporaryFile(suffix=".txt") in the loop causes the previous file to be deleted when tmp is reassigned. One option is to use the delete=False parameter. Or, just keep the file open and seek to the beginning after the write.
NamedTemporaryFile is already a file object - you can write to it directly without reopening. Just make sure the mode is "write plus" and in text, not binary mode. Put the code an a try/finally block to make sure the files are really deleted at the end.
import tempfile
tmp_files = []
try:
for i in range(40):
tmp = tempfile.NamedTemporaryFile(suffix=".txt", mode="w+")
tmp.write(str(i))
tmp.seek(0)
tmp_files.append(tmp)
string = ""
for tmp_file in tmp_files:
data = tmp_file.read()
string += data
finally:
for tmp_file in tmp_files:
tmp_file.close()
print(string)
Looking at Zipfile module, I'm trying to figure out why the content of zipfile changes when I recreate a file with the same content
Here's a sample code I'm working on:
import os
import hashlib
import zipfile
from io import BytesIO
FILE_PATH = './'
SAMPLE_FILE = "zip_test123.txt"
# create an empty file
new_file = FILE_PATH+"/"+SAMPLE_FILE
try:
open(new_file, 'x')
except FileExistsError:
os.remove(new_file)
open(new_file, 'x')
full_path = os.path.expanduser(FILE_PATH)
# zip it
data = BytesIO()
with zipfile.ZipFile(data, mode='w') as zf:
zf.write(os.path.join(full_path, SAMPLE_FILE), SAMPLE_FILE)
zip_cntn = data.getvalue()
data.close()
print(zip_cntn)
print(hashlib.md5(zip_cntn).hexdigest())
This first creates an empty file, then zip it and prints out the hash of zipped data.
Running this multiple times results in differnt contents/hash, which I think is caused by modification date (my assumption is based on this which shows the Modified date as well)
I'm only interested in zipping the actual contents, and not anything else (e.g. hash should stay the same if I recreate the same content for a given file)
Any suggestion how to achieve this goal/ignore extra info while archiving a file?
I have a list contains names of the files.
I want to append content of all the files into the first file, and then copy that file(first file which is appended) to new path.
This is what I have done till now:
This is part of code for appending (I have put a reproducable program in the end of my question please have a look on that:).
if (len(appended) == 1):
shutil.copy(os.path.join(path, appended[0]), out_path_tempappendedfiles)
else:
with open(appended[0],'a+') as myappendedfile:
for file in appended:
myappendedfile.write(file)
shutil.copy(os.path.join(path, myappendedfile.name), out_path_tempappendedfiles)
this one will run successfully and copy successfully but it does not append files it just keep the content of the first file.
I have also tried this link it did not raises error but did not append files. so the same code except instead of using write I used shutil.copyobject
with open(file,'rb') as fd:
shutil.copyfileobj(fd, myappendedfile)
the same thing happend.
Update1
This is the whole code:
Even with the update it still does not append:
import os
import pandas as pd
d = {'Clinic Number':[1,1,1,2,2,3],'date':['2015-05-05','2015-05-05','2015-05-05','2015-05-05','2016-05-05','2017-05-05'],'file':['1a.txt','1b.txt','1c.txt','2.txt','4.txt','5.txt']}
df = pd.DataFrame(data=d)
df.sort_values(['Clinic Number', 'date'], inplace=True)
df['row_number'] = (df.date.ne(df.date.shift()) | df['Clinic Number'].ne(df['Clinic Number'].shift())).cumsum()
import shutil
path= 'C:/Users/sari/Documents/fldr'
out_path_tempappendedfiles='C:/Users/sari/Documents/fldr/temp'
for rownumber in df['row_number'].unique():
appended = df[df['row_number']==rownumber]['file'].tolist()
if (len(appended) == 1):
shutil.copy(os.path.join(path, appended[0]), out_path_tempappendedfiles)
else:
with open(appended[0],'a') as myappendedfile:
for file in appended:
fd=open(file,'r')
myappendedfile.write('\n'+fd.read())
fd.close()
Shutil.copy(os.path.join(path, myappendedfile.name), out_path_tempappendedfiles)
Would you please let me know what is the problem?
you can do it like this, and if the size of files are to large to load, you can use readlines as instructed in Python append multiple files in given order to one big file
import os,shutil
file_list=['a.txt', 'a1.txt', 'a2.txt', 'a3.txt']
new_path=
with open(file_list[0], "a") as content_0:
for file_i in file_list[1:]:
f_i=open(file_i,'r')
content_0.write('\n'+f_i.read())
f_i.close()
shutil.copy(file_list[0],new_path)
so this how I resolve it.
that was very silly mistake:| not joining the basic path to it.
I changed it to use shutil.copyobj for the performance purpose, but the problem only resolved with this:
os.path.join(path,file)
before adding this I was actually reading from the file name in the list and not joining the basic path to read from actual file:|
for rownumber in df['row_number'].unique():
appended = df[df['row_number']==rownumber]['file'].tolist()
print(appended)
if (len(appended) == 1):
shutil.copy(os.path.join(path, appended[0]), new_path)
else:
with open(appended[0], "w+") as myappendedfile:
for file in appended:
with open(os.path.join(path,file),'r+') as fd:
shutil.copyfileobj(fd, myappendedfile, 1024*1024*10)
myappendedfile.write('\n')
shutil.copy(appended[0],new_path)
I'm automating a long task that involves vulnerabilities within a spreadsheet. However, I'm noticing that the "recommendation" for these vulnerabilities are sometimes pretty long.
The CSV module for python seems to be truncating some of this text when writing new rows. Is there any way to prevent this from happening? I simply see "NOTE: THIS FIELD WAS TRUNCATED" in places where the recommendation (which is a lot of text) would be.
The whole objective is to do this:
Import a master spreadsheet which has confirmation statuses and everything up-to-date
Take a new spreadsheet containing vulnerabilities which doesn't have conf status/severity up-to-date.
Compare the second spreadsheet to the first. It'll update the severity levels from the second spreadsheet, and then write to a new file.
Newly created csv file can be copied and pasted into master spreadsheet. All vulnerabilities which match the first spreadsheet now have the same severity level/confirmation status.
What I'm noticing though, even in Ruby for some reason, is that some of the recommendations in these vulnerabilities have long text; therefore, it's being truncated when the CSV file is created for some reason. Here's a sample piece of the code that I've quickly written for demonstration:
#!/usr/bin/python
from sys import argv
import getopt, csv
master_vulns = {}
criticality = {}
############################ Extracting unique vulnerabilities from master file
contents = csv.reader(open(argv[1], 'rb'), delimiter=',')
for row in contents:
if "Confirmation_Status" in row:
continue
try:
if row[7] in master_vulns:
continue
if row[7] in master_vulns:
continue
master_vulns[row[7]] = row[3]
criticality[rows[7]] = row[2]
except Exception:
pass
############################ Updating confirmation status of newly created file
new_contents = csv.reader(open(argv[1], 'rb'), delimiter=',')
new_data = []
results = open('results.csv','wb')
writer = csv.writer(results, delimiter=',')
for nrow in new_contents:
if "Confirmation_Status" in nrow:
continue
try:
if nrow[1] == "DELETE":
continue
vuln_name = nrow[7]
vuln_status = nrow[3]
criticality = criticality[vuln_name]
vuln_status = master_vulns[vuln_name]
nrow[3] = vuln_status
nrow[2] = criticality
writer.writerow(nrow)
except Exception:
writer.writerow(nrow)
pass
results.close()