I'm trying to import 5'000 .txt files into a postgresql database. My script is running fine as long as it doesn't reach a line which doesn't fit the format. For example every file has a new line at the end which also causes the script to crash.
I've tried to handle exceptions but to no success...
My script:
import csv
import os
import sys
import psycopg2
conn = psycopg2.connect(
host="localhost",
database="demo",
user="demo",
password="123",
port="5432"
)
cur = conn.cursor()
maxInt = sys.maxsize
while True:
try:
csv.field_size_limit(maxInt)
break
except OverflowError:
maxInt = int(maxInt / 10)
def searchFiles(directory='', extension=''):
print('SEARCHING IN: ', directory)
filelist = []
extension = extension.lower()
for dirpath, dirnames, files in os.walk(directory):
for name in files:
if extension and name.lower().endswith(extension):
filelist.append(os.path.join(dirpath, name))
elif not extension:
print('FAILED TO READ: ', (os.path.join(dirpath, name)))
print('FINISHED FILE SEARCH AND FOUND ', str(len(filelist)), ' FILES')
return filelist
def importData(fileToImport):
with open(fileToImport, 'r') as f:
reader = csv.reader(f, delimiter=':')
for line in reader:
try:
cur.execute("""INSERT INTO demo VALUES (%s, %s)""", (line[0], line[1]))
conn.commit()
except:
pass
print('FAILED AT LINE: ', line)
print(conn.get_dsn_parameters())
cur.execute("SELECT version();")
record = cur.fetchone()
print("You are connected to - ", record)
fileList = searchFiles('output', '.txt')
counter = 0
length = len(fileList)
for file in fileList:
# if counter % 10 == 0:
print('Processing File: ', str(file), ', COMPLETED: ', str(counter), '/', str(length))
importData(str(file))
counter += 1
print('FINISHED IMPORT OF ', str(length), ' FILES')
A few lines of the data I'm trying to import:
example1#example.com:123456
example2#example.com:password!1
The error I'm getting:
File "import.py", line 66, in <module>
importData(str(file))
File "import.py", line 45, in importData
for line in reader:
_csv.Error: line contains NULL byte
How should I handle lines which can not get imported?
Thanks for any help
Your traceback shows the source of the exception in for line in reader:
File "import.py", line 45, in importData
for line in reader:
_csv.Error: line contains NULL byte
and you do not handle exceptions at that point. As the exception suggests, it is raised by your csv reader instance. While you certainly could wrap your for loop in a try-except block, your loop will still end once the exception raises.
This exception may be caused by the file having a different encoding than your locale's, which is assumed by open() if no encoding is explicitly provided:
In text mode, if encoding is not specified the encoding used is
platform dependent: locale.getpreferredencoding(False) is called to
get the current locale encoding.
The accepted answer in this Q&A outlines a solution to deal with that, provided that you can identify the correct encoding to open the file with. The Q&A also shows some approaches on how to get rid of NULL bytes in the file, prior to handing it over to a reader.
You might also want to simply skip empty lines instead of firing them to your DB and handle the exception, e.g.
for line in reader:
if not line:
continue
try:
[...]
Related
I need to write function which given a text file object open in read and write mode and a string, inserts the text of the string in the file at the current read/write position. In other words, the function writes the string in the file without overwriting the rest of it. When exiting the function, the new read/write position has to be exactly at the end of the newly inserted string.
The algorithm is simple; the function needs to:
read the content of the file starting at the current read/write position
write the given string at the same position step 1 started
write the content read at step 1. at the position where step 2. ended
reposition the read/write cursor at the same position step2. ended (and step 3. started)
If the argument file object is not readable or writable, the function should print a message and return immediately without changing anything.
This can be achieved by using the methods file object methods readable() and writable().
In the main script:
1- prompt the user for a filename
2- open the file in read-write mode. If the file is not found, print a message and exit the program
3- insert the filename as the first line of the file followed by an empty line
4- insert a line number and a space, at the beginning of each line of the original text.
I'm very confused on how to write the function and main body.
so far I only have
def openFile(fileToread):
print(file.read())
givefile = input("enter a file name: ")
try:
file = open(givefile, "r+")
readWriteFile = openFile(file)
except FileNotFoundError:
print("File does not exist")
exit(1)
print(givefile, "\n")
which is not a lot.
I need an output like this:
twinkle.txt
1 Twinkle, twinkle, little bat!
2 How I wonder what you're at!
3 Up above the world you fly,
4 Like a teatray in the sky.
the file used is a simple .txt file with the twinkle twinkle song
How can I do this?
Basic solution
give_file = input("enter a file name: ")
def open_file(file):
return file.read()
def save_file(file, content):
file.write(content)
try:
# Use this to get the data of the file
with open(give_file, "r") as fd:
file_content = open_file(fd)
except FileNotFoundError:
print("File does not exist")
exit(1)
# change the data
new_content = f'{give_file}\n\n{file_content}'
try:
# save the data
with open(give_file, "w") as fd:
save_file(fd, new_content)
except FileNotFoundError:
print("File does not exist")
exit(1)
This should give you the expected result.
I asked about the r+ and how to use it in this case. I got this answer:
reset the cursor to 0 should do the trick
my_fabulous_useless_string = 'POUET'
with open(path, 'r+') as fd:
content = fd.read()
fd.seek(0)
fd.write(f'{my_fabulous_useless_string}\n{content}')
so with your code it's:
give_file = input("enter a file name: ")
def open_file(file):
return file.read()
def save_file(file, content):
file.write(content)
try:
# Use this to get the data of the file
with open(give_file, "+r") as fd:
file_content = open_file(fd)
new_content = f'{give_file}\n\n{file_content}'
fd.seek(0)
save_file(fd, new_content)
except FileNotFoundError:
print("File does not exist")
exit(1)
A suggestion
Don't use function, it hide the fact that a method is used with some side-effects (move the cursor).
Instead, call the method directly, this is better:
give_file = input("enter a file name: ")
try:
# Use this to get the data of the file
with open(give_file, "+r") as fd:
file_content = fd.read()
new_content = f'{give_file}\n\n{file_content}'
fd.seek(0)
fd.write(new_content)
except FileNotFoundError:
print("File does not exist")
exit(1)
Or, with the basic solution and functions
def open_file(path):
with open(path, "r") as fd:
return fd.read()
def save_file(path, content):
with open(path, 'w') as fd:
fd.write(content)
# get file_name
file_name = input("enter a file name: ")
try:
# Use this to get the data of the file
file_content = open_file(file_name)
except FileNotFoundError:
print("File does not exist")
exit(1)
# change the data
new_content = f'{file_name}\n\n{file_content}'
# save the data
save_file(file_name, new_content)
I created this code to scan my samples_vsdt.txt getting a certain values then writing it in a csv, I'm having an error StopIteration and doesn't even read the text file. I'm trying to solve this for hours, any idea what's causing the problem?
Here is how my code works, Example this line:
Scanning samples_extracted\82e5b144cb5f1c10629e72fc1291f535db7b0b40->(Word 2003 XML Document 1003-1)
Will be written to csv as this:
82e5b144cb5f1c10629e72fc1291f535db7b0b40,Word 2003 XML Document 1003-1
Here is my code, and this is working for all my txt_files but this one sample_vsdt.txt doesn't work properly
import csv,re
out_vsdt = "samples_vsdt.txt"
out_sha1_vsdt = "sha1_vsdt.csv"
def read_text_file(out_vsdt):
with open(out_vsdt) as f:
data = []
for line in f:
if "Scanning " + new in line and "(" in line:
try:
sha = re.search('\\\(.*)->', line).group(1)
desc= re.search('->\((.*)\)', line).group(1)
except AttributeError:
desc = None
sha = None
mix = sha,desc
data.append(mix)
continue
if "Scanning " + new in line:
try:
sha= re.search('\\\(.*)$', line).group(1)
while True:
i = next(f)
if "(" in i:
try:
desc = re.search('->\((.*)\)', i).group(1)
break
except AttributeError:
desc = None
sha = None
mix = sha,desc
data.append(mix)
except AttributeError:
sha = None
return data
def write_csv_file(data,out_sha1_vsdt):
with open(out_sha1_vsdt, 'wb') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',', quotechar='"')
csvwriter.writerow(['SHA-1','VSDT','DESC'])
for row in data:
csvwriter.writerow(row)
def main():
data = read_text_file(out_vsdt)
write_csv_file(data, out_sha1_vsdt)
if __name__ == '__main__':
main()
print "Parsing Successful"
Gives me error:
Traceback (most recent call last):
File "C:\Users\trendMICRO\Desktop\ojt\scanner\parser.py", line 65, in <module>
main()
File "C:\Users\trendMICRO\Desktop\ojt\scanner\parser.py", line 61, in main
data = read_text_file(out_vsdt)
File "C:\Users\trendMICRO\Desktop\ojt\scanner\parser.py", line 37, in read_text_file
i = next(f)
StopIteration
An alternative approach could be to just use a regular expression to extract whole blocks:
import csv
import re
out_vsdt = "samples_vsdt.txt"
out_sha1_vsdt = "sha1_vsdt.csv"
with open(out_vsdt) as f_input:
vscan32 = f_input.read()
with open(out_sha1_vsdt, 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(['SHA-1', 'VSDT', 'DESC'])
for sha, desc, vsdt in re.findall(r'Scanning.*?\\([0-9a-f]+)(.*?)->\((.*?)\)$', vscan32, re.S + re.M):
desc = '|'.join(line.strip() for line in desc.splitlines() if len(line.strip()))
desc = ''.join(filter(lambda x: x in string.printable, desc)) # remove non-printable characters
csv_output.writerow([sha, vsdt, desc])
This uses a multi-line expression that looks for blocks starting with Scanning. Where there are multiple lines, the lines are stripped and joined together using a |. Finally any non-printable characters are removed from the description.
This would give you an output starting something like:
SHA-1,VSDT,DESC
004d44eeecae27314f8bd3825eb82d2f40182b51,WIN32 EXE 7-2,
07eab9ea58d4669febf001d52c5182ecf579c407,WIN32 EXE 7-2,
0d558bb5e0a5b544621af0ffde1940615ac39deb,WIN32 EXE 7-2,
5172c70c1977bbddc2a163f6ede46595109c7835,WIN32 EXE 7-2,- $R0\NsCpuCNMiner32.exe->Found Virus [WORM_CO.331300D2]|- $R0\NsCpuCNMiner64.exe->Found Virus [WORM_CO.331300D2]|- $R0\NsGpuCNMiner.exe->Found Virus [TROJ64_.743CC567]
This assumes you are using Python 3.x
In book headfirstpython in chapter4 they have used the syntax
print(list_name, file= output_file_name)
For them it's working fine, but for me it's giving syntax error on file = output_file_name. The python version is same i.e. 3.
code:
import os
man = []
other = []
try:
data = open('sketch.txt')
for each_line in data:
try:
(role, line_spoken) = each_line.split(':', 1)
line_spoken = line_spoken.strip()
if role == 'Man':
man.append(line_spoken)
elif role == 'Other Man':
other.append(line_spoken)
except ValueError:
pass
data.close()
except IOError:
print('The datafile is missing!')
try:
man_file = open('man_data.txt', 'w')
other_file = open('other_data.txt', 'w')
print(man, file=man_file)
print(other, file=other_file)
except IOError:
print('File error.')
finally:
man_file.close()
other_file.close()
As per the help of print function indicates
file: a file-like object (stream); defaults to the current
sys.stdout.
So the input is not supposed to be file-name but rather a file-like object. If you want to write into (say) a text file, you need to first open it for writing and use the file handle.
f = open("output.txt",'w')
print(list_name, file=f)
I am writing a csv file which is named as OUT_FILE. now I can see that file is not immediately created on the disk, so I want to wait until the file gets created.
below is the code to write the csv file:
with open(OUT_FILE, 'a') as outputfile:
with open(INTER_FILE, 'rb') as feed:
writer = csv.writer(outputfile, delimiter=',', quotechar='"')
reader = csv.reader(feed, delimiter=',', quotechar='"')
for row in reader:
reportable_jurisdiction=row[7]
if '|' in reportable_jurisdiction:
row[7]="|".join(sorted(list(row[7].split('|'))))
print " reportable Jurisdiction with comma "+reportable_jurisdiction
else:
print "reportable Jurisdiction if single "+reportable_jurisdiction
writer.writerow(row)
feed.close()
outputfile.close()
Now I have one file called FEED_FILE which actually the input for the OUT_FILE i.e. after wrting the data on OUT_FILE, the size of the OUT_FILE and FEED_FILE should be same.
so for the same I have written the below code :
while True:
try:
print'sleeping for 5 seconds'
time.sleep(5)
outputfileSize=os.path.getsize(OUT_FILE)
if( outputfileSize ==FeedFileSize ):
break
except OSError:
print " file not created "
continue
print " file created !!"
now I don't know if this is executing since there are no errors and even print is not coming in output ?
any help?
You can check if file exist using python's os module?
import os
def wait_for_file(path):
timeout = 300
while timeout:
if os.path.exists(path):
# check if your condition of file size having
# same as feed file size is met
return
timeout = timeout - 5
sleep(5)
raise Exception, "File was not created"
There are many StackOverflow questions about this error when reading from a CSV file. My problem is occurring while reading from STDIN.
[Most SO solutions talk about tweaking the open() command which works for opening CSV files - not for reading them through STDIN]. My problem is with reading through STDIN. So please don't mark this as a duplicate.
My python code is:
import sys , csv
def main(argv):
reader = csv.reader(sys.stdin, delimiter=',')
for line in reader:
print line
and the returned error is:
Traceback (most recent call last):
File "mapper.py", line 19, in <module>
main(sys.argv)
File "mapper.py", line 4, in main
for line in reader:
_csv.Error: line contains NULL byte
It would suffice me to simply ignore that line where the NULL byte occurs (if that is possible) in the for loop.
i solved it by handling CSV exception
import sys , csv
def main(argv):
reader = csv.reader(sys.stdin, delimiter=',')
lineCount = 0
errorCount = 0
while True:
# keep iterating indefinitely until exception is raised for end of the reader (an iterator)
try:
lineCount += 1
line = next(reader)
print "%d - %s" % (lineCount , line)
except csv.Error:
# this exception is raised when a malformed CSV is encountered... ignore it and continue
errorCount += 1
continue
except StopIteration:
# this exception is raised when next() reaches the end of the iterator
lineCount -= 1
break
print "total line: %d" % lineCount
print "total error: %d" % errorCount