Delete a Line from BIG CSV file Python - python

I have an 11GB CSV file which has some corrupt lines I have to delete, I have identified the corrupted lines numbers from an ETL interface.
My program runs with small datasets, however, when I want to run on the main file I'm getting MemoryError. Below the code I'm using Do you have any suggestion to make it work?
row_to_delete = 101068
filename = "EKBE_0_20180907_065907 - Copy.csv"
with open(filename, 'r', encoding='utf8' ,errors='ignore') as file:
data = file.readlines()
print(data[row_to_delete -1 ])
data [row_to_delete -1] = ''
with open(filename, 'wb',encoding="utf8",errors='ignore') as file:
file.writelines( data )
Error:
Traceback (most recent call last):
File "/.PyCharmCE2018.2/config/scratches/scratch_7.py", line 7, in <module>
data = file.readlines()
MemoryError

Rather than read the whole list into memory, loop over the input file, and write all lines except the line you need to delete to the a new file. Use enumerate() to keep a counter if you need to delete by index:
row_to_delete = 101068
filename = "EKBE_0_20180907_065907 - Copy.csv"
with open(filename, 'r', encoding='utf8', errors='ignore') as inputfile,\
open(filename + '.fixed', 'wb', encoding="utf8") as outputfile:
for index, line in enumerate(inputfile):
if index == row_to_delete:
continue # don't write the line that matches
outputfile.writeline(line)
Rather than use an index, you could even detect a bad line directly in code this way.
Note that this writes to a new file, with the same name but with .fixed added.
You can move that file back to replace the old file if you want to, with os.rename(), once you are done copying all but the bad line:
os.rename(filename + '.fixed', filename)

Related

Python adding a string leaves extra characters

If you need any more info just Let Me Know
I have a python script that adds a string after each line on a CSV file. the line file_lines = [''.join([x.strip(), string_to_add, '\n']) for x in f.readlines()] is the trouble maker. For each file line it will add the string and then add a new line after each time the string is added.
Here is the script:
#Adding .JPG string to the end of each line for the Part Numbers
string_to_add = ".JPG"
#Open the file and join the .JPG to the current lines
with open("PartNums.csv", 'r') as f:
file_lines = [''.join([x.strip(), string_to_add, '\n']) for x in f.readlines()]
#Writes to the file until its done
with open("PartNums.csv", 'w') as f:
f.writelines(file_lines)
The script works and does what it is supposed to, however my issue is later on in this larger script. This script outputs into a CSV file and it looks like this:
X00TB0001.JPG
X01BJ0003.JPG
X01BJ0004.JPG
X01BJ0005.JPG
X01BJ0006.JPG
X01BJ0007.JPG
X01BJ0008.JPG
X01BJ0026.JPG
X01BJ0038.JPG
X01BJ0039.JPG
X01BJ0040.JPG
X01BJ0041.JPG
...
X01BJ0050.JPG
X01BJ0058.JPG
X01BJ0059.JPG
X01BJ0060.JPG
X01BJ0061.JPG
X01BJ0170.JPG
X01BJ0178.JPG
Without the \n in that line the csv file output looks like this file_lines = [''.join([x.strip(), string_to_add]) for x in f.readlines()]:
X00TB0001.JPGX01BJ0003.JPGX01BJ0004.JPGX01BJ0005.JPGX01BJ0006.JPG
The issue is when I go to read this file later and move files with it using this script:
#If the string matches a file name move it to a new directory
dst = r"xxx"
with open('PicsWeHave.txt') as my_file:
for filename in my_file:
src = os.path.join(XXX") # .strip() to avoid un-wanted white spaces
#shutil.copy(src, os.path.join(dst, filename.strip()))
shutil.copy(os.path.join(src, filename), os.path.join(dst, filename))
When I run this whole Script it works until it has to move the files I get this error:
FileNotFoundError: [Errno 2] No such file or directory: 'XXX\\X15SL0447.JPG\n'
I know the file exist however the '\n' should not be there and that's why I am asking how can I still get everything on a new line and not have \n after each name so when I move the file the strings match.
Thank You For Your Help!
As they said above you should use .strip():
shutil.copy(os.path.join(src, filename.strip()), os.path.join(dst, filename.strip()))
This way it gives you the file name or string you need and then it removes anything else.

Get AttributeError when copying string from a file into a new file

I am writing a simple program that reads a file, copies its contents and writes the copied content into a new file. I thought I had done it correctly, because when I open "copyFile" the copied contents of the original file is written there as a string. I've written:
copy = open('TestFile').read() #Open 'TestFile', read it into variable
print("Copy of textfile:\t", copy)
copyFile = open('copyText.txt', 'w').write(copy) #Create new file, write in the copied text
copyText = copyFile.read()
print("New file :\t", copyText)
And I am able to print the contents of the file, but when I try to print the copy, i get this error:
Traceback (most recent call last):
File "PATH/TO/THE/FILE/CALLED/copyText.py", line 14, in <module>
copyText = copyFile.read()
AttributeError: 'int' object has no attribute 'read'
The file only has one sentence in it, so I don't understand the error i'm getting.
File write do not return io object. It returns length of the text written.
Also i suggest you should use with statement to write and read from file.
The following code is the right way to do it for your case.
copy = open('TestFile').read() #Open 'TestFile', read it into variable
print("Copy of textfile:\t", copy)
length = open('copyText.txt', 'w').write(copy) #Create new file, write in the copied text
copyText = open('copyText.txt', 'r').read()
print("New file :\t", copyText)
This is the solution you should use to read and write.
with open('TestFile', 'r') as readfile:
copy = readfile.read()
print("Copy of textfile:\t", copy)
with open("copyTest.txt", 'w') as writefile:
length = writefile.write(copy)
print("Length written to file", length)
with open("copyTest.txt", 'r') as readfile:
copyText = readfile.read()
print("New file:\t", copyText)
output
Copy of textfile: this is a sentence
Length written to file 19
New file: this is a sentence
TestFile:
this is a sentence
It looks like the write function is outputting the number of characters that were written to the file, which means you are trying to call read on an int.
You'll want to store the file in a variable before trying to write to it, if you want to read in the file text afterwards. This can be achieved like the following
copy = open('TestFile').read() #Open 'TestFile', read it into variable
print("Copy of textfile:\t", copy)
copyFile = open('copyText.txt', 'w') #Create new file
copyFile.write(copy) # write in the copied text
copyText = copyFile.read()
print("New file :\t", copyText)

ValueError: must have exactly one of create/read/write/append mode

I have a file that I open and i want to search through till I find a specific text phrase at the beginning of a line. I then want to overwrite that line with 'sentence'
sentence = "new text" "
with open(main_path,'rw') as file: # Use file to refer to the file object
for line in file.readlines():
if line.startswith('text to replace'):
file.write(sentence)
I'm getting:
Traceback (most recent call last):
File "setup_main.py", line 37, in <module>
with open(main_path,'rw') as file: # Use file to refer to the file object
ValueError: must have exactly one of create/read/write/append mode
How can I get this working?
You can open a file for simultaneous reading and writing but it won't work the way you expect:
with open('file.txt', 'w') as f:
f.write('abcd')
with open('file.txt', 'r+') as f: # The mode is r+ instead of r
print(f.read()) # prints "abcd"
f.seek(0) # Go back to the beginning of the file
f.write('xyz')
f.seek(0)
print(f.read()) # prints "xyzd", not "xyzabcd"!
You can overwrite bytes or extend a file but you cannot insert or delete bytes without rewriting everything past your current position.
Since lines aren't all the same length, it's easiest to do it in two seperate steps:
lines = []
# Parse the file into lines
with open('file.txt', 'r') as f:
for line in f:
if line.startswith('text to replace'):
line = 'new text\n'
lines.append(line)
# Write them back to the file
with open('file.txt', 'w') as f:
f.writelines(lines)
# Or: f.write(''.join(lines))
You can't read and write to the same file. You'd have to read from main_path, and write to another one, e.g.
sentence = "new text"
with open(main_path,'rt') as file: # Use file to refer to the file object
with open('out.txt','wt') as outfile:
for line in file.readlines():
if line.startswith('text to replace'):
outfile.write(sentence)
else:
outfile.write(line)
Not the problem with the example code, but wanted to share as this is where I wound up when searching for the error.
I was getting this error due to the chosen file name (con.txt for example) when appending to a file on Windows. Changing the extension to other possibilities resulted in the same error, but changing the file name solved the problem. Turns out the file name choice caused a redirect to the console, which resulted in the error (must have exactly one of read or write mode): Why does naming a file 'con.txt' in windows make Python write to console, not file?

I am confused with my Python. ValueError: I/O operation on closed file

Kida lost here. I am trying to get a consolidated csv and I keep getting this error:
File "consolidate.py", line 26, in csv_merge.write(line)
ValueError: I/O operation on closed file.
I tried moving indentation:
csv_header='name,location,age,phonenumber'
csv_out = 'consolidated.csv'
csv_d
dir_tree = os.walk(csv_dir)
for dirpath, dirnames, filenames in dir_tree:
pass
csv_list = []
for file in filenames:
if file.endswith('.csv'):
csv_list.append(file)
csv_merge = open(csv_out, 'w')
csv_merge.write(csv_header)
csv_merge.write('\n')ir = os.getcwd()
for file in csv_list:
csv_in = open(file)
for line in csv_in:
if line.startswith(csv_header):
continue
csv_merge.write(line)
csv_in.close()
csv_merge.close()
print('Verify consolidated CSV file : ' + csv_out)
But this didn't work. How can I resolve this error?
You never open csv_merge properly and even if you do you still close csv_merge after the first item in csv_list is written.
Why are you even using things like csv_merge.close()?
The convention is to use things like with open(csv_merge, 'w') as f:, that way the file always gets closed, even if the loop or script fails to execute properly.

Slow python file I:O; Ruby runs better than this; Got the wrong language?

Please advise - I'm going to use this asa learning point. I'm a beginner.
I'm splitting a 25mb file into several smaller file.
A Kindly guru here gave me a Ruby sript. It works beautifully fast. So, in order to learn I mimicked it with a python script. This runs like a three-legged cat (slow). I wonder if anyone can tell me why?
My python script
##split a file into smaller files
###########################################
def splitlines (file) :
fileNo=0001
outFile=open("C:\\Users\\dunner7\\Desktop\\Textomics\\Media\\LexisNexus\\ele\\newdocs\%s.txt" % fileNo, 'a') ## open file to append
fh = open(file, "r") ## open the file for reading
mylines = fh.readlines() ### read in lines
for line in mylines: ## for each line
if re.search("Copyright ", line): # if the line is equal to the regex
outFile.close() ## close the file
fileNo +=1 #and add one to the filename, starting to read lines in again
else: # otherwise
outFile=open("C:\\Users\\dunner7\\Desktop\\Textomics\\Media\\LexisNexus\\ele\\newdocs\%s.txt" % fileNo, 'a') ## open file to append
outFile.write(line) ## then append it to the open outFile
fh.close()
The guru's Ruby 1.9 script
g=0001
f=File.open(g.to_s + ".txt","w")
open("corpus1.txt").each do |line|
if line[/\d+ of \d+ DOCUMENTS/]
f.close
f=File.open(g.to_s + ".txt","w")
g+=1
end
f.print line
end
There are many reasons why your script is slow -- the main reason being that you reopen the outputfile for almost every line you write. Since the old file gets implicitly closed on opening a new one (due to Python garbage collection), the write buffer is flushed for every single line you write, which is quite expensive.
A cleaned up and corrected version of your script would be
def file_generator():
file_no = 1
while True:
f = open(r"C:\Users\dunner7\Desktop\Textomics\Media"
r"\LexisNexus\ele\newdocs\%s.txt" % file_no, 'a')
yield f
f.close()
file_no += 1
def splitlines(filename):
files = file_generator()
out_file = next(files)
with open(filename) as in_file:
for line in in_file:
if "Copyright " in line:
out_file = next(files)
out_file.write(line)
out_file.close()
I guess the reason your script is so slow is that you open a new file descriptor for each line. If you look at your guru's ruby script, it closes and opens the output file only if your separator matches.
In contrast to that, your python script opens a new file descriptor for every line you read (and btw, does not close them). Opening a file requires talking to the kernel, so this is relatively slow.
Another change I would suggest is to change
fh = open(file, "r") ## open the file for reading
mylines = fh.readlines() ### read in lines
for line in mylines: ## for each line
to
fh = open(file, "r")
for line in fh:
With this change, you do not read the whole file into memory, but only block after block. Although it should not matter with a 25MiB file, it will hurt you with big files and is good practice (and less code ;)).
The Python code might be slow due to regex and not IO. Try
def splitlines (file) :
fileNo=0001
outFile=open("newdocs/%s.txt" % fileNo, 'a') ## open file to append
reg = re.compile("Copyright ")
for line in open(file, "r"):
if reg.search("Copyright ", line): # if the line is equal to the regex
outFile.close() ## close the file
outFile=open("newdocs%s.txt" % fileNo, 'a') ## open file to append
fileNo +=1 #and add one to the filename, starting to read lines in again
outFile.write(line) ## then append it to the open outFile
Several notes
Always use / instead of \ for path name
If regex is used repeatedly, compile it
Do you need re.search? or re.match?
UPDATE:
#Ed. S: point taken
#Winston Ewert: code updated to be closer to the original Ruby code
rosser,
Don't use names of built-in objects as identifiers in a code (file, splitlines)
The following code respects the effect of your own code: an out_file is closed without the line containing 'Copyright ' that constitutes the signal of closing
The use of the function writelines() is intended to obtain a faster execution than with a repetition of out_file.write(line)
The if li: block is there to trigger the closing of out_file in case the last line of the read file doesn't contains 'Copyright '
def splitfile(filename, wordstop, destrep, file_no = 1, li = []):
with open(filename) as in_file:
for line in in_file:
if wordstop in line:
with open(destrep+str(file_no)+'.txt','w') as f:
f.writelines(li)
file_no += 1
li = []
else:
li.append(line)
if li:
with open(destrep+str(file_no)+'.txt','w') as f:
f.writelines(li)

Categories