Strange characters in the begining of the file after writing in Python - python

I want to do a lot of boring C# code replacements automatically through a python script. I read all lines of the file, transform them, truncate the whole file, write new strings and close it.
f = open(file, 'r+')
text = f.readlines()
# some changes
f.truncate(0)
for line in text:
f.write(line)
f.close()
All my changes are written. But some strange characters in the beginning of the file appear. I don't know how to avoid them. Even if I open with encoding='utf-8-sig' it doesn't help.
I tried truncate whole file besides the 1st line like this:
import sys
f.truncate(sys.getsizeof(text[0]))
for index in range(1, len(text), 1):
f.write(text[index])
But in this case more than 1st line is writing instead of only first line.
EDIT
I tried this:
f.truncate(len(text[0]))
for index in range(1, len(text), 1):
f.write(text[index])
And the first line has written correct but next one with the same issue. So I think this characters from the end of the file and I try to write after them.

f=open(file, 'r+')
text = f.readlines() # After reading all the lines, the pointer is at the end of the file.
# some changes
f.seek(0) # To bring the pointer back to the starting of the file.
f.truncate() # Don't pass any value in truncate() as it means number of bytes to be truncated by default size of file.
for line in text:
f.write(line)
f.close()
Check out this Link for more details.

Related

Replace an arrow character, repeating headers and blank lines in text file and paste the data cleanly in Excel sheet

My attempt to remove arrow character, blank lines and headers from this text file is as below -
I am trying to ignore arrow character and blank lines and write in the new file MICnew.txt but my code doesn't do it. Nothing changes in the new file.
Please help, Thanks so much
I have attached sample file as well.
import re
with open('MIC.txt') as oldfile, open('MICnew.txt', 'w') as newfile:
for line in oldfile:
newfile.write(re.sub(r'[^\x00-\x7f]',r' ',line))
with open('MICnew.txt','r+') as file:
for line in file:
if not line.isspace():
file.write(line)
You can't read from and write to the same file simultaneously. When you open a file with mode r+, the I/O pointer is initially at the beginning but reading will push it to the end (as explained in this answer). So in your case, you read the first line of the file, which moves the pointer to the end of the file. Then you write out that line (unless it's all whitespace) but crucially, the pointer stays at the end. That means on the next iteration of the loop you will have reached the end of the file and your program stops.
To avoid this, read in all the contents of the file first, then loop over them and write out what you want:
file_data = Path('MICnew.txt').read_text()
with open('MICnew.txt', 'w') as out_handle: # THIS WILL OVERWRITE THE FILE!
for line in file_data.splitlines():
if not line.isspace():
file.write(line)
But that double loop is a bit clumsy and you can instead combine the two steps into one:
with open('MIC.txt', errors='ignore') as oldfile,
open('MICnew.txt', 'w') as newfile:
for line in oldfile:
clean_line = re.sub(r'[^\x00-\x7f]', ' ', line.strip('\x0c'))
if not clean_line.isspace():
newfile.write(clean_line)
In order to remove non-Unicode characters, the file is opened with errors='ignore' which will omit the improperly encoded characters. Since the sample file contains a number of rogue form feed characters throughout, it explicitly removes them (ASCII code 12 or \x0c in hex).

Using a for loop to add a new line to a table: python

I am trying to create a .bed file after searching through DNA sequences for two regular expressions. Ideally, I'd like to generate a tab-separated file which contains the sequence description, the start location of the first regex and the end location of the second regex. I know that the regex section works, it's just creating the \t separated file I am struggling with.
I was hoping that I could open/create a file and simply print a new line for each iteration of the for loop that contains this information, like so:
with open("Mimp_hits.bed", "a+") as file_object:
for line in file_object:
print(f'{sequence.description}\t{h.start()}\t{h_rc.end()}')
file_object.close()
But this doesn't seem to work (creates empty file). I have also tried to use file_object.write, but again this creates an empty file too.
This is all of the code I have including searching for the regexes:
import re, sys
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
infile = sys.argv[1]
for sequence in SeqIO.parse(infile, "fasta"):
hit = re.finditer(r"CAGTGGG..GCAA[TA]AA", str(sequence.seq))
mimp_length = 400
for h in hit:
h_start = h.start()
hit_rc = re.finditer(r"TT[TA]TTGC..CCCACTG", str(sequence.seq))
for h_rc in hit_rc:
h_rc_end = h_rc.end()
length = h_rc_end - h_start
if length > 0:
if length < mimp_length:
with open("Mimp_hits.bed", "a+") as file_object:
for line in file_object:
print(sequence.description, h.start(), h_rc.end())
file_object.close()
This is the desired output:
Focub_II5_mimp_1__contig_1.16(656599:656809) 2 208
Focub_II5_mimp_2__contig_1.47(41315:41540) 2 223
Focub_II5_mimp_3__contig_1.65(13656:13882) 2 224
Focub_II5_mimp_4__contig_1.70(61591:61809) 2 216
This is example input:
>Focub_II5_mimp_1__contig_1.16(656599:656809)
TACAGTGGGATGCAAAAAGTATTCGCAGGTGTGTAGAGAGATTTGTTGCTCGGAAGCTAGTTAGGTGTAGCTTGTCAGGTTCTCAGTACCCTATATTACACCGAGATCAGCGGGATAATCTAGTCTCGAGTACATAAGCTAAGTTAAGCTACTAACTAGCGCAGCTGACACAACTTACACACCTGCAAATACTTTTTGCATCCCACTGTA
>Focub_II5_mimp_2__contig_1.47(41315:41540)
TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTCTGCCGCTAGCCCATTTTAACAGCTAGAGTGTGTATATTAACCTCACACATAGCTATCTCTTATACTAATTGGTTAGGGAAAACCTCTAACCAGGATTAGGAGTCAACATAGCTTGTTTTAGGCTAAGAGGTGTGTGTCAGTACACCAAAGGGTATTCATACTTATTGCCCCCCACTGTA
>Focub_II5_mimp_3__contig_1.65(13656:13882)
TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTTCTGCCGCTAGCCTATTTTAATAGTTAGAGTGTGCATATTAACCTCACACATAGCTATCTTATATACTAATCGGTTAGGGAAAACCTCTAACCAGGATTAGGAGTCAACATAGCTTCTTTTAGGCTAAGAGGTGTGTGTCAGTACACCAAAGGGTATTCATACTTATTGCCCCCCACTGTA
>Focub_II5_mimp_4__contig_1.70(61591:61809)
TACAGTGGGATGCAATAAGTTTGAATGCAGGCTGAAGTACCAGCTGTTGTAATCTAGCTCCTGTATACAACGCTTTAGCTTGATAAAGTAAGCGCTAAGCTGTATCAGGCAAAAGGCTATCCCGATTGGGGTATTGCTACGTAGGGAACTGGTCTTACCTTGGTTAGTCAGTGAATGTGTACTTGAGTTTGGATTCAAACTTATTGCATCCCACTGTA
Is anybody able to help?
Thank you :)
to write a line to a file you would do something like this:
with open("file.txt", "a") as f:
print("new line", file=f)
and if you want it tab separated you can also add sep="\t", this is why python 3 made print a function so you can use sep, end, file, and flush keyword arguments. :)
opening a file for appending means the file pointer starts at the end of the file which means that writing to it doesn't override any data (gets appended to the end of the file) and iterating over it (or otherwise reading from it) gives nothing like you already reached the end of the file.
So instead of iterating over the lines of the file you would just write the single line to it:
with open("Mimp_hits.bed", "a") as file_object:
print(sequence.description, h.start(), h_rc.end(), file=file_object)
you can also consider just opening the file near the beginning of the loop since opening it once and writing multiple times is more efficient than opening it multiple times, also the with block automatically closes the file so no need to do that explicitly.
You are trying to open the file in "a+" mode, and loop over lines from it (which will not find anything because the file is positioned at the end when you do that). In any case, if this is an output file only, then you would open it in "a" mode to append to it.
Probably you just want to open the file once for appending, and inside the with statement, do your main loop, using file_object.write(...) when you want to actually append strings to the file. Note that there is no need for file_object.close() when using this with construct.
with open("Mimp_hits.bed", "a") as file_object:
for sequence in SeqIO.parse(infile, "fasta"):
# ... etc per original code ...
if length < mimp_length:
file_object.write("{}\t{}\t{}\n".format(
sequence.description, h.start(), h_rc.end()))

parse a file, appending each line at the end and removing the line from the top

I am trying to move each line down at the bottom of the file; this is how the file look like:
daodaos 12391039
idiejda 94093420
jfijdsf 10903213
....
#completed
So at the end of the parsing, I am planning to get all the entry that are on the top, under the actual string that says # completed.
The problem is that I am not sure how can I do this in one pass; I know that I can read the whole file, every single line, close the file and then re-open the file in write mode; searching for that line, removing it from the file and adding it to the end; but it feels incredibly inefficient.
Is there a way in one pass, to process the current line; then in the same for loop, delete the line and append it at the end of the file?
file = open('myfile.txt', 'a')
for items in file:
#process items line
#append items line to the end of the file
#remove items line from the file
suggest to keep it simple read and writeback
with open('myfile.txt') as f:
lines = f.readlines()
with open('myfile.txt', 'w') as f:
newlines = []
for line in lines:
# do you stuff, check if completed, rearrange the list
if line.startswith('#completed'):
idx=i
newlines = lines[idx:] + lines[:idx]
break
f.write(''.join(newlines)) # write back new lines
below is another version i could think of if insist wanna modify while reading
with open('myfile.txt', 'r+') as f:
newlines = ''
line = True
while line:
line = f.readline()
if line.startswith('#completed'):
# line += f.read() # uncomment this line if you interest on line after #completed
f.truncate()
f.seek(0)
f.write(line + newlines)
break
else:
newlines += line
Not really.
Your main problem here is that you're iterating on the file at the same time you want to change it. This will Do Bad Things (tm) to your processing, unless you plan to micro-manage the file position pointer.
You do have that power: the seek method lets you move to a given file location, expressed in bytes. seek(0) moves to the start of the file; seek(-1) to the end. The problem you face is that your for loop trusts that this pointer indicates the next line to read.
One distinct problem is that you can't just remove a line from the middle of the file; something exists in those bytes. Think of it as lines of text on a page, written in pencil. You can erase line 4, but this does not cause lines 5-end to magically float up half a centimeter; they're still in the same physical location.
How to Do It ... sort of
Read all of the lines into a list. You can easily change a list the way you want. When you hit the end, then write the list back to the file -- or use your magic seek and append powers to alter only a little of it.
I'll recommend you to do this the simple way: read all the file and store it in a variable, move the completed files to another variable and then rewrite your file.

Python: Open a file, search then append, if not exist

I am trying to append a string to a file, if the string doesn't exit in the file. However, opening a file with a+ option doesn't allow me to do at once, because opening the file with a+ will put the pointer to the end of the file, meaning that my search will always fail. Is there any good way to do this other than opening the file to read first, close and open again to append?
In code, apparently, below doesn't work.
file = open("fileName", "a+")
I need to do following to achieve it.
file = open("fileName", "r")
... check if a string exist in the file
file.close()
... if the string doesn't exist in the file
file = open("fileName", "a")
file.write("a string")
file.close()
To leave the input file unchanged if needle is on any line or to append the needle at the end of the file if it is missing:
with open("filename", "r+") as file:
for line in file:
if needle in line:
break
else: # not found, we are at the eof
file.write(needle) # append missing data
I've tested it and it works on both Python 2 (stdio-based I/O) and Python 3 (POSIX read/write-based I/O).
The code uses obscure else after a loop Python syntax. See Why does python use 'else' after for and while loops?
You can set the current position of the file object using file.seek(). To jump to the beginning of a file, use
f.seek(0, os.SEEK_SET)
To jump to a file's end, use
f.seek(0, os.SEEK_END)
In your case, to check if a file contains something, and then maybe append append to the file, I'd do something like this:
import os
with open("file.txt", "r+") as f:
line_found = any("foo" in line for line in f)
if not line_found:
f.seek(0, os.SEEK_END)
f.write("yay, a new line!\n")
There is a minor bug in the previous answers: often, the last line in a text file is missing an ending newline. If you do not take that that into account and blindly append some text, your text will be appended to the last line.
For safety:
needle = "Add this line if missing"
with open("filename", "r+") as file:
ends_with_newline = True
for line in file:
ends_with_newline = line.endswith("\n")
if line.rstrip("\n\r") == needle:
break
else: # not found, we are at the eof
if not ends_with_newline:
file.write("\n")
file.write(needle + "\n") # append missing data

Combined effect of reading lines twice?

As a practice, I am learning to reading a file.
As is obvious from code, hopefully, I have a file in working/root whatever directory. I need to read it and print it.
my_file=open("new.txt","r")
lengt=sum(1 for line in my_file)
for i in range(0,lengt-1):
myline=my_file.readlines(1)[0]
print(myline)
my_file.close()
This returns error and says out of range.
The text file simply contains statements like
line one
line two
line three
.
.
.
Everything same, I tried myline=my_file.readline(). I get empty 7 lines.
My guess is that while using for line in my_file, I read up the lines. So reached end of document. To get same result as I desire, I do I overcome this?
P.S. if it mattersm it's python 3.3
No need to count along. Python does it for you:
my_file = open("new.txt","r")
for myline in my_file:
print(myline)
Details:
my_file is an iterator. This a special object that allows to iterate over it.
You can also access a single line:
line 1 = next(my_file)
gives you the first line assuming you just opened the file. Doing it again:
line 2 = next(my_file)
you get the second line. If you now iterate over it:
for myline in my_file:
# do something
it will start at line 3.
Stange extra lines?
print(myline)
will likely print an extra empty line. This is due to a newline read from the file and a newline added by print(). Solution:
Python 3:
print(myline, end='')
Python 2:
print myline, # note the trailing comma.
Playing it save
Using the with statement like this:
with open("new.txt", "r") as my_file:
for myline in my_file:
print(myline)
# my_file is open here
# my_file is closed here
you don't need to close the file as it done as soon you leave the context, i.e. as soon as you continue with your code an the same level as the with statement.
You can actually take care of all of this at once by iterating over the file contents:
my_file = open("new.txt", "r")
length = 0
for line in my_file:
length += 1
print(line)
my_file.close()
At the end, you will have printed all of the lines, and length will contain the number of lines in the file. (If you don't specifically need to know length, there's really no need for it!)
Another way to do it, which will close the file for you (and, in fact, will even close the file if an exception is raised):
length = 0
with open("new.txt", "r") as my_file:
for line in my_file:
length += 1
print(line)

Categories