Python error in processing lines from a file - python

wrote a python script in windows 8.1 using Sublime Text editor and I just tried to run it from terminal in OSX Yosemite but I get an error.
My error occurs when parsing the first line of a .CSV file. This is the slice of the code
lines is an array where each element is the line in the file it is read from as a string
we split the string by the desired delimiter
we skip the first line because that is the header information (else condition)
For the last index in the for loop i = numlines -1 = the number of lines in the file - 2
We only add one to the value of i because the last line is blank in the file
for i in range(numlines):
if i == numlines-1:
dataF = lines[i+1].split(',')
else:
dataF = lines[i+1].split(',')
dataF1 = list(dataF[3])
del(dataF1[len(dataF1)-1])
del(dataF1[len(dataF1)-1])
del(dataF1[0])
f[i] = ''.join(dataF1)
return f
All the lines in the csv file looks like this (with the exception of the header line):
"08/06/2015","19:00:00","1","410"
So it saves the single line into an array where each element corresponds to one of the 4 values separated by commas in a line of the CSV file. Then we take the 3 element in the array, "410" ,and create a list that should look like
['"','4','1','0','"','\n']
(and it does when run from windows)
but it instead looks like
['"','4','1','0','"','\r','\n']
and so when I concatenate this string based off the above code I get 410 instead of 410.
My question is: Where did the '\r' term come from? It is non-existent in the original files when ran by a windows machine. At first I thought it was the text format so I saved the CSV file to a UTF-8, that didn’t work. I tried changing the tab size from 4 to 8 spaces, that didn’t work. Running out of ideas now. Any help would be greatly appreciated.
Thanks

The "\r" is the line separator. The "\r\n" is also a line separator. Different platforms have different line separators.
A simple fix: if you read a line from a file yourself, then line.rstrip() will remove the whitespace from the line end.
A proper fix: use Python's standard CSV reader. It will skip the blank lines and comments, will properly handle quoted strings, etc.
Also, when working with long lists, it helps to stop thinking about them as index-addressed 'arrays' and use the 'stream' or 'sequential reading' metaphor.
So the typical way of handling a CSV file is something like:
import csv
with open('myfile.csv') as f:
reader = csv.reader(f)
# We assume that the file has 3 columns; adjust to taste
for (first_field, second_field, third_field) in reader:
# do something with field values of the current lines here

Related

Using a for loop to add a new line to a table: python

I am trying to create a .bed file after searching through DNA sequences for two regular expressions. Ideally, I'd like to generate a tab-separated file which contains the sequence description, the start location of the first regex and the end location of the second regex. I know that the regex section works, it's just creating the \t separated file I am struggling with.
I was hoping that I could open/create a file and simply print a new line for each iteration of the for loop that contains this information, like so:
with open("Mimp_hits.bed", "a+") as file_object:
for line in file_object:
print(f'{sequence.description}\t{h.start()}\t{h_rc.end()}')
file_object.close()
But this doesn't seem to work (creates empty file). I have also tried to use file_object.write, but again this creates an empty file too.
This is all of the code I have including searching for the regexes:
import re, sys
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
infile = sys.argv[1]
for sequence in SeqIO.parse(infile, "fasta"):
hit = re.finditer(r"CAGTGGG..GCAA[TA]AA", str(sequence.seq))
mimp_length = 400
for h in hit:
h_start = h.start()
hit_rc = re.finditer(r"TT[TA]TTGC..CCCACTG", str(sequence.seq))
for h_rc in hit_rc:
h_rc_end = h_rc.end()
length = h_rc_end - h_start
if length > 0:
if length < mimp_length:
with open("Mimp_hits.bed", "a+") as file_object:
for line in file_object:
print(sequence.description, h.start(), h_rc.end())
file_object.close()
This is the desired output:
Focub_II5_mimp_1__contig_1.16(656599:656809) 2 208
Focub_II5_mimp_2__contig_1.47(41315:41540) 2 223
Focub_II5_mimp_3__contig_1.65(13656:13882) 2 224
Focub_II5_mimp_4__contig_1.70(61591:61809) 2 216
This is example input:
>Focub_II5_mimp_1__contig_1.16(656599:656809)
TACAGTGGGATGCAAAAAGTATTCGCAGGTGTGTAGAGAGATTTGTTGCTCGGAAGCTAGTTAGGTGTAGCTTGTCAGGTTCTCAGTACCCTATATTACACCGAGATCAGCGGGATAATCTAGTCTCGAGTACATAAGCTAAGTTAAGCTACTAACTAGCGCAGCTGACACAACTTACACACCTGCAAATACTTTTTGCATCCCACTGTA
>Focub_II5_mimp_2__contig_1.47(41315:41540)
TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTCTGCCGCTAGCCCATTTTAACAGCTAGAGTGTGTATATTAACCTCACACATAGCTATCTCTTATACTAATTGGTTAGGGAAAACCTCTAACCAGGATTAGGAGTCAACATAGCTTGTTTTAGGCTAAGAGGTGTGTGTCAGTACACCAAAGGGTATTCATACTTATTGCCCCCCACTGTA
>Focub_II5_mimp_3__contig_1.65(13656:13882)
TACAGTGGGAGGCAATAAGTATGAATACCGGGCGTGTATTGTTTTTCTGCCGCTAGCCTATTTTAATAGTTAGAGTGTGCATATTAACCTCACACATAGCTATCTTATATACTAATCGGTTAGGGAAAACCTCTAACCAGGATTAGGAGTCAACATAGCTTCTTTTAGGCTAAGAGGTGTGTGTCAGTACACCAAAGGGTATTCATACTTATTGCCCCCCACTGTA
>Focub_II5_mimp_4__contig_1.70(61591:61809)
TACAGTGGGATGCAATAAGTTTGAATGCAGGCTGAAGTACCAGCTGTTGTAATCTAGCTCCTGTATACAACGCTTTAGCTTGATAAAGTAAGCGCTAAGCTGTATCAGGCAAAAGGCTATCCCGATTGGGGTATTGCTACGTAGGGAACTGGTCTTACCTTGGTTAGTCAGTGAATGTGTACTTGAGTTTGGATTCAAACTTATTGCATCCCACTGTA
Is anybody able to help?
Thank you :)
to write a line to a file you would do something like this:
with open("file.txt", "a") as f:
print("new line", file=f)
and if you want it tab separated you can also add sep="\t", this is why python 3 made print a function so you can use sep, end, file, and flush keyword arguments. :)
opening a file for appending means the file pointer starts at the end of the file which means that writing to it doesn't override any data (gets appended to the end of the file) and iterating over it (or otherwise reading from it) gives nothing like you already reached the end of the file.
So instead of iterating over the lines of the file you would just write the single line to it:
with open("Mimp_hits.bed", "a") as file_object:
print(sequence.description, h.start(), h_rc.end(), file=file_object)
you can also consider just opening the file near the beginning of the loop since opening it once and writing multiple times is more efficient than opening it multiple times, also the with block automatically closes the file so no need to do that explicitly.
You are trying to open the file in "a+" mode, and loop over lines from it (which will not find anything because the file is positioned at the end when you do that). In any case, if this is an output file only, then you would open it in "a" mode to append to it.
Probably you just want to open the file once for appending, and inside the with statement, do your main loop, using file_object.write(...) when you want to actually append strings to the file. Note that there is no need for file_object.close() when using this with construct.
with open("Mimp_hits.bed", "a") as file_object:
for sequence in SeqIO.parse(infile, "fasta"):
# ... etc per original code ...
if length < mimp_length:
file_object.write("{}\t{}\t{}\n".format(
sequence.description, h.start(), h_rc.end()))

File has two parts - 1st is text 2nd is CSV. How to parse only the CSV part with python

I have a text file which contains text in the first 20 or so lines, followed by CSV data. Some of the text in the text section contains commas and so trying csv.reader or csv.dictreader doesn't work well.
I want to skip past the text section and only then start to parse the CSV data.
Searches don't yield much other than instructions to either use csv.reader/csv.dictreader and iterate through the rows that are returned (which doesn't work because of the commas in the text), or to read the file line-by-line and split the lines using ',' as the delimiter.
The latter works up to a point, but it produces strings, not numbers. I could convert the strings to numbers but I'm hoping that there's a simple way to do this either with the csv or numpy libraries.
As requested - Sample data:
This is the first line. This is all just text to be skipped.
The first line doesn't always have a comma - maybe it's in the third line
Still no commas, or was there?
Yes, there was. And there it is again.
and so on
There are more lines but they finally stop when you get to
EndOfHeader
1,2,3,4,5
8,9,10,11,12
3, 6, 9, 12, 15
Thanks for the help.
Edit#2
A suggested answer gave the following link entitled Read file from line 2...
That's kind of what I'm looking for, but I want to be able to read through the lines until I find the "EndOfHeader" and then call on the CSV library to handle the remainder of the file.
The reply by saimadhu.polamuri is part of what I've tried, specifically
with open(filename , 'r') as f:
first_line = f.readline()
for line in f:
#test if line equals EndOfHeader. If true then parse as CSV
But that's where it comes apart - I can't see how to have CSV work with the data from this point forward.
With thanks to #Mike for the suggestion, the code is actually reasonably straightforward.
with open('data.csv') as f: # open the file
for i in range(7): # Loop over first 7 lines
str=f.readline() # just read them. Could also do f.next()
r = csv.reader(f, delimiter=',') # Now pass the file handle to a csv reader
for row in r: # and loop over the resulting rows
print(row) # Print the row. Or do something else.
In my actual code, it will search for the EndOfHeader line and use that to decide where to start parsing the CSV
I'm posting this as an answer, as the question that this one supposedly duplicates doesn't explicitly consider this issue of the file handle and how it can be passed to a CSV reader, and so it may help someone else.
Thanks to all who took time to help.

In place replacement of text in a file in Python

I am using the following code to upload a file on server using FTP after editing it:
import fileinput
file = open('example.php','rb+')
for line in fileinput.input('example.php'):
if 'Original' in line :
file.write( line.replace('Original', 'Replacement'))
file.close()
There is one thing, instead of replacing the text in its original place, the code adds the replaced text at the end and the text in original place is unchanged.
Also, instead of just the replaced text, it prints out the whole line. Could anyone please tell me how to resolve these two errors?
1) The code adds the replaced text at the end and the text in original place is unchanged.
You can't replace in the body of the file because you're opening it with the + signal. This way it'll append to the end of the file.
file = open('example.php','rb+')
But this only works if you want to append to the end of the document.
To bypass this you may use seek() to navigate to the specific line and replace it. Or create 2 files: an input_file and an output_file.
2) Also, instead of just the replaced text, it prints out the whole line.
It's because you're using:
file.write( line.replace('Original', 'Replacement'))
Free Code:
I've segregated into 2 files, an inputfile and an outputfile.
First it'll open the ifile and save all lines in a list called lines.
Second, it'll read all these lines, and if 'Original' is present, it'll replace it.
After replacement, it'll save into ofile.
ifile = 'example.php'
ofile = 'example_edited.php'
with open(ifile, 'rb') as f:
lines = f.readlines()
with open(ofile, 'wb') as g:
for line in lines:
if 'Original' in line:
g.write(line.replace('Original', 'Replacement'))
Then if you want to, you may os.remove() the non-edited file with:
More Info: Tutorials Point: Python Files I/O
The second error is how the replace() method works.
It returns the entire input string, with only the specified substring replaced. See example here.
To write to a specific place in the file, you should seek() to the right position first.
I think this issue has been asked before in several places, I would do a quick search of StackOverflow.
Maybe this would help?
Replacing stuff in a file only works well if original and replacement have the same size (in bytes) then you can do
with open('example.php','rb+') as f:
pos=f.tell()
line=f.readline()
if b'Original' in line:
f.seek(pos)
f.write(line.replace(b'Original',b'Replacement'))
(In this case b'Original' and b'Replacement' do not have the same size so your file will look funny after this)
Edit:
If original and replacement are not the same size, there are different possibilities like adding bytes to fill the hole or moving everything after the line.

Not understanding read command in Python

I'm trying to understand what's going on with my read function. I'm simply doing a readline of a text document I created in canopy. For some reason it only gives me w for whatever value I put in. I'm new to the world of Python so I'm sure its an easy answer! Thanks for your help!
import os
my_file = open(os.path.expanduser("~/Desktop/Python Files/Test Text.txt"),'r')
print my_file.readline(3)
my_file.close()
My Text document is below
w
o
r
d
s
my_file.readline(3) reads up to 3 bytes from the first line.
The first line contains a w and an end-of-line character.
If you want to read up to the first 3 bytes regardless of the line, use my_file.read(3). Note that end-of-line characters are included in the count.
If you want to print the first 3 lines, you could use
import os
with open(os.path.expanduser("~/Desktop/Python Files/Test Text.txt"),'r') as my_file:
for i, line in enumerate(my_file):
if i >= 3: break
print(line)
or
import itertools as IT
with open(os.path.expanduser("~/Desktop/Python Files/Test Text.txt"),'r') as my_file:
for line in IT.islice(my_file, 3):
print(line)
For short files you could instead use
with open(os.path.expanduser("~/Desktop/Python Files/Test Text.txt"),'r') as my_file:
lines = my_file.readlines()
for line in lines[:3]:
print(line)
but note that my_file.readlines() returns a list of all the lines in the
file. Since this can be very memory-intensive if the file is huge, and since it
is usually possible to process a file line-by-line (which is much less
memory-intensive), generally the first two methods of reading a file are
preferred over the third.
'readline([size]) -> next line from the file, as a string.Retain newline. A non-negative size argument limits the maximum number of bytes to return (an incomplete line may be returned then).Return an empty string at EOF.
readline reads next line and so on.The argument size is for how many bytes should it read from the corresponding line.
Using f.readline does not give random access to the file. I think you want to read the third (or maybe fourth if you're zero-indexing) line. The argument that you're passing to f.readline is a maximum byte count to read, rather than a specific line to read.

Reading in line from txt file in python

I'm having trouble it seems with reading in lines from a text file. When I do the whole f.readline() I can save it to a string and then print off the correct text however when lets say I go to print the first or second character of the string I just made it'll print a strange like dot checker pattern character instead of the correct letter.
Edit: Ok so when I try alfasin's method I seem to get the correct length of each line besides the first line that is read in. If I'm say reading in 5 lines and looking for a space, the first line with find the first space at spot 13 when it should find it at spot 8. However the next lines read in will all produce the correct length and location of the space.
Edit2: Also the text file I am reading in is UTF-8.
Edit3: Definitely was an issue with the encoding of the text file. I changed it to ANSI and everything started working as it should.
Try the following:
with open('filename.txt') as file:
for line in file:
print line
# and if you want to break it down to characters:
print list(line)

Categories