I have data in a .txt file, in the form a of a comma separated list. For example:
N12345678,B,A,D,D,C,B,D,A,C,C,D,B,A,B,A,C,B,D,A,C,A,A,B,D,D
N12345678,B,A,D,D,C,B,D,A,C,C,D,B,A,B,A,C,B,D,A,C,A,A,B,D,D
I want to be able to split it up, first by line, then by comma, so I'm able to process the data and validate it. I am getting "invalid" for all of lines in my code, even though some of them should be valid because there should be 26 characters per line. Here is my code so far:
(filename+".txt").split("\n")
(filename+".txt").split(",")
with open(filename+".txt") as f:
for line in f:
if len(line) != 26:
print ("invalid")
else:
print ("valid")
This code is quite far from working; it's syntactically valid Python, but it doesn't mean anything sensible.
# These two lines add two strings together, returning a string
# then they split the string into pieces into a list
# because the /filename/ has no newlines in it, and probably no commas
# that changes nothing
# then the return value isn't saved anywhere, so it gets thrown away
(filename+".txt").split("\n")
(filename+".txt").split(",")
# This opens the file and reads from it line by line,
# which means "line" is a string of text for each line in the file.
with open(filename+".txt") as f:
for line in f:
# This checks if the line in the file is not the /number/ 26
# since the file contains /strings/ it never will be the number 26
if line != 26:
print ("invalid")
# so this is never hit
else:
print ("valid")
[Edit: even in your updated code, the line is the whole text "N12345678,B,A,D..." and because of the commas, len(line) will be longer than 26 characters.]
It seems you want something more like: Drop the first two lines of your code completely, read through the file line by line (meaning you normally don't have to care about "\n" in your code). Then split each line by commas.
with open(filename+".txt") as f:
for line in f:
line_parts = line.split(",")
if len(line_parts) != 26:
print ("invalid")
else:
print ("valid")
# line_parts is now a list of strings
# ["N12345678" ,"B", "A", ...]
I think an easier way to do this would be to use csv module.
import csv
with open("C:/text.csv") as input:
reader = csv.reader(input)
for row in reader:
if len(row) == 26:
print("Valid")
else:
print("Invalid")
As far as I understand your question, you want this.
with open(filename, 'r') as f:
for line in f:
if len(line.split(',')) !=26:
print("Invalid")
else:
print("Valid")
All it does is,
Open the file.
Read the file line by line.
For each line, split the line by ,
As str.split() returns a list, check if length of the list is 26 or not.
If length is 26, count as valid; otherwise not.
Related
I have a CSV file that has errors. The most common one is a too early linebreak.
But now I don't know how to remove it ideally. If I read the line by line
with open("test.csv", "r") as reader:
test = reader.read().splitlines()
the wrong structure is already in my variable. Is this still the right approach and do I use a for loop over test and create a copy or can I manipulate directly in the test variable while iterating over it?
I can identify the corrupt lines by the semikolon, some rows end with a ; others start with it. So maybe counting would be an alternative way to solve it?
EDIT:
I replaced reader.read().splitlines() with reader.readlines() so I could handle the rows which end with a ;
for line in lines:
if("Foobar" in line):
line = line.replace("Foobar", "")
if(";\n" in line):
line = line.replace(";\n", ";")
The only thing that remains are rows that beginn with a ;
Since I need to go back one entry in the list
Example:
Col_a;Col_b;Col_c;Col_d
2021;Foobar;Bla
;Blub
Blub belongs in the row above.
Here's a simple Python script to merge lines until you have the desired number of fields.
import sys
sep = ';'
fields = 4
collected = []
for line in sys.stdin:
new = line.rstrip('\n').split(sep)
if collected:
collected[-1] += new[0]
collected.extend(new[1:])
else:
collected = new
if len(collected) < fields:
continue
print(';'.join(collected))
collected = []
This simply reads from standard input and prints to standard output. If the last line is incomplete, it will be lost.
The separator and the number of fields can be edited into the variables at the top; exposing these as command-line parameters left as an exercise.
If you wanted to keep the newlines, it would not be too hard to only strip a newline from the last fields, and use csv.writer to write the fields back out as properly quoted CSV.
This is how I deal with this. This function fixes the line if there are more columns than needed or if there is a line break in the middle.
Parameters of the function are:
message - content of the file - reader.read() in your case
columns - number of expected columns
filename - filename (I use it for logging)
def pre_parse(message, columns, filename):
parsed_message=[]
i =0
temp_line =''
for line in message.splitlines():
#print(line)
split = line.split(',')
if len(split) == columns:
parsed_message.append(line)
elif len(split) > columns:
print(f'Line {i} has been truncated in file {filename} - too much columns'))
split = split[:columns]
line = ','.join(split)
parsed_message.append(line)
elif len(split) < columns and temp_line =='':
temp_line = line.replace('\n','')
print(temp_line)
elif temp_line !='':
line = temp_line+line
if line.count(',') == columns-1:
print((f'Line {i} has been fixed in file {filename} - extra line feed'))
parsed_message.append(line)
temp_line =''
else:
temp_line=line.replace('\n', '')
i+=1
return parsed_message
make sure you use proper split character and proper line feed characer.
I have the following problem. I am supposed to open a CSV file (its an excel table) and read it without using any library.
I tried already a lot and have now the first row in a tuple and this in a list. But only the first line. The header. But no other row.
This is what I have so far.
with open(path, 'r+') as file:
results=[]
text = file.readline()
while text != '':
for line in text.split('\n'):
a=line.split(',')
b=tuple(a)
results.append(b)
return results
The output should: be every line in a tuple and all the tuples in a list.
My question is now, how can I read the other lines in python?
I am really sorry, I am new to programming all together and so I have a real hard time finding my mistake.
Thank you very much in advance for helping me out!
This problem was many times on Stackoverflow so you should find working code.
But much better is to use module csv for this.
You have wrong indentation and you use return results after reading first line so it exits function and it never try read other lines.
But after changing this there are still other problems so it still will not read next lines.
You use readline() so you read only first line and your loop will works all time with the same line - and maybe it will never ends because you never set text = ''
You should use read() to get all text which later you split to lines using split("\n") or you could use readlines() to get all lines as list and then you don't need split(). OR you can use for line in file: In all situations you don't need while
def read_csv(path):
with open(path, 'r+') as file:
results = []
text = file.read()
for line in text.split('\n'):
items = line.split(',')
results.append(tuple(items))
# after for-loop
return results
def read_csv(path):
with open(path, 'r+') as file:
results = []
lines = file.readlines()
for line in lines:
line = line.rstrip('\n') # remove `\n` at the end of line
items = line.split(',')
results.append(tuple(items))
# after for-loop
return results
def read_csv(path):
with open(path, 'r+') as file:
results = []
for line in file:
line = line.rstrip('\n') # remove `\n` at the end of line
items = line.split(',')
results.append(tuple(items))
# after for-loop
return results
All this version will not work correctly if you will '\n' or , inside item which shouldn't be treated as end of row or as separtor between items. These items will be in " " which also can make problem to remove them. All these problem you can resolve using standard module csv.
Your code is pretty well and you are near goal:
with open(path, 'r+') as file:
results=[]
text = file.read()
#while text != '':
for line in text.split('\n'):
a=line.split(',')
b=tuple(a)
results.append(b)
return results
Your Code:
with open(path, 'r+') as file:
results=[]
text = file.readline()
while text != '':
for line in text.split('\n'):
a=line.split(',')
b=tuple(a)
results.append(b)
return results
So enjoy learning :)
One caveat is that the csv may not end with a blank line as this would result in an ugly tuple at the end of the list like ('',) (Which looks like a smiley)
To prevent this you have to check for empty lines: if line != '': after the for will do the trick.
Please see following attached image showing the format of the text file. I need to extract the dimensions of data matrix indicated by the first line in the file, here 49 * 70 * 1 for the case shown by the image. Note that the length of name "gd_fac" can be varying. How can I extract these numbers as integers? I am using Python 3.6.
Specification is not very clear. I am assuming that the information you want will always be in the first line, and always be in parenthesis. After that:
with open(filename) as infile:
line = infile.readline()
string = line[line.find('(')+1:line.find(')')]
lst = string.split('x')
This will create the list lst = [49, 70, 1].
What is happening here:
First I open the file (you will need to replace filename with the name of your file, as a string. The with ... as ... structure ensures that the file is closed after use. Then I read the first line. After that. I select only the parts of that line that fall after the open paren (, and before the close paren ). Finally, I break the string into parts, with the character x as the separator. This creates a list that contains the values in the first line of the file, which fall between parenthesis, and are separated by x.
Since you have mentioned that length of 'gd_fac' van be variable, best solution will be using Regular Expression.
import re
with open("a.txt") as fh:
for line in fh:
if '(' in line and ')' in line:
dimension = re.findall(r'.*\((.*)\)',line)[0]
break
print dimension
Output:
'49x70x1'
What this does is it looks for "gd_fac"
then if it's there is removes all the unneeded stuff and replaces it with just what you want.
with open('test.txt', 'r') as infile:
for line in infile:
if("gd_fac" in line):
line = line.replace("gd_fac", "")
line = line.replace("x", "*")
line = line.replace("(","")
line = line.replace(")","")
print (line)
break
OUTPUT: "49x70x1"
I'm trying to format a tab delimited txt file that has rows and columns. I'm trying to simply ignore the rows that have any empty values in it when I write to the output file. I'm doing this by len(list) method where if the length of the list equals the number of columns, then that line gets written to output file. But when I check the length of the lines, they are all the same, even though I removed the empty strings! Very frustrating...
Here's my code:
import sys, os
inputFileName = sys.argv[1]
outputFileName = os.path.splitext(inputFileName)[0]+"_edited.txt"
try:
infile = open(inputFileName,'r')
outfile = open(outputFileName, 'w')
line = infile.readline()
outfile.write(line)
for line in infile:
lineList = line.split('\t')
#print lineList
if '' in lineList:
lineList.remove('')
#if len(lineList) < 9:
#print len(lineList)
#outfile.write(line)
infile.close()
#outfile.close()
except IOError:
print inputFileName, "does not exist."
Thanks for any help. When I create an experimental list in the interactive window and use the if '' in list: then it removes it. When I run the code, the ' ' is still there!
I dont know any python but i can mention you dont seem to be checking for whitespace characters. What about \r, \n on top of the \t's. Why dont you try trimming the line and checking if its == ''
I think that one of your problems is that list.remove only removes the first occurrence of the element. There could still be more empty strings in your list. From the documentation:
Remove the first item from the list whose value is x. It is an error if there is no such item.
To remove all the empty strings from your list you could use a list comprehension instead.
lineList = [x for x in lineList if x]
or filter with the identity function (by passing None as the first argument):
lineList = filter(None, lineList)
The following does what you're asking with fewer lines of code and removes empty lines of any kind of whitespace thanks to the strip() call.
#!/usr/bin/env python
import sys, os
inputFileName = sys.argv[1]
outputFileName = os.path.splitext(inputFileName)[0]+"_edited.txt"
try:
infile = open(inputFileName,'r')
outfile = open(outputFileName, 'w')
for line in infile.readlines():
if line.strip():
outfile.write(line)
infile.close()
outfile.close()
except IOError:
print inputFileName, "does not exist."
EDIT:
For clarity, this reads each line of the input file then strips the line of leading and trailing whitespace (tabs, spaces, etc.) and writes the non-empty lines to the output file.
I'm writing a short program in Python that will read a FASTA file which is usually in this format:
>gi|253795547|ref|NC_012960.1| Candidatus Hodgkinia cicadicola Dsem chromosome, 52 lines
GACGGCTTGTTTGCGTGCGACGAGTTTAGGATTGCTCTTTTGCTAAGCTTGGGGGTTGCGCCCAAAGTGA
TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC
TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTGTTGGGCTCGCCGAAAGCTGGCAGGTCGA
I've created another program that reads the first line(aka header) of this FASTA file and now I want this second program to start reading and printing beginning from the sequence.
How would I do that?
so far i have this:
FASTA = open("test.txt", "r")
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
for line in FASTA:
line = line.strip()
print line
readSeq(FASTA)
Thanks guys
-Noob
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
_unused = FASTA.next() # skip heading record
for line in FASTA:
line = line.strip()
print line
Read the docs on file.next() to see why you should be wary of mixing file.readline() with for line in file:
you should show your script. To read from second line, something like this
f=open("file")
f.readline()
for line in f:
print line
f.close()
You might be interested in checking BioPythons handling of Fasta files (source).
def FastaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
"""Generator function to iterate over Fasta records (as SeqRecord objects).
handle - input file
alphabet - optional alphabet
title2ids - A function that, when given the title of the FASTA
file (without the beginning >), will return the id, name and
description (in that order) for the record as a tuple of strings.
If this is not given, then the entire title line will be used
as the description, and the first word as the id and name.
Note that use of title2ids matches that of Bio.Fasta.SequenceParser
but the defaults are slightly different.
"""
#Skip any text before the first record (e.g. blank lines, comments)
while True:
line = handle.readline()
if line == "" : return #Premature end of file, or just empty?
if line[0] == ">":
break
while True:
if line[0]!=">":
raise ValueError("Records in Fasta files should start with '>' character")
if title2ids:
id, name, descr = title2ids(line[1:].rstrip())
else:
descr = line[1:].rstrip()
id = descr.split()[0]
name = id
lines = []
line = handle.readline()
while True:
if not line : break
if line[0] == ">": break
#Remove trailing whitespace, and any internal spaces
#(and any embedded \r which are possible in mangled files
#when not opened in universal read lines mode)
lines.append(line.rstrip().replace(" ","").replace("\r",""))
line = handle.readline()
#Return the record and then continue...
yield SeqRecord(Seq("".join(lines), alphabet),
id = id, name = name, description = descr)
if not line : return #StopIteration
assert False, "Should not reach this line"
good to see another bioinformatician :)
just include an if clause within your for loop above the line.strip() call
def readSeq(FASTA):
for line in FASTA:
if line.startswith('>'):
continue
line = line.strip()
print(line)
A pythonic and simple way to do this would be slice notation.
>>> f = open('filename')
>>> lines = f.readlines()
>>> lines[1:]
['TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC\n', 'TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTG
TTGGGCTCGCCGAAAGCTGGCAGGTCGA']
That says "give me all elements of lines, from the second (index 1) to the end.
Other general uses of slice notation:
s[i:j] slice of s from i to j
s[i:j:k] slice of s from i to j with step k (k can be negative to go backward)
Either i or j can be omitted (to imply the beginning or the end), and j can be negative to indicate a number of elements from the end.
s[:-1] All but the last element.
Edit in response to gnibbler's comment:
If the file is truly massive you can use iterator slicing to get the same effect while making sure you don't get the whole thing in memory.
import itertools
f = open("filename")
#start at the second line, don't stop, stride by one
for line in itertools.islice(f, 1, None, 1):
print line
"islicing" doesn't have the nice syntax or extra features of regular slicing, but it's a nice approach to remember.