I have a file that contains blocks of information beginning and ending with the same phrase:
# Info block
Info line 1
Info line 2
Internal problem
ENDOFPARAMETERPOINT
I am trying to write a python code that deletes the entire block beginning with # Info block and ending with ENDOFPARAMETERPOINT once it detects the phrase Internal problem.
finds = '# Info block\nInfo line 1\nInfo line 2\nInternal problem\nENDOFPARAMETERPOINT'
with open(filename,"r+") as fp:
pattern = re.compile(r'[,\s]+' + re.escape(finds) + r'[\s]+')
textdata = fp.read()
line = re.sub(pattern,'',textdata)
fp.seek(0)
fp.write(line)
This code only works for one line but not the entire paragraph. Any suggestions are appreciated.
EDIT:
The code that works now is:
with open(filename,"r+") as fp:
pattern = re.compile(re.escape(finds))
textdata = fp.read()
line = re.sub(pattern,'',textdata)
fp.seek(0)
fp.write(line)
fp.truncate()
Why can't you just use pattern = re.compile(re.escape(finds))?
You can use two lists start_indexes and stop_indexes which contain respectively the start index to remove from and the end index to remove to. Then you can merge the two lists with the 'zip' method to have a matrix where each row has the start index and the end index of the rows to be removed. For each of these rows in the matrix you can create a list with the lines corresponding to the range of values and then remove the values contained in this list from the original list.
In this example the text to be processed divided into lines is stored in vals.
vals = ['string', '#blabla', 'ciao', 'miao', 'bau', 'ENDOFPARAMETERPOINT', 'as']
start_indexes = []
stop_indexes = []
for index, line in enumerate(vals):
if line[0] == '#':
start_indexes.append(index)
elif line == 'ENDOFPARAMETERPOINT':
stop_indexes.append(index)
for start, stop in zip(start_indexes, stop_indexes):
values_to_remove = [vals[x] for x in range(start, stop+1)]
for v in values_to_remove:
vals.remove(v)
Related
I have a CSV file that has errors. The most common one is a too early linebreak.
But now I don't know how to remove it ideally. If I read the line by line
with open("test.csv", "r") as reader:
test = reader.read().splitlines()
the wrong structure is already in my variable. Is this still the right approach and do I use a for loop over test and create a copy or can I manipulate directly in the test variable while iterating over it?
I can identify the corrupt lines by the semikolon, some rows end with a ; others start with it. So maybe counting would be an alternative way to solve it?
EDIT:
I replaced reader.read().splitlines() with reader.readlines() so I could handle the rows which end with a ;
for line in lines:
if("Foobar" in line):
line = line.replace("Foobar", "")
if(";\n" in line):
line = line.replace(";\n", ";")
The only thing that remains are rows that beginn with a ;
Since I need to go back one entry in the list
Example:
Col_a;Col_b;Col_c;Col_d
2021;Foobar;Bla
;Blub
Blub belongs in the row above.
Here's a simple Python script to merge lines until you have the desired number of fields.
import sys
sep = ';'
fields = 4
collected = []
for line in sys.stdin:
new = line.rstrip('\n').split(sep)
if collected:
collected[-1] += new[0]
collected.extend(new[1:])
else:
collected = new
if len(collected) < fields:
continue
print(';'.join(collected))
collected = []
This simply reads from standard input and prints to standard output. If the last line is incomplete, it will be lost.
The separator and the number of fields can be edited into the variables at the top; exposing these as command-line parameters left as an exercise.
If you wanted to keep the newlines, it would not be too hard to only strip a newline from the last fields, and use csv.writer to write the fields back out as properly quoted CSV.
This is how I deal with this. This function fixes the line if there are more columns than needed or if there is a line break in the middle.
Parameters of the function are:
message - content of the file - reader.read() in your case
columns - number of expected columns
filename - filename (I use it for logging)
def pre_parse(message, columns, filename):
parsed_message=[]
i =0
temp_line =''
for line in message.splitlines():
#print(line)
split = line.split(',')
if len(split) == columns:
parsed_message.append(line)
elif len(split) > columns:
print(f'Line {i} has been truncated in file {filename} - too much columns'))
split = split[:columns]
line = ','.join(split)
parsed_message.append(line)
elif len(split) < columns and temp_line =='':
temp_line = line.replace('\n','')
print(temp_line)
elif temp_line !='':
line = temp_line+line
if line.count(',') == columns-1:
print((f'Line {i} has been fixed in file {filename} - extra line feed'))
parsed_message.append(line)
temp_line =''
else:
temp_line=line.replace('\n', '')
i+=1
return parsed_message
make sure you use proper split character and proper line feed characer.
I have a log file which shows data sent in the below format -
2019-10-17T00:00:02|Connection(10.0.0.89 :0 ) r=0 s=1024
d=0 t=0 q=0 # connected
2019-10-17T00:00:02|McSend (229.0.0.70 :20001) b=1635807
f=2104 d=0 t=0
There will be multiple lines per file
How can I graph the b=value against the time (near the beginning on the line) but only from the McSend lines
Thanks
If you're not familiar with regular expressions - python regex documentation is a good place to start.
The simplest regex you probably need is r"^(\d\d\d\d-\d\d-\d\dT\d\d:\d\d:\d\d)\|.*McSend.*+b=(\d+)"
First group will allow you compare the timestamp and the second will give the value.
import re
pattern = r"^(\d\d\d\d-\d\d-\d\dT\d\d:\d\d:\d\d)\|.+McSend.+b=(\d+)"
#result is a list of tuples containing the time stamp and the value for b
result = re.findall(pattern, some_input)
You should read your file line by lines. Then scan for each line if it contains 'McSend'. If it does then retrieve the desired data.
You could do something like this :
b_values = []
dates = []
## Lets open the file and read it line by line
with open(filepath) as f:
for line in f:
## If the line contains McSend
if 'McSend' in line :
## We split the line by spaces ( split() with no arguments does so )
splited_line = line.split()
## First string chunk contains the header where the date is located
header = splited_line[0]
## Then retrieve the b value
for val in splited_line :
if val.startswith('b=') :
b_value = val.split("=",1)[1]
## Now you can add the value to arrays and then plot what you neet
b_values.append(b_value)
dates.append(header.split("|",1)[0]
## Do your plot
I have to compress a file into a list of words and list of positions to recreate the original file. My program should also be able to take a compressed file and recreate the full text, including punctuation and capitalization, of the original file. I have everything correct apart from the recreation, using the map function my program can't convert my list of positions into floats because of the '[' as it is a list.
My code is:
text = open("speech.txt")
CharactersUnique = []
ListOfPositions = []
DownLine = False
while True:
line = text.readline()
if not line:
break
TwoList = line.split()
for word in TwoList:
if word not in CharactersUnique:
CharactersUnique.append(word)
ListOfPositions.append(CharactersUnique.index(word))
if not DownLine:
CharactersUnique.append("\n")
DownLine = True
ListOfPositions.append(CharactersUnique.index("\n"))
w = open("List_WordsPos.txt", "w")
for c in CharactersUnique:
w.write(c)
w.close()
x = open("List_WordsPos.txt", "a")
x.write(str(ListOfPositions))
x.close()
with open("List_WordsPos.txt", "r") as f:
NewWordsUnique = f.readline()
f.close()
h = open("List_WordsPos.txt", "r")
lines = h.readlines()
NewListOfPositions = lines[1]
NewListOfPositions = map(float, NewListOfPositions)
print("Recreated Text:\n")
recreation = " " .join(NewWordsUnique[pos] for pos in (NewListOfPositions))
print(recreation)
The error I get is:
Task 3 Code.py", line 42, in <genexpr>
recreation = " " .join(NewWordsUnique[pos] for pos in (NewListOfPositions))
ValueError: could not convert string to float: '['
I am using Python IDLE 3.5 (32-bit). Does anyone have any ideas on how to fix this?
Why do you want to turn the position values in the list into floats, since they list indices, and those must be integer? I suspected this might be an instance of what is called the XY Problem.
I also found your code difficult to understand because you haven't followed the PEP 8 - Style Guide for Python Code. In particular, with how many (although not all) of the variable names are CamelCased, which according to the guidelines, should should be reserved for the class names.
In addition some of your variables had misleading names, like CharactersUnique, which actually [mostly] contained unique words.
So, one of the first things I did was transform all the CamelCased variables into lowercase underscore-separated words, like camel_case. In several instances I also gave them better names to reflect their actual contents or role: For example: CharactersUnique became unique_words.
The next step was to improve the handling of files by using Python's with statement to ensure they all would be closed automatically at the end of the block. In other cases I consolidated multiple file open() calls into one.
After all that I had it almost working, but that's when I discovered a problem with the approach of treating newline "\n" characters as separate words of the input text file. This caused a problem when the file was being recreated by the expression:
" ".join(NewWordsUnique[pos] for pos in (NewListOfPositions))
because it adds one space before and after every "\n" character encountered that aren't there in the original file. To workaround that, I ended up writing out the for loop that recreates the file instead of using a list comprehension, because doing so allows the newline "words" could be handled properly.
At any rate, here's the resulting rewritten (and working) code:
input_filename = "speech.txt"
compressed_filename = "List_WordsPos.txt"
# Two lists to represent contents of input file.
unique_words = ["\n"] # preload with newline "word"
word_positions = []
with open(input_filename, "r") as input_file:
for line in input_file:
for word in line.split():
if word not in unique_words:
unique_words.append(word)
word_positions.append(unique_words.index(word))
word_positions.append(unique_words.index("\n")) # add newline at end of each line
# Write representations of the two data-structures to compressed file.
with open(compressed_filename, "w") as compr_file:
words_repr = " ".join(repr(word) for word in unique_words)
compr_file.write(words_repr + "\n")
positions_repr = " ".join(repr(posn) for posn in word_positions)
compr_file.write(positions_repr + "\n")
def strip_quotes(word):
"""Strip the first and last characters from the string (assumed to be quotes)."""
tmp = word[1:-1]
return tmp if tmp != "\\n" else "\n" # newline "words" are special case
# Recreate input file from data in compressed file.
with open(compressed_filename, "r") as compr_file:
line = compr_file.readline()
new_unique_words = list(map(strip_quotes, line.split()))
line = compr_file.readline()
new_word_positions = map(int, line.split()) # using int, not float here
words = []
lines = []
for posn in new_word_positions:
word = new_unique_words[posn]
if word != "\n":
words.append(word)
else:
lines.append(" ".join(words))
words = []
print("Recreated Text:\n")
recreation = "\n".join(lines)
print(recreation)
I created my own speech.txt test file from the first paragraph of your question and ran the script on it with these results:
Recreated Text:
I have to compress a file into a list of words and list of positions to recreate
the original file. My program should also be able to take a compressed file and
recreate the full text, including punctuation and capitalization, of the
original file. I have everything correct apart from the recreation, using the
map function my program can't convert my list of positions into floats because
of the '[' as it is a list.
Per your question in the comments:
You will want to split the input on spaces. You will also likely want to use different data structures.
# we'll map the words to a list of positions
all_words = {}
with open("speech.text") as f:
data = f.read()
# since we need to be able to re-create the file, we'll want
# line breaks
lines = data.split("\n")
for i, line in enumerate(lines):
words = line.split(" ")
for j, word in enumerate(words):
if word in all_words:
all_words[word].append((i, j)) # line and pos
else:
all_words[word] = [(i, j)]
Note that this does not yield maximum compression as foo and foo. count as separate words. If you want more compression, you'll have to go character by character. Hopefully now you can use a similar approach to do so if desired.
Using the following in Python 2.7:
dfile = 'new_data.txt' # Depth file no. 1
d_row = [line.strip() for line in open(dfile)]
I have loaded a data file into a list without the newline character. Now I want to index all elements within d_row where the beginning of the string is not numeric and/or empty. Next, I require:
removal of all of the above detailed non-numeric instances and
save the strings and indexes for later insertion into an updated file.
Example of data:
Thu Mar 14 18:17:05 2013
Fri Mar 15 01:40:25 2013
FT
DepthChange: 0.000000,2895.336,0.000
1363285025.250000,9498.970
1363285025.300000,9498.970
1363285026.050000,9498.970
1363287840.450042,9458.010
1363287840.500042,9458.010
1363287840.850042,9458.010
1363287840.900042,9458.010
DepthChange: 0.000000,2882.810,9457.200
1363287840.950042,9458.010
DepthChange: 0.000000,2882.810,0.000
1363287841.000042,9457.170
1363287841.050042,9457.170
1363287841.100042,9457.170
1363287841.150042,9457.170
1363287841.200042,9457.170
1363287841.250042,9457.170
1363287841.300042,9457.170
1363291902.750102,9149.937
1363291902.800102,9149.822
1363291902.850102,9149.822
1363291902.900102,9149.822
1363291902.950102,9149.822
1363291903.000102,9149.822
1363291903.050102,9149.708
1363291903.100102,9149.708
1363291903.150102,9149.708
1363291903.200102,9149.708
1363291903.250102,9149.708
1363291903.300102,9149.592
1363291903.350102,9149.592
1363291903.400102,9149.592
1363291903.450102,9149.592
1363291903.500102,9149.592
DepthChange: 0.000000,2788.770,2788.709
1363291903.550102,9149.479
1363291903.600102,9149.379
I have been doing the removal step manually which is time consuming because the file contains over half a million rows. Currently I am unable to rewrite the file containing all of the original elements with some modifications.
Any tips would be much appreciated.
dfile = 'new_data.txt'
with open(dfile) as infile:
numericLines = set() # line numbers of lines that start with digits
emptyLines = set() # line numbers of lines that are empty
charLines = [] # line numbers of lines that start with a letter
for lineno, line in enumerate(infile):
if line[0].isalpha:
charLines.append(line.strip())
elif line[0].isdigit():
numericLines.add(lineno)
elif not line.strip():
emptyLines.add(lineno)
The easiest way to do this is in two passes: First get the lines and line numbers of the non-matching lines, and then get the lines of the matching lines.
d_rows = [line.strip() for line in open(dfile)]
good_rows = [(i, row) for i, row in enumerate(d_rows) if is_good_row(row)]
bad_rows = [(i, row) for i, row in enumerate(d_rows) if not is_good_row(row)]
This does mean making two passes over the list, but who cares? If the list is small enough to read the whole thing into memory as you're already doing, the extra cost is probably negligible.
Alternatively, if you need to avoid the cost of building two lists in two passes, you probably also need to avoid reading the whole file at once in the first place, so you'll have to do things a little more cleverly:
d_rows = (line.strip() for line in open(dfile)) # notice genexp, not list comp
good_rows, bad_rows = [], []
for i, row in enumerate(d_rows):
if is_good_row(row):
good_rows.append((i, row))
else:
bad_rows.append((i, row))
If you can push things even farther back to the point where you don't even need explicit good_rows and bad_rows lists, you can keep everything in an iterator all the way through, and waste no memory or up-front reading time at all:
d_rows = (line.strip() for line in open(dfile)) # notice genexp, not list comp
with open(outfile, 'w') as f:
for i, row in enumerate(d_rows):
if is_good_row(row):
f.write(row + '\n')
else:
whatever_you_wanted_to_do_with(i, row)
Thanks to all who replied to my question. Using a part of each reply I was able to attain the desired result. What finally worked is as follows:
goodrow_ind, badrow_ind, badrows = [], [], []
d_rows = (line for line in open(ifile))
with open(ofile, 'w') as f:
for i, row in enumerate(d_rows):
if row[0].isdigit():
f.write(row)
goodrow_ind.append((i))
else:
badrow_ind.append((i))
badrows.append((row))
ifile.close()
data = np.loadtxt(open(ofile,'rb'),delimiter=',')
The result is "good" and "bad" rows separated with an index for each.
I'm writing a short program in Python that will read a FASTA file which is usually in this format:
>gi|253795547|ref|NC_012960.1| Candidatus Hodgkinia cicadicola Dsem chromosome, 52 lines
GACGGCTTGTTTGCGTGCGACGAGTTTAGGATTGCTCTTTTGCTAAGCTTGGGGGTTGCGCCCAAAGTGA
TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC
TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTGTTGGGCTCGCCGAAAGCTGGCAGGTCGA
I've created another program that reads the first line(aka header) of this FASTA file and now I want this second program to start reading and printing beginning from the sequence.
How would I do that?
so far i have this:
FASTA = open("test.txt", "r")
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
for line in FASTA:
line = line.strip()
print line
readSeq(FASTA)
Thanks guys
-Noob
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
_unused = FASTA.next() # skip heading record
for line in FASTA:
line = line.strip()
print line
Read the docs on file.next() to see why you should be wary of mixing file.readline() with for line in file:
you should show your script. To read from second line, something like this
f=open("file")
f.readline()
for line in f:
print line
f.close()
You might be interested in checking BioPythons handling of Fasta files (source).
def FastaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
"""Generator function to iterate over Fasta records (as SeqRecord objects).
handle - input file
alphabet - optional alphabet
title2ids - A function that, when given the title of the FASTA
file (without the beginning >), will return the id, name and
description (in that order) for the record as a tuple of strings.
If this is not given, then the entire title line will be used
as the description, and the first word as the id and name.
Note that use of title2ids matches that of Bio.Fasta.SequenceParser
but the defaults are slightly different.
"""
#Skip any text before the first record (e.g. blank lines, comments)
while True:
line = handle.readline()
if line == "" : return #Premature end of file, or just empty?
if line[0] == ">":
break
while True:
if line[0]!=">":
raise ValueError("Records in Fasta files should start with '>' character")
if title2ids:
id, name, descr = title2ids(line[1:].rstrip())
else:
descr = line[1:].rstrip()
id = descr.split()[0]
name = id
lines = []
line = handle.readline()
while True:
if not line : break
if line[0] == ">": break
#Remove trailing whitespace, and any internal spaces
#(and any embedded \r which are possible in mangled files
#when not opened in universal read lines mode)
lines.append(line.rstrip().replace(" ","").replace("\r",""))
line = handle.readline()
#Return the record and then continue...
yield SeqRecord(Seq("".join(lines), alphabet),
id = id, name = name, description = descr)
if not line : return #StopIteration
assert False, "Should not reach this line"
good to see another bioinformatician :)
just include an if clause within your for loop above the line.strip() call
def readSeq(FASTA):
for line in FASTA:
if line.startswith('>'):
continue
line = line.strip()
print(line)
A pythonic and simple way to do this would be slice notation.
>>> f = open('filename')
>>> lines = f.readlines()
>>> lines[1:]
['TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC\n', 'TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTG
TTGGGCTCGCCGAAAGCTGGCAGGTCGA']
That says "give me all elements of lines, from the second (index 1) to the end.
Other general uses of slice notation:
s[i:j] slice of s from i to j
s[i:j:k] slice of s from i to j with step k (k can be negative to go backward)
Either i or j can be omitted (to imply the beginning or the end), and j can be negative to indicate a number of elements from the end.
s[:-1] All but the last element.
Edit in response to gnibbler's comment:
If the file is truly massive you can use iterator slicing to get the same effect while making sure you don't get the whole thing in memory.
import itertools
f = open("filename")
#start at the second line, don't stop, stride by one
for line in itertools.islice(f, 1, None, 1):
print line
"islicing" doesn't have the nice syntax or extra features of regular slicing, but it's a nice approach to remember.