Python Regex: User Inputs Multiple Search Terms - python

What my code is suppose to do is take in user input search terms then iterate through a tcp dump file and find every instance of that term by packet. The src IP acts as the header to each packet in my output.
So I am having an issue with the fileIn being seemingly erased when it iterates through the first term. So when the program goes to look at the second user input search term it obviously can't find anything. Here is what I have:
import re
searchTerms = []
fileIn = open('ascii_dump.txt', 'r')
while True:
userTerm = input("Enter the search terms (End to stop): ")
if userTerm == 'End':
break
else:
searchTerms.append(userTerm)
ipPattern = re.compile(r'((?:\d{1,3}\.){3}\d{1,3})')
x = 0
while True:
print("Search Term is:", searchTerms[x])
for line in fileIn:
ipMatch = ipPattern.search(line)
userPattern = re.compile(searchTerms[x])
userMatch = userPattern.search(line)
if ipMatch is not None:
print(ipMatch.group())
if userMatch is not None:
print(userMatch.group())
x += 1
if x >= len(searchTerms):
break

This happens because you opened the file object as an iterator which is consumed in the first past through the for loop.
During the second time through the loop, the for line in fileIn will not be evaluated since the iterator fileIn has already been consumed.
A quick fix is to do this:
lines = open('ascii_dump.txt', 'r').readlines()
then in your for loop, change the for line in fileIn to:
for line in lines:
Having said this, you should rewrite your code to do all regex matches in a single pass using the regex or operator.

You need to "rewind" the file after the for line in fileIn loop:
...
fileIn.seek(0);
x += 1

Related

While loop runs differently after the first loop

Doing this exercise from ThinkPython and wanting to do a little extra, trying to modify the exercise function (avoid) to prompt the user repeatedly and perform the calculation to find how many words in a text file (fin) contain the user inputted letters (avoidprompt). It works the first time but after it prompts the user for input again it always returns an answer of 0 words.
Feel like the most likely issue is I'm misunderstanding how to use the while loop in this context since it works the first time but doesn't after that. Is there a better way?
fin = open('[location of text file here]')
line = fin.readline()
word = line.strip()
def avoid(word, forbidden):
for letter in word:
if letter in forbidden:
return False
return True
def avoidprompt():
while(True):
n = 0
forbidden = input('gimmie some letters n Ill tell u how many words have em. \n')
for line in fin:
if avoid(line, forbidden) == False:
n = n+1
print('\n There are ' + str(n) + " words with those letters. \n")
When you open a file and do for line in file, you've consumed the entire file.
There are two easy solutions:
1) Go back to the start of the file in each iteration of your while(True) loop, by doing fin.seek(0)
2) Just store the file contents in a list, by replacing the first line of your script with fin = open('file.txt').readlines()
I believe you need to do something along these lines:
def avoidprompt():
while(True):
n = 0
fin.seek(0)
forbidden = input('gimmie some letters n Ill tell u how many words have em. \n')
for line in fin:
if avoid(line, forbidden) == False:
n = n+1
print('\n There are ' + str(n) + " words with those letters. \n")
Seek sets your pointer back to a specific line in an open file and since you aleady iterated through the file once, your cursor needs to be brought back to the top of the file in order to reread words
You can see this other stack overflow for more details here
Hope this helps! You used the loop just fine

Find the line number a string is on in an external text file

I am trying to create a program where it gets input from a string entered by the user and searches for that string in a text file and prints out the line number. If the string is not in the text file, it will print that out. How would I do this? Also I am not sure if even the for loop that I have so far would work for this so any suggestions / help would be great :).
What I have so far:
file = open('test.txt', 'r')
string = input("Enter string to search")
for string in file:
print("") #print the line number
You can implement this algorithm:
Initialize a counter
Read lines one by one
If the line matches the target, return the current count
Increment the count
If reached the end without returning, the line is not in the file
For example:
def find_line(path, target):
with open(path) as fh:
count = 1
for line in fh:
if line.strip() == target:
return count
count += 1
return 0
A text file differs from memory used in programs (such as dictionaries and arrays) in the manner that it is sequential. Much like the old tapes used for storage a long, long time ago, there's no way to grab/find a specific line without combing through all prior lines (or somehow guessing the exact memory location). Your best option is just to create a for loop that iterates through each line until it finds the one it's looking for, returning the amount of lines traversed until that point.
file = open('test.txt', 'r')
string = input("Enter string to search")
lineCount = 0
for line in file:
lineCount += 1
if string == line.rstrip(): # remove trailing newline
print(lineCount)
break
filepath = 'test.txt'
substring = "aaa"
with open(filepath) as fp:
line = fp.readline()
cnt = 1
flag = False
while line:
if substring in line:
print("string found in line {}".format(cnt))
flag = True
break
line = fp.readline()
cnt += 1
if not flag:
print("string not found in file")
If the string will match a line exactly, we can do this in one-line:
print(open('test.txt').read().split("\n").index(input("Enter string to search")))
Well the above kind of works accept it won't print "no match" if there isn't one. For that, we can just add a little try:
try:
print(open('test.txt').read().split("\n").index(input("Enter string to search")))
except ValueError:
print("no match")
Otherwise, if the string is just somewhere in one of the lines, we can do:
string = input("Enter string to search")
for i, l in enumerate(open('test.txt').read().split("\n")):
if string in l:
print("Line number", i)
break
else:
print("no match")

What is the most efficient way to find the particular strings on huge text files (>1GB) in python?

I am developing a string filter for huge process log file in distributed system.
These log files are >1GB and contains millions of lines.These logs contains special type of message blocks which are starting from "SMsg{" and end from "}". My program is reading the whole file line by line and put the line numbers which the line contains "SMsg{" to an list.Here is my python method to do that.
def FindNMsgStart(self,logfile):
self.logfile = logfile
lf = LogFilter()
infile = lf.OpenFile(logfile, 'Input')
NMsgBlockStart = list()
for num, line in enumerate(infile.readlines()):
if re.search('SMsg{', line):
NMsgBlockStart.append(num)
return NMsgBlockStart
This is my lookup function to search any kind of word in the text file.
def Lookup(self,infile,regex,start,end):
self.infile = infile
self.regex = regex
self.start = start
self.end = end
result = 0
for num, line in enumerate(itertools.islice(infile,start,end)):
if re.search(regex, line):
result = num + start
break
return result
Then I will get that list and find the end for each starting block through the whole file. Following is my code for find the end.
def FindNmlMsgEnd(self,logfile,NMsgBlockStart):
self.logfile = logfile
self.NMsgBlockStart = NMsgBlockStart
NMsgBlockEnd = list()
lf = LogFilter()
length = len(NMsgBlockStart)
if length > 0:
for i in range (0,length):
start=NMsgBlockStart[i]
infile = lf.OpenFile(logfile, 'Input')
lines = lf.LineCount(logfile, 'Input')
end = lf.Lookup(infile, '}', start, lines+1)
NMsgBlockEnd.append(end)
return NMsgBlockEnd
else:
print("There is no Normal Message blocks.")
But those method are never efficient enough to handle huge files. The program is running long time without a result.
Is there efficient way to do this?
If yes, How could I do this?
I am doing another filters too , But first I need to find a solution for this basic problem.I am really new to python. Please help me.
I see a couple of issues that are slowing your code down.
The first seems to be a pretty basic error. You're calling readlines on your file in the FindNMsgStart method, which is going to read the whole file into memory and return a list of its lines.
You should just iterate over the lines directly by using enumerate(infile). You do this properly in the other functions that read the file, so I suspect this is a typo or just a simple oversight.
The second issue is a bit more complicated. It involves the general architecture of your search.
You're first scanning the file for message start lines, then searching for the end line after each start. Each end-line search requires re-reading much of the file, since you need to skip all the lines that occur before the start line. It would be a lot more efficient if you could combine both searches into a single pass over the data file.
Here's a really crude generator function that does that:
def find_message_bounds(filename):
with open(filename) as f:
iterator = enumerate(f)
for start_line_no, start_line in iterator:
if 'SMsg{' in start_line:
for end_line_no, end_line in iterator:
if '}' in end_line:
yield start_line_no, end_line_no
break
This function yields start, end line number tuples, and only makes a single pass over the file.
I think you can actually implement a one-pass search using your Lookup method, if you're careful with the boundary variables you pass in to it.
def FindNmlMsgEnd(self,logfile,NMsgBlockStart):
self.logfile = logfile
self.NMsgBlockStart = NMsgBlockStart
NMsgBlockEnd = list()
lf = LogFilter()
infile = lf.OpenFile(logfile, 'Input')
total_lines = lf.LineCount(logfile, 'Input')
start = NMsgBlockStart[0]
prev_end = -1
for next_start in NMsgBlockStart[:1]:
end = lf.Lookup(infile, '}', start-prev_end-1, next_start-prev_end-1) + prev_end + 1
NMsgBlockEnd.append(end)
start = next_start
prev_end = end
last_end = lf.Lookup(infile, '}', start-prev_end-1, total_lines-prev_end-1) + prev_end + 1
NMsgBlockEnd.append(last_end)
return NMsgBlockEnd
It's possible I have an off-by-one error in there somewhere, the design of the Lookup function makes it difficult to call repeatedly.

Loop within a loop not re-looping with reading a file Python3

Trying to write a code that will find all of a certain type of character in a text file
For vowels it'll find all of the number of a's but won't reloop through text to read e's. help?
def finder_character(file_name,character):
in_file = open(file_name, "r")
if character=='vowel':
brain_rat='aeiou'
elif character=='consonant':
brain_rat='bcdfghjklmnpqrstvwxyz'
elif character=='space':
brain_rat=''
else:
brain_rat='!##$%^&*()_+=-123456789{}|":?><,./;[]\''
found=0
for line in in_file:
for i in range (len(brain_rat)):
found += finder(file_name,brain_rat[i+1,i+2])
in_file.close()
return found
def finder(file_name,character):
in_file = open(file_name, "r")
line_number = 1
found=0
for line in in_file:
line=line.lower()
found +=line.count(character)
return found
If you want to use your original code, you have to pass the filename to the finder() function, and open the file there, for each char you are testing for.
The reason for this is that the file object (in_file) is a generator, not a list. The way a generator works, is that it returns the next item each time you call their next() method. When you say
for line in in_file:
The for ... in statement calls in_file.next() as long as the next() method "returns" (it actually use the keyword yield, but don't think about that for now) a value. When the generator doesn't return any values any longer, we say that the generator is exhausted. You can't re-use an exhausted generator. If you want to start over again, you have to make a new generator.
I allowed myself to rewrite your code. This should give you the desired result. If anything is unclear, please ask!
def finder_character(file_name,character):
with open(file_name, "r") as ifile:
if character=='vowel':
brain_rat='aeiou'
elif character=='consonant':
brain_rat='bcdfghjklmnpqrstvwxyz'
elif character=='space':
brain_rat=' '
else:
brain_rat='!##$%^&*()_+=-123456789{}|":?><,./;[]\''
return sum(1 if c.lower() in brain_rat else 0 for c in ifile.read())
test.txt:
eeehhh
iii!#
kk ="k
oo o
Output:
>>>print(finder_character('test.txt', 'vowel'))
9
>>>print(finder_character('test.txt', 'consonant'))
6
>>>print(finder_character('test.txt', 'space'))
2
>>>print(finder_character('test.txt', ''))
4
If you are having problems understanding the return line, it should be read backwards, like this:
Sum this generator:
Make a generator with values as v in:
for row in ifile.read():
if c.lower() in brain_rat:
v = 1
else:
v = 0
If you want to know more about generators, I recommend the Python Wiki page concerning it.
This seems to be what you are trying to do in finder_character. I'm not sure why you need finder at all.
In python you can loop over iterables (like strings), so you don't need to do range(len(string)).
for line in in_file:
for i in brain_rat:
if i in line: found += 1
There appear to be a few other oddities in your code too:
You open (and iterate through) the file twice, but only closed once.
line_number is never used
You get the total of a character in a file for each line in the file, so the total will be vastly inflated.
This is probably a much safer version, with open... is generally better than open()... file.close() as you don't need to worry as much about error handling and closing. I've added some comments to help explain what you are trying to do.
def finder_character(file_name,character):
found=0 # Initialise the counter
with open(file_name, "r") as in_file:
# Open the file
in_file = file_name.split('\n')
opts = { 'vowel':'aeiou',
'consonant':'bcdfghjklmnpqrstvwxyz',
'space':'' }
default= '!##$%^&*()_+=-123456789{}|":?><,./;[]\''
for line in in_file:
# Iterate through each line in the file
for c in opts.get(character,default):
With each line, also iterate through the set of chars to check.
if c in line.lower():
# If the current character is in the line
found += 1 # iterate the counter.
return found # return the counter

How can I use readline() to begin from the second line?

I'm writing a short program in Python that will read a FASTA file which is usually in this format:
>gi|253795547|ref|NC_012960.1| Candidatus Hodgkinia cicadicola Dsem chromosome, 52 lines
GACGGCTTGTTTGCGTGCGACGAGTTTAGGATTGCTCTTTTGCTAAGCTTGGGGGTTGCGCCCAAAGTGA
TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC
TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTGTTGGGCTCGCCGAAAGCTGGCAGGTCGA
I've created another program that reads the first line(aka header) of this FASTA file and now I want this second program to start reading and printing beginning from the sequence.
How would I do that?
so far i have this:
FASTA = open("test.txt", "r")
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
for line in FASTA:
line = line.strip()
print line
readSeq(FASTA)
Thanks guys
-Noob
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
_unused = FASTA.next() # skip heading record
for line in FASTA:
line = line.strip()
print line
Read the docs on file.next() to see why you should be wary of mixing file.readline() with for line in file:
you should show your script. To read from second line, something like this
f=open("file")
f.readline()
for line in f:
print line
f.close()
You might be interested in checking BioPythons handling of Fasta files (source).
def FastaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
"""Generator function to iterate over Fasta records (as SeqRecord objects).
handle - input file
alphabet - optional alphabet
title2ids - A function that, when given the title of the FASTA
file (without the beginning >), will return the id, name and
description (in that order) for the record as a tuple of strings.
If this is not given, then the entire title line will be used
as the description, and the first word as the id and name.
Note that use of title2ids matches that of Bio.Fasta.SequenceParser
but the defaults are slightly different.
"""
#Skip any text before the first record (e.g. blank lines, comments)
while True:
line = handle.readline()
if line == "" : return #Premature end of file, or just empty?
if line[0] == ">":
break
while True:
if line[0]!=">":
raise ValueError("Records in Fasta files should start with '>' character")
if title2ids:
id, name, descr = title2ids(line[1:].rstrip())
else:
descr = line[1:].rstrip()
id = descr.split()[0]
name = id
lines = []
line = handle.readline()
while True:
if not line : break
if line[0] == ">": break
#Remove trailing whitespace, and any internal spaces
#(and any embedded \r which are possible in mangled files
#when not opened in universal read lines mode)
lines.append(line.rstrip().replace(" ","").replace("\r",""))
line = handle.readline()
#Return the record and then continue...
yield SeqRecord(Seq("".join(lines), alphabet),
id = id, name = name, description = descr)
if not line : return #StopIteration
assert False, "Should not reach this line"
good to see another bioinformatician :)
just include an if clause within your for loop above the line.strip() call
def readSeq(FASTA):
for line in FASTA:
if line.startswith('>'):
continue
line = line.strip()
print(line)
A pythonic and simple way to do this would be slice notation.
>>> f = open('filename')
>>> lines = f.readlines()
>>> lines[1:]
['TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC\n', 'TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTG
TTGGGCTCGCCGAAAGCTGGCAGGTCGA']
That says "give me all elements of lines, from the second (index 1) to the end.
Other general uses of slice notation:
s[i:j] slice of s from i to j
s[i:j:k] slice of s from i to j with step k (k can be negative to go backward)
Either i or j can be omitted (to imply the beginning or the end), and j can be negative to indicate a number of elements from the end.
s[:-1] All but the last element.
Edit in response to gnibbler's comment:
If the file is truly massive you can use iterator slicing to get the same effect while making sure you don't get the whole thing in memory.
import itertools
f = open("filename")
#start at the second line, don't stop, stride by one
for line in itertools.islice(f, 1, None, 1):
print line
"islicing" doesn't have the nice syntax or extra features of regular slicing, but it's a nice approach to remember.

Categories