Read until an only partially known line - Python - python

I need to get (parse) from a device its whole output.
My solution was: 1) Determine how the last line of its output looks like
2) Use the code below to read the output until the last line (which is a way around of saying - read the whole output)
last_line = "text of the last line"
read_until(last_line)
3) Technical detail: make it to a return value of the get_output() as means of passing it further to a parse_result() function.
The problem is: The last line might take various forms and only its rough format is known. For example it might say: {"diag":"hdd_id", "status":"0"}. However, both "diag" and "status" might take other values than "hdd_id" and "0".
What can I do to make the "text of the last line" more universal so that the read_until() stops for every value of "diag" and "status"? (given that the output always includes words "diag" and "status")
What I tried: Using regular expressions. Defining last_line = re('"status":"."}') making use of the fact that . in regular expression means any value. What I get though is TypeError: 'module' object is not callable.
It also wouldn't make much sense to convert that regular expression to a string by str(re('"status":"."}')) since, as far as I understand regular expressions, it wouldn't mean any particular string (due to .).

You should read (again) the re chapter from the Python standard library manual.
The correct usage is:
import re
...
eof = re.compile(r'\s*\{\s*"diag":.*,\s*"status":.*\}') # compile the regex
...
The above expression uses \s* to allow for optional white spaces in the line. You can remove them if you know that they cannot occur.
You can then use it with the telnetlib Python module, but with expect instead of read_until, because the latter searches for a string and not a regex:
index, match, text = tn.expect([eof])
Here, index will the the index of the matched regex (here 0), match the match object, and text the full text including the last line

Related

Python - Injecting html tags into strings based on regex match

I wrote a script in Python for custom HTML page that finds a word within a string/line and highlights just that word with use of following tags where instance is the word that is searched for.
<b><font color=\"red\">"+instance+"</font></b>
With the following result:
I need to find a word (case insensitive) let's say "port" within a string that can be port, Port, SUPPORT, Support, support etc, which is easy enough.
pattern = re.compile(word, re.IGNORECASE)
find_all_instances = pattern.findall(string_to_search)
However my strings often contain 2 or more instances in single line, and I need to append
<b><font color=\"red\">"+instance+"</font></b> to each of those instances, without changing cases.
Problem with my approach, is that I am attempting to itterate over each of instances found with findall (exact match),
while multiple same matches can also be found within the string.
for instance in find_all_instances:
second_pattern = re.compile(instance)
string_to_search = second_pattern.sub("<b><font color=\"red\">"+instance+"</font></b>", string_to_search)
This results in following:
<b><font color="red"><b><font color="red"><b><font color="red">Http</font></b></font></b></font></b></font>
when I need
<b><font color="red">Http</font></b>
I was thinking, I would be able to avoid this if I was able to find out exact part of the string that the pattern.sub substitutes at the moment of doing it,
however I was not able to find any examples of that kind of usage, which leads me to believe that I am doing something very wrong.
If anyone have a way I could use to insert <b><font color="red">instance</font></b> without replacing instance for all matches(case insensitive), then I would be grateful.
Maybe I'm misinterpretting your question, but wouldn't re.sub be the best option?
Example: https://repl.it/DExs
Okay so two ways I did quickly! The second loop is definitely the way to go. It uses re.sub (as someone else commented too). It replaces with the lowercase search term bear in mind.
import re
FILE = open("testing.txt","r")
word="port"
#THIS LOOP IS CASE SENSITIVE
for line in FILE:
newline=line.replace(word,"<b><font color=\"red\">"+word+"</font></b>")
print newline
#THIS LOOP IS INCASESENSITIVE
for line in FILE:
pattern=re.compile(word,re.IGNORECASE)
newline = pattern.sub("<b><font color=\"red\">"+word+"</font></b>",line)
print newline

Python Regex Problems with Whitespace

I'm trying to do a python regular expression that looks for lines formatted as such ([edit:] without new lines; the original is all on one line):
<MediaLine Label="main-video" xmlns="ms-rtcp-metrics">
<OtherTags...></OtherTags>
</MediaLine>
I wish to create a capture group of the body of this XML element (so the OtherTags...) for later processing.
Now the problem lies in the first line, where Label="main-video", and I would like to not capture Label="main-audio"
My initial solution is as such:
m = re.search(r'<MediaLine(.*?)</MediaLine>', line)
This works, in that it filters out all other non-MediaLine elements, but doesn't account for video vs audio. So to build on it, I try simply adding
m = re.search(r'<MediaLine Label(.*?)</MediaLine>', line)
but this won't create a single match, let alone being specific enough to filter audio/video. My problem seems to come down to the space between line and Label. The two variations I can think of trying both fail:
m = re.search(r'<MediaLine L(.*?)</MediaLine>', line)
m = re.search(r'<MediaLine\sL(.*?)</MediaLine>', line)
However, the following works, without being able to distinguish audio/video:
m = re.search(r'<MediaLine\s(.*?)</MediaLine>', line)
Why is the 'L' the point of failure? Where am I going wrong? Thanks for any help.
And to add to this preemptively, my goal is an expression like this:
m = re.search("<MediaLine Label=\"main-video\"(?:.*?)>(?P<payload>.*?)</MediaLine>", line)
result = m.group('payload')
By default, . doesn’t match a newline, so your initial solution didn't work either. To make . match a newline, you need to use the re.DOTALL flag (aka re.S):
>>> m = re.search("<MediaLine Label=\"main-video\"(?:.*?)>(?P<payload>.*)</MediaLine>", line, re.DOTALL)
>>> m.group('payload')
'\n <OtherTags...></OtherTags>\n'
Notice there’s also an extra ? in the first group, so that it’s not greedy.
As another comment observes, the best thing to parse XML is an XML parser. But if your particular XML is sufficiently strict in the tags and attributes that it has, then a regular expression can get the job done. It will just be messier.

Pyparsing delimited list only returns first element

Here is my code :
l = "1.3E-2 2.5E+1"
parser = Word(alphanums + '+-.')
grammar = delimitedList(parser,delim='\t ')
print(grammar.parseString(l))
It returns :
['1.3E-2']
Obiously, I want all both values, not a single one, any idea what is going on ?
As #dawg explains, delimitedList is intended for cases where you have an expression with separating non-whitespace delimiters, typically commas. Pyparsing implicitly skips over whitespace, so in the pyparsing world, what you are really seeing is not a delimitedList, but OneOrMore(realnumber). Also, parseString internally calls str.expandtabs on the provided input string, unless you use the parseWithTabs=True argument. Expanding tabs to spaces helps preserve columnar alignment of data when it is in tabular form, and when I originally wrote pyparsing, this was a prevalent use case.
If you have control over this data, then you might want to use a different delimiter than <TAB>, perhaps commas or semicolons. If you are stuck with this format, but determined to use pyparsing, then use OneOrMore.
As you move forward, you will also want to be more precise about the expressions you define and the variable names that you use. The name "parser" is not very informative, and the pattern of Word(alphanums+'+-.') will match a lot of things besides valid real values in scientific notation. I understand if you are just trying to get anything working, this is a reasonable first cut, and you can come back and tune it once you get something going. If in fact you are going to be parsing real numbers, here is an expression that might be useful:
realnum = Regex(r'[+-]?\d+\.\d*([eE][+-]?\d+)?').setParseAction(lambda t: float(t[0]))
Then you can define your grammar as "OneOrMore(realnum)", which is also a lot more self-explanatory. And the parse action will convert your strings to floats at parse time, which will save you step later when actually working with the parsed values.
Good luck!
Works if you switch to raw strings:
l = r"1.3E-2\t2.5E+1"
parser = Word(alphanums + '+-.')
grammar = delimitedList(parser, delim=r'\t')
print(grammar.parseString(l))
Prints:
['1.3E-2', '2.5E+1']
In general, delimitedList works with something like PDPDP where P is the parse target and D is the delimter or delimiting sequence.
You have delim='\t '. That specifically is a delimiter of 1 tab followed by 1 space; it is not either tab or space.

Pyparsing finds first occurence in file

I'm parsing file via
output=wilcard.parseFile(myfile)
print output
And I do get only first match of string.
I have a big config file to parse, with "entries" which are surrounded by braces.
I expect to see all the matches that are in file or exception for not matching.
How do I achieve that?
By default, pyparsing will find the longest match, starting at the first character. So, if your parse is given by num = Word('0123456789'), parsing either "462" or "462-780" will both return the same value. However, if the parseAll=True option is passed, the parse will attempt to parse the entire string. In this case, "462" would be matched, but parsing "462-780" would raise a ParseException, because the parser doens't know how to deal with the dash.
I would recommend constructing something that will match the entirety of the file, then using the parseAll=True flag in parseFile(). If I understand your description of each entry being separated by braces correctly, one could do the following.
entire_file = OneOrMore('[' + wildcard + ']')
output = wildcard.parseFile(myfile,parseAll=True)
print output

How to search for string in Python by removing line breaks but return the exact line where the string was found?

I have a bunch of PDF files that I have to search for a set of keywords against. I have to extract the exact line where the keyword was found. I first used xpdf's pdf2text to convert the file to PDF. (Tried solr but had a tough time tailoring the output/schema to suit my requirement).
import sys
file_name = sys.argv[1]
searched_string = sys.argv[2]
result = [(line_number+1, line) for line_number, line in enumerate(open(file_name)) if searched_string.lower() in line.lower()]
#print result
for each in result:
print each[0], each[1]
ThinkCode:~$ python find_string.py sample.txt "String Extraction"
The problem I have with this is that for cases where search string is broken towards the end of the line :
If you are going to index large binary files, remember to change the
size limits. String
Extraction is a common problem
If I am searching for 'String Extraction', I will miss this keyword if I use the code presented above. What is the most efficient way of achieving this without making 2 copies of text file (one for searching the keyword to extract the line (number) and the other for removing line breaks and finding the keyword to eliminate the case where the keyword spans across 2 lines).
Much appreciated guys!
Note: Some considerations without any code, but I think they belong to an answer rather than to a comment.
My idea would be to search only for the first keyword; if a match is found, search for the second. This allows you to, if the match is found at the end of the line, take into consideration the next line and do line concatenation only if a match is found in first place*.
Edit:
Coded a simple example and ended up using a different algorithm; the basic idea behind it is this code snippet:
def iterwords(fh):
for number, line in enumerate(fh):
for word in re.split(r'\s+', line.strip()):
yield number, word
It iterates over the file handler and produces a (line_number, word) tuple for each word in the file.
The matching afterwards becomes pretty easy; you can find my implementation as a gist on github. It can be run as follows:
python search.py 'multi word search string' file.txt
There is one main concern with the linked code, I didn't code a workaround both for performance and complexity reasons. Can you figure it out? (Spoiler: try to search for a sentence whose first word appears two times in a row in the file)
* I didn't perform any testing on my own, but this article and the python wiki suggest that string concatenation is not that efficient in python (don't know how actual the information is).
There may be a better way of doing it, but my suggestion would be to start by taking in two lines (let's call them line1 and line2), concatenating them into line3 or something similar, and then search that resultant line.
Then you'd assign line2 to line1, get a new line2, and repeat the process.
Use the flag re.MULTILINE when compiling your expressions: http://docs.python.org/library/re.html#re.MULTILINE
Then use \s to represent all white space (including new lines).

Categories