Pyparsing finds first occurence in file

Pyparsing finds first occurence in file - python

I'm parsing file via
output=wilcard.parseFile(myfile)
print output
And I do get only first match of string.
I have a big config file to parse, with "entries" which are surrounded by braces.
I expect to see all the matches that are in file or exception for not matching.
How do I achieve that?

By default, pyparsing will find the longest match, starting at the first character. So, if your parse is given by num = Word('0123456789'), parsing either "462" or "462-780" will both return the same value. However, if the parseAll=True option is passed, the parse will attempt to parse the entire string. In this case, "462" would be matched, but parsing "462-780" would raise a ParseException, because the parser doens't know how to deal with the dash.
I would recommend constructing something that will match the entirety of the file, then using the parseAll=True flag in parseFile(). If I understand your description of each entry being separated by braces correctly, one could do the following.
entire_file = OneOrMore('[' + wildcard + ']')
output = wildcard.parseFile(myfile,parseAll=True)
print output

Related

Python Regex with Paramiko

I am using Python Paramiko module to sftp into one of my servers. I did a list_dir() to get all of the files in the folder. Out of the folder I'd like to use regex to find the matching pattern and then printout the entire string.
List_dir will list a list of the XML files with this format
LOG_MMDDYYYY_HHMM.XML
LOG_07202018_2018 --> this is for the date 07/20/2018 at the time 20:18
Id like to use regex to file all the XML files for that particular date and store them to a list or a variable. I can then pass this variable to Paramiko to get the file.
for log in file_list:
regex_pattern = 'POSLog_' + date + '*'
if (re.search(regex_pattern, log) != None):
matchObject = re.findall(regex_pattern, log)
print(matchObject)
the code above just prints:
['Log_07202018'] I want it to store the entire string Log_07202018_20:18.XML to a variable.
How would I go about doing this?
Thank you

If you are looking for a fixed string, don't use regex.
search_str = 'POSLog_' + date
for line in file_list:
if search_str in line:
print(line)
Alternatively, a list comprehension can make list of matching lines in one go:
log_lines = [line for line in file_list if search_str in line]
for line in log_lines:
print(line)
If you must use regex, there are a few things to change:
Any variable part that you put into the regex pattern must either be guaranteed to be a regex itself, or it must be escaped.
"The rest of the line" is not *, it's .*.
The start-of-line anchor ^ should be used to speed up the search - this way the regex fails faster when there is no match on a given line.
To support the ^ on multiple lines instead of only at the start of the entire string, the MULTILINE flag is needed.
There are several ways of getting all matches. One could do "for each line, if there is a match, print line", same as above. Here I'm using .finditer() and a search over the whole input block (i.e. not split into lines).
log_pattern = '^POSLog_' + re.escape(date) + '.*'
for match in re.finditer(log_pattern, whole_file, re.MULTILINE):
print(match.string)

Because you only print the matched part, just do print(log) instead and it'll print the whole filename.

How to split a string on multiple pattern using pythonic way (one liner)?

I am trying to extract file name from file pointer without extension. My file name is as follows:
this site:time.list,this.list,this site:time_sec.list, that site:time_sec.list and so on. Here required file name always precedes either whitespace or dot.
Currently I am doing this to get file from file name preceding white space and dot in file name.
search_term = os.path.basename(f.name).split(" ")[0]
and
search_term = os.path.basename(f.name).split(".")[0]
Expected file name output: this, this, this, that.
How can i combine above two into one liner kind and pythonic way?
Thanks in advance.

using regex as below,
[ .] will split either on a space or a dot char
re.split('[ .]', os.path.basename(f.name))[0]

If you split on one and splitting on the other still returns something smaller, that's the one you want. If not, what you get is what you got from the first split. You don't need regex for this.
search_term = os.path.basename(f.name).split(" ")[0].split(".")[0]

Use regex to get the first word at the beginning of the string:
import re
re.match(r"\w+", "this site:time_sec.list").group()
# 'this'
re.match(r"\w+", "this site:time.list").group()
# 'this'
re.match(r"\w+", "that site:time_sec.list").group()
# 'that'
re.match(r"\w+", "this.list").group()
# 'this'
try this:
pattern = re.compile(r"\w+")
pattern.match(os.path.basename(f.name)).group()
Make sure your filenames don't have whitespace inside when you rely on the assumption that a whitespace separates what you want to extract from the rest. It's much more likely to get unexpected results you didn't think up in advance if you rely on implicit rules like that instead of actually looking at the strings you want to extract and tailor explicit expressions to fit the content.

Read until an only partially known line - Python

I need to get (parse) from a device its whole output.
My solution was: 1) Determine how the last line of its output looks like
2) Use the code below to read the output until the last line (which is a way around of saying - read the whole output)
last_line = "text of the last line"
read_until(last_line)
3) Technical detail: make it to a return value of the get_output() as means of passing it further to a parse_result() function.
The problem is: The last line might take various forms and only its rough format is known. For example it might say: {"diag":"hdd_id", "status":"0"}. However, both "diag" and "status" might take other values than "hdd_id" and "0".
What can I do to make the "text of the last line" more universal so that the read_until() stops for every value of "diag" and "status"? (given that the output always includes words "diag" and "status")
What I tried: Using regular expressions. Defining last_line = re('"status":"."}') making use of the fact that . in regular expression means any value. What I get though is TypeError: 'module' object is not callable.
It also wouldn't make much sense to convert that regular expression to a string by str(re('"status":"."}')) since, as far as I understand regular expressions, it wouldn't mean any particular string (due to .).

You should read (again) the re chapter from the Python standard library manual.
The correct usage is:
import re
...
eof = re.compile(r'\s*\{\s*"diag":.*,\s*"status":.*\}') # compile the regex
...
The above expression uses \s* to allow for optional white spaces in the line. You can remove them if you know that they cannot occur.
You can then use it with the telnetlib Python module, but with expect instead of read_until, because the latter searches for a string and not a regex:
index, match, text = tn.expect([eof])
Here, index will the the index of the matched regex (here 0), match the match object, and text the full text including the last line

Django: Bad group name

I faced an error on "bad group name".
Here is the code:
for qitem in q['display']:
if qitem['type'] == 1:
for keyword in keywordTags.split('|'):
p = re.compile('^' + keyword + '$')
newstring=''
for word in qitem['value'].split():
if word[-1:] == ',':
word = word[0:len(word)-1]
newstring += (p.sub('<b>'+word+'</b>', word) + ', ')
else:
newstring += (p.sub('<b>'+word+'</b>', word) + ' ')
qitem['value']=newstring
And here's the error:
error at /result/1/
bad group name
Request Method: GET
Django Version: 1.4.1
Exception Type: error
Exception Value: bad group name
Exception Location: C:\Python27\lib\re.py in _compile_repl, line 257
Python Executable: C:\Python27\python.exe
Python Version: 2.7.3 Python
Path: ['D:\ExamPapers', 'C:\Windows\SYSTEM32\python27.zip',
'C:\Python27\DLLs', 'C:\Python27\lib',
'C:\Python27\lib\plat-win', 'C:\Python27\lib\lib-tk',
'C:\Python27', 'C:\Python27\lib\site-packages']
Server time: Sun,3 Mar 2013 15:31:05 +0800
Traceback Switch to copy-and-paste view
C:\Python27\lib\site-packages\django\core\handlers\base.py in get_response
response = callback(request, *callback_args, **callback_kwargs) ... ▶ Local vars ?
D:\ExamPapers\views.py in result
newstring += (p.sub(''+word+'', word) + ' ') ... ▶ Local vars
In summary, the error is at:
newstring += (p.sub('<b>'+word+'</b>', word) + ' ')

So you're trying to highlight in bold an occurrence of a set of keywords. Right now this code is broken in quite a lot of ways. You're using the re module right now to match the keywords but you're also breaking the keywords and the strings down into individual words, you don't need to do both and the interaction between these two different approaches to the solving the problem are what is causing you issues.
You can use regular expressions to match multiple possible strings at the same time, that's what they're good for! So instead of "^keyword$" to match just "keyword" you could use "^keyword|hello$" to match either "keyword" or "hello". You also use the ^ and $ characters which only match the beginning or end of the entire string, but what you probably wanted originally was to match the beginning or end of words, for this you can use \b like this r"\b(keyword|hello)\b". Note that in the last example I added a r character before the string, this stands for "raw" and turns off pythons usual handling of back slash characters which conflicts with regular expressions, it's good practice to always use the r before the string when the string contains a regular expression. I also used brackets to group together the words.
The regular expression sub method allows you to substitute things matched by a regular expression with another string. It also allow you to make "back references" in the replacing string that include parts of original string that matched. The parts that it includes are called "groups" and are indicated with brackets in the original regular expression, in the example above there is only one set of brackets and these are the first so they're indicated by the back reference \1. The cause of the actual error message you asked about is that your replacement string contained what looked like a backref but there weren't any groups in your regular expression.
Using that you do something like this:
keywordMatcher = re.compile(r"\b(keyword|hello)\b")
value = keywordMatcher.sub(r"<b>\1</b>", value)
Another thing that isn't directly related to what you're asking but is incredibly important is that you are taking source plain text strings (I assume) and making them into HTML, this gives a lot of chance for script injection vulnerabilities which if you don't take the time to understand and avoid will allow bad guys to hack the applications you build (they can do this in an automated way, so even if you think your app will be too small for anyone to notice it can still get hacked and used for all sorts of bad things, don't let this happen!). The basic rule is that it's ok to convert text to HTML but you need to "escape" it first, this is very simple:
from django.utils import html
html_safe = html.escape(my_text)
All this does is convert characters like < to < which the browser will show as < but won't interpret as the beginning of a tag. So if a bad guy types <script> into one of your forms and it gets processed by your code it will display it as <script> and not execute it as a script.
Likewise, if you use an text in a regular expression that you don't intend to have special regular expression characters then you must escape that too! You can do this using re.escape:
import re
my_regexp = re.compile(r"\b%s\b" % (re.escape(my_word),))
Ok, so now we've got that out of the way here is a method you could use to do what you wanted:
value = "this is my super duper testing thingy"
keywords = "super|my|test"
from django.utils import html
import re
# first we must split up the keywords
keywords = keywords.split("|")
# Next we must make each keyword safe for use in a regular expression,
# this is similar to the HTML escaping we discussed above but not to
# be confused with it.
keywords = [re.escape(k) for k in keywords]
# Now we reform the keywordTags string, but this time we know each keyword is regexp-safe
keywords = "|".join(keywords)
# Finally we create a regular expression that matches *any* of the keywords
keywordMatcher = re.compile(r'\b(%s)\b' % (keywords,))
# We are going to make the value into HTML (by adding <b> tags) so must first escape it
value = html.escape(value)
# We can then apply the regular expression to the value. We use a "back reference" `\0` to say
# that each keyword found should be replace with itself wrapped in a <b> tag
value = keywordMatcher.sub(r"<b>\1</b>", value)
print value
I urge you to take the time to understand what this does, otherwise you're just going to get yourself into a mess! It's always easier to just cut and paste and move on but this leads to crappy broken code and worse of all means you yourself don't improve and don't learn. All great coders started of as beginner coders who took the time to understand things :)

Python replace with re-using unknown strings

I have an XML in which I'd like to rename one of the tag groups like this:
<string>ABC</string>
<string>unknown string</string>
should be
<xyz>ABC</xyz>
<xyz>unknown string</xyz>
ABC is always the same, so that's no issue. However, "unknown string" is always different, but since I need this information extracted, I also want to keep the same string in the replacement.
Here's what I got so far:
import re
#open the xml file for reading:
file = open('path/file','r+')
#convert to string:
data = file.read()
file.write(re.sub("<string>ABC</string>(\s+)<string>(.*)</string>","<xyz>ABC</xyz>[\1]<xyz>[\2]</xyz>",data))
print (data)
file.close()
I tried to use capture groups, but didn't do it correctly. The string is replaced with weird symbols in my XML. Plus, it's printed twice. I have both the unchanged and the changed version in my XML, which I don't want.

The problem you're experiencing is not due to your regex pattern. The backslash (\) in the strings are escaping proceeding characters thus resulting in the weird symbols that you see.
>>> print "hello\1world"
helloworld
>>> print r"hello\1world"
hello\1world
Always use the raw string notation to define your re patterns.
>>> data = """
... <string>ABC</string>
... <string>unknown string</string>
... """
>>> print re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data)
<xyz>ABC</xyz>
<xyz>unknown string</xyz>

Why are you including the content in your replacement operation? All you need to do is:
Replace <string> by <xyz>.
Replace </string> by </xyz>.
It would take two operations but the intent of your code would be clear and you don't need to know what unknown string is.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyparsing finds first occurence in file - python

I'm parsing file via output=wilcard.parseFile(myfile) print output And I do get only first match of string. I have a big config file to parse, with "entries" which are surrounded by braces. I expect to see all the matches that are in file or exception for not matching. How do I achieve that?

Related

Python Regex with Paramiko

How to split a string on multiple pattern using pythonic way (one liner)?

Read until an only partially known line - Python

Django: Bad group name

Python replace with re-using unknown strings

Categories

Resources