How to make the re.search() try a best attempt approach - python

This is the text file sb.txt
JOHN:ENGINEER:35?:
Now this the piece of code that tries to perform a regex search on the above line.
biodata1 = re.search(r'([\w\W])+?:([\w\W])+?:([\w\W])+?:',line)
Now I get a proper output for biodata1.group(1), biodata1.group(2) and biodata1.group(3).
If however, I modify the file by removing ":" from the end
JOHN:ENGINEER:35?
and run the script again, I get the following error which makes sense since group(3) didn't match successfully
Traceback (most recent call last):
File "dictionary.py", line 26, in <module>
print('re.search(r([\w\W])+?:([\w\W])+?:([\w\W])+? '+biodata1.group(1)+' '+biodata1.group(2)+' '+biodata1.group(3)) # STMT1
AttributeError: 'NoneType' object has no attribute 'group'
But group(1) and group(2) should've still matched "N" "R" respectively. Is there anyway to avoid this error and attempt a best attempt approach to regex so it doesn't fail and at least prints biodata1.group(1) & biodata1.group(2).
I tried to edit the output statment by not having it print biodata1.group(3) though that didn't work

I think you misunderstand what has happened. Your entire regular expression has failed to match and therefore there is no match object.
Where it says AttributeError: 'NoneType' object has no attribute 'group' it's trying to tell you that biodata1 is None. None is the return you get from re.search when it fails to match.
To be clear, there's no way to get a "best match". What you're asking for is that re should make a decision as to what you really want. If you want groups to be optional, you need to make them optional.
Depending on what you actually want you can try the regexes:
r'([\w\W])+?:([\w\W])+?:([\w\W])+?:?'
or
r'([\w\W])+?:([\w\W])+?:(([\w\W])+?:)?'
Which respectively make the last : and the entire last group optional.

You'll have to modify the regex to instruct it on what exactly is optional and what isn't. Python regexes don't have this concept of partial matches. One possibility is to change it to
biodata1 = re.search(r'([\w\W])+?:(?:([\w\W])+?:(?:([\w\W])+?:)?)?',line)
Where you allow 1, 2 or 3 groups to match. In this case, any groups that don't match will return the empty string when you do match.group(X)

What a regex does is it matches exactly what you provided. There is no best try or anything like that.
If you want some part of your match to be optional you need to declare it using the ? operator. So in your case your regex would need to look like this:
biodata1 = re.search(r'([\w\W])+?:([\w\W])+?:([\w\W])+?:?',line)
Also +? (at least once, or not at all) is equal to * (at least zero times), so you could just do this:
biodata1 = re.search(r'([\w\W])*:([\w\W])*:([\w\W])*:?',line)

Related

Python regex. Get the last word from a sequence

I have a line like this:
jsdata="l7Bhpb;_;CJWKh4 cECq7c;_;CJWKiA" data-ved="2ahUKEwjxq7L29Yr7AhWM7qQKHRABDVEQ2esEegQIGxAE">
I need to get the word CJWKiA.
But I don't understand how to write it in the regex language.
My failed attempt:
jsdata=\".+?;.+?\"
This returns the entire string, including the word I need :(
I don't understand how to get only CJWKiA word, I need something pattern like this:
jsdata=\"l7Bhpb;_;CJWKh4 cECq7c;_;(CJWKiA)\"
There may be different words, I only need to get the last one
/jsdata="[^"]*;([^;"]*)"/gm
You can't have double quotes in the attribute.

Regex trouble with list comprehension

I am working in Python 3.7.6 on Windows and am attempting to use regex to transform one list of foo.csv.gz filenames into a list of the corresponding foo.csv filenames. A code snippet is below.
zippedFileNames = [re.search('[^/]*\\.gz', link).group(0) for link in linksList]
unzippedFileNames = [re.search('.*\\.csv', name).group(0) for name in zippedFileNames]
In the above code, zippedFileNames is a list created by isolating the .gz filenames from a list of download links. This line works as I expect, and taking zippedFileNames[0] returns a string. The type of zippedFileNames[0] is str and the type of zippedFileNames is list.
However, the code throws an error on the second line:
Exception has occurred: AttributeError
'NoneType' object has no attribute 'group'
File "H:\foo\bar\foobar.py", line 133, in <listcomp>
x = [re.search('.*\\.csv', name).group(0) for name in zippedFileNames]
File "H:\foo\bar\foobar.py", line 133, in <module>
x = [re.search('.*\\.csv', name).group(0) for name in zippedFileNames]
This code was working yesterday but stopped working today, and I am not sure what I changed to break it. I believe it broke after I tried amending the second line's pattern to omit the first digits and underscore using the pattern '[^0-9\\_].*\\.csv' (the filenames all follow a pattern 0000_foo_bar_foobar.csv.gz). However, even reverting the pattern to the old one before the omission does not solve my problem.
Is there something I'm not seeing?
Thank you!
EDIT:
Thank you for your answers.
I checked whether there is a None in my list both by printing all list items and by using print(None in zippedFileNames). The latter test returned False and the former returned all the items as I'm expecting. I also did not find None in my linksList.
When I run the regex re.search on just one of the elements of linksList, linksList[0], I get the correct string output.
Are there other things I can try?
EDIT 2:
I tried re-using the original regex pattern '[^/]*\\.gz' in a separate call and it worked. Then I also tried using the pattern '[^/]*\\.csv\\.gz' in hopes of getting the same result as I got with the former pattern, but this pattern returned an error as well. I'm suspecting that the errors have something to do with \\.csv.
RESOLUTION
I was matching on .csv, but it turned out that I had a .report file as well, and that one was throwing off the entire script. Iterating through matches helped to isolate the issue. To solve the regex, I matched the pattern '.*\\[^.gz]' to keep all file extensions, not just .csv. Thank you very much!!
You can use
import re
zippedFileNames=['0001_foo1.csv', 'def.bz', '0000_foo2.csv.gz']
unzippedFileNames = []
for name in zippedFileNames:
m = re.match(r"\d+_(.*\.csv)", name)
if m:
unzippedFileNames.append(m.group(1))
print(unzippedFileNames)
# => ['foo1.csv', 'foo2.csv']
See the Python demo.
Here, the unzippedFileNames is declared as an empty list. Then, iterating over the zippedFileNames, each name is checked against a \d+_(.*\.csv) regex (note that re.match only searches for a match at the start of the string), and if there is a match (if m:) the Group 1 contents are appended to the unzippedFileNames list.
Check if there is an None or empty value in the lists that you are using, zippedFileNames or linksList.
re.search returns None if the string doesn't match . Your second regex seems wrong. I think it should be '.*\.csv. You can test it with regex101.
[EDIT]: You regex is correct, you probably have a file which doesn't match the regex in zippedFileName

Python - Injecting html tags into strings based on regex match

I wrote a script in Python for custom HTML page that finds a word within a string/line and highlights just that word with use of following tags where instance is the word that is searched for.
<b><font color=\"red\">"+instance+"</font></b>
With the following result:
I need to find a word (case insensitive) let's say "port" within a string that can be port, Port, SUPPORT, Support, support etc, which is easy enough.
pattern = re.compile(word, re.IGNORECASE)
find_all_instances = pattern.findall(string_to_search)
However my strings often contain 2 or more instances in single line, and I need to append
<b><font color=\"red\">"+instance+"</font></b> to each of those instances, without changing cases.
Problem with my approach, is that I am attempting to itterate over each of instances found with findall (exact match),
while multiple same matches can also be found within the string.
for instance in find_all_instances:
second_pattern = re.compile(instance)
string_to_search = second_pattern.sub("<b><font color=\"red\">"+instance+"</font></b>", string_to_search)
This results in following:
<b><font color="red"><b><font color="red"><b><font color="red">Http</font></b></font></b></font></b></font>
when I need
<b><font color="red">Http</font></b>
I was thinking, I would be able to avoid this if I was able to find out exact part of the string that the pattern.sub substitutes at the moment of doing it,
however I was not able to find any examples of that kind of usage, which leads me to believe that I am doing something very wrong.
If anyone have a way I could use to insert <b><font color="red">instance</font></b> without replacing instance for all matches(case insensitive), then I would be grateful.
Maybe I'm misinterpretting your question, but wouldn't re.sub be the best option?
Example: https://repl.it/DExs
Okay so two ways I did quickly! The second loop is definitely the way to go. It uses re.sub (as someone else commented too). It replaces with the lowercase search term bear in mind.
import re
FILE = open("testing.txt","r")
word="port"
#THIS LOOP IS CASE SENSITIVE
for line in FILE:
newline=line.replace(word,"<b><font color=\"red\">"+word+"</font></b>")
print newline
#THIS LOOP IS INCASESENSITIVE
for line in FILE:
pattern=re.compile(word,re.IGNORECASE)
newline = pattern.sub("<b><font color=\"red\">"+word+"</font></b>",line)
print newline

Read until an only partially known line - Python

I need to get (parse) from a device its whole output.
My solution was: 1) Determine how the last line of its output looks like
2) Use the code below to read the output until the last line (which is a way around of saying - read the whole output)
last_line = "text of the last line"
read_until(last_line)
3) Technical detail: make it to a return value of the get_output() as means of passing it further to a parse_result() function.
The problem is: The last line might take various forms and only its rough format is known. For example it might say: {"diag":"hdd_id", "status":"0"}. However, both "diag" and "status" might take other values than "hdd_id" and "0".
What can I do to make the "text of the last line" more universal so that the read_until() stops for every value of "diag" and "status"? (given that the output always includes words "diag" and "status")
What I tried: Using regular expressions. Defining last_line = re('"status":"."}') making use of the fact that . in regular expression means any value. What I get though is TypeError: 'module' object is not callable.
It also wouldn't make much sense to convert that regular expression to a string by str(re('"status":"."}')) since, as far as I understand regular expressions, it wouldn't mean any particular string (due to .).
You should read (again) the re chapter from the Python standard library manual.
The correct usage is:
import re
...
eof = re.compile(r'\s*\{\s*"diag":.*,\s*"status":.*\}') # compile the regex
...
The above expression uses \s* to allow for optional white spaces in the line. You can remove them if you know that they cannot occur.
You can then use it with the telnetlib Python module, but with expect instead of read_until, because the latter searches for a string and not a regex:
index, match, text = tn.expect([eof])
Here, index will the the index of the matched regex (here 0), match the match object, and text the full text including the last line

Having some issues with re.sub

In my program I'm parsing Japanese definitions, and I need to take a few things out. There are three things I need to take things out between. 「text」 (text) 《text》
To take out things between 「」 I've been doing sentence = re.sub('「[^)]*」','', sentence) The problem with this is, for some reason if there are parentheses within 「」 it will not replace anything. Also, I've tried using the same code for the other two things like sentence = re.sub('([^)]*)','', sentence)
sentence = re.sub('《[^)]*》','', sentence) but it doesn't work for some reason. There isn't an error or anything, it just doesn't replace anything.
How can I make this work, or is there some better way of doing this?
EDIT:
I'm having a slight problem with another part of this though. Before I replace anything I check the length to make sure it's over a certain length.
parse = re.findall(r'「[^」]*」','', match.text)
if len(str(parse)) > 8:
sentence = re.sub(r'「[^」]*」','', match.text)
This seems to be causing an error now:
Traceback (most recent call last):
File "C:/Users/Dominic/PycharmProjects/untitled9/main.py", line 48, in <module>
parse = re.findall(r'「[^」]*」','', match.text)
File "C:\Python34\lib\re.py", line 206, in findall
return _compile(pattern, flags).findall(string)
File "C:\Python34\lib\re.py", line 275, in _compile
bypass_cache = flags & DEBUG
TypeError: unsupported operand type(s) for &: 'str' and 'int'
I sort of understand what's causing this, but I don't understand why It's not working just from that slight change. I know the re.sub part is fine, It's just the first two lines that are causing the problems.
You should read a tutorial on regular expressions so you understand what your regexps do.
The regexp '「[^)]*」' matches anything between the angles that is not a closing parenthesis. You need this:
sentence = re.sub(r'「[^」]*」','', sentence)
The second regexp has an additional problem: Parentheses have a special meaning (when they are not inside square brackets), so to match parentheses you need to write \( and \). So you need this:
'\([^)]*\)'
Finally: You should always use raw strings for your python regexps. It doesn't happen to make a difference in this case, but it very often does, and the bugs are maddening to spot. E.g., use:
r'\([^)]*\)'
sentence = re.sub(ur'「[^」]*」','', sentence)
^^
You need to change the negatiion based quantifer to stop at 」 instead of ).
You should use unicode flag if dealing with them.If there are ) within them then it will fail as you have used 「[^)]*」
^^
You have instructed regex to stop when it finds ).

Categories