I have a python program and am trying to do a re.search to find a specific pattern in text. The issue I am facing is that the middle search for "[a-zA-Z0-9/" ]+" does not find any number/symbol/or letter and I have to specify each type of symbol I want it to pick up on.
re.search(r'[0-9] [a-zA-Z0-9/" ]+ [0-9]', text)
I am trying to detect strings in text.
I guess you are looking for non space, so each time you may not specify each time of symbol in the character class.
x = re.search(r'[0-9] \S+ [0-9]', text)
Samples are provided in the below link. Try this, if it helps you.
https://www.w3schools.com/python/python_regex.asp
Related
I am writing a function in python using regex that should return text when an element of that text is matched but the outputs I'm getting aren't as expected and I'm not sure what is going wrong.
My function is as below:
def latin_ish_words(text):
latin = re.findall('tion|ex|ph|ost', text, re.I)
return latin
When I pass latin_ish_words("This functions as expected")) it returns the elements 'tion' and 'ex' rather than 'functions' and 'expected'
If someone could tell me where I've gone wrong, I'd be most appreciative!
Many thanks,
Andrew
The function returns matching text - and that's what you saw. If you want to look for specific string within words, your search should state that.
I think \w*(?:tion|ex|ph|ost)\w* should help you find what you're expecting (you may need to enable greedy matching).
Let's look at the modifications:
\w - matches a "word-character" (letters in upper- or lowercase, digits or underscore)
* - previous pattern needs to match between zero and unlimited times
(?: - followed by a match of the rx within (..)
So basically we're just allowing word characters before and after. If you wanted to be more strict and only accept letters, use [A-z]* instead of \w*.
Okay so in Python, I'm trying to search for the pattern "comma, space, any lowercase character", but I cant get a regular expression that seems to work. The whole regular expressions thing is pretty new to me and I have no idea what I'm doing. I was able to search for a "number, space, any character using "[1-9]+ [a-zA-z]", but I'm not sure how to search for the pattern mentioned above. The picture included is an example of what pattern I am trying to search for in the text file.
Thanks,
Schulzy
A Regex expression that would work is
, [a-z]
the comma and space are matched exactly, and the '[]' is a group, where anything in the group could be matched. you want any lowercase char's, so we put [a-z] for any character between lowercase a to z.
text to capture looks like this..
Policy Number ABCD000012345 other text follows in same line....
My regex looks like this
regex value='(?i)(?:[P|p]olicy\s[N|n]o[|:|;|,][\n\r\s\t]*[\na-z\sA-Z:,;\r\d\t]*[S|s]e\s*[H|h]abla\s*[^\n]*[\n\s\r\t]*|(?i)[P|p]olicy[\s\n\t\r]*[N|n]umber[\s\n\r\t]*)(?P<policy_number>[^\n]*)'
this particular case matches with the second or case.. however it is also capturing everything after the policy number. What can be the stopping condition for it to just grab the number. I know something is wrong but can't find a way out.
(?i)[P|p]olicy[\s\n\t\r]*[N|n]umber[\s\n\r\t]*)
current output
ABCD000012345othertextfollowsinsameline....
expected output
ABCD000012345
You may use a more simple regex, just finding from the beginning "[P|p]olicy\s*[N|n]umber\s*\b([A-Z]{4}\d+)\b.*" and use the word boundary \b
pattern = re.compile(r"[P|p]olicy\s*[N|n]umber\s*\b([A-Z0-9]+)\b.*")
line = "Policy Number ABCD000012345 other text follows in same line...."
matches = pattern.match(line)
id_res = matches.group(1)
print(id_res) # ABCD000012345
And if there's always 2 words before you can use (?:\w+\s+){2}\b([A-Z0-9]+)\b.*
Also \s is for [\r\n\t\f\v ] so no need to repeat them, your [\n\r\s\t] is just \s
you don't need the upper and lower case p and n specified since you're already specifying case insensitive.
Also \s already covers \n, \t and \r.
(?i)policy\s+number\s+([A-Z]{4}\d+)\b
for verification purpose: Regex
Another Solution:
^[\s\w]+\b([A-Z]{4}\d+)\b
for verification purpose: Regex
I like this better, in case your text changes from policy number
So currently I am trying to find out how many times a specific word appears on a page.
My Python code has this:
print(len(re.findall(secondAnswer, page)))
0
Upon careful analysis, I noticed that
print(secondAnswer) is giving me a different answer "Pacific"
from print(ascii(secondAnswer)) 'Paci\ufb01c'
I have a feeling that my secondAnswer value in len(re.findall(secondAnswer, page)) is using 'Paci\ufb01c' instead and thus not finding any matches on the page.
Can someone give me any tips on how to solve this?
Thanks, Nick
Unicode character fb01 is the fi ligature. That is, it's a single character as far as Python is concerned, but appears as two (tied) characters when displayed.
To decompose ligatures into their separate characters, you can use unicodedata.normalize. For example:
page = unicodedata.normalize("NFKD", page)
Or in this specific case, you could write your regex to accept the ligature as an alternate for the fi character sequence, for example by using alternation with a non-capturing group: paci(?:fi|fi)c.
UPDATED
I want to find a string within a big text
..."img good img two_apple.txt"
Want to extract the two_apples.txt from a text, but it can change to one_apple, three_apple..so on...
When I try to use lookbehinds, it matches text all the way from the beginning.
You are mis-using lookarounds. Looks like you dont even NEED a lookaround:
pattern = r'src="images/(.+?.png")'
should work for you. As my comment suggests though, using regex is not recommended for parsing HTML/XML style documents but you do you.
EDIT - accommodate your edit:
Now that I understand your problem more, I can see why you would want to use a look-around. However, since you are looking for a file name, you know there aren't going to be any spaces in the name, so you can just ensure that your capturing token does not include spaces:
pattern = r'src="img (\w+?.png")'
^ ensure there is a space HERE because of how your text is
\w - \w is equivalent to [a-zA-Z0-9_] (any letters, numbers or underscore)
This removes the greediness of capture the first 'img ' string that pops up and ensures your capture group doesnt have any spaces.
by using \w, I am assuming you are only expecting _ and letter characters. to include anything else, make your own character group with [any characters you want to capture in here]
" ([^ ]+_apple\.txt)"
Starts with a space, ends with _apple.txt. The middle bit is anything-except-a-space which stops it matching "good img two". Parentheses to capture the bit you care about.
Try it here: https://regex101.com/r/wO7lG3/2