Stripping non printable characters from a string in python? - python

So currently I am trying to find out how many times a specific word appears on a page.
My Python code has this:
print(len(re.findall(secondAnswer, page)))
0
Upon careful analysis, I noticed that
print(secondAnswer) is giving me a different answer "Pacific"
from print(ascii(secondAnswer)) 'Paci\ufb01c'
I have a feeling that my secondAnswer value in len(re.findall(secondAnswer, page)) is using 'Paci\ufb01c' instead and thus not finding any matches on the page.
Can someone give me any tips on how to solve this?
Thanks, Nick

Unicode character fb01 is the fi ligature. That is, it's a single character as far as Python is concerned, but appears as two (tied) characters when displayed.
To decompose ligatures into their separate characters, you can use unicodedata.normalize. For example:
page = unicodedata.normalize("NFKD", page)
Or in this specific case, you could write your regex to accept the ligature as an alternate for the fi character sequence, for example by using alternation with a non-capturing group: paci(?:fi|fi)c.

Related

Use re to detect string with any numbers, symbols, or letters

I have a python program and am trying to do a re.search to find a specific pattern in text. The issue I am facing is that the middle search for "[a-zA-Z0-9/" ]+" does not find any number/symbol/or letter and I have to specify each type of symbol I want it to pick up on.
re.search(r'[0-9] [a-zA-Z0-9/" ]+ [0-9]', text)
I am trying to detect strings in text.
I guess you are looking for non space, so each time you may not specify each time of symbol in the character class.
x = re.search(r'[0-9] \S+ [0-9]', text)
Samples are provided in the below link. Try this, if it helps you.
https://www.w3schools.com/python/python_regex.asp

Exact search of a string that has parenthesis using regex

I am new to regexes.
I have the following string : \n(941)\n364\nShackle\n(941)\nRivet\n105\nTop
Out of this string, I want to extract Rivet and I already have (941) as a string in a variable.
My thought process was like this:
Find all the (941)s
filter the results by checking if the string after (941) is followed by \n, followed by a word, and ending with \n
I made a regex for the 2nd part: \n[\w\s\'\d\-\/\.]+$\n.
The problem I am facing is that because of the parenthesis in (941) the regex is taking 941 as a group. In the 3rd step the regex may be wrong, which I can fix later, but 1st I needed help in finding the 2nd (941) so then I can apply the 3rd step on that.
PS.
I know I can use python string methods like find and then loop over the searches, but I wanted to see if this can be done directly using regex only.
I have tried the following regex: (?:...), (941){1} and the make regex literal character \ like this \(941\) with no useful results. Maybe I am using them wrong.
Just wanted to know if it is possible to be done using regex. Though it might be useful for others too or a good share for future viewers.
Thanks!
Assuming:
You want to avoid matching only digits;
Want to match a substring made of word-characters (thus including possible digits);
Try to escape the variable and use it in the regular expression through f-string:
import re
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
var1 = '(941)'
var2 = re.escape(var1)
m = re.findall(fr'{var2}\n(?!\d+\n)(\w+)', s)[0]
print(m)
Prints:
Rivet
If you have text in a variable that should be matched exactly, use re.escape() to escape it when substituting into the regexp.
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
num = '(941)'
re.findall(rf'(?<=\n{re.escape(num)}\n)[\w\s\'\d\-\/\.]+(?=\n)', s)
This puts (941)\n in a lookbehind, so it's not included in the match. This avoids a problem with the \n at the end of one match overlapping with the \n at the beginning of the next.

Python regex match all sentences include either wordA or wordB [duplicate]

I'm creating a javascript regex to match queries in a search engine string. I am having a problem with alternation. I have the following regex:
.*baidu.com.*[/?].*wd{1}=
I want to be able to match strings that have the string 'word' or 'qw' in addition to 'wd', but everything I try is unsuccessful. I thought I would be able to do something like the following:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
but it does not seem to work.
replace [wd|word|qw] with (wd|word|qw) or (?:wd|word|qw).
[] denotes character sets, () denotes logical groupings.
Your expression:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
does need a few changes, including [wd|word|qw] to (wd|word|qw) and getting rid of the redundant {1}, like so:
.*baidu.com.*[/?].*(wd|word|qw)=
But you also need to understand that the first part of your expression (.*baidu.com.*[/?].*) will match baidu.com hello what spelling/handle????????? or hbaidu-com/ or even something like lkas----jhdf lkja$##!3hdsfbaidugcomlaksjhdf.[($?lakshf, because the dot (.) matches any character except newlines... to match a literal dot, you have to escape it with a backslash (like \.)
There are several approaches you could take to match things in a URL, but we could help you more if you tell us what you are trying to do or accomplish - perhaps regex is not the best solution or (EDIT) only part of the best solution?

How does this code work to extract URL's from a string with regex

I'm using a snippet i found on stackexchange that finds all url's in a string, using re.findall(). It works perfectly, however to further my knowledge I would like to know how exactly it works. The code is as follows-
re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', site)
As far as i understand, its finding all strings starting with http or https (is that why the [s] is in square brackets?) but I'm not really sure about all the stuff after- the (?:[etc etc etc]))+. I think the stuff in the square brackets eg. [a-zA-Z] is meaning all letters from a to z caps or not, but what about the rest of the stuff? And how is it working to only get the url and not random string at the end of the url?
Thanks in advance :)
Using this link you can get your regex explained:
Your regex explained
To add a bit more:
[s]? means "an optional 's' character" but that's because of the ? not of the brackets [I think they are superfluous.
Space isn't one of the accepted characters so it would stop there indeed. Same for '/'. It is not literally mentioned nor is it part of the character range $-_ (see http://www.asciitable.com/index/asciifull.gif).
(?:%[0-9a-fA-F][0-9a-fA-F]) this matches hexadecimal character codes in URLs e.g. %2f for the '/' character.
A non-capturing group means that the group is matched but that the resulting match is not stored in the regex return value, i.e. you cannot extract that matching bit of the string after the regex has been run against your string.

finding the occurance of strings in python [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 8 years ago.
Improve this question
I have a long string which I have parsed through beautifulsoup and I need advice on the best way to extract data from this soup object.
The number I want is contained inside the soup object, inside () after this text.
View All (8)
What is the most efficient way to locate this, and get the number out of it.
In VBA I would have done this.
(1) Find where does this text string start if soup is length 1000 text is at 200
Then I would loop until I found the ending ), grab that text, store it in a variable, and process each character removing everything which is not a number.
So If I have > View All (8) I would end up with 8. The number inside here is not known, could be q00, 110, or 2000.
I have just started learning python, don't yet know how to use regular expression but that seems the way to go?
Sample String
">View All (90)</a>
Expected Result - hopeful
90
Sample String
">View All (8)</a>
Expected Result - hopeful
8
Seeing how my comment provoked some more questions, let me expand it a bit. First, welcome to the wonderful world of regular expressions. Regular expressions can be quite a headache, but mastering them is a very useful skill. A very clear tutorial was written by A.M. Kuchling, one of Python's old hackers from the early days. If memory serves me he wrote the re library, with (as an additional bonus) an undocumented implementation of lex in some 15 odd lines of python. But I digress. You can find the tutorial here. https://docs.python.org/2/howto/regex.html
Let me go over the expression bit by bit:
m = re.compile(r'View All \((\d*?)\)').search(soupstring);
print m.group(1)
The r in front of the quotation marks it as a raw string in Python. Python will preprocess normal string literals, so that a backslash is interpreted as a special character. E.g. a '\t' in a string will be replaced by the tab character. Try print '\' to see what I mean. To include a '\' in a string you have to escape it like this '\\'. This can be a problem as a backslash is also a escaping character for the regular expression engine. If you have to match patterns that contain backslashes, you will soon be writing patterns like this '\\\\'. Which can be fun . . . If you like 50 shades of grey, give it a try.
Inside the regular expression language: '(' characters are special. They are used to group parts of the match together. Since you are only interested in the digits between the parentheses, I used a group to extract this data. Other special characters are '{', '[', , '*', '?', '\' and their matching counterparts. I am sure I have forgotten a few, but you can look them up.
With that information, the '\(' will make more sense. Since I have escaped the '(' it tells the regular expression parser to ignore the special meaning of '(' and instead match it against a literal '(' character.
The sequence '\d' is again special. An escaped '\d' means, do not interpret this as a literal 'd', but interpret it as "any digit character".
The '*' means take the last pattern and match it zero or more times.
The '*?' variant means, use "greedy matching". It means return the first possible match instead of finding the longest possible match. In the context of regular expressions greed is usually good. As Sebastian has noted, the '?' is not needed here. However, if you ever need to find html elements or quoted strings, then you can use '<.*?>' or '".*?"'.
Please note that '.' is again special. It means match "any character (except the newline (well most of the time anyway))".
Have fun . . .

Categories