using Python to search for keywords in pdf [duplicate] - python

This question already has answers here:
Searching text in a PDF using Python? [duplicate]
(11 answers)
Closed 8 years ago.
I'm searching for keywords in a pdf file so I'm trying to search for /AA or /Acroform like the following:
import re
l = "/Acroform "
s = "/Acroform is what I'm looking for"
if re.search (r"\b"+l.rstrip()+r"\b",s):
print "yes"
why I don't get "yes". I want the "/" to be part of the keyword I'm looking for if it exist.
any one can help me with it ?

\b only matches in between a \w (word) and a \W (non-word) character, or vice versa, or when a \w character is at the edge of a string (start or end).
Your string starts with a / forward slash, a non word character, so \W. \b will never match between the start of a string and /. Don't use \b here, use an explicit negative look-behind for a word character :
re.search(r'(?<!\w){}\b'.format(re.escape(l)), s)
The (?<!...) syntax defines a negative look-behind; like \b it matches a position in the string. Here it'll only match if the preceding character (if there is any) is not a word character.
I used string formatting instead of concatenation here, and used re.escape() to make sure that any regular expression meta characters in the string you are searching for are properly escaped.
Demo:
>>> import re
>>> l = "/Acroform "
>>> s = "/Acroform is what I'm looking for"
>>> if re.search(r'(?<!\w){}\b'.format(re.escape(l)), s):
... print 'Found'
...
Found

Related

Isn't the 'r' letter making the regex pattern string literal? [duplicate]

This question already has answers here:
What exactly is a "raw string regex" and how can you use it?
(7 answers)
Closed 7 months ago.
I had thought the 'r' prefix in the pattern is to make sure that anything in the pattern will be interpreted as string literal, so that I don't have to use escape, but in this case below, I still have to use '.' for literal match. So what's the purpose of the 'r' in the beginning of the regex?
pattern = r'.'
text = "this is. test"
text = re.sub(pattern, ' ', text)
The r prefix stands for "raw." It means that escape sequences inside a raw string will appear as literal. Consider:
print('Hello\b World') # Hello World
print(r'Hello\b World') # Hello\b World
In the first non raw string example, \b is interpreted as a control character (which doesn't get printed). In the second example using a raw string, \b is a literal word boundary.
Another example would be comparing '\1' to r'\1'. In the former, '\1' is a control character, while the latter is the first capture group. Note that to represent the first capture group without using a raw string we can double up backslashes, i.e. use '\\1'.

How to check string start with $ symbol and can have alphabet and positive digit [duplicate]

This question already has answers here:
Regex currency validation
(2 answers)
Closed 10 months ago.
I am trying to check string which:
Must start from $ symbol
followed by $ symbol it can have alphabets and digits(no sign).
No Special character and space are allowed(except $ symbol in the beginning)
is_match = re.search("^\$[a-zA-Z0-9]", word)
Problem I am facing
It is accepting special characters and space in my string.
Modified your regex to this:
^\$[a-zA-Z0-9]+$
^ asserts position at the start of a line
\$ matches the character $ in the beginning
[a-zA-Z0-9]+ matches these characters for one or more times
$ asserts position at the end of a line
Explanation:
You were basically searching for string that started with
"$abc123456789" so it didn't matter how your strings ended with. I
just added $ in the end to your regex which asserts position at
the end of a line
It makes sure that the entire string will only consist alphabets
and numbers and nothing else.
Source (run ):
regex = r"^\$[a-zA-Z0-9]+$"
test_str = ("$abc123 ")
is_match = re.search(regex, test_str)
if(is_match):
print("Yes")
else:
print("No")
Demo
Backslashes in regular strings are processed by Python before the regex engine gets to see them. Use a raw string around regular expressions, generally (or double all your backslashes).
Also, your regex simply checks if there is (at least) one alphanumeric character after the dollar sign. If you want to examine the whole string, you need to create a regular expression which examines the whole string.
is_match = re.search(r"^\$[a-zA-Z0-9]+$", word)
or
is_match = re.search("^\\$[a-zA-Z0-9]+$", word)

How to find all every element between text Python [duplicate]

This question already has answers here:
Find string between two substrings [duplicate]
(20 answers)
Closed last year.
I'd like to know how to find characters in between texts in python. What I mean is that you have for example:
cool_string = "I am a very cool string here is something: not cool 8+8 that's it"
and I want to save to another string everything in between something: to that's it.
So the result would be:
soultion_to_cool_string = ' not cool 8+8 '
You can use str.find()
start = "something:"
end = "that's it"
cool_string[cool_string.find(start) + len(start):cool_string.find(end)]
If you need to remove empty space str.strip()
You should look into regex it will do your job. https://docs.python.org/3/howto/regex.html
Now for your question we will first require the lookahead and lookbehind expressions
The lookahead:
Asserts that what immediately follows the current position in the string is foo.
Syntax: (?=foo)
The lookbehind:
Asserts that what immediately precedes the current position in the string is foo.
Syntax: (?<=foo)
We need to look behind for something: and lookahead for that's it
import re
regex = r"(?<=something:).*?(?=that\'s it)" # .*? is way to capture everything in b/w except line terminators
re.findall(regex, cool_string)

Python regex doesnt match when string contains the special character '+' [duplicate]

This question already has answers here:
Escape special characters in a Python string
(7 answers)
Escaping regex string
(4 answers)
Closed 2 years ago.
import re
response = 'string contains+ as special character'
re.match(response, response)
print match
The string match is not successful as the strring contains the special character '+' . If any other special character , then match is successfull.
Even if putting back slash in special character , it doesnt match.
Both doesnt match:
response = r'string contains\+ as special character'
response = 'string contains\\+ as special character'
How to match it when the string is a variable and has this special character.
If you want use an arbitrary string and in a regex but treat it as plain text (so the special regex characters don't take effect), you can escape the whole string with re.escape.
>>> import re
>>> response = 'string contains+ as special character'
>>> re.match(re.escape(response), response)
<re.Match object; span=(0, 37), match='string contains+ as special character'>
In the general case, an arbitrary string does not match itself, though of course this is true for any string which doesn't contain regex metacharacters.
There are several characters which are regex metacharacters and which do not match themselves. A corner case is . which matches any character (except newline, by default), and so of course it also matches a literal ., but not exclusively. The quantifiers *, +, and ? as well as the generalized repetition operator {m,n} modify the preceding regular expression, round parentheses are reserved for grouping, | for alternation, square brackets define character classes, and finally of course the backslash \ is used to escape any of the preceding metacharacters, or itself.
Depending on what you want to accomplish, you can convert a string to a regex which matches exactly that literal string with re.escape(); but perhaps you simply need to have an incorrect assumption corrected.

Regular Expression fails if newline is included [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 2 years ago.
I'm trying to extract a simple sentence from a string delimited with a # character.
str = "#text text text \n text#"
with this pattern
pattern = '#(.+)#'
now, the funny thing is that regular expression isn't matched when the string contains newline character
out = re.findall(pattern, str) # out contains empty []
but if I remove \n from string it works fine.Any idea how to fix this ?
Also pass the re.DOTALL flag, which makes the . match truly everything.
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.
Use re.DOTALL if you want your . to match newline also: -
>>> out = re.findall('#(.+)#', my_str, re.DOTALL)
>>> out
['text text text \n text']
Also, it's not a good idea to use built-in names as your variable names. Use my_str instead of str.
Try this regex "#([^#]+)#"
It will match everything between the delimiters.
Add the DOTALL flag to your compile or match.

Categories