How to use re.search to find multiple strings? - python

I'm trying to check if a certain line in many different text documents equals one of many different strings. To goal here is to classify those documents and then parse them according to that classification.
In my text editor I can use regex to search for:
(?:^kärnten\n|^steiermark|^graz\n|^madrid\n|^oststeirer\n|^weiz\n|^berlin\n|^lavanttal\n|^villach\n|^osttirol\n|^oberkärnten\n|^klagenfurt\n|^weststeiermark\n|^südsteiermark\n|^südoststeiermark\n|^murtal\n|^mürztal\n|^graz\n|^ennstal\n|^frankreich\n|^österreich\n|^dänemark\n|^polen\n|^großbritannien\n|^italien\n|^hitzendorf\n|^osttirol\n|^slowenien\n|^feldkirchen\n|^völkermarkt\n|^wien\n|^warschau\n|^mailand\n|^mainz\n|^leoben\n|^bleiburg\n|^brüssel\n|^bad radkersburg\n|^london\n|^lienz\n|^liezen\n|^hartberg\n|^ilztal|^pöllau\n|^lobmingtal\n)
But if I try to use this in an if statement in python I keep getting syntax errors for any way I tried it.
My current version is this:
if re.search('(^kärnten\n|^steiermark|^graz\n|^madrid\n|^oststeirer\n|^weiz\n|^berlin\n|^lavanttal\n|^villach\n|^osttirol\n|^oberkärnten\n|^klagenfurt\n|^weststeiermark\n|^südsteiermark\n|^südoststeiermark\n|^murtal\n|^mürztal\n|^graz\n|^ennstal\n|^frankreich\n|^österreich\n|^dänemark\n|^polen\n|^großbritannien\n|^italien\n|^hitzendorf\n|^osttirol\n|^slowenien\n|^feldkirchen\n|^völkermarkt\n|^wien\n|^warschau\n|^mailand\n|^mainz\n|^leoben\n|^bleiburg\n|^brüssel\n|^bad radkersburg\n|^london\n|^lienz\n|^liezen\n|^hartberg\n|^ilztal|^pöllau\n|^lobmingtal\n)', article_lines[5].lower()replace('´','')):
no_author = True
I saw that a possible solution is using a for loop and putting the different strings into a list, but as this would require some extra steps I'd prefer to do it as I tried if possible.

You should include what the error is. Your problem is probably just a typo:
if re.search('(^kärnten\n|^steiermark|^graz\n|^madrid\n|^oststeirer\n|^weiz\n|^berlin\n|^lavanttal\n|^villach\n|^osttirol\n|^oberkärnten\n|^klagenfurt\n|^weststeiermark\n|^südsteiermark\n|^südoststeiermark\n|^murtal\n|^mürztal\n|^graz\n|^ennstal\n|^frankreich\n|^österreich\n|^dänemark\n|^polen\n|^großbritannien\n|^italien\n|^hitzendorf\n|^osttirol\n|^slowenien\n|^feldkirchen\n|^völkermarkt\n|^wien\n|^warschau\n|^mailand\n|^mainz\n|^leoben\n|^bleiburg\n|^brüssel\n|^bad radkersburg\n|^london\n|^lienz\n|^liezen\n|^hartberg\n|^ilztal|^pöllau\n|^lobmingtal\n)', article_lines[5].lower().replace('´','')):
no_author = True

Related

Python - Injecting html tags into strings based on regex match

I wrote a script in Python for custom HTML page that finds a word within a string/line and highlights just that word with use of following tags where instance is the word that is searched for.
<b><font color=\"red\">"+instance+"</font></b>
With the following result:
I need to find a word (case insensitive) let's say "port" within a string that can be port, Port, SUPPORT, Support, support etc, which is easy enough.
pattern = re.compile(word, re.IGNORECASE)
find_all_instances = pattern.findall(string_to_search)
However my strings often contain 2 or more instances in single line, and I need to append
<b><font color=\"red\">"+instance+"</font></b> to each of those instances, without changing cases.
Problem with my approach, is that I am attempting to itterate over each of instances found with findall (exact match),
while multiple same matches can also be found within the string.
for instance in find_all_instances:
second_pattern = re.compile(instance)
string_to_search = second_pattern.sub("<b><font color=\"red\">"+instance+"</font></b>", string_to_search)
This results in following:
<b><font color="red"><b><font color="red"><b><font color="red">Http</font></b></font></b></font></b></font>
when I need
<b><font color="red">Http</font></b>
I was thinking, I would be able to avoid this if I was able to find out exact part of the string that the pattern.sub substitutes at the moment of doing it,
however I was not able to find any examples of that kind of usage, which leads me to believe that I am doing something very wrong.
If anyone have a way I could use to insert <b><font color="red">instance</font></b> without replacing instance for all matches(case insensitive), then I would be grateful.
Maybe I'm misinterpretting your question, but wouldn't re.sub be the best option?
Example: https://repl.it/DExs
Okay so two ways I did quickly! The second loop is definitely the way to go. It uses re.sub (as someone else commented too). It replaces with the lowercase search term bear in mind.
import re
FILE = open("testing.txt","r")
word="port"
#THIS LOOP IS CASE SENSITIVE
for line in FILE:
newline=line.replace(word,"<b><font color=\"red\">"+word+"</font></b>")
print newline
#THIS LOOP IS INCASESENSITIVE
for line in FILE:
pattern=re.compile(word,re.IGNORECASE)
newline = pattern.sub("<b><font color=\"red\">"+word+"</font></b>",line)
print newline

python - complex boolean search for words in files

I have a bunch of files in a folder. Let's assume I convert all into plain text files.
I want to use python to perform searches like this:
query = '(word1 and word2) or (word3 and not word4)'
the actual logc varies, and multiple words can be used together. Another example:
query = '(shiny and glass and "blue car")'
Also the words are provided by the users so they are variables.
I want to display the sentences that matched and the filenames.
This really does not need a complex search engine like whoosh or haystack which need to index files with fields.
Also, those tools do not seem to have a boolean query as I explained above.
I've come across pdfquery library which does exactly what I want for pdfs, but now I need that for text files and xml files.
Any suggestions?
There's no easy way to say this, but this is not easy. You're trying to translate unsafe strings into executable code, so you can't take the easy way out and use eval. These aren't literals so you can't use ast.literal_eval either. You need to write a lexer that recognizes things like AND, NOT, OR, (, and ) and considers them something other than strings. On top you apparently need handle compound booleans, so this becomes quite a bit more difficult than you think it might be.
Your question asks about searching by sentence, which is not how Python operates. You'd have to write another lexer to get the data by-sentence instead of by-line. You'll need to read heavily into the io module to do this effectively. I don't know how to do it off-hand, but essentially you'll be looping while there is data to loop, reading a buffersize each iteration, and yielding when you reach a "\.(?=\s+)"
Then you'll have to run your first query lexer results through a set of list comprehensions, each one running across the results of the file lexer.
I really needed to have such a solution so I made a python package called toned
I hope it will be useful to others as well.
Maybe I've answered this question too late, but I think the best way to solve complex boolean search expressions is using this implementation of Pyparsing
As you can see in its description all this cases are included:
SAMPLE USAGE:
from booleansearchparser import BooleanSearchParser
bsp = BooleanSearchParser()
text = u"wildcards at the begining of a search term "
exprs= [
u"*cards and term", #True
u"wild* and term", #True
u"not terms", #True
u"terms or begin", #False
]
for expr in exprs:
print bsp.match(text,expr)
#non-western samples
text = u"안녕하세요, 당신은 어떠세요?"
exprs= [
u"*신은 and 어떠세요", #True
u"not 당신은", #False
u"당신 or 당", #False
]
for expr in exprs:
print bsp.match(text,expr)
It allows wildcard, literal and not searches nested in as many parentheses as you need.

Replacing strings in a text and ignoring certain parts

I found many programs online to replace text in a string or file with words prescribed in a dictionary. For example, https://www.daniweb.com/programming/software-development/code/216636/multiple-word-replace-in-text-python
But I was wondering how to get the program to ignore certain parts of the text. For instance, I would like it to ignore parts that are ensconced within say % signs (%Please ignore this%). Better still, how do I get it to ignore the text within but remove the % sign at the end of the run.
Thank you.
This could very easily be done with regular expressions, although they may not be supported by any online programs you find. You will probably need to write something yourself and then use regex as your dict's search key's.
Good place to start playing around with regex is: http://regexr.com
Well in the replacing dictionary just have any word you want to be ignored such as teh be replaced with the but %teh% be replaced with teh. For the program in the link you could have
wordDic = {
'booster': 'rooster',
'%booster%': 'booster'
}

PyYAML variables in multiline

I'm trying to get a multi-line comment to use variables in PyYAML but not sure if this is even possible.
So, in YAML, you can assign a variable like:
current_host: &hostname myhost
But it doesn't seem to expand in the following:
test: |
Hello, this is my string
which is running on *hostname
Is this at all possible or am I going to have to use Python to parse it?
The anchors (&some_id) and references (*some_id) mechanism is essentially meant to provide the possibility to share complete nodes between parts of the tree representation that is a YAML text. This is e.g. necessary in order to have one and the same complex item (sequence/list resp. mapping/dict) that occurs in a list two times load as one and same item (instead of two copies with the same values).
So yes, you need to do the parsing in Python. You could start with the mechanism I provided in this answer and change the test
if node.value and node.value.startswith(self.d['escape'])
to find the escape character in any place in the scalar and take appropriate action.
You can find the answer here.
Just use a + between lines and your strings need to be enclosed in 's.

diff for single lines

All diff tools I've found are just comparing line by line instead of char by char. Is there any library that gives details on single line strings? Maybe also a percentage difference, though I guess there are separate functions for that?
This algorithm diffs word-by-word:
http://github.com/paulgb/simplediff
available in Python and PHP. It can even spit out HTML formatted output using the <ins> and <del> tags.
I was looking for something similar recently, and came across wdiff. It operates on words, not characters, but is this close to what you're looking for?
What you could try is to split both strings up character by character into lines and then you can use diff on that. It's a dirty hack, but atleast it should work and is quite easy to implement.
Alternately you can split the string up into a list of chars in Python and use difflib. Check Python difflib reference
You can implement a simple Needleman–Wunsch algorithm. The pseudo code is available on Wikipedia: http://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm

Categories