python - complex boolean search for words in files

python - complex boolean search for words in files - python

I have a bunch of files in a folder. Let's assume I convert all into plain text files.
I want to use python to perform searches like this:
query = '(word1 and word2) or (word3 and not word4)'
the actual logc varies, and multiple words can be used together. Another example:
query = '(shiny and glass and "blue car")'
Also the words are provided by the users so they are variables.
I want to display the sentences that matched and the filenames.
This really does not need a complex search engine like whoosh or haystack which need to index files with fields.
Also, those tools do not seem to have a boolean query as I explained above.
I've come across pdfquery library which does exactly what I want for pdfs, but now I need that for text files and xml files.
Any suggestions?

There's no easy way to say this, but this is not easy. You're trying to translate unsafe strings into executable code, so you can't take the easy way out and use eval. These aren't literals so you can't use ast.literal_eval either. You need to write a lexer that recognizes things like AND, NOT, OR, (, and ) and considers them something other than strings. On top you apparently need handle compound booleans, so this becomes quite a bit more difficult than you think it might be.
Your question asks about searching by sentence, which is not how Python operates. You'd have to write another lexer to get the data by-sentence instead of by-line. You'll need to read heavily into the io module to do this effectively. I don't know how to do it off-hand, but essentially you'll be looping while there is data to loop, reading a buffersize each iteration, and yielding when you reach a "\.(?=\s+)"
Then you'll have to run your first query lexer results through a set of list comprehensions, each one running across the results of the file lexer.

I really needed to have such a solution so I made a python package called toned
I hope it will be useful to others as well.

Maybe I've answered this question too late, but I think the best way to solve complex boolean search expressions is using this implementation of Pyparsing
As you can see in its description all this cases are included:
SAMPLE USAGE:
from booleansearchparser import BooleanSearchParser
bsp = BooleanSearchParser()
text = u"wildcards at the begining of a search term "
exprs= [
u"*cards and term", #True
u"wild* and term", #True
u"not terms", #True
u"terms or begin", #False
]
for expr in exprs:
print bsp.match(text,expr)
#non-western samples
text = u"안녕하세요, 당신은 어떠세요?"
exprs= [
u"*신은 and 어떠세요", #True
u"not 당신은", #False
u"당신 or 당", #False
]
for expr in exprs:
print bsp.match(text,expr)
It allows wildcard, literal and not searches nested in as many parentheses as you need.

Related

Is there a way to convert a list containing arithmetic operations which is written as a string into an actual list?

I would like to convert a String like:
s = "[2-1,2,3]"
into a list, like:
a = [1,2,3]
I tried it with json:
s = "[2-1,2,3]"
a = json.loads(s)
but it can't handle 2-1.
Is there an easy method to convert strings into any kind of datatype?

Yes. And as much as it pains me to say it, eval is your friend here, as even ast.literal_eval cannot parse this.
Please read this first: eval: bad practice?, and please ensure you have complete control over the expressions being evaluated.
To help lessen the expressions being evaluated, I've wrapped this solution in regex, to extract only numbers and (in this case) the minus sign.
Obviously, this might need tweaking for your specific use case, this this should give you a boiler-plate (or at least an idea) from which to start.
Example code:
import re
s = "[2-1,2,3]"
rexp = re.compile('[\d-]+')
out = []
for exp in rexp.findall(s):
out.append(eval(exp))
Or, if you prefer a one-liner:
out = [eval(exp) for exp in rexp.findall(s)]
Output:
[1, 2, 3]

This is a common problem to tackle while writing compilers. Usually this comes under lexing.
A parser would usually have a list of tokens, watch for the tokens and then pass it to a parser and then the compiler.
Your problem cannot be completely solved with a lexer though since you also require the 2-1 to evaluate to 1. In this case, I would suggest using eval like #Daniel Hao suggested since it is a simple and clean way of achieving your goal. Remember about the caveats(both security and otherwise) while using it though. (especially, in production)
If you are interested in the parsing though, check this out:
https://craftinginterpreters.com/contents.html
https://tomassetti.me/parsing-in-python/

How to use re.search to find multiple strings?

I'm trying to check if a certain line in many different text documents equals one of many different strings. To goal here is to classify those documents and then parse them according to that classification.
In my text editor I can use regex to search for:
(?:^kärnten\n|^steiermark|^graz\n|^madrid\n|^oststeirer\n|^weiz\n|^berlin\n|^lavanttal\n|^villach\n|^osttirol\n|^oberkärnten\n|^klagenfurt\n|^weststeiermark\n|^südsteiermark\n|^südoststeiermark\n|^murtal\n|^mürztal\n|^graz\n|^ennstal\n|^frankreich\n|^österreich\n|^dänemark\n|^polen\n|^großbritannien\n|^italien\n|^hitzendorf\n|^osttirol\n|^slowenien\n|^feldkirchen\n|^völkermarkt\n|^wien\n|^warschau\n|^mailand\n|^mainz\n|^leoben\n|^bleiburg\n|^brüssel\n|^bad radkersburg\n|^london\n|^lienz\n|^liezen\n|^hartberg\n|^ilztal|^pöllau\n|^lobmingtal\n)
But if I try to use this in an if statement in python I keep getting syntax errors for any way I tried it.
My current version is this:
if re.search('(^kärnten\n|^steiermark|^graz\n|^madrid\n|^oststeirer\n|^weiz\n|^berlin\n|^lavanttal\n|^villach\n|^osttirol\n|^oberkärnten\n|^klagenfurt\n|^weststeiermark\n|^südsteiermark\n|^südoststeiermark\n|^murtal\n|^mürztal\n|^graz\n|^ennstal\n|^frankreich\n|^österreich\n|^dänemark\n|^polen\n|^großbritannien\n|^italien\n|^hitzendorf\n|^osttirol\n|^slowenien\n|^feldkirchen\n|^völkermarkt\n|^wien\n|^warschau\n|^mailand\n|^mainz\n|^leoben\n|^bleiburg\n|^brüssel\n|^bad radkersburg\n|^london\n|^lienz\n|^liezen\n|^hartberg\n|^ilztal|^pöllau\n|^lobmingtal\n)', article_lines[5].lower()replace('´','')):
no_author = True
I saw that a possible solution is using a for loop and putting the different strings into a list, but as this would require some extra steps I'd prefer to do it as I tried if possible.

You should include what the error is. Your problem is probably just a typo:
if re.search('(^kärnten\n|^steiermark|^graz\n|^madrid\n|^oststeirer\n|^weiz\n|^berlin\n|^lavanttal\n|^villach\n|^osttirol\n|^oberkärnten\n|^klagenfurt\n|^weststeiermark\n|^südsteiermark\n|^südoststeiermark\n|^murtal\n|^mürztal\n|^graz\n|^ennstal\n|^frankreich\n|^österreich\n|^dänemark\n|^polen\n|^großbritannien\n|^italien\n|^hitzendorf\n|^osttirol\n|^slowenien\n|^feldkirchen\n|^völkermarkt\n|^wien\n|^warschau\n|^mailand\n|^mainz\n|^leoben\n|^bleiburg\n|^brüssel\n|^bad radkersburg\n|^london\n|^lienz\n|^liezen\n|^hartberg\n|^ilztal|^pöllau\n|^lobmingtal\n)', article_lines[5].lower().replace('´','')):
no_author = True

Python3 - Generate string matching multiple regexes, without modifying them

I would like to generate string matching my regexes using Python 3. For this I am using handy library called rstr.
My regexes:
^[abc]+.
[a-z]+
My task:
I must find a generic way, how to create string that would match both my regexes.
What I cannot do:
Modify both regexes or join them in any way. This I consider as ineffective solution, especially in the case if incompatible regexes:
import re
import rstr
regex1 = re.compile(r'^[abc]+.')
regex2 = re.compile(r'[a-z]+')
for index in range(0, 1000):
generated_string = rstr.xeger(regex1)
if re.fullmatch(regex2, generated_string):
break;
else:
raise Exception('Regexes are probably incompatibile.')
print('String matching both regexes is: {}'.format(generated_string))
Is there any workaround or any magical library that can handle this? Any insights appreciated.
Questions which are seemingly similar, but not helpful in any way:
Match a line with multiple regex using Python
Asker already has the string, which he just want to check against multiple regexes in the most elegant way. In my case we need to generate string in a smart way that would match regexes.

If you want really generic way, you can't really use brute force approach.
What you look for is create some kind of representation of regexp (as rstr does through call of sre_parse.py) and then calling some SMT solver to satisfy both criteria.
For Haskell there is https://github.com/audreyt/regex-genex which uses Yices SMT solver to do just that, but I doubt there is anything like this for Python. If I were you, I'd bite a bullet and call it as external program from your python program.

I don't know if there is something that can fulfill your needs much smother.
But I would do it something like (as you've done it already):
Create a Regex object with the re.compile() function.
Generate String based on 1st regex.
Pass the string you've got into the 2nd regex object using search() method.
If that passes... your done, string passed both regexs.
Maybe you can create a function and pass both regexes as parameters and test "2 by 2" using the same logic.
And then if you have 8 regexes to match...
Just do:
call (regex1, regex2)
call (regex2, regex3)
call (regex4, regex5)
...

I solved this using a little alternative approach. Notice second regex is basically insurance so only lowercase letters are generated in our new string.
I used Google's python package sre_yield which allows charset limitation. Package is also available on PyPi. My code:
import sre_yield
import string
sre_yield.AllStrings(r'^[abc]+.', charset=string.ascii_lowercase)[0]
# returns `aa`

Efficient way to do a large number of search/replaces in Python?

I'm fairly new to Python, and am writing a series of script to convert between some proprietary markup formats. I'm iterating line by line over files and then basically doing a large number (100-200) of substitutions that basically fall into 4 categories:
line = line.replace("-","<EMDASH>") # Replace single character with tag
line = line.replace("<\\#>","#") # tag with single character
line = line.replace("<\\n>","") # remove tag
line = line.replace("\xe1","•") # replace non-ascii character with entity
the str.replace() function seems to be pretty efficient (fairly low in the numbers when I examine profiling output), but is there a better way to do this? I've seen the re.sub() method with a function as an argument, but am unsure if this would be better? I guess it depends on what kind of optimizations Python does internally. Thought I would ask for some advice before creating a large dict that might not be very helpful!
Additionally I do some parsing of tags (that look somewhat like HTML, but are not HTML). I identify tags like this:
m = re.findall('(<[^>]+>)',line)
And then do ~100 search/replaces (mostly removing matches) within the matched tags as well, e.g.:
m = re.findall('(<[^>]+>)',line)
for tag in m:
tag_new = re.sub("\*t\([^\)]*\)","",tag)
tag_new = re.sub("\*p\([^\)]*\)","",tag_new)
# do many more searches...
if tag != tag_new:
line = line.replace(tag,tag_new,1) # potentially problematic
Any thoughts of efficiency here?
Thanks!

str.replace() is more efficient if you're going to do basic search and replaces, and re.sub is (obviously) more efficient if you need complex pattern matching (because otherwise you'd have to use str.replace several times).
I'd recommend you use a combination of both. If you have several patterns that all get replaced by one thing, use re.sub. If you just have some cases where you just need to replace one specific tag with another, use str.replace.
You can also improve efficiency by using larger strings (call re.sub once instead of once for each line). Increases memory use, but shouldn't be a problem unless the file is HUGE, but also improves execution time.

If you don't actually need the regex and are just doing literal replacing, string.replace() will almost certainly be faster. But even so, your bottleneck here will be file input/output, not string manipulation.
The best solution though would probably be to use cStringIO

Depending on the ratio of relevant-to-not-relevant portions of the text you're operating on (and whether or not the parts each substitution operates on overlap), it might be more efficient to try to break down the input into tokens and work on each token individually.
Since each replace() in your current implementation has to examine the entire input string, that can be slow. If you instead broke down that stream into something like...
[<normal text>, <tag>, <tag>, <normal text>, <tag>, <normal text>]
# from an original "<normal text><tag><tag><normal text><tag><normal text>"
...then you could simply look to see if a given token is a tag, and replace it in the list (and then ''.join() at the end).

You can pass a function object to re.sub instead of a substitution string, it takes the match object and returns the substitution, so for example
>>> r = re.compile(r'<(\w+)>|(-)')
>>> r.sub(lambda m: '(%s)' % (m.group(1) if m.group(1) else 'emdash'), '<atag>-<anothertag>')
'(atag)(emdash)(anothertag)'
Of course you can use a more complex function object, this lambda is just an example.
Using a single regex that does all the substitution should be slightly faster than iterating the string many times, but if a lot of substitutions are perfomed the overhead of calling the function object that computes the substitution may be significant.

diff for single lines

All diff tools I've found are just comparing line by line instead of char by char. Is there any library that gives details on single line strings? Maybe also a percentage difference, though I guess there are separate functions for that?

This algorithm diffs word-by-word:
http://github.com/paulgb/simplediff
available in Python and PHP. It can even spit out HTML formatted output using the <ins> and <del> tags.

I was looking for something similar recently, and came across wdiff. It operates on words, not characters, but is this close to what you're looking for?

What you could try is to split both strings up character by character into lines and then you can use diff on that. It's a dirty hack, but atleast it should work and is quite easy to implement.
Alternately you can split the string up into a list of chars in Python and use difflib. Check Python difflib reference

You can implement a simple Needleman–Wunsch algorithm. The pseudo code is available on Wikipedia: http://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python - complex boolean search for words in files - python

I really needed to have such a solution so I made a python package called toned I hope it will be useful to others as well.

Related

Is there a way to convert a list containing arithmetic operations which is written as a string into an actual list?

How to use re.search to find multiple strings?

Python3 - Generate string matching multiple regexes, without modifying them

Efficient way to do a large number of search/replaces in Python?

diff for single lines

Categories

Resources