Fuzzy match string with 1 million rows - python

I have a database with 1 million rows and based on a user's input I need to find him the most relevant matches.
The way the code was written in the past was by using the library fuzzywuzzy. A ratio between 2 strings was calculated in order to show how similar the strings were.
The problem with that is that we had to run the ratio function for each row from the database, meaning 1 million function calls and the performance is really bad. We've never thought that we'd get to the point of having this much data.
I am looking for a better algorithm or solution for handling the search in this case. I've stomped upon something called TF-IDF (Term Frequency-Inverse Document Frequency). It was described as a solution for "fuzzy matching at scale", way faster.
Unfortunately I couldn't wrap my mind around it and completely understand how does it work, and the more I read about it, the more I think that this is not what I need, since all the examples that I've seen are trying to find similar matches between 2 lists, not 1 string and 1 list.
So, am I on the wrong path? And if so, could you please give me some ideas on how can I handle this scenario? Unfortunately, Full Text Search works only with exact matches, so in our case fuzzy is definitely the way we want to go.
And if you're going to propose the idea of using a separate search engine, we don't want to add a new tool to our infrastructure just for this.

Related

Handling Multiple if/else and Special Cases

So I'm fairly new to coding only having relatively simple scripts here and there when I need them for work. I have a document that has an ID column formatted as:
"Number Word Number" and some values under a spec, lower tol, and upper tol column.
Where sometimes the number under ID is a integer or float and the word can be one of say 30 different possibilities. Ultimately these need to be read and then organized depending on the spec and lower/upper tol columns into something like below:
I'm using Pandas to read the data and do the manipulations I need so my question isn't so much of a how to do it, but more of a how should it best be done.
The way my code is written is basically a series of if statements that handle each of the scenarios I've come across so far, but based on other peoples code I've seen this is generally not done and as I understand considered poor practice. It's very basic if statements like:
if(The ID column has "Note" in it) then its a basic dimension
if(The ID column has Roughness) then its an Ra value
if(The ID column has Position in it) then its a true position etc
Problem is I'm not really sure what the "correct" way to do it would be in terms of making it more efficient and simpler. I have currently a series of 30+ if statements and ways to handle different situations that I've run into so far. Virtually all the code I've written is done in this overly specific and not very general coding methodology that even though it works I find personally overcomplicated but I'm not really sure what capabilities of python/pandas I'm sort of missing and not utilizing to simplify my code.
Since you need to test what the variable in ID is and do some staff accordingly you can't avoid the if statements most probably.What i suggest you to do since you have written the code is to reform the database.If there is not a very specific reason you have database with a structure like this,you should change it asap.
To be specific to ID add an (auto)increment unique number and break the 3 datapoints of ID column into 3 seperate columns.

Split string using Regular Expression that includes lowercase, camelcase, numbers

I'm working on to get twitter trends using tweepy in python and I'm able find out world top 50 trends so for sample I'm getting results like these
#BrazilianFansAreTheBest, #PSYPagtuklas, Federer, Corcuera, Ouvindo ANTI,
艦これ改, 영혼의 나이, #TodoDiaéDiaDe, #TronoChicas, #이사람은_분위기상_군주_장수_책사,
#OTWOLKnowYourLimits, #BaeIn3Words, #NoEntiendoPorque
(Please ignore non English words)
So here I need to parse every hashtag and convert them into proper English words, Also I checked how people write hashtag and found below ways -
#thisisawesome
#ThisIsAwesome
#thisIsAwesome
#ThisIsAWESOME
#ThisISAwesome
#ThisisAwesome123
(some time hashtags have numbers as well)
So keeping all these in mind I thought if I'm able to split below string then all above cases will be covered.
string ="pleaseHelpMeSPLITThisString8989"
Result = please, Help, Me, SPLIT, This, String, 8989
I tried something using re.sub but it is not giving me desired results.
Regex is the wrong tool for the job. You need a clearly-defined pattern in order to write a good regex, and in this case, you don't have one. Given that you can have Capitalized Words, CAPITAL WORDS, lowercase words, and numbers, there's no real way to look at, say, THATSand and differentiate between THATS and or THAT Sand.
A natural-language approach would be a better solution, but again, it's inevitably going to run into the same problem as above - how do you differentiate between two (or more) perfectly valid ways to parse the same inputs? Now you'd need to get a trie of common sentences, build one for each language you plan to parse, and still need to worry about properly parsing the nonsensical tags twitter often comes up with.
The question becomes, why do you need to split the string at all? I would recommend finding a way to omit this requirement, because it's almost certainly going to be easier to change the problem than it is to develop this particular solution.

In Python, is there a way to check how alike two files are and get the percentage of differences they have?

I'm trying to compare a lot of scripts at once and most of them have small differences, like a different name inside a variable and such.
For the most part, the scripts should be identical in function, and I'd like to be able to test how actually different they are.
What I'm thinking of doing is taking in all of the input from both files and comparing them against each other, character by character, and increasing a count of some sort when a difference arises. I'm not sure what I would compare this count to to make a percentage, or if this is even the best way to go about this.
If you have an idea or advice to give me I would greatly appreciate it!
Two suggestions:
1) Check out this SO question and Python's difflib. This SO question specifically asks about difflib.
Also, a guy named Doug Hellmann has an excellent series of blog posts called Python Module of the Week (PyMOTW). Here is his post about difflib.
2) If those don't work for you, try searching for language-independent algorithms for file comparisons first, and think about which ones can be most easily implemented in Python. A simple Google search for "file comparison algorithms" came up with several decent looking possibilities that you could try to implement in Python:
Here is a published PDF with a diff algorithm
This site has a discussion of several different algorithms with links

Syntax recognizer in python

I need a module or strategy for detecting that a piece of data is written in a programming language, not syntax highlighting where the user specifically chooses a syntax to highlight. My question has two levels, I would greatly appreciate any help, so:
Is there any package in python that receives a string(piece of data) and returns if it belongs to any programming language syntax ?
I don't necessarily need to recognize the syntax, but know if the string is source code or not at all.
Any clues are deeply appreciated.
Maybe you can use existing multi-language syntax highlighters. Many of them can detect language a file is written in.
You could have a look at methods around baysian filtering.
My answer somewhat depends on the amount of code you're going to be given. If you're going to be given 30+ lines of code, it should be fairly easy to identify some unique features of each language that are fairly common. For example, tell the program that if anything matches an expression like from * import * then it's Python (I'm not 100% sure that phrasing is unique to Python, but you get the gist). Other things you could look at that are usually slightly different would be class definition (i.e. Python always starts with 'class', C will start with a definition of the return so you could check to see if there is a line that starts with a data type and has the formatting of a method declaration), conditionals are usually formatted slightly differently, etc, etc. If you wanted to make it more accurate, you could introduce some sort of weighting system, features that are more unique and less likely to be the result of a mismatched regexp get a higher weight, things that are commonly mismatched get a lower weight for the language, and just calculate which language has the highest composite score at the end. You could also define features that you feel are 100% unique, and tell it that as soon as it hits one of those, to stop parsing because it knows the answer (things like the shebang line).
This would, of course, involve you knowing enough about the languages you want to identify to find unique features to look for, or being able to find people that do know unique structures that would help.
If you're given less than 30 or so lines of code, your answers from parsing like that are going to be far less accurate, in that case the easiest best way to do it would probably be to take an appliance similar to Travis, and just run the code in each language (in a VM of course). If the code runs successfully in a language, you have your answer. If not, you would need a list of errors that are "acceptable" (as in they are errors in the way the code was written, not in the interpreter). It's not a great solution, but at some point your code sample will just be too short to give an accurate answer.

counting (large number of) strings within (very large) text

I've seen a couple variations of the "efficiently search for strings within file(s)" question on Stackoverflow but not quite like my situation.
I've got one text file which contains a relatively large number (>300K) of strings. The vast majority of these strings are multiple words (for ex., "Plessy v. Ferguson", "John Smith", etc.).
From there, I need to search through a very large set of text files (a set of legal docs totaling >10GB) and tally the instances of those strings.
Because of the number of search strings, the strings having multiple words, and the size of the search target, a lot of the "standard" solutions seem fall to the wayside.
Some things simplify the problem a little -
I don't need sophisticated tokenizing / stemming / etc. (e.g. the only instances I care about are "Plessy v. Ferguson", don't need to worry about "Plessy", "Plessy et. al." etc.)
there will be some duplicates (for ex., multiple people named "John Smith"), however, this isn't a very statistically significant issue for this dataset so... if multiple John Smith's get conflated into a single tally, that's ok for now.
I only need to count these specific instances; I don't need to return search results
10 instances in 1 file count the same as 1 instance in each of 10 files
Any suggestions for quick / dirty ways to solve this problem?
I've investigated NLTK, Lucene & others but they appear to be overkill for the problem I'm trying to solve. Should I suck it up and import everything into a DB? bruteforce grep it 300k times? ;)
My preferred dev tool is Python.
The docs to be searched are primarily legal docs like this - http://www.lawnix.com/cases/plessy-ferguson.html
The intended results are tallys for how often the case is referenced across those docs -
"Plessey v. Ferguson: 15"
An easy way to solve this is to build a trie with your queries (simply a prefix tree, list of nodes with a single character inside), and when you search through your 10gb file you go through your tree recursively as the text matches.
This way you prune a lot of options really early on in your search for each character position in the big file, while still searching your whole solution space.
Time performance will be very good (as good as a lot of other, more complicated solutions) and you'll only need enough space to store the tree (a lot less than the whole array of strings) and a small buffer into the large file. Definitely a lot better than grepping a db 300k times...
You have several constraints you must deal with, which makes this a complex problem.
Hard drive IO
Memory space
Processing time
I would suggest writing a multithreaded/multiprocess python app. The libraries to subprocess are painless. Have each process read in a file, and the parse tree as suggested by Blindy. When it finishes, it returns the results to the parent, which writes them to a file.
This will use up as many resources as you can throw at it, while allowing for expansion. If you stick it on a beowulf cluster, it will transparently share the processes across your cpus for you.
The only sticking point is the hard drive IO. Break it into chunks on different hard drives, and as each process finishes, start a new one and load a file. If you're on linux, all of the files can coexist in the same filesystem namespace, and your program won't know the difference.
The ugly brute-force solution won't work.
Time one grep through your documents and extrapolate the time it takes for 300k greps (and possibly try parallelizing it if you have many machines available), is it feasible? My guess is that 300k searches won't be feasible. E.g., greping one search through ~50 Mb of files took me about ~5s, so for 10 Gb, you'd expect ~1000s, and then repeating 300k times means you'd be done in about 10 years with one computer. You can parallelize to get some improvements (limited by disk io on one computer), but still will be quite limited. I assume you want the task to be finished in hours rather than months, so this isn't likely a solution.
So you are going to need to index the documents somehow. Lucene (say through pythonsolr) or Xapian should be suitable for your purpose. Index the documents, then search the indexed documents.
You should use group pattern matching algorithms which use dynamic algorithms to reuse evaluation. I.e. Aho–Corasick . Implementations
http://code.google.com/p/graph-expression/wiki/RegexpOptimization
http://alias-i.com/lingpipe/docs/api/com/aliasi/dict/ExactDictionaryChunker.html
I don't know if this idea is extremely stupid or not, please let me know...
Divide the files to be searched into reasonably sized numbers 10/100/1000... and for each "chunk" use an indexing SW available for SW. Here I'm thinking about ctags gnu global or perhaps the ptx utility or using a technique described in this SO post.
Using this technique, you "only" need to search through the index files for the target strings.

Categories