How to find the prefix of a word for nlp

How to find the prefix of a word for nlp - python

I want to find the prefix of a word for nlp purposes(interested in morphological negation).
For example, I want to know "unable" is negative, but "university" does not have any sort of negation. I have been using the startswith python function so far, but obviously there can be some issues.
Does anyone have any experience with finding prefixes of words? I feel like there should be some library or api, but I'm not sure.
Thanks!

Short of a full morphological analyser, you can work around this with exception lists and longest matching.
For example: you assume un- expresses negation. First, find longer prefixes (such as uni-), and match for that first, before looking at un-. There will be a handful of exceptions, such as uninteresting, which you can check for separately. This will be a fairly smallish list. Then, once all the uni- words have been dealt with, anything starting with un- is a candidate, though there will also be exceptions, such as under.
A slightly better solution is possible if you have a basic word list: cut of un- from the beginning of the string, and check whether the remainder is in your word list. University will become iversity, which is not in your list, and thus it's not the un- prefix. However, uninteresting will become interesting, which is, so here you have found a valid prefix. All you need for this is a list of non-negated words. You can of course also use this for other prefixes, such as the alpha privative, as in atypical the remainder typical will be in your list.
If you don't have such a list, simply split your text into tokens, sort and unique them, and then scan down the line of words beginning with your candidate prefixes. It's a bit tedious, but the numbers of relevant words are not that big. It's what we all did in NLP 30 years ago... :)

Related

how do i make python find words that look similar to a bad word, but not necessarily a proper word in english?

I'm making a cyberbullying detection discord bot in python, but sadly there are some people who may find their way around conventional English and spell a bad word in a different manner, like the n-word with 3 g's or the f word without the c. There are just too many variants of bad words some people may use. How can I make python find them all?
I've tried pyenchant but it doesn't do what I want it to do. If I put suggest("racist slur"), "sucker" is in the array. I can't seem to find anything that works.
Will I have to consider every possibility separately and add all the possibilities into a single dictionary? (I hope not.)

It's not necessarily python's job to do the heavy lifting but rather its ecosystem. You may want to look into Natural Language Understanding algorithms and find a way that suits your specific needs. This takes some time and further expertise to figure out.
You may want to start with pytorch, it has helped my learning curve a lot. Their docs regarding text: https://pytorch.org/text/stable/index.html
Also, I'd suggest, you have a look around at kaggle, several datascience challenges have a prize on them to tackle the same task you are aiming to solve.
https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification
These competitions usually have public starter notebooks to get you started with your own implementation.

You could try looping through the string that you are moderating and putting it into an array.
For example, if you wanted to blacklist "foo"
x=[["f","o","o"],[" "], ["f","o","o","o"]]
then count the letters in each word to count how many of each letter is in each word:
y = [["f":"1", "o":"2"], [" ":"1"], ["f":"1", "o":"3"]]
then see that y[2] is very similar to y[0] (the banned word).
While this method is not perfect, it is a start.
Another thing to look in to is using a neural language interpreter that detects if a word is being used in a derogatory way. A while back, Google designed one of these.
The other answer is just that no bot is perfect.
You might just have to put these common misspellings in the blacklist.
However, the automatic approach would be awesome if you got it working with 100% accuracy.

Unfortunately, spell checking (for different languages) alone is still an open problem that people do research on, so there is no perfect solution for this, let alone for the case when the user intentionally tries to insert some "errors".
Fortunately, there is a conceptually limited number of ways people can intentionally change the input word in order to obtain a new word that resembles the initial one enough to be understood by other people. For example, bad actors could try to:
duplicate some letters multiple times
add some separators (e.g. "-", ".") between characters
delete some characters (e.g. the f word without "c")
reverse the word
potentially others
My suggestion is to initially keep it simple, if you don't want to delve into machine learning. As a possible approach you could try to:
manually create a set of lower-case bad words with their duplicated letters removed (e.g. "killer" -> "kiler").
manually/automatically add to this set variants of these words with one or multiple letters missing that can still be easily understood (e.g. "kiler" +-> "kilr").
extract the words in the message (e.g. by message_str.split())
for each word and its reversed version:
a. remove possible separators (e.g. "-", ".")
b. convert it to lower case and remove consecutive, duplicate letters
c. check if this new form of the word is present in the set, if so, censor it or the entire message
This solution lacks the protection against words with characters separated by one or multiple white spaces / newlines (e.g. "killer" -> "k i l l e r").
Depending on how long the messages are (I believe they are generally short in chat rooms), you can try to consider each substring of the initial message with removed whitespaces, instead of each word detected by the white space separator in step 3. This will take more time, as generating each substring will take alone O(message_length^2) time.

Prefix search against half a billion strings

I have a list of 500 mil strings. The strings are alphanumeric, ASCII characters, of varying size (usually from 2-30 characters). Also, they're single words (or a combination of words without spaces like 'helloiamastring').
What I need is a fast way to check against a target, say 'hi'. The result should be all strings from the 500mil list which start with 'hi' (for eg. 'hithere', 'hihowareyou' etc.). This needs to be fast because there will be a new query each time the user types something, so if he types "hi", all strings starting with "hi" from the 500 mil list will be shown, if he types "hey", all strings starting with "hey" will show etc.
I've tried with the Tries algo, but the memory footprint to store 300 mil strings is just huge. It should require me 100GB+ ram for that. And I'm pretty sure the list will grow up to a billion.
What is a fast algorithm for this use case?
P.S. In case there's no fast option, the best alternative would be to limit people to enter at least, say, 4 characters, before results show up. Is there a fast way to retrieve the results then?

You want a Directed Acyclic Word Graph or DAWG. This generalizes #greybeard's suggestion to use stemming.
See, for example, the discussion in section 3.2 of this.

If the strings are sorted then a binary search is reasonable. As a speedup, you could maintain a dictionary of all possible bigrams ("aa", "ab", etc.) where the corresponding values are the first and last index starting with that bigram (if any do) and so in O(1) time zero in on a much smaller sublist that contains the strings that you are looking for. Once you find a match, do a linear search to the right and left to get all other matches.

If you want to force the user to digit at least 4 letters, for example, you can keep a key-value map, memory or disk, where the keys are all combinations of 4 letters (they are not too many if it is case insensitive, otherwise you can limit to three), and the values are list of positions of all strings that begin with the combination.
After the user has typed the three (or four) letters you have at once all the possible strings. From this point on you just loop on this subset.
On average this subset is small enough, i.e. 500M divided by 26^4...just as example. Actually bigger because probably not all sets of 4 letters can be prefix for your strings.
Forgot to say: when you add a new string to the big list, you also update the list of indexes corresponding to the key in the map.

If you doesn't want to use some database, you should create some data related routines pre-existing in all database engines:
Doesn't try to load all data in memory.
Use fixed length for all string. It increase storage memory consumption but significantly decrease seeking time (i-th string can be found at position L*i bytes in file, where L - fixed length). Create additional mechanism to work with extremely long strings: store it in different place and use special pointers.
Sort all of strings. You can use merge sort to do it without load all strings in memory in one time.
Create indexes (address of first line starts with 'a','b',... ) also indexes can be created for 2-grams, 3-grams, etc. Indexes can be placed in memory to increase search speed.
Use advanced strategies to avoid full indexes regeneration on data update: split a data to a number of files by first letters and update only affected indexes, create an empty spaces in data to decrease affect of read-modify-write procedures, create a cache for a new lines before they will be added to main storage and search in this cache.
Use query cache to fast processing a popular requests.

In this hypothetical, where the strings being indexed are not associated with any other information (e.g. other columns in the same row), there is relatively little difference between a complete index and keeping the strings sorted in the first place (as in, some difference, but not as much as you are hoping for). In light of the growing nature of the list and the cost of updating it, perhaps the opposite approach will better accomplish the performance tradeoffs that you are looking for.
For any given character at any given location in the string, your base case is that no string exists containing that letter. For example, once 'hello' has been typed, if the next letter typed is 't', then your base case is that there is no string beginning 'hellot'. There is a finite number of characters that could follow 'hello' at location 5 (say, 26). You need 26 fixed-length spaces in which to store information about characters that follow 'hello' at location 5. Each space either says zero if there is no string in which, e.g., 't' follows 'hello', or contains a number of data-storage addresses by which to advance to find the list of characters for which one or more strings involve that character following 'hellot' at location 6 (or use absolute data-storage addresses, although only relative addressess allow the algorithm I propose to support an infinite number of strings of infinite length without any modification to allow for larger pointers as the list grows).
The algorithm can then move forward through this data stored on disk, building a tree of string-beginnings in memory as it goes, and avoiding delays caused by random-access reads. For an in-memory index, simply store the part of the tree closest to the root in memory. After the user has typed 'hello' and the algorithm has tracked that information about one or more strings beginning 'hellot' exists at data-storage address X, the algorithm finds one of two types of lists at location X. Either it is another sequence of, e.g., 26 fixed-length spaces with information about characters following 'hellot' at location 6, or it is a pre-allocated block of space listing all post-fixes that follow 'hellot', depending on how many such post-fixes exist. Once there are enough post-fixes that using some traditional search and/or sort algorithm to both update and search the post-fix list fails to provide the performance benefits that you desire, it gets divided up and replaced with a sequence of, e.g., 26 fixed-length spaces.
This involves pre-allocating a relatively substantial amount of disk-storage upfront, with the tradeoff that your tree can be maintained in sorted form without needing to move anything around for most updates, and your searches can be peformed in full in a single sequential read. It also provides more flexibility and probably requires less storage space than a solution based on storing the strings themselves as fixed-length strings.

First of all I should say that the tag you should have added for your question is "Information Retrieval".
I think using Apache Lucene's PrefixQuery is the best way you can handle wildcard queries. Apache has a Python version if you are comfortable with python. But to use Apache lucent to solve your problem you should first know about indexing your data(which is the part that your data will be compressed and saved in a more efficient manner).
Also looking to indexing and wildcard query section of IR book will give you a better vision.

Split string using Regular Expression that includes lowercase, camelcase, numbers

I'm working on to get twitter trends using tweepy in python and I'm able find out world top 50 trends so for sample I'm getting results like these
#BrazilianFansAreTheBest, #PSYPagtuklas, Federer, Corcuera, Ouvindo ANTI,
艦これ改, 영혼의 나이, #TodoDiaéDiaDe, #TronoChicas, #이사람은_분위기상_군주_장수_책사,
#OTWOLKnowYourLimits, #BaeIn3Words, #NoEntiendoPorque
(Please ignore non English words)
So here I need to parse every hashtag and convert them into proper English words, Also I checked how people write hashtag and found below ways -
#thisisawesome
#ThisIsAwesome
#thisIsAwesome
#ThisIsAWESOME
#ThisISAwesome
#ThisisAwesome123
(some time hashtags have numbers as well)
So keeping all these in mind I thought if I'm able to split below string then all above cases will be covered.
string ="pleaseHelpMeSPLITThisString8989"
Result = please, Help, Me, SPLIT, This, String, 8989
I tried something using re.sub but it is not giving me desired results.

Regex is the wrong tool for the job. You need a clearly-defined pattern in order to write a good regex, and in this case, you don't have one. Given that you can have Capitalized Words, CAPITAL WORDS, lowercase words, and numbers, there's no real way to look at, say, THATSand and differentiate between THATS and or THAT Sand.
A natural-language approach would be a better solution, but again, it's inevitably going to run into the same problem as above - how do you differentiate between two (or more) perfectly valid ways to parse the same inputs? Now you'd need to get a trie of common sentences, build one for each language you plan to parse, and still need to worry about properly parsing the nonsensical tags twitter often comes up with.
The question becomes, why do you need to split the string at all? I would recommend finding a way to omit this requirement, because it's almost certainly going to be easier to change the problem than it is to develop this particular solution.

How can I remove large number of phrases in one pass from a large text file?

I was wondering - is there someway I can remove large number (100s of thousands) of text phrases in one pass from a big (18 GB) text file?

You could construct a suffix tree from your list of phrases and walk your file using it. It will allow you to identify all the strings. This is often used for tagging stuff but you should be able to adapt it to remove strings as well.

Rabin-Karp is good for multiple substring searching, but I think your phrases would have to be the same length.
If they're of similar lengths, you might be able to search for subphrases of length (minimum length across all phrases) and then extend when you've found something.
And another thought I'm having is that you could extend this to use a small set of say q subphrase lengths, as appropriate given your search phrases. And you could modify Rabin-Karp to have q rolling hashes instead of one, with q sets of hashes. This would help if you can partition your phrases in q subsets which do have similar lengths.

I'm going to go out on a limb here and suggest you use AWK, because it is very fast for this kind of task.

Are those phrases the same? Like is it the same word you want to remove? Then maybe you can remove it using the 'in' keyword. checking each line using a while loop and removing all instances of the word from that line. Need more information on the problem though.

efficient algorithm to perform spell check on HTML document

I have a HTML document, a list of common spelling mistakes, and the correct spelling for each case.
The HTML documents will be up to ~50 pages and there are ~30K spelling correction entries.
What is an efficient way to correct all spelling mistakes in this HTML document?
(Note: my implementation will be in Python, in case you know of any relevant libraries.)
I have thought of 2 possibles approaches:
build hashtable of the spelling data
parse text from HTML
split text by whitespace into tokens
if token in spelling hashtable replace with correction
build new HTML document with updated text
This approach will fail for multi-word spelling corrections, which will exist. The following is a simpler though seemingly less efficient approach that will work for multi-words:
iterate spelling data
search for word in HTML document
if word exists replace with correction

You are correct that the first approach will be MUCH faster than the second (additionally, I would recommend looking into Tries instead of a straight hash, the space savings will be quite dramatic for 30k words).
To still be able to handle the multi-word cases, you could either keep track of the previous token and thereby check your hash for a combined string such as "prev cur".
Or else you could leave the multi-word corrections out of the hash and combine your two approaches, first using the hash for single words and then doing a scan for the multi-word combos (or vice versa). This could still be relatively fast if the number of multi-word corrections is relatively small.
Be careful tho, pulling out word tokens is trickier than just splitting on whitespace. You don't want to fail to correct an error simply because you didn't find 'instence,' with a comma in your hash.

I agree with Rob's suggestion of using a trie, based on characters, because I programmed a spelling correction algorithm ages ago based on having a dictionary of valid words stored as a trie. By using branch-and-bound I was able to suggest possibly correct spellings of misspelled words (by Levenshtein distance). In addition, since a trie is just a big finite-state-machine, it is fairly easy to add common prefixes and suffixes, so it could handle "words" like "postnationalizationalism's".

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.