Segment words into its sub-words/sub-concepts - python

What are some popular methods to find and replace concatenated words such as:
brokenleg -> (broken,leg)
The method should run on thousands of lines without knowing if there are concatenated words there in advance.
I'm using SpaCy library for most of my string handling so best method would be one that works along well with SpaCy.

Related

How to check if a token in present in a document with spaCy?

I have a huge list of larger spaCy documents and a list of words which I want to look up in the document.
An example: I want to look up the word "Aspirin" in a website text, which was parsed with spaCy.
The list of keywords I want to look up is quite long.
Naive approach
Don't use spacy and just use if keyword in website_text: as a simple matcher. Of course this has the downside that tokens are ignored and searches for test will yield false positives at words like tested, attested, etc.
Use spaCy's matchers
Matcher are an option, but I would need to automatically build a lot of matchers based on my list of keywords.
Is there a recommended way to achieve this task?
I'd go with your naive approach, but you can use regular expressions to get a smarter match that won't pick up false positives.
For example, \b(test|aspirin)\b picks up on the words "test" and "aspirin", but not on "aspiring", "attested", or "testing". You could add other words inside the brackets, separated by pipes, to pick up more key words.
Here's an example of it working.
To actually apply that to Python code, you can use the re module.

Split string using Regular Expression that includes lowercase, camelcase, numbers

I'm working on to get twitter trends using tweepy in python and I'm able find out world top 50 trends so for sample I'm getting results like these
#BrazilianFansAreTheBest, #PSYPagtuklas, Federer, Corcuera, Ouvindo ANTI,
艦これ改, 영혼의 나이, #TodoDiaéDiaDe, #TronoChicas, #이사람은_분위기상_군주_장수_책사,
#OTWOLKnowYourLimits, #BaeIn3Words, #NoEntiendoPorque
(Please ignore non English words)
So here I need to parse every hashtag and convert them into proper English words, Also I checked how people write hashtag and found below ways -
#thisisawesome
#ThisIsAwesome
#thisIsAwesome
#ThisIsAWESOME
#ThisISAwesome
#ThisisAwesome123
(some time hashtags have numbers as well)
So keeping all these in mind I thought if I'm able to split below string then all above cases will be covered.
string ="pleaseHelpMeSPLITThisString8989"
Result = please, Help, Me, SPLIT, This, String, 8989
I tried something using re.sub but it is not giving me desired results.
Regex is the wrong tool for the job. You need a clearly-defined pattern in order to write a good regex, and in this case, you don't have one. Given that you can have Capitalized Words, CAPITAL WORDS, lowercase words, and numbers, there's no real way to look at, say, THATSand and differentiate between THATS and or THAT Sand.
A natural-language approach would be a better solution, but again, it's inevitably going to run into the same problem as above - how do you differentiate between two (or more) perfectly valid ways to parse the same inputs? Now you'd need to get a trie of common sentences, build one for each language you plan to parse, and still need to worry about properly parsing the nonsensical tags twitter often comes up with.
The question becomes, why do you need to split the string at all? I would recommend finding a way to omit this requirement, because it's almost certainly going to be easier to change the problem than it is to develop this particular solution.

SPSS and Python integration: Splitting on a String Variable

I have my SPSS dataset with customer comments housed as a string variable. I'm trying to come up with a piece of syntax that will pull out a word either before or after a specific keyword (One example might be "you were out of organic milk"). If you are wanting to know what type of milk they are talking about, you would want to pull out the word directly before it ("organic").
Through a series of string searches/manipulations, I have it to look for the first space before the word, and to pull out the characters in between. However, I feel like there should be an easier way if I was solely using Python (split on spaces, identify what place the keyword is, and return the word X before or after). However, I don't know to accomplish this in SPSS using python. Any ideas for how to approach this problem?

How can I remove large number of phrases in one pass from a large text file?

I was wondering - is there someway I can remove large number (100s of thousands) of text phrases in one pass from a big (18 GB) text file?
You could construct a suffix tree from your list of phrases and walk your file using it. It will allow you to identify all the strings. This is often used for tagging stuff but you should be able to adapt it to remove strings as well.
Rabin-Karp is good for multiple substring searching, but I think your phrases would have to be the same length.
If they're of similar lengths, you might be able to search for subphrases of length (minimum length across all phrases) and then extend when you've found something.
And another thought I'm having is that you could extend this to use a small set of say q subphrase lengths, as appropriate given your search phrases. And you could modify Rabin-Karp to have q rolling hashes instead of one, with q sets of hashes. This would help if you can partition your phrases in q subsets which do have similar lengths.
I'm going to go out on a limb here and suggest you use AWK, because it is very fast for this kind of task.
Are those phrases the same? Like is it the same word you want to remove? Then maybe you can remove it using the 'in' keyword. checking each line using a while loop and removing all instances of the word from that line. Need more information on the problem though.

most efficient way to find partial string matches in large file of strings (python)

I downloaded the Wikipedia article titles file which contains the name of every Wikipedia article. I need to search for all the article titles that may be a possible match. For example, I might have the word "hockey", but the Wikipedia article for hockey that I would want is "Ice_hockey". It should be a case-insensitive search too.
I'm using Python, and is there a more efficient way than to just do a line by line search? I'll be performing this search like 500 or a 1000 times per minute ideally. If line by line is my only option, are there some optimizations I can do within this?
I think there are several million lines in the file.
Any ideas?
Thanks.
If you've got a fixed data set and variable queries, then the usual technique is to reorganise the data set into something that can be searched more easily. At an abstract level, you could break up each article title into individual lowercase words, and add each of them to a Python dictionary data structure. Then, whenever you get a query, convert the query word to lower case and look it up in the dictionary. If each dictionary entry value is a list of titles, then you can easily find all the titles that match a given query word.
This works for straightforward words, but you will have to consider whether you want to do matching on similar words, such as finding "smoking" when the query is "smoke".
Greg's answer is good if you want to match on individual words. If you want to match on substrings you'll need something a bit more complicated, like a suffix tree (http://en.wikipedia.org/wiki/Suffix_tree). Once constructed, a suffix tree can efficiently answer queries for arbitrary substrings, so in your example it could match "Ice_Hockey" when someone searched for "hock".
I'd suggest you put your data into an sqlite database, and use the SQL 'like' operator for your searches.

Categories