Split string using Regular Expression that includes lowercase, camelcase, numbers - python

I'm working on to get twitter trends using tweepy in python and I'm able find out world top 50 trends so for sample I'm getting results like these
#BrazilianFansAreTheBest, #PSYPagtuklas, Federer, Corcuera, Ouvindo ANTI,
艦これ改, 영혼의 나이, #TodoDiaéDiaDe, #TronoChicas, #이사람은_분위기상_군주_장수_책사,
#OTWOLKnowYourLimits, #BaeIn3Words, #NoEntiendoPorque
(Please ignore non English words)
So here I need to parse every hashtag and convert them into proper English words, Also I checked how people write hashtag and found below ways -
#thisisawesome
#ThisIsAwesome
#thisIsAwesome
#ThisIsAWESOME
#ThisISAwesome
#ThisisAwesome123
(some time hashtags have numbers as well)
So keeping all these in mind I thought if I'm able to split below string then all above cases will be covered.
string ="pleaseHelpMeSPLITThisString8989"
Result = please, Help, Me, SPLIT, This, String, 8989
I tried something using re.sub but it is not giving me desired results.

Regex is the wrong tool for the job. You need a clearly-defined pattern in order to write a good regex, and in this case, you don't have one. Given that you can have Capitalized Words, CAPITAL WORDS, lowercase words, and numbers, there's no real way to look at, say, THATSand and differentiate between THATS and or THAT Sand.
A natural-language approach would be a better solution, but again, it's inevitably going to run into the same problem as above - how do you differentiate between two (or more) perfectly valid ways to parse the same inputs? Now you'd need to get a trie of common sentences, build one for each language you plan to parse, and still need to worry about properly parsing the nonsensical tags twitter often comes up with.
The question becomes, why do you need to split the string at all? I would recommend finding a way to omit this requirement, because it's almost certainly going to be easier to change the problem than it is to develop this particular solution.

Related

how do i make python find words that look similar to a bad word, but not necessarily a proper word in english?

I'm making a cyberbullying detection discord bot in python, but sadly there are some people who may find their way around conventional English and spell a bad word in a different manner, like the n-word with 3 g's or the f word without the c. There are just too many variants of bad words some people may use. How can I make python find them all?
I've tried pyenchant but it doesn't do what I want it to do. If I put suggest("racist slur"), "sucker" is in the array. I can't seem to find anything that works.
Will I have to consider every possibility separately and add all the possibilities into a single dictionary? (I hope not.)
It's not necessarily python's job to do the heavy lifting but rather its ecosystem. You may want to look into Natural Language Understanding algorithms and find a way that suits your specific needs. This takes some time and further expertise to figure out.
You may want to start with pytorch, it has helped my learning curve a lot. Their docs regarding text: https://pytorch.org/text/stable/index.html
Also, I'd suggest, you have a look around at kaggle, several datascience challenges have a prize on them to tackle the same task you are aiming to solve.
https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification
These competitions usually have public starter notebooks to get you started with your own implementation.
You could try looping through the string that you are moderating and putting it into an array.
For example, if you wanted to blacklist "foo"
x=[["f","o","o"],[" "], ["f","o","o","o"]]
then count the letters in each word to count how many of each letter is in each word:
y = [["f":"1", "o":"2"], [" ":"1"], ["f":"1", "o":"3"]]
then see that y[2] is very similar to y[0] (the banned word).
While this method is not perfect, it is a start.
Another thing to look in to is using a neural language interpreter that detects if a word is being used in a derogatory way. A while back, Google designed one of these.
The other answer is just that no bot is perfect.
You might just have to put these common misspellings in the blacklist.
However, the automatic approach would be awesome if you got it working with 100% accuracy.
Unfortunately, spell checking (for different languages) alone is still an open problem that people do research on, so there is no perfect solution for this, let alone for the case when the user intentionally tries to insert some "errors".
Fortunately, there is a conceptually limited number of ways people can intentionally change the input word in order to obtain a new word that resembles the initial one enough to be understood by other people. For example, bad actors could try to:
duplicate some letters multiple times
add some separators (e.g. "-", ".") between characters
delete some characters (e.g. the f word without "c")
reverse the word
potentially others
My suggestion is to initially keep it simple, if you don't want to delve into machine learning. As a possible approach you could try to:
manually create a set of lower-case bad words with their duplicated letters removed (e.g. "killer" -> "kiler").
manually/automatically add to this set variants of these words with one or multiple letters missing that can still be easily understood (e.g. "kiler" +-> "kilr").
extract the words in the message (e.g. by message_str.split())
for each word and its reversed version:
a. remove possible separators (e.g. "-", ".")
b. convert it to lower case and remove consecutive, duplicate letters
c. check if this new form of the word is present in the set, if so, censor it or the entire message
This solution lacks the protection against words with characters separated by one or multiple white spaces / newlines (e.g. "killer" -> "k i l l e r").
Depending on how long the messages are (I believe they are generally short in chat rooms), you can try to consider each substring of the initial message with removed whitespaces, instead of each word detected by the white space separator in step 3. This will take more time, as generating each substring will take alone O(message_length^2) time.

Python, speed up regex expression for extracting sub strings

I have the following text
text = "This is a string with C1234567 and CM123456, CM123, F1234567 and also M1234, M123456"
And I would like to extract this list of substrings
['C1234567', 'CM123456', 'F1234567']
This is what I came up with
new_string = re.compile(r'\b(C[M0-9]\d{6}|[FM]\d{7})\b')
new_string.findall(text)
However, I was wondering if there's a way to do this faster since I'm interested in performing this operation tens of thousands of times.
I thought I could use ^ to match the beginning of string, but the regex expression I came up with
new_string = re.compile(r'\b(^C[M0-9]\d{6}|^[FM]\d{7})\b')
Doesn't return anything anymore. I know this is a very basic question, but I'm not sure how to use the ^ properly.
Good and bad news. Bad news, regex looks pretty good, going to be hard to improve. Good news, I have some ideas :) I would try to do a little outside the box thinking if you are looking for performance. I do Extract Transform Load work, and a lot with Python.
You are already doing the re.compile (big help)
The regex engine is left to right, so short circuit where you can. Doesn't seem to apply here
If you have a big chunk of data that you are going to be looping over multiple times, clean it up front ONCE of stuff you KNOW won't match. Think of an HTML page, you only want stuff in HEAD stuff to get HEAD and need to run loops of many regexes over that section. Extract that section, only do that section, not the whole page. Seems obvious, isn't always :)
Use some metrics, give cProfile a try. Maybe there is some logic around where you are regexing that you can speed up. At least you can find your bottleneck, maybe the regex isn't the problem at all.

Building an index of term usage in python code

Brief version
I have a collection of python code (part of instructional materials). I'd like to build an index of when the various python keywords, builtins and operators are used (and especially: when they are first used). Does it make sense to use ast to get proper tokenization? Is it overkill? Is there a better tool? If not, what would the ast code look like? (I've read the docs but I've never used ast).
Clarification: This is about indexing python source code, not English text that talks about python.
Background
My materials are in the form of ipython Notebooks, so even if I had an indexing tool I'd need to do some coding anyway to get the source code out. And I don't have an indexing tool; googling "python index" doesn't turn up anything with the intended sense of "index".
I thought, "it's a simple task, let's write a script for the whole thing so I can do it exactly the way I want". But of course I want to do it right.
The dumb solution: Read the file, tokenize on whitespaces and word boundaries, index. But this gets confused by the contents of strings (when does for really get introduced for the first time?), and of course attached operators are not separated: text+="this" will be tokenized as ["text", '+="', "this"]
Next idea: I could use ast to parse the file, then walk the tree and index what I see. This looks like it would involve ast.parse() and ast.walk(). Something like this:
for source in source_files:
with open(source) as fp:
code = fp.read()
tree = ast.parse(code)
for node in tree.walk():
... # Get node's keyword, identifier etc., and line number-- how?
print(term, source, line) # I do know how to make an index
So, is this a reasonable approach? Is there a better one? How should this be done?
Did you search on "index" alone, or for "indexing tool"? I would think that your main problem would be to differentiate a language concept from its natural language use.
I expect that your major difficulty here is not traversing the text, but in the pattern-matching to find these things. For instance, how do you recognize introducing for loops? This would be the word for "near" the word loop, with a for command "soon" after. That command would be a line beginning with for and ending with a colon.
That is just one pattern, albeit one with many variations. However, consider what it takes to differentiate that from a list comprehension, and that from a generation comprehension (both explicit and built-in).
Will you have directed input? I'm thinking that a list of topics and keywords is essential, not all of which are in the language's terminal tokens -- although a full BNF grammar would likely include them.
Would you consider a mark-up indexing tool? Sometimes, it's easier to place a mark at each critical spot, doing it by hand, and then have the mark-up tool extract an index from that. Such tools have been around for at least 30 years. These are also found with a search for "indexing tools", adding "mark-up" or "tagging" to the search.
Got it. I thought you wanted to parse both, using the code as the primary key for introduction. My mistake. Too much contact with the Society for Technical Communication. :-)
Yes, AST is overkill -- internally. Externally -- it works, it gives you a tree including those critical non-terminal tokens (such as "list comprehension"), and it's easy to get given a BNF and the input text.
This would give you a sequential list of parse trees. Your coding would consist of traversing the tress to make an index of each new concept from your input list. Once you find each concept, you index the instance, remove it from the input list, and continue until you run out of sample code or input items.

SPSS and Python integration: Splitting on a String Variable

I have my SPSS dataset with customer comments housed as a string variable. I'm trying to come up with a piece of syntax that will pull out a word either before or after a specific keyword (One example might be "you were out of organic milk"). If you are wanting to know what type of milk they are talking about, you would want to pull out the word directly before it ("organic").
Through a series of string searches/manipulations, I have it to look for the first space before the word, and to pull out the characters in between. However, I feel like there should be an easier way if I was solely using Python (split on spaces, identify what place the keyword is, and return the word X before or after). However, I don't know to accomplish this in SPSS using python. Any ideas for how to approach this problem?

Extracting a regex from a set of strings

I have a set of strings. I would like to extract a regular expression that matches all these strings. Further, it should match preferably only these and not many others.
Is there an existing python module that does this?
www.google.com
www.googlemail.com/hello/hey
www.google.com/hello/hey
Then, the extracted regex could be www\.google(mail)?\.com(/hello/hey)?
(This also matches www.googlemail.com but I guess I need to live with it)
My motivation for this is in a machine learning setting. I would like to extract a regular expression that "best" represents all these strings.
I understand that regexes like
(www.google.com)|(www.googlemail.com/hello/hey)|(www.google.com/hello/hey) or
www.google(mail.com/hello/hey)|(.com)|(/hello/hey) would be right given my specification, because they match no other urls other than the given ones. But such a regex will become very large if there are large number of strings in the set.
There's a little perl library that was designed to do this. I know you're using python, but if it's a very large list of strings, you can fork off a perl subprocess now and then. (Or copy the algorithm if you're sufficiently motivated).

Categories