I am trying to remove anything starting with \ud
My text:
onceuponadollhouse: "Iconic apart and better together \ud83d\udc6fâ€â™€ï¸The Caboodles® x Barbieâ„¢ collection has us thinking about our Doll Code \ud83c\udf80 We stand for one another by sharing our lessons
The answer I am looking for:
onceuponadollhouse: "Iconic apart and better together â€â™€ï¸The Caboodles® x Barbieâ„¢ collection has us thinking about our Doll Code We stand for one another by sharing our lessons
So the ideal way would be to take a step back, work out where in the process the encoding is getting mangled, then fix it. Somehow you're getting (a) surrogate pairs, which are the pairs of characters starting with \ud; and (b) UTF-8 interpreted as Latin-1 or some similar encoding, like the â„¢ after "Barbie".
Taking a step back and making sure that your input text is interpreted correctly would be ideal; here you're losing the emojis "woman with bunny ears" and "ribbon"; another time it might be somebody's name or other piece of important information.
If you're in a situation where you can't do it properly, and you need to strip the surrogate pairs, you can use re.sub:
import re
text = 'onceuponadollhouse: "Iconic apart and better together \ud83d\udc6fâ€â™€ï¸The Caboodles® x Barbieâ„¢ collection has us thinking about our Doll Code \ud83c\udf80 We stand for one another by sharing our lessons'
stripped = re.sub('[\ud800-\udfff]+', '', text)
print(stripped)
Depending on your purpose, it might be useful to replace those characters with a placeholder; since they always come in pairs, you might do something like this:
import re
text = 'onceuponadollhouse: "Iconic apart and better together \ud83d\udc6fâ€â™€ï¸The Caboodles® x Barbieâ„¢ collection has us thinking about our Doll Code \ud83c\udf80 We stand for one another by sharing our lessons'
stripped = re.sub('[\ud800-\udfff]{2}', '<unknown character>', text)
print(stripped)
Check out the emot python package. I discovered it this morning in from this article: https://towardsdatascience.com/5-python-libraries-that-you-dont-know-but-you-should-fd6f810773a7
The examples given in the documentation only interpret and emojis, but it also gives their location, so it wouldn't be too much of stretch to replace them.
Related
Given a very long string -
"Given the large category of plants, the split ratio was determined to be 88.4. However, we're not sure if the split ratio was consistent across all subcategories or just a calculated average. If however, it deviated, it would be nonetheless, quite strange.
The words - split ratio. In the output, I want them to appear as split-ratio (as a single word) and I also only want to retain sentences where these words occur. So in this case, only the first two sentences.
Is this possible?
You can use replace in a list comprehension:
s = """Given the large category of plants, the split ratio was
determined to be 88.4. However, we're not sure
if the split ratio was consistent across all subcategories
or just a calculated average. If however, it deviated,
it would be nonetheless, quite strange."""
print('. '.join([x.replace('split ratio', 'split-ratio') for x in s.split('. ') if 'split ratio' in x]) + '.')
will print out only lines that contain 'split ratio' with each of them converted to 'split-ratio'.
Since python is in the tag line I expect you want it in that language right? And to be clear a simple find-replace in a normal text editor isn't going to solve this issue I suppose, you need actual logic to apply onto something.
I would have to stop and look up python for a bit. But in any language the easiest way I can think of is to just parse out the file/stream and make the changes as you go. Read in the stream and look for the pattern you want a match for = "split ratio" - regardless, as you are reading in the stream, write out a new one that favors your changes. But do it in the block size (or string length) of the pattern you are matching.
When you find true for the pattern you are constantly comparing, stop. Don't output that string, instead output the one you want to replace it with into the new target stream/file.
However, a search for python search and replace algorithm gives me this:
https://www.geeksforgeeks.org/python-string-replace/
Someone did the hard work for you already. Love that super high level programming language that leaves folks in the dark as to what is actually happening. Oh well.
Enjoy.
atomkey.
I am working on a python project in which I need to filter profane words, and I already have a filter in place. The only problem is that if a user switches a character with a visually similar character (e.g. hello and h311o), the filter does not pick it up. Is there some way that I could find detect these words without hard coding every combination in?
What about translating l331sp33ch to leetspeech and applying a simple levensthein distance? (you need to pip install editdistance first)
import editdistance
try:
from string import maketrans # python 2
except:
maketrans = str.maketrans # python 3
t = maketrans("01345", "oleas")
editdistance.eval("h3110".translate(t), 'hello')
results in 0
Maybe build a relationship between the visually similar characters and what they can represent i.e.
dict = {'3': 'e', '1': 'l', '0': 'o'} #etc....
and then you can use this to test against your database of forbidden words.
e.g.
input:he11
if any of the characters have an entry in dict,
dict['h'] #not exist
dict['e'] #not exist
dict['1'] = 'l'
dict['1'] = 'l'
Put this together to form a word and then search your forbidden list. I don't know if this is the fastest way of doing it, but it is "a" way.
I'm interested to see what others come up with.
*disclaimer: I've done a year or so of Perl and am starting out learning Python right now. When I get the time. Which is very hard to come by.
Linear Replacement
You will want something adaptable to innovative orthographers. For a start, pattern-match the alphabetic characters to your lexicon of banned words, using other characters as wild cards. For instance, your example would get translated to "h...o", which you would match to your proposed taboo word, "hello".
Next, you would compare the non-alpha characters to a dictionary of substitutions, allowing common wild-card chars to stand for anything. For instance, asterisk, hyphen, and period could stand for anything; '4' and '#' could stand for 'A', and so on. However, you'll do this checking from the strength of the taboo word, not from generating all possibilities: the translation goes the other way.
You will have a little ambiguity, as some characters stand for multiple letters. "#" can be used in place of 'O' of you're getting crafty. Also note that not all the letters will be in your usual set: you'll want to deal with moentary symbols (Euro, Yen, and Pound are all derived from letters), as well as foreign letters that happen to resemble Latin letters.
Multi-character replacements
That handles only the words that have the same length as the taboo word. Can you also handle abbreviations? There are a lot of combinations of the form "h-bomb", where the banned word appears as the first letter only: the effect is profane, but the match is more difficult, especially where the 'b's are replaced with a scharfes-S (German), the 'm' with a Hebrew or Cryllic character, and the 'o' with anything round form the entire font.
Context
There is also the problem that some words are perfectly legitimate in one context, but profane in a slang context. Are you also planning to match phrases, perhaps parsing a sentence for trigger words?
Training a solution
If you need a comprehensive solution, consider training a neural network with phrases and words you label as "okay" and "taboo", and let it run for a day. This can take a lot of the adaptation work off your shoulders, and enhancing the model isn't a difficult problem: add your new differentiating text and continue the training from the point where you left off.
Thank you to all who posted an answer to this question. More answers are welcome, as they may help others. I ended up going off of David Zemens' comment on the question.
I'd use a dictionary or list of common variants ("sh1t", etc.) which you could persist as a plain text file or json etc., and read in to memory. This would allow you to add new entries as needed, independently of the code itself. If you're only concerned about profanities, then the list should be reasonably small to maintain, and new variations unlikely. I've used a hard-coded dict to represent statistical t-table (with 1500 key/value pairs) in the past, seems like your problem would not require nearly that many keys.
While this still means that all there word will be hard coded, it will allow me to update the list more easily.
I'm trying to write code that parses a large text file. However, in order to get said text file, I run the original PDF file through pdfminer. While this works, it also returns my text file with many random spaces (see below)
SM ITH , JO HN , PHD
1234 S N O RT H AV E
Is there any easy way in Python to remove only certain spaces so words aren't separated? For the sample above, I want it to look like
SMITH, JOHN, PHD
1234 S NORTH AVE
Thanks.
Most likely what you're trying to do is impossible to do perfectly, and very hard to do well enough to satisfy you. I'll explain below.
But there's a good chance you shouldn't be doing it in the first place. pdfminer is highly configurable, and something like just specifying a smaller -M value will give you the text you wanted in the first place. You'll need to do a bit of trial and error, but if this works, it'll be far easier than trying to post-process things after the fact.
If you want to do this, you need to come up with a rule that determines which spaces are "random extra spaces" and which are real spaces before you can code that in Python. And I don't know that there is any such rule.
In your example, you can handle most of them by just turning multiple spaces into single spaces, and single spaces into nothing. It should be obvious how to do that. Even if you can't think of a clever solution, a triple replace works fine:
s = re.sub(r'\s\s+', r'<space>', s)
s = re.sub(r'\s', r'', s)
s = re.sub(r'<space>', r' ', s)
However, this rule isn't quite right, because in JO HN , PHD, the space after the comma isn't a random extra space, but it's not showing up as two or more spaces. And the same for the space in "1234 S". And, most likely, the same thing is true in lots of other cases for your real data.
A different somewhat close rule is that you only remove single spaces between letters. Again, if that works, it's easy to code. For example:
s = re.sub(r'(\w)\s(\w)', r'\1\2', s)
s = re.sub(r'\s+', r' ', s)
But now that leaves a space before the comma after SMITH and JOHN.
Maybe you need to put in a little information about English punctuation—strip the spaces around punctuation, then add back in the spaces after a comma or period, around quotes, etc.
Or… well, nobody but you can know what your data look like and figure it out.
If you can't come up with a good rule, the only option is to build some complicated heuristics around looking up possible words in a dictionary and guessing which one is more likely—which still won't get everything right (e.g., how do you know whether "B OO K M AR K" is "BOOK MARK" or "BOOKMARK"?), but it's the best you could possibly do.
What you are trying to do is impossible, e.g., should "DESK TOP" be "DESK TOP" or "DESKTOP"?
I want to search for string that occurs between a certain string. For example,
\start
\problem{number}
\subproblem{number}
/* strings that I want to get */
\subproblem{number}
/* strings that I want to get */
\problem{number}
\subproblem{number}
...
...
\end
More specifically, I want to get problem number and subproblem number and strings between which is answer.
I somewhat came up with expression like
'(\\problem{(.*?)}\n)? \\subproblem{(.*?)} (.*?) (\\problem|\\subproblem|\\end)'
but it seems like it doesn't work as I expect. What is wrong with this expression?
This one:
(?:\\problem\{(.*?)\}\n)?\\subproblem\{(.*?)\}\n+(.*?)\n+(?=\\problem|\\subproblem|\\end)
returns three matches for me:
Match 1:
group 1: "number"
group 2: "number"
group 3: "/* strings that I want to get */"
Match 2:
group 1: null
group 2: "number"
group 3: "/* strings that I want to get */"
Match 3:
group 1: "number"
group 2: "number"
group 3: " ...\n ..."
However I'd rather parse it in two steps.
First find the problem's number (group 1) and content (group 2) using:
\\problem\{(.*?)\}\n(.+?)\\end
Then find the subproblem's numbers (group 1) and contents (group 2) inside that content using:
\\subproblem\{(.*?)\}\n+(.*?)\n+(?=\\problem|\\subproblem|\\end)
TeX is pretty complicated and I'm not sure how I feel about parsing it using regular expressions.
That said, your regular expression has two issues:
You're using a space character where you should just consume all whitespace
You need to use a lookahead assertion for your final group so that it doesn't get eaten up (because you need to match it at the beginning of the regex the next time around)
Give this a try:
>>> v
'\\start\n\n\\problem{number}\n\\subproblem{number}\n\n/* strings that I want to get */\n\n\\subproblem{number}\n\n/* strings that I want to get */\n\n\\problem{number}\n\\subproblem{number}\n ...\n ...\n\\end\n'
>>> re.findall(r'(?:\\problem{(.*?)})?\s*\\subproblem{(.*?)}\s*(.*?)\s*(?=\\problem{|\\subproblem{|\\end)', v, re.DOTALL)
[('number', 'number', '/* strings that I want to get */'), ('', 'number', '/* strings that I want to get */'), ('number', 'number', '...\n ...')]
If the question really is "What is wrong with this expression?", here's the answer:
You're trying to match newlines with a .*?. You need (?s) for that to work.
You have explicit spaces and newlines in the middle of the regex that don't have any corresponding characters in the source text. You need (?x) for that to work.
That may not be all that's wrong with the expression. But just adding (?sx), turning it into a raw string (because I don't trust myself to mix Python quoting and regex quoting properly), and removing the \n gives me this:
r'(?sx)(\\problem{(.*?)}? \\subproblem{(.*?)} (.*?)) (\\problem|\\subproblem|\\end)'
That returns 2 matches instead of 0, and it's probably the smallest change to your regex that works.
However, if the question is "How can I parse this?", rather than "What's wrong with my existing attempt?", I think impl's solution makes more sense (and I also agree with the point about using regex to parse TeX being usually a bad idea)—-or, even better, doing it in two steps as Regexident does.
if using regex to parse TeX is not good idea, then what method would you suggest to parse TeX?
First of all, as a general rule of thumb, if I can't write the regex to solve a problem by myself, I don't want to solve it with a regex, because I'll have a hard time figuring it out a few months from now. Sometimes I break it down into subexpressions, or use (?x) and load it up with comments, but usually I look for another way.
More importantly, if you have a real parser that can consume your language and give you a tree (or whatever's appropriate) that you can walk and search—as with, e.g. etree for XML—then you've got 90% of a solution for every problem you're going to come up with in dealing with that language. A quick&dirty regex (especially one you can't write on your own) only gets you 10% of the way to solving the next problem. And more often than not, if I've got a problem today, I'm going to have more of them in the next few months.
So, what's a good parser for TeX in Python? Honestly, I don't know. I know scipy/matplotlib has something that does it, so I'd probably look there first. Beyond that, check Google, PyPI, and maybe tex.stackexchange.com. The first things that turn up in a search are Texcaller and plasTeX. I have no idea how good they are, or if they're appropriate for your use case, but it shouldn't take long to skim the tutorials and find out.
If it turns out that there's nothing out there, and it comes down to writing something myself with, e.g., pyparsing vs. regexes, then it's a tougher choice. Some languages, it's very easy to define just the subset you care about and leave the rest as giant uninterpreted tokens, in which case a real parser will be just as easy as a regex, so you might as well go that way. Other languages, you have to handle half the syntax before you can do anything useful, so I wouldn't even try. I'd have to put a bit of time into thinking about it and experimenting both ways before deciding which way to go.
I wondered how you would go about tokenizing strings in English (or other western languages) if whitespaces were removed?
The inspiration for the question is the Sheep Man character in the Murakami novel 'Dance Dance Dance'
In the novel, the Sheep Man is translated as saying things like:
"likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."
So, some punctuation is kept, but not all. Enough for a human to read, but somewhat arbitrary.
What would be your strategy for building a parser for this? Common combinations of letters, syllable counts, conditional grammars, look-ahead/behind regexps etc.?
Specifically, python-wise, how would you structure a (forgiving) translation flow? Not asking for a completed answer, just more how your thought process would go about breaking the problem down.
I ask this in a frivolous manner, but I think it's a question that might get some interesting (nlp/crypto/frequency/social) answers.
Thanks!
I actually did something like this for work about eight months ago. I just used a dictionary of English words in a hashtable (for O(1) lookup times). I'd go letter by letter matching whole words. It works well, but there are numerous ambiguities. (asshit can be ass hit or as shit). To resolve those ambiguities would require much more sophisticated grammar analysis.
First of all, I think you need a dictionary of English words -- you could try some methods that rely solely on some statistical analysis, but I think a dictionary has better chances of good results.
Once you have the words, you have two possible approaches:
You could categorize the words into grammar categories and use a formal grammar to parse the sentences -- obviously, you would sometimes get no match or multiple matches -- I'm not familiar with techniques that would allow you to loosen the grammar rules in case of no match, but I'm sure there must be some.
On the other hand, you could just take some large corpus of English text and compute relative probabilities of certain words being next to each other -- getting a list of pair and triples of words. Since that data structure would be rather big, you could use word categories (grammatical and/or based on meaning) to simplify it. Then you just build an automaton and choose the most probable transitions between the words.
I am sure there are many more possible approaches. You can even combine the two I mentioned, building some kind of grammar with weight attached to its rules. It's a rich field for experimenting.
I don't know if this is of much help to you, but you might be able to make use of this spelling corrector in some way.
This is just some quick code I wrote out that I think would work fairly well to extract words from a snippet like the one you gave... Its not fully thought out, but I think something along these lines would work if you can't find a pre-packaged type of solution
textstring = "likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."
indiv_characters = list(textstring) #splits string into individual characters
teststring = ''
sequential_indiv_word_list = []
for cur_char in indiv_characters:
teststring = teststring + cur_char
# do some action here to test the testsring against an English dictionary where you can API into it to get True / False if it exists as an entry
if in_english_dict == True:
sequential_indiv_word_list.append(teststring)
teststring = ''
#at the end just assemble a sentence from the pieces of sequential_indiv_word_list by putting a space between each word
There are some more issues to be worked out, such as if it never returns a match, this would obviously not work as it would never match if it just kept adding in more characters, however since your demo string had some spaces you could have it recognize these too and automatically start over at each of these.
Also you need to account for punctuation, write conditionals like
if cur_char == ',' or cur_char =='.':
#do action to start new "word" automatically