Python - removing some punctuation from text

Python - removing some punctuation from text - python

I want Python to remove only some punctuation from a string, let's say I want to remove all the punctuation except '#'
import string
remove = dict.fromkeys(map(ord, '\n ' + string.punctuation))
sample = 'The quick brown fox, like, totally jumped, #man!'
sample.translate(remove)
Here the output is
The quick brown fox like totally jumped man
But what I want is something like this
The quick brown fox like totally jumped #man
Is there a way to selectively remove punctuation from a text leaving out the punctuation that we want in the text intact?

str.punctuation contains all the punctuations. Remove # from it. Then replace with '' whenever you get that punctuation string.
>>> import re
>>> a = string.punctuation.replace('#','')
>>> re.sub(r'[{}]'.format(a),'','The quick brown fox, like, totally jumped, #man!')
'The quick brown fox like totally jumped #man'

Just remove the character you don't want to touch from the replacement string:
import string
remove = dict.fromkeys(map(ord, '\n' + string.punctuation.replace('#','')))
sample = 'The quick brown fox, like, totally jumped, #man!'
sample.translate(remove)
Also note that I changed '\n ' to '\n', as the former will remove spaces from your string.
Result:
The quick brown fox like totally jumped #man

Related

Python String split using a regex

We want to split a string multi line for example
|---------------------------------------------Title1(a)---------------------------------------------
Content goes here, the quick brown fox jumps over the lazy dog
|---------------------------------------------Title1(b)----------------------------------------------
Content goes here, the quick brown fox jumps over the lazy dog
here's our python split using regex code
import re
str1 = "|---------------------------------------------Title1(a)---------------------------------------------" \
"" \
"Content goes here, the quick brown fox jumps over the lazy dog" \
"" \
"|---------------------------------------------Title1(b)----------------------------------------------" \
"" \
"Content goes here, the quick brown fox jumps over the lazy dog" \
"|"
print(str1)
str2 = re.split("\|---------------------------------------------", str1)
print(str2)
We want the output to include only
str2[0]:
Content goes here, the quick brown fox jumps over the lazy dog
str2[1]:
Content goes here, the quick brown fox jumps over the lazy dog
what's the proper regex to use, or is there any other way to split using the format above

Instead of using split, you can match the lines and capture the part that you want in a group.
\|-{2,}[^-]+-{2,}([^-].*?)(?=\|)
Explanation
\| Match |
-{2,} Match 2 or more -
[^-]+ Match 1+ times any char except -
-{2,} Match 2 or more -
( Capture grou 1
[^-].*? match any char except -, then any char as least as possible
) Close group 1
(?=\|) Positive lookahead, assert a | to the right
Regex demo | Python demo
Example
import re
regex = r"\|-{2,}[^-]+-{2,}([^-].*?)(?=\|)"
str1 = "|---------------------------------------------Title1(a)---------------------------------------------" \
"" \
"Content goes here, the quick brown fox jumps over the lazy dog" \
"" \
"|---------------------------------------------Title1(b)----------------------------------------------" \
"" \
"Content goes here, the quick brown fox jumps over the lazy dog" \
"|"
str2 = re.findall(regex, str1);
print(str2[0])
print(str2[1])
Output
Content goes here, the quick brown fox jumps over the lazy dog
Content goes here, the quick brown fox jumps over the lazy dog
If Title should be part of the line, another option is to make the match a bit more precise.
\|-+Title\d+\([a-z]\)-+(.+?)(?=\||$)
Regex demo

How to highlight (only) word errors using difflib?

I'm trying to compare the output of a speech-to-text API with a ground truth transcription. What I'd like to do is capitalize the words in the ground truth which the speech-to-text API either missed or misinterpreted.
For Example:
Truth:
The quick brown fox jumps over the lazy dog.
Speech-to-text Output:
the quick brown box jumps over the dog
Desired Result:
The quick brown FOX jumps over the LAZY dog.
My initial instinct was to remove the capitalization and punctuation from the ground truth and use difflib. This gets me an accurate diff, but I'm having trouble mapping the output back to positions in the original text. I would like to keep the ground truth capitalization and punctuation to display the results, even if I'm only interested in word errors.
Is there any way to express difflib output as word-level changes on an original text?

I would also like to suggest a solution using difflib but I'd prefer using RegEx for word detection since it will be more precise and more tolerant to weird characters and other issues.
I've added some weird text to your original strings to show what I mean:
import re
import difflib
truth = 'The quick! brown - fox jumps, over the lazy dog.'
speech = 'the quick... brown box jumps. over the dog'
truth = re.findall(r"[\w']+", truth.lower())
speech = re.findall(r"[\w']+", speech.lower())
for d in difflib.ndiff(truth, speech):
print(d)
Output
the
quick
brown
- fox
+ box
jumps
over
the
- lazy
dog
Another possible output:
diff = difflib.unified_diff(truth, speech)
print(''.join(diff))
Output
---
+++
## -1,9 +1,8 ##
the quick brown-fox+box jumps over the-lazy dog

Why not just split the sentence into words then use difflib on those?
import difflib
truth = 'The quick brown fox jumps over the lazy dog.'.lower().strip(
'.').split()
speech = 'the quick brown box jumps over the dog'.lower().strip('.').split()
for d in difflib.ndiff(truth, speech):
print(d)

So I think I've solved the problem. I realised that difflib's "contextdiff" provides indices of lines that have changes in them. To get the indices for the "ground truth" text, I remove the capitalization / punctuation, split the text into individual words, and then do the following:
altered_word_indices = []
diff = difflib.context_diff(transformed_ground_truth, transformed_hypothesis, n=0)
for line in diff:
if line.startswith('*** ') and line.endswith(' ****\n'):
line = line.replace(' ', '').replace('\n', '').replace('*', '')
if ',' in line:
split_line = line.split(',')
for i in range(0, (int(split_line[1]) - int(split_line[0])) + 1):
altered_word_indices.append((int(split_line[0]) + i) - 1)
else:
altered_word_indices.append(int(line) - 1)
Following this, I print it out with the changed words capitalized:
split_ground_truth = ground_truth.split(' ')
for i in range(0, len(split_ground_truth)):
if i in altered_word_indices:
print(split_ground_truth[i].upper(), end=' ')
else:
print(split_ground_truth[i], end=' ')
This allows me to print out "The quick brown FOX jumps over the LAZY dog." (capitalization / punctuation included) instead of "the quick brown FOX jumps over the LAZY dog".
This is...not a super elegant solution, and it's subject to testing, cleanup, error handling, etc. But it seems like a decent start and is potentially useful for someone else running into the same problem. I'll leave this question open for a few days in case someone comes up with a less gross way of getting the same result.

Trying to remove symbol (" - ") with whitespace while keeping symbol ("-") without whitespace

I have a txt file I open in Python. And I'm trying to remove the symbols and order the remaining words alphabetically. Removing the periods, the commas etc. isn't a problem. However, I can't seem to remove the dash symbol with whitespaces when I add it to a list together with the rest of the symbols.
This is an example of what I open:
content = "The quick brown fox - who was hungry - jumps over the 7-year old lazy dog."
This is what I want (periods removed, and dash symbols which aren't attached to a word removed):
content = "The quick brown fox who was hungry jumps over the 7-year old lazy dog"
But I either get this (all dash symbols removed):
content = "The quick brown fox who was hungry jumps over the 7year old lazy dog"
Or this (dash symbol unremoved):
content = "The quick brown fox - who was hungry - jumps over the 7-year old lazy dog"
This is my entire code. Adding a content.replace() works. But that isn't what I want:
f = open("article.txt", "r")
# Create variable (Like this removing " - " works)
content = f.read()
content = content.replace(" - ", " ")
# Create list
wordlist = content.split()
# Which symbols (If I remove the line "content = content.replace(" - ", " ")", the " - " in this list doesn't get removed here)
chars = [",", ".", "'", "(", ")", "‘", "’", " - "]
# Remove symbols
words = []
for element in wordlist:
temp = ""
for ch in element:
if ch not in chars:
temp += ch
words.append(temp)
# Print words, sort alphabetically and do not print duplicates
for word in sorted(set(words)):
print(word)
It works like this. But when I remove the content = content.replace(" - ", " "), the "whitespace + dash symbol + whitspace" in chars doesn't get removed.
And if I replace it with "-" (no whitespaces), I get this which I don't want:
content = "The quick brown fox who was hungry jumps over the 7year old lazy dog"
Is it possible at all to do this with a list like chars or is my only option to do this with a .replace().
And is there a particular reason why Python orders capitalized words alphabetically first, and uncapitalized words later separately?
Like this (The letters ABC are just added to emphasize what I'm trying to say):
7-year
A
B
C
The
brown
dog
fox
hungry
jumps
lazy
old
over
quick
the
was
who

You can use re.sub like this:
>>> import re
>>> strip_chars = re.compile('(?:[,.\'()‘’])|(?:[-,]\s)')
>>> content = "The quick brown fox - who was hungry - jumps over the 7-year old lazy dog."
>>> strip_chars.sub("", content)
'The quick brown fox who was hungry jumps over the 7-year old lazy dog'
>>> strip_chars.sub("", content).split()
['The', 'quick', 'brown', 'fox', 'who', 'was', 'hungry', 'jumps', 'over', 'the', '7-year', 'old', 'lazy', 'dog']
>>> print(*sorted(strip_chars.sub("", content).split()), sep='\n')
7-year
The
brown
dog
fox
hungry
jumps
lazy
old
over
quick
the
was
who
Summarizing my comments and putting it all together:
from pathlib import Path
from collections import Counter
import re
strip_chars = re.compile('(?:[,.\'()‘’])|(?:[-,]\s)')
article = Path('/path/to/your/article.txt')
content = article.read_text()
words = Counter(strip_chars.sub('', content).split())
for word in sorted(words, key=lambda x: x.lower()):
print(word)
If The and the, for example, count as duplicate words then you just need to convert content to lower case letters. The code would be this one instead:
from pathlib import Path
from collections import Counter
import re
strip_chars = re.compile('(?:[,.\'()‘’])|(?:[-,]\s)')
article = Path('/path/to/your/article.txt')
content = article.read_text().lower()
words = Counter(strip_chars.sub('', content).split())
for word in sorted(words):
print(word)
Finally, as a good side effect of using collections.Counter, you also get a words counter in words and you can answer questions like "what are the top ten most common words?" with something like:
words.most_common(10)

After
wordlist = content.split()
your list no longer contains anything with starting/ending whitespaces.
str.split()
removes consecutive whitespaces. So there is no ' - ' in your split list.
Doku: https://docs.python.org/3/library/stdtypes.html#str.split
str.split(sep=None, maxsplit=-1)
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
Replacing ' - ' seems right - the other way to keep close to your code would be to remove exactly '-' from your split list:
content = "The quick brown fox - who was hungry - jumps over the 7-year old lazy dog."
wordlist = content.split()
print(wordlist)
chars = [",", ".", "'", "(", ")", "‘", "’"] # modified
words = []
for element in wordlist:
temp = ""
if element == '-': # skip pure -
continue
for ch in element: # handle characters to be removed
if ch not in chars:
temp += ch
words.append(temp)
Output:
['The', 'quick', 'brown', 'fox', '-', 'who', 'was', 'hungry', '-',
'jumps', 'over', 'the', '7-year', 'old', 'lazy', 'dog.']
7-year
The
brown
dog
fox
hungry
jumps
lazy
old
over
quick
the
was
who

How to extract the text between a word and its next occurrence?

I have the following sample text:
mystr = r'''\documentclass[12pt]{article}
\usepackage{amsmath}
\title{\LaTeX}
\begin{document}
\section{Introduction}
This is introduction paragraph
\section{Non-Introduction}
This is non-introduction paragraph
\section{Sample section}
This is sample section paragraph
\begin{itemize}
\item Item 1
\item Item 2
\end{itemize}
\end{document}'''
What I'm trying to accomplish is to create a regex expression which will extract the following lines from mystr:
['This is introduction paragraph','This is non-introduction paragraph',' This is sample section paragraph\n \begin{itemize}\n\item Item 1\n\item Item 2\n\end{itemize}']

For any reason you need to use regex. Perhaps the splitting string is more involved than just "a". The re module has a split function too:
import re
str_ = "a quick brown fox jumps over a lazy dog than a quick elephant"
print(re.split(r'\s?\ba\b\s?',str_))
# ['', 'quick brown fox jumps over', 'lazy dog than', 'quick elephant']
EDIT: expanded answer with the new information you provided...
After your edit in which you write a better description of your problem and you include a text that looks like LaTeX, I think you need to extract those lines that do not start with a \, which are the latex commands. In other words, you need the lines with only text. Try the following, always using regular expressions:
import re
mystr = r'''\documentclass[12pt]{article}
\usepackage{amsmath}
\title{\LaTeX}
\begin{document}
\section{Introduction}
This is introduction paragraph
\section{Non-Introduction}
This is non-introduction paragraph
\section{Sample section}
This is sample section paragraph
\end{document}'''
pattern = r"^[^\\]*\n"
matches = re.findall(pattern, mystr, flags=re.M)
print(matches)
# ['This is introduction paragraph\n', 'This is non-introduction paragraph\n', 'This is sample section paragraph\n']

You can use the split method from str:
my_string = "a quick brown fox jumps over a lazy dog than a quick elephant"
word = "a "
my_string.split(word)
Results in:
['', 'quick brown fox jumps over ', 'lazy dog than ', 'quick elephant']
Note: Don't use str as a variable name as it is a keyword in Python.

How can I reverse parts of sentence in python?

I have a sentence, let's say:
The quick brown fox jumps over the lazy dog
I want to create a function that takes 2 arguments, a sentence and a list of things to ignore. And it returns that sentence with the reversed words, however it should ignore the stuff I pass to it in a second argument. This is what I have at the moment:
def main(sentence, ignores):
return ' '.join(word[::-1] if word not in ignores else word for word in sentence.split())
But this will only work if I pass a second list like so:
print(main('The quick brown fox jumps over the lazy dog', ['quick', 'lazy']))
However, I want to pass a list like this:
print(main('The quick brown fox jumps over the lazy dog', ['quick brown', 'lazy dog']))
expected result:
ehT quick brown xof spmuj revo eht lazy dog
So basically the second argument (the list) will have parts of the sentence that should be ignored. Not just single words.
Do I have to use regexp for this? I was trying to avoid it...

I'm the first person to recommend avoiding regular expressions, but in this case, the complexity of doing without is greater than the complexity added by using them:
import re
def main(sentence, ignores):
# Dedup and allow fast lookup for determining whether to reverse a component
ignores = frozenset(ignores)
# Make a pattern that will prefer matching the ignore phrases, but
# otherwise matches each space and non-space run (so nothing is dropped)
# Alternations match the first pattern by preference, so you'll match
# the ignores phrases if possible, and general space/non-space patterns
# otherwise
pat = r'|'.join(map(re.escape, ignores)) + r'|\S+|\s+'
# Returns the chopped up pieces (space and non-space runs, but ignore phrases stay together
parts = re.findall(pat, sentence)
# Reverse everything not found in ignores and then put it all back together
return ''.join(p if p in ignores else p[::-1] for p in parts)

Just another idea, reverse every word and then reverse the ignores right back:
>>> from functools import reduce
>>> def main(sentence, ignores):
def r(s):
return ' '.join(w[::-1] for w in s.split())
return reduce(lambda s, i: s.replace(r(i), i), ignores, r(sentence))
>>> main('The quick brown fox jumps over the lazy dog', ['quick brown', 'lazy dog'])
'ehT quick brown xof spmuj revo eht lazy dog'

Instead of placeholders, why not just initially reverse any phrase that you want to be around the right way, then reverse the whole string:
def main(sentence, ignores):
for phrase in ignores:
reversed_phrase = ' '.join([word[::-1] for word in phrase.split()])
sentence = sentence.replace(phrase, reversed_phrase)
return ' '.join(word[::-1] for word in sentence.split())
print(main('The quick brown fox jumps over the lazy dog', ['quick', 'lazy']))
print(main('The quick brown fox jumps over the lazy dog', ['quick brown', 'lazy dog']))
returns:
ehT quick nworb xof spmuj revo eht lazy god
ehT quick brown xof spmuj revo eht lazy dog

I have attempted to solve the issue of overlapping ignore phrases e.g. ['brown fox', 'quick brown'] raised by #PadraicCunningham.
There's obviously a lot more looping and code feels less pythonic so I'd be interested in feedback on how to improve this.
import re
def _span_combiner(spans):
"""replace overlapping spans with encompasing single span"""
for i, s in enumerate(spans):
start = s[0]
end = s[1]
for x in spans[i:]:
if x[0] < end:
end = x[1]
yield (start, end)
def main(sentence, ignores):
# spans is a start and finish indices for each ignore phrase in order of occurence
spans = sorted(
[[m.span() for m in re.finditer(p, sentence)][0] for p in ignores if p in sentence]
)
# replace overlapping indices with single set of indices encompasing overlapped range
spans = [s for s in _span_combiner(spans)]
# recreate ignore list by slicing sentence with combined spans
ignores = [sentence[s[0]:s[1]] for s in spans]
for phrase in ignores:
reversed_phrase = ' '.join([word[::-1] for word in phrase.split()])
sentence = sentence.replace(phrase, reversed_phrase)
return ' '.join(word[::-1] for word in sentence.split())
if __name__ == "__main__":
print(main('The quick brown fox jumps over the lazy dog', ['quick', 'lazy']))
print(main('The quick brown fox jumps over the lazy dog', ['brown fox', 'lazy dog']))
print(main('The quick brown fox jumps over the lazy dog', ['nonexistent' ,'brown fox', 'quick brown']))
print(main('The quick brown fox jumps over the brown fox', ['brown fox', 'quick brown']))
results:
ehT quick nworb xof spmuj revo eht lazy god
ehT kciuq brown fox spmuj revo eht lazy dog
ehT quick brown fox spmuj revo eht yzal god
ehT quick brown fox spmuj revo eht brown fox

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - removing some punctuation from text - python

Related

Python String split using a regex

How to highlight (only) word errors using difflib?

Trying to remove symbol (" - ") with whitespace while keeping symbol ("-") without whitespace

How to extract the text between a word and its next occurrence?

How can I reverse parts of sentence in python?

Categories

Resources