Trying to remove symbol (" - ") with whitespace while keeping symbol ("-") without whitespace - python

I have a txt file I open in Python. And I'm trying to remove the symbols and order the remaining words alphabetically. Removing the periods, the commas etc. isn't a problem. However, I can't seem to remove the dash symbol with whitespaces when I add it to a list together with the rest of the symbols.
This is an example of what I open:
content = "The quick brown fox - who was hungry - jumps over the 7-year old lazy dog."
This is what I want (periods removed, and dash symbols which aren't attached to a word removed):
content = "The quick brown fox who was hungry jumps over the 7-year old lazy dog"
But I either get this (all dash symbols removed):
content = "The quick brown fox who was hungry jumps over the 7year old lazy dog"
Or this (dash symbol unremoved):
content = "The quick brown fox - who was hungry - jumps over the 7-year old lazy dog"
This is my entire code. Adding a content.replace() works. But that isn't what I want:
f = open("article.txt", "r")
# Create variable (Like this removing " - " works)
content = f.read()
content = content.replace(" - ", " ")
# Create list
wordlist = content.split()
# Which symbols (If I remove the line "content = content.replace(" - ", " ")", the " - " in this list doesn't get removed here)
chars = [",", ".", "'", "(", ")", "‘", "’", " - "]
# Remove symbols
words = []
for element in wordlist:
temp = ""
for ch in element:
if ch not in chars:
temp += ch
words.append(temp)
# Print words, sort alphabetically and do not print duplicates
for word in sorted(set(words)):
print(word)
It works like this. But when I remove the content = content.replace(" - ", " "), the "whitespace + dash symbol + whitspace" in chars doesn't get removed.
And if I replace it with "-" (no whitespaces), I get this which I don't want:
content = "The quick brown fox who was hungry jumps over the 7year old lazy dog"
Is it possible at all to do this with a list like chars or is my only option to do this with a .replace().
And is there a particular reason why Python orders capitalized words alphabetically first, and uncapitalized words later separately?
Like this (The letters ABC are just added to emphasize what I'm trying to say):
7-year
A
B
C
The
brown
dog
fox
hungry
jumps
lazy
old
over
quick
the
was
who

You can use re.sub like this:
>>> import re
>>> strip_chars = re.compile('(?:[,.\'()‘’])|(?:[-,]\s)')
>>> content = "The quick brown fox - who was hungry - jumps over the 7-year old lazy dog."
>>> strip_chars.sub("", content)
'The quick brown fox who was hungry jumps over the 7-year old lazy dog'
>>> strip_chars.sub("", content).split()
['The', 'quick', 'brown', 'fox', 'who', 'was', 'hungry', 'jumps', 'over', 'the', '7-year', 'old', 'lazy', 'dog']
>>> print(*sorted(strip_chars.sub("", content).split()), sep='\n')
7-year
The
brown
dog
fox
hungry
jumps
lazy
old
over
quick
the
was
who
Summarizing my comments and putting it all together:
from pathlib import Path
from collections import Counter
import re
strip_chars = re.compile('(?:[,.\'()‘’])|(?:[-,]\s)')
article = Path('/path/to/your/article.txt')
content = article.read_text()
words = Counter(strip_chars.sub('', content).split())
for word in sorted(words, key=lambda x: x.lower()):
print(word)
If The and the, for example, count as duplicate words then you just need to convert content to lower case letters. The code would be this one instead:
from pathlib import Path
from collections import Counter
import re
strip_chars = re.compile('(?:[,.\'()‘’])|(?:[-,]\s)')
article = Path('/path/to/your/article.txt')
content = article.read_text().lower()
words = Counter(strip_chars.sub('', content).split())
for word in sorted(words):
print(word)
Finally, as a good side effect of using collections.Counter, you also get a words counter in words and you can answer questions like "what are the top ten most common words?" with something like:
words.most_common(10)

After
wordlist = content.split()
your list no longer contains anything with starting/ending whitespaces.
str.split()
removes consecutive whitespaces. So there is no ' - ' in your split list.
Doku: https://docs.python.org/3/library/stdtypes.html#str.split
str.split(sep=None, maxsplit=-1)
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
Replacing ' - ' seems right - the other way to keep close to your code would be to remove exactly '-' from your split list:
content = "The quick brown fox - who was hungry - jumps over the 7-year old lazy dog."
wordlist = content.split()
print(wordlist)
chars = [",", ".", "'", "(", ")", "‘", "’"] # modified
words = []
for element in wordlist:
temp = ""
if element == '-': # skip pure -
continue
for ch in element: # handle characters to be removed
if ch not in chars:
temp += ch
words.append(temp)
Output:
['The', 'quick', 'brown', 'fox', '-', 'who', 'was', 'hungry', '-',
'jumps', 'over', 'the', '7-year', 'old', 'lazy', 'dog.']
7-year
The
brown
dog
fox
hungry
jumps
lazy
old
over
quick
the
was
who

Related

replace a list of words with regex

I have a list of words list_words =["cat","dog","animals"]
And i have a text = "I have a lot of animals a cat and a dog"
I want a regex code that is able to add a comma at the end of every word before any word in the list given.
I want my text to be like that: text = "I have a lot of, animals a, cat and a, dog"
My code so far:
import re
list_words = ["cat", "dog", "animals","adam"]
text = "I have a lot of animals, cat and a dog"
for word in list_words:
if word in text:
word = re.search(r" (\s*{})".format(word), text).group(1)
text = text.replace(f" {word}", f", {word}")
print(text)
But i have 2 issues here:
1: if i have a text like this : text= I have a lot of animals cat and a dogy
it turns it into : text= I have a lot of, animals, cat and a, dogy
which is not the result wanted, i wanted to replace only the word itself not with
addition like dogy
2: if i have a text like this: text= I have a lot of animals, cat and a dogy
it still add another comma which is not what i want
You can use
,*(\s*\b(?:cat|dog|animals|adam))\b
See the regex demo. Details:
,* - zero or more commas
(\s*\b(?:cat|dog|animals|adam)) - Group 1:
\s* - zero or more whitespaces
\b - a word boundary
(?:cat|dog|animals|adam) - one of the words
\b - word boundary
See the Python demo:
import re
list_words = ["cat", "dog", "animals", "adam"]
text = "I have a lot of animals, cat and a dog"
pattern = r",*(\s*\b(?:{}))\b".format("|".join(list_words))
print( re.sub(pattern, r",\1", text) )
# => I have a lot of, animals, cat and a, dog
All words get a comma:
import re
list_words = ["cat", "dog", "animals"]
text = "I have a lot of animals a cat and a dog"
for word in list_words:
word = re.search(r" (\s*{}) ".format(word), text)
text = text.replace(f" {word}", f", {word}")
Go with a simpler method.
list_words =["cat","dog","animals"]
text = "I have a lot of animals a cat and a dog"
test_list_words=[]
for new in text.split(" "):
if new in list_words:
new=new+","
test_list_words.append(new)
else:
test_list_words.append(new)
print(' '.join(test_list_words))

How to highlight (only) word errors using difflib?

I'm trying to compare the output of a speech-to-text API with a ground truth transcription. What I'd like to do is capitalize the words in the ground truth which the speech-to-text API either missed or misinterpreted.
For Example:
Truth:
The quick brown fox jumps over the lazy dog.
Speech-to-text Output:
the quick brown box jumps over the dog
Desired Result:
The quick brown FOX jumps over the LAZY dog.
My initial instinct was to remove the capitalization and punctuation from the ground truth and use difflib. This gets me an accurate diff, but I'm having trouble mapping the output back to positions in the original text. I would like to keep the ground truth capitalization and punctuation to display the results, even if I'm only interested in word errors.
Is there any way to express difflib output as word-level changes on an original text?
I would also like to suggest a solution using difflib but I'd prefer using RegEx for word detection since it will be more precise and more tolerant to weird characters and other issues.
I've added some weird text to your original strings to show what I mean:
import re
import difflib
truth = 'The quick! brown - fox jumps, over the lazy dog.'
speech = 'the quick... brown box jumps. over the dog'
truth = re.findall(r"[\w']+", truth.lower())
speech = re.findall(r"[\w']+", speech.lower())
for d in difflib.ndiff(truth, speech):
print(d)
Output
the
quick
brown
- fox
+ box
jumps
over
the
- lazy
dog
Another possible output:
diff = difflib.unified_diff(truth, speech)
print(''.join(diff))
Output
---
+++
## -1,9 +1,8 ##
the quick brown-fox+box jumps over the-lazy dog
Why not just split the sentence into words then use difflib on those?
import difflib
truth = 'The quick brown fox jumps over the lazy dog.'.lower().strip(
'.').split()
speech = 'the quick brown box jumps over the dog'.lower().strip('.').split()
for d in difflib.ndiff(truth, speech):
print(d)
So I think I've solved the problem. I realised that difflib's "contextdiff" provides indices of lines that have changes in them. To get the indices for the "ground truth" text, I remove the capitalization / punctuation, split the text into individual words, and then do the following:
altered_word_indices = []
diff = difflib.context_diff(transformed_ground_truth, transformed_hypothesis, n=0)
for line in diff:
if line.startswith('*** ') and line.endswith(' ****\n'):
line = line.replace(' ', '').replace('\n', '').replace('*', '')
if ',' in line:
split_line = line.split(',')
for i in range(0, (int(split_line[1]) - int(split_line[0])) + 1):
altered_word_indices.append((int(split_line[0]) + i) - 1)
else:
altered_word_indices.append(int(line) - 1)
Following this, I print it out with the changed words capitalized:
split_ground_truth = ground_truth.split(' ')
for i in range(0, len(split_ground_truth)):
if i in altered_word_indices:
print(split_ground_truth[i].upper(), end=' ')
else:
print(split_ground_truth[i], end=' ')
This allows me to print out "The quick brown FOX jumps over the LAZY dog." (capitalization / punctuation included) instead of "the quick brown FOX jumps over the LAZY dog".
This is...not a super elegant solution, and it's subject to testing, cleanup, error handling, etc. But it seems like a decent start and is potentially useful for someone else running into the same problem. I'll leave this question open for a few days in case someone comes up with a less gross way of getting the same result.

Search a string for a word/sentence and print the following word

I have a string that has around 10 lines of text. What I am trying to do is find a sentence that has a specific word(s) in it, and display the word following.
Example String:
The quick brown fox
The slow donkey
The slobbery dog
The Furry Cat
I want the script to search for 'The slow', then print the following word, so in this case, 'donkey'.
I have tried using the Find function, but that just prints the location of the word(s).
Example code:
sSearch = output.find("destination-pattern")
print(sSearch)
Any help would be greatly appreciated.
output = "The slow donkey brown fox"
patt = "The slow"
sSearch = output.find(patt)
print(output[sSearch+len(patt)+1:].split(' ')[0])
output:
donkey
You could work with regular expressions. Python has builtin library called re.
Example usage:
s = "The slow donkey some more text"
finder = "The slow"
idx_finder_end = s.find(finder) + len(finder)
next_word_match = re.match(r"\s\w*\s", s[idx_finder_end:])
next_word = next_word_match.group().strip()
# donkey
I would do it using regular expressions (re module) following way:
import re
txt = '''The quick brown fox
The slow donkey
The slobbery dog
The Furry Cat'''
words = re.findall(r'(?<=The slow) (\w*)',txt)
print(words) # prints ['donkey']
Note that words is now list of words, if you are sure that there is exactly 1 word to be found you could do then:
word = words[0]
print(word) # prints donkey
Explanation: I used so-called lookbehind assertion in first argument of re.findall, which mean I am looking for something behind The slow. \w* means any substring consisting of: letters, digits, underscores (_). I enclosed it in group (brackets) because it is not part of word.
You can do it using regular expressions:
>>> import re
>>> r=re.compile(r'The slow\s+\b(\w+)\b')
>>> r.match('The slow donkey')[1]
'donkey'
>>>

How to extract the text between a word and its next occurrence?

I have the following sample text:
mystr = r'''\documentclass[12pt]{article}
\usepackage{amsmath}
\title{\LaTeX}
\begin{document}
\section{Introduction}
This is introduction paragraph
\section{Non-Introduction}
This is non-introduction paragraph
\section{Sample section}
This is sample section paragraph
\begin{itemize}
\item Item 1
\item Item 2
\end{itemize}
\end{document}'''
What I'm trying to accomplish is to create a regex expression which will extract the following lines from mystr:
['This is introduction paragraph','This is non-introduction paragraph',' This is sample section paragraph\n \begin{itemize}\n\item Item 1\n\item Item 2\n\end{itemize}']
For any reason you need to use regex. Perhaps the splitting string is more involved than just "a". The re module has a split function too:
import re
str_ = "a quick brown fox jumps over a lazy dog than a quick elephant"
print(re.split(r'\s?\ba\b\s?',str_))
# ['', 'quick brown fox jumps over', 'lazy dog than', 'quick elephant']
EDIT: expanded answer with the new information you provided...
After your edit in which you write a better description of your problem and you include a text that looks like LaTeX, I think you need to extract those lines that do not start with a \, which are the latex commands. In other words, you need the lines with only text. Try the following, always using regular expressions:
import re
mystr = r'''\documentclass[12pt]{article}
\usepackage{amsmath}
\title{\LaTeX}
\begin{document}
\section{Introduction}
This is introduction paragraph
\section{Non-Introduction}
This is non-introduction paragraph
\section{Sample section}
This is sample section paragraph
\end{document}'''
pattern = r"^[^\\]*\n"
matches = re.findall(pattern, mystr, flags=re.M)
print(matches)
# ['This is introduction paragraph\n', 'This is non-introduction paragraph\n', 'This is sample section paragraph\n']
You can use the split method from str:
my_string = "a quick brown fox jumps over a lazy dog than a quick elephant"
word = "a "
my_string.split(word)
Results in:
['', 'quick brown fox jumps over ', 'lazy dog than ', 'quick elephant']
Note: Don't use str as a variable name as it is a keyword in Python.

Python - removing some punctuation from text

I want Python to remove only some punctuation from a string, let's say I want to remove all the punctuation except '#'
import string
remove = dict.fromkeys(map(ord, '\n ' + string.punctuation))
sample = 'The quick brown fox, like, totally jumped, #man!'
sample.translate(remove)
Here the output is
The quick brown fox like totally jumped man
But what I want is something like this
The quick brown fox like totally jumped #man
Is there a way to selectively remove punctuation from a text leaving out the punctuation that we want in the text intact?
str.punctuation contains all the punctuations. Remove # from it. Then replace with '' whenever you get that punctuation string.
>>> import re
>>> a = string.punctuation.replace('#','')
>>> re.sub(r'[{}]'.format(a),'','The quick brown fox, like, totally jumped, #man!')
'The quick brown fox like totally jumped #man'
Just remove the character you don't want to touch from the replacement string:
import string
remove = dict.fromkeys(map(ord, '\n' + string.punctuation.replace('#','')))
sample = 'The quick brown fox, like, totally jumped, #man!'
sample.translate(remove)
Also note that I changed '\n ' to '\n', as the former will remove spaces from your string.
Result:
The quick brown fox like totally jumped #man

Categories