removing nonwords from a text without any library

removing nonwords from a text without any library - python

How to remove nonwords from this without using any library in python?
By a word I mean strings containing only the English letters plus ”’” and ”-”.
Hence, we consider words like ”can’t”, ”John’s”, and ”full-time” as valid words.
Furthermore, a word doesn’t contain any digits, or symbols like ”.”, ”,”, ”!”, ”?”, etc.
The only single-letter words in English are ”a” and ”I”
99998 Zwizwai confirmed that there were attempts to place him, together with Mudzuri, on the
sanctions list, describing them as “a Nicodimous diploآکروجت تاج طلایی نیروی هوایی شاهنشاهی ایران ” by
some fellow MDC-T
______________________________________
| |
| The Cast: (in order of appearance) |
|______________________________________|
100000 在大會上，女神為女性安全再次發聲，代表"HeforShe"運動發表消除校園暴力，保障女性安全（calling
on universities to ensure the safety of women on campus）的演講。
this what i tried
for word in row.split():
if word.isalpha() == False:
word = word[:-1]
print (word)
words.append(word.lower())
return words

Try:
words = list()
with open("file.txt", errors="ignore") as infile:
for row in infile:
for word in row.split():
if word.strip('"(.,!?):').replace("'","").replace("-","").isalpha():
words.append(word.strip('"(.,!?):').lower())
>>> " ".join(words)
'zwizwai confirmed that there were attempts to place him together with mudzuri on the sanctions list describing them as nicodimous by some fellow mdc-t the cast in order of appearance on universities to ensure the safety of women on'

Try this:
for word in row.split():
word = word.replace("!", "").replace("?", "").replace(".", "") # since they are included in a word if it's splitted by spaces
if word.replace("'", "").replace("-", "").isalpha():
words.append(word.lower())

Related

Finding the singular or plural form of a word with regex

Let's assume I have the sentence:
sentence = "A cow runs on the grass"
If I want to replace the word cow with "some" special token, I can do:
to_replace = "cow"
# A <SPECIAL> runs on the grass
sentence = re.sub(rf"(?!\B\w)({re.escape(to_replace)})(?<!\w\B)", "<SPECIAL>", sentence, count=1)
Additionally, if I want to replace it's plural form, I could do:
sentence = "The cows run on the grass"
to_replace = "cow"
# Influenza is one of the respiratory <SPECIAL>
sentence = re.sub(rf"(?!\B\w)({re.escape(to_replace) + 's?'})(?<!\w\B)", "<SPECIAL>", sentence, count=1)
which does the replacement even if the word to replace remains in its singular form cow, while the s? does the job to perform the replacement.
My question is what happens if I want to apply the same in a more general way, i.e., find-and-replace words which can be singular, plural - ending with s, and also plural - ending with es (note that I'm intentionally ignoring many edge cases that could appear - discussed in the comments of the question). Another way to frame the question would be how can add multiple optional ending suffixes to a word, so that it works for the following examples:
to_replace = "cow"
sentence1 = "The cow runs on the grass"
sentence2 = "The cows run on the grass"
# --------------
to_replace = "gas"
sentence3 = "There are many natural gases"

I suggest using regular python logic, remember to avoid stretching regexes too much if you don't need to:
phrase = "There are many cows in the field cowes"
for word in phrase.split():
if word == "cow" or word == "cow" + "s" or word == "cow" + "es":
phrase = phrase.replace(word, "replacement")
print(phrase)
Output:
There are many replacement in the field replacement

Apparently, for the use-case I posted, I can make the suffix optional. So it could go as:
re.sub(rf"(?!\B\w)({re.escape(e_obj) + '(s|es)?'})(?<!\w\B)", "<SPECIAL>", sentence, count=1)
Note that this would not work for many edge cases discussed in the comments!

how to remove instances and possible multiple instances of a certain word in a string and return a string (CODEWARS dubstep)

I have had a go at the CODEWARS dubstep challenge using python.
My code is below, it works and I pass the kata test. However, it took me a long time and I ended up using a brute force approach (newbie).
(basically replacing and striping the string until it worked)
Any ideas with comments on how my code could be improved please?
TASK SUMMARY:
Let's assume that a song consists of some number of words (that don't contain WUB). To make the dubstep remix of this song, Polycarpus inserts a certain number of words "WUB" before the first word of the song (the number may be zero), after the last word (the number may be zero), and between words (at least one between any pair of neighbouring words), and then the boy glues together all the words, including "WUB", in one string and plays the song at the club.
For example, a song with words "I AM X" can transform into a dubstep remix as "WUBWUBIWUBAMWUBWUBX" and cannot transform into "WUBWUBIAMWUBX".
song_decoder("WUBWEWUBAREWUBWUBTHEWUBCHAMPIONSWUBMYWUBFRIENDWUB")
# => WE ARE THE CHAMPIONS MY FRIEND
song_decoder("AWUBBWUBC"), "A B C","WUB should be replaced by 1 space"
song_decoder("AWUBWUBWUBBWUBWUBWUBC"), "A B C","multiples WUB should be replaced by only 1 space"
song_decoder("WUBAWUBBWUBCWUB"), "A B C","heading or trailing spaces should be removed"
Thanks in advance, (I am new to stackoverflow also)
MY CODE:
def song_decoder(song):
new_song = song.replace("WUB", " ")
new_song2 = new_song.strip()
new_song3 = new_song2.replace(" ", " ")
new_song4 = new_song3.replace(" ", " ")
return(new_song4)

I don't know if it can improve it but I would use split and join
text = 'WUBWEWUBAREWUBWUBTHEWUBCHAMPIONSWUBMYWUBFRIENDWUB'
text = text.replace("WUB", " ")
print(text)
words = text.split()
print(words)
text = " ".join(words)
print(text)
Result
WE ARE THE CHAMPIONS MY FRIEND
['WE', 'ARE', 'THE', 'CHAMPIONS', 'MY', 'FRIEND']
WE ARE THE CHAMPIONS MY FRIEND
EDIT:
Dittle different version. I split usinsg WUB but then it creates empty elements between two WUB and it needs to remove them
text = 'WUBWEWUBAREWUBWUBTHEWUBCHAMPIONSWUBMYWUBFRIENDWUB'
words = text.split("WUB")
print(words)
words = [x for x in words if x] # remove empty elements
#words = list(filter(None, words)) # remove empty elements
print(words)
text = " ".join(words)
print(text)

Using .replace effectively on text

I'm attempting to capitalize all words in a section of text that only appear once. I have the bit that finds which words only appear once down, but when I go to replace the original word with the .upper version, a bunch of other stuff gets capitalized too. It's a small program, so here's the code.
from collections import Counter
from string import punctuation
path = input("Path to file: ")
with open(path) as f:
word_counts = Counter(word.strip(punctuation) for line in f for word in line.replace(")", " ").replace("(", " ")
.replace(":", " ").replace("", " ").split())
wordlist = open(path).read().replace("\n", " ").replace(")", " ").replace("(", " ").replace("", " ")
unique = [word for word, count in word_counts.items() if count == 1]
for word in unique:
print(word)
wordlist = wordlist.replace(word, str(word.upper()))
print(wordlist)
The output should be 'Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan., as sojournings is the first word that only appears once. Instead, it outputs GenesIs 37:1 Jacob lIved In the land of hIs FATher's SOJOURNINGS, In the land of Canaan. Because some of the other letters appear in keywords, it tries to capitalize them as well.
Any ideas?

I rewrote the code pretty significantly since some of the chained replace calls might prove to be unreliable.
import string
# The sentence.
sentence = "Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan."
rm_punc = sentence.translate(None, string.punctuation) # remove punctuation
words = rm_punc.split(' ') # split spaces to get a list of words
# Find all unique word occurrences.
single_occurrences = []
for word in words:
# if word only occurs 1 time, append it to the list
if words.count(word) == 1:
single_occurrences.append(word)
# For each unique word, find it's index and capitalize the letter at that index
# in the initial string (the letter at that index is also the first letter of
# the word). Note that strings are immutable, so we are actually creating a new
# string on each iteration. Also, sometimes small words occur inside of other
# words, e.g. 'an' inside of 'land'. In order to make sure that our call to
# `index()` doesn't find these small words, we keep track of `start` which
# makes sure we only ever search from the end of the previously found word.
start = 0
for word in single_occurrences:
try:
word_idx = start + sentence[start:].index(word)
except ValueError:
# Could not find word in sentence. Skip it.
pass
else:
# Update counter.
start = word_idx + len(word)
# Rebuild sentence with capitalization.
first_letter = sentence[word_idx].upper()
sentence = sentence[:word_idx] + first_letter + sentence[word_idx+1:]
print(sentence)

Text replacement by patters calls for regex.
Your text is a bit tricky, you have to
remove digits
remove punktuations
split into words
care about capitalisation: 'It's' vs 'it's'
only replace full matches 'remote' vs 'mote' when replacing mote
etc.
This should do this - see comments inside for explanations:
bible.txt is from your link
from collections import Counter
from string import punctuation , digits
import re
from collections import defaultdict
with open(r"SO\AllThingsPython\P4\bible.txt") as f:
s = f.read()
# get a set of unwanted characters and clean the text
ps = set(punctuation + digits)
s2 = ''.join( c for c in s if c not in ps)
# split into words
s3 = s2.split()
# create a set of all capitalizations of each word
repl = defaultdict(set)
for word in s3:
repl[word.upper()].add(word) # f.e. {..., 'IN': {'In', 'in'}, 'THE': {'The', 'the'}, ...}
# count all words _upper case_ and use those that only occure once
single_occurence_upper_words = [w for w,n in Counter( (w.upper() for w in s3) ).most_common() if n == 1]
text = s
# now the replace part - for all upper single words
for upp in single_occurence_upper_words:
# for all occuring capitalizations in the text
for orig in repl[upp]:
# use regex replace to find the original word from our repl dict with
# space/punktuation before/after it and replace it with the uppercase word
text = re.sub(f"(?<=[{punctuation} ])({orig})(?=[{punctuation} ])",upp, text)
print(text)
Output (shortened):
Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan.
2 These are the GENERATIONS of Jacob.
Joseph, being seventeen years old, was pasturing the flock with his brothers. He was a boy with the sons of Bilhah and Zilpah, his father's wives. And Joseph brought a BAD report of them to their father. 3 Now Israel loved Joseph more than any other of his sons, because he was the son of his old age. And he made him a robe of many colors. [a] 4 But when his brothers saw that their father loved him more than all his brothers, they hated him
and could not speak PEACEFULLY to him.
<snipp>
The regex uses lookahead '(?=...)' and lookbehind '(?<=...)'syntax to make sure we replace only full words, see regex syntax.

getting words between m and n characters

I am trying to get all names that start with a capital letter and ends with a full-stop on the same line where the number of characters are between 3 and 5
My text is as follows:
King. Great happinesse
Rosse. That now Sweno, the Norwayes King,
Craues composition:
Nor would we deigne him buriall of his men,
Till he disbursed, at Saint Colmes ynch,
Ten thousand Dollars, to our generall vse
King. No more that Thane of Cawdor shall deceiue
Our Bosome interest: Goe pronounce his present death,
And with his former Title greet Macbeth
Rosse. Ile see it done
King. What he hath lost, Noble Macbeth hath wonne.
I am testing it out on this link. I am trying to get all words between 3 and 5 but haven't succeeded.

Does this produce your desired output?
import re
re.findall(r'[A-Z].{2,4}\.', text)
When text contains the text in your question it will produce this output:
['King.', 'Rosse.', 'King.', 'Rosse.', 'King.']
The regex pattern matches any sequence of characters following an initial capital letter. You can tighten that up if required, e.g. using [a-z] in the pattern [A-Z][a-z]{2,4}\. would match an upper case character followed by between 2 to 4 lowercase characters followed by a literal dot/period.
If you don't want duplicates you can use a set to get rid of them:
>>> set(re.findall(r'[A-Z].{2,4}\.', text))
set(['Rosse.', 'King.'])

You may have your own reasons for wanting to use regexs here, but Python provides a rich set of string methods and (IMO) it's easier to understand the code using these:
matched_words = []
for line in open('text.txt'):
words = line.split()
for word in words:
if word[0].isupper() and word[-1] == '.' and 3 <= len(word)-1 <=5:
matched_words.append(word)
print matched_words

Python: Auto-correct

I have two files check.txt and orig.txt. I want to check every word in check.txt and see if it matches with any word in orig.txt. If it does match then the code should replace that word with its first match otherwise it should leave the word as it is. But somehow its not working as required. Kindly help.
check.txt looks like this:
ukrain
troop
force
and orig.txt looks like:
ukraine cnn should stop pretending & announce: we will not report news while it reflects bad on obama #bostonglobe #crowleycnn #hardball
rt #cbcnews: breaking: .#vice journalist #simonostrovsky, held in #ukraine now free and safe http://t.co/sgxbedktlu http://t.co/jduzlg6jou
russia 'outraged' at deadly shootout in east #ukraine - moscow:... http://t.co/nqim7uk7zg
#groundtroops #russianpresidentvladimirputin
http://pastebin.com/XJeDhY3G
f = open('check.txt','r')
orig = open('orig.txt','r')
new = open('newfile.txt','w')
for word in f:
for line in orig:
for word2 in line.split(" "):
word2 = word2.lower()
if word in word2:
word = word2
else:
print('not found')
new.write(word)

There are two problems with your code:
when you loop over the words in f, each word will still have a new line character, so your in check does not work
you want to iterate orig for each of the words from f, but files are iterators, being exhausted after the first word from f
You can fix those by doing word = word.strip() and orig = list(orig), or you can try something like this:
# get all stemmed words
stemmed = [line.strip() for line in f]
# set of lowercased original words
original = set(word.lower() for line in orig for word in line.split())
# map stemmed words to unstemmed words
unstemmed = {word: None for word in stemmed}
# find original words for word stems in map
for stem in unstemmed:
for word in original:
if stem in word:
unstemmed[stem] = word
print unstemmed
Or shorter (without that final double loop), using difflib, as suggested in the comments:
unstemmed = {word: difflib.get_close_matches(word, original, 1) for word in stemmed}
Also, remember to close your files, or use the with keyword to close them automatically.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

removing nonwords from a text without any library - python

Try this: for word in row.split(): word = word.replace("!", "").replace("?", "").replace(".", "") # since they are included in a word if it's splitted by spaces if word.replace("'", "").replace("-", "").isalpha(): words.append(word.lower())

Related

Finding the singular or plural form of a word with regex

how to remove instances and possible multiple instances of a certain word in a string and return a string (CODEWARS dubstep)

Using .replace effectively on text

getting words between m and n characters

Python: Auto-correct

Categories

Resources