Converting a Text File to a String in Python - python

I am new to python and am trying to find the largest word in the alice_in_worderland.txt. I think I have a good system set up ("See Below"), but my output is returning a "word" with dashes connecting multiple words. Is there someway to remove the dashes in the input of the file? For the text file visit here
sample from text file:
That's very important,' the King said, turning to the jury. They were
just beginning to write this down on their slates, when the White
Rabbit interrupted: UNimportant, your Majesty means, of course,' he
said in a very respectful tone, but frowning and making faces at him
as he spoke. " UNimportant, of course, I meant,' the King hastily
said, and went on to himself in an undertone, important--unimportant--
unimportant--important--' as if he were trying which word sounded
best."
code:
#String input
with open("alice_in_wonderland.txt", "r") as myfile:
string=myfile.read().replace('\n','')
#initialize list
my_list = []
#Split words into list
for word in string.split(' '):
my_list.append(word)
#initialize list
uniqueWords = []
#Fill in new list with unique words to shorten final printout
for i in my_list:
if not i in uniqueWords:
uniqueWords.append(i)
#Legnth of longest word
count = 0
#Longest word place holder
longest = []
for word in uniqueWords:
if len(word)>count:
longest = word
count = len(longest)
print longest

>>> import nltk # pip install nltk
>>> nltk.download('gutenberg')
>>> words = nltk.corpus.gutenberg.words('carroll-alice.txt')
>>> max(words, key=len) # find the longest word
'disappointment'

Here's one way using re and mmap:
import re
import mmap
with open('your alice in wonderland file') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
words = re.finditer('\w+', mf)
print max((word.group() for word in words), key=len)
# disappointment
Far more efficient than loading the file to physical memory.

Use str.replace to replace the dashes with spaces (or whatever you want). To do this, simply add another call to replace after the first call on line 3:
string=myfile.read().replace('\n','').replace('-', ' ')

Related

how to remove instances and possible multiple instances of a certain word in a string and return a string (CODEWARS dubstep)

I have had a go at the CODEWARS dubstep challenge using python.
My code is below, it works and I pass the kata test. However, it took me a long time and I ended up using a brute force approach (newbie).
(basically replacing and striping the string until it worked)
Any ideas with comments on how my code could be improved please?
TASK SUMMARY:
Let's assume that a song consists of some number of words (that don't contain WUB). To make the dubstep remix of this song, Polycarpus inserts a certain number of words "WUB" before the first word of the song (the number may be zero), after the last word (the number may be zero), and between words (at least one between any pair of neighbouring words), and then the boy glues together all the words, including "WUB", in one string and plays the song at the club.
For example, a song with words "I AM X" can transform into a dubstep remix as "WUBWUBIWUBAMWUBWUBX" and cannot transform into "WUBWUBIAMWUBX".
song_decoder("WUBWEWUBAREWUBWUBTHEWUBCHAMPIONSWUBMYWUBFRIENDWUB")
# => WE ARE THE CHAMPIONS MY FRIEND
song_decoder("AWUBBWUBC"), "A B C","WUB should be replaced by 1 space"
song_decoder("AWUBWUBWUBBWUBWUBWUBC"), "A B C","multiples WUB should be replaced by only 1 space"
song_decoder("WUBAWUBBWUBCWUB"), "A B C","heading or trailing spaces should be removed"
Thanks in advance, (I am new to stackoverflow also)
MY CODE:
def song_decoder(song):
new_song = song.replace("WUB", " ")
new_song2 = new_song.strip()
new_song3 = new_song2.replace(" ", " ")
new_song4 = new_song3.replace(" ", " ")
return(new_song4)
I don't know if it can improve it but I would use split and join
text = 'WUBWEWUBAREWUBWUBTHEWUBCHAMPIONSWUBMYWUBFRIENDWUB'
text = text.replace("WUB", " ")
print(text)
words = text.split()
print(words)
text = " ".join(words)
print(text)
Result
WE ARE THE CHAMPIONS MY FRIEND
['WE', 'ARE', 'THE', 'CHAMPIONS', 'MY', 'FRIEND']
WE ARE THE CHAMPIONS MY FRIEND
EDIT:
Dittle different version. I split usinsg WUB but then it creates empty elements between two WUB and it needs to remove them
text = 'WUBWEWUBAREWUBWUBTHEWUBCHAMPIONSWUBMYWUBFRIENDWUB'
words = text.split("WUB")
print(words)
words = [x for x in words if x] # remove empty elements
#words = list(filter(None, words)) # remove empty elements
print(words)
text = " ".join(words)
print(text)

If a certain word is not before the search word then add to list python

I would like the program to detect whether a certain word is before the search word and if it is not to add it to a list.
This is what I have come up with myself:
sentence = "today i will take my dog for a walk, tomorrow i will not take my dog for a walk"
all = ["take", "take"]
all2= [w for w in all if not(re.search(r'not' + w + r'\b', sentence))]
print(all2)
The excpected output is ["take"], but it remains the same with ["take, "take]
Watch how it should be formulated: gather all take word occurrences that aren't preceded with word not:
import re
sentence = "today i will take my dog for a walk, tomorrow i will not take my dog for a walk"
search_word = 'take'
all_takes_without_not = re.findall(fr'(?<!\bnot)\s+({search_word})\b', sentence)
print(all_takes_without_not)
The output:
['take']
It may be simpler to first convert you sentence to a list of words.
from itertools import chain
# Get individual words from the string
words = sentence.split()
# Create an iterator which yields the previous word at each position
previous = chain([None], words)
output = [word for prev, word in zip(previous, words) if word=='take' and prev != 'not']

Using .replace effectively on text

I'm attempting to capitalize all words in a section of text that only appear once. I have the bit that finds which words only appear once down, but when I go to replace the original word with the .upper version, a bunch of other stuff gets capitalized too. It's a small program, so here's the code.
from collections import Counter
from string import punctuation
path = input("Path to file: ")
with open(path) as f:
word_counts = Counter(word.strip(punctuation) for line in f for word in line.replace(")", " ").replace("(", " ")
.replace(":", " ").replace("", " ").split())
wordlist = open(path).read().replace("\n", " ").replace(")", " ").replace("(", " ").replace("", " ")
unique = [word for word, count in word_counts.items() if count == 1]
for word in unique:
print(word)
wordlist = wordlist.replace(word, str(word.upper()))
print(wordlist)
The output should be 'Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan., as sojournings is the first word that only appears once. Instead, it outputs GenesIs 37:1 Jacob lIved In the land of hIs FATher's SOJOURNINGS, In the land of Canaan. Because some of the other letters appear in keywords, it tries to capitalize them as well.
Any ideas?
I rewrote the code pretty significantly since some of the chained replace calls might prove to be unreliable.
import string
# The sentence.
sentence = "Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan."
rm_punc = sentence.translate(None, string.punctuation) # remove punctuation
words = rm_punc.split(' ') # split spaces to get a list of words
# Find all unique word occurrences.
single_occurrences = []
for word in words:
# if word only occurs 1 time, append it to the list
if words.count(word) == 1:
single_occurrences.append(word)
# For each unique word, find it's index and capitalize the letter at that index
# in the initial string (the letter at that index is also the first letter of
# the word). Note that strings are immutable, so we are actually creating a new
# string on each iteration. Also, sometimes small words occur inside of other
# words, e.g. 'an' inside of 'land'. In order to make sure that our call to
# `index()` doesn't find these small words, we keep track of `start` which
# makes sure we only ever search from the end of the previously found word.
start = 0
for word in single_occurrences:
try:
word_idx = start + sentence[start:].index(word)
except ValueError:
# Could not find word in sentence. Skip it.
pass
else:
# Update counter.
start = word_idx + len(word)
# Rebuild sentence with capitalization.
first_letter = sentence[word_idx].upper()
sentence = sentence[:word_idx] + first_letter + sentence[word_idx+1:]
print(sentence)
Text replacement by patters calls for regex.
Your text is a bit tricky, you have to
remove digits
remove punktuations
split into words
care about capitalisation: 'It's' vs 'it's'
only replace full matches 'remote' vs 'mote' when replacing mote
etc.
This should do this - see comments inside for explanations:
bible.txt is from your link
from collections import Counter
from string import punctuation , digits
import re
from collections import defaultdict
with open(r"SO\AllThingsPython\P4\bible.txt") as f:
s = f.read()
# get a set of unwanted characters and clean the text
ps = set(punctuation + digits)
s2 = ''.join( c for c in s if c not in ps)
# split into words
s3 = s2.split()
# create a set of all capitalizations of each word
repl = defaultdict(set)
for word in s3:
repl[word.upper()].add(word) # f.e. {..., 'IN': {'In', 'in'}, 'THE': {'The', 'the'}, ...}
# count all words _upper case_ and use those that only occure once
single_occurence_upper_words = [w for w,n in Counter( (w.upper() for w in s3) ).most_common() if n == 1]
text = s
# now the replace part - for all upper single words
for upp in single_occurence_upper_words:
# for all occuring capitalizations in the text
for orig in repl[upp]:
# use regex replace to find the original word from our repl dict with
# space/punktuation before/after it and replace it with the uppercase word
text = re.sub(f"(?<=[{punctuation} ])({orig})(?=[{punctuation} ])",upp, text)
print(text)
Output (shortened):
Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan.
2 These are the GENERATIONS of Jacob.
Joseph, being seventeen years old, was pasturing the flock with his brothers. He was a boy with the sons of Bilhah and Zilpah, his father's wives. And Joseph brought a BAD report of them to their father. 3 Now Israel loved Joseph more than any other of his sons, because he was the son of his old age. And he made him a robe of many colors. [a] 4 But when his brothers saw that their father loved him more than all his brothers, they hated him
and could not speak PEACEFULLY to him.
<snipp>
The regex uses lookahead '(?=...)' and lookbehind '(?<=...)'syntax to make sure we replace only full words, see regex syntax.

python word grouping based on words before and after

I am trying create groups of words. First I am counting all words. Then I establish the top 10 words by word count. Then I want to create 10 groups of words based on those top 10. Each group consist of all the words that are before and after the top word.
I have survey results stored in a python pandas dataframe structured like this
Question_ID | Customer_ID | Answer
1 234 Data is very important to use because ...
2 234 We value data since we need it ...
I also saved the answers column as a string.
I am using the following code to find 3 words before and after a word ( I actually had to create a string out of the answers column)
answers_str = df.Answer.apply(str)
for value in answers_str:
non_data = re.split('data|Data', value)
terms_list = [term for term in non_data if len(term) > 0] # skip empty terms
substrs = [term.split()[0:3] for term in terms_list] # slice and grab first three terms
result = [' '.join(term) for term in substrs] # combine the terms back into substrings
print result
I have been manually creating groups of words - but is there a way of doing it in python?
So based on the example shown above the group with word counts would look like this:
group "data":
data : 2
important: 1
value: 1
need:1
then when it goes through the whole file, there would be another group:
group "analytics:
analyze: 5
report: 7
list: 10
visualize: 16
The idea would be to get rid of "we", "to","is" as well - but I can do it manually, if that's not possible.
Then to establish the 10 most used words (by word count) and then create 10 groups with words that are in front and behind those main top 10 words.
We can use regex for this. We'll be using this regular expression
((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})
which you can test for yourself here, to extract the three words before and after each occurence of data
First, let's remove all the words we don't like from the strings.
import re
# If you're processing a lot of sentences, it's probably wise to preprocess
#the pattern, assuming that bad_words is the same for all sentences
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
The we want to get the words that surround data in each line
data_pat = r'((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})'
res = re.findall(pat, s, flags=re.IGNORECASE)
gives us a list of tuples of strings. We want to get a list of those strings after they are split.
from itertools import chain
list_of_words = list(chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res))))))
That's not pretty, but it works. Basically, we pull the tuples out of the list, pull the strings out of each tuples, then split each string then pull all the strings out of the lists they end up in into one big list.
Let's put this all together with your pandas code. pandas isn't my strongest area, so please don't assume that I haven't made some elementary mistake if you see something weird looking.
import re
from itertools import chain
from collections import Counter
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
bad_words = ['we', 'is', 'to']
sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))
c = Counter()
data_pat = r'((?:\b\w+?\b\s*){0,3})data((?:\s*\b\w+?\b){0,3})'
for sentence in sentence_list:
res = re.findall(data_pat, sentence, flags=re.IGNORECASE)
words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
c.update(words)
The nice thing about the regex we're using is that all of the complicated parts don't care about what word we're using. With a slight change, we can make a format string
base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'
such that
base_pat.format('data') == data_pat
So with some list of words we want to collect information about key_words
import re
from itertools import chain
from collections import Counter
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
bad_words = ['we', 'is', 'to']
sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))
key_words = ['data', 'analytics']
d = {}
base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'
for keyword in key_words:
key_pat = base_pat.format(keyword)
c = Counter()
for sentence in sentence_list:
res = re.findall(key_pat, sentence, flags=re.IGNORECASE)
words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
c.update(words)
d[keyword] = c
Now we have a dictionary d that maps keywords, like data and analytics to Counters that map words that are not on our blacklist to their counts in the vicinity of the associated keyword. Something like
d= {'data' : Counter({ 'important' : 2,
'very' : 3}),
'analytics' : Counter({ 'boring' : 5,
'sleep' : 3})
}
As to how we get the top 10 words, that's basically the thing Counter is best at.
key_words, _ = zip(*Counter(w for sentence in sentence_list for w in sentence.split()).most_common(10))

Python: Auto-correct

I have two files check.txt and orig.txt. I want to check every word in check.txt and see if it matches with any word in orig.txt. If it does match then the code should replace that word with its first match otherwise it should leave the word as it is. But somehow its not working as required. Kindly help.
check.txt looks like this:
ukrain
troop
force
and orig.txt looks like:
ukraine cnn should stop pretending & announce: we will not report news while it reflects bad on obama #bostonglobe #crowleycnn #hardball
rt #cbcnews: breaking: .#vice journalist #simonostrovsky, held in #ukraine now free and safe http://t.co/sgxbedktlu http://t.co/jduzlg6jou
russia 'outraged' at deadly shootout in east #ukraine - moscow:... http://t.co/nqim7uk7zg
#groundtroops #russianpresidentvladimirputin
http://pastebin.com/XJeDhY3G
f = open('check.txt','r')
orig = open('orig.txt','r')
new = open('newfile.txt','w')
for word in f:
for line in orig:
for word2 in line.split(" "):
word2 = word2.lower()
if word in word2:
word = word2
else:
print('not found')
new.write(word)
There are two problems with your code:
when you loop over the words in f, each word will still have a new line character, so your in check does not work
you want to iterate orig for each of the words from f, but files are iterators, being exhausted after the first word from f
You can fix those by doing word = word.strip() and orig = list(orig), or you can try something like this:
# get all stemmed words
stemmed = [line.strip() for line in f]
# set of lowercased original words
original = set(word.lower() for line in orig for word in line.split())
# map stemmed words to unstemmed words
unstemmed = {word: None for word in stemmed}
# find original words for word stems in map
for stem in unstemmed:
for word in original:
if stem in word:
unstemmed[stem] = word
print unstemmed
Or shorter (without that final double loop), using difflib, as suggested in the comments:
unstemmed = {word: difflib.get_close_matches(word, original, 1) for word in stemmed}
Also, remember to close your files, or use the with keyword to close them automatically.

Categories