Automatically separating words into letters? - python

So I have this code:
import sys ## The 'sys' module lets us read command line arguments
words1 = open(sys.argv[2],'r') ##sys.argv[2] is your dictionary text file
words = str((words1.read()))
def main():
# Get the dictionary to search
if (len(sys.argv) != 3) :
print("Proper format: python filename.py scrambledword filename.txt")
exit(1) ## the non-zero return code indicates an error
scrambled = sys.argv[1]
print(sys.argv[1])
unscrambled = sorted(scrambled)
print(unscrambled)
for line in words:
print(line)
When I print words, it prints the words in the dictionary, one word at a time, which is great. But as soon as I try and do anything with those words like in my last two lines, it automatically separates the words into letters, and prints one letter per line of each word. Is there anyway to keep the words together? My end goal is to do ordered=sorted(line), and then an if (ordered==unscrambled) have it print the original word from the dictionary?

Your words is an instance of str. You should use split to iterate over words:
for word in words.split():
print(word)

A for-loop takes one element at a time from the "sequence" you pass it. You have read the contents of your file into a single string, so python treats it as a sequence of letters. What you need is to convert it into a list yourself: Split it into a list of strings that are as large as you like:
lines = words.splitlines() # Makes a list of lines
for line in lines:
....
Or
wordlist = words.split() # Makes a list of "words", by splitting at whitespace
for word in wordlist:
....

Related

What's the difference between "word = line.split()" and "for word in line.split()"?

I'm new to programming and this is my first question here. I feel it might be a very silly beginner doubt, but here goes.
On multiple occasions, I've typed out the whole code right except for this one line, on which I make the same mistake every time.
Could someone please explain to me what the computer understands when I type each of the following lines, and what the difference is?
word = line.split()
for word in line.split()
The difference between the expected and my actual output is just because I typed the former instead of the latter:
word = line.split()
This will split the line variable (using the default "any amount of white space" separator) and give you back a list of words built from it. You then bind the variable word to that list.
On the other hand:
for word in line.split()
initially does the same thing the previous command did (splitting the line to get a list) but, instead of binding the word variable to that entire list, it iterates over the list, binding word to each string in the list in turn.
The following transcript hopefully makes this clearer:
>>> line = 'pax is good-looking'
>>> word = line.split() ; print(word)
['pax', 'is', 'good-looking']
>>> for word in line.split(): print(word)
...
pax
is
good-looking
The split() is a separator method.
word = line.split() will return a list by splitting line into words where ' ' is present (as it is the default separator.)
for word in line.split() will iterate over that list (line.split()).
Here is an example for clarification.
line = "Stackoverflow is amazing"
word = line.split()
print(word)
>>>['Stackoverflow','is','amazing']
for word in line.split():
print(word)
>>>
'Stackoverflow'
'is'
'amazing'

Duplicates with in a sentence of a text file in python

Hi I want to write a code that reads a text file, and identifies the sentences in the file with words that have duplicates within that sentence. I was thinking of putting each sentence of the file in a dictionary and finding which sentences have duplicates. Since I am new to Python, I need some help in writing the code.
This is what I have so far:
def Sentences():
def Strings():
l = string.split('.')
for x in range(len(l)):
print('Sentence', x + 1, ': ', l[x])
return
text = open('Rand article.txt', 'r')
string = text.read()
Strings()
return
The code above converts files to sentences.
Suppose you have a file where each line is a sentence, e.g. "sentences.txt":
I contain unique words.
This sentence repeats repeats a word.
The strategy could be to split the sentence into its constituent words, then use set to find the unique words in the sentence. If the resulting set is shorter than the list of all words, then you know that the sentence contains at least one duplicated word:
sentences_with_dups = []
with open("sentences.txt") as fh:
for sentence in fh:
words = sentence.split(" ")
if len(set(words)) != len(words):
sentences_with_dups.append(sentence)

Call multiple functions inside list comprehension

I'm trying to import a text file and return the text into a list of strings for each word while also returning lower case and no punctuation.
I've created the following code but this doesn't split each word into a string. Also is it possible to add .lower() into the comprehension?
def read_words(words_file):
"""Turns file into a list of strings, lower case, and no punctuation"""
return [word for line in open(words_file, 'r') for word in line.split(string.punctuation)]
Yes, you can add .lower to the comprehension. It should probably happen in word. Also the following code probably does not split each word because of string.punctuation. If you are just trying to split on whitespace calling .split() without arguments will suffice.
Here's a list comprehension that should do everything you want:
[word.translate(None, string.punctuation).lower() for line in open(words_file) for word in line.split()]
You need to split on whitespace (the default) to separate the words. Then you can transform each resulting string to remove the punctuation and make it lowercase.
import string
def read_words(words_file):
"""Turns file into a list of strings, lower case, and no punctuation"""
with open(words_file, 'r') as f:
lowered_text = f.read().lower()
return ["".join(char for char in word if char not in string.punctuation) for word in lowered_text.split()]
Use a mapping to translate the words and use it in a generator function.
import string
def words(filepath):
'''Yield words from filepath with punctuation and whitespace removed.'''
# map uppercase to lowercase and punctuation/whitespace to an empty string
t = str.maketrans(string.ascii_uppercase,
string.ascii_lowercase,
string.punctuation + string.whitespace)
with open(filepath) as f:
for line in f:
for word in line.strip().split():
word = word.translate(t)
# don't yield empty strings
if word:
yield word
Usage
for word in words('foo.txt'):
print(word)

How to find length of words in a file in Python

I am fairly new to files in python and want to find the words in a file that have say 8 letters in them, which prints them, and keeps a numerical total of how many there actually are. Can you look through files like if it were a very large string or is there a specific way that it has to be done?
You could use Python's Counter for doing this:
from collections import Counter
import re
with open('input.txt') as f_input:
text = f_input.read().lower()
words = re.findall(r'\b(\w+)\b', text)
word_counts = Counter(w for w in words if len(w) == 8)
for word, count in word_counts.items():
print(word, count)
This works as follows:
It reads in a file called input.txt, as one very long string.
It then converts it all to lowercase to make sure the same words with different case are counted as the same word.
It uses a regular expression to split all of the text into a list of words.
It uses a list comprehension to store any word that has a length of 8 characters into a Counter.
It displays all of the matching entries along with the counts.
Try this code, where "eight_l_words" is an array of all the eight letter words and where "number_of_8lwords" is the number of eight letter words:
# defines text to be used
your_file = open("file_location","r+")
text = your_file.read
# divides the text into lines and defines some arrays
lines = text.split("\n")
words = []
eight_l_words = []
# iterating through "lines" adding each separate word to the "words" array
for each in lines:
words += each.split(" ")
# checking to see if each word in the "words" array is 8 chars long, and if so
# appending that words to the "eight_l_word" array
for each in words:
if len(each) == 8:
eight_l_word.append(each)
# finding the number of eight letter words
number_of_8lwords = len(eight_l_words)
# displaying results
print(eight_l_words)
print("There are "+str(number_of_8lwords)+" eight letter words")
Running the code with
text = "boomhead shot\nshamwow slapchop"
Yields the results:
['boomhead', 'slapchop']
There are 2 eight letter words
There's a useful post from 2 years ago called "How to split a text file to its words in python?"
How to split a text file to its words in python?
It describes splitting the line by whitespace. If you got punctuation such as commas and fullstops in there then you'll have to be a bit more sophisticated. There's help here: "Python - Split Strings with Multiple Delimiters" Split Strings with Multiple Delimiters?
You can use the function len() to get the length of each individual word.

Empty string need not be counted in python (repeat)

The purpose of the program is to count each word in a passage and note the frequency. Unfortunately, the program is also counting empty strings. My codes are:
def build_map( in_file, word_map ):
# Receives an input file and an empty dictionary
for line in in_file:
# Splits each line at blank space and turns it into
# a list.
word_list = line.split()
for word in word_list:
word= word.strip().strip(string.punctuation).lower()#program revised
if word!='':
# Within the word_list, we are stripping empty space
# on both sides of each word and also stripping any
# punctuation on both side of each word in the list.
# Then, it turns each word to the lower case to avoid
# counting 'THE' and 'the' as two different words.
add_word( word_map, word)
I would really appreciate if someone could take a look at the codes and explain, why it is still counting empty strings. Other than that everything else is working fine. Thanks (modified the code and it is working fine now).
You're checking if the word is empty and then you're stripping the whitespace and punctuation. Reverse the order of these operations.

Categories