Call multiple functions inside list comprehension - python

I'm trying to import a text file and return the text into a list of strings for each word while also returning lower case and no punctuation.
I've created the following code but this doesn't split each word into a string. Also is it possible to add .lower() into the comprehension?
def read_words(words_file):
"""Turns file into a list of strings, lower case, and no punctuation"""
return [word for line in open(words_file, 'r') for word in line.split(string.punctuation)]

Yes, you can add .lower to the comprehension. It should probably happen in word. Also the following code probably does not split each word because of string.punctuation. If you are just trying to split on whitespace calling .split() without arguments will suffice.

Here's a list comprehension that should do everything you want:
[word.translate(None, string.punctuation).lower() for line in open(words_file) for word in line.split()]
You need to split on whitespace (the default) to separate the words. Then you can transform each resulting string to remove the punctuation and make it lowercase.

import string
def read_words(words_file):
"""Turns file into a list of strings, lower case, and no punctuation"""
with open(words_file, 'r') as f:
lowered_text = f.read().lower()
return ["".join(char for char in word if char not in string.punctuation) for word in lowered_text.split()]

Use a mapping to translate the words and use it in a generator function.
import string
def words(filepath):
'''Yield words from filepath with punctuation and whitespace removed.'''
# map uppercase to lowercase and punctuation/whitespace to an empty string
t = str.maketrans(string.ascii_uppercase,
string.ascii_lowercase,
string.punctuation + string.whitespace)
with open(filepath) as f:
for line in f:
for word in line.strip().split():
word = word.translate(t)
# don't yield empty strings
if word:
yield word
Usage
for word in words('foo.txt'):
print(word)

Related

how to use strip() in python

I have a text file test.txt which has in it 'a 2hello 3fox 2hen 1dog'.
I want to read the file and then add all the items into a list, then strip the integers so it will result in the list looking like this 'a hello fox hen dog'
I tried this but my code is not working. The result is ['a 2hello 3foz 2hen 1dog']. thanks
newList = []
filename = input("Enter a file to read: ")
openfile = open(filename,'r')
for word in openfile:
newList.append(word)
for item in newList:
item.strip("1")
item.strip("2")
item.strip("3")
print(newList)
openfile.close()
from python Doc
str.strip([chars])Return a copy of the string with the leading and
trailing characters removed. The chars argument is a string specifying
the set of characters to be removed. If omitted or None, the chars
argument defaults to removing whitespace. The chars argument is not a
prefix or suffix; rather, all combinations of its values are stripped:
Strip wont modify the string, returns a copy of the string after removing the characters mentioned.
>>> text = '132abcd13232111'
>>> text.strip('123')
'abcd'
>>> text
'132abcd13232111'
You can try:
out_put = []
for item in newList:
out_put.append(item.strip("123"))
If you want to remove all 123 then use regular expression re.sub
import re
newList = [re.sub('[123]', '', word) for word in openfile]
Note: This will remove all 123 from the each line
Pointers:
strip returns a new string, so you need to assign that to something. (better yet, just use a list comprehension)
Iterating over a file object gives you lines, not words;
so instead you can read the whole thing then split on spaces.
The with statement saves you from having to call close manually.
strip accepts multiple characters, so you don't need to call it three times.
Code:
filename = input("Enter a file to read: ")
with open(filename, 'r') as openfile:
new_list = [word.strip('123') for word in openfile.read().split()]
print(new_list)
This will give you a list that looks like ['a', 'hello', 'fox', 'hen', 'dog']
If you want to turn it back into a string, you can use ' '.join(new_list)
there are several types of strips in python, basically they strip some specified char in every line. In your case you could use lstrip or just strip:
s = 'a 2hello 3fox 2hen 1dog'
' '.join([word.strip('0123456789') for word in s.split()])
Output:
'a hello fox hen dog'
A function in Python is called in this way:
result = function(arguments...)
This calls function with the arguments and stores the result in result.
If you discard the function call result as you do in your case, it will be lost.
Another way to use it is:
l=[]
for x in range(5):
l.append("something")
l.strip()
This will remove all spaces

Automatically separating words into letters?

So I have this code:
import sys ## The 'sys' module lets us read command line arguments
words1 = open(sys.argv[2],'r') ##sys.argv[2] is your dictionary text file
words = str((words1.read()))
def main():
# Get the dictionary to search
if (len(sys.argv) != 3) :
print("Proper format: python filename.py scrambledword filename.txt")
exit(1) ## the non-zero return code indicates an error
scrambled = sys.argv[1]
print(sys.argv[1])
unscrambled = sorted(scrambled)
print(unscrambled)
for line in words:
print(line)
When I print words, it prints the words in the dictionary, one word at a time, which is great. But as soon as I try and do anything with those words like in my last two lines, it automatically separates the words into letters, and prints one letter per line of each word. Is there anyway to keep the words together? My end goal is to do ordered=sorted(line), and then an if (ordered==unscrambled) have it print the original word from the dictionary?
Your words is an instance of str. You should use split to iterate over words:
for word in words.split():
print(word)
A for-loop takes one element at a time from the "sequence" you pass it. You have read the contents of your file into a single string, so python treats it as a sequence of letters. What you need is to convert it into a list yourself: Split it into a list of strings that are as large as you like:
lines = words.splitlines() # Makes a list of lines
for line in lines:
....
Or
wordlist = words.split() # Makes a list of "words", by splitting at whitespace
for word in wordlist:
....

file.replace('abcd') also replaces 'abcde' How do I only replace exact value?

def censor2(filename):
infile = open(filename,'r')
contents = infile.read()
contentlist = contents.split()
print (contents)
print (contentlist)
for letter in contentlist:
if len(letter) == 4:
print (letter)
contents = contents.replace(letter,'xxxx')
outfile = open('censor.txt','w')
outfile.write(contents)
infile.close()
outfile.close()
This code works in Python. It accepts a file 'example.txt', reads it and loops through replacing all 4 letter words with the string 'xxxx' and outputting this into a new file (keeping original format!) called censored.txt.
I used the replace function and find the words to be replaced. However, the word 'abcd' is replaced and the next word 'abcde' is turned into 'xxxxe'
How do i prevent 'abcde' from being changed?
I could not get the below examples to work, but after working with the re.sub module i found that the following code works to replace only 4 letter words and not 5 letter words.
contents = re.sub(r"(\b)\w{4}(\b)", r"\1xxxxx\2", contents)
how about:
re.sub(r'\babcd\b','',my_text)
this will require it to have word boundaries on either side
This is where regular expressions can be helpful. You would want something like this:
import re
...
contents = re.sub(r'\babcd\b', 'xxxx', contents)
....
The \b is the "word boundary" marker. It matches the change from a word to whitespace characters, punctuation, etc.
You'll need the r'' style string for the regex pattern so that the backslashes are not treated as escape characters.

python - remove string from words in an array

#!/usr/bin/python
#this looks for words in dictionary that begin with 'in' and the suffix is a real word
wordlist = [line.strip() for line in open('/usr/share/dict/words')]
newlist = []
for word in wordlist:
if word.startswith("in"):
newlist.append(word)
for word in newlist:
word = word.split('in')
print newlist
how would I get the program to remove the string "in" from all the words that it starts with? right now it does not work
#!/usr/bin/env python
# Look for all words beginning with 'in'
# such that the rest of the word is also
# a valid word.
# load the dictionary:
with open('/usr/share/dict/word') as inf:
allWords = set(word.strip() for word in inf) # one word per line
using 'with' ensures the file is always properly closed;
I make allWords a set; this makes searching it an O(1) operation
then we can do
# get the remainder of all words beginning with 'in'
inWords = [word[2:] for word in allWords if word.startswith("in")]
# filter to get just those which are valid words
inWords = [word for word in inWords if word in allWords]
or run it into a single statement, like
inWords = [word for word in (word[2:] for word in allWords if word.startswith("in")) if word in allWords]
Doing it the second way also lets us use a generator for the inside loop, reducing memory requirements.
split() returns a list of the segments obtained by splitting. Furthermore,
word = word.split('in')
doesn't modify your list, it just modifies the variable being iterated.
Try replacing your second loop with this:
for i in range(len(newlist)):
word = newlist[i].split('in', 1)
newlist[i] = word[1]
It's difficult to tell from your question what you want in newlist if you just want words that start with "in" but with "in" removed then you can use a slice:
newlist = [word[2:] for word in wordlist if word.startswith('in')]
If you want words that start with "in" are still in wordlist once they've had "in" removed (is that what you meant by "real" in your comment?) then you need something a little different:
newlist = [word for word in wordlist if word.startswith('in') and word[2:] in wordlist
Note that in Python we use a list, not an "array".
Suppose that wordlist is the list of words. Following code should do the trick:
for i in range(len(wordlist)):
if wordlist[i].startswith("in"):
wordlist[i] = wordlist[i][2:]
It is better to use while loop if the number of words in the list is quite big.

manipulating list items python

line = "english: while french: pendant que spanish: mientras german: whrend "
words = line.split('\t')
for each in words:
each = each.rstrip()
print words
the string in 'line' is tab delimited but also features a single white space character after each translated word, so while split returns the list I'm after, each word annoyingly has a whitespace character at the end of the string.
in the loop I'm trying to go through the list and remove any trailing whitespaces in the strings but it doest seem to work, suggestions?
Just line.split() could give you stripped words list.
Updating each inside the loop does not make any changes to the words list
Should be done like this
for i in range(len(words)):
words[i]=words[i].rstrip()
Or
words=map(str.rstrip,words)
See the map docs for details on map.
Or one liner with list comprehension
words=[x.rstrip() for x in line.split("\t")]
Or with regex .findall
words=re.findall("[^\t]+",line)
words = line.split('\t')
words = [ i.rstrip() for i in words ]
You can use a regular expression:
import re
words = re.split(r' *\t| +$', line)[:-1]
With this you define the possible sequence as the delimiter. It also allows more than one space because of the * operator (or no space at all).
EDIT: Fixed after Roger Pate pointed an error.

Categories