Python parsing to lower case and removing punctuation not functioning properly

Python parsing to lower case and removing punctuation not functioning properly - python

import string
remove = dict.fromkeys(map(ord, '\n ' + string.punctuation))
with open('data10.txt', 'r') as f:
for line in f:
for word in line.split():
w = f.read().translate(remove)
print(word.lower())
I have this code here and for some reason, the translate(remove) is leaving a good amount of punctuation in the parsed file.

Why are you reading the whole file within the for loop?
Try this:
import string
remove = dict.fromkeys(map(ord, '\n ' + string.punctuation))
with open('data10.txt', 'r') as f:
for line in f:
for word in line.split():
word = word.translate(remove)
print(word.lower())
This will print our the lower cased and stripped words, one per line. Not really sure if that's what you want.

Related

stripping \n from every element of a list

I am currently completely new to python and trying to build a simple hangman game. I created a .txt file with all the sample words and imported it to python. When printing them out however they all have this format: ['exampleword\n'], ['exampleword2\n'] I however want to get rid of the \n ending. I tried most of the suggestions from this thread: How to remove '\n' from end of strings inside a list?, but they didn't work.
woerter = open("woerter.txt", "r")
wortliste = [line.split("\t") for line in woerter.readlines()]
print(wortliste)
I have python 3.8.2. installed, any help is greatly appreciated :)

try:
woerter = open("woerter.txt", "r")
wortliste = [line.rstrip() for line in woerter.readlines()]
print(wortliste)

no reason to use readlines, you can just iterate over the file directly:
with open('woerter.txt') as f:
wortlist = [l.strip() for l in f]

You can use str.splitlines().
for example:
string = "this\n"
string += "has\n"
string += "multiple\n"
string += "lines"
words = string.splitlines()
print(words)
# Outputs: ['this', 'has', 'multiple', 'lines']
with open("woerter.txt", 'r') as f:
wordlist = f.read().splitlines()

Stripping numbers dates until first alphabet is found from string

I am trying an efficient way to strip numbers dates or any other characters present in a string until the first alphabet is found from the end.
string - '12.abd23yahoo 04/44 231'
Output - '12.abd23yahoo'
line_inp = "12.abd23yahoo 04/44 231"
line_out = line_inp.rstrip('0123456789./')
This rstrip() call doesn't seem to work as expected, I get '12.abd23yahoo 04/44 ' instead.
I am trying below and it doesn't seem to be working.
for fname in filenames:
with open(fname) as infile:
for line in infile:
outfile.write(line.rstrip('0123456789./ '))

You need to strip spaces too:
line_out = line_inp.rstrip('0123456789./ ')
Demo:
>>> line_inp = "12.abd23yahoo 04/44 231"
>>> line_inp.rstrip('0123456789./ ')
'12.abd23yahoo'

You need to strip the newlines and add it again before you write :
for fname in filenames:
with open(fname) as infile:
outfile.writelines(line.rstrip('0123456789./ \n') + "\n"
for line in infile)
If the format is always the same you can just split:
with open(fname) as infile:
outfile.writelines(line.split(None, 1)[0] + "\n"
for line in infile)

Here's a solution using a regular expression:
import re
line_inp = "12.abd23yahoo 04/44 231"
r = re.compile('^(.*[a-zA-Z])')
m = re.match(r, line_inp)
line_out = m.group(0) # 12.abd23yahoo
The regular expression matches a group of arbitrary characters which end in a letter.

trying to print to a text file with words that only have two or more occurring vowels

import re
twovowels=re.compile(r".*[aeiou].*[aeiou].*", re.I)
nonword=re.compile(r"\W+", re.U)
text_file = open("twoVoweledWordList.txt", "w")
file = open("FirstMondayArticle.html","r")
for line in file:
for word in nonword.split(line):
if twovowels.match(word): print word
text_file.write('\n' + word)
text_file.close()
file.close()
This is my python code, I am trying to print only the words that have two or more occurring vowels. When i run this code, it prints everything, including the words and numbers that do not have vowels, to my text file. But the python shell shows me all of the words that have two or more occurring vowels. So how do I change that?

You can remove the vowels with str.translate and compare lengths. If after removing the letters the length difference is > 1 you have at least two vowels:
with open("FirstMondayArticle.html") as f, open("twoVoweledWordList.txt", "w") as out:
for line in file:
for word in line.split():
if len(word) - len(word.lower().translate(None,"aeiou")) > 1:
out.write("{}\n".format(word.rstrip()))
In your own code you always write the word as text_file.write('\n' + word) is outside the if block. a good lesson in why you should not have multiple statements on one line, your code is equivalent to:
if twovowels.match(word):
print(word)
text_file.write('\n' + word) # <- outside the if
Your code with the if in the correct location, some changes to your naming convention, adding some spaces between assignments and using with which closes your files for you:
import re
with open("FirstMondayArticle.html") as f, open("twoVoweledWordList.txt", "w") as out:
two_vowels = re.compile(r".*[aeiou].*[aeiou].*", re.I)
non_word = re.compile(r"\W+", re.U)
for line in f:
for word in non_word.split(line):
if two_vowels.match(word):
print(word)
out.write("{}\n".format(word.rstrip()))

Because it is outside of if condition. This is what the code lines should look like:
for line in file:
for word in nonword.split(line):
if twovowels.match(word):
print word
text_file.write('\n' + word)
text_file.close()
file.close()
Here is a sample program on Tutorialspoint showing the code above is correct.

I would suggest an alternate, and simpler, method, not using re:
def twovowels(word):
count = 0
for char in word.lower():
if char in "aeiou":
count = count + 1
if count > 1:
return True
return False
with open("FirstMondayArticle.html") as file,
open("twoVoweledWordList.txt", "w") as text_file:
for line in file:
for word in line.split():
if twovowels(word):
print word
text_file.write(word + "\n")

Reading/Writing files in Python and using Dictionary for WordCount list

i'm doing a python exercise that has to open and read the text file of Alice In Wonderland, populate a dictionary by making a wordcount, and then writing out that file. For the love of me, it won't work. Any tips??
f = open('/Users/yongcho822/Desktop/alice.txt', 'r')
count = {}
for line in f:
for word in line.split():
# remove punctuation
word = word.replace('_', '').replace('"', '').replace(',', '').replace('.', '')
word = word.replace('-', '').replace('?', '').replace('!', '').replace("'", "")
word = word.replace('(', '').replace(')', '').replace(':', '').replace('[', '')
word = word.replace(']', '').replace(';', '')
# ignore case
word = word.lower()
# ignore numbers
if word.isalpha():
if word in count:
count[word] = count[word] + 1
else:
count[word] = 1
keys = list(count.keys())
keys.sort()
# save the word count analysis to a file
out = open('/Users/yongcho822/Desktop/alice.txt', 'w')
for word in keys:
out.write(word + " " + str(count[word]))
out.write('\n')
print("The word 'alice' appears " + str(count['alice']) + " times in the book.")

Are you certain you want to be writing to a file called alice.txt, when that is also the name of your input file?
Check your input file (Desktop/alice.txt) to make sure you didn't accidentally overwrite it! Use a different name, e.g. output.txt for your output file.
The other minor issue is the indentation of your second for loop, but I think that's just formatting of your question.
Otherwise, your code worked for me. I found 167 x Alice, in the text I used.

An easier way to have gotten rid of the punctuation would have been to do as follows,
word=word.translate(None, string.punctuation)

How to save indention format of file in Python

I am saving all the words from a file like so:
sentence = " "
fileName = sys.argv[1]
fileIn = open(sys.argv[1],"r")
for line in open(sys.argv[1]):
for word in line.split(" "):
sentence += word
Everything works okay when outputting it except the formatting.
I am moving source code, is there any way I can save the indention?

Since you state, that you want to move source code files, why not just copy/move them?
import shutil
shutil.move(src, dest)
If you read source file,
fh = open("yourfilename", "r")
content = fh.read()
should load your file as it is (with indention), or not?

When you invoke line.split(), you remove all leading spaces.
What's wrong with just reading the file into a single string?
textWithIndentation = open(sys.argv[1], "r").read()

Split removes all spaces:
>>> a=" a b c"
>>> a.split(" ")
['', '', '', 'a', 'b', '', '', 'c']
As you can see, the resulting array doesn't contain any spaces anymore. But you can see these strange empty strings (''). They denote that there has been a space. To revert the effect of split, use join(" "):
>>> l=a.split(" ")
>>> " ".join(l)
' a b c'
or in your code:
sentence += " " + word
Or you can use a regular expression to get all spaces at the start of the line:
>>> import re
>>> re.match(r'^\s*', " a b c").group(0)
' '

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python parsing to lower case and removing punctuation not functioning properly - python

Related

stripping \n from every element of a list

Stripping numbers dates until first alphabet is found from string

trying to print to a text file with words that only have two or more occurring vowels

Reading/Writing files in Python and using Dictionary for WordCount list

How to save indention format of file in Python

Categories

Resources