Preventing removal of linebreaks - python

I have a function that replaces offensive words with a star, but in running text through this, it strips out linebreaks. Any thoughts on how to prevent this?
def replace_words(text, exclude_list):
words = text.split()
for i in range(len(words)):
if words[i].lower() in exclude_list:
words[i] = "*"
return ' '.join(words)

Don't use .split() with no argument on the entire input string, it removes line breaks and you lose the information where you have to put them in the result string.
You could first split the input into lines and then process each line separately in the same way as you now process the whole input.

credit to mkrieger1
def replace_words(text, exclude_list):
paragraphs = text.split('\n')
new_paragraph = ""
for p in paragraphs:
words = p.split()
for i in range(len(words)):
if words[i].lower() in exclude_list:
words[i] = "*"
new_p = ' '.join(words)
new_paragraph = new_paragraph + "\n" + new_p #add line break
return new_paragraph

You can use \n to create a new line or .split()

Related

Why does str.capitalize() not work as I expect?

Please, let me know if I'm not providing enough information. The goal of the program is to capitalize the first letter of every sentence.
usr_str = input()
def fix_capitalization(usr_str):
list_of_sentences = usr_str.split(".")
list_of_sentences.pop() #remove last element: ""
new_str = ''
for sentence in list_of_sentences:
new_str += sentence.capitalize() + "."
return new_str
print(fix_capitalization(usr_str))
For instance, if I input "hi. hello. hey." I expect it to output "Hi. Hello. Hey." but instead, it outputs "Hi. hello. hey."
An alternative would be to build a list of strings then concatenate them:
def fix_capitalization(usr_str):
list_of_sentences = usr_str.split(".")
output = []
for sentence in list_of_sentences:
new_sentence = sentence.strip().capitalize()
# If empty, don't bother
if new_sentence:
output.append(new_sentence)
# Finally, join everything
return ". ".join(output) +"."
You've entered the sentences with spaces between them. Now when you split the list the list at the '.' character the spaces are still remaining. I checked what the elements in the list were when you split it and the result was this.
'''
['hi', ' hello', ' hey', '']
'''

I am trying to count the number of time a word appears in a .txt file using Python3

I am trying to count the number of times a word appears in in a txt file. The program seems to work however I cannot stop it counting what I think is white space (the 60 in my result, which makes no sense as there is more than 60 spaces). Is there a way of stripping - and -- from the middle of words?
import string
words = {}
def unique_words2(filename):
strip = string.whitespace + string.punctuation + string.digits + "\"'"
for line in open(filename):
for word in line.lower().split():
if word == " ":
continue
else:
word = word.strip(strip)
words[word] = words.get(word, 0) + 1
for word in sorted(words):
print("{0} {1}".format(word, words[word]))
unique_words2("alice.txt")
the first 5 results show;
60
a 627
a--i'm 1
a-piece 1
abide 1
It is results like 1, 3 and 4 that I would like eliminate.
The strip method of a python string only removes specified characters from the beginning and end of the string. Using the translate method instead would fix that. (This is the cause of outputs 3 and 4). Output one is caused by a different problem. If a word consisting of entirely characters in strip occurs, it gets included in the words dictionary under the empty string.
Tweaked code:
import string
def unique_words2(filename):
words = {}
strip = string.whitespace + string.punctuation + string.digits + "\"'"
translation = {ord(bad):None for bad in strip}
for line in open(filename):
for word in line.lower().split():
word = word.translate(translation)
if word:
words[word] = words.get(word, 0) + 1
for word in sorted(words):
print("{0} {1}".format(word, words[word]))
unique_words2("alice.txt")
From https://docs.python.org/2/library/string.html:
string.split(s[, sep[, maxsplit]])
The words are separated by arbitrary strings of whitespace characters (space, tab, newline, return, formfeed)
Replacing any other separator like '-' with a space should do the trick. No need to take care of repeated spaces since they will be treated as a single space.
def unique_words2(filename):
strip = string.whitespace + string.punctuation + string.digits + "\"'"
for line in open(filename):
separators = '-_|'
for sep in seperators:
line = line.replace(sep, ' ')
for word in line.lower().split():
word = word.strip(strip)
if word:
words[word] = words.get(word, 0) + 1
for word in sorted(words):
print("{0} {1}".format(word, words[word]))

Python Coding for Reversing Sentences

this is the code i am using so far.
translated = []
line = input('Line: ')
while line != '':
for word in line.split():
letters = list(word)
letters.reverse()
word = ''.join(letters)
translated.append(word)
if line == '':
print(' '.join(translated))
elif line:
line = input('Line: ')
it is suppose to read lines of input from the user. An empty line is suppose to signify the end of any inputs. Then the program is suppose to read all the lines, then reproduce them in their original order with each word reversed in place.
For example if i was to input: Hello how are you
Its output shout be: olleH woh era uoy
Currently it is asking for the inputs, then stopping when there is an empty line, but not producing anything. No reversed words no nothing.
Can anyone tell me what i am doing wrong, and help me out with my code??
The print statement needs to be outside the loop. Your loop condition ensures that line is never '' inside the loop, so the if condition is never satisfied.
For the same reason, you need to rethink the elif.
as #Flav points out to read all lines before an empty line to end the input. I have edited the solution as below:
lines = [] # to store all line inputs
while True:
line = raw_input('Line: ') # input if using python3 or raw_input if python2.6/7
if line == '':
break
lines.append(line)
for line in lines:
print (' '.join([word[::-1] for word in line.split(' ')]))
You could probably do it like this.
' '.join( [ i[::-1] for i in line.split( ' ' ) ] )
Split the line into words
Reverse each word
Put them back together
The issue is that when the line is empty, your while loop stops. You should get rid of the if / else which are useless here.
Full script:
translated = []
line = input('Line: ')
while line != '':
for word in line.split():
letters = list(word)
letters.reverse()
word = ''.join(letters)
translated.append(word)
#The above for loop could be done in one line with:
#translated.extend([word[::-1] for word in line.split()])
line = input('Line: ')
print(' '.join(translated))
This works perfect
a = "Hello how are you"
" ".join([ "".join(reversed(x)) for x in re.findall('\w+',a) ])

Grab a keyword and the text between keywords in Python

Firt thing I'd like to say is this place has helped me more than I could ever repay. I'd like to say thanks to all that have helped me in the past :).
I am trying to devide up some text from a specific style message. It is formated like this:
DATA|1|TEXT1|STUFF: some random text|||||
DATA|2|TEXT1|THINGS: some random text and|||||
DATA|3|TEXT1|some more random text and stuff|||||
DATA|4|TEXT1|JUNK: crazy randomness|||||
DATA|5|TEXT1|CRAP: such random stuff I cant believe how random|||||
I have code shown below that combines the text adding a space between words and adds it to a string named "TEXT" so it looks like this:
STUFF: some random text THINGS: some random text and some more random text and stuff JUNK: crazy randomness CRAP: such random stuff I cant believe how random
I need it formated like this:
DATA|1|TEXT1|STUFF: |||||
DATA|2|TEXT1|some random text|||||
DATA|3|TEXT1|THINGS: |||||
DATA|4|TEXT1|some random text and|||||
DATA|5|TEXT1|some more random text and stuff|||||
DATA|6|TEXT1|JUNK: |||||
DATA|7|TEXT1|crazy randomness|||||
DATA|8|NEWTEXT|CRAP: |||||
DATA|9|NEWTEXT|such random stuff I cant believe how random|||||
The line numbers are easy, I have that done as well as the carraige returns. What I need is to grab "CRAP" and change the part that says "TEXT1" to "NEWTEXT".
My code scans the string looking for keywords then adds them to their own line then adds text below them followed by the next keyword on its own line etc. Here is my code I have so far:
#this combines all text to one line and adds to a string
while current_segment.move_next('DATA')
TEXT = TEXT + " " + current_segment.field(4).value
KEYWORD_LIST = [STUFF:', THINGS:', JUNK:']
KEYWORD_LIST1 = [CRAP:']
#this splits the words up to search through
TEXT_list = TEXT.split(' ')
#this searches for the first few keywords then stops at the unwanted one
for word in TEXT_list:
if word in KEYWORD_LIST:
my_output = my_output + word
elif word in KEYWORD_LIST1:
break
else:
my_output = my_output + ' ' + word
#this searches for the unwanted keywords leaving the output blank until it reaches the wanted keyword
for word1 in TEXT_list:
if word1 in KEYWORD_LIST:
my_output1 = ''
elif word1 in KEYWORD_LIST1:
my_output1 = my_output1 + word1 + '\n'
else:
my_output1 = my_output1 + ' ' + word1
#my_output is formatted back the way I want deviding up the text into 65 or less character lines
MAX_LENGTH = 65
my_wrapped_output = wrap(my_output,MAX_LENGTH)
my_wrapped_output1 = wrap(my_output1,MAX_LENGTH)
my_output_list = my_wrapped_output.split('\n')
my_output_list1 = my_wrapped_output1.split('\n')
for phrase in my_output_list:
if phrase == "":
SetID +=1
output = output + "DATA|" + str(SetID) + "|TEXT| |||||"
else:
SetID +=1
output = output + "DATA|" + str(SetID) + "|TEXT|" + phrase + "|||||"
for phrase2 in my_output_list1:
if phrase2 == "":
SetID +=1
output = output + "DATA|" + str(SetID) + "|NEWTEXT| |||||"
else:
SetID +=1
output = output + "DATA|" + str(SetID) + "|NEWTEXT|" + phrase + "|||||"
#this populates the fields I need
value = output
Then I format the "my_output" and "my_output1" adding the word "NEWTEXT" where it goes. This code runs through each line looking for the keyword then puts that keyword and a carraige return in. Once it gets the other "KEYWORD_LIST1" it stops and drops the rest of the text then starts the next loop. My problem is the above code gives my this:
DATA|1|TEXT1|STUFF: |||||
DATA|2|TEXT1|some random text|||||
DATA|3|TEXT1|THINGS: |||||
DATA|4|TEXT1|some random text and|||||
DATA|5|TEXT1|some more random text and stuff|||||
DATA|6|TEXT1|JUNK: |||||
DATA|7|TEXT1|crazy randomness|||||
DATA|8|NEWTEXT|crazy randomness|||||
DATA|9|NEWTEXT|CRAP: |||||
DATA|10|NEWTEXT|such random stuff I cant believe how random|||||
It is grabbing the text from before "KEYWORD_LIST1" and adding it into the NEWTEXT section. I know there is a way to make groups from the keyword and text after it but I am unclear on how to impliment it. Any help would be much appreciated.
Thanks.
This is what I had to do to get it to work for me:
KEYWORD_LIST = ['STUFF:', 'THINGS:', 'JUNK:']
KEYWORD_LIST1 = ['CRAP:']
def text_to_message(text):
result=[]
for word in text.split():
if word in KEYWORD_LIST or word in KEYWORD_LIST1:
if result:
yield ' '.join(result)
result=[]
yield word
else:
result.append(word)
if result:
yield ' '.join(result)
def format_messages(messages):
title='TEXT1'
for message in messages:
if message in KEYWORD_LIST:
title='TEXT1'
elif message in KEYWORD_LIST1:
title='NEWTEXT'
my_wrapped_output = wrap(message,MAX_LENGTH)
my_output_list = my_wrapped_output.split('\n')
for line in my_output_list:
if line = '':
yield title + '|'
else:
yield title + '|' + line
for line in format_messages(text_to_message(TEXT)):
if line = '':
SetID +=1
output = "DATA|" + str(SetID) + "|"
else:
SetID +=1
output = "DATA|" + str(SetID) + "|" + line
#this is needed instead of print(line)
value = output
General tip: Don't try to build up strings accretively like this:
my_output = my_output + ' ' + word
instead, make my_output a list, append word to the list, and
then, at the very end, do a single join: my_output = '
'.join(my_output). (See text_to_message code below for an example.)
Using join is the right way to build strings. Delaying the creation of the string is useful because processing lists of substrings is more pleasant than splitting and unsplitting strings, and having to add spaces and carriage returns here and there.
Study generators. They are easy to understand, and can help you a lot when processing text like this.
import textwrap
KEYWORD_LIST = ['STUFF:', 'THINGS:', 'JUNK:']
KEYWORD_LIST1 = ['CRAP:']
def text_to_message(text):
result=[]
for word in text.split():
if word in KEYWORD_LIST or word in KEYWORD_LIST1:
if result:
yield ' '.join(result)
result=[]
yield word
else:
result.append(word)
if result:
yield ' '.join(result)
def format_messages(messages):
title='TEXT1'
num=1
for message in messages:
if message in KEYWORD_LIST:
title='TEXT1'
elif message in KEYWORD_LIST1:
title='NEWTEXT'
for line in textwrap.wrap(message,width=65):
yield 'DATA|{n}|{t}|{l}'.format(n=num,t=title,l=line)
num+=1
TEXT='''STUFF: some random text THINGS: some random text and some more random text and stuff JUNK: crazy randomness CRAP: such random stuff I cant believe how random'''
for line in format_messages(text_to_message(TEXT)):
print(line)

small issue with whitespeace/punctuation in python?

I have this function that will convert text language into English:
def translate(string):
textDict={'y':'why', 'r':'are', "l8":'late', 'u':'you', 'gtg':'got to go',
'lol': 'laugh out loud', 'ur': 'your',}
translatestring = ''
for word in string.split(' '):
if word in textDict:
translatestring = translatestring + textDict[word]
else:
translatestring = translatestring + word
return translatestring
However, if I want to translate y u l8? it will return whyyoul8?. How would I go about separating the words when I return them, and how do I handle punctuation? Any help appreciated!
oneliner comprehension:
''.join(textDict.get(word, word) for word in re.findall('\w+|\W+', string))
[Edit] Fixed regex.
You're adding words to a string without spaces. If you're going to do things this way (instead of the way suggested to your in your previous question on this topic), you'll need to manually re-add the spaces since you split on them.
"y u l8" split on " ", gives ["y", "u", "l8"]. After substitution, you get ["why", "you", "late"] - and you're concatenating these without adding spaces, so you get "whyyoulate". Both forks of the if should be inserting a space.
You can just add a + ' ' + to add a space. However, I think what you're trying to do is this:
import re
def translate_string(str):
textDict={'y':'why', 'r':'are', "l8":'late', 'u':'you', 'gtg':'got to go', 'lol': 'laugh out loud', 'ur': 'your',}
translatestring = ''
for word in re.split('([^\w])*', str):
if word in textDict:
translatestring += textDict[word]
else:
translatestring += word
return translatestring
print translate_string('y u l8?')
This will print:
why you late?
This code handles stuff like question marks a bit more gracefully and preserves spaces and other characters from your input string, while retaining your original intent.
I'd like to suggest the following replacement for this loop:
for word in string.split(' '):
if word in textDict:
translatestring = translatestring + textDict[word]
else:
translatestring = translatestring + word
for word in string.split(' '):
translatetring += textDict.get(word, word)
The dict.get(foo, default) will look up foo in the dictionary and use default if foo isn't already defined.
(Time to run, short notes now: When splitting, you could split based on punctuation as well as whitespace, save the punctuation or whitespace, and re-introduce it when joining the output string. It's a bit more work, but it'll get the job done.)

Categories