small issue with whitespeace/punctuation in python? - python

I have this function that will convert text language into English:
def translate(string):
textDict={'y':'why', 'r':'are', "l8":'late', 'u':'you', 'gtg':'got to go',
'lol': 'laugh out loud', 'ur': 'your',}
translatestring = ''
for word in string.split(' '):
if word in textDict:
translatestring = translatestring + textDict[word]
else:
translatestring = translatestring + word
return translatestring
However, if I want to translate y u l8? it will return whyyoul8?. How would I go about separating the words when I return them, and how do I handle punctuation? Any help appreciated!

oneliner comprehension:
''.join(textDict.get(word, word) for word in re.findall('\w+|\W+', string))
[Edit] Fixed regex.

You're adding words to a string without spaces. If you're going to do things this way (instead of the way suggested to your in your previous question on this topic), you'll need to manually re-add the spaces since you split on them.

"y u l8" split on " ", gives ["y", "u", "l8"]. After substitution, you get ["why", "you", "late"] - and you're concatenating these without adding spaces, so you get "whyyoulate". Both forks of the if should be inserting a space.

You can just add a + ' ' + to add a space. However, I think what you're trying to do is this:
import re
def translate_string(str):
textDict={'y':'why', 'r':'are', "l8":'late', 'u':'you', 'gtg':'got to go', 'lol': 'laugh out loud', 'ur': 'your',}
translatestring = ''
for word in re.split('([^\w])*', str):
if word in textDict:
translatestring += textDict[word]
else:
translatestring += word
return translatestring
print translate_string('y u l8?')
This will print:
why you late?
This code handles stuff like question marks a bit more gracefully and preserves spaces and other characters from your input string, while retaining your original intent.

I'd like to suggest the following replacement for this loop:
for word in string.split(' '):
if word in textDict:
translatestring = translatestring + textDict[word]
else:
translatestring = translatestring + word
for word in string.split(' '):
translatetring += textDict.get(word, word)
The dict.get(foo, default) will look up foo in the dictionary and use default if foo isn't already defined.
(Time to run, short notes now: When splitting, you could split based on punctuation as well as whitespace, save the punctuation or whitespace, and re-introduce it when joining the output string. It's a bit more work, but it'll get the job done.)

Related

Python String adjust

Hello is use some method like .isupper() in a loop, or string[i+1] to find my lower char but i don't know how to do that
input in function -> "ThisIsMyChar"
expected -> "This is my char"
I´ve done it with regex, could be done with less code but my intention is readable
import re
def split_by_upper(input_string):
pattern = r'[A-Z][a-z]*'
matches = re.findall(pattern, input_string)
if (matches):
output = matches[0]
for word in matches[1:]:
output += ' ' + word[0].lower() + word[1:]
return output
else:
return input_string
print(split_by_upper("ThisIsMyChar"))
>> split_by_upper() -> "This is my char"
You could use re.findall and str.lower:
>>> import re
>>> s = 'ThisIsMyChar'
>>> ' '.join(w.lower() if i >= 1 else w for i, w in enumerate(re.findall('.[^A-Z]*', s)))
'This is my char'
You should first try by yourself. If you didn't get it done, you can do something like this:
# to parse input string
def parse(str):
result= "" + str[0];
for i in range(1, len(str)):
ch = str[i]
if ch.isupper():
result += " ";
result += ch.lower();
return result;
# input string
str = "ThisIsMyChar";
print(parse(str))
First you need to run a for loop and check for Uppercase words then when you find it just add a space at the starting, lower the word and increment it to your new string. Simple, more code is explained in comments in the code itself.
def AddSpaceInTitleCaseString(string):
NewStr = ""
# Check for Uppercase string in the input string char-by-char.
for i in string:
# If it found one, add it to the NewStr variable with a space and lowering it's case.
if i.isupper(): NewStr += f" {i.lower()}"
# Else just add it as usual.
else: NewStr += i
# Before returning the NewStr, remove all the leading and trailing spaces from it.
# And as shown in your question I'm assuming that you want the first letter or your new sentence,
# to be in uppercase so just use 'capitalize' function for it.
return NewStr.strip().capitalize()
# Test.
MyStr = AddSpaceInTitleCaseString("ThisIsMyChar")
print(MyStr)
# Output: "This is my char"
Hope it helped :)
Here is a concise regex solution:
import re
capital_letter_pattern = re.compile(r'(?!^)[A-Z]')
def add_spaces(string):
return capital_letter_pattern.sub(lambda match: ' ' + match[0].lower(), string)
if __name__ == '__main__':
print(add_spaces('ThisIsMyChar'))
The pattern searches for capital letters ([A-Z]), and the (?!^) is negative lookahead that excludes the first character of the input ((?!foo) means "don't match foo, ^ is "start of line", so (?!^) is "don't match start of line").
The .sub(...) method of a pattern is usually used like pattern.sub('new text', 'my input string that I want changed'). You can also use a function in place of 'new text', in which case the function is called with the match object as an argument, and the value returned by the function is used as the replacement string.
The expression capital_letter_pattern.sub(lambda match: ' ' + match[0].lower(), string) replaces all matches (all capital letters except at the start of the line) using a lambda function to add a space before and make the letter lowercase. match[0] means "the entirety of the matched text", which in this case is the captial letter.
You can split it via Regex using r"(?<!^)(?=[A-Z])" pattern:
import re
txt = 'ThisIsMyChar'
c = re.compile(r"(?<!^)(?=[A-Z])")
first, *rest = map(str.lower, c.split(txt))
print(f'{first.title()} {" ".join(rest)}')
Pattern explanation:
(?<!^) checks to see if it is not at the beginning.
(?=[A-Z]) checks to see there a capital letter after it.
note These are non-capturing groups.

How to reverse the words of a string considering the punctuation?

Here is what I have so far:
def reversestring(thestring):
words = thestring.split(' ')
rev = ' '.join(reversed(words))
return rev
stringing = input('enter string: ')
print(reversestring(stringing))
I know I'm missing something because I need the punctuation to also follow the logic.
So let's say the user puts in Do or do not, there is no try.. The result should be coming out as .try no is there , not do or Do, but I only get try. no is there not, do or Do. I use a straightforward implementation which reverse all the characters in the string, then do something where it checks all the words and reverses the characters again but only to the ones with ASCII values of letters.
Try this (explanation in comments of code):
s = "Do or do not, there is no try."
o = []
for w in s.split(" "):
puncts = [".", ",", "!"] # change according to needs
for c in puncts:
# if a punctuation mark is in the word, take the punctuation and add it to the rest of the word, in the beginning
if c in w:
w = c + w[:-1] # w[:-1] gets everthing before the last char
o.append(w)
o = reversed(o) # reversing list to reverse sentence
print(" ".join(o)) # printing it as sentence
#output: .try no is there ,not do or Do
Your code does exactly what it should, splitting on space doesn't separator a dot ro comma from a word.
I'd suggest you use re.findall to get all words, and all punctation that interest you
import re
def reversestring(thestring):
words = re.findall(r"\w+|[.,]", thestring)
rev = ' '.join(reversed(words))
return rev
reversestring("Do or do not, there is no try.") # ". try no is there , not do or Do"
You can use regular expressions to parse the sentence into a list of words and a list of separators, then reverse the word list and combine them together to form the desired string. A solution to your problem would look something like this:
import re
def reverse_it(s):
t = "" # result, empty string
words = re.findall(r'(\w+)', s) # just the words
not_s = re.findall(r'(\W+)', s) # everything else
j = len(words)
k = len(not_s)
words.reverse() # reverse the order of word list
if re.match(r'(\w+)', s): # begins with a word
for i in range(k):
t += words[i] + not_s[i]
if j > k: # and ends with a word
t += words[k]
else: # begins with punctuation
for i in range(j):
t += not_s[i] + words[i]
if k > j: # ends with punctuation
t += not_s[j]
return t #result
def check_reverse(p):
q = reverse_it(p)
print("\"%s\", \"%s\"" % (p, q) )
check_reverse('Do or do not, there is no try.')
Output
"Do or do not, there is no try.", "try no is there, not do or Do."
It is not a very elegant solution but sure does work!

Why does str.capitalize() not work as I expect?

Please, let me know if I'm not providing enough information. The goal of the program is to capitalize the first letter of every sentence.
usr_str = input()
def fix_capitalization(usr_str):
list_of_sentences = usr_str.split(".")
list_of_sentences.pop() #remove last element: ""
new_str = ''
for sentence in list_of_sentences:
new_str += sentence.capitalize() + "."
return new_str
print(fix_capitalization(usr_str))
For instance, if I input "hi. hello. hey." I expect it to output "Hi. Hello. Hey." but instead, it outputs "Hi. hello. hey."
An alternative would be to build a list of strings then concatenate them:
def fix_capitalization(usr_str):
list_of_sentences = usr_str.split(".")
output = []
for sentence in list_of_sentences:
new_sentence = sentence.strip().capitalize()
# If empty, don't bother
if new_sentence:
output.append(new_sentence)
# Finally, join everything
return ". ".join(output) +"."
You've entered the sentences with spaces between them. Now when you split the list the list at the '.' character the spaces are still remaining. I checked what the elements in the list were when you split it and the result was this.
'''
['hi', ' hello', ' hey', '']
'''

How to add strings to items in list that resulted from split?

I am building a function that accepts a string as input, splits it based on certain separator and ends on period. Essentially what I need to do is add certain pig latin words onto certain words within the string if they fit the criteria.
The criteria are:
if the word starts with a non-letter or contains no characters, do nothing to it
if the word starts with a vowel, add 'way' to the end
if the word starts with a consonant, place the first letter at the end and add 'ay'
For output example:
simple_pig_latin("i like this") → 'iway ikelay histay.'
--default sep(space) and end(dot)
simple_pig_latin("i like this", sep='.') → 'i like thisway.'
--separator is dot, so whole thing is a single “word”
simple_pig_latin("i.like.this",sep='.',end='!') → 'iway.ikelay.histay!'
--sep is '.' and end is '!'
simple_pig_latin(".") → '..'
--only word is '.', so do nothing to it and add a '.' to the end
It is now:
def simple_pig_latin(input, sep='', end='.'):
words=input.split(sep)
new_sentence=""
Vowels= ('a','e','i','o','u')
Digit= (0,1,2,3,4,5,6,7,8,9)
cons=('b','c','d','f','g','h','j','k','l','m','n','p','q','r','s','t','v','w','x','y','z')
for word in words:
if word[0] in Vowels:
new_word= word+"way"
if word[0] in Digit:
new_word= word
if word[0] in cons:
new_word= word+"ay"
else:
new_word= word
new_sentence= new_sentence + new_word+ sep
new_sentence= new_sentence.strip(sep) + sentenceEndPunctuation
return new_sentence
Example error:
ERROR: test_simple_pig_latin_8 (__main__.AllTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "testerl8.py", line 125, in test_simple_pig_latin_8
result = simple_pig_latin(input,sep='l',end='')
File "/Users/kgreenwo/Desktop/student.py", line 8, in simple_pig_latin
if word[0] in Vowels:
IndexError: string index out of range
You have the means of adding strings together correct: you use the + operator, as you have in new_string = new_string + "way".
You have two other major issues, however:
To determine whether a variable can be found in a list (in your case, a tuple), you’d probably want to use the in operator. Instead of if [i][0]==Vowels: you would use if [i][0] in Vowels:.
When you reconstruct the string with the new words, you will need to add the word to your new_string. Instead of new_string=new_string+"way" you might use new_string = new_string+word+"way". If you choose to do it this way, you’ll also need to decide when to add the sep back to each word.
Another way of joining smaller strings into larger ones with a known separator is to create a list of the new individual strings, and then join the strings back together using your known separator:
separator = ' '
words = sentence.split(separator)
newWords = []
for word in words:
newWord = doSomething(word)
newWords.append(newWord)
newSentence = separator.join(newWords)
In this way, you don’t have to worry about either the first or last word not needing a separator.
In your case, doSomething might look like:
def doSomething(word):
if word[0] in Vowels:
return word + "way"
elif word[0] in Consonants:
return word + "ay"
#and so forth
How to write a function
On a more basic level, you will probably find it easier to create your functions in steps, rather than trying to write everything at once. At each step, you can be sure that the function (or script) works, and then move on to the next step. For example, your first version might be as simple as:
def simple_pig_latin(sentence, separator=' '):
words = sentence.split(separator)
for word in words:
print word
simple_pig_latin("i like this")
This does nothing except print each word in the sentence, one per line, to show you that the function is breaking the sentence apart into words the way that you expect it to be doing. Since words are fundamental to your function, you need to be certain that you have words and that you know where they are before you can continue. Your error of trying to check [i][0] would have been caught much more easily in this version, for example.
A second version might then do nothing except return the sentence recreated, taking it apart and then putting it back together the same way it arrived:
def simple_pig_latin(sentence, separator=' '):
words = sentence.split(separator)
new_sentence = ""
for word in words:
new_sentence = new_sentence + word + separator
return new_sentence
print simple_pig_latin("i like this")
Your third version might try to add the end punctuation:
def simple_pig_latin(sentence, separator=' ', sentenceEndPunctuation='.'):
words = sentence.split(separator)
new_sentence = ""
for word in words:
new_sentence = new_sentence + word + separator
new_sentence = new_sentence + sentenceEndPunctuation
return new_sentence
print simple_pig_latin("i like this")
At this point, you’ll realize that there’s an issue with the separator getting added on in front of the end punctuation, so you might fix that by stripping off the separator when done, or by using a list to construct the new_sentence, or any number of ways.
def simple_pig_latin(sentence, separator=' ', sentenceEndPunctuation='.'):
words = sentence.split(separator)
new_sentence = ""
for word in words:
new_sentence = new_sentence + word + separator
new_sentence = new_sentence.strip(separator) + sentenceEndPunctuation
return new_sentence
print simple_pig_latin("i like this")
Only when you can return the new sentence without the pig latin endings, and understand how that works, would you add the pig latin to your function. And when you add the pig latin, you would do it one rule at a time:
def simple_pig_latin(sentence, separator=' ', sentenceEndPunctuation='.'):
vowels= ('a','e','i','o','u')
words = sentence.split(separator)
new_sentence = ""
for word in words:
if word[0] in vowels:
new_word = word + "way"
else:
new_word = word
new_sentence = new_sentence + new_word + separator
new_sentence = new_sentence.strip(separator) + sentenceEndPunctuation
return new_sentence
print simple_pig_latin("i like this")
And so on, adding each change one at a time, until the function performs the way you expect.
When you try to build the function complete all at once, you end up with competing errors that make it difficult to see where the function is going wrong. By building the function one step at a time, you should generally only have one error at a time to debug.

How to separate a irregularly cased string to get the words? - Python

I have the following word list.
as my words are not all delimited by capital latter. the word list would consist words such as 'USA' , I am not sure how to do that. 'USA' should be as a one word. cannot be separated.
myList=[u'USA',u'Chancellor', u'currentRank', u'geolocDepartment', u'populationUrban', u'apparentMagnitude', u'Train', u'artery',
u'education', u'rightChild', u'fuel', u'Synagogue', u'Abbey', u'ResearchProject', u'languageFamily', u'building',
u'SnookerPlayer', u'productionCompany', u'sibling', u'oclc', u'notableStudent', u'totalCargo', u'Ambassador', u'copilote',
u'codeBook', u'VoiceActor', u'NuclearPowerStation', u'ChessPlayer', u'runwayLength', u'horseRidingDiscipline']
How to edit the element in the list.
I would like to get change the element in the list as below shows:
updatemyList=[u'USA',u'Chancellor', u'current Rank', u'geoloc Department', u'population Urban', u'apparent Magnitude', u'Train', u'artery',
u'education', u'right Child', u'fuel', u'Synagogue', u'Abbey', u'Research Project', u'language Family', u'building',
u'Snooker Player', u'production Company', u'sibling', u'oclc', u'notable Student', u'total Cargo', u'Ambassador', u'copilote',
u'code Book', u'Voice Actor', u'Nuclear Power Station', u'Chess Player', u'runway Length', u'horse Riding Discipline']
the word is able to separate
You could use re.sub
import re
first_cap_re = re.compile('(.)([A-Z][a-z]+)')
all_cap_re = re.compile('([a-z0-9])([A-Z])')
def convert(word):
s1 = first_cap_re.sub(r'\1 \2', word)
return all_cap_re.sub(r'\1 \2', s1)
updated_words = [convert(word) for word in myList]
Adapated from: Elegant Python function to convert CamelCase to snake_case?
Could do this using regex, but easier to comprehend with a small algorithm (ignoring corner cases like abbreviations e.g NLTK)
def split_camel_case(string):
new_words = []
current_word = ""
for char in string:
if char.isupper() and current_word:
new_words.append(current_word)
current_word = ""
current_word += char
return " ".join(new_words + [current_word])
old_words = ["HelloWorld", "MontyPython"]
new_words = [split_camel_case(string) for string in old_words]
print(new_words)
You can use a regular expression to prepend each upper-case letter that's not at the beginning of a word with a space:
re.sub(r"(?!\b)(?=[A-Z])", " ", your_string)
The bit in the first pair of parens means "not at the beginning of a word", and the bit in the second pair of parens means "followed by an uppercase letter". The regular expression matches the empty string at places where these two conditions hold, and replaces the empty string with a space, i.e. it inserts a space at these positions.
The following code snippet separates the words as you want:
myList=[u'Chancellor', u'currentRank', u'geolocDepartment', u'populationUrban', u'apparentMagnitude', u'Train', u'artery', u'education', u'rightChild', u'fuel', u'Synagogue', u'Abbey', u'ResearchProject', u'languageFamily', u'building', u'SnookerPlayer', u'productionCompany', u'sibling', u'oclc', u'notableStudent', u'totalCargo', u'Ambassador', u'copilote', u'codeBook', u'VoiceActor', u'NuclearPowerStation', u'ChessPlayer', u'runwayLength', u'managerYearsEndYear', 'horseRidingDiscipline']
updatemyList = []
for word in myList:
phrase = word[0]
for letter in word[1:]:
if letter.isupper():
phrase += " "
phrase += letter
updatemyList.append(phrase)
print updatemyList
Can you simply do a check to see if all letters in word are caps, and if so, to ignore them i.e. count them as a single word?
I've used similar code in the past, and it looks a bit hard-coded but it does the job right (in my case I wanted to capture abbreviations up to 4 letters long)
def CapsSumsAbbv():
for word in words:
for i,l in enumerate(word):
try:
if word[i] == word[i].upper() and word[i+1] == word[i+1].upper() and word[i+2] == word[i+2].upper() and word[i+3] == word[i+3].upper():
try:
word = int(word)
except:
if word not in allcaps:
allcaps.append(word)
except:
pass
To further expand, if you had entries such as u'USAMilitarySpending' you can adapt the above code so that if there are more than two Caps letters in a row, but there are also lower caps, the space is added between the last and last-1 caps letter so it becomes u'USA Military Spending'

Categories