I need to write a function for an edx Python course. The idea is figure out the number of letters and words in series of strings and return the average letter count while only using for/while loops and conditionals. My code comes close to being able to do this, but I cannot. For the life of me. Figure out why it doesn't work. I've been bashing my head against it for two days, now, and I know it's probably something really simple that I'm too idiotic to see (sense my frustration?), but I do not know what it is.
If I'm looking at line 14, the logic makes sense: if i in the string is punctuation (not a letter) and the previous character (char, in this case) is not punctuation (therefore a letter), it should be a word. But it's still counting double punctuation as words. But not all of them.
def averageWordLength(myString):
char = ""
punctuation = [" ", "!", "?", ".", ","]
letters = 0
words = 0
if not myString == str(myString):
return "Not a string"
try:
for i in myString:
if i not in punctuation:
letters += 1
elif i in punctuation:
if char not in punctuation:
words += 1
elif char in punctuation:
pass
char = i
if letters == 0:
return "No words"
else:
average = letters / (words + 1)
return letters, words + 1, average
except TypeError:
return "No words"
print(averageWordLength("Hi"))
print(averageWordLength("Hi, Lucy"))
print(averageWordLength(" What big spaces you have!"))
print(averageWordLength(True))
print(averageWordLength("?!?!?! ... !"))
print(averageWordLength("One space. Two spaces. Three spaces. Nine spaces. "))
Desired output:
2, 1, 2.0
6, 2, 3.0
20, 6, 4.0
Not a string
No words
38, 8, 4.75
What in blazes am I doing wrong?!
٩๏̯͡๏۶
Final correction:
for i in myString:
if i not in punctuation:
letters += 1
if char in punctuation:
words += 1
char = i
else:
average = letters / (words + 1)
return letters, words + 1, average
You're adding 1 to words by default... this is not valid in all cases: "Hi!" being a good example. This is actually what is putting off all of your strings: Anytime a string does not end in a word your function will be off.
Hint: You only want to add one if there is no punctuation after the last word.
A problem happens when the string begins with a punctuation character: the previous character is still "" and not considered as a punctuation character, so an non-existent word in counted.
you could add "" in the list of symbols, or do :
punctuation = " !?.,"
because testing c in s return true if c is a substring of s, aka if c is a character of s. And the empty string is contained in every string.
A second problem occurs at the end, if the string terminate with a word, it is not counted (were your word+1 a way to fix that ?), but if the string terminate with a punctuation, the last word is counted.
Add this just after the for loop :
if char not in punctuation:
words += 1
And now there will be no need to add 1, just use
average = letters / words
This is made more difficult since I assume you're not allowed to use inbuilt string functions like split().
The way I would approach this is to:
Split the sentence into a list of words.
Count the letters in each word.
Take the average amount of letters.
def averageWordLength(myString):
punctuation = [" ", "!", "?", ".", ","]
if not isinstance(myString, str):
return "Not a string"
split_values = []
word = ''
for char in myString:
if char in punctuation:
if word:
split_values.append(word)
word = ''
else:
word += char
if word:
split_values.append(word)
letter_count = []
for word in split_values:
letter_count.append(len(word))
if len(letter_count):
return sum(letter_count)/len(letter_count)
else:
return "No words."
Related
A university assignment has us tasked with writing a program in Python that analyzes tweets. Part of the assignment is coding a function that identifies whether words within a string sentence are valid, and can be counted. Here's the question:
Task 8 Valid Words
We also might want to look at only valid words in our data set. A word will be a valid word if all three of the following conditions are true:
• The word contains only letters, hyphens, and/or punctuation* (no digits).
• There is at most one hyphen '-'. If present, it must be surrounded by characters ("a-b" is valid, but "-ab" and "ab-" are not valid).
• There is at most one punctuation mark. If present, it must be at the end of the word ("ab,", "cd!", and "." are valid, but "a!b" and "c.," are not valid).
NB: for this question, the 3rd condition will also apply to apostrophes despite real "valid" words
containing them.
Write a function valid_words_mask(sentence) that takes an input parameter sentence (type string)
and returns the tuple: (int, list[]), where:
• int is the number of valid words found.
• list[] contains the booleans True or False for each word in sequence depending on whether that
word is valid.
*Assume that a punctuation mark is any character that is not an alphanumeric (except for apostrophes,
and for hyphens, which are handled separately as per the instructions).
Here's the code I have written so far, after many days of struggling. It seems to only return one iteration of the loop. Keep in mind that I am a beginner programmer, and have only applied the few concepts we have learned. :)
Thanks for the feedback.
def valid_words_mask(sentence):
"""Takes a string sentence input and determines whether words are valid"""
import string
punctuation = list(string.punctuation)
punctuation.remove("-")
word_list = " ".split(sentence)
valid_count = 0
valid_list = []
for word in word_list:
hyphen_count = 0
digit_count = 0
punctuation_count = 0
for i in range (0, len(word)):
#Checks whether given character is a punctuation mark
if word[i] == "-":
hyphen_count += 1
for i in range (0, len(word)):
#Checks whether given character is a digit
if word[i].isdigit() == True:
digit_count += 1
for i in range (0, (len(word) - 1)):
if word[i] in punctuation:
punctuation_count += 1
if digit_count < 1 and hyphen_count < 2 and punctuation_count < 1:
if word[0] != "-" and word[-1] != "-":
validity = True
else: validity = False
if validity == True:
valid_count += 1
valid_list.append(validity)
final_tuple = (valid_count, valid_list)
return final_tuple
sentence = "these are valid words"
print(valid_words_mask(sentence))
The problem is wit the line:
word_list = " ".split(sentence).
word_list is an empty list.
Put
word_list = sentence.split() instead.
The instructions for this task are confusing when it comes to defining what constitutes punctuation which means that the following code may not work for you.
However, you should think about breaking down the functionality into its component parts. In particular, you have 3 "rules" so write 3 complementary functions: each one succinct. Then it becomes easier to combine those rules into another "driver" function. Here's an example:
from string import ascii_lowercase, punctuation
HYPHEN = '-'
PUNCTUATION = punctuation.replace(HYPHEN, '')
VCHARS = ascii_lowercase + punctuation
def valid_chars(word):
return all(c in VCHARS for c in word)
def valid_hyphens(word):
return word.count(HYPHEN) == 0 or (word[0] != HYPHEN and word[-1] != HYPHEN)
def valid_punctuation(word):
pcount = sum(1 for c in word if c in PUNCTUATION)
return pcount == 0 or (pcount == 1 and word[-1] in PUNCTUATION)
def valid_words_mask(sentence):
valid_count = 0
valid_list = list()
for word in sentence.lower().split():
if v := valid_chars(word) and valid_punctuation(word) and valid_hyphens(word):
valid_count += 1
valid_list.append(v)
return valid_count, valid_list
print(valid_words_mask('Hello world??'))
Output:
(1, [True, False])
I am doing a Pig Latin code in which the following words are supposed to return the following responses:
"computer" == "omputercay"
"think" == "inkthay"
"algorithm" == "algorithmway"
"office" == "officeway"
"Computer" == "Omputercay"
"Science!" == "Iencescay!"
However, for the last word, my code does not push the '!' to the end of the string. What is the code that will make this happen?
All of them return the correct word apart from the last which returns "Ience!Scay!"
def pigLatin(word):
vowel = ("a","e","i","o","u")
first_letter = word[0]
if first_letter in vowel:
return word +'way'
else:
l = len(word)
i = 0
while i < l:
i = i + 1
if word[i] in vowel:
x = i
new_word = word[i:] + word[:i] + "ay"
if word[0].isupper():
new_word = new_word.title()
return new_word
For simplicity, how about you check if the word contains an exlamation point ! at the end and if it does just remove it and when you are done add it back. So instead of returning just check place ! at the end (if you discovered it does at the beggining).
def pigLatin(word):
vowel = ("a","e","i","o","u")
first_letter = word[0]
if first_letter in vowel:
return word +'way'
else:
hasExlamation = False
if word[-1] == '!':
word = word[:-1] # removes last letter
hasExlamation = True
l = len(word)
i = 0
while i < l:
i = i + 1
if word[i] in vowel:
x = i
new_word = word[i:] + word[:i] + "ay"
if word[0].isupper():
new_word = new_word.title()
break # do not return just break out of the `while` loop
if hasExlamation:
new_word += "!" # same as new_word = new_word + "!"
return new_word
That way it does not treat ! as a normal letter and the output is Iencescay!. You can of course do this with any other character similarly
specialCharacters = ["!"] # define this outside the function
def pigLatin():
# all of the code above
if word in specialCharacters:
hasSpecialCharacter = True
# then you can continue the same way
Regular expressions to the rescue. A regex pattern with word boundaries will make your life much easier in this case. A word boundary is exactly what it sounds like - it indicates the start- or end of a word, and is represented in the pattern with \b. In your case, the ! would be such a word boundary. The "word" itself consists of any character in the set a-z, A-Z, 0-9 or underscore, and is represented by \w in the pattern. The + means, one or more \w characters.
So, if the pattern is r"\b\w+\b", this will match any word (consisting of any of a-zA-Z0-9_), with leading or succeeding word boundaries.
import re
pattern = r"\b\w+\b"
sentence = "computer think algorithm office Computer Science!"
print(re.findall(pattern, sentence))
Output:
['computer', 'think', 'algorithm', 'office', 'Computer', 'Science']
>>>
Here, we're using re.findall to get a list of all substrings that matched the pattern. Notice, no whitespace or punctuation is included.
Let's introduce re.sub, which takes a pattern to look for, a string to look through, and another string with which to replace any match it finds. Instead of a replacement-string, you can instead pass in a function. This function must take a match object as a parameter, and must return a string with which to replace the current match.
import re
pattern = r"\b\w+\b"
sentence = "computer think algorithm office Computer Science!"
def replace(match):
return "*" * len(match.group())
print(re.sub(pattern, replace, sentence))
Output:
******** ***** ********* ****** ******** *******!
>>>
That's just for demonstration purposes.
Let's change gears for a second:
from string import ascii_letters as alphabet
print(alphabet)
Output:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
>>>
That's handy for creating a string containing only consonants:
from string import ascii_letters as alphabet
consonants = "".join(set(alphabet) ^ set("aeiouAEIOU"))
print(consonants)
Output:
nptDPbHvsxKNWdYyrTqVQRlBCZShzgGjfkJMLmFXwc
>>>
We've taken the difference between the set of all alpha-characters and the set of only vowels. This yields the set of only consonants. Notice, that the order of the characters it not preserved in a set, but it doesn't matter in our case, since we'll be effectively treating this string as a set - testing for membership (if a character is in this string, it must be a consonant. The order does not matter).
Let's take advantage of this, and modify our pattern from earlier. Let's add two capturing groups - the first will capture any leading consonants (if they exist), the second will capture all remaining alpha characters (consonants or vowels) before the terminating word boundary:
import re
from string import ascii_letters as alphabet
consonants = "".join(set(alphabet) ^ set("aeiouAEIOU"))
pattern = fr"\b([{consonants}]*)(\w+)\b"
word = "computer"
match = re.match(pattern, word)
if match is not None:
print(f"Group one is \"{match.group(1)}\"")
print(f"Group two is \"{match.group(2)}\"")
Output:
Group one is "c"
Group two is "omputer"
>>>
As you can see, the first group captured c, and the second group captured omputer. Separating the match into two groups will be useful later when we construct the pig-latin translation. We can get even cuter by naming our capturing groups. This isn't required, but it will make things a bit easier to read later on:
pattern = fr"\b(?P<prefix>[{consonants}]*)(?P<rest>\w+)\b"
Now, the first capturing group is named prefix, and can be accessed via match.group("prefix"), rather than match.group(1). The second capturing group is named rest, and can be accessed via match.group("rest") instead of match.group(2).
Putting it all together:
import re
from string import ascii_letters as alphabet
consonants = "".join(set(alphabet) ^ set("aeiouAEIOU"))
pattern = fr"\b(?P<prefix>[{consonants}]*)(?P<rest>\w+)\b"
sentence = "computer think algorithm office Computer Science!"
def to_pig_latin(match):
rest = match.group("rest")
prefix = match.group("prefix")
result = rest + prefix
if len(prefix) == 0:
# if the 'prefix' capturing group was empty
# the word must have started with a vowel
# so, the suffix is 'way'
result += "way"
# that also means we need to check if the first character...
# ... (which must be in 'rest') was upper-case.
if rest[0].isupper():
result = result.title()
else:
result += "ay"
if prefix[0].isupper():
result = result.title()
return result
print(re.sub(pattern, to_pig_latin, sentence))
Output:
omputercay inkthay algorithmway officeway Omputercay Iencescay!
>>>
That was the verbose version. The definition of to_pig_latin can be shortened to:
def to_pig_latin(match):
rest = match.group("rest")
prefix = match.group("prefix")
return (str, str.title)[(prefix or rest)[0].isupper()](rest + prefix + "way"[bool(prefix):])
This question already has answers here:
Split a string at uppercase letters
(22 answers)
Closed 2 years ago.
I am trying to make a script that will accept a string as input in which all of the words are run together, but the first character of each word is uppercase. It should convert the string to a string in which the words are separated by spaces and only the first word starts with an uppercase letter.
For Example (The Input):
"StopWhateverYouAreDoingInterestingIDontCare"
The expected output:
"Stop whatever you are doing interesting I dont care"
Here is the one I wrote so far:
string_input = "StopWhateverYouAreDoingInterestingIDontCare"
def organize_string():
start_sentence = string_input[0]
index_of_i = string_input.index("I")
for i in string_input[1:]:
if i == "I" and string_input[index_of_i + 1].isupper():
start_sentence += ' ' + i
elif i.isupper():
start_sentence += ' ' + i.lower()
else:
start_sentence += i
return start_sentence
While this takes care of some parts, I am struggling with differentiating if the letter "I" is single or a whole word. Here is my output:
"Stop whatever you are doing interesting i dont care"
Single "I" needs to be uppercased, while the "I" in the word "Interesting" should be lowercased "interesting".
I will really appreciate all the help!
A regular expression will do in this example.
import re
s = "StopWhateverYouAreDoingInterestingIDontCare"
t = re.sub(r'(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z])', ' ', s)
Explained:
(?<=[a-z])(?=[A-Z]) - a lookbehind for a lowercase letter followed by a lookahead uppercase letter
| - (signifies or)
(?<=[A-Z])(?=[A-Z]) - a lookbehind for a uppercase letter followed by a lookahead uppercase letter
This regex substitutes a space when there is a lowercase letter followed by an uppercase letter, OR, when there is an uppercase letter followed by an uppercase letter.
UPDATE: This doesn't correctly lowercase the words (with the exception of I and the first_word)
UPDATE2: The fix to this is:
import re
s = "StopWhateverYouAreDoingInterestingIDontCare"
first_word, *rest = re.split(r'(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z])', s)
rest = [word.lower() if word != 'I' else word for word in rest]
print(first_word, ' '.join(rest))
Prints:
Stop whatever you are doing interesting I dont care
Update 3: I looked at why your code failed to correctly form the sentence (which I should have done in the first place instead of posting my own solution :-)).
Here is the corrected code with some remarks about the changes.
string_input = "StopWhateverYouAreDoingInterestingIDontCare"
def organize_string():
start_sentence = string_input[0]
#index_of_i = string_input.index("I")
for i, char in enumerate(string_input[1:], start=1):
if char == "I" and string_input[i + 1].isupper():
start_sentence += ' ' + char
elif char.isupper():
start_sentence += ' ' + char.lower()
else:
start_sentence += char
return start_sentence
print(organize_string())
!. I commented out the line index_of_i = string_input.index("I") as it doesn't do what you need (it finds the index of the first capital I and not an I that should stand alone (it finds the index of the I in Interesting instead of the IDont further in the string_input string). It is not a correct statement.
for i, char in enumerate(string_input[1:], 1) enumerate states the index of the letters in the string starting at 1 (since string_input[1:] starts at index 1 so they are in sync). i is the index of a letter in string_input.
I changed the i's to char to make it clearer that char is the character. Other than these changes, the code stands as you wrote it.
Now the program gives the correct output.
string_input = "StopWhateverYouAreDoingInterestingIDontCare"
counter = 1
def organize_string():
global counter
start_sentence = string_input[0]
for i in string_input[1:]:
if i == "I" and string_input[counter+1].isupper():
start_sentence += ' ' + i
elif i.isupper():
start_sentence += ' ' + i.lower()
else:
start_sentence += i
counter += 1
print(start_sentence)
organize_string()
I made some changes to your program. I used a counter to check the index position. I get your expected output:
Stop whatever you are doing interesting I dont care
s = 'StopWhateverYouAreDoingInterestingIDontCare'
ss = ' '
res = ''.join(ss + x if x.isupper() else x for x in s).strip(ss).split(ss)
sr = ''
for w in res:
sr = sr + w.lower() + ' '
print(sr[0].upper() + sr[1:])
output
Stop whatever you are doing interesting i dont care
I hope this will work fine :-
string_input = "StopWhateverYouAreDoingInterestingIDontCare"
def organize_string():
i=0
while i<len(string_input):
if string_input[i]==string_input[i].upper() and i==0 :
print(' ',end='')
print(string_input[i].upper(),end='')
elif string_input[i]==string_input[i].upper() and string_input[i+1]==string_input[i+1].upper():
print(' ',end='')
print(string_input[i].upper(),end='')
elif string_input[i]==string_input[i].upper() and i!=0:
print(' ',end='')
print(string_input[i].lower(),end='')
if string_input[i]!=string_input[i].upper():
print(string_input[i],end='')
i=i+1
organize_string()
Here is one solution utilising the re package to split the string based on the upper case characters. [Docs]
import re
text = "StopWhateverYouAreDoingInterestingIDontCare"
# Split text by upper character
text_splitted = re.split('([A-Z])', text)
print(text_splitted)
As we see in the output below the separator (The upper case character) and the text before and after is kept. This means that the upper case character is always followed by the rest of the word. The empty first string originates from the first upper case character, which is the first separator.
# Output of print
[
'',
'S', 'top',
'W', 'hatever',
'Y', 'ou',
'A', 're',
'D', 'oing',
'I', 'nteresting',
'I', '',
'D', 'ont',
'C', 'are'
]
As we have seen the first character is always followed by the rest of the word. By combining the two we have the splitted words. This also allows us to easily handle your special case with the I
# Remove first character because it is always empty if first char is always upper
text_splitted = text_splitted[1:]
result = []
for i in range(0, len(text_splitted), 2):
word = text_splitted[i]+text_splitted[i+1]
if (i > 0) and (word != 'I') :
word = word.lower()
result.append(word)
result = ' '.join(result)
split the sentence into individual words. If you find the word "I" in this list, leave it alone. Leave the first word alone. All of the other words, you cast to lower case.
You have to use some string manipulation like this:
output=string_input[0]
for l in string_input[1:]:
if l.islower():
new_s+=l
else:
new_s+=' '+l.lower()
print(output)
So I'm trying to do a code that will shift every letter in a word back by a number of letters in the alphabet (wrapping around for the end). For example, if I want to shift by 2 and I input CBE, I should get AZC. or JOHN into HMFL. I got a code to work for only one letter, and I wonder if there's a way to do a nested for loop for python (that works?)
def move(word, shift):
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ"
original = ""
for letter in range(26, len(alphabet)):
if alphabet[letter] == word: #this only works if len(word) is 0, I want to be able to iterate over the letters in word.
original += alphabet[letter-shift]
return original
You could start like this
def move(word, shift):
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
return "".join([alphabet[alphabet.find(i)-shift] for i in word])
Basically, this list comprehension constructs a list of the single letters. Then, the index of the letter in the alphabet is found by the .find method. The (index - shift) is the desired new index, which is extracted from alphabet. The resulting list is joined again and returned.
Note that it does obviously not work on lowercase input strings (if you want that use the str.upper method). Actually, the word should only consist of letters present in alphabet. For sentences the approach needs to treat whitespaces differently.
Don't find the letter in the alphabet that way -- find it with an index operation. Let char be the letter in question:
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
...
char_pos = alphabet.index(char)
new_pos = (char_pos - shift) % len(alphabet)
new_char = alphabet[new_pos]
Once you understand this, you can collapse those three lines to a single line.
Now, to make it operate on an entire word ...
new_word = ""
for char in word:
# insert the above logic
new_word += new_char
Can you put all those pieces together?
You'll still need your check to see that char is a letter. Also, if you're interested, you can build a list comprehension for all the translated characters and the apply ''.join() to get your new word.
For instance ...
If the letter is in the alphabet (if char in alphabet), shift the given distance and get the new letter, wrapping around the end if needed (% 26). If it's not a capital letter, then use the original character.
Make a list from all these translations, and then join them into a string. Return that string.
def move(word, shift):
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
return ''.join([alphabet[(alphabet.find(char) - shift) % 26]
if char in alphabet else char
for char in word])
print move("IBM", 1)
print move("The 1812 OVERTURE is COOL!", 13)
Output:
HAL
Ghe 1812 BIREGHER is PBBY!
A_VAL = ord('a')
def move(word, shift):
new_word = ""
for letter in word:
new_letter = ord(letter) - shift
new_word += chr(new_letter) if (new_letter >= A_VAL) else (26 + new_letter)
return new_word
Note that this will only work for lowercase words. As soon as you start mixing upper and lowercase letters you'll need to start checking for them. But this is a start. I discarded your nested loop idea because you should avoid those if at all possible.
You could use : chr() give you the character for a ascii number, ord() give you the ascii number for the matching character.
Here is an old Vigenere project :
def code_vigenere(ch,cle):
text = ch.lower()
clef = cle.lower()
L = len(cle)
res = ''
for i,l in enumerate(text):
res += chr((ord(l) - 97 + ord(cle[i%L]) - 97)%26 +97)
return res
I have the following word list.
as my words are not all delimited by capital latter. the word list would consist words such as 'USA' , I am not sure how to do that. 'USA' should be as a one word. cannot be separated.
myList=[u'USA',u'Chancellor', u'currentRank', u'geolocDepartment', u'populationUrban', u'apparentMagnitude', u'Train', u'artery',
u'education', u'rightChild', u'fuel', u'Synagogue', u'Abbey', u'ResearchProject', u'languageFamily', u'building',
u'SnookerPlayer', u'productionCompany', u'sibling', u'oclc', u'notableStudent', u'totalCargo', u'Ambassador', u'copilote',
u'codeBook', u'VoiceActor', u'NuclearPowerStation', u'ChessPlayer', u'runwayLength', u'horseRidingDiscipline']
How to edit the element in the list.
I would like to get change the element in the list as below shows:
updatemyList=[u'USA',u'Chancellor', u'current Rank', u'geoloc Department', u'population Urban', u'apparent Magnitude', u'Train', u'artery',
u'education', u'right Child', u'fuel', u'Synagogue', u'Abbey', u'Research Project', u'language Family', u'building',
u'Snooker Player', u'production Company', u'sibling', u'oclc', u'notable Student', u'total Cargo', u'Ambassador', u'copilote',
u'code Book', u'Voice Actor', u'Nuclear Power Station', u'Chess Player', u'runway Length', u'horse Riding Discipline']
the word is able to separate
You could use re.sub
import re
first_cap_re = re.compile('(.)([A-Z][a-z]+)')
all_cap_re = re.compile('([a-z0-9])([A-Z])')
def convert(word):
s1 = first_cap_re.sub(r'\1 \2', word)
return all_cap_re.sub(r'\1 \2', s1)
updated_words = [convert(word) for word in myList]
Adapated from: Elegant Python function to convert CamelCase to snake_case?
Could do this using regex, but easier to comprehend with a small algorithm (ignoring corner cases like abbreviations e.g NLTK)
def split_camel_case(string):
new_words = []
current_word = ""
for char in string:
if char.isupper() and current_word:
new_words.append(current_word)
current_word = ""
current_word += char
return " ".join(new_words + [current_word])
old_words = ["HelloWorld", "MontyPython"]
new_words = [split_camel_case(string) for string in old_words]
print(new_words)
You can use a regular expression to prepend each upper-case letter that's not at the beginning of a word with a space:
re.sub(r"(?!\b)(?=[A-Z])", " ", your_string)
The bit in the first pair of parens means "not at the beginning of a word", and the bit in the second pair of parens means "followed by an uppercase letter". The regular expression matches the empty string at places where these two conditions hold, and replaces the empty string with a space, i.e. it inserts a space at these positions.
The following code snippet separates the words as you want:
myList=[u'Chancellor', u'currentRank', u'geolocDepartment', u'populationUrban', u'apparentMagnitude', u'Train', u'artery', u'education', u'rightChild', u'fuel', u'Synagogue', u'Abbey', u'ResearchProject', u'languageFamily', u'building', u'SnookerPlayer', u'productionCompany', u'sibling', u'oclc', u'notableStudent', u'totalCargo', u'Ambassador', u'copilote', u'codeBook', u'VoiceActor', u'NuclearPowerStation', u'ChessPlayer', u'runwayLength', u'managerYearsEndYear', 'horseRidingDiscipline']
updatemyList = []
for word in myList:
phrase = word[0]
for letter in word[1:]:
if letter.isupper():
phrase += " "
phrase += letter
updatemyList.append(phrase)
print updatemyList
Can you simply do a check to see if all letters in word are caps, and if so, to ignore them i.e. count them as a single word?
I've used similar code in the past, and it looks a bit hard-coded but it does the job right (in my case I wanted to capture abbreviations up to 4 letters long)
def CapsSumsAbbv():
for word in words:
for i,l in enumerate(word):
try:
if word[i] == word[i].upper() and word[i+1] == word[i+1].upper() and word[i+2] == word[i+2].upper() and word[i+3] == word[i+3].upper():
try:
word = int(word)
except:
if word not in allcaps:
allcaps.append(word)
except:
pass
To further expand, if you had entries such as u'USAMilitarySpending' you can adapt the above code so that if there are more than two Caps letters in a row, but there are also lower caps, the space is added between the last and last-1 caps letter so it becomes u'USA Military Spending'