Unable to properly parse a string via character modification

Unable to properly parse a string via character modification - python

I'm running into an issue where my Python code is not correctly returning a function call designed to add an underscore character before each capital letter and I'm not sure where I'm going wrong. For an output, only the "courseID" word in the string is getting touched whereas the other two words are not.
I thought cycling thru the letters in a word, looking for capitalized letters would work, but it doesn't appear to be so. Could someone let me know where my code might be going wrong?
def parse_variables(string):
new_string=''
for letter in string:
if letter.isupper():
pos=string.index(letter)
parsed_string=string[:pos] + '_' + string[pos:]
new_string=''.join(parsed_string+letter)
else:
new_string=''.join(letter)
# new_string=''.join(letter)
return new_string.lower()
parse_variables("courseID pathID apiID")
Current output is a single letter lowercase d and the expected output should be course_id path_id api_id.

The issue with your revised code is that index only finds the first occurence of the capital letter in the string. Since you have repeated instances of the same capital letters, the function never finds the subsequent instances. You could simplify your approach and avoid this issue by simply concatenating the letters with or without underscores depending on whether they are uppercase as you iterate.
For example:
def underscore_caps(s):
result = ''
for c in s:
if c.isupper():
result += f'_{c.lower()}'
else:
result += c
return result
print(underscore_caps('courseID pathID apiID'))
# course_i_d path_i_d api_i_d
Or a bit more concisely using list comprehension and join:
def underscore_caps(s):
return ''.join([f'_{c.lower()}' if c.isupper() else c for c in s])
print(underscore_caps('courseID pathID apiID'))
# course_i_d path_i_d api_i_d

I think a regex solution would be easier to understand here. This takes words that end with capital letters and adds the underscore and makes them lowercase
import re
s = "courseID pathID apiID exampleABC DEF"
def underscore_lower(match):
return "_" + match.group(1).lower()
pat = re.compile(r'(?<=[^A-Z\s])([A-Z]+)\b')
print(pat.sub(underscore_lower, s))
# course_id path_id api_id example_abc DEF
You might have to play with that regex to get it to do exactly what you want. At the moment, it takes capital letters at the end of words that are preceded by a character that is neither a capital letter or a space. It then makes those letters lowercase and adds an underscore in front of them.

You have a number of issues with your code:
string.index(letter) gives the index of the first occurrence of letter, so if you have multiple e.g. D, pos will only update to the position of the first one.
You could correct this by iterating over both position and letter using enumerate e.g. for pos, letter in enumerate(string):
You are putting underscores before each capital letter i.e. _i_d
You are overwriting previous edits by referring to string in parsed_string=string[:pos] + '_' + string[pos:]
Correcting all these issues you would have:
def parse_variables(string):
new_string=''
for pos, letter in enumerate(string):
if letter.isupper() and pos+1 < len(string) and string[pos+1].isupper():
new_string += f'_{letter}'
else:
new_string += letter
return new_string.lower()
But a much simpler method is:
"courseID pathID apiID".replace('ID', '_id')
Update:
Given the variety of strings you want to capture, it seems regex is the tool you want to use:
import re
def parse_variables(string, pattern=r'(?<=[a-z])([A-Z]+)', prefix='_'):
"""Replace patterns in string with prefixed lowercase version.
Default pattern is any substring of consecutive
capital letters that occur after a lowercase letter."""
foo = lambda pat: f'{prefix}{pat.group(1).lower()}'
return re.sub(pattern, foo, text)
text = 'courseID pathProjects apiCode'
parse_variables(text)
>>> course_id path_projects api_code

Related

Python 3 - How to capitalize first letter of every sentence when translating from morse code

I am trying to translate morse code into words and sentences and it all works fine... except for one thing. My entire output is lowercased and I want to be able to capitalize every first letter of every sentence.
This is my current code:
text = input()
if is_morse(text):
lst = text.split(" ")
text = ""
for e in lst:
text += TO_TEXT[e].lower()
print(text)
Each element in the split list is equal to a character (but in morse) NOT a WORD. 'TO_TEXT' is a dictionary. Does anyone have a easy solution to this? I am a beginner in programming and Python btw, so I might not understand some solutions...

Maintain a flag telling you whether or not this is the first letter of a new sentence. Use that to decide whether the letter should be upper-case.
text = input()
if is_morse(text):
lst = text.split(" ")
text = ""
first_letter = True
for e in lst:
if first_letter:
this_letter = TO_TEXT[e].upper()
else:
this_letter = TO_TEXT[e].lower()
# Period heralds a new sentence.
first_letter = this_letter == "."
text += this_letter
print(text)

From what is understandable from your code, I can say that you can use the title() function of python.
For a more stringent result, you can use the capwords() function importing the string class.
This is what you get from Python docs on capwords:
Split the argument into words using str.split(), capitalize each word using str.capitalize(), and join the capitalized words using str.join(). If the optional second argument sep is absent or None, runs of whitespace characters are replaced by a single space and leading and trailing whitespace are removed, otherwise sep is used to split and join the words.

Find multiple list items within a string

Working on a problem in which I am trying to get a count of the number of vowels in a string. I wrote the following code:
def vowel_count(s):
count = 0
for i in s:
if i == 'a' or i == 'e' or i == 'i' or i == 'o' or i == 'u':
count += 1
print count
vowel_count(s)
While the above works, I would like to know how to do this more simply by creating a list of all vowels, then looping my If statement through that, instead of multiple boolean checks. I'm sure there's an even more elegant way to do this with import modules, but interested in this type of solution.
Relative noob...appreciate the help.

No need to create a list, you can use a string like 'aeiou' to do this:
>>> vowels = 'aeiou'
>>> s = 'fooBArSpaM'
>>> sum(c.lower() in vowels for c in s)
4

You can actually treat a string similarly to how you would a list in python (as they are both iterables), for example
vowels = 'aeiou'
sum(1 for i in s if i.lower() in vowels)
For completeness sake, others suggest vowels = set('aeiou') to allow not matching checks such as 'eio' in vowels. However note if you are iterating over your string in a for loop one character at a time, you won't run into this problem.

A weird way around this is the following:
vowels = len(s) - len(s.translate(None, 'aeiou'))
What you are doing with s.translate(None, 'aeiou') is creating a copy of the string removing all vowels. And then checking how the length differed.
Special note: the way I'm using it is even part of the official documentation
What is a vowel?
Note, though, that method presented here only replaces exactly the characters present in the second parameter of the translate string method. In particular, this means that it will not replace uppercase versions characters, let alone accented ones (like áèïôǔ).
Uppercase vowels
Solving the uppercase ones is kind of easy, just do the replacemente on a copy of the string that has been converted to lowercase:
vowels = len(s) - len(s.lower().translate(None, 'aeiou'))
Accented vowels
This one is a little bit more convoluted, but thanks to this other SO question we know the best way to do it. The resulting code would be:
from unicodedate import normalize
# translate special characters to unaccented versions
normalized_str = normalize('NFD', s).encode('ascii', 'ignore')
vowels = len(s) - len(normalized_str.lower().translate(None, 'aeiou'))

You can filter using a list comprehension, like so:
len([letter for letter in s if letter in 'aeiou'])

Python: if a word contains a digit

I'm writing a function that will take a word as a parameter and will look at each character and if there is a number in the word, it will return the word
This is my string that I will iterate through
'Let us look at pg11.'
and I want to look at each character in each word and if there is a digit in the word, I want to return the word just the way it is.
import string
def containsDigit(word):
for ch in word:
if ch == string.digits
return word

if any(ch.isdigit() for ch in word):
print word, 'contains a digit'

To make your code work use the in keyword (which will check if an item is in a sequence), add a colon after your if statement, and indent your return statement.
import string
def containsDigit(word):
for ch in word:
if ch in string.digits:
return word

Why not use Regex?
>>> import re
>>> word = "super1"
>>> if re.search("\d", word):
... print("y")
...
y
>>>
So, in your function, just do:
import re
def containsDigit(word):
if re.search("\d", word):
return word
print(containsDigit("super1"))
output:
'super1'

You are missing a colon:
for ch in word:
if ch.isdigit(): #<-- you are missing this colon
print "%s contains a digit" % word
return word

Often when you want to know if "something" contains "something_else" sets may be usefull.
digits = set('0123456789')
def containsDigit(word):
if set(word) & digits:
return word
print containsDigit('hello')

If you desperately want to use the string module. Here is the code:
import string
def search(raw_string):
for raw_array in string.digits:
for listed_digits in raw_array:
if listed_digits in raw_string:
return True
return False
If I run it in the shell here I get the wanted resuts. (True if contains. False if not)
>>> search("Give me 2 eggs")
True
>>> search("Sorry, I don't have any eggs.")
False
Code Break Down
This is how the code works
The string.digits is a string. If we loop through that string we get a list of the parent string broke down into pieces. Then we get a list containing every character in a string with'n a list. So, we have every single characters in the string! Now we loop over it again! Producing strings which we can see if the string given contains a digit because every single line of code inside the loop takes a step, changing the string we looped through. So, that means ever single line in the loop gets executed every time the variable changes. So, when we get to; for example 5. It agains execute the code but the variable in the loop is now changed to 5. It runs it agin and again and again until it finally got to the end of the string.

Python regular expression to remove space and capitalize letters where the space was?

I want to create a list of tags from a user supplied single input box, separated by comma's and I'm looking for some expression(s) that can help automate this.
What I want is to supply the input field and:
remove all double+ whitespaces, tabs, new lines (leaving just single spaces)
remove ALL (single's and double+) quotation marks, except for comma's, which there can be only one of
in between each comma, i want Something Like Title Case, but excluding the first word and not at all for single words, so that when the last spaces are removed, the tag comes out as 'somethingLikeTitleCase' or just 'something' or 'twoWords'
and finally, remove all remaining spaces
Here's what I have gathered around SO so far:
def no_whitespace(s):
"""Remove all whitespace & newlines. """
return re.sub(r"(?m)\s+", "", s)
# remove spaces, newlines, all whitespace
# http://stackoverflow.com/a/42597/523051
tag_list = ''.join(no_whitespace(tags_input))
# split into a list at comma's
tag_list = tag_list.split(',')
# remove any empty strings (since I currently don't know how to remove double comma's)
# http://stackoverflow.com/questions/3845423/remove-empty-strings-from-a-list-of-strings
tag_list = filter(None, tag_list)
I'm lost though when it comes to modifying that regex to remove all the punctuation except comma's and I don't even know where to begin for the capitalizing.
Any thoughts to get me going in the right direction?
As suggested, here are some sample inputs = desired_outputs
form: 'tHiS iS a tAg, 'whitespace' !&#^ , secondcomment , no!punc$$, ifNOSPACESthenPRESERVEcaps' should come out as
['thisIsATag', 'secondcomment', 'noPunc', 'ifNOSPACESthenPRESERVEcaps']

Here's an approach to the problem (that doesn't use any regular expressions, although there's one place where it could). We split up the problem into two functions: one function which splits a string into comma-separated pieces and handles each piece (parseTags), and one function which takes a string and processes it into a valid tag (sanitizeTag). The annotated code is as follows:
# This function takes a string with commas separating raw user input, and
# returns a list of valid tags made by sanitizing the strings between the
# commas.
def parseTags(str):
# First, we split the string on commas.
rawTags = str.split(',')
# Then, we sanitize each of the tags. If sanitizing gives us back None,
# then the tag was invalid, so we leave those cases out of our final
# list of tags. We can use None as the predicate because sanitizeTag
# will never return '', which is the only falsy string.
return filter(None, map(sanitizeTag, rawTags))
# This function takes a single proto-tag---the string in between the commas
# that will be turned into a valid tag---and sanitizes it. It either
# returns an alphanumeric string (if the argument can be made into a valid
# tag) or None (if the argument cannot be made into a valid tag; i.e., if
# the argument contains only whitespace and/or punctuation).
def sanitizeTag(str):
# First, we turn non-alphanumeric characters into whitespace. You could
# also use a regular expression here; see below.
str = ''.join(c if c.isalnum() else ' ' for c in str)
# Next, we split the string on spaces, ignoring leading and trailing
# whitespace.
words = str.split()
# There are now three possibilities: there are no words, there was one
# word, or there were multiple words.
numWords = len(words)
if numWords == 0:
# If there were no words, the string contained only spaces (and/or
# punctuation). This can't be made into a valid tag, so we return
# None.
return None
elif numWords == 1:
# If there was only one word, that word is the tag, no
# post-processing required.
return words[0]
else:
# Finally, if there were multiple words, we camel-case the string:
# we lowercase the first word, capitalize the first letter of all
# the other words and lowercase the rest, and finally stick all
# these words together without spaces.
return words[0].lower() + ''.join(w.capitalize() for w in words[1:])
And indeed, if we run this code, we get:
>>> parseTags("tHiS iS a tAg, \t\n!&#^ , secondcomment , no!punc$$, ifNOSPACESthenPRESERVEcaps")
['thisIsATag', 'secondcomment', 'noPunc', 'ifNOSPACESthenPRESERVEcaps']
There are two points in this code that it's worth clarifying. First is the use of str.split() in sanitizeTags. This will turn a b c into ['a','b','c'], whereas str.split(' ') would produce ['','a','b','c','']. This is almost certainly the behavior you want, but there's one corner case. Consider the string tAG$. The $ gets turned into a space, and is stripped out by the split; thus, this gets turned into tAG instead of tag. This is probably what you want, but if it isn't, you have to be careful. What I would do is change that line to words = re.split(r'\s+', str), which will split the string on whitespace but leave in the leading and trailing empty strings; however, I would also change parseTags to use rawTags = re.split(r'\s*,\s*', str). You must make both these changes; 'a , b , c'.split(',') becomes ['a ', ' b ', ' c'], which is not the behavior you want, whereas r'\s*,\s*' deletes the space around the commas too. If you ignore leading and trailing white space, the difference is immaterial; but if you don't, then you need to be careful.
Finally, there's the non-use of regular expressions, and instead the use of str = ''.join(c if c.isalnum() else ' ' for c in str). You can, if you want, replace this with a regular expression. (Edit: I removed some inaccuracies about Unicode and regular expressions here.) Ignoring Unicode, you could replace this line with
str = re.sub(r'[^A-Za-z0-9]', ' ', str)
This uses [^...] to match everything but the listed characters: ASCII letters and numbers. However, it's better to support Unicode, and it's easy, too. The simplest such approach is
str = re.sub(r'\W', ' ', str, flags=re.UNICODE)
Here, \W matches non-word characters; a word character is a letter, a number, or the underscore. With flags=re.UNICODE specified (not available before Python 2.7; you can instead use r'(?u)\W' for earlier versions and 2.7), letters and numbers are both any appropriate Unicode characters; without it, they're just ASCII. If you don't want the underscore, you can add |_ to the regex to match underscores as well, replacing them with spaces too:
str = re.sub(r'\W|_', ' ', str, flags=re.UNICODE)
This last one, I believe, matches the behavior of my non-regex-using code exactly.
Also, here's how I'd write the same code without those comments; this also allows me to eliminate some temporary variables. You might prefer the code with the variables present; it's just a matter of taste.
def parseTags(str):
return filter(None, map(sanitizeTag, str.split(',')))
def sanitizeTag(str):
words = ''.join(c if c.isalnum() else ' ' for c in str).split()
numWords = len(words)
if numWords == 0:
return None
elif numWords == 1:
return words[0]
else:
return words[0].lower() + ''.join(w.capitalize() for w in words[1:])
To handle the newly-desired behavior, there are two things we have to do. First, we need a way to fix the capitalization of the first word: lowercase the whole thing if the first letter's lowercase, and lowercase everything but the first letter if the first letter's upper case. That's easy: we can just check directly. Secondly, we want to treat punctuation as completely invisible: it shouldn't uppercase the following words. Again, that's easy—I even discuss how to handle something similar above. We just filter out all the non-alphanumeric, non-whitespace characters rather than turning them into spaces. Incorporating those changes gives us
def parseTags(str):
return filter(None, map(sanitizeTag, str.split(',')))
def sanitizeTag(str):
words = filter(lambda c: c.isalnum() or c.isspace(), str).split()
numWords = len(words)
if numWords == 0:
return None
elif numWords == 1:
return words[0]
else:
words0 = words[0].lower() if words[0][0].islower() else words[0].capitalize()
return words0 + ''.join(w.capitalize() for w in words[1:])
Running this code gives us the following output
>>> parseTags("tHiS iS a tAg, AnD tHIs, \t\n!&#^ , se#%condcomment$ , No!pUnc$$, ifNOSPACESthenPRESERVEcaps")
['thisIsATag', 'AndThis', 'secondcomment', 'NopUnc', 'ifNOSPACESthenPRESERVEcaps']

You could use a white list of characters allowed to be in a word, everything else is ignored:
import re
def camelCase(tag_str):
words = re.findall(r'\w+', tag_str)
nwords = len(words)
if nwords == 1:
return words[0] # leave unchanged
elif nwords > 1: # make it camelCaseTag
return words[0].lower() + ''.join(map(str.title, words[1:]))
return '' # no word characters
This example uses \w word characters.
Example
tags_str = """ 'tHiS iS a tAg, 'whitespace' !&#^ , secondcomment , no!punc$$,
ifNOSPACESthenPRESERVEcaps' """
print("\n".join(filter(None, map(camelCase, tags_str.split(',')))))
Output
thisIsATag
whitespace
secondcomment
noPunc
ifNOSPACESthenPRESERVEcaps

I think this should work
def toCamelCase(s):
# remove all punctuation
# modify to include other characters you may want to keep
s = re.sub("[^a-zA-Z0-9\s]","",s)
# remove leading spaces
s = re.sub("^\s+","",s)
# camel case
s = re.sub("\s[a-z]", lambda m : m.group(0)[1].upper(), s)
# remove all punctuation and spaces
s = re.sub("[^a-zA-Z0-9]", "", s)
return s
tag_list = [s for s in (toCamelCase(s.lower()) for s in tag_list.split(',')) if s]
the key here is to make use of re.sub to make the replacements you want.
EDIT : Doesn't preserve caps, but does handle uppercase strings with spaces
EDIT : Moved "if s" after the toCamelCase call

Check if a string is a possible abbrevation for a name

I'm trying to develop a python algorithm to check if a string could be an abbrevation for another word. For example
fck is a match for fc kopenhavn because it matches the first characters of the word. fhk would not match.
fco should not match fc kopenhavn because no one irl would abbrevate FC Kopenhavn as FCO.
irl is a match for in real life.
ifk is a match for ifk goteborg.
aik is a match for allmanna idrottskluben.
aid is a match for allmanna idrottsklubben. This is not a real team name abbrevation, but I guess it is hard to exclude it unless you apply domain specific knowledge on how Swedish abbrevations are formed.
manu is a match for manchester united.
It is hard to describe the exact rules of the algorithm, but I hope my examples show what I'm after.
Update I made a mistake in showing the strings with the matching letters uppercased. In the real scenario, all letters are lowercase so it is not as easy as just checking which letters are uppercased.

This passes all the tests, including a few extra I created. It uses recursion. Here are the rules that I used:
The first letter of the abbreviation must match the first letter of
the text
The rest of the abbreviation (the abbrev minus the first letter) must be an abbreviation for:
the remaining words, or
the remaining text starting from
any position in the first word.
tests=(
('fck','fc kopenhavn',True),
('fco','fc kopenhavn',False),
('irl','in real life',True),
('irnl','in real life',False),
('ifk','ifk gotebork',True),
('ifko','ifk gotebork',False),
('aik','allmanna idrottskluben',True),
('aid','allmanna idrottskluben',True),
('manu','manchester united',True),
('fz','faz zoo',True),
('fzz','faz zoo',True),
('fzzz','faz zoo',False),
)
def is_abbrev(abbrev, text):
abbrev=abbrev.lower()
text=text.lower()
words=text.split()
if not abbrev:
return True
if abbrev and not text:
return False
if abbrev[0]!=text[0]:
return False
else:
return (is_abbrev(abbrev[1:],' '.join(words[1:])) or
any(is_abbrev(abbrev[1:],text[i+1:])
for i in range(len(words[0]))))
for abbrev,text,answer in tests:
result=is_abbrev(abbrev,text)
print(abbrev,text,result,answer)
assert result==answer

Here's a way to accomplish what you seem to want to do
import re
def is_abbrev(abbrev, text):
pattern = ".*".join(abbrev.lower())
return re.match("^" + pattern, text.lower()) is not None
The caret makes sure that the first character of the abbreviation matches the first character of the word, it should be true for most abbreviations.
Edit:
Your new update changed the rules a bit. By using "(|.*\s)" instead of ".*" the characters in the abbreviation will only match if they are next to each other, or if the next character appears at the start of a new word.
This will correctly match fck with FC Kopenhavn, but fco will not.
However, matching aik with allmanna idrottskluben will not work, as that requires knowledge of the swedish language and is not as trivial to do.
Here's the new code with the minor modification
import re
def is_abbrev(abbrev, text):
pattern = "(|.*\s)".join(abbrev.lower())
return re.match("^" + pattern, text.lower()) is not None

#Ocaso Protal said in comment how should you decide that aik is valid, but aid is not valid? and he is right.
The algo which came in my mind is to work with word threshold (number of words separated by space).
words = string.strip().split()
if len(words) > 2:
#take first letter of every word
elif len(words) == 2:
#take two letters from first word and one letter from other
else:
#we have single word, take first three letter or as you like
you have to define your logic, you can't find abbreviation blindly.

Your algorithm seems simple - the abbreviation is the Concatenation of all upper case letters.
so:
upper_case_letters = "QWERTYUIOPASDFGHJKLZXCVBNM"
abbrevation = ""
for letter in word_i_want_to_check:
if letter in letters:
abbrevation += letter
for abb in _list_of_abbrevations:
if abb=abbrevation:
great_success()

This might be good enough.
def is_abbrevation(abbrevation, word):
lowword = word.lower()
lowabbr = abbrevation.lower()
for c in lowabbr:
if c not in lowword:
return False
return True
print is_abbrevation('fck', 'FC Kopenhavn')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to properly parse a string via character modification - python

Related

Python 3 - How to capitalize first letter of every sentence when translating from morse code

Find multiple list items within a string

Python: if a word contains a digit

Python regular expression to remove space and capitalize letters where the space was?

Check if a string is a possible abbrevation for a name

Categories

Resources