How to understand this code, to split an array in Python? - python

I'm a bit confused about a line in Python:
We use Python and a custom function to split a line: we want what is between quotes to be a single entry in the array.
The line is, for example:
"La Jolla Bank, FSB",La Jolla,CA,32423,19-Feb-10,24-Feb-10
So "La Jolla Bank, FSB" should be a single entry in the array.
And I'm not sure to understand this code:
The first char is a quote '"', so the variable "quote" is set to its inverse, so set to "TRUE".
Then we check the comma, AND if quote is set to its inverse, so if quote is TRUE, which is the case when we are inside the quotes.
We cut it with current="", and this is where I don't understand: we are still between the quotes, so normally we should not cut it now! edit: so and not quote means "false", and not "the opposite of", thanks !
Code:
def mysplit (string):
quote = False
retval = []
current = ""
for char in string:
if char == '"':
quote = not quote
elif char == ',' and not quote: #the first coma is still in the quotes, and quote is set to TRUE, so we should not cut current here...
retval.append(current)
current = ""
else:
current += char
retval.append(current)
return retval

You're viewing it as though both if char == '"' and elif char == ',' and not quote were run.
However the if statement explicitly makes it so that only one will run.
Either, quote will be inverted OR the current value will get cut.
In the case where the current char is ", then the logic will be called to invert the quote flag. But the logic to cut the string will not run.
In the case where the current char is ,, then the logic for inverting the flag will NOT run, but the logic to cut the string will if the quote flag is not set.

That is initializing current to the empty string, wiping out whatever it may have been set to before.
As long as you are not inside quotes (ie. quote is False), when you see a ,, you have hit the end of the field. Whatever you have accumulated into current is the content of that field, so append it to retval and reset current to the empty string, ready for the next field.
That said, this looks like you're dealing with a .csv input. There is a csv module that can deal with this for you.

current is reset to empty because in the case where you have encountered ',' and you are not under "" quotes you should interpret that as an end of a "token".
This is definitely not pythonic, for char in string makes me cringe and whoever wrote this code should have used regex.

What you're looking at is a condensed version of a Finite State Machine, used by most language parsing programs.
Let's see if I can't annotate it:
def mysplit (string):
# We start out at the beginning of the string NOT in between quotes
quote = False
# Hold each element that we split out
retval = []
# This variable holds whatever the current item we're interested in is
# e.g: If we're in a quote, then it's everything (including commas)
# otherwise it's every UP UNTIL the next comma
current = ""
# Scan the string character by character
for char in string:
# We hit a quote, so turn on QUOTE SCANNING MODE!!!
# If we're in quote scanning mode, turn it off
if char == '"':
quote = not quote
# We hit a comma, and we're not in quote scanning mode
elif char == ',' and not quote:
# We got what we want, let's put it in the return value
# and then reset our current item to nothing so we can prepare for the next item.
retval.append(current)
current = ""
else:
# Nothing special, let's just keep building up our current item
current += char
# We're done with all the characters, let's put together whatever we were working on when we ran out of characters
retval.append(current)
# Return it!
return retval

This is not the best code for splitting but it is pretty straight forward
1 current = ""
# First you set current to empty string, the following line
# will loop through the string to be split and pull characters out of it
# one by one... setting 'char' to be the value of next character
2 for char in string:
# the following code will check if the line we are currently inside of the quote
# if otherwise it will add the current character to the the 'current' variable
#
3 if char == '"':
4 quote = not quote
5 elif char == ',' and not quote:
6 retval.append(current)
### if we see the comma, it will append whatever is accumulated in current to the
### return result.
### then you have to reset the value in the current to let the next word accumulate
7 current = "" #why do we cut current here?
8 else:
9 current += char
### after the last char is seen, we still have left over characters in current which
### we can just shove into the final result
10 retval.append(current)
11 return retval
Here is an example run:
Let string be 'a,bbb,ccc
Step char current retval
1 a a {}
2 , {a} ### Current is reset
3 b b {a}
4 b bb {a}
5 b bbb {a}
6 , {a,bbb} ### Current is reset
and so on

OK you aren't quite there!
1.the first char is a quote
' " ', so the variable "quote" is set to its inverse, so set to
"TRUE".
good! so quote was set to the inverse of whatever it was previously. At the beginning of the prog, it was false, so when " is seen, it becomes true. But vice versa, if it was True, and a quote is seen, it becomes false.
In other words, this line of the program changes quote from whatever is was before that line. It is called 'toggling'.
then we check the coma, AND if quote is set to its inverse,
so if quote is TRUE, which is the case when we are inside the quotes.
This isn't quite right. not quote means "only if quote is false". This has nothing to do with whether it is 'set to its inverse'. No variable can be equal to its own inverse! it is like saying X=True and X=False - obviously nonsense.
quote is always either True or False - and nothing else!
3.we cut it with current="", and this is where i don't understand : we are still between the quotes, so norm ally we should not cut it now!
So hopefully you can see now that, you are not between the quotes if you reach this line. the not quote ensures that you don't cut inside a quote, because not quote really means just that - not in a quote!

Related

How to end a program when the line ends in a period character?

How do I end a program that reads an input line by line and it ends when there's a period (whitespace doesn't matter)?
For example:
input = "HI
bye
."
the program should end after it reaches the period.
I tried doing two things:
if line == ".":
break
if "." in line:
break
but the first one doesn't consider whitespace, and the second one doesn't consider "." in numbers like 2.1.
if line.replace(" ", "")[-1] == ".":
break
.replace(" ", "") removes all white-spaces, and [-1] takes the last character of the string
You need .strip() to remove whitespaces and check the ending character with .endswith():
for line in f:
if line.strip().endswith("."):
terminate...
There is a method for strings called endswith, but honestly I would check if the string ends with a '.' through indexing.
if my_str[-1] == '.':
do_something()
But this also depends on how your string is received. Is it literally from an input? Is it from a file? You may need to add something additional per the circumstance
A few remarks:
you should use triple double quotation marks if you want to literally include a multiline string
I'm assuming you want to go over the input line by line, not just all at once
don't use keywords and the names of builtin and standard (or other already defined) names as names, that's called shadowing
Here's what you might be after:
from io import StringIO
# you need triple double quotes to have a multiline string like this
# also, don't name it `input`, that shadows the `input()` function
text = """HI
bye
."""
for line in StringIO(text):
if line.strip()[-1] == ".":
print('found the end')
break
Note that the StringIO stuff is only there to go over text line by line. The important part, in answering your question, is if line.strip()[-1] == ".":
This solution also works when your text looks like this, for example:
text = """HI
some words
bye. """ # note the space at the end, and the word in front of the period
If you want to end the string at the exact dot, you can try this:
input = '''HI
bye
.
hello
bye'''
index = input.find('.') # gets the index of the dot
print(input[:index+1])

Can't wrap my head around how to remove a list of characters from another list

I've been able to isolate the list (or string) of characters I want excluded from a user entered string. But I don't see how to then remove all these unwanted characters. After I do this, I think I can try joining the user string so it all becomes one alphabet input like the instructions say.
Instructions:
Remove all non-alpha characters
Write a program that removes all non-alpha characters from the given input.
For example, if the input is:
-Hello, 1 world$!
the output should be:
Helloworld
My code:
userEntered = input()
makeList = userEntered.split()
def split(userEntered):
return list(userEntered)
if userEntered.isalnum() == False:
for i in userEntered:
if i.isalpha() == False:
#answer = userEntered[slice(userEntered.index(i))]
reference = split(userEntered)
excludeThis = i
print(excludeThis)
When I print excludeThis, I get this as my output:
-
,
1
$
!
So I think I might be on the right track. I need to figure it out how to get these characters out of the user input. Any help is appreciated.
Loop over the input string. If the character is alphabetic, add it to the result string.
userEntered = input()
result = ''
for char in userEntered:
if char.isalpha():
result += char
print(result)
This can also be done with a regular expression:
import re
userEntered = input()
result = re.sub(r'[^a-z]', '', userEntered, flags=re.I)
The regexp [^a-z] matches anything except an alphabetic character. The re.I flag makes it case-insensitive. These are all replaced with an empty string, which removes them.
There's basically two main parts to this: distinguish alpha from non-alpha, and get a string with only the former. If isalpha() is satisfactory for the former, then that leaves the latter. My understanding is that the solution that is considered most Pythonic would be to join a comprehension. This would like this:
''.join(char for char in userEntered if char.isalpha())
BTW, there are several places in the code where you are making it more complicated than it needs to be. In Python, you can iterate over strings, so there's no need to convert userEntered to a list. isalnum() checks whether the string is all alphanumeric, so it's rather irrelevant (alphanumeric includes digits). You shouldn't ever compare a boolean to True or False, just use the boolean. So, for instance, if i.isalpha() == False: can be simplified to just if not i.isalpha():.

How to remove a character from a string within a list PYTHON

I currently have the following code but I am struggling to figure out the code that will remove the punctuation from the strings within the list. Please let me know if you know what I should input.
"""
This function checks to see if the last character is either punctuation or a
"\n" character and removes these unwanted characters.
Question: 1
"""
import string
result = string.punctuation
def myFun(listA):
A = listA
for element in A:
for i in element:
if i in result:
#what do i put here
print(A)
myFun(['str&', 'cat', 'dog\n', 'myStr.'])
In Python, string is basically a list of characters, so you can access the last character by str[-1], or if you wanna check for \n, str[-2:]
Now, you can perform a simple check on str[-1] first by,
if str[-1] in string.punctuation
and remove it with str = str[:-1]
and str[-2:] by:
if str[-2:] == "\n"
and remove it with str = str[:-2]
Notes: After you perform each checking, remember to add a continue to iterate to the next loop, or else, it will remove from the str twice for case that ends with both \n and a punctuation, such as "test\n," will become "test"
p.s. I intentionally didn't put them in your code and left that part for you to do.

Is it possible to use find command in python in such a way that it does not look for the character inside double quotes?

I've a file and I want to find the index of some special character (\*) in it. This character might appear at several places in the file. for example:
hello \*this is a file*/
print "good\* morning"
I want to use find command to find index of \* only outside double quotes and not inside double quotes. Is there a way to implement this in python?
I know that find returns the index of first character that is found but I've a for loop that checks for this character and prints the index. But I want that whenever it encounters this character /* inside double quotes, it should skip that character and move on to find next one on the file.
str1 = 'hello \*this is a file*/'
str2 = 'print "good\* morning"'
def find_index(_str):
is_in_quotes = 0
idx = 0
while idx < len(_str):
if _str[idx] == '"':
is_in_quotes = 1 - is_in_quotes
elif not is_in_quotes:
if _str[idx: idx+2] == '\*':
return idx
idx += 1
return -1
print(find_index(str1))
print(find_index(str2))
The function return -1 if it doesn't find it.
Let me know if it meets all your needs.

Python regular expression to remove space and capitalize letters where the space was?

I want to create a list of tags from a user supplied single input box, separated by comma's and I'm looking for some expression(s) that can help automate this.
What I want is to supply the input field and:
remove all double+ whitespaces, tabs, new lines (leaving just single spaces)
remove ALL (single's and double+) quotation marks, except for comma's, which there can be only one of
in between each comma, i want Something Like Title Case, but excluding the first word and not at all for single words, so that when the last spaces are removed, the tag comes out as 'somethingLikeTitleCase' or just 'something' or 'twoWords'
and finally, remove all remaining spaces
Here's what I have gathered around SO so far:
def no_whitespace(s):
"""Remove all whitespace & newlines. """
return re.sub(r"(?m)\s+", "", s)
# remove spaces, newlines, all whitespace
# http://stackoverflow.com/a/42597/523051
tag_list = ''.join(no_whitespace(tags_input))
# split into a list at comma's
tag_list = tag_list.split(',')
# remove any empty strings (since I currently don't know how to remove double comma's)
# http://stackoverflow.com/questions/3845423/remove-empty-strings-from-a-list-of-strings
tag_list = filter(None, tag_list)
I'm lost though when it comes to modifying that regex to remove all the punctuation except comma's and I don't even know where to begin for the capitalizing.
Any thoughts to get me going in the right direction?
As suggested, here are some sample inputs = desired_outputs
form: 'tHiS iS a tAg, 'whitespace' !&#^ , secondcomment , no!punc$$, ifNOSPACESthenPRESERVEcaps' should come out as
['thisIsATag', 'secondcomment', 'noPunc', 'ifNOSPACESthenPRESERVEcaps']
Here's an approach to the problem (that doesn't use any regular expressions, although there's one place where it could). We split up the problem into two functions: one function which splits a string into comma-separated pieces and handles each piece (parseTags), and one function which takes a string and processes it into a valid tag (sanitizeTag). The annotated code is as follows:
# This function takes a string with commas separating raw user input, and
# returns a list of valid tags made by sanitizing the strings between the
# commas.
def parseTags(str):
# First, we split the string on commas.
rawTags = str.split(',')
# Then, we sanitize each of the tags. If sanitizing gives us back None,
# then the tag was invalid, so we leave those cases out of our final
# list of tags. We can use None as the predicate because sanitizeTag
# will never return '', which is the only falsy string.
return filter(None, map(sanitizeTag, rawTags))
# This function takes a single proto-tag---the string in between the commas
# that will be turned into a valid tag---and sanitizes it. It either
# returns an alphanumeric string (if the argument can be made into a valid
# tag) or None (if the argument cannot be made into a valid tag; i.e., if
# the argument contains only whitespace and/or punctuation).
def sanitizeTag(str):
# First, we turn non-alphanumeric characters into whitespace. You could
# also use a regular expression here; see below.
str = ''.join(c if c.isalnum() else ' ' for c in str)
# Next, we split the string on spaces, ignoring leading and trailing
# whitespace.
words = str.split()
# There are now three possibilities: there are no words, there was one
# word, or there were multiple words.
numWords = len(words)
if numWords == 0:
# If there were no words, the string contained only spaces (and/or
# punctuation). This can't be made into a valid tag, so we return
# None.
return None
elif numWords == 1:
# If there was only one word, that word is the tag, no
# post-processing required.
return words[0]
else:
# Finally, if there were multiple words, we camel-case the string:
# we lowercase the first word, capitalize the first letter of all
# the other words and lowercase the rest, and finally stick all
# these words together without spaces.
return words[0].lower() + ''.join(w.capitalize() for w in words[1:])
And indeed, if we run this code, we get:
>>> parseTags("tHiS iS a tAg, \t\n!&#^ , secondcomment , no!punc$$, ifNOSPACESthenPRESERVEcaps")
['thisIsATag', 'secondcomment', 'noPunc', 'ifNOSPACESthenPRESERVEcaps']
There are two points in this code that it's worth clarifying. First is the use of str.split() in sanitizeTags. This will turn a b c into ['a','b','c'], whereas str.split(' ') would produce ['','a','b','c','']. This is almost certainly the behavior you want, but there's one corner case. Consider the string tAG$. The $ gets turned into a space, and is stripped out by the split; thus, this gets turned into tAG instead of tag. This is probably what you want, but if it isn't, you have to be careful. What I would do is change that line to words = re.split(r'\s+', str), which will split the string on whitespace but leave in the leading and trailing empty strings; however, I would also change parseTags to use rawTags = re.split(r'\s*,\s*', str). You must make both these changes; 'a , b , c'.split(',') becomes ['a ', ' b ', ' c'], which is not the behavior you want, whereas r'\s*,\s*' deletes the space around the commas too. If you ignore leading and trailing white space, the difference is immaterial; but if you don't, then you need to be careful.
Finally, there's the non-use of regular expressions, and instead the use of str = ''.join(c if c.isalnum() else ' ' for c in str). You can, if you want, replace this with a regular expression. (Edit: I removed some inaccuracies about Unicode and regular expressions here.) Ignoring Unicode, you could replace this line with
str = re.sub(r'[^A-Za-z0-9]', ' ', str)
This uses [^...] to match everything but the listed characters: ASCII letters and numbers. However, it's better to support Unicode, and it's easy, too. The simplest such approach is
str = re.sub(r'\W', ' ', str, flags=re.UNICODE)
Here, \W matches non-word characters; a word character is a letter, a number, or the underscore. With flags=re.UNICODE specified (not available before Python 2.7; you can instead use r'(?u)\W' for earlier versions and 2.7), letters and numbers are both any appropriate Unicode characters; without it, they're just ASCII. If you don't want the underscore, you can add |_ to the regex to match underscores as well, replacing them with spaces too:
str = re.sub(r'\W|_', ' ', str, flags=re.UNICODE)
This last one, I believe, matches the behavior of my non-regex-using code exactly.
Also, here's how I'd write the same code without those comments; this also allows me to eliminate some temporary variables. You might prefer the code with the variables present; it's just a matter of taste.
def parseTags(str):
return filter(None, map(sanitizeTag, str.split(',')))
def sanitizeTag(str):
words = ''.join(c if c.isalnum() else ' ' for c in str).split()
numWords = len(words)
if numWords == 0:
return None
elif numWords == 1:
return words[0]
else:
return words[0].lower() + ''.join(w.capitalize() for w in words[1:])
To handle the newly-desired behavior, there are two things we have to do. First, we need a way to fix the capitalization of the first word: lowercase the whole thing if the first letter's lowercase, and lowercase everything but the first letter if the first letter's upper case. That's easy: we can just check directly. Secondly, we want to treat punctuation as completely invisible: it shouldn't uppercase the following words. Again, that's easy—I even discuss how to handle something similar above. We just filter out all the non-alphanumeric, non-whitespace characters rather than turning them into spaces. Incorporating those changes gives us
def parseTags(str):
return filter(None, map(sanitizeTag, str.split(',')))
def sanitizeTag(str):
words = filter(lambda c: c.isalnum() or c.isspace(), str).split()
numWords = len(words)
if numWords == 0:
return None
elif numWords == 1:
return words[0]
else:
words0 = words[0].lower() if words[0][0].islower() else words[0].capitalize()
return words0 + ''.join(w.capitalize() for w in words[1:])
Running this code gives us the following output
>>> parseTags("tHiS iS a tAg, AnD tHIs, \t\n!&#^ , se#%condcomment$ , No!pUnc$$, ifNOSPACESthenPRESERVEcaps")
['thisIsATag', 'AndThis', 'secondcomment', 'NopUnc', 'ifNOSPACESthenPRESERVEcaps']
You could use a white list of characters allowed to be in a word, everything else is ignored:
import re
def camelCase(tag_str):
words = re.findall(r'\w+', tag_str)
nwords = len(words)
if nwords == 1:
return words[0] # leave unchanged
elif nwords > 1: # make it camelCaseTag
return words[0].lower() + ''.join(map(str.title, words[1:]))
return '' # no word characters
This example uses \w word characters.
Example
tags_str = """ 'tHiS iS a tAg, 'whitespace' !&#^ , secondcomment , no!punc$$,
ifNOSPACESthenPRESERVEcaps' """
print("\n".join(filter(None, map(camelCase, tags_str.split(',')))))
Output
thisIsATag
whitespace
secondcomment
noPunc
ifNOSPACESthenPRESERVEcaps
I think this should work
def toCamelCase(s):
# remove all punctuation
# modify to include other characters you may want to keep
s = re.sub("[^a-zA-Z0-9\s]","",s)
# remove leading spaces
s = re.sub("^\s+","",s)
# camel case
s = re.sub("\s[a-z]", lambda m : m.group(0)[1].upper(), s)
# remove all punctuation and spaces
s = re.sub("[^a-zA-Z0-9]", "", s)
return s
tag_list = [s for s in (toCamelCase(s.lower()) for s in tag_list.split(',')) if s]
the key here is to make use of re.sub to make the replacements you want.
EDIT : Doesn't preserve caps, but does handle uppercase strings with spaces
EDIT : Moved "if s" after the toCamelCase call

Categories