Confusion with string split method in python

Confusion with string split method in python - python

Consider the following example
a= 'Apple'
b = a.split(',')
print(b)
Output is ['Apple'].
I am not getting why is it returning a list even when there is no ',' character in Apple
There might be case when we use split method we are expecting more than one element in list but since we are splitting based on separator not present in string, there will be only one element, wouldn't it be better if this mistake is caught during this split method itself

The behaviour of a.split(',') when no commas are present in a is perfectly consistent with the way it behaves when there are a positive number of commas in a.
a.split(',') says to split string a into a list of substrings that are delimited by ',' in a; the delimiter is not preserved in the substrings.
If 1 comma is found you get 2 substrings in the list, if 2 commas are found you get 3 substrings in the list, and in general, if n commas are found you get n+1 substrings in the list. So if 0 commas are found you get 1 substring in the list.
If you want 0 substrings in the list, then you'll need to supply a string with -1 commas in it. Good luck with that. :)

The docstring of that method says:
Return a list of the words in the string S, using sep as the delimiter string.
The delimiter is used to separate multiple parts of the string; having only one part is not an error.

That's the way split() function works. If you do not want that behaviour, you can implement your my_split() function as follows:
def my_split(s, d=' '):
return s.split(d) if d in s else s

Related

How to remove a substrings from a list of strings?

I have a list of strings, all of which have a common property, they all go like this "pp:actual_string". I do not know for sure what the substring "pp:" will be, basically : acts as a delimiter; everything before : shouldn't be included in the result.
I have solved the problem using the brute force approach, but I would like to see a clever method, maybe something like regex.
Note : Some strings might not have this "pp:string" format, and could be already a perfect string, i.e. without the delimiter.
This is my current solution:
ll = ["pp17:gaurav","pp17:sauarv","pp17:there","pp17:someone"]
res=[]
for i in ll:
g=""
for j in range(len(i)):
if i[j] == ':':
index=j+1
res.append(i[index:len(i)])
print(res)
Is there a way that I can do it without creating an extra list ?

Whilst regex is an incredibly powerful tool with a lot of capabilities, using a "clever method" is not necessarily the best idea you are unfamiliar with its principles.
Your problem is one that can be solved without regex by splitting on the : character using the str.split() method, and just returning the last part by using the [-1] index value to represent the last (or only) string that results from the split. This will work even if there isn't a :.
list_with_prefixes = ["pp:actual_string", "perfect_string", "frog:actual_string"]
cleaned_list = [x.split(':')[-1] for x in list_with_prefixes]
print(cleaned_list)
This is a list comprehension that takes each of the strings in turn (x), splits the string on the : character, this returns a list containing the prefix (if it exists) and the suffix, and builds a new list with only the suffix (i.e. item [-1] in the list that results from the split. In this example, it returns:
['actual_string', 'perfect_string', 'actual_string']

Here are a few options, based upon different assumptions.
Most explicit
if s.startswith('pp:'):
s = s[len('pp:'):] # aka 3
If you want to remove anything before the first :
s = s.split(':', 1)[-1]
Regular expressions:
Same as startswith
s = re.sub('^pp:', '', s)
Same as split, but more careful with 'pp:' and slower
s = re.match('(?:^pp:)?(.*)', s).group(1)

python: what does the comma do in += s,?

I am doing a problem, the input is strings:
["abc","bcd","acef","xyz","az","ba","a","z"]
The code is listed below.
def groupStrings(self, strings):
groups = collections.defaultdict(list)
for s in strings:
tmp=[0]*len(s)
for i in range(len(s)):
tmp[i]=(ord(s[i])-ord(s[0]))%26
tmptuple=tuple(tmp)
groups[tmptuple] += s,
return groups.values()
So in groups[tmptuple]+=s,
if I remove the comma ','
I get
[["a","b","c","b","c","d","x","y","z"],["a","c","e","f"],["a","z"],["a","z","b","a"]]
instead of
[["abc","bcd","xyz"],["acef"],["a","z"],["az","ba"]]
The groups just does not add the whole string s, can anyone explain why does the comma make it different and why I could not do it without the comma?

The trailing comma makes a tuple, with a single element, s. Python doesn't require parentheses to make a tuple unless there is ambiguity (e.g. with function call parens); aside from the empty tuple (()), you can usually make tuples with just commas, no parentheses at all. In this case, a single trailing comma, s,, is equivalent to (s,).
Since groups has list values, this means that doing += s, is equivalent to .append(s) (technically, it's closer to .extend((s,)), but the end result is the same). Someone is probably trying to save a few keystrokes.
If you omitted the comma, it would be doing list += str, interpreting the str as a sequence of characters and extending the list with each of the resulting len 1 strings, as you observed.

How to Check if the substring is matching in a list of strings in Python

I have a list and I want to find if the string is present in the list of strings.
li = ['Convenience','Telecom Pharmacy']
txt = '1 convenience store'
I want to match the txt with the Convenience from the list.
I have tried
if any(txt.lower() in s.lower() for s in li):
print s
print [s for s in li if txt in s]
Both the methods didn't give the output.
How to match the substring with the list?

You could use set() and intersection:
In [19]: set.intersection(set(txt.lower().split()), set(s.lower() for s in list1))
Out[19]: {'convenience'}

I think split is your answer. Here is the description from the python documentation:
string.split(s[, sep[, maxsplit]])
Return a list of the words of the string s. If the optional second argument sep is absent or None, the words are separated by arbitrary
strings of whitespace characters (space, tab, newline, return,
formfeed). If the second argument sep is present and not None, it
specifies a string to be used as the word separator. The returned list
will then have one more item than the number of non-overlapping
occurrences of the separator in the string. If maxsplit is given, at
most maxsplit number of splits occur, and the remainder of the string
is returned as the final element of the list (thus, the list will have
at most maxsplit+1 elements). If maxsplit is not specified or -1, then
there is no limit on the number of splits (all possible splits are
made).
The behavior of split on an empty string depends on the value of sep. If sep is not specified, or specified as None, the result will be
an empty list. If sep is specified as any string, the result will be a
list containing one element which is an empty string.
Use the split command on your txt variable. It will give you a list back. You can then do a compare on the two lists to find any matches. I personally would write the nested for loops to check the lists manually, but python provides lots of tools for the job. The following link discusses different approaches to matching two lists.
How can I compare two lists in python and return matches
Enjoy. :-)

I see two things.
Do you want to find if the pattern string matches EXACTLY an item in the list? In this case, nothing simpler:
if txt in list1:
#do something
You can also do txt.upper() or .lower() if you want list case insensitive
But If you want as I understand, to find if there is a string (in the list) which is part of txt, you have to use "for" loop:
def find(list1, txt):
#return item if found, false otherwise
for i in list1:
if i.upper() in txt.upper(): return i
return False
It should work.
Console output:
>>>print(find(['Convenience','Telecom Pharmacy'], '1 convenience store'))
Convenience
>>>

You can try this,
>> list1 = ['Convenience','Telecom Pharmacy']
>> txt = '1 convenience store'
>> filter(lambda x: txt.lower().find(x.lower()) >= 0, list1)
['Convenience']
# Or you can use this as well
>> filter(lambda x: x.lower() in txt.lower(), list1)
['Convenience']

String splitting in python by finding non-zero character

I want to do the following split:
input: 0x0000007c9226fc output: 7c9226fc
input: 0x000000007c90e8ab output: 7c90e8ab
input: 0x000000007c9220fc output: 7c9220fc
I use the following line of code to do this but it does not work!
split = element.rpartition('0')
I got these outputs which are wrong!
input: 0x000000007c90e8ab output: e8ab
input: 0x000000007c9220fc output: fc
what is the fastest way to do this kind of split?
The only idea for me right now is to make a loop and perform checking but it is a little time consuming.
I should mention that the number of zeros in input is not fixed.

Each string can be converted to an integer using int() with a base of 16. Then convert back to a string.
for s in '0x000000007c9226fc', '0x000000007c90e8ab', '0x000000007c9220fc':
print '%x' % int(s, 16)
Output
7c9226fc
7c90e8ab
7c9220fc

input[2:].lstrip('0')
That should do it. The [2:] skips over the leading 0x (which I assume is always there), then the lstrip('0') removes all the zeros from the left side.
In fact, we can use lstrip ability to remove more than one leading character to simplify:
input.lstrip('x0')

format is handy for this:
>>> print '{:x}'.format(0x000000007c90e8ab)
7c90e8ab
>>> print '{:x}'.format(0x000000007c9220fc)
7c9220fc

In this particular case you can just do
your_input[10:]
You'll most likely want to properly parse this; your idea of splitting on separation of non-zero does not seem safe at all.
Seems to be the XY problem.

If the number of characters in a string is constant then you can use
the following code.
input = "0x000000007c9226fc"
output = input[10:]
Documentation
Also, since you are using rpartitionwhich is defined as
str.rpartition(sep)
Split the string at the last occurrence of sep, and return a 3-tuple containing the part before the separator, the separator itself, and the part after the separator. If the separator is not found, return a 3-tuple containing two empty strings, followed by the string itself.
Since your input can have multiple 0's, and rpartition only splits the last occurrence this a malfunction in your code.

Regular expression for 0x00000 or its type is (0x[0]+) and than replace it with space.
import re
st="0x000007c922433434000fc"
reg='(0x[0]+)'
rep=re.sub(reg, '',st)
print rep

Remove substrings of variable length from string

I have a list of strings where all of the strings roughly follow the format 'foo\tbar\tfoo\n' in that there are three segments of variable length that are separated by two tabs (\t) and with a newline indicator at the end (\n).
I want to remove everything except for the text before the first \, so that it would return as 'foo'. Given that the first segment is of variable length, I'm not sure how I can do that.

Use str.split():
>>> string = 'foo\tbar\tfoo\n'
>>> string.split('\t', 1)[0]
'foo'
This splits the string by the first occurrence of the '\t' tab character, which returns a list with two elements. The [0] selects the first element in the list, which is the part of the string before the first '\t' occurrence.

Just search for the first \t character, and get everything before it. Slicing makes this easy.
newstr = oldstr[:oldstr.find("\t")]

Try with:
t = 'foo\tbar\tfoo\n'
t[:t.index("\t")]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.