Python - splitting a string twice - python

I have some data that looks like "string,string,string:otherstring,otherstring,otherstring".
I want to manipulate the first set of "string"s one at a time. If I split the input and delimit it based off a colon, I will then end up with a list. I then cannot split this again because "'list' object has no attribute 'split'". Alternatively, if I decide to delimit based off a comma, then that will return everything (including stuff after the comma, which I don't want to manipulate). rsplit has the same problem. Now even with a list, I could still manipulate the first entries by using [0], [1] etc. except for the fact that the number of "string"s is always changing, so I am not able to hardcode the numbers in place. Any ideas on how to get around this list limitation?

Try this:
import re
s = 'string,string,string:otherstring,otherstring,otherstring'
re.split(r'[,:]', s)
=> ['string', 'string', 'string', 'otherstring', 'otherstring', 'other string']
We're using regular expressions and the split() method for splitting a string with more than one delimiter. Or if you want to manipulate the first group of strings in a separate way from the second group, we can create two lists, each one with the strings in each group:
[x.split(',') for x in s.split(':')]
=> [['string', 'string', 'string'], ['otherstring', 'otherstring', 'otherstring']]
… Or if you just want to retrieve the strings in the first group, simply do this:
s.split(':')[0].split(',')
=> ['string', 'string', 'string']

Use a couple join() statements to convert back to a string:
>>> string = "string,string,string:otherstring,otherstring,otherstring"
>>> ' '.join(' '.join(string.split(':')).split(',')).split()
['string', 'string', 'string', 'otherstring', 'otherstring', 'otherstring']
>>>

text = "string,string,string:otherstring,otherstring,otherstring"
replace = text.replace(":", ",").split(",")
print(replace)
['string', 'string', 'string', 'otherstring', 'otherstring', 'otherstring']

Related

How to convert a string containing a list of values that are not comma-separated to a list?

I'm new to Python and am wondering what is the most elegant way to convert a string of the form "[1 2 3]" to a list? If the string contains a comma-separated list of values, then the solution is simple:
str = "['x', 'y', 'z']"
arr = eval(str)
print isinstance(arr, list) # True
However, this solution doesn't work if the list in the string is not comma separated, e.g. "['x' 'y' 'z']".
Is there a common way to solve this without having to manually parse the string? The solution should not be type dependent, e.g. both "[1 2 3]" and "['multiple words 1' 'multiple words 2']" should be converted normally.
In this case shlex might be a solution.
import shlex
s = "['x' 'y' 'z']"
# First get rid of the opening and closing brackets
s = s.strip('[]')
# Split the string using shell-like syntax
lst = shlex.split(s)
print(type(lst), lst)
# Prints: <class 'list'> ['x', 'y', 'z']
But you'll have to check if it fulfills your requirements.
import re
str = "[1 2 a 'multiple words 1' 'multiple words 2' 'x' 'y' 'z']"
print ([''.join(x) for x in re.findall("'(.*?)'|(\S+)", re.sub(r'^\[(.*)\]', r'\1', str))])
>>> ['1', '2', 'a', 'multiple words 1', 'multiple words 2', 'x', 'y', 'z']
The first obvious step is to get rid of the [...] because they don't add anything useful to the results ...
Then it works because of the regex in findall: this will only match either anything between quotes or any sequence of non-spaces.
We don't want the quotes themselves (or do we? – it is not specified) so the regex grouping allows it to return just the inner parts.
Then we always get pairs of one element empty and one filled (('', '1'), ('', '2') and so on) so we need an additional cleaning loop.
This code cannot see the difference between [1 2 3] and ['1' '2' '3'], but that's no problem as such a variant is not specified in the question.

How to parse string to list without eval or ast.literal_eval

Is there any way to convert a list containing unicode strings to a proper list without using eval() or ast.literal_eval() in Python?
For example:
"[u'hello', u'hi']"
to
['hello', 'hi']
Could this be what you are looking for?
a = "[u'hello', u'hi']".translate(None, "[]u'' ")
a = a.split(',')
print(a) #['hello', 'hi']
Seems to fail when you have 'u' in string so you can go with:
a = "[u'hello', u'hi', u'uyou']".translate(None, "[]' ")
a = [item[1:] for item in a.split(',')]
It depends a bit on the formatting of your input:
it only contains "strings"
if there are always u in front of the strings,
each string is inside single quotations '
there is always no whitespace before the , but one after.
there are no whitespaces before or after the [ and ]
you could simply strip the left [u' and the right '] (for convenience I just slice it from the fourth element to the second to last element), then split at ', u' and you're done:
>>> s = "[u'hello', u'hi']"
>>> s[3:-2].split("', u'")
['hello', 'hi']

Python Regex Simple Split - Empty at first index

I have a String that looks like
test = '20170125NBCNightlyNews'
I am trying to split it into two parts, the digits, and the name. The format will always be [date][show] the date is stripped of format and is digit only in the direction of YYYYMMDD (dont think that matters)
I am trying to use re. I have a working version by writing.
re.split('(\d+)',test)
Simple enough, this gives me the values I need in a list.
['', '20170125', 'NBCNightlyNews']
However, as you will note, there is an empty string in the first position. I could theoretically just ignore it, but I want to learn why it is there in the first place, and if/how I can avoid it.
I also tried telling it to match the begininning of the string as well, and got the same results.
>>> re.split('(^\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>> re.split('^(\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>>
Does anyone have any input as to why this is there / how I can avoid the empty string?
Other answers have explained why what you're doing does what it does, but if you have a constant format for the date, there is no reason to abuse a re.split to parse this data:
test[:8], test[8:]
Will split your strings just fine.
What you are actually doing by entering re.split('(^\d+)', test) is, that your test string is splitted on any occurence of a number with at least one character.
So, if you have
test = '20170125NBCNightlyNews'
This is happening:
20170125 NBCNightlyNews
^^^^^^^^
The string is split into three parts, everything before the number, the number itself and everything after the number.
Maybe it is easier to understand if you have a sentence of words, separated by a whitespace character.
re.split(' ', 'this is a house')
=> ['this', 'is', 'a', 'house']
re.split(' ', ' is a house')
=> ['', 'is', 'a', 'house']
You're getting an empty result in the beginning because your input string starts with digits and you're splitting it by digits only. Hence you get an empty string which is before first set of digits.
To avoid that you can use filter:
>>> print filter(None, re.split('(\d+)',test))
['20170125', 'NBCNightlyNews']
Why re.split when you can just match and get the groups?...
import re
test = '20170125NBCNightlyNews'
pattern = re.compile('(\d+)(\w+)')
result = re.match(pattern, test)
result.groups()[0] # for the date part
result.groups()[1] # for the show name
I realize now the intention was to parse the text, not fix the regex usage. I'm with the others, you shouldn't use regex for this simple task when you already know the format won't change and the date is fixed size and will always be first. Just use string indexing.
From the documentation:
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string. That way, separator components are always found at the same relative indices within the result list.
So if you have:
test = 'test20170125NBCNightlyNews'
The indexes would remain unaffected:
>>>re.split('(\d+)',test)
['test', '20170125', 'NBCNightlyNews']
If the date is always 8 digits long, I would access the substrings directly (without using regex):
>>> [test[:8], test[8:]]
['20170125', 'NBCNightlyNews']
If the length of the date might vary, I would use:
>>> s = re.search('^(\d*)(.*)$', test)
>>> [s.group(1), s.group(2)]
['20170125', 'NBCNightlyNews']

String split with default delimiter vs user defined delimiter

I tried a simple example with string split, but get some unexpected behavior. Here is the sample code:
def split_string(source,splitlist):
for delim in splitlist:
source = source.replace(delim, ' ')
return source.split(' ')
out = split_string("This is a test-of the,string separation-code!", " ,!-")
print out
>>> ['This', 'is', 'a', 'test', 'of', 'the', 'string', 'separation', 'code', '']
As you can see, I got an extra empty string at the end of the list when I use space as delimiter argument for split() function. However, if I don't pass in any argument for split() function, I got no empty string at the end of the output list.
From what I read in python docs, they said the default argument for split() is space. So, why when I explicitly pass in a ' ' as delimiter, it creates an empty string at the end of the output list?
The docs:
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start
or end if the string has leading or trailing whitespace.
That may happen if you have multiple spaces separating two words.
For example,
'a b'.split(' ') will return ['a', '', '', '', 'b']
But I would suggest you to use split from re module. Check the example below:
import re
print re.split('[\s,; !]+', 'a b !!!!!!! , hello ;;;;; world')
When we run the above piece, it outputs ['a', 'b', 'hello', 'world']

Splitting string in Python starting from the first numeric character

I need to manage string in Python in this way:
I have this kind of strings with '>=', '=', '<=', '<', '>' in front of them, for example:
'>=1_2_3'
'<2_3_2'
what I want to achieve is splitting the strings to obtain, respectively:
'>=', '1_2_3'
'<', '2_3_2'
basically I need to split them starting from the first numeric character.
There's a way to achieve this result with regular expressions without iterating over the string checking if a character is a number or a '_'?
thank you.
This will do:
re.split(r'(^[^\d]+)', string)[1:]
Example:
>>> re.split(r'(^[^\d]+)', '>=1_2_3')[1:]
['>=', '1_2_3']
>>> re.split(r'(^[^\d]+)', '<2_3_2')[1:]
['<', '2_3_2']
import re
strings = ['>=1_2_3','<2_3_2']
for s in strings:
mat = re.match(r'([^\d]*)(\d.*)', s)
print mat.groups()
Outputs:
('>=', '1_2_3')
('<', '2_3_2')
This just groups everything up until the first digit in one group, then that first digit and everything after into a second.
You can access the individual groups with mat.group(1), mat.group(2)
You can split using this regex:
(?<=[<>=])(?=\d)
RegEx Demo
There's probably a better way but you can split with a capture then join the second two elements:
values = re.split(r'(\d)', '>=1_2_3', maxsplit = 1)
values = [values[0], values[1] + values[2]]

Categories