How could I define string delimiter for splitting in most efficient way? I mean to not need to use many if's etc?
I have strings that need to be splited strictly into two element lists. The problem is those strings have different symbols by which I can split them. For example:
'Hello: test1'. This one has split delimiter ': '. The other example would be:
'Hello - test1'. So this one would be ' - '. Also split delimiter could be ' -' or '- '. So if I know all variations of delimiters, how could I define them most efficiently?
First I did something like this:
strings = ['Hello - test', 'Hello- test', 'Hello -test']
for s in strings:
delim = ' - '
if len(s.split('- ', 1)) == 2:
delim = '- '
elif len(s.split(' -', 1)) == 2:
delim = ' -'
print s.split(delim, 1)[1])
But then I got new strings that had another unexpected delimiters. So doing this way I should add even more ifs to check other delimiters like ': '. But then I wondered if there is some better way to define them (there is not problem if I should need to include new delimiters in some kind of list if I would need to later on). Maybe regex would help or some other tool?
Put all the delimiters inside re.split function like below using logical OR | operator.
re.split(r': | - | -|- ', string)
Add maxsplit=1, if you want to do an one time split.
re.split(r': | - | -|- ', string, maxsplit=1)
You can use the split function of the re module
>>> strings = ['Hello1 - test1', 'Hello2- test2', 'Hello3 -test3', 'Hello4 :test4', 'Hello5 : test5']
>>> for s in strings:
... re.split(" *[:-] *",s)
...
['Hello1', 'test1']
['Hello2', 'test2']
['Hello3', 'test3']
['Hello4', 'test4']
['Hello5', 'test5']
Where between [] you put all the possible delimiters. The * indicates that some spaces can be put before or after.
\s*[:-]\s*
You can split by this.Use re.split(r"\s*[:-]\s*",string).See demo.
https://regex101.com/r/nL5yL3/14
You should use this if you can have delimiters like - or - or -.wherein you have can have multiple spaces.
This isn't the best way, but if you want to avoid using re for some (or no) reason, this is what I would do:
>>> strings = ['Hello - test', 'Hello- test', 'Hello -test', 'Hello : test']
>>> delims = [':', '-'] # all possible delimiters; don't worry about spaces.
>>>
>>> for string in strings:
... delim = next((d for d in delims if d in string), None) # finds the first delimiter in delims that's present in the string (if there is one)
... if not delim:
... continue # No delimiter! (I don't know how you want to handle this possibility; this code will simply skip the string all together.)
... print [s.strip() for s in string.split(delim, 1)] # assuming you want them in list form.
['Hello', 'test']
['Hello', 'test']
['Hello', 'test']
['Hello', 'test']
This uses Python's native .split() to break the string at the delimiter, and then .strip() to trim the white space off the results, if there is any. I've used next to find the appropriate delimiter, but there are plenty of things you can swap that out with (especially if you like for blocks).
If you're certain that each string will contain at least one of the delimiters (preferably exactly one), then you can shave it down to this:
## with strings and delims defined...
>>> for string in strings:
... delim = next(d for d in delims if d in string) # raises StopIteration at this line if there is no delimiter in the string.
... print [s.strip() for s in string.split(delim, 1)]
I'm not sure if this is the most elegant solution, but it uses fewer if blocks, and you won't have to import anything to do it.
Related
I am splitting a text file using this tcl proc:
proc mcsplit "str splitStr {mc {\x00}}" {
return [split [string map [list $splitStr $mc] $str] $mc] }
# mcsplit --
# Splits a string based using another string
# Arguments:
# str string to split into pieces
# splitStr substring
# mc magic character that must not exist in the orignal string.
# Defaults to the NULL character. Must be a single character.
# Results:
# Returns a list of strings
The split command splits a string based on each character that is in the splitString. This version handles the splitString as a combined string, splitting the string into constituent parts,
but my objective is to do the same using python does anyone here did the same before?
It's not very clear from your question whether the python split behavior is what you need. If you need to split at each occurrence of a multiple-character string, Python's regular split will do the job:
>>> 'this is a test'.split('es')
['this is a t', 't']
If, however, you want to split at any occurrence of multiple individual characters, you'll need to use re.split:
>>> import re
>>> re.split(r'[es]', 'this is a test')
['thi', ' i', ' a t', '', 't']
>>>
Is there any way to convert a list containing unicode strings to a proper list without using eval() or ast.literal_eval() in Python?
For example:
"[u'hello', u'hi']"
to
['hello', 'hi']
Could this be what you are looking for?
a = "[u'hello', u'hi']".translate(None, "[]u'' ")
a = a.split(',')
print(a) #['hello', 'hi']
Seems to fail when you have 'u' in string so you can go with:
a = "[u'hello', u'hi', u'uyou']".translate(None, "[]' ")
a = [item[1:] for item in a.split(',')]
It depends a bit on the formatting of your input:
it only contains "strings"
if there are always u in front of the strings,
each string is inside single quotations '
there is always no whitespace before the , but one after.
there are no whitespaces before or after the [ and ]
you could simply strip the left [u' and the right '] (for convenience I just slice it from the fourth element to the second to last element), then split at ', u' and you're done:
>>> s = "[u'hello', u'hi']"
>>> s[3:-2].split("', u'")
['hello', 'hi']
Anyone know how I can find the character in the center that is surrounded by spaces?
1 + 1
I'd like to be able to separate the + in the middle to use in a if/else statement.
Sorry if I'm not too clear, I'm a Python beginner.
I think you are looking for something like the split() method which will split on white space by default.
Suppose we have a string s
s = "1 + 1"
chunks = s.split()
print(chunks[1]) # Will print '+'
This regular expression will detect a single character surrounded by spaces, if the character is a plus or minus or mult or div sign: r' ([+-*/]) '. Note the spaces inside the apostrophes. The parentheses "capture" the character in the middle. If you need to recognize a different set of characters, change the set inside the brackets.
If you haven't dealt with regular expressions before, read up on the re module. They are very useful for simple text processing. The two relevant features here are "character classes" (the square brackets in my example) and "capturing parentheses" (the round parens).
You can use regex:
s="1 + 1"
a=re.compile(r' (?P<sym>.) ')
a.search(s).group('sym')
import re
def find_between(string, start_=' ', end_=' '):
re_str = r'{}([-+*/%^]){}'.format(start_, end_)
try:
return re.search(re_str, string).group(1)
except AttributeError:
return None
print(find_between('9 * 5', ' ', ' '))
Not knowing how many spaces separate your central character, then I'd use the following:
s = '1 + 1'
middle = filter(None, s.split())[1]
print middle # +
The split works as in the solution provided by Zac, but if there are more than a single space, then the returned list will have a bunch of '' elements, which we can get rid of with the filter(None, ) function.
Then it's just a matter of extracting your second element.
Check it in action at https://eval.in/636622
If we look at it step-by-step, then here is how it all works using a python console:
>>> s = '1 + 1'
>>> s.split()
['1', '+', '', '', '1']
>>> filter(None, s.split())
['1', '+', '1']
>>> filter(None, s.split())[1]
'+'
I have a text file something like -
$ abc
defghjik
am here
not now
$ you
are not
here but go there
$ ....
I want to extract text between two $ signs and put that text into a list or a dict. How can I do this in python by reading the file?
I tried regex but it gives me alternate values of the text file:
f1 = open('some.txt','r')
lines = f1.read()
x = re.findall(r'$(.*?)$', lines, re.DOTALL)
I want the output as something like below -
['abc', 'defghjik', 'am here', 'not now']
['you', 'are not', 'here but go there']
Sorry but am new to python and trying to learn, any help appreciated! Thanks!
In regular expressions $ is a character of special meaning and needs to be escaped to match a literal character. Also to match multiple parts I would use a lookahead (?=...) assertion to assert matching a literal $ character.
>>> x = re.findall(r'(?s)\$\s*(.*?)(?=\$)', lines)
>>> [i.splitlines() for i in x]
[['abc', 'defghjik', 'am here', 'not now'], ['you', 'are not', 'here but go there']]
Working Demo
$ has a special meaning in regex, so to match it you need to escape it first. Note that inside a character class([]), $ and other metcharatcers lose their special meaning, so no escaping required there. Following regex should do it:
\$\s*([^$]+)(?=\$)
Debuggex Demo
Demo:
>>> lines = '''$ abc
defghjik
am here
not now
$ you
are not
here but go there
$'''
>>> it = re.finditer(r'\$\s*([^$]+)(?=\$)', lines, re.DOTALL)
>>> [x.group(1).splitlines() for x in it]
[['abc', 'defghjik', 'am here', 'not now'], ['you', 'are not', 'here but go there']]
Regex may not actually be what you want: your desired output has every line as an individual entry in a list. I'd suggest just using lines.split(), and then iterating over the resulting array.
I'll write this as if you just need to print the text you want as output. Adapt as necessary.
f1 = open('some.txt','r')
lines = f1.read()
lists = []
for s in lines.split('\n'):
if s == '$':
if lists:
print lists
lists = []
else: lists.append(s)
if lists: print lists
Happy Python-ing! Welcome to the club. :)
$ holds a special meaning in a regex. It is an anchor. It matches the end of the string or just before the newline at the end of the string. See here :
Regular Expression Operations
You can escape the $ sign by prefixing it with a '\' character, so it won't be treated as an anchor.
Better yet, you don't need to use regex at all here. You can use the split method of strings in python.
>>> string = '''$ abc
defghjik
am here
not now
$ you
are not
here but go there
$ '''
>>> string.split('$')
['', ' abc\ndefghjik\nam here\nnot now\n', ' you\nare not\nhere but go there\n', ' ']
And you get a list. To remove the empty string entries if you want, you can do this:
a=string.split('$')
while a.count('') > 0:
a.remove('')
Reading parts of files often boils down to an "iteration pattern." There are a number of generators in the itertools package that can help. Or you can craft your own generator. For example:
def take_sections(predicate, iterable, firstpost=lambda x:x):
i = iter(iterable)
try:
nextone = i.next()
while True:
batch = [ firstpost(nextone) ]
nextone = i.next()
while not predicate(nextone):
batch.append(nextone)
nextone = i.next()
yield batch
except StopIteration:
yield batch
return
this is similar to itertools.takewhile except it's more of a take until loop (i.e. test at the bottom, not the top). It also has a built in clean-up/post-process function for the first line in a section (the "section marker"). Once you've abstracted this iteration pattern, you need to read the lines in the file, define how the section markers are identified and cleaned up, and run the generator:
with open('some.txt','r') as f1:
lines = [ l.strip() for l in f1.readlines() ]
dollar_line = lambda x: x.startswith('$')
clean_dollar_line = lambda x: x[1:].lstrip()
print list(take_sections(dollar_line, lines, clean_dollar_line))
Yielding:
[['abc', 'defghjik', 'am here', 'not now'],
['you', 'are not', 'here but go there'],
['....']]
I have a string that I need to split on multiple characters without the use of regular expressions. for example, I would need something like the following:
>>>string="hello there[my]friend"
>>>string.split(' []')
['hello','there','my','friend']
is there anything in python like this?
If you need multiple delimiters, re.split is the way to go.
Without using a regex, it's not possible unless you write a custom function for it.
Here's such a function - it might or might not do what you want (consecutive delimiters cause empty elements):
>>> def multisplit(s, delims):
... pos = 0
... for i, c in enumerate(s):
... if c in delims:
... yield s[pos:i]
... pos = i + 1
... yield s[pos:]
...
>>> list(multisplit('hello there[my]friend', ' []'))
['hello', 'there', 'my', 'friend']
Solution without regexp:
from itertools import groupby
sep = ' []'
s = 'hello there[my]friend'
print [''.join(g) for k, g in groupby(s, sep.__contains__) if not k]
I've just posted an explanation here https://stackoverflow.com/a/19211729/2468006
A recursive solution without use of regex. Uses only base python in contrast to the other answers.
def split_on_multiple_chars(string_to_split, set_of_chars_as_string):
# Recursive splitting
# Returns a list of strings
s = string_to_split
chars = set_of_chars_as_string
# If no more characters to split on, return input
if len(chars) == 0:
return([s])
# Split on the first of the delimiter characters
ss = s.split(chars[0])
# Recursive call without the first splitting character
bb = []
for e in ss:
aa = split_on_multiple_chars(e, chars[1:])
bb.extend(aa)
return(bb)
Works very similarly to pythons regular string.split(...), but accepts several delimiters.
Example use:
print(split_on_multiple_chars('my"example_string.with:funny?delimiters', '_.:;'))
Output:
['my"example', 'string', 'with', 'funny?delimiters']
If you're not worried about long strings, you could force all delimiters to be the same using string.replace(). The following splits a string by both - and ,
x.replace('-', ',').split(',')
If you have many delimiters you could do the following:
def split(x, delimiters):
for d in delimiters:
x = x.replace(d, delimiters[0])
return x.split(delimiters[0])
re.split is the right tool here.
>>> string="hello there[my]friend"
>>> import re
>>> re.split('[] []', string)
['hello', 'there', 'my', 'friend']
In regex, [...] defines a character class. Any characters inside the brackets will match. The way I've spaced the brackets avoids needing to escape them, but the pattern [\[\] ] also works.
>>> re.split('[\[\] ]', string)
['hello', 'there', 'my', 'friend']
The re.DEBUG flag to re.compile is also useful, as it prints out what the pattern will match:
>>> re.compile('[] []', re.DEBUG)
in
literal 93
literal 32
literal 91
<_sre.SRE_Pattern object at 0x16b0850>
(Where 32, 91, 93, are the ascii values assigned to , [, ])