How to split string with 2 arguments? - python

If I have a string thats 'asdf foo\nHi\nBar thing', I want it to split the string, so the output is ['asdf', 'foo', 'hi', 'bar', thing']. Thats essentially x.split(' ') and x.split('\n'). How can I do this efficiently? I want it to be about one line long, instead of having a for loop to split again...

Omit the parameter to split(): x.split() will split on both, spaces and newline characters (and also tabs).
Example:
>>> x = 'asdf foo\nHi\nBar thing'
>>> x.split()
['asdf', 'foo', 'Hi', 'Bar', 'thing']

Related

What's the difference between these two input statements?

I've tried using these two input statements in python. Both the statements returns same output. What's the difference between using split() and split(" ") ?
a=[int(i) for i in input().split(" ")]
print(a)
and
a=[int(i) for i in input().split()]
print(a)
The default action of method split on a string is to split on any grouping of white space:
>>> 'foo bar'.split()
['foo', 'bar']
>>> 'foo \n \t bar'.split()
['foo', 'bar']
If you pass a literal space as the argument, however, the split is done differently, with only a literal space as the splitter, and with empty strings resulting from adjacent literal spaces:
>>> 'foo \n \t bar'.split(' ')
['foo', '\n', '\t', '', '', 'bar']
If the input has only single, ordinary spaces, there will be no observable difference.

Splitting a string after certain characters?

I will be given a string, and I need to split it every time that it has an "|", "/", "." or "_"
How can I do this fast? I know how to use the command split, but is there any way to give more than 1 split condition to it? For example, if the input given was
Hello test|multiple|36.strings/just36/testing
I want the output to give:
"['Hello test', 'multiple', '36', 'strings', 'just36', 'testing']"
Use a regex and the regex module:
>>> import re
>>> s='You/can_split|multiple'
>>> re.split(r'[/_|.]', s)
['You', 'can', 'split', 'multiple']
In this case, [/_|.] will split on any of those characters.
Or, you can use a list comprehension to insert a single (perhaps multiple character) delimiter and then split on that:
>>> ''.join(['-><-' if c in '/_|.' else c for c in s]).split('-><-')
['You', 'can', 'split', 'multiple']
With the added example:
>>> s2="Hello test|multiple|36.strings/just36/testing"
Method 1:
>>> re.split(r'[/_|.]', s2)
['Hello test', 'multiple', '36', 'strings', 'just36', 'testing']
Method 2:
>>> ''.join(['-><-' if c in '/_|.' else c for c in s2]).split('-><-')
['Hello test', 'multiple', '36', 'strings', 'just36', 'testing']
Use groupby:
from itertools import groupby
s = 'You/can_split|multiple'
separators = set('/_|.')
result = [''.join(group) for k, group in groupby(s, key=lambda x: x not in separators) if k]
print(result)
Output
['You', 'can', 'split', 'multiple']

list comprehension using regex conditional

i have a list of strings.
If any of these strings has a 4-digit year, i want to truncate the string at the end of the year.
Otherwise I leave the strings alone.
I tried using:
for x in my_strings:
m=re.search("\D\d\d\d\d\D",x)
if m: x=x[:m.end()]
I also tried:
my_strings=[x[:re.search("\D\d\d\d\d\D",x).end()] if re.search("\D\d\d\d\d\D",x) for x in my_strings]
Neither of these is working.
Can you tell me what I am doing wrong?
Something like this seems to work on trivial data:
>>> regex = re.compile(r'^(.*(?<=\D)\d{4}(?=\D))(.*)')
>>> strings = ['foo', 'bar', 'baz', 'foo 1999', 'foo 1999 never see this', 'bar 2010 n 2015', 'bar 20156 see this']
>>> [regex.sub(r'\1', s) for s in strings]
['foo', 'bar', 'baz', 'foo 1999', 'foo 1999', 'bar 2010', 'bar 20156 see this']
Looks like your only bound on the result string is at the end(), so you should be using re.match() instead, and modify your regex to:
my_expr = r".*?\D\d{4}\D"
Then, in your code, do:
regex = re.compile(my_expr)
my_new_strings = []
for string in my_strings:
match = regex.match(string)
if match:
my_new_strings.append(match.group())
else:
my_new_strings.append(string)
Or as a list comprehension:
regex = re.compile(my_expr)
matches = ((regex.match(string), string) for string in my_strings)
my_new_strings = [match.group() if match else string for match, string in matches]
Alternatively, you could use re.sub:
regex = re.compile(r'(\D\d{4})\D')
new_strings = [regex.sub(r'\1', string) for string in my_strings]
I am not entirely sure of your usecase, but the following code can give you some hints:
import re
my_strings = ['abcd', 'ab12cd34', 'ab1234', 'ab1234cd', '1234cd', '123cd1234cd']
for index, string in enumerate(my_strings):
match = re.search('\d{4}', string)
if match:
my_strings[index] = string[0:match.end()]
print my_strings
# ['abcd', 'ab12cd34', 'ab1234', 'ab1234', '1234', '123cd1234']
You were actually pretty close with the list comprehension, but your syntax is off - you need to make the first expression a "conditional expression" aka x if <boolean> else y:
[x[:re.search("\D\d\d\d\d\D",x).end()] if re.search("\D\d\d\d\d\D",x) else x for x in my_strings]
Obviously this is pretty ugly/hard to read. There are several better ways to split your string around a 4-digit year. Such as:
[re.split(r'(?<=\D\d{4})\D', x)[0] for x in my_strings]

using re.findall when in need of striping a string into words in python

I'm using re.findall like this:
x=re.findall('\w+', text)
so I'm getting a list of words matching the characters [a-zA-Z0-9].
the problem is when I'm using this input:
!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~:
I want to get an empty list, but im getting ['', '']. how could
I exclude those underscores?
Use just the [a-zA-Z0-9] pattern; \w includes underscores:
x = re.findall('[a-zA-Z0-9]+', text)
or use the inverse of \w, \W in a negative character set with _ added:
x = re.findall('[^\W_]+', text)
The latter has the advantage of working correctly even when using re.UNICODE or re.LOCALE, where \w matches a wider range of characters.
Demo:
>>> import re
>>> text = '!"#$%&\'()*+,-./:;<=>?#[\]^_`{|}~!"#$%&\'()*+,-./:;<=>?#[\]^_`{|}~:'
>>> re.findall('[^\W_]+', text)
[]
>>> re.findall('[^\W_]+', 'The foo bar baz! And the eggs, ham and spam?')
['The', 'foo', 'bar', 'baz', 'And', 'the', 'eggs', 'ham', 'and', 'spam']
You can use groupby for this too
from itertools import groupby
x = [''.join(g) for k, g in groupby(text, str.isalnum) if k]
eg.
>>> text = 'The foo bar baz! And the eggs, ham and spam?'
>>> x = [''.join(g) for k, g in groupby(text, str.isalnum) if k]
>>> x
['The', 'foo', 'bar', 'baz', 'And', 'the', 'eggs', 'ham', 'and', 'spam']

How to remove empty string in a list?

For example I have a sentence
"He is so .... cool!"
Then I remove all the punctuation and make it in a list.
["He", "is", "so", "", "cool"]
How do I remove or ignore the empty string?
You can use filter, with None as the key function, which filters out all elements which are Falseish (including empty strings)
>>> lst = ["He", "is", "so", "", "cool"]
>>> filter(None, lst)
['He', 'is', 'so', 'cool']
Note however, that filter returns a list in Python 2, but a generator in Python 3. You will need to convert it into a list in Python 3, or use the list comprehension solution.
Falseish values include:
False
None
0
''
[]
()
# and all other empty containers
You can filter it like this
orig = ["He", "is", "so", "", "cool"]
result = [x for x in orig if x]
Or you can use filter. In python 3 filter returns a generator, thus list() turns it into a list. This works also in python 2.7
result = list(filter(None, orig))
You can use a list comprehension:
cleaned = [x for x in your_list if x]
Although I would use regex to extract the words:
>>> import re
>>> sentence = 'This is some cool sentence with, spaces'
>>> re.findall(r'(\w+)', sentence)
['This', 'is', 'some', 'cool', 'sentence', 'with', 'spaces']
I'll give you the answer to the question you should have asked -- how to avoid the empty string altogether. I assume you do something like this to get your list:
>>> "He is so .... cool!".replace(".", "").split(" ")
['He', 'is', 'so', '', 'cool!']
The point is that you use .split(" ") to split on space characters. However, if you leave out the argument to split, this happens:
>>> "He is so .... cool!".replace(".", "").split()
['He', 'is', 'so', 'cool!']
Quoth the docs:
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
So you really don't need to bother with the other answers (except Blender's, which is a totally different approach), because split can do the job for you!
>>> from string import punctuation
>>> text = "He is so .... cool!"
>>> [w.strip(punctuation) for w in text.split() if w.strip(punctuation)]
['He', 'is', 'so', 'cool']
You can filter out empty strings very easily using a list comprehension:
x = ["He", "is", "so", "", "cool"]
x = [str for str in x if str]
>>> ['He', 'is', 'so', 'cool']
Python 3 returns an iterator from filter, so should be wrapped in a call to list()
str_list = list(filter(None, str_list)) # fastest
lst = ["He", "is", "so", "", "cool"]
lst = list(filter(str.strip, lst))
You can do this with a filter.
a = ["He", "is", "so", "", "cool"]
filter(lambda s: len(s) > 0, a)

Categories