What regex will emulate the default behavior of split() in python? - python

Using split() I can easily create from a string the list of tokens that are divided by space:
>>> 'this is a test 200/2002'.split()
['this', 'is', 'a', 'test', '200/2002']
How do I do the same using re.compile and re.findall? I need something similiar to the following example but without splitting the "200/2002".
>>> test = re.compile('\w+')
>>> test.findall('this is a test 200/2002')
['this', 'is', 'a', 'test', '200', '2002']

This should output the desired list:
>>> test = re.compile('\S+')
>>> test.findall('this is a test 200/2002')
['this', 'is', 'a', 'test', '200/2002']
\S is anything but a whitespace (space, tab, newline, ...).
From str.split() documentation :
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start
or end if the string has leading or trailing whitespace. Consequently,
splitting an empty string or a string consisting of just whitespace
with a None separator returns [].
findall() with the above regex should have the same behaviour :
>>> test.findall(" a\nb\tc d ")
['a', 'b', 'c', 'd']
>>> " a\nb\tc d ".split()
['a', 'b', 'c', 'd']

Related

How to convert command line argument into python list [duplicate]

How do I split a sentence and store each word in a list? For example, given a string like "these are words", how do I get a list like ["these", "are", "words"]?
To split on other delimiters, see Split a string by a delimiter in python.
To split into individual characters, see How do I split a string into a list of characters?.
Given a string sentence, this stores each word in a list called words:
words = sentence.split()
To split the string text on any consecutive runs of whitespace:
words = text.split()
To split the string text on a custom delimiter such as ",":
words = text.split(",")
The words variable will be a list and contain the words from text split on the delimiter.
Use str.split():
Return a list of the words in the string, using sep as the delimiter
... If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
>>> line = "a sentence with a few words"
>>> line.split()
['a', 'sentence', 'with', 'a', 'few', 'words']
Depending on what you plan to do with your sentence-as-a-list, you may want to look at the Natural Language Took Kit. It deals heavily with text processing and evaluation. You can also use it to solve your problem:
import nltk
words = nltk.word_tokenize(raw_sentence)
This has the added benefit of splitting out punctuation.
Example:
>>> import nltk
>>> s = "The fox's foot grazed the sleeping dog, waking it."
>>> words = nltk.word_tokenize(s)
>>> words
['The', 'fox', "'s", 'foot', 'grazed', 'the', 'sleeping', 'dog', ',',
'waking', 'it', '.']
This allows you to filter out any punctuation you don't want and use only words.
Please note that the other solutions using string.split() are better if you don't plan on doing any complex manipulation of the sentence.
[Edited]
How about this algorithm? Split text on whitespace, then trim punctuation. This carefully removes punctuation from the edge of words, without harming apostrophes inside words such as we're.
>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"
>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]
>>> import string
>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']
I want my python function to split a sentence (input) and store each word in a list
The str().split() method does this, it takes a string, splits it into a list:
>>> the_string = "this is a sentence"
>>> words = the_string.split(" ")
>>> print(words)
['this', 'is', 'a', 'sentence']
>>> type(words)
<type 'list'> # or <class 'list'> in Python 3.0
If you want all the chars of a word/sentence in a list, do this:
print(list("word"))
# ['w', 'o', 'r', 'd']
print(list("some sentence"))
# ['s', 'o', 'm', 'e', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e']
shlex has a .split() function. It differs from str.split() in that it does not preserve quotes and treats a quoted phrase as a single word:
>>> import shlex
>>> shlex.split("sudo echo 'foo && bar'")
['sudo', 'echo', 'foo && bar']
NB: it works well for Unix-like command line strings. It doesn't work for natural-language processing.
Split the words without without harming apostrophes inside words
Please find the input_1 and input_2 Moore's law
def split_into_words(line):
import re
word_regex_improved = r"(\w[\w']*\w|\w)"
word_matcher = re.compile(word_regex_improved)
return word_matcher.findall(line)
#Example 1
input_1 = "computational power (see Moore's law) and "
split_into_words(input_1)
# output
['computational', 'power', 'see', "Moore's", 'law', 'and']
#Example 2
input_2 = """Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad."""
split_into_words(input_2)
#output
['Oh',
'you',
"can't",
'help',
'that',
'said',
'the',
'Cat',
"we're",
'all',
'mad',
'here',
"I'm",
'mad',
"You're",
'mad']

How to process element list by using rstrip and lower()

I use this code:
test_list = ['A','a','word','Word','If,','As:']
for a in test_list:
str(a)
a.rstrip('.' or '?' or '!' or '"' or ":" or ',' or ';')
a.lower()
print(a)
print(test_list)
I got result like this:
A
a
word
Word
If,
As:
['A', 'a', 'word', 'Word', 'If,', 'As:']
An I was looking for something like:
a
a
word
word
if
as
['a','a','word','word','if','as']
I want to convert all elements in a list and strip all the marks off, so only if the word for me to process.
The following should do everything you requested by using a generator expression:
# Test List
test_list = ['A', 'a', 'word', 'Word', 'If,', 'As:']
# Remove certain characters and convert characters to lower case in test list
test_list = [str(a).strip('.?!":,;').lower() for a in test_list]
# Print test list
print(test_list)
Output:
['a', 'a', 'word', 'word', 'if', 'as']

Python-Getting contents between current and next occurrence of pattern in a string

I want to implement the following in python
(1)Search pattern in a string
(2)Get content till next occurence of the same pattern in the same string
Till end of the string do (1) and (2)
Searched all available answers but of no use.
Thanks in advance.
As mentioned by Blckknght in the comment, you can achieve this with re.split. re.split retains all empty strings between a) the beginning of the string and the first match, b) the last match and the end of the string and c) between different matches:
>>> re.split('abc', 'abcabcabcabc')
['', '', '', '', '']
>>> re.split('bca', 'abcabcabcabc')
['a', '', '', 'bc']
>>> re.split('c', 'abcabcabcabc')
['ab', 'ab', 'ab', 'ab', '']
>>> re.split('a', 'abcabcabcabc')
['', 'bc', 'bc', 'bc', 'bc']
If you want to retain only c) the strings between 2 matches of the pattern, just slice the resulting array with [1:-1].
Do note that there are two caveat with this method:
re.split doesn't split on empty string match.
>>> re.split('', 'abcabc')
['abcabc']
Content in capturing groups will be included in the resulting array.
>>> re.split(r'(.)(?!\1)', 'aaaaaakkkkkkbbbbbsssss')
['aaaaa', 'a', 'kkkkk', 'k', 'bbbb', 'b', 'ssss', 's', '']
You have to write your own function with finditer if you need to handle those use cases.
This is the variant where only case c) is matched.
def findbetween(pattern, input):
out = []
start = 0
for m in re.finditer(pattern, input):
out.append(input[start:m.start()])
start = m.end()
return out
Sample run:
>>> findbetween('abc', 'abcabcabcabc')
['', '', '']
>>> findbetween(r'', 'abcdef')
['a', 'b', 'c', 'd', 'e', 'f']
>>> findbetween(r'ab', 'abcabcabc')
['c', 'c']
>>> findbetween(r'b', 'abcabcabc')
['ca', 'ca']
>>> findbetween(r'(?<=(.))(?!\1)', 'aaaaaaaaaaaabbbbbbbbbbbbkkkkkkk')
['bbbbbbbbbbbb', 'kkkkkkk']
(In the last example, (?<=(.))(?!\1) matches the empty string at the end of the string, so 'kkkkkkk' is included in the list of results)
You can use something like this
re.findall(r"pattern.*?(?=pattern|$)",test_Str)
Here we search pattern and with lookahead make sure it captures till next pattern or end of string.

split string by arbitrary number of white spaces

I'm trying to find the most pythonic way to split a string like
"some words in a string"
into single words. string.split(' ') works ok but it returns a bunch of white space entries in the list. Of course i could iterate the list and remove the white spaces but I was wondering if there was a better way?
Just use my_str.split() without ' '.
More, you can also indicate how many splits to perform by specifying the second parameter:
>>> ' 1 2 3 4 '.split(None, 2)
['1', '2', '3 4 ']
>>> ' 1 2 3 4 '.split(None, 1)
['1', '2 3 4 ']
How about:
re.split(r'\s+',string)
\s is short for any whitespace. So \s+ is a contiguous whitespace.
Use string.split() without an argument or re.split(r'\s+', string) instead:
>>> s = 'some words in a string with spaces'
>>> s.split()
['some', 'words', 'in', 'a', 'string', 'with', 'spaces']
>>> import re; re.split(r'\s+', s)
['some', 'words', 'in', 'a', 'string', 'with', 'spaces']
From the docs:
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].
>>> a = "some words in a string"
>>> a.split(" ")
['some', 'words', 'in', 'a', 'string']
split parameter is not included in the result, so i guess theres something more about your string. otherwise, it should work
if you have more than one whitespace just use split() without parameters
>>> a = "some words in a string "
>>> a.split()
['some', 'words', 'in', 'a', 'string']
>>> a.split(" ")
['some', 'words', 'in', 'a', 'string', '', '', '', '', '']
or it will just split a by single whitespaces
The most Pythonic and correct ways is to just not specify any delimiter:
"some words in a string".split()
# => ['some', 'words', 'in', 'a', 'string']
Also read:
How can I split by 1 or more occurrences of a delimiter in Python?
text = "".join([w and w+" " for w in text.split(" ")])
converts large spaces into single spaces

String to list in Python

I'm trying to split a string:
'QH QD JC KD JS'
into a list like:
['QH', 'QD', 'JC', 'KD', 'JS']
How would I go about doing this?
>>> 'QH QD JC KD JS'.split()
['QH', 'QD', 'JC', 'KD', 'JS']
split:
Return a list of the words in the
string, using sep as the delimiter
string. If maxsplit is given, at most
maxsplit splits are done (thus, the
list will have at most maxsplit+1
elements). If maxsplit is not
specified, then there is no limit on
the number of splits (all possible
splits are made).
If sep is given, consecutive
delimiters are not grouped together
and are deemed to delimit empty
strings (for example,
'1,,2'.split(',') returns ['1', '', '2']). The sep argument may consist of
multiple characters (for example,
'1<>2<>3'.split('<>') returns ['1', '2', '3']). Splitting an empty string
with a specified separator returns
[''].
If sep is not specified or is None, a
different splitting algorithm is
applied: runs of consecutive
whitespace are regarded as a single
separator, and the result will contain
no empty strings at the start or end
if the string has leading or trailing
whitespace. Consequently, splitting an
empty string or a string consisting of
just whitespace with a None separator
returns [].
For example, ' 1 2 3 '.split()
returns ['1', '2', '3'], and ' 1 2 3 '.split(None, 1) returns ['1', '2 3 '].
Here the simples
a = [x for x in 'abcdefgh'] #['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
Maybe like this:
list('abcdefgh') # ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
Or for fun:
>>> ast.literal_eval('[%s]'%','.join(map(repr,s.split())))
['QH', 'QD', 'JC', 'KD', 'JS']
>>>
ast.literal_eval
You can use the split() function, which returns a list, to separate them.
letters = 'QH QD JC KD JS'
letters_list = letters.split()
Printing letters_list would now format it like this:
['QH', 'QD', 'JC', 'KD', 'JS']
Now you have a list that you can work with, just like you would with any other list. For example accessing elements based on indexes:
print(letters_list[2])
This would print the third element of your list, which is 'JC'

Categories