I have a string composed of numbers and letters: string = 'this1234is5678it', and I would like the string.split() output to give me a list like ['this', '1234', 'is', '5678', 'it'], splitting at where numbers and letter meet. Is there an easy way to do this?
You can use Regex for this.
import re
s = 'this1234is5678it'
re.split('(\d+)',s)
Running example http://ideone.com/JsSScE
Outputs ['this', '1234', 'is', '5678', 'it']
Update
Steve Rumbalski mentioned in the comment the importance of the parenthesis in the regex. He quotes from the documentation:
If capturing parentheses are used in pattern, then the text of all
groups in the pattern are also returned as part of the resulting
list." Without the parenthesis the result would be ['this', 'is',
'it'].
Related
I'm trying to write a regex that takes in a string of words containing quoted substrings like "green lizards" like to sit "in the sun", tokenizes it into words and quoted substrings (using either single or double quotes) separated by spaces, and then returns a list [['green', 'lizards'], 'like', 'to', 'sit', ['in', 'the', 'sun']] where the list items are either single words or nested lists of words where a quoted substrings was encountered.
I am new to regex, and was able to find a solution that captures the quoted parts: re.findall('"([^"]*)"', '"green lizards" like to sit "in the sun"') ... which returns: ['green lizards', 'in the sun']
But this doesn't capture the individual words, and also doesn't tokenize them (returning a single string instead of list of words, which requires me to split() them each separately.
How would I make a regex that correctly returns the type of list I'm wanting? Also, I'm open to better methods/tools than regex for parsing these sorts of strings if anyone has suggestions.
Thanks!
With re.findall() function and built-in str methods:
import re
s = '"green lizards" like to sit "in the sun"'
result = [i.replace('"', "").split() if i.startswith('"') else i
for i in re.findall(r'"[^"]+"|\S+', s)]
print(result)
The output:
[['green', 'lizards'], 'like', 'to', 'sit', ['in', 'the', 'sun']]
Another approach (supporting both single and double quotes):
import re
sentence = """"green lizards" like to sit "in the sun" and 'single quotes' remain alone"""
rx = re.compile(r"""(['"])(.*?)\1|\S+""")
tokens = [m.group(2).split()
if m.group(2) else m.group(0)
for m in rx.finditer(sentence)]
print(tokens)
Yielding
[['green', 'lizards'], 'like', 'to', 'sit', ['in', 'the', 'sun'], 'and', ['single', 'quotes'], 'remain', 'alone']
The idea here is:
(['"]) # capture a single or a double quote
(.*?) # 0+ characters lazily
\1 # up to the same type of quote previously captured
| # ...or...
\S+ # not a whitespace
In the list comprehension we check which condition was met.
You can use re.split and then a final str.split:
import re
s = '"green lizards" like to sit "in the sun"'
new_s = [[i[1:-1].split()] if i.startswith('"') else i.split() for i in re.split('(?<=")\s|\s(?=")', s)]
last_result = [i for b in new_s for i in b]
Output:
[['green', 'lizards'], 'like', 'to', 'sit', ['in', 'the', 'sun']]
I have this code
import re
str1 = "These should be counted as a single-word, b**m !?"
match_pattern = re.findall(r'\w{1,15}', str1)
print(match_pattern)
I want the output to be:
['These', 'should', 'be', 'counted', 'as', 'a', 'single-word', 'b**m']
The output should exclude non-words such as the "!?" what are the other validation should I use to match and achieve the desired output?
I would use word boundaries (\b) filled with 1 or more non-space:
match_pattern = re.findall(r'\b\S+\b', str1)
result:
['These', 'should', 'be', 'counted', 'as', 'a', 'single-word', 'b**m']
!? is skipped thanks to word boundary magic, which don't consider that as a word at all either.
Probably you want something like [^\s.!?] instead of \w but what exactly you want is not evident from a single example. [^...] matches a single character which is not one of those between the brackets and \s matches whitespace characters (space, tab, newline, etc).
You can also achieve a similar result not using RegEx:
string = "These should be counted as a single-word, b**m !?"
replacements = ['.',',','?','!']
for replacement in replacements:
if replacement in string:
string = string.replace(replacement, "");
print string.split()
>>> ['These', 'should', 'be', 'counted', 'as', 'a', 'single-word', 'b**m']
I wrote a tokenize function that basically reads a string representation and splits it into list of words.
My code:
def tokenize(document):
x = document.lower()
return re.findall(r'\w+', x)
My output:
tokenize("Hi there. What's going on? first-class")
['hi', 'there', 'what', 's', 'going', 'on', 'first', 'class']
Desired Output:
['hi', 'there', "what's", 'going', 'on', 'first-class']
Basically I want the apostrophed words and hypen words to remain as a single word in list along with double quotes. How can i change my function to get the desired output.
\w+ matches one or more word characters; this does not include apostrophes or hyphens.
You need to use a character set here to tell Python exactly what you want to match:
>>> import re
>>> def tokenize(document):
... return re.findall("[A-Za-z'-]+", document)
...
>>> tokenize("Hi there. What's going on? first-class")
['hi', 'there', "what's", 'going', 'on', 'first-class']
>>>
You'll notice too that I removed the x = document.lower() line. This is no longer necessary since we can match uppercase characters by simply adding A-Z to the character set.
This function takes in a string text, and returns a list which contains lists of strings, one list for each sentence in the string text.
Sentences are separated by one of the strings ".", "?", or "!". We ignore the possibility of other punctuation separating sentences. so 'Mr.X' will turn to 2 sentences, and 'don't' will be two words.
For example, the text is
Hello, Jack. How is it going? Not bad; pretty good, actually... Very very
good, in fact.
And the function returns:
['hello', 'jack'],
['how', 'is', 'it', 'going'],
['not', 'bad', 'pretty', 'good', 'actually'],
['very', 'very', 'good', 'in', 'fact']]
The most confusing part is how to make the function detect the characters , . ! ? and how to make it a list of lists contains words in each sentence.
Thank you.
This sounds very much like a homework problem to me, so I'll provide general tips instead of exact code.
a string has the split(char) function on it. You can use this to split your string based on a specific character. However, you will have to use a loop and perform the split multiple times.
You could also use a regular expression to find matches (that would be a better solution.) That would let you find all matches at once. Then you would iterate over the matches and spit them based on spaces, while stripping out punctuation.
Edit: Here's an example of a regular expression you could use to get sentence groups all at once:
\s*([^.?!]+)\s*
The \s* surrounding the parenthesis causes any extra spaces to be removed from the result, and the parenthesis are a capture group. You can use re.findall() to get a list of all captured results, and then you can loop over these items and use re.split() and some conditional logic to append all the words to a new list.
Let me know how you get along with that, and if you have any other questions please provide us the code you have so far.
you can use re.split() :
>>> s="Hello, Jack. How is it going? Not bad; pretty good, actually... Very very good, in fact."
>>> import re
>>> [re.split(r'\W',i) for i in re.split(r'\.|\?|\!',s) if len(i)]
and for remove empty indices you can do this :
>>> [[x for x in i if len(x)]for i in my_s]
[['Hello', 'Jack'], ['How', 'is', 'it', 'going'], ['Not', 'bad', 'pretty', 'good', 'actually'], ['Very', 'very', 'good', 'in', 'fact']]
I'm using regular expressions to split a string using multiple delimiters. But if two of my delimiters occur next to each other in the string, it puts an empty string in the resulting list. For example:
re.split(',|;', "This,is;a,;string")
Results in
['This', 'is', 'a', '', 'string']
Is there any way to avoid getting '' in my list without adding ,; as a delimiter?
Try this:
import re
re.split(r'[,;]+', 'This,is;a,;string')
> ['This', 'is', 'a', 'string']