Python split a string using regex - python

I would like to split a string by ':' and ' ' characters. However, i would like to ignore two spaces ' ' and two colons '::'. for e.g.
text = "s:11011 i:11010 ::110011 :110010 d:11000"
should split into
[s,11011,i,11010,:,110011, ,110010,d,11000]
after following the Regular Expressions HOWTO on the python website, i managed to comeup with the following
regx= re.compile('([\s:]|[^\s\s]|[^::])')
regx.split(text)
However this does not work as intended as it splits on the : and spaces, but it still includes the ':' and ' ' in the split.
[s,:,11011, ,i,:,11010, ,:,:,110011, , :,110010, ,d,:,11000]
How can I fix this?
EDIT: In case of a double space, i only want one space to appear

Note this assumes that your data has format like X:101010:
>>> re.findall(r'(.+?):(.+?)\b ?',text)
[('s', '11011'), ('i', '11010'), (':', '110011'), (' ', '110010'), ('d', '11000')]
Then chain them up:
>>> list(itertools.chain(*_))
['s', '11011', 'i', '11010', ':', '110011', ' ', '110010', 'd', '11000']

>>> text = "s:11011 i:11010 ::110011 :110010 d:11000"
>>> [x for x in re.split(r":(:)?|\s(\s)?", text) if x]
['s', '11011', 'i', '11010', ':', '110011', ' ', '110010', 'd', '11000']

Use the regex (?<=\d) |:(?=\d) to split:
>>> text = "s:11011 i:11010 ::110011 :110010 d:11000"
>>> result = re.split(r"(?<=\d) |:(?=\d)", text)
>>> result
['s', '11011', 'i', '11010', ':', '110011', ' ', '110010', 'd', '11000']
This will split on:
(?<=\d) a space, when there is a digit on the left. To check this I use a lookbehind assertion.
:(?=\d) a colon, when there is a digit on the right. To check this I use a lookahead assertion.

Have a look at this pattern:
([a-z\:\s])\:(\d+)
It will give you the same array you are expecting. No need to use split, just access the matches you have returned by the regex engine.
Hope it helps!

Related

How to remove interpunct and number which followed by fullstop by using regular expressions

I have a begginer in NLP and I have a dataset for NLP task which has strings. I want to clean it by removing interpunct and nuber which followed by fullstop, such as 'George is · working here since 2015.' to 'George is working here since'.
I want to use regular expression and I think comile library works in my problem. The code that I have is
def stripPunc(wordList):
"""Strips punctuation from list of words"""
puncList = ["]","[","·",".",";",":","!","?","/","\\",",","#","#","$","&",")","(","\""]
for punc in puncList:
for word in wordList:
wordList=[word.replace(punc,'') for word in wordList]
return wordList
but returns
['G',
'e',
'o',
'r',
'g',
'e',
' ',
'i',
's',
' ',
'',
' ',
'',
'',
' ',
'w',
'o',
'r',
'k',
'i',
'n',
'g',
' ',
'h',
'e',
'r',
'e',
' ',
's',
'i',
'n',
'c',
'e',
' ',
'2',
'0',
'1',
'5',
'']
instead of 'George is working here since'.
One more aproach is to use
import re
re_word_pattern = re.compile(r'\w+')
re_brackets = re.compile(r'(\[|\])')
re_number_to_zero = re.compile(r'\d+')
re_interpunct_to_zero = re.compile(r'')
text = 'George is · [] working here since 2015.'
text = re_brackets.sub('', text)
text = re_number_to_zero.sub('', text)
which gives
George is · working here since .
so in this case how could I remove the interpunct?
You can use
def clean_text(text):
return " ".join(re.sub(r'\s*\d+\.|[^\w\s]|_', '', text).split())
See a Python demo:
import re
s='George is · working here since 2015.'
print( " ".join(re.sub(r'\s*\d+\.|[^\w\s]|_', '', s).split()) )
# => George is working here since
Details:
re.sub(r'\s*\d+\.|[^\w\s]|_', '', s) - removes a couple of patterns:
\s*\d+\. - zero or more whitespaces, one or more digits and a dot
| - or
[^\w\s] - any punctuation other than _
|_ - or a _
" ".join(...).split() shrinks the whitespace in the result.
In your first code, you are iterating through a list, so if you give it a string it will iterate through each character
so what you can do is :
def stripPunc(sentence):
"""Strips punctuation from list of words"""
puncList = ["]","[","·",".",";",":","!","?","/","\\",",","#","#","$","&",")","(","\""]
for punc in puncList:
sentence = sentence.replace(punc,'')
return sentence

How to split "\t" in a string to two separate characters as "\" and "t"? (How to split Escape Sequence?) [duplicate]

This question already has answers here:
What exactly do "u" and "r" string prefixes do, and what are raw string literals?
(7 answers)
How to write string literals in Python without having to escape them?
(6 answers)
Python raw literal string [duplicate]
(2 answers)
Closed 5 years ago.
I am trying to split a string in python into a list of characters. I know that there are a lot of ways to do this in python, but I have a case where those methods don't give me the desired results.
The problem happens when I have special characters like '\t' that is explicitly written in the string (and I don't mean the real tab).
Example:
string = " Hello \t World."
the output I need is:
list_of_chars = [' ', ' ', ' ', ' ', 'H', 'e', 'l', 'l', 'o', ' ', '\', 't', ' ', 'W', 'o', 'r', 'l', 'd', '.']
but when I use the methods that are given in this question, I get a list that contains '/t' as whole string - not separated.
Example:
> list(string)
> ['H', 'e', 'l', 'l', 'o', 'w', ' ', '\t', ' ', 'W', 'o', 'r', 'l', 'd', '.']
I want to know why this happens and how to get what I want.
You can substitute your string accordingly:
import itertools
txt = " Hello \t World."
specials = {
'\a' : '\\a', # ASCII Bell (BEL)
'\b' : '\\b', # ASCII Backspace (BS)
'\f' : '\\f', # ASCII Formfeed (FF)
'\n' : '\\n', # ASCII Linefeed (LF)
'\r' : '\\r', # ASCII Carriage Return (CR)
'\t' : '\\t', # ASCII Horizontal Tab (TAB)
'\v' : '\\v' # ASCII Vertical Tab (VT)
}
# edited out: # txt2 = "".join([x if x not in specials else specials[x] for x in txt])
txt2 = itertools.chain(* [(list(specials[x]) if x in specials else [x]) for x in txt])
print(list(txt2))
Output:
[' ', ' ', ' ', ' ', 'H', 'e', 'l', 'l', 'o', ' ', '\\', 't', ' ', 'W',
'o', 'r', 'l', 'd', '.']
The list comprehension looks more "positive" and uses list(itertools.chain(*[...])) instead of list("".join([...])) which should be more performant.
You should take a look at String Literal document, which says:
The backslash (\) character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character. String literals may optionally be prefixed with a letter r' orR'; such strings are called raw strings and use different rules for backslash escape sequences.
In your example string, \t are not two characters but a single character which represents ASCII Horizontal Tab (TAB).
In order to tell your Python interpreter that these two are separate character, you should be using raw string (using r before string "")as:
>>> list(r" Hello \t World.")
[' ', ' ', ' ', ' ', 'H', 'e', 'l', 'l', 'o', ' ', '\\', 't', ' ', 'W', 'o', 'r', 'l', 'd', '.']
But here also you'll see two \\ in the resultant list, which is just a Python's way of representing \.
For Python interpreter '\' is an invalid string because \' in a string represent Single quote ('). Hence, when you do '\', it raises below error because for Python there is no end quote present in the string:
>>> '\'
File "<stdin>", line 1
'\'
^
SyntaxError: EOL while scanning string literal
If you can't declare your string as raw string (as it's already defined or imported from some other source), you may convert it to byte string by setting encoding as "unicode-escape":
>>> my_str = " Hello \t World."
>>> unicode_escaped_string = my_str.encode('unicode-escape')
>>> unicode_escaped_string
b' Hello \\t World.'
Since it is a byte-string, you need to call chr to get the corresponding character value of each byte. For example:
>>> list(map(chr, unicode_escaped_string))
[' ', ' ', ' ', ' ', 'H', 'e', 'l', 'l', 'o', ' ', '\\', 't', ' ', 'W', 'o', 'r', 'l', 'd', '.']
You could maybe convert to a Python's literal string and then split character by character?
string = " Hello \t World."
string_raw = string.encode('unicode-escape')
print([ch for ch in string_raw])
print([chr(ch) for ch in string_raw])
Outputs:
[32, 32, 32, 32, 72, 101, 108, 108, 111, 32, 92, 116, 32, 87, 111, 114, 108, 100, 46]
[' ', ' ', ' ', ' ', 'H', 'e', 'l', 'l', 'o', ' ', '\\', 't', ' ', 'W', 'o', 'r', 'l', 'd', '.']
The Ascii 92 is a single backlash, even though when you print it in a terminal, it'll show it escaped.
\t means tab, if you want to explicitely have a \ character, you'll need to escape it in your string:
string = " Hello \\t World."
Or use a raw string:
string = r" Hello \t World."

[] followed by () in regex altering the meaning of [] in python

My regex expression is re.findall("[2]*(.)","b = 2 + a*10");
Its output: ['b', ' ', '=', ' ', ' ', '+', ' ', 'a', '*', '1', '0']
But from the expression what I can infer is it should give all strings starting with o or more times 2 followed by anything, which should give all characters including 2! But there is not 2 in the output? It is actually omitting the characters inside [] which I concluded after replacing 2 with any other character But unable to understand why it is happening? Why [] followed by () omitting characters inside [].
Read the docs for re.findall:
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
So when you include (.) in your pattern, re.findall will return the contents of that group.

Join split words and punctuation correctly

So I have this list:
list1 = ['hi', 'there', '!', 'i', 'work', 'for', 'Spencer', '&', 'Co']
I want to join the list together and have some of the punctuation join to the words, but others not to:
I am currently using:
list1 = " ".join()
re.sub(r' (?=\W)', '', list1)
This makes every punctuation join to the previous element.
hi there! i work for Spencer& Co
But
I want:
hi there! i work for Spencer & Co
I personally avoid using regular expressions since pure logical solutions are more easy to understand to me. Here is a short solution you could use for your above example:
list1 = ['hi', 'there', '!', 'i', 'work', 'for', 'Spencer', '&', 'Co']
output = ""
for part in list1:
output += " " + part + " "
output = [1:-1]
The last line removes the starting space character and the ending space character.
You could use a negated character set with your look-ahead and include your special character(s):
>>> re.sub(r' (?=[^\w&])', '', list1) # include &
'hi there! i work for Spencer & Co'

Split a string in python with spaces and punctuations mark , unicode characters , etc.

I want to split string like this:
string = '[[he (∇((comesΦf→chem,'
based on spaces, punctuation marks also unicode characters. I mean, what I expect in output is in following mode:
out= ['[', '[', 'he',' ', '(','∇' , '(', '(', 'comes','Φ', 'f','→', 'chem',',']
I am using
re.findall(r"[\w\s\]+|[^\w\s]",String,re.unicode)
for this case, but it returned following output:
output=['[', '[', 'he',' ', '(', '\xe2', '\x88', '\x87', '(', '(', 'comes\xce', '\xa6', 'f\xe2', '\x86', '\x92', 'chem',',']
Please tell me how can i solve this problem.
Without using regexes and assuming words only contain ascii characters:
from string import ascii_letters
from itertools import groupby
LETTERS = frozenset(ascii_letters)
def is_alpha(char):
return char in LETTERS
def split_string(text):
for key, tokens in groupby(text, key=is_alpha):
if key: # Found letters, join them and yield a word
yield ''.join(tokens)
else: # not letters, just yield the single tokens
yield from tokens
Example result:
In [2]: list(split_string('[[he (∇((comesΦf→chem,'))
Out[2]: ['[', '[', 'he', ' ', '(', '∇', '(', '(', 'comes', 'Φ', 'f', '→', 'chem', ',']
If you are using a python version less than 3.3 you can replace yield from tokens with:
for token in tokens: yield token
If you are on python2 keep in mind that split_string accepts a unicode string.
Note that modifying the is_alpha function you can define different kinds of grouping. For example if you wanted to considered all unicode letters as letters you could do: is_alpha = str.isalpha (or unicode.isalpha in python2):
In [3]: is_alpha = str.isalpha
In [4]: list(split_string('[[he (∇((comesΦf→chem,'))
Out[4]: ['[', '[', 'he', ' ', '(', '∇', '(', '(', 'comesΦf', '→', 'chem', ',']
Note the 'comesΦf' that before was splitted.
Hope i halp.
In [33]: string = '[[he (∇((comesΦf→chem,'
In [34]: re.split('\W+', string)
Out[34]: ['', 'he', 'comes', 'f', 'chem', '']

Categories