Splitting a string after certain characters? - python

I will be given a string, and I need to split it every time that it has an "|", "/", "." or "_"
How can I do this fast? I know how to use the command split, but is there any way to give more than 1 split condition to it? For example, if the input given was
Hello test|multiple|36.strings/just36/testing
I want the output to give:
"['Hello test', 'multiple', '36', 'strings', 'just36', 'testing']"

Use a regex and the regex module:
>>> import re
>>> s='You/can_split|multiple'
>>> re.split(r'[/_|.]', s)
['You', 'can', 'split', 'multiple']
In this case, [/_|.] will split on any of those characters.
Or, you can use a list comprehension to insert a single (perhaps multiple character) delimiter and then split on that:
>>> ''.join(['-><-' if c in '/_|.' else c for c in s]).split('-><-')
['You', 'can', 'split', 'multiple']
With the added example:
>>> s2="Hello test|multiple|36.strings/just36/testing"
Method 1:
>>> re.split(r'[/_|.]', s2)
['Hello test', 'multiple', '36', 'strings', 'just36', 'testing']
Method 2:
>>> ''.join(['-><-' if c in '/_|.' else c for c in s2]).split('-><-')
['Hello test', 'multiple', '36', 'strings', 'just36', 'testing']

Use groupby:
from itertools import groupby
s = 'You/can_split|multiple'
separators = set('/_|.')
result = [''.join(group) for k, group in groupby(s, key=lambda x: x not in separators) if k]
print(result)
Output
['You', 'can', 'split', 'multiple']

Related

How to find unknown character from a list of strings?

How would I delete an unknown character from a list of strings?
For example my list is ['hi', 'h#w', 'are!', 'you;', '25'] and I want to delete all the characters that are not words or numbers?
How would I do this?
Regex:
s = ['hi', 'h#w', 'are!', 'you;', '25']
[re.sub(r'[^A-Za-z0-9 ]+', '', x) for x in s]
['hi', 'hw', 'are', 'you', '25']
Use re.sub:
from re import sub
lst = ['hi', 'h#w', 'are!', 'you;', '25']
lst = [sub('[^\w]', '', i) for i in lst]
print(lst)
Output:
['hi', 'hw', 'are', 'you', '25']
Explanation:
This line: sub('[^\w]', '', i) tells python to replace all the substrings of pattern "[^\w]" with an empty string, "" inside the i string, and return the result.
The pattern [^\w] finds all the characters in a string that are not letters or numbers.

Python Regex Compile Split string so that words appear first

Say I was given a string like so
text = "1234 I just ? shut * the door"
I want to use a regex with re.compile() such that when I split the list all of the words are in front.
I.e. it should look like this.
text = ["I", "just", "shut", "the", "door", "1234", "?", "*"]
How can I use re.compile() to split the string this way?
import re
r = re.compile('regex to split string so that words are first').split(text)
Please let me know if you need any more information.
Thank you for the help.
IIUC, you don't need re. Just use str.split with sorted:
sorted(text.split(), key=lambda x: not x.isalpha())
Output:
['I', 'just', 'shut', 'the', 'door', '1234', '?', '*']
You can use sorted with re.findall:
import re
text = "1234 I just ? shut * the door"
r = sorted(text.split(), key=lambda x:(x.isalpha(), x.isdigit(), bool(re.findall('^\W+$', x))), reverse=True)
Output:
['I', 'just', 'shut', 'the', 'door', '1234', '?', '*']
You can't do that with a single regex. You can write one regex to get all words, then another regex to get everything else.
import re
text = "1234 I just ? shut * the door"
r = re.compile(r'[a-zA-Z]+')
words = r.findall(text)
r = re.compile(r'[^a-zA-Z\s]+')
other = r.findall(text)
print(words + other) # ['I', 'just', 'shut', 'the', 'door', '1234', '?', '*']

re.search with strings: difference between use cases of re.rearch() with strings

I have the following code:
import re
l=['fang', 'yi', 'ke', 'da', 'xue', 'xue', 'bao', '=', 'journal', 'of', 'southern', 'medical', 'university', '2015/feb']
t=[l[13]]
t2=['2015/Feb']
wl1=['2015/Feb']
for i in t:
print(type(i))
print(type(wl1[0]))
r=re.search(r'^%s$' %i, wl1[0])
if r:
print('yes')
for i in t2:
print(type(i))
print(type(wl1[0]))
r2=re.search(r'^%s$' %i, wl1[0])
if r2:
print('yes')
Could anyone explain me why in the first loop it does not match the two strings? In the second it does.
Your input value is lowercase:
>>> l=['fang', 'yi', 'ke', 'da', 'xue', 'xue', 'bao', '=', 'journal', 'of', 'southern', 'medical', 'university', '2015/feb']
>>> t=[l[13]]
>>> t[0]
'2015/feb'
while you are trying to match against a value with the F uppercased:
>>> wl1=['2015/Feb']
>>> wl1[0]
'2015/Feb'
As such the regular expression ^2015/feb$ won't match, while in your second example you generated the expression ^2015/Feb$ instead.

using re.findall when in need of striping a string into words in python

I'm using re.findall like this:
x=re.findall('\w+', text)
so I'm getting a list of words matching the characters [a-zA-Z0-9].
the problem is when I'm using this input:
!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~!"#$%&'()*+,-./:;<=>?#[\]^_`{|}~:
I want to get an empty list, but im getting ['', '']. how could
I exclude those underscores?
Use just the [a-zA-Z0-9] pattern; \w includes underscores:
x = re.findall('[a-zA-Z0-9]+', text)
or use the inverse of \w, \W in a negative character set with _ added:
x = re.findall('[^\W_]+', text)
The latter has the advantage of working correctly even when using re.UNICODE or re.LOCALE, where \w matches a wider range of characters.
Demo:
>>> import re
>>> text = '!"#$%&\'()*+,-./:;<=>?#[\]^_`{|}~!"#$%&\'()*+,-./:;<=>?#[\]^_`{|}~:'
>>> re.findall('[^\W_]+', text)
[]
>>> re.findall('[^\W_]+', 'The foo bar baz! And the eggs, ham and spam?')
['The', 'foo', 'bar', 'baz', 'And', 'the', 'eggs', 'ham', 'and', 'spam']
You can use groupby for this too
from itertools import groupby
x = [''.join(g) for k, g in groupby(text, str.isalnum) if k]
eg.
>>> text = 'The foo bar baz! And the eggs, ham and spam?'
>>> x = [''.join(g) for k, g in groupby(text, str.isalnum) if k]
>>> x
['The', 'foo', 'bar', 'baz', 'And', 'the', 'eggs', 'ham', 'and', 'spam']

Product code looks like abcd2343, how to split by letters and numbers?

I have a list of product codes in a text file, on each line is the product code that looks like:
abcd2343 abw34324 abc3243-23A
So it is letters followed by numbers and other characters.
I want to split on the first occurrence of a number.
import re
s='abcd2343 abw34324 abc3243-23A'
re.split('(\d+)',s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A']
Or, if you want to split on the first occurrence of a digit:
re.findall('\d*\D+',s)
> ['abcd', '2343 abw', '34324 abc', '3243-', '23A']
\d+ matches 1-or-more digits.
\d*\D+ matches 0-or-more digits followed by 1-or-more non-digits.
\d+|\D+ matches 1-or-more digits or 1-or-more non-digits.
Consult the docs for more about Python's regex syntax.
re.split(pat, s) will split the string s using pat as the delimiter. If pat begins and ends with parentheses (so as to be a "capturing group"), then re.split will return the substrings matched by pat as well. For instance, compare:
re.split('\d+', s)
> ['abcd', ' abw', ' abc', '-', 'A'] # <-- just the non-matching parts
re.split('(\d+)', s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A'] # <-- both the non-matching parts and the captured groups
In contrast, re.findall(pat, s) returns only the parts of s that match pat:
re.findall('\d+', s)
> ['2343', '34324', '3243', '23']
Thus, if s ends with a digit, you could avoid ending with an empty string by using re.findall('\d+|\D+', s) instead of re.split('(\d+)', s):
s='abcd2343 abw34324 abc3243-23A 123'
re.split('(\d+)', s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A ', '123', '']
re.findall('\d+|\D+', s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A ', '123']
This function handles float and negative numbers as well.
def separate_number_chars(s):
res = re.split('([-+]?\d+\.\d+)|([-+]?\d+)', s.strip())
res_f = [r.strip() for r in res if r is not None and r.strip() != '']
return res_f
For example:
utils.separate_number_chars('-12.1grams')
> ['-12.1', 'grams']
import re
m = re.match(r"(?P<letters>[a-zA-Z]+)(?P<the_rest>.+)$",input)
m.group('letters')
m.group('the_rest')
This covers your corner case of abc3243-23A and will output abc for the letters group and 3243-23A for the_rest
Since you said they are all on individual lines you'll obviously need to put a line at a time in input
def firstIntIndex(string):
result = -1
for k in range(0, len(string)):
if (bool(re.match('\d', string[k]))):
result = k
break
return result
To partition on the first digit
parts = re.split('(\d.*)','abcd2343') # => ['abcd', '2343', '']
parts = re.split('(\d.*)','abc3243-23A') # => ['abc', '3243-23A', '']
So the two parts are always parts[0] and parts[1].
Of course, you can apply this to multiple codes:
>>> s = "abcd2343 abw34324 abc3243-23A"
>>> results = [re.split('(\d.*)', pcode) for pcode in s.split(' ')]
>>> results
[['abcd', '2343', ''], ['abw', '34324', ''], ['abc', '3243-23A', '']]
If each code is in an individual line then instead of s.split( ) use s.splitlines().
Try this code it will work fine
import re
text = "MARIA APARECIDA 99223-2000 / 98450-8026"
parts = re.split(r' (?=\d)',text, 1)
print(parts)
Output:
['MARIA APARECIDA', '99223-2000 / 98450-8026']

Categories