Python : Splitting a string by numbers, letters and -_ - python

Let's say that I have a string like this one
string = 'rename_file_1122--23-_12'
Is there a way to split this like that
parts = ['rename','_','file','_','1122','--','23','-_','12']
I tried with the regular expression but it does not work
import re
name_parts = re.findall('\d+|\D+|\w+|\W+', string)
The result was:
['rename_file_', '1122', '--', '23', '-_', '12']
########## Second part
If I have a string like this one :
string2 = 'Hello_-Marco5__-'
What are the conditions that I need to use to get :['Hello','_-','Marco','5','__-']. My goal is to split a string y groups of letters,digits ans '-_'.
Thanks fors yours answers

You can use
re.findall(r'[^\W_]+|[\W_]+', string)
See the regex demo.
Regex details:
[^\W_]+ - one or more chars other than non-word and _ chars (so, one or more letters or digits)
| - or
[\W_]+ - one or more non-word and/or _ chars.
See a Python demo:
import re
string = 'rename_file_1122--23-_12'
name_parts = re.findall(r'[^\W_]+|[\W_]+', string)
print(name_parts)
# => ['rename', '_', 'file', '_', '1122', '--', '23', '-_', '12']

Alternatively you could use groupby from itertools:
from itertools import groupby
string = 'rename_file_1122--23-_12'
result = [''.join(value) for key, value in groupby(string, key=str.isalnum)]
print(result)
Output:
['rename', '_', 'file', '_', '1122', '--', '23', '-_', '12']
edit:
I came up with a perhaps simpler solution, using regular expressions:
string = 'rename_file_1122--23-_12'
result = re.split('([_-]*)', string)
print(result)
Same output.
re.split will split the string based upon matching the regular expression. The expression I've used includes a grouping pattern, and split includes the match in the result:
([_-]*)
Means match (and remember the result) of a sequence of one or more of any of _ or -. * means one or more, [] means any of whatever's inside the square brackets.
Without the group, just using [_-]* we'd get the following, without the matches:
string = 'rename_file_1122--23-_12'
result = re.split('[_-]*', string)
print(result)
Output:
['rename', 'file', '1122', '23', '12']

I have found the solution for the second part, it is the following :
name_parts=re.findall(r'[^\d_]+|[^\D]+|[^\W_]+|[\W_]+', string)

Related

Splitting string with regex and re.findall

I want to match any number of digits, decimal points, and the letter e, or ONE CHARACTER in this list of characters when it occurs in a string:
+ - % ^ * / ( )
then I want to break the subject string into a list containing each individual match.
I have the following Regex to attempt to accomplish this, which I'm fairly certain it does correctly: ([0-9.e]+|[\^\*\/\%\+\-\(\)]) I even went on regex101.com and tested it, and it properly matches how I want it to:
However, when I run re.findall() on the following string (5+2)*5 it returns me the following list:
['(', '5', '+', '2', u')*', '5']
What is wrong with my regex?

Escaping regex unicode string in Python

I have a user defined string.
I want to use it in regex with small improvement: search by three apostrophes instead of one.
For example,
APOSTROPHES = re.escape('\'\u2019\u02bc')
word = re.escape("п'ять")
word = ''.join([s if s not in APOSTROPHES else '[%s]' % APOSTROPHES for s in word])
It works good for latin, but for unicode list comprehension gives the following string:
"[\\'\\\\u2019\\\\u02bc]\xd0[\\'\\\\u2019\\\\u02bc]\xbf[\\'\\\\u2019\\\\u02bc][\\'\\\\u2019\\\\u02bc][\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x8f[\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x82[\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x8c"
Looks like it finds backslashes in both strings and then substitutes APOSTROPHES
Also, print(list(w for w in APOSTROPHES)) gives ['\\', "'", '\\', '\\', 'u', '2', '0', '1', '9', '\\', '\\', 'u', '0', '2', 'b', 'c'].
How can I avoid it? I want to get "\п[\'\u2019\u02bc]\я\т\ь"
What I understand is: you want to create a regular expression which can match a given word with any apostrophe:
The RegEx which match any apostrophe can be defined in a group:
APOSTROPHES_REGEX = r'[\'\u2019\u02bc]'
For instance, you have this (Ukrainian?) word which contains a single quote:
word = "п'ять"
EDIT: If your word contains another kind of apostrophe, you can normalize it, like this:
word = re.sub(APOSTROPHES_REGEX , r"\'", word, flags=re.UNICODE)
To create a RegEx, you escape this string (because in some context, it can contains special characters like punctuation, I think). When escaped, the single quote "'" is replaced by an escaped single quote, like this: r"\'".
You can replace this r"\'" by your apostrophe RegEx:
import re
word_regex = re.escape(word)
word_regex = word_regex.replace(r'\'', APOSTROPHES_REGEX)
The new RegEx can then be used to match the same word with any apostrophe:
assert re.match(word_regex, "п'ять") # '
assert re.match(word_regex, "п’ять") # \u2019
assert re.match(word_regex, "пʼять") # \u02bc
Note: don’t forget to use the re.UNICODE flag, it will help you for some RegEx characters classes like r"\w".

How can I get the number of groups to vary depending on the number of lines?

I have this regex: ^:([^:]+):([^:]*) which works as in this regex101 link.
Now, in Python, I have this:
def get_data():
data = read_mt_file()
match_fields = re.compile('^:([^:]+):([^:]*)', re.MULTILINE)
fields = re.findall(match_fields, data)
return fields
Which, for a file containing the data from regex101, returns:
[('1', 'text\ntext\n\n'), ('20', 'text\n\n'), ('21', 'text\ntext\ntext\n\n'), ('22', ' \n\n'), ('25', 'aa\naa\naaaaa')]
Now, this is ok, but I want to change the regex, so that I can get the number of groups to vary depending on the number of lines. Meaning:
for the first line, now, I get two groups:
1
text\ntext\n\n
I'd like to get instead:
1
((text\n), (text\n\n)) <-- those should be somehow in the same group but separated, each in his own subgroup. Somehow I need to know they both belong to 1 field, but are sepparate lines.
So, In python, the desired result for that file would be:
[('1', '(text\n), (text\n\n)'), ('20', 'text\n\n'), ('21', '(text\n), (text\n), (text\n\n)'), ('22', ' \n\n'), ('25', '(aa\n), (aa\n), (aaaaa)')]
Is this possible with regex? Could this be achieved with some nice string manipulation instead ?
To do what you want, you'd need another regex.
This is as re.match only matches the last item it matches:
>>> re.match(r'(\d)+', '12345').groups()
('5',)
Instead of using one regex you'll need to use two.
The one that you are using at the moment, and then one to match all the 'sub-groups', using say re.findall.
You can get these sub-groups by simply matching anything that isn't a \n and then any amount of \n.
So you could use a regex such as [^\n]+\n*:
>>> re.findall(r'[^\n]+\n*', 'text\ntext')
['text\n', 'text']
>>> re.findall(r'[^\n]+\n*', 'text\ntext\n\n')
['text\n', 'text\n\n']
>>> re.findall(r'[^\n]+\n*', '')
[]
You may use a simple trick: after getting the matches with your regex, run a .+\n* regex over the Group 2 value:
import re
p = re.compile(r'^:([^:]+):([^:]+)', re.MULTILINE)
s = ":1:text\ntext\n\n:20:text\n\n:21:text\ntext\ntext\n\n:22: \n\n:25:aa\naa\naaaaa"
print([[x.group(1)] + re.findall(r".+\n*", x.group(2)) for x in p.finditer(s)])
Here,
p.finditer(s) finds all matches in the string using your regex
[x.group(1)] - a list created from the first group contents
re.findall(r".+\n*", x.group(2)) - fetches individual lines from Group 2 contents (with trailing newlines, 0 or more)
[] + re.findall - combining the lists into 1.
Result is
[['1', 'text\n', 'text\n\n'], ['20', 'text\n\n'], ['21', 'text\n', 'text\n', 'text\n\n'], ['22', ' \n\n'], ['25', 'aa\n', 'aa\n', 'aaaaa']]
Another approach: match all the substrings with your pattern and then use a re.sub to add ), ( between the lines ending with optional newlines:
[(x, "({})".format(re.sub(r".+(?!\n*$)\n+", r"\g<0>), (", y))) for x, y in p.findall(s)]
Result:
[('1', '(text\n), (text\n\n)'), ('20', '(text\n\n)'), ('21', '(text\n), (text\n), (text\n\n)'), ('22', '( \n\n)'), ('25', '(aa\n), (aa\n), (aaaaa)')]
See the Python 3 demo
Here:
p.findall(s) - grabs all the matches in the form of a list of tuples containing your capture group contents using your regex
(x, "({})".format(re.sub(r".+(?!\n*$)\n+", r"\g<0>), (", y))) - creates a tuple from Group 1 contents and Group 2 contents that are a bit modified with the re.sub the way described below
.+(?!\n*$)\n+ - pattern that matches 1+ characters other than newline and then 1+ newline symbols if they are not at the end of the string. If they are at the end of the string, there will be no replacement made (to avoid , () at the end). The \g<0> in the replacement string is re-inserting the whole match back into the resulting string and appends ), ( to it.

Python split with multiple delimiters not working

I have a string:
feature.append(freq_and_feature(text, freq))
I want a list containing each word of the string, like [feature, append, freq, and, feature, text, freq], where each word is a string, of course.
These string are contained in a file called helper.txt, so I'm doing the following, as suggested by multiple SO posts, like the accepted answer for this one(Python: Split string with multiple delimiters):
import re
with open("helper.txt", "r") as helper:
for row in helper:
print re.split('\' .,()_', row)
However, I get the following, which is not what I want.
[' feature.append(freq_pain_feature(text, freq))\n']
re.split('\' .,()_', row)
This looks for the string ' .,()_ to split on. You probably meant
re.split('[\' .,()_]', row)
re.split takes a regular expression as the first argument. To say "this OR that" in regular expressions, you can write a|b and it will match either a or b. If you wrote ab, it would only match a followed by b. Luckily, so we don't have to write '| |.|,|(|..., there's a nice form where you can use []s to state that everything inside should be treated as "match one of these".
It seems you want to split a string with non-word or underscore characters. Use
import re
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[\W_]+', s) if x])
# => ['feature', 'append', 'freq', 'and', 'feature', 'text', 'freq']
See the IDEONE demo
The [\W_]+ regex matches 1+ characters that are not word (\W = [^a-zA-Z0-9_]) or underscores.
You can get rid of the if x if you remove initial and trailing non-word characters from the input string, e.g. re.sub(r'^[\W_]+|[\W_]+$', '', s).
You can try this
str = re.split('[.(_,)]+', row, flags=re.IGNORECASE)
str.pop()
print str
This will result:
['feature', 'append', 'freq', 'and', 'feature', 'text', ' freq']
I think you are trying to split on the basis of non-word characters. It should be
re.split(r'[^A-Za-z0-9]+', s)
[^A-Za-z0-9] can be translated to --> [\W_]
Python Code
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[^A-Za-z0-9]+', s) if x])
This will also work, indeed
p = re.compile(r'[^\W_]+')
test_str = "feature.append(freq_and_feature(text, freq))"
print(re.findall(p, test_str))
Ideone Demo

python regex finditer

I have question about re, I tried to look answer on re documentary but I think I am to newbie for this.
I have string like this
string = "id=186 s_id=0 channel_name=[cspacer0]---BlaBla--- number=2"
I want to retrive all result after '=' so I used
re.finditer("=[\w]*", string)
My result was as follow
186
0
empty space <-- there should be a [cspacer0]--BlaBla--
2
How should my pattern look to get the channel_name as well?
The \w token only matches word characters, to allow metacharacters I would use \S (any non-white space character) instead. Also, instead of finditer you can use findall for this task:
>>> import re
>>> s = 'id=186 s_id=0 channel_name=[cspacer0]---BlaBla--- number=2'
>>> re.findall(r'=(\S+)', s)
['186', '0', '[cspacer0]---BlaBla---', '2']
EDIT
The orginal string looks like this, I want to get everything starting with = skip =ok and idx=0
>>> s = 'error idx=0 msg=ok id=186 s_id=0 channel_name=[cspacer0]---BlaBla--- number=2'
>>> re.findall(r'(?<!idx)=(?!ok)(\S+)', s)
['186', '0', '[cspacer0]---BlaBla---', '2']

Categories