I'm currently trying to convert a sorted list of characters into a string. The delimiter of this string should be '-' if the character before and after are different, but should be '&' if these characters are equal.
An example:
The list ['1', '1', '2', '9', 'A', 'A', 'A', 'B', 'C'] should become '1&1-2-9-A&A&A-B-C'.
This will happen for approximately 250K lists in a pandas DataFrame. I'm thinking of creating a string with all delimiters '-' and replacing them using str.replace() function, but getting stuck at the final part.
A simplificiation of my current code is as follows (where column 'sorted' contains a sorted list of the characters as above).
df['joined'] = df['sorted'].str.join('-')
df['correct'] = df['joined'].str.replace(r"\-(.\-)\1{1,}?", xxxx, regex=True)
Is there a regex pattern that can replace the xxxx that would be able to do the same part as the first pattern, with the '.' being the original character? Or is there another solution (for example a matching positive lookbehind and lookahead?
Thanks!
I would do it following way:
import re
chars = ['1', '1', '2', '9', 'A', 'A', 'A', 'B', 'C']
joined = '-'.join(chars)
result = re.sub(r'(.)-(?=\1)', r'\1&', joined)
print(result) # 1&1-2-9-A&A&A-B-C
Explanation: I used positive lookahead here, checking if - is followed by same characters as one before. Zero-length assertion does not capture, which result in proper replacement of - which are 1 from each other, consider
A-A-A
Result in matches:
(A-)(A-)A
If we would use r'(.)-\1' as pattern it would be:
(A-A)-A
thus lefting second - unchanged
Related
Let's say that I have a string like this one
string = 'rename_file_1122--23-_12'
Is there a way to split this like that
parts = ['rename','_','file','_','1122','--','23','-_','12']
I tried with the regular expression but it does not work
import re
name_parts = re.findall('\d+|\D+|\w+|\W+', string)
The result was:
['rename_file_', '1122', '--', '23', '-_', '12']
########## Second part
If I have a string like this one :
string2 = 'Hello_-Marco5__-'
What are the conditions that I need to use to get :['Hello','_-','Marco','5','__-']. My goal is to split a string y groups of letters,digits ans '-_'.
Thanks fors yours answers
You can use
re.findall(r'[^\W_]+|[\W_]+', string)
See the regex demo.
Regex details:
[^\W_]+ - one or more chars other than non-word and _ chars (so, one or more letters or digits)
| - or
[\W_]+ - one or more non-word and/or _ chars.
See a Python demo:
import re
string = 'rename_file_1122--23-_12'
name_parts = re.findall(r'[^\W_]+|[\W_]+', string)
print(name_parts)
# => ['rename', '_', 'file', '_', '1122', '--', '23', '-_', '12']
Alternatively you could use groupby from itertools:
from itertools import groupby
string = 'rename_file_1122--23-_12'
result = [''.join(value) for key, value in groupby(string, key=str.isalnum)]
print(result)
Output:
['rename', '_', 'file', '_', '1122', '--', '23', '-_', '12']
edit:
I came up with a perhaps simpler solution, using regular expressions:
string = 'rename_file_1122--23-_12'
result = re.split('([_-]*)', string)
print(result)
Same output.
re.split will split the string based upon matching the regular expression. The expression I've used includes a grouping pattern, and split includes the match in the result:
([_-]*)
Means match (and remember the result) of a sequence of one or more of any of _ or -. * means one or more, [] means any of whatever's inside the square brackets.
Without the group, just using [_-]* we'd get the following, without the matches:
string = 'rename_file_1122--23-_12'
result = re.split('[_-]*', string)
print(result)
Output:
['rename', 'file', '1122', '23', '12']
I have found the solution for the second part, it is the following :
name_parts=re.findall(r'[^\d_]+|[^\D]+|[^\W_]+|[\W_]+', string)
the line re.findall(r'(.)*', 'aabc') will return ['c', ''] instead of ['a', 'a', 'b', 'c', '']. Why is that?
Thank you
Your pattern says match one capturing group of one character (.) zero or more times *.
There are two matches with this pattern. First, it matches 'aabc' as (.) (a capturing group of a single character) four times. The result in findall for that will be the content of the captured group, which is c, since the last thing your group captures is c.
The second match found is the empty string (which is a match since * can mean "zero times"), in which case nothing is captured in the capturing group, and you get an empty string as a result.
If you want the result ['a', 'a', 'b', 'c', ''], you could use
re.findall(r'.?', 'aabc')
which is "match any single character optionally".
I noticed that when I did "heelo".split("e"), it would return ['h', '', 'lo']. Why is there an empty/a whitespace item in the list? Shouldn't it have been ['h', 'lo']?
I am confused on why I received that result, instead of what I had expected and would appreciate if someone could explain me the functionality of split better.
From the Python docs:
If sep is given, consecutive delimiters are not grouped together and are deemed to delimit empty strings (for example, '1,,2'.split(',') returns ['1', '', '2'])
Your string is divided between the first e and second e, but there is no character there, so you get an empty character back ''
It takes into account the first 'e' separates the 'h' but the letter adjacent to it is also an 'e', but there is no letter between the first and second 'e' so you get an empty string.
If we add one more 'e':
"heeelo".split("e")
['h', '', '', 'lo']
It returns two empty strings between the three 'e's.
I want to match any number of digits, decimal points, and the letter e, or ONE CHARACTER in this list of characters when it occurs in a string:
+ - % ^ * / ( )
then I want to break the subject string into a list containing each individual match.
I have the following Regex to attempt to accomplish this, which I'm fairly certain it does correctly: ([0-9.e]+|[\^\*\/\%\+\-\(\)]) I even went on regex101.com and tested it, and it properly matches how I want it to:
However, when I run re.findall() on the following string (5+2)*5 it returns me the following list:
['(', '5', '+', '2', u')*', '5']
What is wrong with my regex?
I need to extract all letters after the + sign or at the beginning of a string like this:
formula = "X+BC+DAF"
I tried so, and I do not want to see the + sign in the result. I wish see only ['X', 'B', 'D'].
>>> re.findall("^[A-Z]|[+][A-Z]", formula)
['X', '+B', '+D']
When I grouped with parenthesis, I got this strange result:
re.findall("^([A-Z])|[+]([A-Z])", formula)
[('X', ''), ('', 'B'), ('', 'D')]
Why it created tuples when I try to group ? How to write the regexp directly such that it returns ['X', 'B', 'D'] ?
If there are any capturing groups in the regular expression then re.findall returns only the values captured by the groups. If there are no groups the entire matched string is returned.
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
How to write the regexp directly such that it returns ['X', 'B', 'D'] ?
Instead of using a capturing group you can use a non-capturing group:
>>> re.findall(r"(?:^|\+)([A-Z])", formula)
['X', 'B', 'D']
Or for this specific case you could try a simpler solution using a word boundary:
>>> re.findall(r"\b[A-Z]", formula)
['X', 'B', 'D']
Or a solution using str.split that doesn't use regular expressions:
>>> [s[0] for s in formula.split('+')]
['X', 'B', 'D']