How can I achieve something like this: "Ca(OH)2" => "Ca" and "(OH)2"
In python, it can be achieved like this:
import re
compound = "Ca(OH)2"
segments=re.split('(\([A-Za-z0-9]*\)[0-9]*)',compound)
print(segments)
Output: ['Ca', '(OH)2', '']
I am following this tutorial from https://medium.com/swlh/balancing-chemical-equations-with-python-837518c9075b (except that I wanted to do it in Java)
(\([A-Za-z0-9]*\)[0-9]*) To breakdown the regex, the outermost parenthesis(near the single quotes) indicate that that is our capture group and it is what we want to keep. The inner parenthesis with the forward slash before them mean that we want to literally find parenthesis(this is called escaping) the [A-Za-z0–9] indicate that we are ok with any letter(of any case) or number within our parentheses and the asterisk after the square brackets is a quantifier. It means that we are ok with having zero or infinite many letters(of any case) or numbers within our parenthesis. and the [0–9] near the end, indicate that we want to include ALL digits to the right of our parenthesis in our split.
I tried to do it in Java but the output was not what I wanted:
String compound = "Ca(OH)2";
String[] segments = compound.split("(\\([A-Za-z0-9]*\\)[0-9]*)");
System.out.println(Arrays.toString(segments));
Output: [Ca]
In Java, unlike Python re.split method, String#split does not keep captured parts.
You can use the following code in Java:
String s = "Ca(OH)2";
Pattern p = Pattern.compile("\\([A-Za-z0-9]+\\)[0-9]*|[A-Za-z0-9]+");
Matcher m = p.matcher(s);
List<String> res = new ArrayList<>();
while(m.find()) {
res.add(m.group());
}
System.out.println(res); // => [Ca, (OH)2]
See the online demo. Here, \([A-Za-z0-9]+\)[0-9]*|[A-Za-z0-9]+ regex matches
\([A-Za-z0-9]+\)[0-9]* - (, one or more ASCII letters/digits, ) and then zero or more digits
| - or
[A-Za-z0-9]+ - one or more ASCII letters/digits.
See the regex demo. It can also be written as
Pattern p = Pattern.compile("\\(\\p{Alnum}+\\)\\d*|\\p{Alnum}+");
Try this mate:
String[] segments = compound.split("([^\\w*])");
so output should be :
ca , oh ,2
Hopefully it will help you!
Related
I would like to make a regular expression for formatting a text, in which there can't be a { character except if it's coming with a backslash \ behind. The problem is that a backslash can escape itself, so I don't want to match \\{ for example, but I do want \\\{. So I want only an odd number of backslashs before a {. I can't just take it in a group and lookup the number of backslashs there are after like this:
s = r"a wei\\\{rd thing\\\\\{"
matchs = re.finditer(r"([^\{]|(\\+)\{)+", s)
for match in matchs:
if len(match.group(2)) / 2 == len(match.group(2)) // 2: # check if it's even
continue
do_some_things()
Because the group 2 can be used more than one time, so I can access only to the last one (in this case, \\\\\)
It would be really nice if we could just do something like "([^\{]|(\\+)(?if len(\2) / 2 == len(\2) // 2)\{)+" as regular expression, but, as far as I know, that is impossible.
How can I do then ???
This matches an odd number of backslashes followed by a brace:
(?<!\\)(\\\\)*(\\\{)
Breakdown:
(?<!\\) - Not preceded by a backslash, to accommodate the next bit
This is called "negative lookbehind"
(\\\\)* - Zero or more pairs of backslashes
(\\\{) - A backslash then a brace
Matches:
\{
\\\{
\\\\\{
Non-matches:
\\{
\\\\{
\\\\\\{
Try it on RegExr
This was partly inspired by Vadim Baratashvili's answer
I think you can use this as solution:
([^\\](\\\\){0,})(\{)
We can check that between the last character that is not a backslash there are 0 or more pairs of backslashes and then goes {if part of the text matches the pattern, then we can replace it with the first group $1 (a character that is not a slash plus 0 or more pairs of slashes), so we will find and replace not escaped { .
If we want to find escaped { we ca use this expression:
([^\\](\\\\){0,})(\\\{) - second group of match is \{
I am trying to build a regex to match 5 digit numbers or those 5 digit numbers preceded by IND/
10223 match to return 10223
IND/10110 match to return 10110
ID is 11233 match to return 11233
Ref is:10223 match to return 10223
Ref is: th10223 not match
SBI12234 not match
MRF/10234 not match
RBI/10229 not match
I have used the foll. Regex which selects the 5 digit correctly using word boundary concept. But not sure how to allow IND and not allow anything else like MRF, etc:
/b/d{5}/b
If I put (IND)? At beginning of regex then it won't help. Any hints?
Use a look behind:
(?<=^IND\/|^ID is |^)\d{5}\b
See live demo.
Because the look behind doesn’t consume any input, the entire match is your target number (ie there’s no need to use a group).
Variable length lookbehind is not supported by python, use alternation instead:
(?:(?<=IND/| is[: ])\d{5}|^\d{5})(?!\d)
Demo
This should work: (?<=IND/|\s|^)(\d{5})(?=\s|$) .
Try this: (?:IND\/|ID is |^)\b(\d{5})\b
Explanation:
(?: ALLOWED TEXT): A non-capture group with all allowed segments inside. In your example, IND\/ for "IND/", ID is for "ID is ...", and ^ for the beginning of the string (in case of only the number / no text at start: 12345).
\b(\d{5})\b: Your existing pattern w/ capture group for 5-digit number
I feel like this will need some logic to it. The regex can find the 5 digits, but maybe a second regex pattern to find IND, then join them together if need be. Not sure if you are using Python, .Net, or Java, but should be doable
This is not for homework!
Hello,
Just a quick question about Regex formatting.
I have a list of different courses.
L = ['CI101', 'CS164', 'ENGL101', 'I-', 'III-', 'MATH116', 'PSY101']
I was looking for a format to find all the words that start with I, or II, or III. Here is what I did. (I used python fyi)
for course in L:
if re.search("(I?II?III?)*", course):
L.pop()
I learned that ? in regex means optional. So I was thinking of making I, II, and III optional and * to include whatever follows. However, it seems like it is not working as I intended. What would be a better working format?
Thanks
Here is the regex you should use:
^I{1,3}.*$
click here to see example
^ means the head of a line. I{1,3} means repeat I 1 to 3 times. .* means any other strings. $ means the tail of a line. So this regex will match all the words that start with I, II, or III.
Look at your regex, first, you don't have the ^ mark, so it will match I anywhere. Second, ? will only affect the previous one character, so the first I is optional, but the second I is not, then the third I is optional, the fourth and fifth I are not, the sixth I is optional. Finally, you use parentheses with *, that means the expression in parentheses will repeat many times include 0 time. So it will match 0 I, or at least 3 I.
your regex
Instead of search() you can use the function match() that matches the pattern at the beginning of string:
import re
l = ['CI101', 'CS164', 'ENGL101', 'I-', 'III-', 'MATH116', 'PSY101']
pattern = re.compile(r'I{1,3}')
[i for i in l if not pattern.match(i)]
# ['CI101', 'CS164', 'ENGL101', 'MATH116', 'PSY101']
The following should be matched:
AAA123
ABCDEFGH123
XXXX123
can I do: ".*123" ?
Yes, you can. That should work.
. = any char except newline
\. = the actual dot character
.? = .{0,1} = match any char except newline zero or one times
.* = .{0,} = match any char except newline zero or more times
.+ = .{1,} = match any char except newline one or more times
Yes that will work, though note that . will not match newlines unless you pass the DOTALL flag when compiling the expression:
Pattern pattern = Pattern.compile(".*123", Pattern.DOTALL);
Matcher matcher = pattern.matcher(inputStr);
boolean matchFound = matcher.matches();
Use the pattern . to match any character once, .* to match any character zero or more times, .+ to match any character one or more times.
The most common way I have seen to encode this is with a character class whose members form a partition of the set of all possible characters.
Usually people write that as [\s\S] (whitespace or non-whitespace), though [\w\W], [\d\D], etc. would all work.
.* and .+ are for any chars except for new lines.
Double Escaping
Just in case, you would wanted to include new lines, the following expressions might also work for those languages that double escaping is required such as Java or C++:
[\\s\\S]*
[\\d\\D]*
[\\w\\W]*
for zero or more times, or
[\\s\\S]+
[\\d\\D]+
[\\w\\W]+
for one or more times.
Single Escaping:
Double escaping is not required for some languages such as, C#, PHP, Ruby, PERL, Python, JavaScript:
[\s\S]*
[\d\D]*
[\w\W]*
[\s\S]+
[\d\D]+
[\w\W]+
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegularExpression{
public static void main(String[] args){
final String regex_1 = "[\\s\\S]*";
final String regex_2 = "[\\d\\D]*";
final String regex_3 = "[\\w\\W]*";
final String string = "AAA123\n\t"
+ "ABCDEFGH123\n\t"
+ "XXXX123\n\t";
final Pattern pattern_1 = Pattern.compile(regex_1);
final Pattern pattern_2 = Pattern.compile(regex_2);
final Pattern pattern_3 = Pattern.compile(regex_3);
final Matcher matcher_1 = pattern_1.matcher(string);
final Matcher matcher_2 = pattern_2.matcher(string);
final Matcher matcher_3 = pattern_3.matcher(string);
if (matcher_1.find()) {
System.out.println("Full Match for Expression 1: " + matcher_1.group(0));
}
if (matcher_2.find()) {
System.out.println("Full Match for Expression 2: " + matcher_2.group(0));
}
if (matcher_3.find()) {
System.out.println("Full Match for Expression 3: " + matcher_3.group(0));
}
}
}
Output
Full Match for Expression 1: AAA123
ABCDEFGH123
XXXX123
Full Match for Expression 2: AAA123
ABCDEFGH123
XXXX123
Full Match for Expression 3: AAA123
ABCDEFGH123
XXXX123
If you wish to explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
There are lots of sophisticated regex testing and development tools, but if you just want a simple test harness in Java, here's one for you to play with:
String[] tests = {
"AAA123",
"ABCDEFGH123",
"XXXX123",
"XYZ123ABC",
"123123",
"X123",
"123",
};
for (String test : tests) {
System.out.println(test + " " +test.matches(".+123"));
}
Now you can easily add new testcases and try new patterns. Have fun exploring regex.
See also
regular-expressions.info/Tutorial
No, * will match zero-or-more characters. You should use +, which matches one-or-more instead.
This expression might work better for you: [A-Z]+123
Specific Solution to the example problem:-
Try [A-Z]*123$ will match 123, AAA123, ASDFRRF123. In case you need at least a character before 123 use [A-Z]+123$.
General Solution to the question (How to match "any character" in the regular expression):
If you are looking for anything including whitespace you can try [\w|\W]{min_char_to_match,}.
If you are trying to match anything except whitespace you can try [\S]{min_char_to_match,}.
Try the regex .{3,}. This will match all characters except a new line.
[^] should match any character, including newline. [^CHARS] matches all characters except for those in CHARS. If CHARS is empty, it matches all characters.
JavaScript example:
/a[^]*Z/.test("abcxyz \0\r\n\t012789ABCXYZ") // Returns ‘true’.
I like the following:
[!-~]
This matches all char codes including special characters and the normal A-Z, a-z, 0-9
https://www.w3schools.com/charsets/ref_html_ascii.asp
E.g. faker.internet.password(20, false, /[!-~]/)
Will generate a password like this: 0+>8*nZ\\*-mB7Ybbx,b>
I work this Not always dot is means any char. Exception when single line mode. \p{all} should be
String value = "|°¬<>!\"#$%&/()=?'\\¡¿/*-+_#[]^^{}";
String expression = "[a-zA-Z0-9\\p{all}]{0,50}";
if(value.matches(expression)){
System.out.println("true");
} else {
System.out.println("false");
}
I have a very large document containing section references in different formats. I want to extract these references using Python & regex.
Examples of the string formats:
1) Section 23
2) Section 45(3)
3) point (e) of Section 75
4) Sections 21(1), 54(2), 78(1)
Right now, I have the following code:
s = "This is a sample for Section 231"
m = re.search('Section\\W+(\\w+)', s)
m.group(0)
The output is: Section 231
This works perfectly, except that it does not account for the other formatting cases.
Is there any way to indicate that for 231(1), the (1) should also be extracted? Or to include the following section numbers if several others are listed?
I'm also open to using other libraries if you think Regex is not the best in this case. Thank you!
Try:
Sections?\W+(\w+)(\(\w+\))?(, (\w+)(\(\w+\))?)*
Demo
>>> s = 'Sections 21(1), 54(2), 78(1)'
>>> res = re.search(r'Sections?\W+(\w+)(\(\w+\))?(, (\w+)(\(\w+\))?)*', s)
>>> res.group(0)
# => 'Sections 21(1), 54(2), 78(1)'
Explanation:
Sections? matches "Section" with optionable s
\W+(\w+)(\(\w+\))? matches section number/title (as you did it) and adds optional text in brackets
(, (\w+)(\(\w+\))?)* allows repetition of the section number patter after comma and space
EDIT
To exclude Section 1 of Other Book you can use combination of word boundary and negative lookahead:
Sections?\W+(\w+)(\(\w+\))?(, (\w+)(\(\w+\))?)*\b(?! of)
Demo
\b assures that you match until end of a word
(?! of) check that after the word boundary there is no space followed by of
There's probably never going to be a catch-all regex for this - however the following is quite close to what you want:
Sections?( *\d+((\(\d+\))*,?(?= *))*)+
Sections? = Section or Sections
( *\d+((\(\d+\))*,?(?= *))*)+ = 1 or more of: 0 or more spaces, then 1 or more digits, optionally followed by 1 or more digits in braces, then optionally a comma and 0 or spaces.
The 'trailing' space uses a positive lookahead so it isn't included in the match, so you don't need to strip trailing spaces.
Try it out