RegEx for re-occurring phrase

RegEx for re-occurring phrase - python

I have the following phrase:
05/30/2016 07:02 AM (GMT+02:00) added by XXX YYY (PID-000301):\tSome_alphanum_text_Some_alphanum_text_Some_alphanum_text_Some_alphanum_text\t\t*************************************************************************************************\t05/12/2016 02:03 PM (GMT+02:00) added by ZZZ AAA (PID-000301):\tSome_other_alphanum_text_Some_other_alphanum_text_Some_other_alphanum_text_Some_other_alphanum_text\t\t
I would like to write a RegEx which is just going to scoop up for me only 'Some_alphanum_text' and 'Some_other_alphanum_text'.
So far I was trying my luck with something like this:
r'(?:.+\(PID-\d{6}\):)(.+)'
But it is only giving me the 'Some_other_alphanum_text' occurrence.
There can be more than 2 unique strings I will need to scoop out from this mess of a text. Any ideas?

You need to replace .+ with something that only matches what you want to return. Since you only want to match alphanumeric text, use \w instead of .
r'(?:\(PID-\d{6}\):)\s*(\w+)'
You need \s* before the second group because the whitespace before the alphanumeric text won't match \w+.
You also don't need .+ at the beginning. The match will just begin where it finds PID.
DEMO

I believe you need this regex:
\(PID-\d{6}\):\\t(.+?)(?:\\t){2}
regex101

I think you could use this to find all the instances of text occurring between "\t"s

I didn't change the regex area to be a code block so it has not worked.
Now it works! One thing you should consider is that there could be no '\t'. But
every matched text follows a date format such as 05/12/2016 02:03 or ends.
\(PID-\d{6}\)[\n\r\t\s]*:(?:.|[\n\r\t\s])*?(?=[0-9]{2}\/[0-9]{2}\/[0-9]{4}[\n\r\t\s]*[0-9]{2}:[0-9]{2}|$)

Related

I dont want my regex to start with a parenthesis [duplicate]

The initial string is [image:salmon-v5-09-14-2011.jpg]
I would like to capture the text "salmon-v5-09-14-2011.jpg" and used GSkinner's RegEx Tool
The closest I can get to my desired output is using this RegEx:
:([\w+.-]+)
The problem is that this sequence includes the colon and the output becomes
:salmon-v5-09-14-2011.jpg
How can I capture the desired output without the colon. Thanks for the help!

Use a look-behind:
(?<=:)[\w+.-]+
A look-behind (coded as (?<=someregex)) is a zero-width match, so it asserts, but does not capture, the match.
Also, your regex may be able to be simplified to this:
(?<=:)[^\]]+
which simply grabs anything between (but not including) a : and a ]

If you are always looking at strings in that format, I would use this pattern:
(?<=\[image:)[^\]]+
This looks behind for [image:, then matches until the closing ]

You have the correct regex only the tool you're using is highlighting the entire match and not just your capture group. Hover over the match and see what "group 1" actually is.
If you want a slightly more robust regex you could try :([^\]]+) which will allow for any characters other than ] to appear in the file name portion.

Regex: Stop when it finds the first ocurrence of a character [duplicate]

I am looking for a pattern that matches everything until the first occurrence of a specific character, say a ";" - a semicolon.
I wrote this:
/^(.*);/
But it actually matches everything (including the semicolon) until the last occurrence of a semicolon.

You need
/^[^;]*/
The [^;] is a character class, it matches everything but a semicolon.
^ (start of line anchor) is added to the beginning of the regex so only the first match on each line is captured. This may or may not be required, depending on whether possible subsequent matches are desired.
To cite the perlre manpage:
You can specify a character class, by enclosing a list of characters in [] , which will match any character from the list. If the first character after the "[" is "^", the class matches any character not in the list.
This should work in most regex dialects.

Would;
/^(.*?);/
work?
The ? is a lazy operator, so the regex grabs as little as possible before matching the ;.

/^[^;]*/
The [^;] says match anything except a semicolon. The square brackets are a set matching operator, it's essentially, match any character in this set of characters, the ^ at the start makes it an inverse match, so match anything not in this set.

None of the proposed answers did work for me. (e.g. in notepad++)
But
^.*?(?=\;)
did.

Try /[^;]*/
Google regex character classes for details.

sample text:
"this is a test sentence; to prove this regex; that is g;iven below"
If for example we have the sample text above, the regex /(.*?\;)/ will give you everything until the first occurence of semicolon (;), including the semicolon: "this is a test sentence;"

Try /[^;]*/
That's a negating character class.

This was very helpful for me as I was trying to figure out how to match all the characters in an xml tag including attributes. I was running into the "matches everything to the end" problem with:
/<simpleChoice.*>/
but was able to resolve the issue with:
/<simpleChoice[^>]*>/
after reading this post. Thanks all.

this is not a regex solution, but something simple enough for your problem description. Just split your string and get the first item from your array.
$str = "match everything until first ; blah ; blah end ";
$s = explode(";",$str,2);
print $s[0];
output
$ php test.php
match everything until first

This will match up to the first occurrence only in each string and will ignore subsequent occurrences.
/^([^;]*);*/

"/^([^\/]*)\/$/" worked for me, to get only top "folders" from an array like:
a/ <- this
a/b/
c/ <- this
c/d/
/d/e/
f/ <- this

Really kinda sad that no one has given you the correct answer....
In regex, ? makes it non greedy. By default regex will match as much as it can (greedy)
Simply add a ? and it will be non-greedy and match as little as possible!
Good luck, hope that helps.

This works for getting the content from the beginning of a line till the first word,
/^.*?([^\s]+)/gm

I faced a similar problem including all the characters until the first comma after the word entity_id. The solution that worked was this in Bigquery:
SELECT regexp_extract(line_items,r'entity_id*[^,]*')

Regex replace everything except match

I would like to keep just words that start with '#' and continue with letters or dots. Basically I have done opposite that I can match such a words but don't know how to match everything besides this match. So basically just keep those that starts with '#'. So far I have this patter:
(#[a-zA-Z0-9.]+\b)
I tried to use '?!' but it doesn't work. Thanks!

From the comments, the following regex is ok
(?:^|\s)[^#]*
exact contrary would be
(?:^|[^#A-Za-z0-9.]|#(?![A-Za-z0-9.]+\b))[^#]*

Try with this Regex Expression:
(#+[a-zA-Z0-9.]+[a-zA-Z0-9]+)
I tested it on-line and it does what you are looking for ( match every words that starts with # and it can continue with dots, es: #hello.sir match | #hello.sir.do match and so on.. )

Regex is not matching in the way that I want to

Hi I'm new to regexes.
I have a string that I want to match any number of A-Z a-z 0-9 - and _
I've tried the following in python however it always matches, even the empty space. Can someone tell me why that is?
re.match(r'[A-Za-z0-9_-]+', 'gfds9 41.-=,434')

Your regex matches one or more of those characters. Your text starts with one or more of those characters, hence it matches. If you want it to only match those characters then you have to match them from the beginning to the end of the text.
re.match(r'^[A-Za-z0-9_-]+$', 'gfds9 41.-=,434')

Try the alternative for it maybe it will work for you:
[\w-]+
EDIT:
Although the initial regex you provided also works for me.

match until a certain pattern using regex

I have string in a text file containing some text as follows:
txt = "java.awt.GridBagLayout.layoutContainer"
I am looking to get everything before the Class Name, "GridBagLayout".
I have tried something the following , but I can't figure out how to get rid of the "."
txt = re.findall(r'java\S?[^A-Z]*', txt)
and I get the following: "java.awt."
instead of what I want: "java.awt"
Any pointers as to how I could fix this?

Without using capture groups, you can use lookahead (the (?= ... ) business).
java\s?[^A-Z]*(?=\.[A-Z]) should capture everything you're after. Here it is broken down:
java //Literal word "java"
\s? //Match for an optional space character. (can change to \s* if there can be multiple)
[^A-Z]* //Any number of non-capital-letter characters
(?=\.[A-Z]) //Look ahead for (but don't add to selection) a literal period and a capital letter.

Make your pattern match a period followed by a capital letter:
'(java\S?[^A-Z]*?)\.[A-Z]'
Everything in capture group one will be what you want.

This seems to do what you want with re.findall(): (java\S?[^A-Z]*)\.[A-Z]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

RegEx for re-occurring phrase - python

I believe you need this regex: \(PID-\d{6}\):\\t(.+?)(?:\\t){2} regex101

I think you could use this to find all the instances of text occurring between "\t"s

Related

I dont want my regex to start with a parenthesis [duplicate]

Regex: Stop when it finds the first ocurrence of a character [duplicate]

Regex replace everything except match

Regex is not matching in the way that I want to

match until a certain pattern using regex

Categories

Resources