Regex is not matching in the way that I want to - python

Hi I'm new to regexes.
I have a string that I want to match any number of A-Z a-z 0-9 - and _
I've tried the following in python however it always matches, even the empty space. Can someone tell me why that is?
re.match(r'[A-Za-z0-9_-]+', 'gfds9 41.-=,434')

Your regex matches one or more of those characters. Your text starts with one or more of those characters, hence it matches. If you want it to only match those characters then you have to match them from the beginning to the end of the text.
re.match(r'^[A-Za-z0-9_-]+$', 'gfds9 41.-=,434')

Try the alternative for it maybe it will work for you:
[\w-]+
EDIT:
Although the initial regex you provided also works for me.

Related

Regex: Stop when it finds the first ocurrence of a character [duplicate]

I am looking for a pattern that matches everything until the first occurrence of a specific character, say a ";" - a semicolon.
I wrote this:
/^(.*);/
But it actually matches everything (including the semicolon) until the last occurrence of a semicolon.
You need
/^[^;]*/
The [^;] is a character class, it matches everything but a semicolon.
^ (start of line anchor) is added to the beginning of the regex so only the first match on each line is captured. This may or may not be required, depending on whether possible subsequent matches are desired.
To cite the perlre manpage:
You can specify a character class, by enclosing a list of characters in [] , which will match any character from the list. If the first character after the "[" is "^", the class matches any character not in the list.
This should work in most regex dialects.
Would;
/^(.*?);/
work?
The ? is a lazy operator, so the regex grabs as little as possible before matching the ;.
/^[^;]*/
The [^;] says match anything except a semicolon. The square brackets are a set matching operator, it's essentially, match any character in this set of characters, the ^ at the start makes it an inverse match, so match anything not in this set.
None of the proposed answers did work for me. (e.g. in notepad++)
But
^.*?(?=\;)
did.
Try /[^;]*/
Google regex character classes for details.
sample text:
"this is a test sentence; to prove this regex; that is g;iven below"
If for example we have the sample text above, the regex /(.*?\;)/ will give you everything until the first occurence of semicolon (;), including the semicolon: "this is a test sentence;"
Try /[^;]*/
That's a negating character class.
This was very helpful for me as I was trying to figure out how to match all the characters in an xml tag including attributes. I was running into the "matches everything to the end" problem with:
/<simpleChoice.*>/
but was able to resolve the issue with:
/<simpleChoice[^>]*>/
after reading this post. Thanks all.
this is not a regex solution, but something simple enough for your problem description. Just split your string and get the first item from your array.
$str = "match everything until first ; blah ; blah end ";
$s = explode(";",$str,2);
print $s[0];
output
$ php test.php
match everything until first
This will match up to the first occurrence only in each string and will ignore subsequent occurrences.
/^([^;]*);*/
"/^([^\/]*)\/$/" worked for me, to get only top "folders" from an array like:
a/ <- this
a/b/
c/ <- this
c/d/
/d/e/
f/ <- this
Really kinda sad that no one has given you the correct answer....
In regex, ? makes it non greedy. By default regex will match as much as it can (greedy)
Simply add a ? and it will be non-greedy and match as little as possible!
Good luck, hope that helps.
This works for getting the content from the beginning of a line till the first word,
/^.*?([^\s]+)/gm
I faced a similar problem including all the characters until the first comma after the word entity_id. The solution that worked was this in Bigquery:
SELECT regexp_extract(line_items,r'entity_id*[^,]*')

Trying to search for a certain pattern in text file

Okay so in Python, I'm trying to search for the pattern "comma, space, any lowercase character", but I cant get a regular expression that seems to work. The whole regular expressions thing is pretty new to me and I have no idea what I'm doing. I was able to search for a "number, space, any character using "[1-9]+ [a-zA-z]", but I'm not sure how to search for the pattern mentioned above. The picture included is an example of what pattern I am trying to search for in the text file.
Thanks,
Schulzy
A Regex expression that would work is
, [a-z]
the comma and space are matched exactly, and the '[]' is a group, where anything in the group could be matched. you want any lowercase char's, so we put [a-z] for any character between lowercase a to z.

Regex that does not contain a substring after some point

I want a regex that doesn't match a string if contains the word page, and match if it's not contain.
^https?.+/(event|news)/.+(?!page).+$ this is the regex I'm currently using, so I want it to not match with, e.g. https://www.foosite.com/news/foopath/page/10, but it does. Where did I made a mistake?
The double .+ expressions should imply that there should be some string around the page string, and (?!page) should imply there must not be a string like page between them. What's wrong with this expression? Thanks, and sorry for poor grammar.
Your problem is that .+(?!page).+ will match foopath/page/10 because the first .+ match can end at the 1 in 10, and the second can match from there until $. Instead, just assert there is no combination of characters plus the word page after (event|news)/:
^https?.+/(event|news)/(?!.*page)
Demo on regex101
If you want more than just a match/nomatch decision, you can capture the entire matching string with this regex:
^https?.+/(event|news)/(?!.*page).*$
Demo on regex101
You might be looking for
^https?.+/(event|news)/(?:(?!page).)+$
See a demo on regex101.com.
Matching is usually way easier in regex than excluding.
I would rather match your excluded words and invert the logic on the if-clause.
if(!re.match(...

Python Regex for Clinical Trials Fields

I am trying to split text of clinical trials into a list of fields. Here is an example doc: https://obazuretest.blob.core.windows.net/stackoverflowquestion/NCT00000113.txt. Desired output is of the form: [[Date:<date>],[URL:<url>],[Org Study ID:<id>],...,[Keywords:<keywords>]]
I am using re.split(r"\n\n[^\s]", text) to split at paragraphs that start with a character other than space (to avoid splitting at the indented paragraphs within a field). This is all good, except the resulting fields are all (except the first field) missing their first character. Unfortunately, it is not possible to use string.partition with a regex.
I can add back the first characters by finding them using re.findall(r"\n\n[^\s]", text), but this requires a second iteration through the entire text (and seems clunky).
I am thinking it makes sense to use re.findall with some regex that matches all fields, but I am getting stuck. re.findall(r"[^\s].+\n\n") only matches the single line fields.
I'm not so experienced with regular expressions, so I apologize if the answer to this question is easily found elsewhere. Thanks for the help!
You may use a positive lookahead instead of a negated character class:
re.split(r"\n\n(?=\S)", text)
Now, it will only match 2 newlines if they are followed with a non-whitespace char.
Also, if there may be 2 or more newlines, you'd better use a {2,} limiting quantifier:
re.split(r"\n{2,}(?=\S)", text)
See the Python demo and a regex demo.
You want a lookahead. You also might want it to be more flexible as far as how many newlines / what newline characters. You might try this:
import re
r = re.compile(r"""(\r\n|\r|\n)+(?=\S)""")
l = r.split(text)
though this does seem to insert \r\n characters into the list... Hmm.

regex python - using lookbehinds to find my specific text

UPDATED
I want to find a string within a big text
..."img good img two_apple.txt"
Want to extract the two_apples.txt from a text, but it can change to one_apple, three_apple..so on...
When I try to use lookbehinds, it matches text all the way from the beginning.
You are mis-using lookarounds. Looks like you dont even NEED a lookaround:
pattern = r'src="images/(.+?.png")'
should work for you. As my comment suggests though, using regex is not recommended for parsing HTML/XML style documents but you do you.
EDIT - accommodate your edit:
Now that I understand your problem more, I can see why you would want to use a look-around. However, since you are looking for a file name, you know there aren't going to be any spaces in the name, so you can just ensure that your capturing token does not include spaces:
pattern = r'src="img (\w+?.png")'
^ ensure there is a space HERE because of how your text is
\w - \w is equivalent to [a-zA-Z0-9_] (any letters, numbers or underscore)
This removes the greediness of capture the first 'img ' string that pops up and ensures your capture group doesnt have any spaces.
by using \w, I am assuming you are only expecting _ and letter characters. to include anything else, make your own character group with [any characters you want to capture in here]
" ([^ ]+_apple\.txt)"
Starts with a space, ends with _apple.txt. The middle bit is anything-except-a-space which stops it matching "good img two". Parentheses to capture the bit you care about.
Try it here: https://regex101.com/r/wO7lG3/2

Categories