Matching data between padding - python

I'm trying to match some strings in a binary file and the strings appear to be padded. As an example, the word PROGRAM could be in the binary like this:
%$###P^&#!)00000R{]]]////O.......G"""""R;;$#!*%&#*A/////847M
In that example, the word PROGRAM is there but it is split up and it's between random data, so I'm trying to use regex to find it.
Currently, this is what I came up with but I don't think this is very effectie:
(?<=P)(.*?)(?=R)(.*?)(?=O)(.*?)(?=G)(.*?)(?=R)(.*?)(?=A)(.*?)(?=M)

If you want to get PROGRAM from the string, one option might be to use re.sub with a negated character class to remove all that you don't want.
[^A-Z]+
Regex demo | Python demo
For example:
import re
test_str = "%$###P^&#!)00000R{]]]////O.......G\"\"\"\"\"R;;$#!*%&#*A/////847M"
pattern = r'[^A-Z]+'
print(re.sub(pattern, '', test_str))
Result
PROGRAM

This should work for you and is more efficient than your current solution:
P[^R]+R[^O]+O[^G]+G[^R]+R[^A]+A[^M]+M
Explanation:
P[^R]+ - match P, than one or more characters other than R
Demo

I'm not quite sure what the desired output might be, I'm guessing maybe this expression,
(?=.*?P.*?R.*?O.*?G.*?R.*?A.*?M).*?(P).*?(R).*?(O).*?(G).*?(R).*?(A).*?(M)
might be a start.
The expression is explained on the top right panel of this demo, if you wish to explore further or simplify/modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.

Related

Exact search of a string that has parenthesis using regex

I am new to regexes.
I have the following string : \n(941)\n364\nShackle\n(941)\nRivet\n105\nTop
Out of this string, I want to extract Rivet and I already have (941) as a string in a variable.
My thought process was like this:
Find all the (941)s
filter the results by checking if the string after (941) is followed by \n, followed by a word, and ending with \n
I made a regex for the 2nd part: \n[\w\s\'\d\-\/\.]+$\n.
The problem I am facing is that because of the parenthesis in (941) the regex is taking 941 as a group. In the 3rd step the regex may be wrong, which I can fix later, but 1st I needed help in finding the 2nd (941) so then I can apply the 3rd step on that.
PS.
I know I can use python string methods like find and then loop over the searches, but I wanted to see if this can be done directly using regex only.
I have tried the following regex: (?:...), (941){1} and the make regex literal character \ like this \(941\) with no useful results. Maybe I am using them wrong.
Just wanted to know if it is possible to be done using regex. Though it might be useful for others too or a good share for future viewers.
Thanks!
Assuming:
You want to avoid matching only digits;
Want to match a substring made of word-characters (thus including possible digits);
Try to escape the variable and use it in the regular expression through f-string:
import re
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
var1 = '(941)'
var2 = re.escape(var1)
m = re.findall(fr'{var2}\n(?!\d+\n)(\w+)', s)[0]
print(m)
Prints:
Rivet
If you have text in a variable that should be matched exactly, use re.escape() to escape it when substituting into the regexp.
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
num = '(941)'
re.findall(rf'(?<=\n{re.escape(num)}\n)[\w\s\'\d\-\/\.]+(?=\n)', s)
This puts (941)\n in a lookbehind, so it's not included in the match. This avoids a problem with the \n at the end of one match overlapping with the \n at the beginning of the next.

Python regex match all sentences include either wordA or wordB [duplicate]

I'm creating a javascript regex to match queries in a search engine string. I am having a problem with alternation. I have the following regex:
.*baidu.com.*[/?].*wd{1}=
I want to be able to match strings that have the string 'word' or 'qw' in addition to 'wd', but everything I try is unsuccessful. I thought I would be able to do something like the following:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
but it does not seem to work.
replace [wd|word|qw] with (wd|word|qw) or (?:wd|word|qw).
[] denotes character sets, () denotes logical groupings.
Your expression:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
does need a few changes, including [wd|word|qw] to (wd|word|qw) and getting rid of the redundant {1}, like so:
.*baidu.com.*[/?].*(wd|word|qw)=
But you also need to understand that the first part of your expression (.*baidu.com.*[/?].*) will match baidu.com hello what spelling/handle????????? or hbaidu-com/ or even something like lkas----jhdf lkja$##!3hdsfbaidugcomlaksjhdf.[($?lakshf, because the dot (.) matches any character except newlines... to match a literal dot, you have to escape it with a backslash (like \.)
There are several approaches you could take to match things in a URL, but we could help you more if you tell us what you are trying to do or accomplish - perhaps regex is not the best solution or (EDIT) only part of the best solution?

Trying to find the regex for this particular case? Also can I parse this without creating groups?

text to capture looks like this..
Policy Number ABCD000012345 other text follows in same line....
My regex looks like this
regex value='(?i)(?:[P|p]olicy\s[N|n]o[|:|;|,][\n\r\s\t]*[\na-z\sA-Z:,;\r\d\t]*[S|s]e\s*[H|h]abla\s*[^\n]*[\n\s\r\t]*|(?i)[P|p]olicy[\s\n\t\r]*[N|n]umber[\s\n\r\t]*)(?P<policy_number>[^\n]*)'
this particular case matches with the second or case.. however it is also capturing everything after the policy number. What can be the stopping condition for it to just grab the number. I know something is wrong but can't find a way out.
(?i)[P|p]olicy[\s\n\t\r]*[N|n]umber[\s\n\r\t]*)
current output
ABCD000012345othertextfollowsinsameline....
expected output
ABCD000012345
You may use a more simple regex, just finding from the beginning "[P|p]olicy\s*[N|n]umber\s*\b([A-Z]{4}\d+)\b.*" and use the word boundary \b
pattern = re.compile(r"[P|p]olicy\s*[N|n]umber\s*\b([A-Z0-9]+)\b.*")
line = "Policy Number ABCD000012345 other text follows in same line...."
matches = pattern.match(line)
id_res = matches.group(1)
print(id_res) # ABCD000012345
And if there's always 2 words before you can use (?:\w+\s+){2}\b([A-Z0-9]+)\b.*
Also \s is for [\r\n\t\f\v ] so no need to repeat them, your [\n\r\s\t] is just \s
you don't need the upper and lower case p and n specified since you're already specifying case insensitive.
Also \s already covers \n, \t and \r.
(?i)policy\s+number\s+([A-Z]{4}\d+)\b
for verification purpose: Regex
Another Solution:
^[\s\w]+\b([A-Z]{4}\d+)\b
for verification purpose: Regex
I like this better, in case your text changes from policy number

Generalize regex to search for Wikipedia Categories

I have the following string of text (taken from the Wikipedia dumps)
text = "[[Category:Ethnic groups| ]]\n[[Category:Ethnic groups by region|*]]\n[[Category:Society-related lists|Ethnic groups]]\n[[Category:Lists of ethnic groups]]"
and I would like to extract all the categories in the text. So basically the ideal output should be
text = "[Ethnic groups,Ethnic groups by region,Society-related lists|Ethnic groups,Lists of ethnic groups]"
This is my attempts at getting the solution
import re
categories = re.findall(r'\b(Category:.*)\b', text)
categories = [category.replace("Category:", "") for category in categories]
which returns what I want. However, I'm not sure this is the best way to generalize the regular expression. In particular, I would like to search for "[[Category:" instead of just "Category:" because that's the actual Wikipedia definition for the category links. Do you have any suggestions on how I can improve my regular expression?
First, you don't need to make a research and after a replacement, you can do it in one step using a capture group (re.findall returns only capture groups when the pattern contains capture groups, otherwise it returns the whole match).
Looking for [[Category: instead of \bCategory: is probably a good idea. All you have to do is to escape opening square brackets since they are special regex characters.
Instead of .*\b you should use something more restrictive like (?:\|(?!\*)[^\]|]*)*) that excludes the closing square bracket and the pipe followed by an asterisk. However using .*\b is also a good idea if you are sure that the data you want to extract ends with a word character and if there is only one [[Category:...]] per line. A good compromise will be [^\]]*\b
So in one step:
categories = re.findall(r'\[\[Category:([^\]]*\b)', text)
I would go with :
re.findall(r"\bCategory:(.*)\b", text)
wich should return only the values needed (thanks to the parenthesis)

regex, how to exlude search in match

Might be a bit messy title, but the question is simple.
I got this in Python:
string = "start;some;text;goes;here;end"
the start; and end; word is always at the same position in the string.
I want the second word which is some in this case. This is what I did:
import re
string = "start;some;text;goes;here;end"
word = re.findall("start;.+?;" string)
In this example, there might be a few things to modify to make it more appropriate, but in my actual code, this is the best way.
However, the string I get back is start;some;, where the search characters themselves is included in the output. I could index both ;, and extract the middle part, but there have to be a way to only get the actual word, and not the extra junk too?
No need for regex in my opinion, but all you need is a capture group here.
word = re.findall("start;(.+?);", string)
Another improvement I'd like to suggest is not using .. Rather be more specific, and what you are looking for is simply anything else than ;, the delimiter.
So I'd do this:
word = re.findall("start;([^;]+);", string)

Categories