Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 8 years ago.
Improve this question
I have a long string which I have parsed through beautifulsoup and I need advice on the best way to extract data from this soup object.
The number I want is contained inside the soup object, inside () after this text.
View All (8)
What is the most efficient way to locate this, and get the number out of it.
In VBA I would have done this.
(1) Find where does this text string start if soup is length 1000 text is at 200
Then I would loop until I found the ending ), grab that text, store it in a variable, and process each character removing everything which is not a number.
So If I have > View All (8) I would end up with 8. The number inside here is not known, could be q00, 110, or 2000.
I have just started learning python, don't yet know how to use regular expression but that seems the way to go?
Sample String
">View All (90)</a>
Expected Result - hopeful
90
Sample String
">View All (8)</a>
Expected Result - hopeful
8
Seeing how my comment provoked some more questions, let me expand it a bit. First, welcome to the wonderful world of regular expressions. Regular expressions can be quite a headache, but mastering them is a very useful skill. A very clear tutorial was written by A.M. Kuchling, one of Python's old hackers from the early days. If memory serves me he wrote the re library, with (as an additional bonus) an undocumented implementation of lex in some 15 odd lines of python. But I digress. You can find the tutorial here. https://docs.python.org/2/howto/regex.html
Let me go over the expression bit by bit:
m = re.compile(r'View All \((\d*?)\)').search(soupstring);
print m.group(1)
The r in front of the quotation marks it as a raw string in Python. Python will preprocess normal string literals, so that a backslash is interpreted as a special character. E.g. a '\t' in a string will be replaced by the tab character. Try print '\' to see what I mean. To include a '\' in a string you have to escape it like this '\\'. This can be a problem as a backslash is also a escaping character for the regular expression engine. If you have to match patterns that contain backslashes, you will soon be writing patterns like this '\\\\'. Which can be fun . . . If you like 50 shades of grey, give it a try.
Inside the regular expression language: '(' characters are special. They are used to group parts of the match together. Since you are only interested in the digits between the parentheses, I used a group to extract this data. Other special characters are '{', '[', , '*', '?', '\' and their matching counterparts. I am sure I have forgotten a few, but you can look them up.
With that information, the '\(' will make more sense. Since I have escaped the '(' it tells the regular expression parser to ignore the special meaning of '(' and instead match it against a literal '(' character.
The sequence '\d' is again special. An escaped '\d' means, do not interpret this as a literal 'd', but interpret it as "any digit character".
The '*' means take the last pattern and match it zero or more times.
The '*?' variant means, use "greedy matching". It means return the first possible match instead of finding the longest possible match. In the context of regular expressions greed is usually good. As Sebastian has noted, the '?' is not needed here. However, if you ever need to find html elements or quoted strings, then you can use '<.*?>' or '".*?"'.
Please note that '.' is again special. It means match "any character (except the newline (well most of the time anyway))".
Have fun . . .
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am currently implementing a graphical calculator in Python, in a manner where you can type a natural expression and evaluate it. This is not a problem with most functions or operators, but as the factorial function is denoted by a ! after the operand, this is more difficult.
What I have is a string which contains a function, for example: '(2x + 1)!' which needs to be replaced with: 'math.factorial((2x + 1))'
However, the string may also include other terms such as: '2*x + (2x + 1)! - math.sin(x)'
and the factorial term may not necessarily contain brackets: '2!'
I've been trying to find a solution to this problem, but to no avail, I don't think the string.replace() method can do this directly. Is what I seek too ambitious, or is there some method by which I could achieve a result as desired?
There's two levels of answer to your question: (1) solve your current problem; (2) solve your general problem.
(1) is pretty easy - the most common, versatile tool in general for doing string pattern matching and replacement is Regular Expressions (RE). You simply define the pattern you're looking for, the pattern you want out, and pass the RE engine your string. Python has a RE module built-in called re. Most languages have something similar. Some languages (eg. Perl) even have it as a core part of the language syntax.
A pattern is a series of either specific characters, or non-specific ("wildcard") characters. So in your case you want the non-specific characters before a specific '!' character. You seem to suggest that "before" in your case means either all the non-whitespace characters, or if the proceeding character is a ')', then all the characters between that and the proceeding '('. So let's build that pattern, starting with the version without the parentheses:
[\w] - the set of characters which are letters or numbers (we need a set of
characters that doesn't include whitespace or ')' so I'm taking some
liberty to keep the example simple - you could always build your own
more complex set with the '[]' pattern)
+ - at least one of them
! - the '!' character
And then the version with the parentheses:
\( - the literal '(' character, as opposed to the special function that ( has
. - any character
+ - at least one of them
? - but dont be "greedy", ie. only take the smallest set of characters that
match the pattern (will work out to be the closest pair of parentheses)
\) - the closing ')' character
! - the '!' character
Then we just need to put it all together. We use | to match the first pattern OR the second pattern. And we use ( and ) to denote the part of the pattern we want to "capture" - it's the bit before the '!' and inside the parentheses that we want to use later. So your pattern becomes:
([\w]+)!|\((.+?)\)!
Don't worry, RE expressions always come out looking like someone has just mashed the keyboard. There are some great tools out there like RegExr which help break down complex RE expressions.
Finally, you just need to take your captures and stick them in "math.factorial". \x means the xth match. If the first pattern matches, \2 will be blank and vice-versa, so we can just use both of them at once.
math.factorial(\1\2)
That's it! Here how you run your RE in Python (note the r before the strings prevents Python from trying to process the \ as an escape sequence):
import re
re.sub(r'([\w]+)!|\((.+?)\)!', r'math.factorial(\1\2)', '2*x + (2x + 1)! - math.sin(x) + 2!')
re.sub takes three parameters (plus some optional ones not used here): the RE pattern, the replacement string, and the input string. This produces:
'2*x + math.factorial(2x + 1) - math.sin(x) + math.factorial(2)'
which is I believe what you're after.
Now, (2) is harder. If your intention really is to implement a calculator that takes strings as input, you're quickly going to drown in regular expressions. There will be so many exceptions and variations between what can be entered and what Python can interpret, that you'll end up with something quite fragile, that will fail on its first contact with a user. If you're not intending on having users you're pretty safe - you can just stick to using the patterns that work. If not, then you'll find the pattern matching method a bit limiting.
In general, the problem you're tackling is known as lexical analysis (or more fully, as the three step process of lexical analysis, syntactic analysis and semantic analysis). The standard way to tackle this is with a technique called recursive descent parsing.
Intriguingly, the Python interpreter performs exactly this process in interpreting the re statement above - compilers and interpreters all undertake the same process to turn a string into tokens that can be processed by a computer.
You'll find lots of tutorials on the web. It's a bit more complex than using RE, but allows significantly more generalisation. You might want to start with the very brief intro here.
You could remove the brackets and calculate everything in a linear form, such that when parsing the brackets it would evaluate the operand, and then apply the factorial function - in the order written.
Or, you could get the index of the factorial ! in the string, then if the character at the index before is a close bracket ) you know there is a bracketed operand that needs to be calculated prior to applying math.factorial().
So currently I am trying to find out how many times a specific word appears on a page.
My Python code has this:
print(len(re.findall(secondAnswer, page)))
0
Upon careful analysis, I noticed that
print(secondAnswer) is giving me a different answer "Pacific"
from print(ascii(secondAnswer)) 'Paci\ufb01c'
I have a feeling that my secondAnswer value in len(re.findall(secondAnswer, page)) is using 'Paci\ufb01c' instead and thus not finding any matches on the page.
Can someone give me any tips on how to solve this?
Thanks, Nick
Unicode character fb01 is the fi ligature. That is, it's a single character as far as Python is concerned, but appears as two (tied) characters when displayed.
To decompose ligatures into their separate characters, you can use unicodedata.normalize. For example:
page = unicodedata.normalize("NFKD", page)
Or in this specific case, you could write your regex to accept the ligature as an alternate for the fi character sequence, for example by using alternation with a non-capturing group: paci(?:fi|fi)c.
This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 6 years ago.
For a little idea of what the project is, I'm trying to make a markup language that compiles to HTML/CSS. I plan on formatting links like this: #(link mask)[(link url)], and I want to find all occurrences of this and get both the link mask and the link url.
I tried using this code for it:
re.search("#(.*)\[(.*)\]", string)
But it started at the beginning of the first instance, and ended at the end of the last instance of a link. Any ideas how I can have it find all of them, in a list or something?
The default behavior of a regular expression is "greedy matching". This means each .* will match as many characters as it can.
You want them to instead match the minimal possible number of characters. To do that, change each .* into a .*?. The final question mark will make the pattern match the minimal number of characters. Because you anchor your pattern to a ] character, it will still match/consume the whole link correctly.
* is greedy: it matches as many characters as it can, e.g. up to the last right parenthesis in your document. (After all, . means "any character" and ) is 'any character" as much as any other character.)
You need the non-greedy version of *, which is *?. (Probably actually you should use +?, as I don't think zero-length matches would be very useful).
i have string like this
<name:john student male age=23 subject=\computer\sience_{20092973}>
i am confused ":","="
i want to parsing this string!
so i want to split to list like this
name:john
job:student
sex:male
age:23
subject:{20092973}
parsing string with specific name(name, job, sex.. etc) in python
i already searching... but i can't find.. sorry..
how can i this?
thank you.
It's generally a good idea to give more than one example of the strings you're trying to parse. But I'll take a guess. It looks like your format is pretty simple, and primarily whitespace-separated. It's simple enough that using regular expressions should work, like this, where line_to_parse is the string you want to parse:
import re
matchval = re.match("<name:(\S+)\s+(\S+)\s+(\S+)\s+age=(\S+)\s+subject=[^\{]*(\{\S+\})", line_to_parse)
matchgroups = matchval.groups()
Now matchgroups will be a tuple of the values you want. It should be trivial for you to take those and get them into the desired format.
If you want to do many of these, it may be worth compiling the regular expression; take a look at the re documentation for more on this.
As for the way the expression works: I won't go into regular expressions in general (that's what the re docs are for) but in this case, we want to get a bunch of strings that don't have any whitespace in them, and have whitespace between them, and we want to do something odd with the subject, ignoring all the text except the part between { and }.
Each "(...)" in the expression saves whatever is inside it as a group. Each "\S+" stands for one or more ("+") characters that aren't whitespace ("\S"), so "(\S+)" will match and save a string of length at least one that has no whitespace in it. Each "\s+" does the opposite: it has not parentheses around it, so it doesn't save what it matches, and it matches at one or more ("+") whitespace characters ("\s"). This suffices for most of what we want. At the end, though, we need to deal with the subject. "[...]" allows us to list multiple types of characters. "[^...]" is special, and matches anything that isn't in there. {, like [, (, and so on, needs to be escaped to be normal in the string, so we escape it with \, and in the end, that means "[^{]*" matches zero or more ("*") characters that aren't "{" ("[^{]"). Since "*" and "+" are "greedy", and will try to match as much as they can and still have the expression match, we now only need to deal with the last part. From what I've talked about before, it should be pretty clear what "({\S+})" does.
I'm trying to find all instances of the keyword "public" in some Java code (with a Python script) that are not in comments or strings, a.k.a. not found following //, in between a /* and a */, and not in between double or single quotes, and which are not part of variable names-- i.e. they must be preceded by a space, tab, or newline, and must be followed by the same.
So here's what I have at the moment--
//.*\spublic\s.*\n
/\*.*\spublic\s.*\*/
".*\spublic\s.*"
'.*\spublic\s.*'
Am I messing this up at all?
But that finds exactly what I'm NOT looking for. How can I turn it around and search the inverse of the sum of those four expressions, as a single regex?
I've figured out this probably uses negative look-ahead and look-behind, but I still can't quite piece it together. Also, for the /**/ regex, I'm concerned that .* doesn't match newlines, so it would fail to recognize that this public is in a comment:
/*
public
*/
Everything below this point is me thinking on paper and can be disregarded. These thoughts are not fully accurate.
Edit:
I daresay (?<!//).*public.* would match anything not in single line comments, so I'm getting the hang of things. I think. But still unsure how to combine everything.
Edit2:
So then-- following that idea, I |ed them all to get--
(?<!//).*public.*|(?<!/\*).*public.\*/(?!\*/)|(?<!").*public.*(?!")|(?<!').*public.*(?!')
But I'm not sure about that. //public will not be matched by the first alternate, but it will be matched by the second. I need to AND the look-aheads and look-behinds, not OR the whole thing.
I'm sorry, but I'll have to break the news to you, that what you are trying to do is impossible. The reason is mostly because Java is not a regular language. As we all know by now, most regex engines provide non-regular features, but Python in particular is lacking something like recursion (PCRE) or balancing groups (.NET) which could do the trick. But let's look into that in more depth.
First of all, why are your patterns not as good as you think they are? (for the task of matching public inside those literals; similar problems will apply to reversing the logic)
As you have already recognized, you will have problems with line breaks (in the case of /*...*/). This can be solved by either using the modifier/option/flag re.S (which changes the behavior of .) or by using [\s\S] instead of . (because the former matches any character).
But there are other problems. You only want to find surrounding occurrences of the string or comment literals. You are not actually making sure that they are specifically wrapped around the public in question. I'm not sure how much you can put onto a single line in Java, but if you had an arbitrary string, then later a public and then another string on a single line, then your regex would match the public because it can find the " before and after it. Even if that is not possible, if you have two block comments in the same input, then any public between those two block comments would cause a match. So you would need to find a way to assert only that your public is really inside "..." or /*...*/ and not just that these literals can be found anywhere to left of right of it.
Next thing: matches cannot overlap. But your match includes everything from the opening literal until the ending literal. So if you had "public public" that would cause only one match. And capturing cannot help you here. Usually the trick to avoid this is to use lookarounds (which are not included in the match). But (as we will see later) the lookbehind doesn't work as nicely as you would think, because it cannot be of arbitrary length (only in .NET that is possible).
Now the worst of all. What if you have " inside a comment? That shouldn't count, right? What if you have // or /* or */ inside a string? That shouldn't count, right? What about ' inside "-strings and " inside '-strings? Even worse, what about \" inside "-string? So for 100% robustness you would have to do a similar check for your surrounding delimiters as well. And this is usually where regular expressions reach the end of their capabilities and this is why you need a proper parser that walks the input string and builds a whole tree of your code.
But say you never have comment literals inside strings and you never have quotes inside comments (or only matched quotes, because they would constitute a string, and we don't want public inside strings anyway). So we are basically assuming that every of the literals in question is correctly matched, and they are never nested. In that case you can use a lookahead to check whether you are inside or outside one of the literals (in fact, multiple lookaheads). I'll get to that shortly.
But there is one more thing left. What does (?<!//).*public.* not work? For this to match it is enough for (?<!//) to match at any single position. e.g. if you just had input // public the engine would try out the negative lookbehind right at the start of the string, (to the left of the start of the string), would find no //, then use .* to consume // and the space and then match public. What you actually want is (?<!//.*)public. This will start the lookbehind from the starting position of public and look all the way to the left through the current line. But... this is a variable-length lookbehind, which is only supported by .NET.
But let's look into how we can make sure we are really outside of a string. We can use a lookahead to look all the way to the end of the input, and check that there is an even number of quotes on the way.
public(?=[^"]*("[^"]*"[^"]*)*$)
Now if we try really hard we can also ignore escaped quotes when inside of a string:
public(?=[^"]*("(?:[^"\\]|\\.)*"[^"]*)*$)
So once we encounter a " we will accept either non-quote, non-backslash characters, or a backslash character and whatever follows it (that allows escaping of backslash-characters as well, so that in "a string\\" we won't treat the closing " as being escaped). We can use this with multi-line mode (re.M) to avoid going all the way to the end of the input (because the end of the line is enough):
public(?=[^"\r\n]*("(?:[^"\r\n\\]|\\.)*"[^"\r\n]*)*$)
(re.M is implied for all following patterns)
This is what it looks for single-quoted strings:
public(?=[^'\r\n]*('(?:[^'\r\n\\]|\\.)*'[^'\r\n]*)*$)
For block comments it's a bit easier, because we only need to look for /* or the end of the string (this time really the end of the entire string), without ever encountering */ on the way. That is done with a negative lookahead at every single position until the end of the search:
public(?=(?:(?![*]/)[\s\S])*(?:/[*]|\Z))
But as I said, we're stumped on the single-line comments for now. But anyway, we can combine the last three regular expressions into one, because lookaheads don't actually advance the position of the regex engine on the target string:
public(?=[^"\r\n]*("(?:[^"\r\n\\]|\\.)*"[^"\r\n]*)*$)(?=[^'\r\n]*('(?:[^'\r\n\\]|\\.)*'[^'\r\n]*)*$)(?=(?:(?![*]/)[\s\S])*(?:/[*]|\Z))
Now what about those single-line comments? The trick to emulate variable-length lookbehinds is usually to reverse the string and the pattern - which makes the lookbehind a lookahead:
cilbup(?!.*//)
Of course, that means we have to reverse all other patterns, too. The good news is, if we don't care about escaping, they look exactly the same (because both quotes and block comments are symmetrical). So you could run this pattern on a reversed input:
cilbup(?=[^"\r\n]*("[^"\r\n]*"[^"\r\n]*)*$)(?=[^'\r\n]*('[^'\r\n]*'[^'\r\n]*)*$)(?=(?:(?![*]/)[\s\S])*(?:/[*]|\Z))(?!.*//)
You can then find the match positions in your actual input by using inputLength -foundMatchPosition - foundMatchLength.
Now what about escaping? That get's quite annoying now, because we have to skip quotes, if they are followed by a backslash. Because of some backtracking issues we need to take care of that in five places. Three times, when consuming non-quote characters (because we need to allow "\ as well now. And twice, when consuming quote characters (using a negative lookahead to make sure there is no backslash after them). Let's look at double quotes:
cilbup(?=(?:[^"\r\n]|"\\)*(?:"(?!\\)(?:[^"\r\n]|"\\)*"(?!\\)(?:[^"\r\n]|"\\)*)*$)
(It looks horrible, but if you compare it with the pattern that disregards escaping, you will notice the few differences.)
So incorporating that into the above pattern:
cilbup(?=(?:[^"\r\n]|"\\)*(?:"(?!\\)(?:[^"\r\n]|"\\)*"(?!\\)(?:[^"\r\n]|"\\)*)*$)(?=(?:[^'\r\n]|'\\)*(?:'(?!\\)(?:[^'\r\n]|'\\)*'(?!\\)(?:[^'\r\n]|'\\)*)*$)(?=(?:(?![*]/)[\s\S])*(?:/[*]|\Z))(?!.*//)
So this might actually do it for many cases. But as you can see it's horrible, almost impossible to read, and definitely impossible to maintain.
What were the caveats? No comment literals inside strings, no string literals inside strings of the other type, no string literals inside comments. Plus, we have four independent lookaheads, which will probably take some time (at least I think I have a voided most of backtracking).
In any case, I believe this is as close as you can get with regular expressions.
EDIT:
I just realised I forgot the condition that public must not be part of a longer literal. You included spaces, but what if it's the first thing in the input? The easiest thing would be to use \b. That matches a position (without including surrounding characters) that is between a word character and a non-word character. However, Java identifiers may contain any Unicode letter or digit, and I'm not sure whether Python's \b is Unicode-aware. Also, Java identifiers may contain $. Which would break that anyway. Lookarounds to the rescue! Instead of asserting that there is a space character on every side, let's assert that there is no non-space character. Because we need negative lookarounds for that, we will get the advantage of not including those characters in the match for free:
(?<!\S)cilbup(?!\S)(?=(?:[^"\r\n]|"\\)*(?:"(?!\\)(?:[^"\r\n]|"\\)*"(?!\\)(?:[^"\r\n]|"\\)*)*$)(?=(?:[^'\r\n]|'\\)*(?:'(?!\\)(?:[^'\r\n]|'\\)*'(?!\\)(?:[^'\r\n]|'\\)*)*$)(?=(?:(?![*]/)[\s\S])*(?:/[*]|\Z))(?!.*//)
And because just from scrolling this code snippet to the right one cannot quite grasp how ridiculously huge this regex is, here it is in freespacing mode (re.X) with some annotations:
(?<!\S) # make sure there is no trailing non-whitespace character
cilbup # public
(?!\S) # make sure there is no leading non-whitespace character
(?= # lookahead (effectively lookbehind!) to ensure we are not inside a
# string
(?:[^"\r\n]|"\\)*
# consume everything except for line breaks and quotes, unless the
# quote is followed by a backslash (preceded in the actual input)
(?: # subpattern that matches two (unescaped) quotes
"(?!\\) # a quote that is not followed by a backslash
(?:[^"\r\n]|"\\)*
# we've seen that before
"(?!\\) # a quote that is not followed by a backslash
(?:[^"\r\n]|"\\)*
# we've seen that before
)* # end of subpattern - repeat 0 or more times (ensures even no. of ")
$ # end of line (start of line in actual input)
) # end of double-quote lookahead
(?=(?:[^'\r\n]|'\\)*(?:'(?!\\)(?:[^'\r\n]|'\\)*'(?!\\)(?:[^'\r\n]|'\\)*)*$)
# the same horrible bastard again for single quotes
(?= # lookahead (effectively lookbehind) for block comments
(?: # subgroup to consume anything except */
(?![*]/) # make sure there is no */ coming up
[\s\S] # consume an arbitrary character
)* # repeat
(?:/[*]|\Z)# require to find either /* or the end of the string
) # end of lookahead for block comments
(?!.*//) # make sure there is no // on this line
Have you considered replacing all comments and single and double quoted string literals with null strings using the re sub() method. Then just do a simple search/match/find of the resulting file for the word you're looking for?
That would at least give you the line numbers where the word is located. You may be able to use that information to edit the original file.
You could use pyparsing to find public keyword outside a comment or a double quoted string:
from pyparsing import Keyword, javaStyleComment, dblQuotedString
keyword = "public"
expr = Keyword(keyword).ignore(javaStyleComment | dblQuotedString)
Example
for [token], start, end in expr.scanString(r"""{keyword} should match
/*
{keyword} should not match "
*/
// this {keyword} also shouldn't match
"neither this \" {keyword}"
but this {keyword} will
re{keyword} is ignored
'{keyword}' - also match (only double quoted strings are ignored)
""".format(keyword=keyword)):
assert token == keyword and len(keyword) == (end - start)
print("Found at %d" % start)
Output
Found at 0
Found at 146
Found at 187
To ignore also single quoted string, you could use quotedString instead of dblQuotedString.
To do it with only regexes, see regex-negation tag on SO e.g., Regular expression to match string not containing a word? or using even less regex capabilities Regex: Matching by exclusion, without look-ahead - is it possible?. The simple way would be to use a positive match and skip matched comments, quoted strings. The result is the rest of the matches.
It's finding the opposite because that's just what you're asking for. :)
I don't know a way to match them all in a single regex (though it should be theoretically possible, since the regular languages are closed under complements and intersections). But you could definitely search for all instances of public, and then remove any instances that are matched by one of your "bad" regexes. Try using for example set.difference on the match.start and match.end properties from re.finditer.