parsing string with specific name in python - python

i have string like this
<name:john student male age=23 subject=\computer\sience_{20092973}>
i am confused ":","="
i want to parsing this string!
so i want to split to list like this
name:john
job:student
sex:male
age:23
subject:{20092973}
parsing string with specific name(name, job, sex.. etc) in python
i already searching... but i can't find.. sorry..
how can i this?
thank you.

It's generally a good idea to give more than one example of the strings you're trying to parse. But I'll take a guess. It looks like your format is pretty simple, and primarily whitespace-separated. It's simple enough that using regular expressions should work, like this, where line_to_parse is the string you want to parse:
import re
matchval = re.match("<name:(\S+)\s+(\S+)\s+(\S+)\s+age=(\S+)\s+subject=[^\{]*(\{\S+\})", line_to_parse)
matchgroups = matchval.groups()
Now matchgroups will be a tuple of the values you want. It should be trivial for you to take those and get them into the desired format.
If you want to do many of these, it may be worth compiling the regular expression; take a look at the re documentation for more on this.
As for the way the expression works: I won't go into regular expressions in general (that's what the re docs are for) but in this case, we want to get a bunch of strings that don't have any whitespace in them, and have whitespace between them, and we want to do something odd with the subject, ignoring all the text except the part between { and }.
Each "(...)" in the expression saves whatever is inside it as a group. Each "\S+" stands for one or more ("+") characters that aren't whitespace ("\S"), so "(\S+)" will match and save a string of length at least one that has no whitespace in it. Each "\s+" does the opposite: it has not parentheses around it, so it doesn't save what it matches, and it matches at one or more ("+") whitespace characters ("\s"). This suffices for most of what we want. At the end, though, we need to deal with the subject. "[...]" allows us to list multiple types of characters. "[^...]" is special, and matches anything that isn't in there. {, like [, (, and so on, needs to be escaped to be normal in the string, so we escape it with \, and in the end, that means "[^{]*" matches zero or more ("*") characters that aren't "{" ("[^{]"). Since "*" and "+" are "greedy", and will try to match as much as they can and still have the expression match, we now only need to deal with the last part. From what I've talked about before, it should be pretty clear what "({\S+})" does.

Related

Python regex match all sentences include either wordA or wordB [duplicate]

I'm creating a javascript regex to match queries in a search engine string. I am having a problem with alternation. I have the following regex:
.*baidu.com.*[/?].*wd{1}=
I want to be able to match strings that have the string 'word' or 'qw' in addition to 'wd', but everything I try is unsuccessful. I thought I would be able to do something like the following:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
but it does not seem to work.
replace [wd|word|qw] with (wd|word|qw) or (?:wd|word|qw).
[] denotes character sets, () denotes logical groupings.
Your expression:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
does need a few changes, including [wd|word|qw] to (wd|word|qw) and getting rid of the redundant {1}, like so:
.*baidu.com.*[/?].*(wd|word|qw)=
But you also need to understand that the first part of your expression (.*baidu.com.*[/?].*) will match baidu.com hello what spelling/handle????????? or hbaidu-com/ or even something like lkas----jhdf lkja$##!3hdsfbaidugcomlaksjhdf.[($?lakshf, because the dot (.) matches any character except newlines... to match a literal dot, you have to escape it with a backslash (like \.)
There are several approaches you could take to match things in a URL, but we could help you more if you tell us what you are trying to do or accomplish - perhaps regex is not the best solution or (EDIT) only part of the best solution?

How do I extract definitions from a html file?

I'm trying to practice with regular expressions by extracting function definitions from Python's standard library built-in functions page. What I do have so far is that the definitions are generally printed between <dd><p> and </dd></dl>. When I try
import re
fname = open('functions.html').read()
deflst = re.findall(r'<dd><p>([\D3]+)</dd></dl>', fhand)
it doesn't actually stop at </dd></dl>. This is probably something very silly that I'm missing here, but I've been really having a hard time trying to figure this one out.
Regular expressions are evaluated left to right, in a sense. So in your regular expression,
r'<dd><p>([\D3]+)</dd></dl>'
the regex engine will first look for a <dd><p>, then it will look at each of the following characters in turn, checking each for whether it's a nondigit or 3, and if so, add it to the match. It turns out that all the characters in </dd></dl> are in the class "nondigit or 3", so all of them get added to the portion matched by [\D3]+, and the engine dutifully keeps going. It will only stop when it finds a character that is a digit other than 3, and then go on and "notice" the rest of the regex (the </dd></dl>).
To fix this, you can use the reluctant quantifier like so:
r'<dd><p>([\D3]+?)</dd></dl>'
(note the added ?) which means the regex engine should be conservative in how much it adds to the match. Instead of trying to "gobble" as many characters as possible, it will now try to match the [\D3]+? to just one character and then go on and see if the rest of the regex matches, and if not it will try to match [\D3]+? with just two characters, and so on.
Basically, [\D3]+ matches the longest possible string of [\D3]'s that it can while still letting the full regex match, whereas [\D3]+? matches the shortest possible string of [\D3]'s that it can while still letting the full regex match.
Of course one shouldn't really be using regular expressions to parse HTML in "the real world", but if you just want to practice regular expressions, this is probably as good a text sample as any.
By default all quantifiers are greedy which means they want to match as many characters as possible. You can use ? after quantifier to make it lazy which matches as few characters as possible. \d+? matches at least one digit, but as few as possible.
Try r'<dd><p>([\D3]+?)</dd></dl>'

Finding repetitive substrings

Having some arbitrary string such as
hello hello hello I am I am I am your string string string string of strings
Can I somehow find repetitive sub-strings delimited by spaces(EDIT)? In this case it would be 'hello', 'I am' and 'string'.
I have been wondering about this for some time but I still can not find any real solution.
I also have read some articles concerning this topic and hit up on suffix trees but can this help me even though I need to find every repetition e.g. with repetition count higher than two?
If it is so, is there some library for python, that can handle suffix trees and perform operations on them?
Edit: I am sorry I was not clear enough. So just to make it clear - I am looking for repetitive sub-strings, that means sequences in string, that, for example, in terms of regular expressions can be substituted by + or {} wildcards. So If I would have to make regular expression from listed string, I would do
(hello ){3}(I am ){3}your (string ){4}of strings
To find two or more characters that repeat two or more times, each delimited by spaces, use:
(.{2,}?)(?:\s+\1)+
Here's a working example with your test string: http://bit.ly/17cKX62
EDIT: made quantifier in capture group reluctant by adding ? to match shortest possible match (i.e. now matches "string" and not "string string")
EDIT 2: added a required space delimiter for cleaner results

Extracting parenthesis with a specific format with Python

I am fairly new to python so I apologies if this is quite a novice question, but I am trying to extract text from parentheses that has specific format from a raw text file.
I have tried this with regular expressions, but please let me know if their is a better method.
To show what I want to do by example:
s = "Testing (Stackoverflow, 2013). Testing (again) (Stackoverflow, 1999)"
From this string I want a result something like:
['(Stackoverflow, 2013)', '(Stackoverflow, 1999)']
The regular expression I have tried so far is
"(\(.+[,] [0-9]{4}\))"
in conjunction with re.findall(), however this only gives me the result:
['(Stackoverflow, 2013). Testing (again) (Stackoverflow, 1999)']
So, as you may have guessed, I am trying to extract the bibliographic references from a .txt file. But I don't want to extract anything that happens to be in parentheses that is not a bibliographic reference.
Again, I apologies if this is novice, and again if there is a question like this out there already. I have searched, but no luck as yet.
Using [^()] instead of .. This will make sure there is no nested ().
>>> re.findall("(\([^()]+[,] [0-9]{4}\))", s)
['(Stackoverflow, 2013)', '(Stackoverflow, 1999)']
Assuming that you will have no nested brackets, you could use something like so: (\([^()]+?, [0-9]{4}\)). This will match any non bracket character which is within a set of parenthesis which is followed by a comma, a white space four digits and a closing parenthesis.
I would suggest something like \(\w+,\s+[0-9]{4}\). A couple changes from your original:
Match word characters (letters/numbers/underscores) instead of any character in the source name.
Match one or more space characters after the comma, instead of limiting yourself to a single literal space.

Comments in string and strings in comments

I am trying to count characters in comments included in C code using Python and Regex, but no success. I can erase strings first to get rid of comments in strings, but this will erase string in comments too and result will be bad ofc. Is there any chance to ask by using regex to not match strings in comments or vice versa?
No, not really.
Regex is not the correct tool to parse nested structures like you describe; instead you will need to parse the C syntax (or the "dumb subset" of it you're interested in, anyway), and you might find regex helpful in that. A relatively simple state machine with three states (CODE, STRING, COMMENT) would do it.
Regular expressions are not always a replacement for a real parser.
You can strip out all strings that aren't in comments by searching for the regular expression:
'[^'\r\n]+'|(//.*|/\*(?s:.*?)\*/)
and replacing with:
$1
Essentially, this searches for the regex string|(comment) which matches a string or a comment, capturing the comment. The replacement is either nothing if a string was matched or the comment if a comment was matched.
Though regular expressions are not a replacement for a real parser you can quickly build a rudimentary parser by creating a giant regex that alternates all of the tokens you're interested in (comments and strings in this case). If you're writing a bit of code to handle comments, but not those in strings, iterate over all the matches of the above regex, and count the characters in the first capturing group if it participated in the match.

Categories