Extracting content within[] embedded inside curly brackets using Regex

Extracting content within[] embedded inside curly brackets using Regex - python

I have a string in the following format
test_str = '{"keywords": {"Associate Director Information Technology Services": ["Director of Technology Services"]}}'
My Regex code below
import re
matches= re.findall(r'\{(.*?)\}',test_str)
gives the output
['"keywords": {"Associate Director Information Technology Services": ["Director of Technology Services"]']
What change should I do in my Regex expression to output only
"Director of Technology Services"

re.findall(r"\[(.*?)\]", test_str)
print(re.findall(r"\[(.*?)\]", test_str)[0])
instead of escaping { and } you should escape [ and ].
Alternative Solution using capturing of groups.
import re
regex = re.compile(r"\[(.*?)\]")
test_str = '{"keywords": {"Associate Director Information Technology Services": ["Director of Technology Services"]}}'
print(regex.search(test_str).group(1))
Output:
"Director of Technology Services"

Try using the below code:
import json
test_str = '{"keywords": {"Associate Director Information Technology Services": ["Director of Technology Services"]}}'
test_str_json = json.loads(test_str)
output = test_str_json["keywords"]["Associate Director Information Technology Services"][0]
print(output)
Output:
Director of Technology Services

Related

How to convert a 'raw' string into a 'decoded' string in Python?

I have the following string:
raw_text = r"The Walt Disney Company, (2006\u2013present)"
print(raw_text)
#result : The Walt Disney Company, (2006\u2013present)
My questions is how can I get a decoded string "decoded_text" from the raw_text so I can get
print(decoded_text)
#result : The Walt Disney Company, (2006-present)
except this trivial way:
decoded_text = raw_text.replace("\u2013", "-")
In fact, I have big strings, which contains a lot of \u-- stuff (like \u2013, \u00c9, and so forth). So I'm looking for a way to convert all of them at once in a right way.

You might use built-in codecs module for this task as follows
import codecs
raw_text = r"The Walt Disney Company, (2006\u2013present)"
print(codecs.unicode_escape_decode(raw_text)[0])
output:
The Walt Disney Company, (2006–present)

Question on regex not performing as expected

I am trying to change the suffixes of companies such that they are all in a common pattern such as Limited, Limiteed all to LTD.
Here is my code:
re.sub(r"\s+?(CORPORATION|CORPORATE|CORPORATIO|CORPORATTION|CORPORATIF|CORPORATI|CORPORA|CORPORATN)", r" CORP", 'ABC CORPORATN')
I'm trying 'ABC CORPORATN' and it's not converting it to CORP. I can't see what the issue is. Any help would be great.
Edit: I have tried the other endings that I included in the regex and they all work except for corporatin (that I mentioned above)

I see that all te patterns begins with "CORPARA", so we can just go:
import re
print(re.sub("CORPORA\w+", "CORP", 'ABC CORPORATN'))
Output:
ABC CORP
Same for the possible patterns of limited; if they all begin with "Limit", you can
import re
print(re.sub("Limit\w+", "LTD", 'Shoe Shop Limited.'))
Output:
Shoe Shop LTD.

regex capture text in brackets, omitting optional prefix

I'm trying to convert some documents (Wikipedia articles) which contain links with a specific markdown convention. I want to render these to be reader-friendly without links. The convention is:
Names in double-brackets with of the pattern [[Article Name|Display Name]] should be captured ignoring the pipe and preceding text as well as enclosing brackets:
Display Name.
Names in double-brackets of the pattern [[Article Name]] should be
captured without the brackets: Article Name.
Nested approach (produces desired result)
I know I can handle #1 and #2 in a nestedre.sub() expression. For example, this does what I want:
s = 'including the [[Royal Danish Academy of Sciences and Letters|Danish Academy of Sciences]], [[Norwegian Academy of Science and Letters|Norwegian Academy of Sciences]], [[Russian Academy of Sciences]], and [[National Academy of Sciences|US National Academy of Sciences]].'
re.sub('\[\[(.*?\|)(.*?)\]\]','\\2', # case 1
re.sub('\[\[([^|]+)\]\]','\\1',s) # case 2
)
# result is correct:
'including the Danish Academy of Sciences, Norwegian Academy of Sciences, Russian Academy of Sciences, and US National Academy of Sciences.'
Single-pass approach (looking for solution here)
For efficiency and my own improvement, I would like to know whether there is a single-pass approach.
What I have tried: In an optional group 1, I want to greedy-capture everything between [[ and a | (if it exists). Then in group 2, I want to capture everything else up to the ]]. Then I want to return only group 2.
My problem is in making the greedy capture optional:
re.sub('\[\[([^|]*\|)?(.*?)\]\]','\\2',s)
# does NOT return the desired result:
'including the Danish Academy of Sciences, Norwegian Academy of Sciences, US National Academy of Sciences.'
# is missing: 'Russian Academy of Sciences, and '

See regex in use here
\[{2}(?:(?:(?!]{2})[^|])+\|)*((?:(?!]{2})[^|])+)]{2}
\[{2} Match [[
(?:(?:(?!]{2})[^|])+\|)* Matches the following any number of times
(?:(?!]{2})[^|])+ Tempered greedy token matching any character one or more times except | or location that matches ]]
\| Matches | literally
((?:(?!]{2})[^|])+) Capture the following into capture group 1
(?:(?!]{2})[^|])+ Tempered greedy token matching any character one or more times except | or location that matches ]]
]{2} Match ]]
Replacement \1
Result:
including the Danish Academy of Sciences, Norwegian Academy of Sciences, Russian Academy of Sciences, and US National Academy of Sciences.
Another alternative that may work for you is the following. It's less specific than the regex above but doesn't include any lookarounds.
\[{2}(?:[^]|]+\|)*([^]|]+)]{2}

Regular expression for extracting fields from wiki template markup

I would like to use Python to extract content formatted in MediaWiki markup following a particular string. For example, the 2012 U.S. presidential election article, contains fields called "nominee1" and "nominee2". Toy example:
In [1]: markup = get_wikipedia_markup('United States presidential election, 2012')
In [2]: markup
Out[2]:
u"{{
| nominee1 = '''[[Barack Obama]]'''\n
| party1 = Democratic Party (United States)\n
| home_state1 = [[Illinois]]\n
| running_mate1 = '''[[Joe Biden]]'''\n
| nominee2 = [[Mitt Romney]]\n
| party2 = Republican Party (United States)\n
| home_state2 = [[Massachusetts]]\n
| running_mate2 = [[Paul Ryan]]\n
}}"
Using the election article above as an example, I would like to extract the information immediately following the "nomineeN" field but that exists before the invocation of the next field (demarcated by a pip "|"). Thus, given the example above, I would ideally like to extract "Barack Obama" and "Mitt Romney" -- or at least the syntax in which they're embedded ('''[[Barack Obama]]''' and [[Mitt Romney]]). Other regex has extracted links from the wikimarkup, but my (failed) attempts of using a positive lookbehind assertion have been something of the flavor of:
nominees = re.findall(r'(?<=\|nominee\d\=)\S+',markup)
My thinking is that it should find strings like "|nominee1=" and "|nominee2=" with some whitespace possible between "|", "nominee", "=" and then return the content following it like "Barack Obama" and "Mitt Romney".

Use mwparserfromhell! It condenses your code and is more reassuring for capturing the result. For usage with this example:
import mwparserfromhell as mw
text = get_wikipedia_markup('United States presidential election, 2012')
code = mw.parse(text)
templates = code.filter_templates()
for template in templates:
if template.name == 'Infobox election':
nominee1 = template.get('nominee1').value
nominee2 = template.get('nominee2').value
print nominee1
print nominee2
Very simple thing to do to capture the result.

Lookbehinds aren't necessary here—it's much easier to use matching groups to specify exactly what should be extracted from the string. (In fact, lookbehinds can't work here with Python's regular expression engine, since the optional spaces make the expression variable-width.)
Try this regex:
\|\s*nominee\d+\s*=\s*(?:''')?\[\[([^]]+)\]\](?:''')?
Results:
re.findall(r"\|\s*nominee\d+\s*=\s*(?:''')?\[\[([^]]+)\]\](?:''')?", markup)
# => ['Barack Obama', 'Mitt Romney']

For infobox data like this, it's best to use DBpedia. They've done all the extraction work for you :)
http://wiki.dbpedia.org/Downloads38
See the "Ontology Infobox Properties " file. You don't have to be an ontologies expert here. Just use simple tsv parser to find the info you need!

First of all, you're missing a space after nominee\d. You probably want nominee\d\s*\=. In addition, you really don't want to be parsing markup with regex. Try using one of the suggestions here instead.
If you must do it with regex, why not a slightly more readable multi line solution?
import re
markup_string = """{{
| nominee1 = '''[[Barack Obama]]'''
| party1 = Democratic Party (United States)
| home_state1 = [[Illinois]]
| running_mate1 = '''[[Joe Biden]]'''
| nominee2 = [[Mitt Romney]]
| party2 = Republican Party (United States)
| home_state2 = [[Massachusetts]]
| running_mate2 = [[Paul Ryan]]<br>
}}"""
for match in re.finditer(r'(nominee\d\s*\=)[^|]*', markup_string, re.S):
end_nominee, end_line = match.end(1), match.end(0)
print end_nominee, end_line
print markup_string[end_nominee:end_line]

Python, Regular Expression Postcode search

I am trying to use regular expressions to find a UK postcode within a string.
I have got the regular expression working inside RegexBuddy, see below:
\b[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}\b
I have a bunch of addresses and want to grab the postcode from them, example below:
123 Some Road Name Town, City County PA23 6NH
How would I go about this in Python? I am aware of the re module for Python but I am struggling to get it working.
Cheers
Eef

repeating your address 3 times with postcode PA23 6NH, PA2 6NH and PA2Q 6NH as test for you pattern and using the regex from wikipedia against yours, the code is..
import re
s="123 Some Road Name\nTown, City\nCounty\nPA23 6NH\n123 Some Road Name\nTown, City"\
"County\nPA2 6NH\n123 Some Road Name\nTown, City\nCounty\nPA2Q 6NH"
#custom
print re.findall(r'\b[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}\b', s)
#regex from #http://en.wikipedia.orgwikiUK_postcodes#Validation
print re.findall(r'[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][A-Z]{2}', s)
the result is
['PA23 6NH', 'PA2 6NH', 'PA2Q 6NH']
['PA23 6NH', 'PA2 6NH', 'PA2Q 6NH']
both the regex's give the same result.

Try
import re
re.findall("[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}", x)
You don't need the \b.

#!/usr/bin/env python
import re
ADDRESS="""123 Some Road Name
Town, City
County
PA23 6NH"""
reobj = re.compile(r'(\b[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}\b)')
matchobj = reobj.search(ADDRESS)
if matchobj:
print matchobj.group(1)
Example output:
[user#host]$ python uk_postcode.py
PA23 6NH

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting content within[] embedded inside curly brackets using Regex - python

Related

How to convert a 'raw' string into a 'decoded' string in Python?

Question on regex not performing as expected

regex capture text in brackets, omitting optional prefix

Regular expression for extracting fields from wiki template markup

Python, Regular Expression Postcode search

Categories

Resources