How to remove overlapping spans in regex matches?

How to remove overlapping spans in regex matches? - python

I'm trying to find all types of references inside a text, such as "Appendix 2", "Section 17" or "Schedule 12.2", using python. The issue after finding such matches is that some of them overlap and I would like to join them in a new string or just consider the longest one, removing substrings.
To do so, I've created multiple regex patterns such that the code is more readable and then I've inserted them in a list, calling finditer over all patterns in the list.
From the matches, I gather both the text and the position inside the text as start and end index.
def get_references(text):
refs = [{
'text': match.group(),
'span': {
'start': match.span()[0],
'end': match.span()[1]
}}
for ref in references_regex for match in finditer(ref, text)]
This implies that a reference matched by multiple patterns is still inserted in the results multiple times, despite being the same or with little variants (e.g. "Section 17.4" and "Section 17.4 of the book" and "17.4 of the book").
I've tried to merge overlapping patterns with some ad hoc functions, but still don't work properly.
Do you know if there's a way to remove duplicates or merge them in case they overlap?
For instance, I have:
[{"text": "Schedule 15.1", "span": {"start": 756, "end": 770}},
{"text": "15.1 of the Framework Agreement", "span": {"start": 765, "end": 796}},
{"text": "17.14 of the book", "span": {"start": 1883, "end": 1900}]
I would like to get:
{"text": "Schedule 15.1 of the Framework Agreement", "span": {"start": 756, "end": 796}},
{"text": "17.14 of the book", "span": {"start": 1883, "end": 1900}]
Thank you in advance!

Your problem is called merging intervals. You can checkout the problem in leetcode and read the solutions part.
You could try my code, this code implements the solution for your specific problem. It might have bug since I haven't tested with a bigger dataset.
Edit: Please note that your list should be sorted in ascending order
def process(match_list):
if not match_list:
return []
new_list = []
new_text = match_list[0]['text']
start, end = match_list[0]['span']['start'], match_list[0]['span']['end']
for i in range(1, len(match_list)):
# If overlap
if end >= match_list[i]['span']['start']:
# Merge the text and update the ending position
new_text += match_list[i]['text'][end-match_list[i]['span']['start']-1:]
end = max(end, match_list[i]['span']['end'])
else:
# If not overlap, append the text to the result
new_list.append({'text': new_text, 'span': {'start': start, 'end': end}})
# Process the next text
new_text = match_list[i]['text']
start, end = match_list[i]['span']['start'], match_list[i]['span']['end']
# Append the last text in the list
new_list.append({'text': new_text, 'span': {'start': start, 'end': end}})
return new_list

def get_s_e(x):
s, e = map(x['span'].get, ['start', 'end'])
return s, e
def concat_dict(a):
a = sorted(a, key=lambda x: x['span']['start'], reverse=True)
index = 0
while index < len(a):
cur = a[index]
try:
nxt = a[index+1]
except:
break
cur_st, cur_end = get_s_e(cur)
nxt_st, nxt_end = get_s_e(nxt)
if cur_st <= nxt_end:
join_index = cur_st-nxt_st
if nxt_end >= cur_end:
text = nxt['text']
a[index]['span']['end'] = nxt_end
else:
text = n['text'][:join_index]+cur['text']
a[index]['text'] = text
a[index]['span']['start'] = nxt_st
del a[index+1]
else:
index += 1
return a
a = [{"text": "Book bf dj Schedule 15.1 of the", "span": {"start": 745, "end": 776}},
{"text": "Schedule 15.1", "span": {"start": 756, "end": 770}},
{"text": "15.1 of the Framework Agreement", "span": {"start": 765, "end": 796}},
{"text": "17.14 of the book", "span": {"start": 1883, "end": 1900}}
]
print(concat_dict(a))
Output:
[{'text': '17.14 of the book', 'span': {'start': 1883, 'end': 1900}},
{'text': 'Book bf dj Book bf d15.1 of the Framework Agreement',
'span': {'start': 745, 'end': 796}}]

Related

Assemble separate sentences connected by dictionary attributes

I have the following list of dictionaries and I'm trying to come up with a single connected sentence based on whether the fact that the dictionary of the child sentences have the "ref" attribute connecting it to the father sentence.
clause_list = [
{"id": "T1", "text": "hi"},
{"id": "T2", "text": "I'm", "ref": "T1"},
{"id": "T3", "text": "Simone", "ref": "T2"},
]
Expected output is
"hi I'm Simone" but avoiding sentences like "hi I'm" or "I'm Simone"
What I tried so far is the following, but no matter how I flip it, the undesired sentences always get printed.
for c in clause_list:
for child in clause_list:
try:
if c["id"] == child["ref"] and "ref" not in c.keys():
print(c["text"], child["text"])
elif c["id"] == child["ref"] and "ref" in c.keys():
for father in clause_list:
if father["id"] == c["ref"] and "ref" not in father.keys():
print(father["text"], c["text"], child["text"])
except KeyError:
pass

probablly is best use a class and not a dict, but you could convert to list of dict to list of classes easy.
This work. Convert the list of dict to list of Clauses. and the clauses has the method search_ref that print the partial text if there are no referenced object, or add the referenced object and continue if there are.
if you have 2 objects i don't know exactly what you want
clause_list = [
{"id": "T1", "text": "hi"},
{"id": "T2", "text": "I'm", "ref": "T1"},
{"id": "T3", "text": "Simone", "ref": "T2"},
]
class Clause:
def __init__(self, id, text, ref:None):
self.id = id
self.text = text
self.ref = ref
def search_ref(self, Clauses, text=''):
parcialText = text + ' ' + self.text
for clause in Clauses:
if clause.ref == self.id:
return clause.search_ref(Clauses, parcialText)
print(parcialText)
Clauses = [Clause(id=c['id'], text=c['text'], ref=c.get('ref')) for c in clause_list]
for c in Clauses:
if c.ref is None:
c.search_ref(Clauses)

So right now you have your data set up as a list of dictionaries. I think it will be more helpful if you set it up as a linked list instead and use a bit of recursion. Maybe something like below. As a note, you'll need something to indicate the start of the sentence (I'm going to use "parent=True"):
clause_list = {
# ID's are the keys
'T1':{'text':"hi ", 'child':'T2','parent':True},
'T2':{'text':"I'm ", 'child':'T3'},
'T3':{'text':"Simone", 'child':None},
}
Then your code can look like this:
# Recursive function
def print_child(child_id, clause_list):
print(clause_list[child_id]['text'])
if 'child' in clause_list[child_id]:
print_child(clause_list[child_id]['child'], clause_list)
# Main function
for id in clause_list:
if 'parent' in clause_list[id]:
print(clause_list[id]['text'])
# Recurse through the parent to find others
if 'child' in clause_list[id]:
print_child(clause_list[id]['child'], clause_list)
Cloud definitely be cleaned up a bit and optimized, but this is the general gist. Hope it helps!

Another solution using a recursion:
clause_list = [
{"id": "T1", "text": "hi"},
{"id": "T2", "text": "I'm", "ref": "T1"},
{"id": "T3", "text": "Simone", "ref": "T2"},
]
inv_dict = {d.get("ref", ""): (d["id"], d["text"]) for d in clause_list}
def get_sentence(inv_dict, key=""):
if key in inv_dict:
id_, text = inv_dict[key]
return [text] + get_sentence(inv_dict, id_)
return []
print(" ".join(get_sentence(inv_dict)))
Prints:
hi I'm Simone

Parse only selected records from empty-line separated file

I have a file with the following structure:
SE|text|Baz
SE|entity|Bla
SE|relation|Bla
SE|relation|Foo
SE|text|Bla
SE|entity|Foo
SE|text|Zoo
SE|relation|Bla
SE|relation|Baz
Records (i.e., blocks) are separated by an empty line. Each line in a block starts with a SE tag. text tag always occurs in the first line of each block.
I wonder how to properly extract only blocks with a relation tag, which is not necessarily present in each block. My attempt is pasted below:
from itertools import groupby
with open('test.txt') as f:
for nonempty, group in groupby(f, bool):
if nonempty:
process_block() ## ?
Desired output is a json dump:
{
"result": [
{
"text": "Baz",
"relation": ["Bla","Foo"]
},
{
"text": "Zoo",
"relation": ["Bla","Baz"]
}
]
}

I have a proposed solution in pure python that returns a block if it contains the value in any position. This could most likely be done more elegant in a proper framework like pandas.
from pprint import pprint
fname = 'ex.txt'
# extract blocks
with open(fname, 'r') as f:
blocks = [[]]
for line in f:
if len(line) == 1:
blocks.append([])
else:
blocks[-1] += [line.strip().split('|')]
# remove blocks that don't contain 'relation
blocks = [block for block in blocks
if any('relation' == x[1] for x in block)]
pprint(blocks)
# [[['SE', 'text', 'Baz'],
# ['SE', 'entity', 'Bla'],
# ['SE', 'relation', 'Bla'],
# ['SE', 'relation', 'Foo']],
# [['SE', 'text', 'Zoo'], ['SE', 'relation', 'Bla'], ['SE', 'relation', 'Baz']]]
# To export to proper json format the following can be done
import pandas as pd
import json
results = []
for block in blocks:
df = pd.DataFrame(block)
json_dict = {}
json_dict['text'] = list(df[2][df[1] == 'text'])
json_dict['relation'] = list(df[2][df[1] == 'relation'])
results.append(json_dict)
print(json.dumps(results))
# '[{"text": ["Baz"], "relation": ["Bla", "Foo"]}, {"text": ["Zoo"], "relation": ["Bla", "Baz"]}]'
Let's go through it
Read the file into a list and divide each block by a blank line and divide columns with the | character.
Go through each block in the list and sort out any that does not contain relation.
Print the output.

You can not store the same key twice in a dictionary as mentioned in the comments.
You can read your file, split at '\n\n' into blocks, split blocks into lines at '\n', split lines into data at '|'.
You then can put it into a suiteable datastructure and parse it into a string using module json:
Create data file:
with open("f.txt","w")as f:
f.write('''SE|text|Baz
SE|entity|Bla
SE|relation|Bla
SE|relation|Foo
SE|text|Bla
SE|entity|Foo
SE|text|Zoo
SE|relation|Bla
SE|relation|Baz''')
Read data and process it:
with open("f.txt") as f:
all_text = f.read()
as_blocks = all_text.split("\n\n")
# skip SE when splitting and filter only with |relation|
with_relation = [[k.split("|")[1:]
for k in b.split("\n")]
for b in as_blocks if "|relation|" in b]
print(with_relation)
Create a suiteable data structure - grouping multiple same keys into a list:
result = []
for inner in with_relation:
result.append({})
for k,v in inner:
# add as simple key
if k not in result[-1]:
result[-1][k] = v
# got key 2nd time, read it as list
elif k in result[-1] and not isinstance(result[-1][k], list):
result[-1][k] = [result[-1][k], v]
# got it a 3rd+ time, add to list
else:
result[-1][k].append(v)
print(result)
Create json from data structure:
import json
print( json.dumps({"result":result}, indent=4))
Output:
# with_relation
[[['text', 'Baz'], ['entity', 'Bla'], ['relation', 'Bla'], ['relation', 'Foo']],
[['text', 'Zoo'], ['relation', 'Bla'], ['relation', 'Baz']]]
# result
[{'text': 'Baz', 'entity': 'Bla', 'relation': ['Bla', 'Foo']},
{'text': 'Zoo', 'relation': ['Bla', 'Baz']}]
# json string
{
"result": [
{
"text": "Baz",
"entity": "Bla",
"relation": [
"Bla",
"Foo"
]
},
{
"text": "Zoo",
"relation": [
"Bla",
"Baz"
]
}
]
}

In my opinion this is a very good case for a small parser.
This solution uses a PEG parser called parsimonious but you could totally use another one:
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
import json
data = """
SE|text|Baz
SE|entity|Bla
SE|relation|Bla
SE|relation|Foo
SE|text|Bla
SE|entity|Foo
SE|text|Zoo
SE|relation|Bla
SE|relation|Baz
"""
class TagVisitor(NodeVisitor):
grammar = Grammar(r"""
content = (ws / block)+
block = line+
line = ~".+" nl?
nl = ~"[\n\r]"
ws = ~"\s+"
""")
def generic_visit(self, node, visited_children):
return visited_children or node
def visit_content(self, node, visited_children):
filtered = [child[0] for child in visited_children if isinstance(child[0], dict)]
return {"result": filtered}
def visit_block(self, node, visited_children):
text, relations = None, []
for child in visited_children:
if child[1] == "text" and not text:
text = child[2].strip()
elif child[1] == "relation":
relations.append(child[2])
if relations:
return {"text": text, "relation": relations}
def visit_line(self, node, visited_children):
tag1, tag2, text = node.text.split("|")
return tag1, tag2, text.strip()
tv = TagVisitor()
result = tv.parse(data)
print(json.dumps(result))
This yields
{"result":
[{"text": "Baz", "relation": ["Bla", "Foo"]},
{"text": "Zoo", "relation": ["Bla", "Baz"]}]
}
The idea is to phrase a grammar, build an abstract syntax tree out of it and return the block's content in a suitable data format.

Find first occurrence of a key in nested python dictionaries and lists

I have a dictionary like this:
I saw this question Find all occurrences of a key in nested python dictionaries and lists, but it's return all the values of key.
{
"html": {
"head": {
"title": {
"text": "Your Title Here"
}
},
"body": {
"bgcolor": "FFFFFF",
"img": {
"src": "clouds.jpg",
"align": "bottom",
"text": ""
},
"a": [
{
"href": "http://somegreatsite.com",
"text": "Link Name"
},
{
"href": "mailto:support#yourcompany.com",
"text": "support#yourcompany.com"
}
],
"p": [
{
"text": "This is a new paragraph!dasda"
},
{
"h1": {
"text": "dasda"
}
}
{
"h3": {
"text": "hello therereere"
}
}
],
"h1": [
{
"text": "This is a Header"
},
{
"class": "text-primary",
"text": "This is a Header"
}
],
"h2": {
"text": "This is a Medium Header"
},
"b": {
"text": "This is a new sentence without a paragraph break, in bold italics.",
"i": {
"text": "This is a new sentence without a paragraph break, in bold italics."
}
}
}
}
}
I want to function just find the first occurrence of key
for example:
if i search for h1
return:
{"text": "dasda"}

Should be noted your question is constructed a bit weirdly. Technically the first occurrence of h1 (in regards to hierarchy) is under html.body.h1. You seem to want to find the very first occurrence in the dictionary in regards to order, so in other words, you are looking for html.body.p.h1. However, a dictionary in Python does not guarantee order for most versions.
Here is a hierarchical solution for the time being:
def func(dictionary: dict, key: str):
a = None
for k, v in dictionary.items():
if key in dictionary:
return dictionary[key]
elif isinstance(v, dict):
a = func(v, key)
if isinstance(a, list):
return a[0]
else:
return a
print(func(a, "h1"))
Outputs:
{'text': 'This is a Header'}

You can do this recursively:
# Recursively go through dicts and lists to find the given target key
def findKey(target, dic):
found = False
# No iteration if it is a string
if isinstance(dic, str):
return None
# Iterate differently if it is a list
if isinstance(dic, list):
for elem in dic:
found = findKey(target, elem)
if (found):
return found
return found
# Find the key in the dictionary
for key in dic.keys():
if (key == target):
return dic[key]
else:
found = findKey(target, dic[key])
if (found):
return found
d = {'html': {'head': {'title': {'text': 'Your Title Here'}}, 'body': {'bgcolor': 'FFFFFF', 'img': {'src': 'clouds.jpg', 'align': 'bottom', 'text': ''}, 'a': [{'href': 'http://somegreatsite.com', 'text': 'Link Name'}, {'href': 'mailto:support#yourcompany.com', 'text': 'support#yourcompany.com'}], 'p': [{'text': 'This is a new paragraph!dasda'}, {'h1': {'text': 'dasda'}}, {'h3': {'text': 'hello therereere'}}], 'h1': [{'text': 'This is a Header'}, {'class': 'text-primary', 'text': 'This is a Header'}], 'h2': {'text': 'This is a Medium Header'}, 'b': {'text': 'This is a new sentence without a paragraph break, in bold italics.', 'i': {'text': 'This is a new sentence without a paragraph break, in bold italics.'}}}}}
findKey("h1", d)
Output:
{'text': 'dasda'}

Move Nested dictionary element to top level and have remaining nested elements values follow | Python | Django

I have a dictionary, taken from a JSON file, where all of the values, including the nested values are sequential. Regardless of nesting.
Obj = [{
"Text": 1,
"Content": [{"Text": 2},{"Text": 3},{"Text": 4},{"Text": 5}]},
{
"Text": 6,
"Content": [{"Text": 7},{"Text": 8},{"Text": 9},{"Text": 10}]}
]
I need to make {"Text": 3} a top-level element and move all the following elements in that nested list beneath it, like this:
Obj = [{
"Text": 1,
"Content": [{"Text": 2}]},
{
"Text": 3,
"Content": [{"Text": 4},{"Text": 5}]},
{
"Text": 6,
"Content": [{"Text": 7},{"Text": 8},{"Text": 9},{"Text": 10}]}
]
Assuming, this list is really long, with a varying number of top-level and nested elements How would I do this efficiently?
Here is a very simplified version of my actual code, the result is roughly the same I either get empty 'Content' values or, with other (much longer) variations of this code, the order of the elements gets all out of sync:
Obj = [{
"Text": 1,
"Content": [{"Text": 2},{"Text": 3},{"Text": 4},{"Text": 5}]},
{
"Text": 6,
"Content": [{"Text": 7},{"Text": 8},{"Text": 9},{"Text": 10}]}
]
print(Obj)
NewObj = []
lineCounter = -1
topLevelKey = -1
ElementToMove = 3
for toplevel in Obj:
topLevelKey += 1
NewObj.append(toplevel)
NewObj[len(NewObj)-1]['Content'] = []
for nested in toplevel['Content']:
lineCounter += 1
print(NewObj)
if nested['text'] == ElementToMove:
NewObj.insert(topLevelKey,nested)
NewObj[topLevelKey]['Content'] = []
else:
NewObj[topLevelKey]['Content'].append(nested)
print(NewObj)
The end result of this code sample ends up looking like this:
[{'Text': 1, 'Content': []}, {'Text': 6, 'Content': []}]
I've tried solving this several different ways and I know the answer is probably super simple, Python Probably has a function for dealing with stuff like this, that I just can't find. What am I missing?

How do I force one specific key to come at the top of a python dict or JSON dump [duplicate]

Is there any way in Python 2.6 to supply a custom key or cmp function to JSON's sort_keys?
I've got a list of dicts coming from JSON like so:
[
{
"key": "numberpuzzles1",
"url": "number-puzzle-i.html",
"title": "Number Puzzle I",
"category": "nestedloops",
"points": "60",
"n": "087"
},
{
"key": "gettingindividualdigits",
"url": "getting-individual-digits.html",
"title": "Getting Individual Digits",
"category": "nestedloops",
"points": "80",
"n": "088"
}
]
...which I've stored into the list variable assigndb. I'd like to be able to load in the JSON, modify it, and serialized it back out with dumps (or whatever), keeping the orders of the keys intact.
So far, I've tried something like this:
ordering = {'key': 0, 'url': 1, 'title': 2, 'category': 3,
'flags': 4, 'points': 5, 'n': 6}
def key_func(k):
return ordering[k]
# renumber assignments sequentially
for (i, a) in enumerate(assigndb):
a["n"] = "%03d" % (i+1)
s = json.dumps(assigndb, indent=2, sort_keys=True, key=key_func)
...but of course dumps doesn't support a custom key like list.sort() does. Something with a custom JSONEncoder maybe? I can't seem to get it going.

An idea (tested with 2.7):
import json
import collections
json.encoder.c_make_encoder = None
d = collections.OrderedDict([("b", 2), ("a", 1)])
json.dumps(d)
# '{"b": 2, "a": 1}'
See: OrderedDict + issue6105. The c_make_encoder hack seems only to be needed for Python 2.x. Not a direct solution because you have to change dicts for OrderedDicts, but it may be still usable. I checked the json library (encode.py) and the ordered is hardcoded:
if _sort_keys:
items = sorted(dct.items(), key=lambda kv: kv[0])

This is kind of ugly, but in case tokland's solution does not work for you:
data = [{'category': 'nestedloops', 'title': 'Number Puzzle I', 'url': 'number-puzzle-i.html', 'n': '087', 'points': '60', 'key': 'numberpuzzles1'}, {'category': 'nestedloops', 'title': 'Getting Individual Digits', 'url': 'getting-individual-digits.html', 'n': '088', 'points': '80', 'key': 'gettingindividualdigits'}]
ordering = {'key': 0, 'url': 1, 'title': 2, 'category': 3,
'flags': 4, 'points': 5, 'n': 6}
outlist = []
for d in data:
outlist.append([])
for k in sorted(d.keys(), key=lambda k: ordering[k]):
outlist[-1].append(json.dumps({k: d[k]}))
for i, l in enumerate(outlist):
outlist[i] = "{" + ",".join((s[1:-1] for s in outlist[i])) + "}"
s = "[" + ",".join(outlist) + "]"

Compact yet powerful recursive implementation with "prepended" and "appended" keys: https://gist.github.com/jeromerg/91f73d5867c5fa04ee7dbc0c5a03d611
def sort_recursive(node, first_keys, last_keys):
""" Sort the dictionary entries in a whole JSON object tree"""
fixed_placements = {
**{key: (0, idx) for idx, key in enumerate(first_keys)},
**{key: (2, idx) for idx, key in enumerate(last_keys)},
}
return _sort_recursive(node, lambda key: fixed_placements.get(key, (1, key)))
def _sort_recursive(node, key_fn):
if isinstance(node, list):
return [_sort_recursive(val, key_fn) for val in node]
elif isinstance(node, dict):
sorted_keys = sorted(node.keys(), key=key_fn)
return {k:_sort_recursive(node[k], key_fn) for k in sorted_keys}
else:
return node

I had the same problem and collections.OrderedDict was just not fit for the task because it ordered everything alphabetically. So I wrote something similar to Andrew Clark's solution:
def json_dumps_sorted(data, **kwargs):
sorted_keys = kwargs.get('sorted_keys', tuple())
if not sorted_keys:
return json.dumps(data)
else:
out_list = []
for element in data:
element_list = []
for key in sorted_keys:
if key in element:
element_list.append(json.dumps({key: element[key]}))
out_list.append('{{{}}}'.format(','.join((s[1:-1] for s in element_list))))
return '[{}]'.format(','.join(out_list))
You use it like this:
json_string = json_dumps_sorted([
{
"key": "numberpuzzles1",
"url": "number-puzzle-i.html",
"title": "Number Puzzle I",
"category": "nestedloops",
"points": "60",
"n": "087"
}, {
"key": "gettingindividualdigits",
"url": "getting-individual-digits.html",
"title": "Getting Individual Digits",
"category": "nestedloops",
"points": "80",
"n": "088"
}
], sorted_keys=(
'key',
'url',
'title',
'category',
'flags',
'points',
'n'
))

Thanks. I needed to put a timestamp key:value at the top of my JSON object no matter what. Obviously sorting the keys screwed this up as it starts with "t".
Using something like this, while putting the timestamp key in the dict_data right away worked:
d = collections.OrderedDict(dict_data)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove overlapping spans in regex matches? - python

Related

Assemble separate sentences connected by dictionary attributes

Parse only selected records from empty-line separated file

Find first occurrence of a key in nested python dictionaries and lists

Move Nested dictionary element to top level and have remaining nested elements values follow | Python | Django

How do I force one specific key to come at the top of a python dict or JSON dump [duplicate]

Categories

Resources