simple in memory positional inverted index in python - python

I trying to make a simple positional index that but having some problems getting the correct output.
Given a list of strings (sentences) I want to use the string position in the sting list as document id and then iterate over the words in the sentence and use the words index in the sentence as its position. Then update a dictionary of words with a tuple of the doc id and it's position in the doc.
Code:
main func -
def doc_pos_index(alist):
inv_index= {}
words = [word for line in alist for word in line.split(" ")]
for word in words:
if word not in inv_index:
inv_index[word]=[]
for item, index in enumerate(alist): # find item and it's index in list
for item2, index2 in enumerate(alist[item]): # for words in string find word and it's index
if item2 in inv_index:
inv_index[i].append(tuple(index, index2)) # if word in index update it's list with tuple of doc index and position
return inv_index
example list:
doc_list= [
'hello Delivered dejection necessary objection do mr prevailed',
'hello Delivered dejection necessary objection do mr prevailed',
'hello Delivered dejection necessary objection do mr prevailed',
'hello Delivered dejection necessary objection do mr prevailed',
'hello Delivered dejection necessary objection do mr prevailed'
]
desired output:
{'Delivered': [(0,1),(1,1),(2,1),(3,1),(4,1)],
'necessary': [(0,3),(1,3),(2,3),(3,3),(4,3)],
'dejection': [(0,2),(1,2),(2,2),(3,2),(4,2)],
ect...}
Current output:
{'Delivered': [],
'necessary': [],
'dejection': [],
'do': [],
'objection': [],
'prevailed': [],
'mr': [],
'hello': []}
An fyi, I do know about collections libarary and NLTK but I'm mainly doing this for learning/practice reasons.

Check this:
>>> result = {}
>>> for doc_id,doc in enumerate(doc_list):
for word_pos,word in enumerate(doc.split()):
result.setdefault(word,[]).append((doc_id,word_pos))
>>> result
{'Delivered': [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], 'necessary': [(0, 3), (1, 3), (2, 3), (3, 3), (4, 3)], 'dejection': [(0, 2), (1, 2), (2, 2), (3, 2), (4, 2)], 'do': [(0, 5), (1, 5), (2, 5), (3, 5), (4, 5)], 'objection': [(0, 4), (1, 4), (2, 4), (3, 4), (4, 4)], 'prevailed': [(0, 7), (1, 7), (2, 7), (3, 7), (4, 7)], 'mr': [(0, 6), (1, 6), (2, 6), (3, 6), (4, 6)], 'hello': [(0, 0), (1, 0), (2, 0), (3, 0), (4, 0)]}
>>>

You seem to be confused about what enumerate does. The first item returned by enumerate() is the index, and the second item is the value. You seem to have it reversed.
You are further confused with your second use of enumerate():
for item2, index2 in enumerate(alist[item]): # for words in string find word and it's index
First of all you don't need to do alist[item]. You already have the value of that line in the index variable (again, you are perhaps confused since you have the variable names backwards. Second, you seem to think that enumerate() will split a line into individual words. It won't. Instead it will just iterate over every character in the string (I'm confused why you thought this since you demonstrated earlier that you know how to split a string on spaces--interesting though).
As an additional tip, you don't need to do this:
for word in words:
if word not in inv_index:
inv_index[word]=[]
First of all, since you're just initializing a dict you don't need the if statement. Just
for word in words:
inv_index[word] = []
will do. If the word is already in the dictionary this will make an unnecessary assignment, true, but it's still an O(1) operation so there's no harm. However, you don't even need to do this. Instead you can use collections.defaultdict:
from collections import defaultdict
inv_index = defaultdict(list)
Then you can just do ind_index[word].append(...). If word is not already in inv_index it will add it and initialize its value to an empty list. Otherwise it will just append to the existing list.

#And the algorithm for the following: {term: [df, tf, {doc1: [tf, [offsets], doc2...}]]
InvertedIndex = {}
from TextProcessing import *
for i in range(len(listaDocumentos)):
docTokens = tokenization(listaDocumentos[i], NLTK=True)
for token in docTokens:
if token in InvertedIndex:
if i in InvertedIndextoken:
pass
else:
InvertedIndex[token][0] += 1
InvertedIndextoken.append(i)
else:
DF = 1
ListOfDOCIDs = [i]
InvertedIndex[token] = [DF, ListOfDOCIDs]
Output

Related

unexpected EOF while parsing - how do i fix my code?

I've got
desc = ['(4,1);(1,4)', '(2,3);(3,2)', '(4,2);(2,4);(1,3);(3,1)', '(1,2);(2,1);(4,3);(3,4)']
and I want the output to be
[[(4, 1), (1, 4)], [(2, 3), (3, 2)], [(4, 2), (2, 4), (1, 3), (3, 1)], [(1, 2), (2, 1), (4, 3), (3, 4)]]
So far I've tried:
for x in range(len(desc)):
desc[x] = desc[x].split(';')
for y in range(len(desc[x])):
desc[x][y] = eval(desc[x][y])
but there is a syntax error saying 'unexpected EOF while parsing. How do I fix my code?
For the last two lines of my code I was just trying to extract the tuples from the strings containing them, is there anything else I could use except for eval()?
Unexpected EOF is caused by the indentation of the second for loop.
for x in range(len(desc)):
desc[x] = desc[x].split(';')
for y in range(len(desc[x])): # this has one tab to much
desc[x][y] = eval(desc[x][y])
This is how it should look like:
for x in range(len(desc)):
desc[x] = desc[x].split(';')
for y in range(len(desc[x])):
desc[x][y] = eval(desc[x][y])
You want to split each item of your list with the separator ';'. You need to parse your list :
for element in desc and split each element according to this separator :
temp = element.split(';'). You can then add to your output list the list [temp[0], temp[1]]
desc = ['(4,1);(1,4)', '(2,3);(3,2)', '(4,2);(2,4);(1,3);(3,1)', '(1,2);(2,1);(4,3);(3,4)']
output = []
for element in desc:
temps = element.split(";")
output.append([temps[0], temps[1]])
print(output)
# [['(4,1)', '(1,4)'], ['(2,3)', '(3,2)'], ['(4,2)', '(2,4)'], ['(1,2)', '(2,1)']]
To remove the '' you have to transform your items into actual tuples with the integers inside :
desc = ['(4,1);(1,4)', '(2,3);(3,2)', '(4,2);(2,4);(1,3);(3,1)', '(1,2);(2,1);(4,3);(3,4)']
output = []
for element in desc:
temps = element.split(";")
tuples_to_add = []
for i in temps:
tuples_to_add.append(tuple([int(i.strip('()')[0]), int(i.strip('()')[-1])]))
output.append(tuples_to_add)
print(output)
[[(4, 1), (1, 4)], [(2, 3), (3, 2)], [(4, 2), (2, 4), (1, 3), (3, 1)], [(1, 2), (2, 1), (4, 3), (3, 4)]]

TypeError: unhashable type: 'list' in python chess program

I am coding a chess program and am coding check. I need the key from the opponent moves dictionary (which contains the king's position) to be used to find the coordinate of the piece placing it in check. Right now this is givng me the error:
opponentpieceposition=opponentposition.get(piece)
TypeError: unhashable type: 'list'.
Note the example below should print (1,6)
king=(5,1)
opponentmoves={'ksknight': [(8, 3), (5, 2), (6, 3)],
'ksbishop': [(3, 6), (4, 7), (5, 8), (1, 4), (1, 6), (3, 4), (4, 3), (5, 1), (6, 1)],
'king': [(6, 1), (5, 2), (4, 1)],
'queen': [(4, 5), (2, 4), (1, 3), (2, 6), (1, 7), (4, 4)],
'qsknight': [(3, 3), (1, 3)]}
opponentposition={'ksknight': (1, 3),
'ksbishop': (1, 6),
'king': (6, 1),
'queen': (4, 5),
'qsknight': (3, 3)}
if king in [z for v in opponentmoves.values() for z in v]:
piece=[key for key in opponentmoves if king in opponentmoves[key]]
opponentpieceposition=opponentposition.get(piece)
print(opponentpieceposition)
lists and objects of other mutable types cannot be used as keys in dictionaries (or elements in sets).
These containers rely on computing a hash value which is a function of the 'content' of the object at insertion time. So if the object (like mutable objects are able to) changes after insertion there will be problems.
you can instead use a tuple which is an immutable sequence.
duplicate
In your code piece is a list, it can't be dictionary key. Please follow comments in code how to overcome the issue:
if king in [z for v in opponentmoves.values() for z in v]:
piece = [key for key in opponentmoves if king in opponentmoves[key]]
print(piece) # Let's show what is piece
# result is ['ksbishop']
# so we need 1st element of the list pice
opponentpieceposition=opponentposition.get(piece[0]) # take the 1st element
print(opponentpieceposition)
Hope it helped to solve the issue.
This is what I got working.
if king in [z for v in opponent.moves.values() for z in v]:
for key in opponent.moves:
opponentpiece=opponent.moves[key]
if king in opponentpiece:
opponentposition=opponent.position[key]

How to remove duplicate from list of tuple when order is important

I have seen some similar answers, but I can't find something specific for this case.
I have a list of tuples:
[(5, 0), (3, 1), (3, 2), (5, 3), (6, 4)]
What I want is to remove tuples from this list only when first element of tuple has occurred previously in the list and the tuple which remains should have the smallest second element.
So the output should look like this:
[(5, 0), (3, 1), (6, 4)]
Here's a linear time approach that requires two iterations over your original list.
t = [(5, 0), (3, 1), (3, 2), (5, 3), (6, 4)] # test case 1
#t = [(5, 3), (3, 1), (3, 2), (5, 0), (6, 4)] # test case 2
smallest = {}
inf = float('inf')
for first, second in t:
if smallest.get(first, inf) > second:
smallest[first] = second
result = []
seen = set()
for first, second in t:
if first not in seen and second == smallest[first]:
seen.add(first)
result.append((first, second))
print(result) # [(5, 0), (3, 1), (6, 4)] for test case 1
# [(3, 1), (5, 0), (6, 4)] for test case 2
Here is a compact version I came up with using OrderedDict and skipping replacement if new value is larger than old.
from collections import OrderedDict
a = [(5, 3), (3, 1), (3, 2), (5, 0), (6, 4)]
d = OrderedDict()
for item in a:
# Get old value in dictionary if exist
old = d.get(item[0])
# Skip if new item is larger than old
if old:
if item[1] > old[1]:
continue
#else:
# del d[item[0]]
# Assign
d[item[0]] = item
list(d.values())
Returns:
[(5, 0), (3, 1), (6, 4)]
Or if you use the else-statement (commented out):
[(3, 1), (5, 0), (6, 4)]
Seems to me that you need to know two things:
The tuple that has the smallest second element for each first element.
The order to index each first element in the new list
We can get #1 by using itertools.groupby and a min function.
import itertools
import operator
lst = [(3, 1), (5, 3), (5, 0), (3, 2), (6, 4)]
# I changed this slightly to make it harder to accidentally succeed.
# correct final order should be [(3, 1), (5, 0), (6, 4)]
tmplst = sorted(lst, key=operator.itemgetter(0))
groups = itertools.groupby(tmplst, operator.itemgetter(0))
# group by first element, in this case this looks like:
# [(3, [(3, 1), (3, 2)]), (5, [(5, 3), (5, 0)]), (6, [(6, 4)])]
# note that groupby only works on sorted lists, so we need to sort this first
min_tuples = {min(v, key=operator.itemgetter(1)) for _, v in groups}
# give the best possible result for each first tuple. In this case:
# {(3, 1), (5, 0), (6, 4)}
# (note that this is a set comprehension for faster lookups later.
Now that we know what our result set looks like, we can re-tackle lst to get them in the right order.
seen = set()
result = []
for el in lst:
if el not in min_tuples: # don't add to result
continue
elif el not in seen: # add to result and mark as seen
result.append(el)
seen.add(el)
This will do what you need:
# I switched (5, 3) and (5, 0) to demonstrate sorting capabilities.
list_a = [(5, 3), (3, 1), (3, 2), (5, 0), (6, 4)]
# Create a list to contain the results
list_b = []
# Create a list to check for duplicates
l = []
# Sort list_a by the second element of each tuple to ensure the smallest numbers
list_a.sort(key=lambda i: i[1])
# Iterate through every tuple in list_a
for i in list_a:
# Check if the 0th element of the tuple is in the duplicates list; if not:
if i[0] not in l:
# Add the tuple the loop is currently on to the results; and
list_b.append(i)
# Add the 0th element of the tuple to the duplicates list
l.append(i[0])
>>> print(list_b)
[(5, 0), (3, 1), (6, 4)]
Hope this helped!
Using enumerate() and list comprehension:
def remove_if_first_index(l):
return [item for index, item in enumerate(l) if item[0] not in [value[0] for value in l[0:index]]]
Using enumerate() and a for loop:
def remove_if_first_index(l):
# The list to store the return value
ret = []
# Get the each index and item from the list passed
for index, item in enumerate(l):
# Get the first number in each tuple up to the index we're currently at
previous_values = [value[0] for value in l[0:index]]
# If the item's first number is not in the list of previously encountered first numbers
if item[0] not in previous_values:
# Append it to the return list
ret.append(item)
return ret
Testing
some_list = [(5, 0), (3, 1), (3, 2), (5, 3), (6, 4)]
print(remove_if_first_index(some_list))
# [(5, 0), (3, 1), (6, 4)]
I had this idea without seeing the #Anton vBR's answer.
import collections
inp = [(5, 0), (3, 1), (3, 2), (5, 3), (6, 4)]
od = collections.OrderedDict()
for i1, i2 in inp:
if i2 <= od.get(i1, i2):
od.pop(i1, None)
od[i1] = i2
outp = list(od.items())
print(outp)

Python: Print a generator expression's values when those values are itertools.product objects

I'm trying to dig into some code I found online here to better understand Python.
This is the code fragment I'm trying to get a feel for:
from itertools import chain, product
def generate_groupings(word_length, glyph_sizes=(1,2)):
cartesian_products = (
product(glyph_sizes, repeat=r)
for r in range(1, word_length + 1)
)
Here, word_length is 3.
I'm trying to evaluate the contents of the cartesian_products generator. From what I can gather after reading the answer at this SO question, generators do not iterate (and thus, do not yield a value) until they are called as part of a collection, so I've placed the generator in a list:
list(cartesian_products)
Out[6]:
[<itertools.product at 0x1025d1dc0>,
<itertools.product at 0x1025d1e10>,
<itertools.product at 0x1025d1f50>]
Obviously, I now see inside the generator, but I was hoping to get more specific information than the raw details of the itertools.product objects. Is there a way to accomplish this?
if you don't care about exhausting the generator, you can use:
list(map(list,cartesian_products))
You will get the following for word_length = 3
Out[1]:
[[(1,), (2,)],
[(1, 1), (1, 2), (2, 1), (2, 2)],
[(1, 1, 1),
(1, 1, 2),
(1, 2, 1),
(1, 2, 2),
(2, 1, 1),
(2, 1, 2),
(2, 2, 1),
(2, 2, 2)]]

How to generate list of tuples relating records

I need to generate a list from the list of tuples:
a = [(1,2), (1,3), (2,3), (2,5), (2,6), (3,4), (3,6), (4,7), (5 6), (5,9), (5,10), (6,7)
(6.10) (6.11) (7.8) (7.12) (8.12) (9.10) (10.11)]
The rule is:
- I have a record from any (begin = random.choice (a))
- Items from the new list must have the following relationship:
the last item of each tuple in the list must be equal to the first item of the next tuple to be inserted.
Example of a valid output (starting by the tuple (3.1)):
[(3, 1), (1, 2), (2, 3), (3, 4), (4, 7), (7, 8), (8, 12), (12, 7), (7, 6), (6, 2), (2, 5), (5, 6), (6, 10), (10, 5) (5, 9), (9, 10), (10, 11), (11, 6), (6, 3)]
How can I do this? Its make using list comprehensions?
Thanks!
Here, lisb will be populated with tuples in the order that you seek. This is, of course, if lisa provides appropriate tuples (ie, each tuple has a 1th value matching another tuple's 0th value). Your sample list will not work, regardless of the implementation, because all the values don't match up (for example, there is no 0th element with 12, so that tuple can't be connected forward to any other tuple)...so you should come up with a better sample list.
Tested, working.
import random
lisa = [(1, 2), (3, 4), (2, 3), (4, 0), (0, 9), (9, 1)]
lisb = []
current = random.choice(lisa)
while True:
lisa.remove(current)
lisb.append(current)
current = next((y for y in lisa if y[0] == current[1]), None)
if current == None:
break
print lisb
If you don't want to delete items from lisa, just slice a new list.
As a generator function:
def chained_tuples(x):
oldlist = x[::]
item = random.choice(oldlist)
oldlist.remove(item)
yield item
while oldlist:
item = next(next_item for next_item in oldlist if next_item[0] == item[1])
oldlist.remove(item)
yield item
As noted, you'll get an incomplete response if your list isn't actually chainable all the way through, like your example list.
Just to add another way of solving this problem:
import random
from collections import defaultdict
lisa = [(1, 2), (3, 4), (2, 3), (4, 0), (0, 9), (9, 1)]
current_start, current_end = lisa[random.randint(0, len(lisa) - 1)]
starts = defaultdict(list)
lisb = [(current_start, current_end)]
for start, end in lisa:
starts[start].append(end)
while True:
if not starts[current_end]:
break
current_start, current_end = current_end, starts[current_end].pop()
lisb.append((current_start, current_end))
Note: You have to make sure lisa is not empty.
I think all of the answers so far are missing the requirement (at least based on your example output) that the longest chain be found.
My suggested solution is to recursively parse all possible chains that can be constructed, and return the longest result. The function looks like this:
def generateTuples(list, offset, value = None):
if value == None: value = list[offset]
list = list[:offset]+list[offset+1:]
res = []
for i,(a,b) in enumerate(list):
if value[1] in (a,b):
if value[1] == a:
subres = generateTuples(list, i, (a,b))
else:
subres = generateTuples(list, i, (b,a))
if len(subres) > len(res):
res = subres
return [value] + res
And you would call it like this:
results = generateTuples(a, 1, (3,1))
Producing the list:
[(3, 1), (1, 2), (2, 3), (3, 4), (4, 7), (7, 8), (8, 12), (12, 7), (7, 6),
(6, 2), (2, 5), (5, 6), (6, 10), (10, 5), (5, 9), (9, 10), (10, 11),
(11, 6), (6, 3)]
The first parameter of the function is the source list of tuples, the second parameter is the offset of the first element to use, the third parameter is optional, but allows you to override the value of the first element. The latter is useful when you want to start with a tuple in its reversed order as you have done in your example.

Categories