Find co-occurrences of two lists in python - python

I have got two lists. The first one contains names and second one names and corresponding values. The names of the first list in a subset of the name of the name of the second lists. The values are a true or false. I want to find the co-occurrences of the names of both lists and count the true values. My code:
data1 = [line.strip() for line in open("text_files/first_list.txt", 'r')]
ins = open( "text_files/second_list.txt", "r" ) # the "r" is not really needed - default
parseTable = []
for line in ins:
row = line.rstrip().split(' ') # <- note use of rstrip()
parseTable.append(row)
new_data = []
indexes = []
for index in range(len(parseTable)):
new_data.append(parseTable[index][0])
indexes.append(parseTable[index][1])
in1 =return_indices_of_a(new_data, data1)
def return_indices_of_a(a, b):
b_set = set(b)
return [i for i, v in enumerate(a) if v in b_set] #return the co-occurrences
I am reading both text files which containing the lists, i found the co-occurrences and then I want to keep from the parseTable[][1] only the in1 indices . Am I doing it right? How can I keep the indices I want? My two lists:
['SITNC', 'porkpackerpete', 'teensHijab', '1DAlert', 'IsmodoFashion',....
[['SITNC', 'true'], ['1DFAMlLY', 'false'], ['tibi', 'true'], ['1Dneews', 'false'], ....

Here's a one liner to get the matches:
matches = [(name, dict(values)[name]) for name in set(names) if name in dict(values)]
and then to get the number of true matches:
len([name for (name, value) in matches if value == 'true'])
Edit
You might want to move dict(values) into a named variable:
value_map = dict(values)
matches = [(name, value_map[name]) for name in set(names) if name in value_map]

There are two ways, one is what Andrey suggests (you may want to convert names to set), or, alternatively, convert the second list into a dictionary:
mapping = dict(values)
sum_of_true = sum(mapping[n] for n in names)
The latter sum works because bool is essentially int in Python (True == 1).

If you need just the sum of true values, then use in operator and list comprehension:
In [1]: names = ['SITNC', 'porkpackerpete', 'teensHijab', '1DAlert', 'IsmodoFashion']
In [2]: values = [['SITNC', 'true'], ['1DFAMlLY', 'false'], ['tibi', 'true'], ['1Dneews', 'false']]
In [3]: sum_of_true = len([v for v in values if v[0] in names and v[1] == "true"])
In [4]: sum_of_true
Out[4]: 1
To get also indices of co-occurrences, this one-liner may come in handy:
In [6]: true_indices = [names.index(v[0]) for v in values if v[0] in names and v[1] == "true"]
In [7]: true_indices
Out[7]: [0]

Related

Find duplicates in a list of strings differing only in upper and lower case writing

I have a list of strings that contains 'literal duplicates' and 'pseudo-duplicates' which differ only in lower- and uppercase writing. I am looking for a function that treats all literal duplicates as one group, returns their indices, and finds all pseudo-duplicates for these elements, again returning their indices.
Here's an example list:
a = ['bar','bar','foo','Bar','Foo','Foo']
And this is the output I am looking for (a list of lists of lists):
dupe_list = [[[0,1],[3]],[[2],[4,5]]]
Explanation: 'bar' appears twice at the indexes 0 and 1 and there is one pseudo-duplicate 'Bar' at index 3. 'foo' appears once at index 2 and there are two pseudo-duplicates 'Foo' at indexes 4 and 5.
Here is one solution (you didn't clarify what the logic of list items will be and i considered that you want the items in lower format as they are met from left to right in the list, let me know if it must be different):
d={i:[[], []] for i in set(k.lower() for k in a)}
for i in range(len(a)):
if a[i] in d.keys():
d[a[i]][0].append(i)
else:
d[a[i].lower()][1].append(i)
result=list(d.values())
Output:
>>> print(result)
[[[0, 1], [3]], [[2], [4, 5]]]
Here's how I would achieve it. But you should consider using a dictionary and not a list of list of list. Dictionaries are excellent data structures for problems like this.
#default argument vars
a = ['bar','bar','foo','Bar','Foo','Foo']
#initalize a dictionary to count occurances
a_dict = {}
for i in a:
a_dict[i] = None
#loop through keys in dictionary, which is values from a_list
#loop through the items from list a
#if the item is exact match to key, add index to list of exacts
#if the item is similar match to key, add index to list of similars
#update the dictionary key's value
for k, v in a_dict.items():
index_exact = []
index_similar = []
for i in range(len(a)):
print(a[i])
print(a[i] == k)
if a[i] == str(k):
index_exact.append(i)
elif a[i].lower() == str(k):
index_similar.append(i)
a_dict[k] = [index_exact, index_similar]
#print out dictionary values to assure answer
print(a_dict.items())
#segregate values from dictionary to its own list.
dup_list = []
for v in a_dict.values():
dup_list.append(v)
print(dup_list)
Here is the solution. I have handled the situation where if there are only pseudo duplicates present or only literal duplicates present
a = ['bar', 'bar', 'foo', 'Bar', 'Foo', 'Foo', 'ka']
# Dictionaries to store the positions of words
literal_duplicates = dict()
pseudo_duplicates = dict()
for index, item in enumerate(a):
# Treates words as literal duplicates if word is in smaller case
if item.islower():
if item in literal_duplicates:
literal_duplicates[item].append(index)
else:
literal_duplicates[item] = [index]
# Handle if only literal_duplicates present
if item not in pseudo_duplicates:
pseudo_duplicates[item] = []
# Treates words as pseudo duplicates if word is in not in smaller case
else:
item_lower = item.lower()
if item_lower in pseudo_duplicates:
pseudo_duplicates[item_lower].append(index)
else:
pseudo_duplicates[item_lower] = [index]
# Handle if only pseudo_duplicates present
if item not in literal_duplicates:
literal_duplicates[item_lower] = []
# Form final list from the dictionaries
dupe_list = [[v, pseudo_duplicates[k]] for k, v in literal_duplicates.items()]
Here is the simple and easy to understand answer for you
a = ['bar','bar','foo','Bar','Foo','Foo']
dupe_list = []
ilist = []
ilist2 =[]
samecase = -1
dupecase = -1
for i in range(len(a)):
if a[i] != 'Null':
ilist = []
ilist2 = []
for j in range(i+1,len(a)):
samecase = -1
dupecase = -1
# print(a)
if i not in ilist:
ilist.append(i)
if a[i] == a[j]:
# print(a[i],a[j])
samecase = j
a[j] = 'Null'
elif a[i] == a[j].casefold():
# print(a[i],a[j])
dupecase = j
a[j] = 'Null'
# print(samecase)
# print(ilist,ilist2)
if samecase != -1:
ilist.append(samecase)
if dupecase != -1:
ilist2.append(dupecase)
dupe_list.append([ilist,ilist2])
a[i]='Null'
print(dupe_list)

How to map two filtered lists with the operator 'and' and/or 'or'

I have following two lists:
advanced_filtered_list_val1 = [row for row in cleaned_list if float(row[val1]) < wert1]
advanced_filtered_list_val2 = [row for row in cleaned_list if float(row[val2]) < wert2]
How can I map the filtered lists in a list with the option and and/or or?
The data in the lists are dictionaries and I search and filter some rows in this lists. I want to filter two values on. This works fine. But how can I now map this to filter in a list?
I tried following things:
select = int(input())
#and operation
if select == 1:
mapped_list = [row for row in advanced_filtered_list_val1 and advanced_filtered_list_val2]
for x in mapped_list:
print(x)
#or operation
if select == 2:
mapped_list = [row for row in advanced_filtered_list_val1 or advanced_filtered_list_val2]
for x in mapped_list:
print(x)
I import the data as follows:
faelle = [{k: v for k, v in row.items()}
for row in csv.DictReader(csvfile, delimiter=";")]
I want to filter now from wert1 and wert2 and from wert1 or wert2. Thats mean on the and clause it should be on both filters true, and on the or clause it should one of wert1 or wert2 True
You want to filter dictionaries contained in cleaned_list which respect either the two wert-like conditions (AND) or at least one of them (OR). What you can do is
import operator as op
ineq_1 = 'gt'
ineq_2 = 'lt'
select = 2
andor = {
1:lambda L: filter(
lambda d: getattr(op,ineq_1)(float(d[val1]), wert1)
and getattr(op,ineq_2)(float(d[val2]), wert2),
L
),
2:lambda L: filter(
lambda d: getattr(op,ineq_1)(float(d[val1]), wert1)
or getattr(op,ineq_2)(float(d[val2]), wert2),
L
),
}
mapped_list = andor[select](cleaned_list)
for x in mapped_list:
print(dict(x))
The possible choices are gt (greater than), lt (lower than), or eq.
Note that you can even make things a little bit more "dynamic" by as well using the method and_ and or_ of the python-builtin module operator. For example, doing
#Where the two following ix2-like stuffs are defined to make
# a correspondence between names one knows, and methods of the
# module operator.
ix2conj = {
1:'and_',
2:'or_',
}
ix2ineq = {
'<' :'lt',
'==':'eq',
'>' :'gt',
}
def my_filter(conjunction, inequality1, inequality2, my_cleaned_list):
return filter(
lambda d: getattr(op, ix2conj[conjunction])(
getattr(op, ix2ineq[inequality1])(float(d[val1]), wert1),
getattr(op, ix2ineq[inequality2])(float(d[val2]), wert2)
),
my_cleaned_list
)
ineq_1 = '>'
ineq_2 = '<'
select = 2
print(my_filter(select, ineq_1, ineq_2, cleaned_list))
I see where you're coming from with that syntax, but that's not what the "and" and "or" keywords in python do at all. To do what you're looking for I think you'll want to use the built in type, set. You could do something like
# note that this is already the "or" one
both = list1 + [x for x in list2 if not x in list1]
# for "and"
mapped_list = [x for x in both if x in list1 and x in list2]
If you want the resultant lists to have only unique values; otherwise you could just do the same with
both = list1 + list2

List of dicts: Getting list of matching dictionary based on id

I'm trying to get the matching IDs and store the data into one list. I have a list of dictionaries:
list = [
{'id':'123','name':'Jason','location': 'McHale'},
{'id':'432','name':'Tom','location': 'Sydney'},
{'id':'123','name':'Jason','location':'Tompson Hall'}
]
Expected output would be something like
# {'id':'123','name':'Jason','location': ['McHale', 'Tompson Hall']},
# {'id':'432','name':'Tom','location': 'Sydney'},
How can I get matching data based on dict ID value? I've tried:
for item in mylist:
list2 = []
row = any(list['id'] == list.id for id in list)
list2.append(row)
This doesn't work (it throws: TypeError: tuple indices must be integers or slices, not str). How can I get all items with the same ID and store into one dict?
First, you're iterating through the list of dictionaries in your for loop, but never referencing the dictionaries, which you're storing in item. I think when you wrote list[id] you mean item[id].
Second, any() returns a boolean (true or false), which isn't what you want. Instead, maybe try row = [dic for dic in list if dic['id'] == item['id']]
Third, if you define list2 within your for loop, it will go away every iteration. Move list2 = [] before the for loop.
That should give you a good start. Remember that row is just a list of all dictionaries that have the same id.
I would use kdopen's approach along with a merging method after converting the dictionary entries I expect to become lists into lists. Of course if you want to avoid redundancy then make them sets.
mylist = [
{'id':'123','name':['Jason'],'location': ['McHale']},
{'id':'432','name':['Tom'],'location': ['Sydney']},
{'id':'123','name':['Jason'],'location':['Tompson Hall']}
]
def merge(mylist,ID):
matches = [d for d in mylist if d['id']== ID]
shell = {'id':ID,'name':[],'location':[]}
for m in matches:
shell['name']+=m['name']
shell['location']+=m['location']
mylist.remove(m)
mylist.append(shell)
return mylist
updated_list = merge(mylist,'123')
Given this input
mylist = [
{'id':'123','name':'Jason','location': 'McHale'},
{'id':'432','name':'Tom','location': 'Sydney'},
{'id':'123','name':'Jason','location':'Tompson Hall'}
]
You can just extract it with a comprehension
matched = [d for d in mylist if d['id'] == '123']
Then you want to merge the locations. Assuming matched is not empty
final = matched[0]
final['location'] = [d['location'] for d in matched]
Here it is in the interpreter
In [1]: mylist = [
...: {'id':'123','name':'Jason','location': 'McHale'},
...: {'id':'432','name':'Tom','location': 'Sydney'},
...: {'id':'123','name':'Jason','location':'Tompson Hall'}
...: ]
In [2]: matched = [d for d in mylist if d['id'] == '123']
In [3]: final=matched[0]
In [4]: final['location'] = [d['location'] for d in matched]
In [5]: final
Out[5]: {'id': '123', 'location': ['McHale', 'Tompson Hall'], 'name': 'Jason'}
Obviously, you'd want to replace '123' with a variable holding the desired id value.
Wrapping it all up in a function:
def merge_all(df):
ids = {d['id'] for d in df}
result = []
for id in ids:
matches = [d for d in df if d['id'] == id]
combined = matches[0]
combined['location'] = [d['location'] for d in matches]
result.append(combined)
return result
Also, please don't use list as a variable name. It shadows the builtin list class.

Merge nested list items based on a repeating value

Although poorly written, this code:
marker_array = [['hard','2','soft'],['heavy','2','light'],['rock','2','feather'],['fast','3'], ['turtle','4','wet']]
marker_array_DS = []
for i in range(len(marker_array)):
if marker_array[i-1][1] != marker_array[i][1]:
marker_array_DS.append(marker_array[i])
print marker_array_DS
Returns:
[['hard', '2', 'soft'], ['fast', '3'], ['turtle', '4', 'wet']]
It accomplishes part of the task which is to create a new list containing all nested lists except those that have duplicate values in index [1]. But what I really need is to concatenate the matching index values from the removed lists creating a list like this:
[['hard heavy rock', '2', 'soft light feather'], ['fast', '3'], ['turtle', '4', 'wet']]
The values in index [1] must not be concatenated. I kind of managed to do the concatenation part using a tip from another post:
newlist = [i + n for i, n in zip(list_a, list_b]
But I am struggling with figuring out the way to produce the desired result. The "marker_array" list will be already sorted in ascending order before being passed to this code. All like-values in index [1] position will be contiguous. Some nested lists may not have any values beyond [0] and [1] as illustrated above.
Quick stab at it... use itertools.groupby to do the grouping for you, but do it over a generator that converts the 2 element list into a 3 element.
from itertools import groupby
from operator import itemgetter
marker_array = [['hard','2','soft'],['heavy','2','light'],['rock','2','feather'],['fast','3'], ['turtle','4','wet']]
def my_group(iterable):
temp = ((el + [''])[:3] for el in marker_array)
for k, g in groupby(temp, key=itemgetter(1)):
fst, snd = map(' '.join, zip(*map(itemgetter(0, 2), g)))
yield filter(None, [fst, k, snd])
print list(my_group(marker_array))
from collections import defaultdict
d1 = defaultdict(list)
d2 = defaultdict(list)
for pxa in marker_array:
d1[pxa[1]].extend(pxa[:1])
d2[pxa[1]].extend(pxa[2:])
res = [[' '.join(d1[x]), x, ' '.join(d2[x])] for x in sorted(d1)]
If you really need 2-tuples (which I think is unlikely):
for p in res:
if not p[-1]:
p.pop()
marker_array = [['hard','2','soft'],['heavy','2','light'],['rock','2','feather'],['fast','3'], ['turtle','4','wet']]
marker_array_DS = []
marker_array_hit = []
for i in range(len(marker_array)):
if marker_array[i][1] not in marker_array_hit:
marker_array_hit.append(marker_array[i][1])
for i in marker_array_hit:
lists = [item for item in marker_array if item[1] == i]
temp = []
first_part = ' '.join([str(item[0]) for item in lists])
temp.append(first_part)
temp.append(i)
second_part = ' '.join([str(item[2]) for item in lists if len(item) > 2])
if second_part != '':
temp.append(second_part);
marker_array_DS.append(temp)
print marker_array_DS
I learned python for this because I'm a shameless rep whore
marker_array = [
['hard','2','soft'],
['heavy','2','light'],
['rock','2','feather'],
['fast','3'],
['turtle','4','wet'],
]
data = {}
for arr in marker_array:
if len(arr) == 2:
arr.append('')
(first, index, last) = arr
firsts, lasts = data.setdefault(index, [[],[]])
firsts.append(first)
lasts.append(last)
results = []
for key in sorted(data.keys()):
current = [
" ".join(data[key][0]),
key,
" ".join(data[key][1])
]
if current[-1] == '':
current = current[:-1]
results.append(current)
print results
--output:--
[['hard heavy rock', '2', 'soft light feather'], ['fast', '3'], ['turtle', '4', 'wet']]
A different solution based on itertools.groupby:
from itertools import groupby
# normalizes the list of markers so all markers have 3 elements
def normalized(markers):
for marker in markers:
yield marker + [""] * (3 - len(marker))
def concatenated(markers):
# use groupby to iterator over lists of markers sharing the same key
for key, markers_in_category in groupby(normalized(markers), lambda m: m[1]):
# get separate lists of left and right words
lefts, rights = zip(*[(m[0],m[2]) for m in markers_in_category])
# remove empty strings from both lists
lefts, rights = filter(bool, lefts), filter(bool, rights)
# yield the concatenated entry for this key (also removing the empty string at the end, if necessary)
yield filter(bool, [" ".join(lefts), key, " ".join(rights)])
The generator concatenated(markers) will yield the results. This code correctly handles the ['fast', '3'] case and doesn't return an additional third element in such cases.

Removing values from a list in python

I have a large file of names and values on a single line separated by a space:
name1 name2 name3....
Following the long list of names is a list of values corresponding to the names. The values can be 0-4 or na. What I want to do is consolidate the data file and remove all the names and and values when the value is na.
For instance, the final line of name in this file is like so:
namenexttolast nameonemore namethelast 0 na 2
I would like the following output:
namenexttolast namethelast 0 2
How would I do this using Python?
Let's say you read the names into one list, then the values into another. Once you have a names and values list, you can do something like:
result = [n for n, v in zip(names, values) if v != 'na']
result is now a list of all names whose value is not "na".
s = "name1 name2 name3 v1 na v2"
s = s.split(' ')
names = s[:len(s)/2]
values = s[len(s)/2:]
names_and_values = zip(names, values)
names, values = [], []
[(names.append(n) or values.append(v)) for n, v in names_and_values if v != "na"]
names.extend(values)
print ' '.join(names)
Update
Minor improvement after suggestion from Paul. I'm sure the list comprehension is fairly unpythonic, as it leverages the fact that list.append returns None, so both append expressions will be evaluated and a list of None values will be constructed and immediately thrown away.
I agree with Justin than using zip is a good idea. The problems is how to put the data into two different lists. Here is a proposal that should work ok.
reader = open('input.txt')
writer = open('output.txt', 'w')
names, nums = [], []
row = reader.read().split(' ')
x = len(row)/2
for (a, b) in [(n, v) for n, v in zip(row[:x], row[x:]) if v!='na']:
names.append(a)
nums.append(b)
writer.write(' '.join(names))
writer.write(' ')
writer.write(' '.join(nums))
#writer.write(' '.join(names+nums)) is nicer but cause list to be concat
or say you have a string which you have read from a file. Let's call this string as "s"
words = filter(lambda x: x!="na", s.split())
should give you all the strings except for "na"
edit: the code above obviously doesn't do what you want it to do.
the one below should work though
d = s.split()
keys = d[:len(d)/2]
vals = d[len(d)/2:]
w = " ".join(map(lambda (k,v): (k + " " + v) if v!="na" else "", zip(keys, vals)))
print " ".join([" ".join(w.split()[::2]), " ".join(w.split()[1::2])])
strlist = 'namenexttolast nameonemore namethelast 0 na 2'.split()
vals = ('0', '1', '2', '3', '4', 'na')
key_list = [s for s in strlist if s not in vals]
val_list = [s for s in strlist if s in vals]
#print [(key_list[i],v) for i, v in enumerate(val_list) if v != 'na']
filtered_keys = [key_list[i] for i, v in enumerate(val_list) if v != 'na']
filtered_vals = [v for v in val_list if v != 'na']
print filtered_keys + filtered_vals
If you'd rather group the vals, you could create a list of tuples instead (commented out line)
Here is a solution that uses just iterators plus a single buffer element, with no calls to len and no other intermediate lists created. (In Python 3, just use map and zip, no need to import imap and izip from itertools.)
from itertools import izip, imap, ifilter
def iterStartingAt(cond, seq):
it1,it2 = iter(seq),iter(seq)
while not cond(it1.next()):
it2.next()
for item in it2:
yield item
dataline = "namenexttolast nameonemore namethelast 0 na 2"
datalinelist = dataline.split()
valueset = set("0 1 2 3 4 na".split())
print " ".join(imap(" ".join,
izip(*ifilter(lambda (n,v): v != 'na',
izip(iter(datalinelist),
iterStartingAt(lambda s: s in valueset,
datalinelist))))))
Prints:
namenexttolast namethelast 0 2

Categories