Most Pythonic way to Cross-reference two lists - python

Problem:
I have a somewhat complicated cross-referencing task I need to perform between a long list (~600,000 entries) and a short list (~300,000 entries). I'm trying to find the similar entries between the two lists, and each unique entry is identified by three different integers (call them int1,int2,and int3). Based on the three integer identifiers in one list, I want to see if those same three integers are in the other list, and return which ones they are.
Attempt:
First I zipped each three-integer tuple in the long list into an array called a. Similarly, I zipped each three-int tuple in the short list into an array called b:
a = [(int1,int2,int3),...] # 600,000 entries
b = [(int1,int2,int3),...] # 300,000 entries
Then I iterated through each entry in a to see if it was in b. If it was, I appended the corresponding tuples to an array outside the loop called c:
c= []
for i in range(0,len(a),1):
if a[i] in b:
c.append(a[i])
The iteration is (not surprisingly) very slow. I'm guessing Python has to check b for a[i] at each iteration (~300,000 times!), and its iterating 600,000 times. It has taken over an hour now and still hasn't finished, so I know I should be optimizing something.
My question is: what is the most Pythonic or fastest way to perform this cross-referencing?

You can use sets:
c = set(b).intersection(a)
I chose to convert b to a set as it is the shorter of the two lists. Using intersection() does not require that list a first be converted to a set.
You can also do this:
c = set(a) & set(b)
however, both lists require conversion to type set first.
Either way you have a O(n) operation, see time complexity.

Pandas solution:
a = [(1,2,3),(4,5,6),(4,5,8),(1,2,8) ]
b = [(1,2,3),(0,3,7),(4,5,8)]
df1 = pd.DataFrame(a)
print (df1)
0 1 2
0 1 2 3
1 4 5 6
2 4 5 8
3 1 2 8
df2 = pd.DataFrame(b)
print (df2)
0 1 2
0 1 2 3
1 0 3 7
2 4 5 8
df = pd.merge(df1, df2)
print (df)
0 1 2
0 1 2 3
1 4 5 8
Pure python solution with sets:
c = list(set(b).intersection(set(a)))
print (c)
[(4, 5, 8), (1, 2, 3)]

Another interesting way to do it:
from itertools import compress
list(compress(b, map(lambda x: x in a, b)))
And other one:
filter(lambda x: x in a, b)

Related

Pandas: best way to remove rows where columns match any set of values in a list of tuples?

I have two columns, A and B. I also have a list of tuples. I want to remove any rows where it matches any of the tuples in the list. For example:
Input:
A
B
A
1
A
4
B
2
A
3
[(A,1),(C,4),(A,3)]
Output:
A
B
A
4
B
2
You can use zip + list comprehension:
tuples = [('A', 1), ('C', 4), ('A', 3)]
new_df = df[[x not in tuples for x in zip(df['A'], df['B'])]]
Output:
>>> new_df
A B
1 A 4
2 B 2
Use zip + pandas series to do without for loop (should be faster)
Note: Based upon How to filter a pandas DataFrame according to a list of tuples
tuples = [('A',1),('C',4),('A',3)]
new_df = df[~pd.Series(list(zip(df['A'], df['B']))).isin(tuples)] # no for loop
>>> new_df
A B
1 A 4
2 B 2
I think the best you can do here is to throw all the "blacklisted" tuples into a set (i.e. hash them) and perform a membership test on each row in your list. The membership test will take constant time & the overall time complexity of this algorithm will be O(n + m), with n being the number of items in your list and m being the number of items in your blacklist.
def solve(arr, blacklist):
S = set(blacklist)
result = [None] * len(arr)
idx = 0
for i in range(len(arr)):
if arr[i] not in S:
result[idx] = arr[i]
idx += 1
return result[:idx]
A "pure" pandas solution (whatever that means):
df[~df.set_index(['A','B']).index.isin(tuples)]
output
A B
1 A 4
2 B 2

Python Pandas. Delete cells whose value is contained in another cell in the same column

I have a dataframe like this:
A B
exa 3
example 6
exam 4
hello 4
hell 3
I want to delete the rows that are substrings of another row and keep the longest one (Notice that B is already the length of A)
I want my table to look like this:
A B
example 6
hello 4
I thought about the following boolean filter but it does not work :(
df['Check'] = df.apply(lambda row: df.count(row['A'] in row['A'])>1, axis=1)
This is non-trivial. But we can take advantage of B to sort the data, compare each value with only those strings larger than itself for solution slightly better than O(N^2).
df = df.sort_values('B')
v = df['A'].tolist()
df[[not any(b.startswith(a) for b in v[i + 1:]) for i, a in enumerate(v)]].sort_index()
A B
1 example 6
3 hello 4
Like what cold provided my solution is O(m*n) as well (In your case m=n)
df[np.sum(np.array([[y in x for x in df.A.values] for y in df.A.values]),1)==1]
Out[30]:
A B
1 example 6
3 hello 4

Mapping rows of a Pandas dataframe to numpy array

Sorry, I know there are so many questions relating to indexing, and it's probably starring me in the face, but I'm having a little trouble with this. I am familiar with .loc, .iloc, and .index methods and slicing in general. The method .reset_index may not have been (and may not be able to be) called on our dataframe and therefore index lables may not be in order. The dataframe and numpy array(s) are actually different length subsets of the dataframe, but for this example I'll keep them the same size (I can handle offsetting once I have an example).
Here is a picture that show's what I'm looking for:
I can pull cols of rows from the dataframe based on some search criteria.
idxlbls = df.index[df['timestamp'] == dt]
stuff = df.loc[idxlbls, 'col3':'col5']
But how do I map that to row number (array indices, not label indices) to be used as an array index in numpy (assuming same row length)?
stuffprime = array[?, ?]
The reason I need it is because the dataframe is much larger and more complete and contains the column searching criteria, but the numpy arrays are subsets that have been extracted and modified prior in the pipeline (and do not have the same searching criteria in them). I need to search the dataframe and pull the equivalent data from the numpy arrays. Basically I need to correlate specific rows from a dataframe to the corresponding rows of a numpy array.
I would map pandas indices to numpy indicies:
keys_dict = dict(zip(idxlbls, range(len(idxlbls))))
Then you may use the dictionary keys_dict to address the array elements by a pandas index: array[keys_dict[some_df_index], :]
I believe need get_indexer for positions by filtered columns names, for index is possible use same way or numpy.where for positions by boolean mask:
df = pd.DataFrame({'timestamp':list('abadef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4]}, index=list('ABCDEF'))
print (df)
timestamp B C D E
A a 4 7 1 5
B b 5 8 3 3
C a 4 9 5 6
D d 5 4 7 9
E e 5 2 1 2
F f 4 3 0 4
idxlbls = df.index[df['timestamp'] == 'a']
stuff = df.loc[idxlbls, 'C':'E']
print (stuff)
C D E
A 7 1 5
C 9 5 6
a = df.index.get_indexer(stuff.index)
Or get positions by boolean mask:
a = np.where(df['timestamp'] == 'a')[0]
print (a)
[0 2]
b = df.columns.get_indexer(stuff.columns)
print (b)
[2 3 4]

Add column of empty lists to DataFrame

Similar to this question How to add an empty column to a dataframe?, I am interested in knowing the best way to add a column of empty lists to a DataFrame.
What I am trying to do is basically initialize a column and as I iterate over the rows to process some of them, then add a filled list in this new column to replace the initialized value.
For example, if below is my initial DataFrame:
df = pd.DataFrame(d = {'a': [1,2,3], 'b': [5,6,7]}) # Sample DataFrame
>>> df
a b
0 1 5
1 2 6
2 3 7
Then I want to ultimately end up with something like this, where each row has been processed separately (sample results shown):
>>> df
a b c
0 1 5 [5, 6]
1 2 6 [9, 0]
2 3 7 [1, 2, 3]
Of course, if I try to initialize like df['e'] = [] as I would with any other constant, it thinks I am trying to add a sequence of items with length 0, and hence fails.
If I try initializing a new column as None or NaN, I run in to the following issues when trying to assign a list to a location.
df['d'] = None
>>> df
a b d
0 1 5 None
1 2 6 None
2 3 7 None
Issue 1 (it would be perfect if I can get this approach to work! Maybe something trivial I am missing):
>>> df.loc[0,'d'] = [1,3]
...
ValueError: Must have equal len keys and value when setting with an iterable
Issue 2 (this one works, but not without a warning because it is not guaranteed to work as intended):
>>> df['d'][0] = [1,3]
C:\Python27\Scripts\ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
Hence I resort to initializing with empty lists and extending them as needed. There are a couple of methods I can think of to initialize this way, but is there a more straightforward way?
Method 1:
df['empty_lists1'] = [list() for x in range(len(df.index))]
>>> df
a b empty_lists1
0 1 5 []
1 2 6 []
2 3 7 []
Method 2:
df['empty_lists2'] = df.apply(lambda x: [], axis=1)
>>> df
a b empty_lists1 empty_lists2
0 1 5 [] []
1 2 6 [] []
2 3 7 [] []
Summary of questions:
Is there any minor syntax change that can be addressed in Issue 1 that can allow a list to be assigned to a None/NaN initialized field?
If not, then what is the best way to initialize a new column with empty lists?
One more way is to use np.empty:
df['empty_list'] = np.empty((len(df), 0)).tolist()
You could also knock off .index in your "Method 1" when trying to find len of df.
df['empty_list'] = [[] for _ in range(len(df))]
Turns out, np.empty is faster...
In [1]: import pandas as pd
In [2]: df = pd.DataFrame(pd.np.random.rand(1000000, 5))
In [3]: timeit df['empty1'] = pd.np.empty((len(df), 0)).tolist()
10 loops, best of 3: 127 ms per loop
In [4]: timeit df['empty2'] = [[] for _ in range(len(df))]
10 loops, best of 3: 193 ms per loop
In [5]: timeit df['empty3'] = df.apply(lambda x: [], axis=1)
1 loops, best of 3: 5.89 s per loop
EDIT: the commenters caught the bug in my answer
s = pd.Series([[]] * 3)
s.iloc[0].append(1) #adding an item only to the first element
>s # unintended consequences:
0 [1]
1 [1]
2 [1]
So, the correct solution is
s = pd.Series([[] for i in range(3)])
s.iloc[0].append(1)
>s
0 [1]
1 []
2 []
OLD:
I timed all the three methods in the accepted answer, the fastest one took 216 ms on my machine. However, this took only 28 ms:
df['empty4'] = [[]] * len(df)
Note: Similarly, df['e5'] = [set()] * len(df) also took 28ms.
Canonical solutions: List comprehension, map and apply
Obligatory disclaimer: avoid using lists in pandas columns where possible, list columns are slow to work with because they are objects and those are inherently hard to vectorize.
With that out of the way, here are the canonical methods of introducing a column of empty lists:
# List comprehension
df['c'] = [[] for _ in range(df.shape[0])]
df
a b c
0 1 5 []
1 2 6 []
2 3 7 []
There's also these shorthands involving apply and map:
from collections import defaultdict
# map any column with defaultdict
df['c'] = df.iloc[:,0].map(defaultdict(list))
# same as,
df['c'] = df.iloc[:,0].map(lambda _: [])
# apply with defaultdict
df['c'] = df.apply(defaultdict(list), axis=1)
# same as,
df['c'] = df.apply(lambda _: [], axis=1)
df
a b c
0 1 5 []
1 2 6 []
2 3 7 []
Things you should NOT do
Some folks believe multiplying an empty list is the way to go, unfortunately this is wrong and will usually lead to hard-to-debug issues. Here's an MVP:
# WRONG
df['c'] = [[]] * len(df)
df.at[0, 'c'].append('abc')
df.at[1, 'c'].append('def')
df
a b c
0 1 5 [abc, def]
1 2 6 [abc, def]
2 3 7 [abc, def]
# RIGHT
df['c'] = [[] for _ in range(df.shape[0])]
df.at[0, 'c'].append('abc')
df.at[1, 'c'].append('def')
df
a b c
0 1 5 [abc]
1 2 6 [def]
2 3 7 []
In the first case, a single empty list is created and its reference is replicated across all the rows, so you see updates to one reflected to all of them. In the latter case each row is assigned its own empty list, so this is not a concern.

Fast python algorithm (in numpy or pandas?) to find indices of array elements that match elements in another array

I am looking for a fast method to determine the cross-matching indices of two arrays, defined as follows.
I have two very large (>1e7 elements) structured arrays, one called members, and another called groups. Both arrays have a groupID column. The groupID entries of the groups array are unique, the groupID entries of the members array are not.
The groups array has a column called mass. The members array has a (currently empty) column called groupmass. I want to assign the correct groupmass to those elements of members with a groupID that matches one of the groups. This would be accomplished via:
members['groupmass'][idx_matched_members] = groups['mass'][idx_matched_groups]
So what I need is a fast routine to compute the two index arrays idx_matched_members and idx_matched_groups. This sort of task seems so common that it seems very likely that a package like numpy or pandas would have an optimized solution. Does anyone know of a solution, professionally developed, homebrewed, or otherwise?
This can be done with pandas using map to map the data from one column using the data of another. Here's an example with sample data:
members = pandas.DataFrame({
'id': np.arange(10),
'groupID': np.arange(10) % 3,
'groupmass': np.zeros(10)
})
groups = pandas.DataFrame({
'groupID': np.arange(3),
'mass': np.random.randint(1, 10, 3)
})
This gives you this data:
>>> members
groupID groupmass id
0 0 0 0
1 1 0 1
2 2 0 2
3 0 0 3
4 1 0 4
5 2 0 5
6 0 0 6
7 1 0 7
8 2 0 8
9 0 0 9
>>> groups
groupID mass
0 0 3
1 1 7
2 2 4
Then:
>>> members['groupmass'] = members.groupID.map(groups.set_index('groupID').mass)
>>> members
groupID groupmass id
0 0 3 0
1 1 7 1
2 2 4 2
3 0 3 3
4 1 7 4
5 2 4 5
6 0 3 6
7 1 7 7
8 2 4 8
9 0 3 9
If you will often want to use the groupID as the index into groups, you can set it that way permanently so you won't have to use set_index every time you do this.
Here's an example of setting the mass with just numpy. It does use iteration, so for large arrays it won't be fast.
For just 10 rows, this is much faster than the pandas equivalent. But as the data set becomes larger (eg. M=10000), pandas is much better. The setup time for pandas is larger, but the per row iteration time much lower.
Generate test arrays:
dt_members = np.dtype({'names':['groupID','groupmass'], 'formats': [int, float]})
dt_groups = np.dtype({'names':['groupID', 'mass'], 'formats': [int, float]})
N, M = 5, 10
members = np.zeros((M,), dtype=dt_members)
groups = np.zeros((N,), dtype=dt_groups)
members['groupID'] = np.random.randint(101, 101+N, M)
groups['groupID'] = np.arange(101, 101+N)
groups['mass'] = np.arange(1,N+1)
def getgroup(id):
idx = id==groups['groupID']
return groups[idx]
members['groupmass'][:] = [getgroup(id)['mass'] for id in members['groupID']]
In python2 the iteration could use map:
members['groupmass'] = map(lambda x: getgroup(x)['mass'], members['groupID'])
I can improve the speed by about 2x by minimizing the repeated subscripting, eg.
def setmass(members, groups):
gmass = groups['mass']
gid = groups['groupID']
mass = [gmass[id==gid] for id in members['groupID']]
members['groupmass'][:] = mass
But if groups['groupID'] can be mapped onto arange(N), then we can get a big jump in speed. By applying the same mapping to members['groupID'], it becomes a simple array indexing problem.
In my sample arrays, groups['groupID'] is just arange(N)+101. So the mapping just subtracts that minimum.
def setmass1(members, groups):
members['groupmass'][:] = groups['mass'][members['groupID']-groups['groupID'].min()]
This is 300x faster than my earlier code, and 8x better than the pandas solution (for 10000,500 arrays).
I suspect pandas does something like this. pgroups.set_index('groupID').mass is the mass Series, with an added .index attribute. (I could test this with a more general array)
In a more general case, it might help to sort groups, and if necessary, fill in some indexing gaps.
Here's a 'vectorized' solution - no iteration. But it has to calculate a very large matrix (length of groups by length of members), so does not gain much speed (np.where is the slowest step).
def setmass2(members, groups):
idx = np.where(members['groupID'] == groups['groupID'][:,None])
members['groupmass'][idx[1]] = groups['mass'][idx[0]]

Categories