split an table(or list) based on column python - python

I have a array
arr = [['a', 'b', 'a'], [1, 2, 3]
I need this to be spliited based on the first array values i.e based on 'a' or 'b'. So Expected output is
arr_out_a = [1, 3]
arr_out_b = [2]
How do I do it?
Please help me correct the question,if the way I'm using words like list and array might create confusion

Use collections.defaultdict():
In [82]: arr = [['a', 'b', 'a'], [1, 2, 3]]
In [83]: from collections import defaultdict
In [84]: d = defaultdict(list)
In [85]: for i, j in zip(*arr):
....: d[i].append(j)
....:
In [86]: d
Out[86]: defaultdict(<class 'list'>, {'b': [2], 'a': [1, 3]})

Basically just append them conditionally to predefined empty lists:
arr_out_a = []
arr_out_b = []
for char, num in zip(*arr):
if char == 'a':
arr_out_a.append(num)
else:
arr_out_b.append(num)
or if you don't like the zip:
arr_out_a = []
arr_out_b = []
for idx in range(len(arr[0])):
if arr[0][idx] == 'a':
arr_out_a.append(arr[1][idx])
else:
arr_out_b.append(arr[1][idx])

Related

How to merge multiple lists into 1 but only those elements that were in all of the initial lists?

I need to merge 5 lists of which any list can be empty that way so that only items that were in all 5 initial lists are included in the newly formed list.
for filter in filters:
if filter == 'M':
filtered1 = [] # imagine that this is filled
if filter == 'V':
filtered2 = [] # imagine that this is filled
if filter == 'S':
filtered3 = [] # imagine that this is filled
if filter == 'O':
filtered4 = [] # imagine that this is filled
if filter == 'C':
filtered5 = [] # imagine that this is filled
filtered = [] # merge all 5 lists from above
So now I need to make a list filtered with merged data from all filtered lists 1-5. How should I do that?
This is the most classical solution.
filtered = filter1 + filter2 + filter3 + filter4 + filter5
What happens is that you add an list to another one and so on...
So if filter1 was ['a', 'b'] and filter3 was ['c', 'd'] and filter4 was ['e'],
then you would get:
filtered = ['a', 'b', 'c', 'd', 'e']
Given some lists xs1, ..., xs5:
xss = [xs1, xs2, xs3, xs4, xs5]
sets = [set(xs) for xs in xss]
merged = set.intersection(*sets)
This has the property that merged may be in any order.
f1, f2, f3, f4, f5 = [1], [], [2, 5], [4, 1], [3]
only_merge = [*f1, *f2, *f3, *f4, *f5]
print("Only merge: ", only_merge)
merge_and_sort = sorted([*f1, *f2, *f3, *f4, *f5])
print("Merge and sort: ", merge_and_sort)
merge_and_unique_and_sort = list({*f1, *f2, *f3, *f4, *f5})
print("Merge, unique and sort: ", merge_and_unique_and_sort)
Output:
Only merge: [1, 2, 5, 4, 1, 3]
Merge and sort: [1, 1, 2, 3, 4, 5]
Merge, unique and sort: [1, 2, 3, 4, 5]

Process a list of lists, finding all lists that have matching last values?

Given a list of lists
lol = [[0,a], [0,b],
[1,b], [1,c],
[2,d], [2,e],
[2,g], [2,b],
[3,e], [3,f]]
I would like to extract all sublists that have the same last element (lol[n][1]) and end up with something like below:
[0,b]
[1.b]
[2,b]
[2,e]
[3,e]
I know that given two lists we can use an intersection, what is the right way to go about a problem like this, other than incrementing the index value in a for each loop?
1. Using collections.defaultdict
You can use defaultdict to the first group up your items with more than one occurrence, then, iterate over the dict.items to get what you need.
from collections import defaultdict
lol = [[0,'a'], [0,'b'],
[1,'b'], [1,'c'],
[2,'d'], [2,'e'],
[2,'g'], [2,'b'],
[3,'e'], [3,'f']]
d = defaultdict(list)
for v,k in lol:
d[k].append(v)
# d looks like -
# defaultdict(list,
# {'a': [0],
# 'b': [0, 1, 2],
# 'c': [1],
# 'd': [2],
# 'e': [2, 3],
# 'g': [2],
# 'f': [3]})
result = [[v,k] for k,vs in d.items() for v in vs if len(vs)>1]
print(result)
[[0, 'b'], [1, 'b'], [2, 'b'], [2, 'e'], [3, 'e']]
2. Using pandas.duplicated
Here is how you can do this with Pandas -
Convert to pandas dataframe
For key column, find the duplicates and keep all of them
Convert to list of records while ignoring index
import pandas as pd
df = pd.DataFrame(lol, columns=['val','key'])
dups = df[df['key'].duplicated(keep=False)]
result = list(dups.to_records(index=False))
print(result)
[(0, 'b'), (1, 'b'), (2, 'e'), (2, 'b'), (3, 'e')]
3. Using numpy.unique
You can solve this in a vectorized manner using numpy -
Convert to numpy matrix arr
Find unique elements u and their counts c
Filter list of unique elements that occur more than once dup
Use broadcasting to compare the second column of the array and take any over axis=0 to get a boolean which is True for duplicated rows
Filter the arr based on this boolean
import numpy as np
arr = np.array(lol)
u, c = np.unique(arr[:,1], return_counts=True)
dup = u[c > 1]
result = arr[(arr[:,1]==dup[:,None]).any(0)]
result
array([['0', 'b'],
['1', 'b'],
['2', 'e'],
['2', 'b'],
['3', 'e']], dtype='<U21')

Pandas find duplicate concatenated values across selected columns

I want to find duplicates in a selection of columns of a df,
# converts the sub df into matrix
mat = df[['idx', 'a', 'b']].values
str_dict = defaultdict(set)
for x in np.ndindex(mat.shape[0]):
concat = ''.join(str(x) for x in mat[x][1:])
# take idx as values of each key a + b
str_dict[concat].update([mat[x][0]])
dups = {}
for key in str_dict.keys():
dup = str_dict[key]
if len(dup) < 2:
continue
dups[key] = dup
The code finds duplicates of the concatenation of a and b. Uses the concatenation as key for a set defaultdict (str_dict), updates the key with idx values; finally uses a dict (dups) to store any concatenation if the length of its value (set) is >= 2.
I am wondering if there is a better way to do that in terms of efficiency.
You can just concatenate and convert to set:
res = set(df['a'].astype(str) + df['b'].astype(str))
Example:
df = pd.DataFrame({'idx': [1, 2, 3],
'a': [4, 4, 5],
'b': [5, 5,6]})
res = set(df['a'].astype(str) + df['b'].astype(str))
print(res)
# {'56', '45'}
If you need to map indices too:
df = pd.DataFrame({'idx': [1, 2, 3],
'a': [41, 4, 5],
'b': [3, 13, 6]})
df['conc'] = (df['a'].astype(str) + df['b'].astype(str))
df = df.reset_index()
res = df.groupby('conc')['index'].apply(set).to_dict()
print(res)
# {'413': {0, 1}, '56': {2}}
You can filter the column you need before drop_duplicate
df[['a','b']].drop_duplicates().astype(str).apply(np.sum,1).tolist()
Out[1027]: ['45', '56']

Getting items from a list of lists if the list contains any keywords?

In my Python code, there are two objects, (x, y).
x is a numpy array from a separate function containing an x, y, and z coordinate. Each x, y and z coordinate corresponds to an object in list 'y'.
and 'y' would be a list of letters between a - j in random order.
There are be multiple instances of each letter i.e.: a b b c d a a f b d e e f e c a so on. For every value of 'x', there is a corresponding 'y' letter. Each line is different.
I want to get the x that corresponds a list of chosen letters, say a, c, and f.
How can I do this? I've tried looking into slices and indices but I'm not sure where to begin.
Trying to grab an item from array x, that corresponds to the same line in list y, if that makes any sense?
You wanted the values corresponding to 'a', 'c', and 'f':
>>> x = [ 1, 2, 3, 4, 5, 6 ]
>>> y = 'cgadfh'
>>> d = dict(zip(y, x))
>>> d['a']
3
>>> [d[char] for char in 'acf']
[3, 1, 5]
'a' is the third character in y and 3 is the third number in x, so d['a'] returns 3.
Incidentally, this approach works the same whether y is a string or a list:
>>> x = [ 1, 2, 3, 4, 5, 6 ]
>>> y = ['c', 'g', 'a', 'd', 'f', 'h']
>>> d = dict(zip(y, x))
>>> [d[char] for char in 'acf']
[3, 1, 5]
You can use collections.defaultdict and enumerate function to achieve this
X = [ 1, 2, 3, 4, 5, 6]
Y = ["a", "f", "c", "a", "c", "f"]
from collections import defaultdict
result = defaultdict(list)
for idx, y in enumerate(Y):
result[y].append(X[idx])
print result
Output
defaultdict(<type 'list'>, {'a': [1, 4], 'c': [3, 5], 'f': [2, 6]})
If X is just an indeces started from 1 then you can do the following:
exp = ['a', 'c', 'f']
output = [idx+1 for idx, ch in enumerate(Y) if ch in exp]
Otherwise you can try zip or izip or izip_longest:
import string
import random
a = range(15)
b = random.sample(string.lowercase, 15)
exp = random.sample(b, 3)
output = [k for k, v in zip(a, b) if v in exp]

How do I delete the Nth list item from a list of lists (column delete)?

How do I delete a "column" from a list of lists?
Given:
L = [
["a","b","C","d"],
[ 1, 2, 3, 4 ],
["w","x","y","z"]
]
I would like to delete "column" 2 to get:
L = [
["a","b","d"],
[ 1, 2, 4 ],
["w","x","z"]
]
Is there a slice or del method that will do that? Something like:
del L[:][2]
You could loop.
for x in L:
del x[2]
If you're dealing with a lot of data, you can use a library that support sophisticated slicing like that. However, a simple list of lists doesn't slice.
just iterate through that list and delete the index which you want to delete.
for example
for sublist in list:
del sublist[index]
You can do it with a list comprehension:
>>> removed = [ l.pop(2) for l in L ]
>>> print L
[['a', 'b', 'd'], [1, 2, 4], ['w', 'x', 'z']]
>>> print removed
['d', 4, 'z']
It loops the list and pops every element in position 2.
You have got list of elements removed and the main list without these elements.
A slightly twisted version:
index = 2 # Delete column 2
[(x[0:index] + x[index+1:]) for x in L]
[(x[0], x[1], x[3]) for x in L]
It works fine.
This is a very easy way to remove whatever column you want.
L = [
["a","b","C","d"],
[ 1, 2, 3, 4 ],
["w","x","y","z"]
]
temp = [[x[0],x[1],x[3]] for x in L] #x[column that you do not want to remove]
print temp
O/P->[['a', 'b', 'd'], [1, 2, 4], ['w', 'x', 'z']]
L = [['a', 'b', 'C', 'd'], [1, 2, 3, 4], ['w', 'x', 'y', 'z']]
_ = [i.remove(i[2]) for i in L]
If you don't mind on creating new list then you can try the following:
filter_col = lambda lVals, iCol: [[x for i,x in enumerate(row) if i!=iCol] for row in lVals]
filter_out(L, 2)
An alternative to pop():
[x.__delitem__(n) for x in L]
Here n is the index of the elements to be deleted.

Categories