Searching index position in python - python

cols = [2,4,6,8,10,12,14,16,18] # selected the columns i want to work with
df = pd.read_csv('mywork.csv')
df1 = df.iloc[:, cols]
b= np.array(df1)
b
outcome
b = [['WV5 6NY' 'RE4 9VU' 'BU4 N90' 'TU3 5RE' 'NE5 4F']
['SA8 7TA' 'BA31 0PO' 'DE3 2FP' 'LR98 4TS' 0]
['MN0 4NU' 'RF5 5FG' 'WA3 0MN' 'EA15 8RE' 'BE1 4RE']
['SB7 0ET' 'SA7 0SB' 'BT7 6NS' 'TA9 0LP' 'BA3 1OE']]
a = np.concatenate(b) #concatenated to get a single array, this worked well
a = np.array([x for x in a if x != 'nan'])
a = a[np.where(a != '0')] #removed the nan
print(np.sort(a)) # to sort alphabetically
#Sorted array
['BA3 1OE' 'BA31 0PO' 'BE1 4RE' 'BT7 6NS' 'BU4 N90'
'DE3 2FP' 'EA15 8RE' 'LR98 4TS' 'MN0 4NU', 'NE5 4F' 'RE4 9VU'
'RF5 5FG' 'SA7 0SB' 'SA8 7TA' 'SB7 0ET' 'TA9 0LP' 'TU3 5RE'
'WA3 0MN' 'WV5 6NY']
#Find the index position of all elements of b in a(sorted array)
def findall_index(b, a )
result = []
for i in range(len(a)):
for j in range(len(a[i])):
if b[i][j] == a:
result.append((i, j))
return result
print(findall_index(0,result))
I am still very new with python, I tried finding the index positions of all element of b in a above. The underneath codes blocks doesn't seem to be giving me any result. Please can some one help me.
Thank you in advance.

One way you could approach this is by zipping (creating pairs) the index of elements in b with the actual elements and then sorting this new array based on the elements only. Now you have a mapping from indices of the original array to the new sorted array. You can then just loop over the sorted pairs to map the current index to the original index.
I would highly suggest you to code this yourself, since it will help you learn!

Related

Pandas filter list of list values in a dataframe column

I have a dataframe like as below
sample_df = pd.DataFrame({'single_proj_name': [['jsfk'],['fhjk'],['ERRW'],['SJBAK']],
'single_item_list': [['ABC_123'],['DEF123'],['FAS324'],['HSJD123']],
'single_id':[[1234],[5678],[91011],[121314]],
'multi_proj_name':[['AAA','VVVV','SASD'],['QEWWQ','SFA','JKKK','fhjk'],['ERRW','TTTT'],['SJBAK','YYYY']],
'multi_item_list':[[['XYZAV','ADS23','ABC_123'],['ABC_123','ADC_123']],['XYZAV','DEF123','ABC_123','SAJKF'],['QWER12','FAS324'],['JFAJKA','HSJD123']],
'multi_id':[[[2167,2147,29481],[5432,1234]],[2313,57567,2321,7898],[1123,8775],[5237,43512]]})
I would like to do the below
a) Pick the value from single_item_list for each row
b) search that value in multi_item_list column of the same row. Please note that it could be list of lists for some of the rows
c) If match found, keep only that matched values in multi_item_list and remove all other non-matching values from multi_item_list
d) Based on the position of the match item, look for corresponding value in multi_id list and keep only that item. Remove all other position items from the list
So, I tried the below but it doesn't work for nested list of lists
for a, b, c in zip(sample_df['single_item_list'],sample_df['multi_item_list'],sample_df['multi_id']):
for i, x in enumerate(b):
print(x)
print(a[0])
if a[0] in x:
print(x.index(a[0]))
pos = x.index(a[0])
print(c[pos-1])
I expect my output to be like as below. In real world, I will have more cases like 1st input row (nested lists with multiple levels)
Here is one approach which works with any number of nested lists:
def func(z, X, Y):
A, B = [], []
for x, y in zip(X, Y):
if isinstance(x, list):
a, b = func(z, x, y)
A.append(a), B.append(b)
if x == z:
A.append(x), B.append(y)
return A, B
c = ['single_item_list', 'multi_item_list', 'multi_id']
df[c[1:]] = [func(z, X, Y) for [z], X, Y in df[c].to_numpy()]
Result
single_proj_name single_item_list single_id multi_proj_name multi_item_list multi_id
0 [jsfk] [ABC_123] [1234] [AAA, VVVV, SASD] [[ABC_123], [ABC_123]] [[29481], [5432]]
1 [fhjk] [DEF123] [5678] [QEWWQ, SFA, JKKK, fhjk] [DEF123] [57567]
2 [ERRW] [FAS324] [91011] [ERRW, TTTT] [FAS324] [8775]
3 [SJBAK] [HSJD123] [121314] [SJBAK, YYYY] [HSJD123] [43512]
The code you've provided uses a zip() function to iterate over the 'single_item_list', 'multi_item_list', and 'multi_id' columns of the DataFrame simultaneously.
For each iteration, it uses a nested for loop to iterate over the sublists in the 'multi_item_list' column. It checks if the first element of the 'single_item_list' is present in the current sublist, using the in operator. If it is present, it finds the index of the matching element in the sublist using the index() method, and assigns it to the variable pos. Then it prints the value in the corresponding index of the 'multi_id' column.
This code will work correctly, but it's only printing the matched value in multi_id column, it's not updating the multi_item_list and multi_id columns of the DataFrame.
In order to update the DataFrame with the matched values, you will have to use the .iloc method to update the Dataframe.
e.g: sample_df.iloc[i,j] = new_val
for i, (single, multi_item, multi_id) in enumerate(zip(sample_df['single_item_list'],sample_df['multi_item_list'],sample_df['multi_id'])):
for j, item_list in enumerate(multi_item):
if single[0] in item_list:
pos = item_list.index(single[0])
sample_df.at[i,'multi_item_list'] = [item_list]
sample_df.at[i,'multi_id'] = [multi_id[j]]
print(sample_df)
This will print the updated DataFrame with the filtered values in the 'multi_item_list' and 'multi_id' columns.
Please note that the print(sample_df) should be placed after the for loop to make sure the table is printed after the updates.
This code iterates over the 'single_item_list', 'multi_item_list', and 'multi_id' columns of the DataFrame simultaneously.
In each iteration, it uses a nested for loop to iterate over the sublists in the 'multi_item_list' column.
It checks if the first element of the 'single_item_list' is present in the current sublist, using the in operator. If it is present, it finds the index of the matching element in the sublist using the index() method, and assigns it to the variable pos.
Then it updates the 'multi_item_list' and 'multi_id' columns of the DataFrame at the current index with the matched value using the at method.
Please note that this code will remove the non-matching items from the 'multi_item_list' and 'multi_id' columns, if there is no matching item it will keep the original values.
I made use to isinstance to check whether it is a nested list or not and came up with something like below which results in expected output. Am open to suggestions and improvement for experts here
for i, (single, multi_item, multi_id) in enumerate(zip(sample_df['single_item_list'],sample_df['multi_item_list'],sample_df['multi_id'])):
if (any(isinstance(i, list) for i in multi_item)) == False:
for j, item_list in enumerate(multi_item):
if single[0] in item_list:
pos = item_list.index(single[0])
sample_df.at[i,'multi_item_list'] = [item_list]
sample_df.at[i,'multi_id'] = [multi_id[j]]
else:
print("under nested list")
for j, item_list in enumerate(zip(multi_item,multi_id)):
if single[0] in multi_item[j]:
pos = multi_item[j].index(single[0])
sample_df.at[i,'multi_item_list'][j] = single[0]
sample_df.at[i,'multi_id'][j] = multi_id[j][pos]
else:
sample_df.at[i,'multi_item_list'][j] = np.nan
sample_df.at[i,'multi_id'][j] = np.nan

Get list with all values from all cells of pd DataFrame

I have a pd Dataframe cooc_all (symmetric matrix) from which I would like to create a list that contains all the values from the DataFrame.
Currently, I have done this as follows:
pd_list = []
for i in range(0,40):
for j in range(i, 40):
pd_list.append(cooc_all[j][i])
Is this the best way to do it? Or are there faster/shorter ways?
Try with ravel then tolist
outlist = df.values.ravel().tolist()
Update only get the upper tri
idx = np.tril_indices(len(df))
df[idx] = np.nan
df.stack().tolist()
You can use np.tril to extract the lower triangle of the symmetric matrix, then flatten it with Fortran order (column-major) to match your list and finally take out the zeros coming from upper side:
>>> out = np.tril(df).ravel(order="F")
>>> out[out != 0].tolist()

Removing values >0 in data set

I have a data set which is a list of lists, looking like this:
[[-0.519418066, -0.680905835],
[0.895518429, -0.654813183],
[0.092350219, 0.135117023],
[-0.299403315, -0.568458405],....]
its shape is (9760,) and I am trying to remove all entries where the value of the first number in each entry is greater than 0, so in this example the 2nd and 3rd entries would be removed to leave
[[-0.519418066, -0.680905835],
[-0.299403315, -0.568458405],....]
So far I have written:
for x in range(9670):
for j in filterfinal[j][0]:
if filterfinal[j][0] > 0:
np.delete(filterfinal[j])
this returns: TypeError: list indices must be integers or slices, not list
Thanks in advance for any help on this problem!
You can use numpy's boolean indexing:
>>> x = np.random.randn(10).reshape((5,2))
array([[-0.46490993, 0.09064271],
[ 1.01982349, -0.46011639],
[-0.40474591, -1.91849573],
[-0.69098115, 0.19680831],
[ 2.00139248, -1.94348869]])
>>> x[x[:,0] > 0]
array([[ 1.01982349, -0.46011639],
[ 2.00139248, -1.94348869]])
Some explanation:
x[:,0] selects the first column of your array.
x > 0 will return an array of the same shape where each value is replaced by the result of the element-wise comparison (i.e., is the value > 0 or not?)
So, x[:,0] > 0 will give you an array of shape (n,1) with True or False values depending on the first value of your row.
You can then pass this array of booleans as an index to your original array, where it will return you an array of only the indexes that are True. By passing in a boolean array of shape (n,1), you select per row.
You are talking about "shape", so I assume that you are using numpy. Also, you are mentioning np in your example code, so you are able to apply element wise operations together with boolean indexing
array = np.array([[-0.519418066, -0.680905835],
[0.895518429, -0.654813183],
[0.092350219, 0.135117023],
[-0.299403315, -0.568458405]])
filtered = array[array[:, 0] < 0]
Use a list comprehension:
lol = [[-0.519418066, -0.680905835],[0.895518429, -0.654813183],[0.092350219, 0.135117023],[-0.299403315, -0.568458405]]
filtered_lol = [l for l in lol if l[0] <= 0]
You can use a list comprehension that unpacks the first item from each sub-list and retains only those with the first item <= 0 (assuming your list of lists is stored as variable l):
[l for a, _ in l if a <= 0]
You can go through this in a for loop and making a new list without the positives like so:
new_list = []
for item in old_list:
if item[0] < 0:
new_list.append(item)
But I'd prefer to instead use the in built filter function if you are comfortable with it and do something like:
def is_negative(number):
return number < 0
filtered_list = filter(is_negative, old_list)
This is similar to a list comprehension - or just using a for loop. However it returns a generator instead so you never have to hold two lists in memory making the code more efficient.

how to get a range of values from a list on python?

I'd like to do the following Matlab code:
indexes=find(data>0.5);
data2=data(indexes(1):indexes(length(indexes))+1);
in Python, so I did:
indexes=[x for x in data if x>0.5]
init=indexes[1]
print(indexes)
end=indexes[len(indexes)]+1
data2=data[init:end]
but I'm getting this error:
end=indexes[len(indexes)]+1 IndexError: list index out of range
I think the indexes in Python may not be the same ones as I get in Matlab?
Your list comprehension isn't building a list of indices, but a list of the items themselves. You should generate the indices alongside the items using enumerate:
ind = [i for i, x in enumerate(data) if x > 0.5]
And no need to be so verbose with slicing:
data2 = data[ind[0]: ind[-1]+1] # Matlab's index 1 is Python's index 0
Indexing the list of indices with len(ind) will give an IndexError as indexing in Python starts from 0 (unlike Matlab) and the last index should be fetched with ind[len(ind)-1] or simply ind[-1].
len(indexes) will give you the index of the last element of the list, so that value plus 1 is out of the range of the list.
It looks like what you're trying to do is find the indices of the list that have values of greater that 0.5 and put those values into data2. This is better suited to a numpy array.
import numpy as np
data2 = data[data > 0.5]

Compare 1 column of 2D array and remove duplicates Python

Say I have a 2D array like:
array = [['abc',2,3,],
['abc',2,3],
['bb',5,5],
['bb',4,6],
['sa',3,5],
['tt',2,1]]
I want to remove any rows where the first column duplicates
ie compare array[0] and return only:
removeDups = [['sa',3,5],
['tt',2,1]]
I think it should be something like:
(set first col as tmp variable, compare tmp with remaining and #set array as returned from compare)
for x in range(len(array)):
tmpCol = array[x][0]
del array[x]
removed = compare(array, tmpCol)
array = copy.deepcopy(removed)
print repr(len(removed)) #testing
where compare is:
(compare first col of each remaining array items with tmp, if match remove else return original array)
def compare(valid, tmpCol):
for x in range(len(valid)):
if valid[x][0] != tmpCol:
del valid[x]
return valid
else:
return valid
I keep getting 'index out of range' error. I've tried other ways of doing this, but I would really appreciate some help!
Similar to other answers, but using a dictionary instead of importing counter:
counts = {}
for elem in array:
# add 1 to counts for this string, creating new element at this key
# with initial value of 0 if needed
counts[elem[0]] = counts.get(elem[0], 0) + 1
new_array = []
for elem in array:
# check that there's only 1 instance of this element.
if counts[elem[0]] == 1:
new_array.append(elem)
One option you can try is create a counter for the first column of your array before hand and then filter the list based on the count value, i.e, keep the element only if the first element appears only once:
from collections import Counter
count = Counter(a[0] for a in array)
[a for a in array if count[a[0]] == 1]
# [['sa', 3, 5], ['tt', 2, 1]]
You can use a dictionary and count the occurrences of each key.
You can also use Counter from the library collections that actually does this.
Do as follows :
from collection import Counter
removed = []
for k, val1, val2 in array:
if Counter([k for k, _, _ in array])[k]==1:
removed.append([k, val1, val2])

Categories