next item of list inside dataframe - python

I have a dataframe that has a column where each row has a list.
I want to get the next element after the value I am looking for (in another column).
For example:
Let's say I am looking for 'b':
|lists |next_element|
|---------|------------|
|[a,b,c,d]| c | #(c is the next value after b)
|[c,b,a,e]| a | #(a is the next value after b)
|[a,e,f,b]| [] | #(empty, because there is no next value after b)
*All lists have the element. There are no lists without the value I am looking for
Thank you

Try writing a function and use apply.
value = 'b'
def get_next(x):
get_len = len(x)-1
for i in x:
if value.lower() == i.lower():
curr_idx = x.index(i)
if curr_idx == get_len:
return []
else:
return x[curr_idx+1]
df["next_element"] = df["lists"].apply(get_next)
df
Out[649]:
lists next_element
0 [a, b, c, d] c
1 [c, b, a, e] a
2 [a, e, f, b] []

First observation, since you want the next element of a list of string elements, the expected data type should be a string for that column, and not a list.
So, instead of the next_element columns as [c, a, []] its better to use [c, a, None]
Secondly, you should try avoiding apply methods directly over series and instead utilize the str methods that pandas provides for series which is a vectorized way of solving such problems super fast.
With the above in mind, let's try this completely vectorized one-liner -
element = 'b'
df['next_element'] = df.lists.str.join('').str.split(element).str[-1].str[0]
lists next_element
0 [a, b, c, d] c
1 [c, b, a, e] a
2 [a, e, f, b] NaN
First I combine each row as a single string [a,b,c,d]->'abcd`
Next I split this by 'b' to get substrings
I pick the last element from this list and finally the first element from that, for each row, using str functions which are vectorized over each row.
Read more about pandas.Series.str methods on official documentation/tutorial here

df = df.assign(next_element = "")
print(df)
for ind in df.index:
c= df["Lists"][ind]
for i,v in enumerate(c):
if v == "b":
df["next_element"][ind] = c[i+1]
print(df)
Try with this one you will get the exact output what you expected.

Related

Rename None in a list under pandas column

Let's say I have the following dataframe:
Value
[None, A, B, C]
[None]
I would like to replace None value in the column with none but it seems I couldn't figure out it.
I used this but not working.
df['Value'] = df['Value'].str.replace('None','none')
None is a built-in type in Python, so if you want to make it lowercase, you have to convert it to a string.
There is no built-in way in Pandas to replace values in lists, but you can use explode to expand all the lists so that each individual item of each list gets its own row in the column, then replace, then group back together into the original list format:
df['Value'] = df['Value'].explode().replace({None: 'none'}).groupby(level=0).apply(list)
Output:
>>> df
Value
0 [none, A, B, C]
1 [none]
Here is a way using map()
df['Value'] = df['Value'].map(lambda x: ['none' if i == None else i for i in x])
Output:
Value
0 [none, A, B, C]
1 [none]

loops list slicing + elements allocation Python

I am pretty beginner in Python and trying to do the following:
main_list[80,80,30,30,30,30,20,10,5,4,3,2,1] #list of integers
- slicing the main_list in multiple lists for example list1,2,3,..,n with a sum of sub lists < 100
for i in range of n:
print(list(i))
list1[80,20], list2[80,10,5,4,1], list3[30,30,30], listn[30,3,2]
Thanks!
It's not really clear to me what you consider an acceptable output so I'm assuming that it's any list where its elements sum less than 100.
The solution I found is using recursion. For the list [a, b, c, d] we are going to check the condition for this sublists:
[a]
[a, b] (if the condition for [a] is met)
[a, b, c] (if the condition for [a, b] is met)
[a, b, c, d] (if the condition for [a, b, c] is met)
[a, c] (if the condition for [a] is met)
[a, c, d] (if the condition for [a, c] is met)
[a, d] (if the condition for [a] is met)
[b]
[b, c] (if the condition for [b] is met)
[b, c, d] (if the condition for [b, c] is met)
[b, d] (if the condition for [b] is met)
[c]
[c, d] (if the condition for [c] is met)
[d]
The concept is that for the "n" element in the list we are going to look for the sublists of size "n - 1" to 0 (so the element itself) that meet the requirements. The sublists are formed by the elements at the right of the studied element of each iteration, so for the first 30, the sublist to use would be [30, 30, 30, 20, 10, 5, 4, 3, 2, 1]
This process of finding the sublists for each element is the one that uses recursion. It calls itself for each element of the sublists checking if it meets the condition. For the example above, if the condition is met for [a, b] then it will also try for [a, b, c] and [a, b, d] (by calling itself with the sum of (a, b) and the sublist [c, d].
I've added a few prints so you can study how it works, but you should just use the results variable at the end of the script for getting your results.
main_list = [80,80,30,30,30,30,20,10,5,4,3,2,1]
def less_than_hundred(input) -> bool:
return input < 100
def sublists_meet_condition(condition, input):
"""
This function is used to call the sublists_for_element function with every element in the original list and its sublist:
- For the first element (80) it calls the second function with the sublist [80,30,30,30,30,20,10,5,4,3,2,1]
- For the fifth element (30) it calls the second function with the sublist [30,20,10,5,4,3,2,1]
Its purpose is to collect all the sublists that meet the requirements for each element
"""
results = []
for index, element in enumerate(input):
print('Iteration {} - Element {}'.format(index, element))
if condition(element):
results.append([element])
print('{} = {}'.format([element], element))
num_elements = len(input) - index
main_element = element
sublist = input[index+1:]
for result in sublists_for_element(condition, main_element, sublist):
new_result = [element] + result
sum_new_result = sum(new_result)
results.append(new_result)
print('{} = {}'.format([element] + result, sum_new_result))
return results
def sublists_for_element(condition, sum_main_elements, sublist):
"""
This function is used to check every sublist with the given condition.
The variable sum_main_elements allows the function to call itself and check if for a given list of numbers that meet the conditions [30, 30, 4] for example, any of the elements of the remaining sublists also meets the condition for example adding the number 3 still meets the condition.
Its purpose is to return all the sublists that meet the requirements for the given sum of main elements and remaining sublist
"""
num_elements = '{}{}'.format('0' if len(sublist) + 1 < 10 else '',len(sublist) + 1)
#print('Elements: {} | Main element: {} | Sublist: {}'.format(num_elements, sum_main_elements, sublist))
result = []
for index, element in enumerate(sublist):
if condition(sum_main_elements + element):
result.append([element])
sublist_results = sublists_for_element(condition, sum_main_elements + element, sublist[index+1:])
for sublist_result in sublist_results:
result.append([element] + sublist_result)
return result
results = sublists_meet_condition(less_than_hundred, main_list)

select column in pandas dataframe whose nested list value matches a given value [duplicate]

In the following example, how do I keep only rows that have "a" in the array present in column tags?
df = pd.DataFrame(columns=["val", "tags"], data=[[5,["a","b","c"]]])
df[3<df.val] # this works
df["a" in df.tags] # is there an equivalent for filtering on tags?
I think using sets is intuitive. Then you can use >= as set containment
df[df.tags.apply(set) >= {'a'}]
val tags
0 5 [a, b, c]
A Numpy alternative would be
tags = df['tags']
n = len(tags)
out = np.zeros(n, np.bool8)
i = np.arange(n).repeat(tags.str.len())
np.logical_or.at(out, i, np.concatenate(tags) == 'a')
df[out]
Per #JonClements
You can use set.issubset in a map (very clever)
df[df.tags.map({'a'}.issubset)]
val tags
0 5 [a, b, c]
Use list comprehension:
df1 = df[["a" in x for x in df.tags]]
you could use apply with a lambda function which tests if 'a' is in arg of lambda:
df.tags.apply(lambda x: 'a' in x)
Result:
0 True
Name: tags, dtype: bool
This can also be used to index your dataframe:
df[df.tags.apply(lambda x: 'a' in x)]
Result:
val tags
0 5 [a, b, c]

Fixing a 4 nested for loops in Python

So i'm trying to implement the agglomerative clustering algorithm and to check the distances between each cluster i use this:
a, b = None, None
c = max
for i in range(len(map)-1):
for n in range(len(map[i])):
for j in range(i+1, len(map)):
for m in range(len(map[j])):
//dist is distance func.
d = dist(map[i][n], map[j][m])
if c > d:
a, b, c = i, j, d
print(a, ' ', b)
return a, b
map looks like this: { 0: [[1,2,3], [2,2,2]], 1: [[3,3,3]], 2: [[4,4,4], [5,5,5]] }
What I expect from this is for each row item to compare with every row/col of every other row. So something like this:
comparisons:
[1,2,3] and [3,3,3], [1,2,3] and [4,4,4], [1,2,3] and [5,5,5], [2,2,2] and [3,3,3] and so on
When I run this it only works 1 time and fails any subsequent try after at line 6 with KeyError.
I suspect that the problem is either here or in merging clusters.
If map is a dict of values, you have a general problem with your indexing:
for m in range(len(map[j])):
You use range() to create numerical indices. However, what you need j to be in this example is a valid key of the dictionary map.
EDIT:
That is - of course - assuming that you did not use 0-based incremented integers as the key of map, in which cause you might as well have gone with a list. In general you seem to be relying on the ordering provided in a list or OrderedDict (or dict in Python3.6+ as an implementation detail). See for j in range(i+1, len(map)): as a good example. Therefore I would advise using a list.
EDIT 2: Alternatively, create a list of the map.keys() and use it to index the map:
a, b = None, None
c = max
keys = list(map.keys())
for i in range(len(map)-1):
for n in range(len(map[keys[i]])):
for j in range(i+1, len(map)):
for m in range(len(map[keys[j]])):
#dist is distance func.
d = dist(map[keys[i]][n], map[keys[j]][m])
if c > d:
a, b, c = i, j, d
print(a, ' ', b)
return a, b
Before accessing to map[j] check is it valid or not like:
if j in map.keys():
#whatever
or put it in try/except:
try:
#...
except KeyError:
#....
Edit:
its better to use for loop like this:
for i in map.keys():
#.....

Sum values of array using zip

i have array of arrays this way.
I wanna sum a specific fild (like 3rd in the list )
data = [[d, 408.56087701, 87.26907024],
[b, 277.95015117, 75.19386881],
[b, 385.41416264, 84.73488504],
[b, 380.31630662, 71.23504808],
[b, 392.10729207, 83.80720357],
[b, 399.70877373, 76.59640833],
[b, 350.93124656, 79.34979059],
[b, 330.09702335, 79.37166555]]
I am trying this, but it produces problem as I have to select only 3rd in the list (first field is string)
data = [sum(x) for x in zip(*data)]
I have to add condition to show that x is third in sub list.
the_sum = sum(x[2] for x in data)
Or if you're wondering why you thought zip(*...) seemed like a good idea in the first place:
the_sum = sum(zip(*data)[2])
Although that's more wasteful of memory

Categories