Remove only half of specific adjacent duplicates in python list - python

I have a tool which is outputting some data . It is known that whenever '10' comes in the data it is added with extra '10' I.e new data becomes ... '10', '10', . Sometimes there can be 4 '10' in consecutive series which means that there is actually 2 '10'.
While reading the data I am trying to remove the duplicates . Till now I have learnt how to remove duplicates if only two adjacent duplicates are found but at the same time if even number of duplicates are found , I want to return half of the duplicates .
x = [ '10', '10', '00', 'DF', '20' ,'10' ,'10' ,'10' ,'10', ....]
Expected output
[ '10', '00' , 'DF', ' 20', ' 10', '10' ..]

You may try to use groupby() from itertools:
X= [ '10', '10', '00', 'DF', '20' ,'10' ,'10' ,'10' ,'10']
from itertools import groupby
result = []
for k, g in groupby(X) :
group = list(g)
if k == '10' :
result.extend(group[:(len(group)+1)/2])
else :
result.extend(group)
print result
gives:
['10', '00', 'DF', '20', '10', '10']

A pure python approach
ls = []
dupe = True
for item in x:
if ls and ls[-1] == item and dupe:
dupe = False
continue
dupe = True
ls.append(item)
['10', '00', 'DF', '20', '10', '10']

Related

How to get all the substrings in string using Regex in Python

I have a string such as: "12345"
using the regex, how to get all of its substrings that consist of one up to three consecutive characters to get an output such as:
'1', '2', '3', '4', '5', '12', '23', '34', '45', '123', '234', '345'
You can use re.findall with a positive lookahead pattern that matches a character repeated for a number of times that's iterated from 1 to 3:
[match for size in range(1, 4) for match in re.findall('(?=(.{%d}))' % size, s)]
However, it would be more efficient to use a list comprehension with nested for clauses to iterate through all the sizes and starting indices:
[s[start:start + size] for size in range(1, 4) for start in range(len(s) - size + 1)]
Given s = '12345', both of the above would return:
['1', '2', '3', '4', '5', '12', '23', '34', '45', '123', '234', '345']

Search exact string in list

I am doing an exercise where I need to search the exact function name from the fun list and get the corresponding information from another list detail.
Here is the dynamic list detail:
csvCpReportContents =[
['[PLT] rand (DEBUG INFO NOT FOUND)', '11', '15'],
['rand', '10', '11', '12'],
['__random_r', '23', '45'],
['__random', '10', '11', '12'],
[],
['multiply_matrices()','23','45'] ]
Here is fun list contains function name to be searched:
fun = ['multiply_matrices()','__random_r','__random']
Expected Output for function fun[2]
['__random', '10', '11', '12']
Expected Output for function fun[1]
['__random_r', '23', '45'],
Here what I have tried for fun[2]:
for i in range(0, len(csvCpReportContents)):
row = csvCpReportContents[i]
if len(row)!=0:
search1 = re.search("\\b" + str(fun[2]).strip() + "\\b", str(row))
if search1:
print(csvCpReportContents[i])
Please suggest to me how to search for the exact word and fetch only that information.
for each fun function you can just iterate through the csv list checking if the first element starts with it
csvCpReportContents = [
['[PLT] rand (DEBUG INFO NOT FOUND)', '11', '15'],
['rand', '10', '11', '12'],
[],
['multiply_matrices()', '23', '45']]
fun=['multiply_matrices()','[PLT] rand','rand']
for f in fun:
for c in csvCpReportContents:
if len(c) and c[0].startswith(f):
print(f'fun function {f} is in csv row {c}')
OUTPUT
fun function multiply_matrices() is in csv row ['multiply_matrices()', '23', '45']
fun function [PLT] rand is in csv row ['[PLT] rand (DEBUG INFO NOT FOUND)', '11', '15']
fun function rand is in csv row ['rand', '10', '11', '12']
Updated code since you changed the test cases and requirement in the question. My first answer was based on your test cases that you wanted to match lines that started with item from fun. Now you seem to have changed that requirement to match an exact match and if not exact match match a starts with match. Below code updated to handle that scenario. However i would say next time be clear in your question and dont change the criteria after several people have answered
csvCpReportContents =[
['[PLT] rand (DEBUG INFO NOT FOUND)', '11', '15'],
['rand', '10', '11', '12'],
['__random_r', '23', '45'],
['__random', '10', '11', '12'],
[],
['multiply_matrices()','23','45'] ]
fun = ['multiply_matrices()','__random_r','__random','asd']
for f in fun:
result = []
for c in csvCpReportContents:
if len(c):
if f == c[0]:
result = c
elif not result and c[0].startswith(f):
result = c
if result:
print(f'fun function {f} is in csv row {result}')
else:
print(f'fun function {f} is not vound in csv')
OUTPUT
fun function multiply_matrices() is in csv row ['multiply_matrices()', '23', '45']
fun function __random_r is in csv row ['__random_r', '23', '45']
fun function __random is in csv row ['__random', '10', '11', '12']
fun function asd is not vound in csv
above input is nested list, so you have to consider 2D Indexing such as
l = [[1,2,3,4],[2,5,7,9]]
for finding 3 number element
you have to use the index of l[0][2]
With custom search_by_func_name function:
csvCpReportContents = [
['[PLT] rand (DEBUG INFO NOT FOUND)', '11', '15'],
['rand', '10', '11', '12'],
[],
['multiply_matrices()', '23', '45']]
fun = ['multiply_matrices()', '[PLT] rand', 'rand']
def search_by_func_name(name, content_list):
for lst in content_list:
if any(i.startswith(name) for i in lst):
return lst
print(search_by_func_name(fun[1], csvCpReportContents)) # ['[PLT] rand (DEBUG INFO NOT FOUND)', '11', '15']
print(search_by_func_name(fun[2], csvCpReportContents)) # ['rand', '10', '11', '12']
You can also use call_fun function as I did in the below code.
def call_fun(fun_name):
for ind,i in enumerate(csvCpReportContents):
if i:
if i[0].startswith(fun_name):
return csvCpReportContents[ind]
# call_fun(fun[2])
# ['rand', '10', '11', '12']

Extend list of lists only if first item of new list is unique

I'm working on parsing an output file for a NCBI Blast Search for a bioinformatics application. Essentially, the search takes a template genetic sequence and finds a series of sequences (contigs) with significant similarity to the template sequence.
In order to extract the many matches for contigs, my goal is to create a list of lists with the following format:
'[(contig #), (frame #), (first character # of the subject ("Sbjct")),(last character # of the subject ("Sbjct")]'
e.g. the output sublist for a given section with contig #1568, frame = -1, starting on character #5509 of the subject and ending on character #3914 of the subject is:
[1568,-1,5509,3914]
In this question I've left off the final item of the sublists. My challenge is that because there are multiple readout files, sometimes containing the same contig as other files, the list of lists that I'm creating sometimes gets extended with the same contig twice. Let me explain.
As depicted in the posted code block below, I tried to only add a new sublist if the sublist was unique (not already present). The issue I think I had with that is that all of the items in a sublist were compared to all of the items in the other sublist. This led to duplicates owing to the fact that although the contig # was the same, the other parameters were not the same. I just want the first sublist with a particular contig # to be the one it keeps without regard to the other parameters.
for ind, line in enumerate(contents,1):
if re.search("(.*)>(.*)", line):
c1 = line.split('[')
c2 = c1[1].split(']')
c3 = c2[0]
my_line = getline(file.name, ind + 5)
f1 = my_line.split('= ')
if '+' in f1[1]:
f2 = f1[1].split('+')
f3 = f2[1].split('\n')[0]
else:
f3 = f1[1].split('\n')[0]
my_line2 = getline(file.name, ind + 7)
q1 = my_line2.split(' ')[2]
my_line3 = getline(file.name, ind - 3)
l1= [c3,f3,q1]
if l1 not in x:
x.extend([l1])
Here is what I received for my actual output:
[['1568', '-1', '12'], ['0003', '1', '12'], ['0130', '3', '12'], ['0097', '1', '20'], ['0512', '3', '11'], ['0315', '-1', '296'], ['0118', '-2', '52'], ['0308', '-3', '488'], ['1568', '-1', '1'], ['0003', '1', '1'], ['0130', '3', '4'], ['0097', '1', '28'], ['0512', '3', '23'], ['0315', '-1', '21'], ['0118', '-2', '39'], ['0102', '-3', '293'], ['0495', '-1', '146'], ['0386', '-3', '146']]
And here is what I expected:
[['1568', '-1', '12'], ['0003', '1', '12'], ['0130', '3', '12'], ['0097', '1', '20'], ['0512', '3', '11'], ['0315', '-1', '296'], ['0118', '-2', '52'], ['0308', '-3', '488'], ['0102', '-3', '293'], ['0495', '-1', '146'], ['0386', '-3', '146']]
How might I only add a sublist if the first item of the new sublist isn't in any of the other sublists? Please help!
This might be a quick fix, replace the line:
if l1 not in x:
With:
#if (any(c3 in temp for temp in x)):
if (not any(c3 == temp[0] for temp in x)):
This will check if there are any instances of c3 (your first element in the l1 sub-list) in any of the temp lists already contained in x

using a for loop to compare lists

The problem at hand is I have a list of lists that I need to iterate through and compare one by one.
def stockcheck():
stock = open("Stock.csv", "r")
reader = csv.reader(stock)
stockList = []
for row in reader:
stockList.append(row)
The output from print(stockList) is:
[['Product', 'Current Stock', 'Reorder Level', 'Target Stock'], ['plain blankets', '5', '10', '50'], ['mugs', '15', '20', '120'], ['100m rope', '60', '15', '70'], ['burner', '90', '20', '100'], ['matches', '52', '10', '60'], ['bucket', '85', '15', '100'], ['spade', '60', '10', '65'], ['wood', '100', '10', '200'], ['sleeping bag', '50', '10', '60'], ['chair', '30', '10', '60']]
I've searched the basics for this but i've had no luck... I'm sure the solution is simple but it's escaping me! Essentially I need to check whether the current stock is less than the re-order level, and if it is save it to a CSV (that part I can do no problem).
for item in stockList:
if stockList[1][1] < stockList[1][2]:
print("do the add to CSV jiggle")
This is as much as I can do but it doesn't iterate through... Any ideas? Thanks in advance!
Iterate through the stockList using list comprehension, maybe and then print out the results
[sl for sl in stockList[1:] if sl[1] < sl[2]]
You will get the following results:
[['mugs', '15', '20', '120']]
In case you were wondering stockList[1:] is to ensure that you ignore the header.
However, you must note that the values are strings that are being compared. Hence, the values are compared char by char. If you want integer comparisons then you must convert the strings to integers, assuming you are absolutely sure that sl[1] and sl[2] will always be integers - just being presented as strings. Just try doing:
[sl for sl in stockList[1:] if int(sl[1]) < int(sl[2])]
The result changes:
[['plain blankets', '5', '10', '50'], ['mugs', '15', '20', '120']]
Use the [1:] to not get the header, and then make the comparation.
for item in stockList[1:]:
if item[1] < item[2]:
print item
print("do the add to CSV jiggle")

How can i sort integers in a list, if they are in a string?

I have a controlled assessment, and need to be able to order scores from a test in numerical and alphabetical order. How do i do this if they are connected to the persons name who completed the quiz. All names are within 1 list, For example ["John, 9"], ["alfie, 6"] etc
any help much appreciated!
If you want to sort a list of strings based on a transformation on each of these strings, you can use the function sorted with the key keyword argument:
>>> l = ['10', '9', '100', '8']
>>> sorted(l)
['10', '100', '8', '9']
>>> sorted(l, key=int)
['8', '9', '10', '100']
>>> def transformation(x):
... return -int(x)
...
>>> sorted(l, key=transformation)
['100', '10', '9', '8']
What the key function does is that the strings are not compared directly, but the values that are returned by the function are.

Categories