Lining up DNA base pairs in an array - python

Trying to import a series of base pairs into an array.
I want it in the form ['AA','AT','AG','AC'...]
Here's my code:
paths = [str(x[4:7]) for x in mm_start]
paths
['A A', 'A T', 'A G', 'A C', 'T A', 'T T', 'T G', 'T C', 'G A', 'G T', 'G G', 'G C', 'C A', 'C T', 'C G', 'C C']
I get spaces in between the letters!
This strip command isn't helping either.
paths = str(paths).replace(" ","")
paths
"['AA','AT','AG','AC','TA','TT','TG','TC','GA','GT','GG','GC','CA','CT','CG','CC']"
Now I get a (") at the beginning and end of this array.
Any ideas very welcome!
The text file has the base pairs laid out
1 2 3 4 A A
1 2 3 1 A T
...
Thanks

You're converting the list into a string. You really want to say
paths = [pair.replace(" ", "") for pair in paths]
That iterates over each pair in your list of strings and removes spaces rather than converting the entire list to a string.

paths = str(paths).replace(" ","")
Well of course it doesn't work, because you converted the list to a string (and you didn't convert it back). What you need to do is apply the replace to each element of the list, not to the string expression of the list.
paths = [str(x[4:7]).replace(" ","") for x in mm_start]
(that's only one way of many).

Related

Sorting a subset of a list of strings

I have a list of strings containing the column names of a specific dataframe, and I want to sort a subset of the list so that it follows a certain standard format.
Specifically, to clarify things here is an example :
Input list :
array = ['var1', 'var2', 'var3', '2010 a', '2010 b', '2010 c', '2011 a', '2011 b', '2011 c']
Desired output :
array = ['var1', 'var2', 'var3', '2010 a', '2011 a', '2010 b', '2011 b', '2010 c', '2011 c']
In other words there is a subset of the array that should be left untouched (ie var1 var2 var3) and another subset that should be sorted first by the element after the whitespace and then by the element preceding whitespace.
How could this be done efficiently ?
In this specific case:
array = ['var1', 'var2', 'var3', '2010 a', '2010 b', '2010 c', '2011 a', '2011 b', '2011 c']
array[3:] = sorted(array[3:], key=lambda s:s.split()[::-1])
the various parts of this should be straightforward. Replace the fourth element onwards with the rest of the sorted list, according to a custom key. This custom key will split the element on any whitespace, and then compare them based on the splits in reverse order (last split takes priority).
This solution assumes that 'var...' (or whatever you want to leave untouched) can appear anywhere.
Extract the elements you want to sort, remember their indexes, sort them, then put them back:
lst = ['var1', 'var2', 'var3', '2010 a', '2010 b', '2010 c', '2011 a', '2011 b', '2011 c']
where, what = zip(*((i, x) for i, x in enumerate(lst) if not x.startswith('var')))
what = sorted(what, key=lambda x: x.split()[::-1])
for i, x in zip(where, what):
lst[i] = x
print(lst)
# ['var1', 'var2', 'var3', '2010 a', '2011 a', '2010 b', '2011 b', '2010 c', '2011 c']
def sort_second(string_list: list):
"""
Sort a list of strings according to the value after the string
"""
output = []
sorting_dict = {}
for string in string_list:
try:
# split the first and the second values
value, key = string.split(" ")
try:
# something with the same key has already been read in
# sort this value into a list with the other value(s) from
# the same key
insort_left(sorting_dict[key], value)
except:
# nothing else with the same key has been read in yet
sorting_dict[key] = [value]
except:
# split didn't work therefore, must be single value entry
output.append(string)
# for loop sorts second key
for key in sorted(sorting_dict.keys()):
for value in sorting_dict[key]:
# list contains values sorted according to the first key
output.append(" ".join((value, key)))
return output
I'd need to run some tests but this does the job and should be reasonably quick.
I have used dict as opposed to ordereddict because ordereddict is implemented in python rather than C
I think O(n) is nlog(n) but I'm not sure what kind of sort sorted() uses so it may be worse (if it is I'm fairly sure there will be something else to do the job more efficiently)
Edit:
I accidentally stated the time complexity as log(n) in the original post, as Kelly pointed out this is impossible. The correct time complexity (as edited above) is O(n) = nlog(n)
Given:
array = ['var1', 'var2', 'var3', '2010 a', '2010 b', '2010 c', '2011 a', '2011 b', '2011 c']
desired = ['var1', 'var2', 'var3', '2010 a', '2011 a', '2010 b', '2011 b', '2010 c', '2011 c']
Three easy ways.
First, with a split and find if there is a second element:
>>> sorted(array, key=lambda e: "" if len(e.split())==1 else e.split()[1])==desired
True
Or use partition and use the last element:
>>> sorted(array, key=lambda e: e.partition(' ')[2])==desired
True
Or with a regex to remove the first element:
>>> sorted(array, key=lambda e: re.sub(r'^\S+\s*','', e))==desired
True
All three rely on the fact that Python's sort is stable.

How to split by newline and ignore blank lines using regex?

Lets say I have this data
data = '''a, b, c
d, e, f
g. h, i
j, k , l
'''
4th line contains one single space, 6th and 7th line does not contain any space, just a blank new line.
Now when I split the same using splitlines
data.splitlines()
I get
['a, b, c', 'd, e, f', 'g. h, i', ' ', 'j, k , l', '', '']
However expected was just
['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l']
Is there a simple solution using regular expressions to do this.
Please note that I know the other way of doing the same by filtering empty strings from the output of splitlines()
I am not sure if the same can be achieved using regex.
When I use regex to split on new line, it gives me
import re
re.split("\n", data)
Output :
['a, b, c', 'd,e,f', 'g. h, i', ' ', 'j, k , l', '', '', '']
I disagree with your assessment that filtering is more complicated than using regular expressions. However, if you really want to use regex, you could split at multiple consecutive newlines like so:
>>> re.split(r"\n+", data)
['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l', '']
Unfortunately, this leaves an empty string at the end of your list.
To get around this, use re.findall to find everything that isn't a newline:
>>> re.findall(r"([^\n]+)", data)
['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l']
Since that regex doesn't work on input with spaces, here's one that does:
>>> re.findall(r"^([ \t]*\S.*)$", data, re.MULTILINE)
['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l ']
Here's the explanation:
^([ \t]*\S.*)$
^ $ : Start of line and end of line
( ) : Capturing group
[ \t]* : Zero or more of blank space or tab (i.e. whitespace that isn't newline
\S : One non-whitespace character
.* : Zero or more of any character
List comprehension approach
You can add elements to your list if they are not empty strings or whitespace ones with a condition check.
If the element/line is True after stripping it from whitespaces, then it is different from an empty string, thus you add it to your list.
filtered_data = [el for el in data.splitlines() if el.strip()]
# ['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l']
Regexp approach
import re
p = re.compile(r"^([^\s]+.+)", re.M)
p.findall(data)
# ['a, b, c', 'd, e, f', 'g. h, i', 'j, k , l']

Merging items in list given condition

Let's say I have ['A B', 'B C', 'X Y', 'C D', 'Y Z', 'D E', 'C G'].
If the second word in each element of the list is same as first word in any other elements in the list, they should be merged into one item. The order matters as well.
['A B C D E G', 'X Y Z'] should be the final product.
Letters will not form a cycle, i.e., ['A B', 'B C', 'C A'] is not possible.
As for why G was added at the end, let's say we are at 'A B C' and we see 'C D'. And later on in the list we have 'C G'. We add 'C D' since it comes first, so we have 'A B C D', and then if there is 'D E' we merge that to 'A B C D E'. Now at the end of a 'chain', we add 'C G' so we have 'A B C D E G'. If there had been a 'B H', since B comes first, then it would be 'A B C D E H G'.
Since order matters, ['A B', 'C A'] is still ['A B', 'C A'] since B does not connect to C.
Another example is:
If we have ['A B', 'A C'] then at step 1 we just have A B and then we see A C and we merge that into A B and we have A B C.
I have tried many things, such as dissecting each element into two lists, then indexing and removing etc. It is way too complex and I feel there should be a more intuitive way to solve this in Python. Alas, I can't seem to figure it out.
A simple algorithm solving this appears to be:
initialize results as empty list
repeat for each pair in input list:
repeat for each sublist R in results:
if R contains the first item of pair, append second item to R and continue with next pair
if no sublist R contained the first item of pair, append pair as new sublist to results
The implementation in Python is straightforward and should be self-explanatory:
def merge_pairs(pairs):
results = []
for pair in pairs:
first, second = pair.split()
for result in results:
if first in result:
result.append(second)
break
else:
results.append([first, second])
return [' '.join(result) for result in results]
The only extra steps are the conversions between space-separated letters and lists of letters by .split() and ' '.join().
Note the else clause of the for statement which is only executed if no break was encountered.
Some examples:
>>> merge_pairs(['A B', 'B C', 'X Y', 'C D', 'Y Z', 'D E', 'C G'])
['A B C D E G', 'X Y Z']
>>> merge_pairs(['A B', 'A C'])
['A B C']
>>> merge_pairs(['B C', 'A B'])
['B C', 'A B']
bit messy but this works.
a = ['AB', 'BC', 'XY', 'CD', 'YZ', 'DE', 'CG']
used_item = []
result = []
for f, fword in enumerate(a):
for s, sword in enumerate(a):
if f == s:
continue
if fword[1] == sword[0]:
if f in used_item or s in used_item:
idx = [i for i, w in enumerate(result) if fword[1] in w][0]
result = [r + sword[1] if i == idx else r for i, r in enumerate(result) ]
used_item.append(f)
used_item.append(s)
else:
result.append(fword+sword[1])
used_item.append(f)
used_item.append(s)
print(result)
output is ['ABCDGE', 'XYZ']
you could sort 'ABCDGE' if necessary.
I appreciate all the above answers and wanted to attempt the question using Graph Theory.
This question is a classic example of connected components, where we can assign a unique identifier to each node (nodes are the different characters in the list = A, B, C, .... Z) accordingly.
Also, as we need to maintain the order, DFS is the correct choice to proceed with.
Algorithm:
1. Treat the characters A, B, C, .... Z as different 26 nodes.
2. Perform DFS on each node.
2.1 If the node has not been assigned an unique identifier:
2.1.1. Assign a new unique identifier to the node.
2.1.2. dfs(node)
2.2 Else:
no nothing
The function dfs() will group the nodes according to the unique identifier. While performing DFS, the child gets the same unique identifier as its parent.
Have a look at the following implementation:
class Graph:
def __init__(self, n):
self.number_of_nodes = n
self.visited = []
self.adjacency_list = []
self.identifier = [-1] * self.number_of_nodes
self.merged_list = []
def add_edge(self, edge_start, edge_end):
if(self.encode(edge_end) not in self.adjacency_list[self.encode(edge_start)]):
self.adjacency_list[self.encode(edge_start)] = self.adjacency_list[self.encode(edge_start)] + [self.encode(edge_end)]
def initialize_graph(self):
self.visited = [False] * self.number_of_nodes
for i in range(0, self.number_of_nodes):
self.adjacency_list = self.adjacency_list + [[]]
def get_adjacency_list(self):
return self.adjacency_list
def encode(self, node):
return ord(node) - 65
def decode(self, node):
return chr(node + 65)
def dfs(self, start_index):
if(self.visited[self.encode(start_index)] == True):
return
self.visited[self.encode(start_index)] = True
for node in self.adjacency_list[self.encode(start_index)]:
if(self.identifier[node] == -1):
self.identifier[node] = self.identifier[self.encode(start_index)]
if(self.visited[node] == False):
self.merged_list[self.identifier[node]] = self.merged_list[self.identifier[node]] + self.decode(node)
self.dfs(self.decode(node))
graph = Graph(26)
graph.initialize_graph()
input_list = ['A B', 'B C', 'X Y', 'C D', 'Y Z', 'D E', 'C G']
for inputs in input_list:
edge = inputs.split()
edge_start = edge[0]
edge_end = edge[1]
graph.add_edge(edge_start, edge_end)
unique_identifier = 0
for inputs in input_list:
edge = inputs.split()
edge_start = edge[0]
if(graph.identifier[graph.encode(edge_start)] == -1):
graph.identifier[graph.encode(edge_start)] = unique_identifier
graph.merged_list = graph.merged_list + [edge_start]
unique_identifier = unique_identifier + 1
graph.dfs(edge_start)
print(graph.merged_list)
Output:
['ABCDEG', 'XYZ']

How to find a substring in a string list?

I try to find some string in a list, but have problems because of word order.
list = ['a b c d', 'e f g', 'h i j k']
str = 'e g'
I need to find the 2nd item in a list and output it.
You can use combination of any() and all() to check the presence in one line:
>>> my_list = ['a b c d', 'e f g', 'h i j k']
>>> my_str = 'e g'
>>> any(all(s in sub_list for s in my_str.split()) for sub_list in my_list)
True
Here, above expression will return True / False depending on whether the char in your strings are present inside the list.
To also get the get that sub-list as return value, you can modify above expression by skipping any() with list comprehension as:
>>> [sub_list for sub_list in my_list if all(s in sub_list for s in my_str.split())]
['e f g']
It'll return the list of strings containing your chars.
You can try:
for l in list:
l_words = l.split(" ")
if all([x in l_words for x in str.split(" ")]):
print(l_words)
You can try this
list = ['a b c d', 'e f g', 'h i j k']
str = list[2].split()
for letter in str:
print(letter)
This can be achieved by using sets and list comprehension
ls = ['a b c d', 'e f g', 'h i j k']
s = 'e g'
print([i for i in ls if len(set(s.replace(" ", "")).intersection(set(i.replace(" ", "")))) == len(s.replace(" ", ""))])
OR
ls = ['a b c d', 'e f g', 'h i j k']
s = 'e g'
s_set = set(s.replace(" ", ""))
print([i for i in ls if len(s_set.intersection(set(i.replace(" ", "")))) == len(s_set)])
Output
['e f g']
The list comprehension is removing all the items in ls that all the chars from s are not including inside the list item, by that you will get all the ls items that all the s chars are in them.

Concatenating Arbitrary number of items of a string in Python

Given a list ['a','b','c','d','e','f'].No. of divisions to be made 2.. So In the first string i want to take the 0,2,4 elements of the list, and then concatenate them separated by a space delimiter with the second string of 1,3,5 elements.
The output needs to be in the form of k = ["a c e", "b d f"]
The actual program is to take in a string (eg {ball,bat,doll,choclate,bat,kite}), also take in the input of the number of kids who take those gifts(eg 2), and then divide them so that the frst kid gets a gift, goes to the back of the line, the second kid takes the gift and stands at the back, in that way all kids take gifts. If gifts remain then the first kid again takes a gift and the cycle continues....
desired output for above eg: {"ball doll bat" , "bat choclate kite"}
Here is a general way to do this for any number of groups:
def merge(lst, ngroups):
return [' '.join(lst[start::ngroups]) for start in xrange(ngroups)]
Here is how it's used:
>>> lst = ['a','b','c','d','e','f']
>>> merge(lst, 2)
['a c e', 'b d f']
>>> merge(lst, 3)
['a d', 'b e', 'c f']
lst = ['a','b','c','d','e','f']
k = [" ".join(lst[::2]), " ".join(lst[1::2])]
output:
['a c e', 'b d f']
more generic solution:
def group(lst, n):
return [" ".join(lst[i::n]) for i in xrange(n)]
lst = ['a','b','c','d','e','f']
print group(lst, 3)
output:
['a d', 'b e', 'c f']

Categories