How to get all the substrings in string using Regex in Python - python

I have a string such as: "12345"
using the regex, how to get all of its substrings that consist of one up to three consecutive characters to get an output such as:
'1', '2', '3', '4', '5', '12', '23', '34', '45', '123', '234', '345'

You can use re.findall with a positive lookahead pattern that matches a character repeated for a number of times that's iterated from 1 to 3:
[match for size in range(1, 4) for match in re.findall('(?=(.{%d}))' % size, s)]
However, it would be more efficient to use a list comprehension with nested for clauses to iterate through all the sizes and starting indices:
[s[start:start + size] for size in range(1, 4) for start in range(len(s) - size + 1)]
Given s = '12345', both of the above would return:
['1', '2', '3', '4', '5', '12', '23', '34', '45', '123', '234', '345']

Related

How to get every string variations with list of possibilities for each position

Let's say I have a list of lists called master_list, which has the following properties:
Each list inside the master_list contains strings that are single digit positive integers
Each list inside the master_list has a length from 2 to 5
The master_list has a length from 1 to 8
What I want to do is return a list of string variations using the list of posibilities for each position.
Here is an example of a master_list and what the output would look like:
master_list = [['3', '2', '6'], ['6', '5', '3', '9'], ['9', '8', '6']]
# In this case the output would contain 3*4*3 = 36 elements
output = ["339","366","399","658","636","258","268","669","668","266","369","398",
"256","296","259","368","638","396","238","356","659","639","666","359",
"336","299","338","696","269","358","656","698","699","298","236","239"]
've tried iterating through each list with nested for loops, then I realized that I would need to use recursion because the number of lists is variable. But I'm stuck on how to do that. The nested loops got me the output above but its essentially hard coded for that case.
Try this (It uses itertools.product):
import itertools
li = [['3', '2', '6'], ['6', '5', '3', '9'], ['9', '8', '6']]
result = [''.join(i) for i in itertools.product(*li)]
print(result)
Outputs:
['369', '368', '366', '359', '358', '356', '339', '338', '336', '399', '398', '396', '269', '268', '266', '259', '258', '256', '239', '238', '236', '299', '298', '296', '669', '668', '666', '659', '658', '656', '639', '638', '636', '699', '698', '696']
If you want to convert each element into int:
result = [int(''.join(i)) for i in itertools.product(*li)]
itertools product will give you all combinations in tuples
as in this example
All combinations of a list of lists
then you can concat each tuple to a single string, or create a number and convert to string

How can I use regex to match only one character in Python?

I am trying do precess a list of files
file_list = ['.DS_Store', '9', '7', '6', '8', '01', '4', '3', '2', '5']
the goal is to find the files whose name has only one character.
I tried this code
r = re.compile('[0-9]')
result_list = list(filter(r.match, file_list))
result_list
and got
['9', '7', '6', '8', '01', '4', '3', '2', '5']
where '01' should not be included.
I made a workaround
tmp = []
for i in file_list:
if len(i)==1:
tmp.append(i)
tmp
and I got
['9', '7', '6', '8', '4', '3', '2', '5']
this is exactly what I want. Although the method is ugly.
how can I use regex in Python to finish the task?
r = re.compile('^[0-9]$')
The ^ matches the beginning of a line and $ matches the end.
And if you really want it to match any character, not just numbers, it should be
r = re.compile('^.$')
The . in the regex is a single-character wildcard.
Match a string if it's simply any single character appearing at the beginning of the string (^.) right before the end of the string ($):
^.$
Regex101
Your Python then becomes:
r = re.compile('^.$')
result_list = list(filter(r.match, file_list))
Your code is equivalent to
[ i for i in file_list if len(i)==1]
And this method adapts to every case in which file's name has only one character.

Extend list of lists only if first item of new list is unique

I'm working on parsing an output file for a NCBI Blast Search for a bioinformatics application. Essentially, the search takes a template genetic sequence and finds a series of sequences (contigs) with significant similarity to the template sequence.
In order to extract the many matches for contigs, my goal is to create a list of lists with the following format:
'[(contig #), (frame #), (first character # of the subject ("Sbjct")),(last character # of the subject ("Sbjct")]'
e.g. the output sublist for a given section with contig #1568, frame = -1, starting on character #5509 of the subject and ending on character #3914 of the subject is:
[1568,-1,5509,3914]
In this question I've left off the final item of the sublists. My challenge is that because there are multiple readout files, sometimes containing the same contig as other files, the list of lists that I'm creating sometimes gets extended with the same contig twice. Let me explain.
As depicted in the posted code block below, I tried to only add a new sublist if the sublist was unique (not already present). The issue I think I had with that is that all of the items in a sublist were compared to all of the items in the other sublist. This led to duplicates owing to the fact that although the contig # was the same, the other parameters were not the same. I just want the first sublist with a particular contig # to be the one it keeps without regard to the other parameters.
for ind, line in enumerate(contents,1):
if re.search("(.*)>(.*)", line):
c1 = line.split('[')
c2 = c1[1].split(']')
c3 = c2[0]
my_line = getline(file.name, ind + 5)
f1 = my_line.split('= ')
if '+' in f1[1]:
f2 = f1[1].split('+')
f3 = f2[1].split('\n')[0]
else:
f3 = f1[1].split('\n')[0]
my_line2 = getline(file.name, ind + 7)
q1 = my_line2.split(' ')[2]
my_line3 = getline(file.name, ind - 3)
l1= [c3,f3,q1]
if l1 not in x:
x.extend([l1])
Here is what I received for my actual output:
[['1568', '-1', '12'], ['0003', '1', '12'], ['0130', '3', '12'], ['0097', '1', '20'], ['0512', '3', '11'], ['0315', '-1', '296'], ['0118', '-2', '52'], ['0308', '-3', '488'], ['1568', '-1', '1'], ['0003', '1', '1'], ['0130', '3', '4'], ['0097', '1', '28'], ['0512', '3', '23'], ['0315', '-1', '21'], ['0118', '-2', '39'], ['0102', '-3', '293'], ['0495', '-1', '146'], ['0386', '-3', '146']]
And here is what I expected:
[['1568', '-1', '12'], ['0003', '1', '12'], ['0130', '3', '12'], ['0097', '1', '20'], ['0512', '3', '11'], ['0315', '-1', '296'], ['0118', '-2', '52'], ['0308', '-3', '488'], ['0102', '-3', '293'], ['0495', '-1', '146'], ['0386', '-3', '146']]
How might I only add a sublist if the first item of the new sublist isn't in any of the other sublists? Please help!
This might be a quick fix, replace the line:
if l1 not in x:
With:
#if (any(c3 in temp for temp in x)):
if (not any(c3 == temp[0] for temp in x)):
This will check if there are any instances of c3 (your first element in the l1 sub-list) in any of the temp lists already contained in x

Append to list from another list

i have list like
list = ['1,2,3,4,5', '6,7,8,9,10']
I have problem with "," in list, because '1,2,3,4,5' its string.
I want to have list2 = ['1','2','3','4'...]
How i can do this?
Should be something like that:
nums = []
for str in list:
nums = nums + [int(n) for n in str.split(',')]
You can loop through and split the strings up.
list = ['1,2,3,4,5', '6,7,8,9,10']
result = []
for s in list:
result += s.split(',')
print(result)
Split each value in the original by , and then keep appending them to a new list.
l = []
for x in ['1,2,3,4,5', '6,7,8,9,10']:
l.extend(y for y in x.split(','))
print(l)
Use itertools.chain.from_iterable with map:
from itertools import chain
lst = ['1,2,3,4,5', '6,7,8,9,10']
print(list(chain.from_iterable(map(lambda x: x.split(','), lst))))
# ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
Note that you shouldn't use list name for variables as it's a built-in.
You can also use list comprehension
li = ['1,2,3,4,5', '6,7,8,9,10']
res = [c for s in li for c in s.split(',') ]
print(res)
#['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
list2 = []
list2+=(','.join(list).split(','))
','.join(list) produces a string of '1,2,3,4,5,6,7,8,9,10'
','.join(list).split(',') produces ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
join method is used to joined elements in a list by a delimiter. It returns a string in which the elements of sequence have been joined by ','.
split method is used to split a string into a list by a delimiter. It splits a string into an array of substrings.
# Without using loops
li = ['1,2,3,4,5', '6,7,8,9,10']
p = ",".join(li).split(",")
#['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']

How to remove empty separators from read files in Python?

Here is my input file sample (z.txt)
>qrst
ABCDE-- 6 6 35 25 10
>qqqq
ABBDE-- 7 7 28 29 2
I store the alpha and numeric in separate lists. Here is the output of numerics list
#Output : ['', '6', '', '6', '35', '25', '10']
['', '7', '', '7', '28', '29', '', '2']
The output has an extra space when there are single digits because of the way the file has been created. Is there anyway to get rid of the '' (empty spaces)?
You can take advantage of filter with None as function for that:
numbers = ['', '7', '', '7', '28', '29', '', '2']
numbers = filter(None, numbers)
print numbers
See it in action here: https://eval.in/640707
If your input looks like this:
>>> li=[' 6 6 35 25 10', ' 7 7 28 29 2']
Just use .split() which will handle the repeated whitespace as a single delimiter:
>>> [e.split() for e in li]
[['6', '6', '35', '25', '10'], ['7', '7', '28', '29', '2']]
vs .split(" "):
>>> [e.split(" ") for e in li]
[['', '6', '', '6', '', '35', '', '25', '', '10'], ['', '7', '7', '28', '', '29', '2']]
I guess there are many ways to do this. I prefer using regular expressions, although this might be slower if you have a large input file with tens of thousands of lines. For smaller files, it's okay.
Few points:
Use context manager (with statement) to open files. When the with statement ends, the file will automatically be closed.
An alternative to re.findall() is re.match() or re.search(). Subsequent code will be slightly different.
It org, sequence and numbers are related element-wise, I suggest you maintain a list of 3-element tuples instead. Of course, you have buffer the org field and add to the list of tuples when the next line is obtained.
import re
org = []
sequence = []
numbers = []
with open('ddd', 'r') as f:
for line in f.readlines():
line = line.strip()
if re.search(r'^>', line):
org.append(line)
else:
m = re.findall(r'^([A-Z]+--)\s+(.*)\s+', line)
if m:
sequence.append(m[0][0])
numbers.append(map(int, m[0][1].split())) # convert from str to int
print(org, sequence, numbers)

Categories