I have a list of characters
a = ["s", "a"]
I have some words.
b = "asp"
c= "lat"
d = "kasst"
I know that the characters in the list can appear only once or in linear order(or at most on small set can appear in the bigger one).
I would like to split my words by putting the elements in a in the middle, an the rest on the left or on the right (and put a "=" if there is nothing)
so b = ["*", "as", "p"]
If a bigger set of characters which contains
d = ["k", "ass", "t"]
I know that the combinations can be at most of length 4.
So I have divided the possible combinations depending on the length:
import itertools
c4 = [''.join(i) for i in itertools.product(a, repeat = 4)]
c3 = [''.join(i) for i in itertools.product(a, repeat = 3)]
c2 = [''.join(i) for i in itertools.product(a, repeat = 2)]
c1 = [''.join(i) for i in itertools.product(a, repeat = 1)]
For each c, starting with the greater
For simplicity, let's say I start with c3 in this case and not with length 4.
I have to do this with a lot of data.
Is there a way to simplify the code ?
You can do something similar using a regular expression:
>>> import re
>>> p = re.compile(r'([sa]{1,4})')
p matches the characters 's' or 'a' repeated between 1 and 4 times.
To split a given string at this pattern, use p.split. The use of capturing parentheses in the pattern leads to the pattern itself being included in the result.
>>> p.split('asp')
['', 'as', 'p']
>>> p.split('lat')
['l', 'a', 't']
>>> p.split('kasst')
['k', 'ass', 't']
Use regex ?
import re
a = ["s", "a"]
text = "kasst"
pattern = re.compile("[" + "".join(a) + "]{1,4}")
match = pattern.search(text)
parts = [text[:match.start()], text[match.start():match.end()], text[match.end():]]
parts = [part if part else "*" for part in parts]
However, note that this won't handle the case when there is no match on the elements in a
I would do a regular expression to simplify the matching.
import re
splitters = ''.join(a)
pattern = re.compile("([^%s]*)([%s]+)([^%s]*)" % (splitters, splitters, splitters))
words = [v if v else '=' for v in pattern.match(s).groups() ]
This doesn't allow the characters in the first or last group, so not all string will match correctly (and throw an exception). You can allow them if you want. Feel free to modify the regular expression to better match what you want it to do.
Also you only need to run the re.compile once, not for every string you are trying to match.
Related
I am pulling data from a table that changes often using Python - and the method I am using is not ideal. What I would like to have is a method to pull all strings that contain only one letter and leave out anything that is 2 or more.
An example of data I might get:
115
19A6
HYS8
568
In this example, I would like to pull 115, 19A6, and 568.
Currently I am using the isdigit() method to determine if it is a digit and this filters out all numbers with one letter, which works for some purposes, but is less than ideal.
Try this:
string_list = ["115", "19A6", "HYS8", "568"]
output_list = []
for item in string_list: # goes through the string list
letter_counter = 0
for letter in item: # goes through the letters of one string
if not letter.isdigit(): # checks if the letter is a digt
letter_counter += 1
if letter_counter < 2: # if the string has more then 1 letter it wont be in output list
output_list.append(item)
print(output_list)
Output:
['115', '19A6', '568']
Here is a one-liner with a regular expression:
import re
data = ["115", "19A6", "HYS8", "568"]
out = [string for string in data if len(re.sub("\d", "", string))<2]
print(out)
Output:
['115', '19A6', '568']
This is an excellent case for regular expressions (regex), which is available as the built-in re library.
The code below follows the logic:
Define the dataset. Two examples have been added to show that a string containing two alpha-characters is rejected.
Compile a character pattern to be matched. In this case, zero or more digits, followed by zero or one upper case letter, ending with zero of more digits.
Use the filter function to detect matches in the data list and output as a list.
For example:
import re
data = ['115', '19A6', 'HYS8', '568', 'H', 'HI']
rexp = re.compile('^\d*[A-Z]{0,1}\d*$')
result = list(filter(rexp.match, data))
print(result)
Output:
['115', '19A6', '568', 'H']
Another solution, without re using str.maketrans/str.translate:
lst = ["115", "19A6", "HYS8", "568"]
d = str.maketrans(dict.fromkeys(map(str, range(10)), ""))
out = [i for i in lst if len(i.translate(d)) < 2]
print(out)
Prints:
['115', '19A6', '568']
z=False
a = str(a)
for I in range(len(a)):
if a[I].isdigit():
z = True
break
else:
z="no digit"
print(z)```
Suppose I have the following string:
trend = '(A|B|C)_STRING'
I want to expand this to:
A_STRING
B_STRING
C_STRING
The OR condition can be anywhere in the string. i.e STRING_(A|B)_STRING_(C|D)
would expand to
STRING_A_STRING_C
STRING_B_STRING C
STRING_A_STRING_D
STRING_B_STRING_D
I also want to cover the case of an empty conditional:
(|A_)STRING would expand to:
A_STRING
STRING
Here's what I've tried so far:
def expandOr(trend):
parenBegin = trend.index('(') + 1
parenEnd = trend.index(')')
orExpression = trend[parenBegin:parenEnd]
originalTrend = trend[0:parenBegin - 1]
expandedOrList = []
for oe in orExpression.split("|"):
expandedOrList.append(originalTrend + oe)
But this is obviously not working.
Is there any easy way to do this using regex?
Here's a pretty clean way. You'll have fun figuring out how it works :-)
def expander(s):
import re
from itertools import product
pat = r"\(([^)]*)\)"
pieces = re.split(pat, s)
pieces = [piece.split("|") for piece in pieces]
for p in product(*pieces):
yield "".join(p)
Then:
for s in ('(A|B|C)_STRING',
'(|A_)STRING',
'STRING_(A|B)_STRING_(C|D)'):
print s, "->"
for t in expander(s):
print " ", t
displays:
(A|B|C)_STRING ->
A_STRING
B_STRING
C_STRING
(|A_)STRING ->
STRING
A_STRING
STRING_(A|B)_STRING_(C|D) ->
STRING_A_STRING_C
STRING_A_STRING_D
STRING_B_STRING_C
STRING_B_STRING_D
import exrex
trend = '(A|B|C)_STRING'
trend2 = 'STRING_(A|B)_STRING_(C|D)'
>>> list(exrex.generate(trend))
[u'A_STRING', u'B_STRING', u'C_STRING']
>>> list(exrex.generate(trend2))
[u'STRING_A_STRING_C', u'STRING_A_STRING_D', u'STRING_B_STRING_C', u'STRING_B_STRING_D']
I would do this to extract the groups:
def extract_groups(trend):
l_parens = [i for i,c in enumerate(trend) if c == '(']
r_parens = [i for i,c in enumerate(trend) if c == ')']
assert len(l_parens) == len(r_parens)
return [trend[l+1:r].split('|') for l,r in zip(l_parens,r_parens)]
And then you can evaluate the product of those extracted groups using itertools.product:
expr = 'STRING_(A|B)_STRING_(C|D)'
from itertools import product
list(product(*extract_groups(expr)))
Out[92]: [('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D')]
Now it's just a question of splicing those back onto your original expression. I'll use re for that :)
#python3.3+
def _gen(it):
yield from it
p = re.compile('\(.*?\)')
for tup in product(*extract_groups(trend)):
gen = _gen(tup)
print(p.sub(lambda x: next(gen),trend))
STRING_A_STRING_C
STRING_A_STRING_D
STRING_B_STRING_C
STRING_B_STRING_D
There's probably a more readable way to get re.sub to sequentially substitute things from an iterable, but this is what came off the top of my head.
It is easy to achieve with sre_yield module:
>>> import sre_yield
>>> trend = '(A|B|C)_STRING'
>>> strings = list(sre_yield.AllStrings(trend))
>>> print(strings)
['A_STRING', 'B_STRING', 'C_STRING']
The goal of sre_yield is to efficiently generate all values that can match a given regular expression, or count possible matches efficiently... It does this by walking the tree as constructed by sre_parse (same thing used internally by the re module), and constructing chained/repeating iterators as appropriate. There may be duplicate results, depending on your input string though -- these are cases that sre_parse did not optimize.
i encountered a problem while trying to solve a problem where given some strings and their lengths, you need to find their common substring. My code for the part where it loops through the list and then each through each word in it is this:
num_of_cases = int(input())
for i in range(1, num_of_cases+1):
if __name__ == '__main__':
len_of_str = list(map(int, input().split()))
len_of_virus = int(input())
strings = []
def string(strings, len_of_str):
len_of_list = len(len_of_str)
for i in range(1, len_of_list+1):
strings.append(input())
lst_of_subs = []
virus_index = []
def substr(strings, len_of_virus):
for word in strings:
for i in range(len(len_of_str)):
leng = word[i:len_of_virus]
lst_of_subs.append(leng)
virus_index.append(i)
print(string(strings, len_of_str))
print(substr(strings, len_of_virus))
And it prints the following given the strings: ananasso, associazione, tassonomia, massone
['anan', 'nan', 'an', 'n', 'asso', 'sso', 'so', 'o', 'tass', 'ass', 'ss', 's', 'mass', 'ass', 'ss', 's']
It seems that the end index doesn't increase, although i tried it by writing len_of_virus += 1 at the end of the loop.
sample input:
1
8 12 10 7
4
ananasso
associazione
tassonomia
massone
where the 1st letter is the number of cases, the second line is the name of the strings, 3rd is the length of the virus(the common substring), and then there are the given strings that i should loop through.
expected output:
Case #1: 4 0 1 1
where the four numbers are the starting indexes of the common substring.(i dont think that code for printing cares us for this particular problem)
What should i do? Please help!!
The problem, beside defining functions in odd places and using said function to get side effect in ways that aren't really encourage, is here:
for i in range(len(len_of_str)):
leng = word[i:len_of_virus]
i constantly increase in each iteration, but len_of_virus stay the same, so you are effectively doing
word[0:4] #when len_of_virus=4
word[1:4]
word[2:4]
word[3:4]
...
that is where the 'anan', 'nan', 'an', 'n', come from the first word "ananasso", and the same for the other
>>> word="ananasso"
>>> len_of_virus = 4
>>> for i in range(len(word)):
word[i:len_of_virus]
'anan'
'nan'
'an'
'n'
''
''
''
''
>>>
you can fix it moving the upper end by i, but that leave with the same problem in the other end
>>> for i in range(len(word)):
word[i:len_of_virus+i]
'anan'
'nana'
'anas'
'nass'
'asso'
'sso'
'so'
'o'
>>>
so some simple adjustments in the range and problem solve:
>>> for i in range(len(word)-len_of_virus+1):
word[i:len_of_virus+i]
'anan'
'nana'
'anas'
'nass'
'asso'
>>>
Now that the substring part is done, the rest is also easy
>>> def substring(text,size):
return [text[i:i+size] for i in range(len(text)-size+1)]
>>> def find_common(lst_text,size):
subs = [set(substring(x,size)) for x in lst_text]
return set.intersection(*subs)
>>> test="""ananasso
associazione
tassonomia
massone""".split()
>>> find_common(test,4)
{'asso'}
>>>
To find the common part to all the strings in our list we can use a set, first we put all the substring of a given word into a set and finally we intersect them all.
the rest is just printing it to your liking
>>> virus = find_common(test,4).pop()
>>> print("case 1:",*[x.index(virus) for x in test])
case 1: 4 0 1 1
>>>
First extract all the substrings of the give size from the shortest string. Then select the first of these substrings that is present in all of the strings. Finally output the position of this common substring in each of the strings:
def commonSubs(strings,size):
base = min(strings,key=len) # shortest string
subs = [base[i:i+size] for i in range(len(base)-size+1)] # all substrings
cs = next(ss for ss in subs if all(ss in s for s in strings)) # first common
return [s.index(cs) for s in strings] # indexes of common substring
output:
S = ["ananasso", "associazione", "tassonomia", "massone"]
print(commonSubs(S,4))
[4, 0, 1, 1]
You could also use a recursive approach:
def commonSubs(strings,size,i=0):
sub = strings[0][i:i+size]
if all(sub in s for s in strings):
return [s.index(sub) for s in strings]
return commonSubs(strings,size,i+1)
from suffix_trees import STree
STree.STree(["come have some apple pies",
'apple pie available',
'i love apple pie haha']).lcs()
the most simple way is use STree
I should note that we are only allowed to use built in python string functions and loop functions.
A = 'bet[bge]geee[tb]bb'
B = 'betggeeetbb'
The square brackets mean any single one of the characters inside the bracket can be used so you could have
betbgeeetbb
betggeeetbb
betegeeetbb
betbgeeebbb
betggeeebbb
betegeeebbb
How do I check A will have a combination that can be found within B.
A can have any number of brackets, with a minimum of 2 characters and a maximum of 4 characters in each square bracket
Thank you
Read up on the regular expressions library. The solution is literally the re.match function, whose documentation includes the following bit:
[] Used to indicate a set of characters. In a set:
Characters can be listed individually, e.g. [amk] will match 'a', 'm', or 'k'.
Since regular expressions use backslashes for their own purpose (beyond Python's normal escapes, e.g. "\n" to represent a newline), raw strings are idiomatic in the matching string.
>>> import re
>>> A = r'bet[bge]geee[tb]bb'
>>> B = 'betggeeetbb'
>>> m = re.match(A, B)
>>> m
<_sre.SRE_Match object; span=(0, 11), match='betggeeetbb'>
>>> m.group(0)
'betggeeetbb'
You can also verify that it doesn't match if (say) the second bracket is not matched:
>>> C = "betggeeezbb"
>>> m = re.match(A, C)
>>> m is None
True
Before you go about adding this liberally to an existing project, make sure you understand:
What is the difference between re.search and re.match?
What is the cost of creating a regular expression? How can you avoid this cost if the regular expression is used repeatedly?
How can you extract parts of a matching expression (e.g. the character matched by [bge] in your example)?
How can you perform matches on strings that contain newlines?
Finally, when learning regular expressions (similarly to learning class inheritance), it's tempting to use them everywhere. Meditate on this koan from Jamie Zawinski:
Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.
It's easiest to break your problem up into simpler tasks. There are many ways to convert your pattern from just a plain string into something with more structure, but here's something that uses only plain string operations to get you started:
def parse_pattern(pattern):
'''
>>> parse_pattern('bet[bge]geee[tb]bb')
['b', 'e', 't', ['b', 'g', 'e'], 'g', 'e', 'e', 'e', ['t', 'b'], 'b', 'b']
'''
in_group = False
group = []
result = []
# Iterate through the pattern, character by character
for c in pattern:
if in_group:
# If we're currently parsing a character
# group, we either add a char into current group
# or we end the group and go back to looking at
# normal characters
if c == ']':
in_group = False
result.append(group)
group = []
else:
group.append(c)
elif c == '[':
# A [ starts a character group
in_group = True
else:
# Otherwise, we just handle normal characters
result.append(c)
return result
def check_if_matches(string, pattern):
parsed_pattern = parse_pattern(pattern)
# Useful thing to note: `string` and `parsed_pattern`
# have the same number of elements once we parse the
# `pattern`
...
if __name__ == '__main__':
print(check_if_matches('betggeeetbb', 'bet[bge]geee[tb]bb'))
I have a list of strings, and I want to all the strings that end with _1234 where 1234 can be any 4-digit number. It's ideal to find all the elements, and what the digits actually are, or at least return the 1st matching element, and what the 4 digit is.
For example, I have
['A', 'BB_1024', 'CQ_2', 'x_0510', 'y_98765']
I want to get
['1024', '0510']
Okay so far I got, _\d{4}$ will match _1234 and return a match object, and the match_object.group(0) is the actual matched string. But is there a better way to look for _\d{4}$ but only return \d{4} without the _?
Use re.search():
import re
lst = ['A', 'BB_1024', 'CQ_2', 'x_0510']
newlst = []
for item in lst:
match = re.search(r'_(\d{4})\Z', item)
if match:
newlst.append(match.group(1))
print(newlst) # ['1024', '0510']
As for the regex, the pattern matches an underscore and exactly 4 digits at the end of the string, capturing only the digits (note the parens). The captured group is then accessible via match.group(1) (remember that group(0) is the entire match).
import re
src = ['A', 'BB_1024', 'CQ_2', 'x_0510', 'y_98765', 'AB2421', 'D3&1345']
res = []
p = re.compile('.*\D(\d{4})$')
for s in src:
m = p.match(s)
if m:
res.append(m.group(1))
print(res)
Works fine, \D means not a number, so it will match 'AB2421', 'D3&1345' and so on.
Please show some code next time you ask a question here, even if it doesn't work at all. It makes it easier for people to help you.
If you're interested in a solution without any regex, here's a way with list comprehensions:
>>> data = ['A', 'BB_1024', 'CQ_2', 'x_0510', 'y_98765']
>>> endings = [text.split('_')[-1] for text in data]
>>> endings
['A', '1024', '2', '0510', '98765']
>>> [x for x in endings if x.isdigit() and len(x)==4]
['1024', '0510']
Try this:
[s[-4:] for s in lst if s[-4:].isdigit() and len(s) > 4]
Just check the last four characters if it's a number or not.
added the len(s) > 4 to correct the mistake Joran pointed out.
Try this code:
r = re.compile(".*?([0-9]+)$")
newlist = filter(r.match, mylist)
print newlist