I am dealing with some string search tasks just to improve an efficient way of searching.
I am trying to implement a way of counting how many substrings there are in a given set of strings by using backward search.
For example given the following strings:
original = 'panamabananas$'
s = smnpbnnaaaaa$a
s1 = $aaaaaabmnnnps #sorted version of s
I am trying to find how many times the substring 'ban' it occurs. For doing so I was thinking in iterate through both strings with zip function. In the backward search, I should first look for the last character of ban (n) in s1 and see where it matches with the next character a in s. It matches in indexes 9,10 and 11, which actually are the third, fourth and fifth a in s. The next character to look for is b but only for the matches that occurred before (This means, where n in s1 matched with a in s). So we took those a (third, fourth and fifth) from s and see if any of those third, fourth or fifth a in s1 match with any b in s. This way we would have found an occurrence of 'ban'.
It seems complex to me to iterate and save cuasi-occurences so what I was trying is something like this:
n = 0 #counter of occurences
for i, j in zip(s1, s):
if i == 'n' and j == 'a': # this should save the match
if i[3:6] == 'a' and any(j[3:6] == 'b'):
n += 1
I think nested if statements may be needed but I am still a beginner. Because I am getting 0 occurrences when there are one ban occurrences in the original.
You can run a loop with find to count the number of occurence of substring.
s = 'panamabananasbananasba'
ss = 'ban'
count = 0
idx = s.find(ss, 0)
while (idx != -1):
count += 1
idx += len(ss)
idx = s.find(ss, idx)
print count
If you really want backward search, then reverse the string and substring and do the same mechanism.
s = 'panamabananasbananasban'
s = s[::-1]
ss = 'ban'
ss = ss[::-1]
Related
I need to check the occurrences of the letter "a" in a string s of size n.
Example:
s = "abcac"
n = 10
String to check for occurrences of letter "a": "abcacabcac".
Occurrences: 4
My code works, but I need it to work faster for larger values of n.
What can I do to optimize this code?
def repeatedString(s, n):
a_count, word_iter = 0, 0
for i in range(n):
if s[word_iter] == "a":
a_count+=1
word_iter += 1
if word_iter == (len(s)):
word_iter = 0
return a_count
You only don't need to assemble the full repeated string to do it. count the number of the specified characted in the whole string and multiple that by the number of times it will be fully repeated (n//len(s) times). Add to that the number of occurrences that will appear in the last (truncated) part at the end of the repetitions (i.e. first n%len(s) characters)
def countChar(s,n,c):
return s.count(c)*n//len(s)+s[:n%len(s)].count(c)
output:
countChar("abcac",10,"a") # 4 times in 'abcacabcac'
countChar("abcac",17,"a") # 7 times in 'abcacabcacabcacab'
Count the number of times a appears in a string, s up to length n
s = "abcac"
n = 10
str(s*(int(n/len(s))))[:n].count('a')
You can use regular expressions:
import re
a_count = len(re.findall(r'a',s))
re.findall returns an array of all matches, and we can just get the length of it. Using a regular expression allows for greater generalization and the ability to search for more complex patterns. Debra's original answer is better for a simple string search though:
a_count = s.count('a')
In Python I am trying to extract all the longest common leading substrings that contain at least 4 characters from a list. For example, in the list called "data" below, the 2 longest common substrings that fit my criteria are "johnjack" and "detc". I knew how to find the single longest common substring with the codes below, which returned nothing (as expected) because there is no common substring. But I am struggling with building a script that could detect multiple common substrings within a list, where each of the common substring must have length of 4 or above.
data = ['johnjack1', 'johnjack2', 'detc22', 'detc32', 'chunganh']
def ls(data):
if len(data)==0:
prefix = ''
else:
prefix = data[0]
for i in data:
while not i.startswith(prefix) and len(prefix) > 0:
prefix = prefix[:-1]
print(prefix)
ls(data)
Here's one, but I think it's probably not the fastest or most efficient. Let's start with just the data and a container for our answer:
data = ['johnjack1', 'johnjack2', 'detc22', 'detc32', 'chunganh', 'chunganh']
substrings = []
Note I added a dupe for chunganh -- that's a common edge case we should be handling.
See How do I find the duplicates in a list and create another list with them?
So to capture the duplicates in the data
seen = {}
dupes = []
for x in data:
if x not in seen:
seen[x] = 1
else:
if seen[x] == 1:
dupes.append(x)
seen[x] += 1
for dupe in dupes:
substrings.append(dupe)
Now let's record the unique values in the data as-is
# Capture the unique values in the data
last = set(data)
From here, we can loop through our set, popping characters off the end of each unique value. If the length of our set changes, we've found a unique substring.
# Handle strings up to 10000 characters long
for k in [0-b for b in range(1, 10000)]:
# Use negative indexing to start from the longest
last, middle = set([i[:k] for i in data]), last
# Unique substring found
if len(last) != len(middle):
for k in last:
count = 0
for word in middle:
if k in word:
count += 1
if count > 1:
substrings.append(k)
# Early stopping
if len(last) == 1:
break
Finally, you mentioned needing only substrings of length 4.
list(filter(lambda x: len(x) >= 4, substrings))
I'm trying to get how many any character repeats in a word. The repetitions must be sequential.
For example, the method with input "loooooveee" should return 6 (4 times 'o', 2 times 'e').
I'm trying to implement string level functions and I can do it this way but, is there an easy way to do this? Regex, or some other sort of things?
Original question: order of repetition does not matter
You can subtract the number of unique letters by the number of total letters. set applied to a string will return a unique collection of letters.
x = "loooooveee"
res = len(x) - len(set(x)) # 6
Or you can use collections.Counter, subtract 1 from each value, then sum:
from collections import Counter
c = Counter("loooooveee")
res = sum(i-1 for i in c.values()) # 6
New question: repetitions must be sequential
You can use itertools.groupby to group sequential identical characters:
from itertools import groupby
g = groupby("aooooaooaoo")
res = sum(sum(1 for _ in j) - 1 for i, j in g) # 5
To avoid the nested sum calls, you can use itertools.islice:
from itertools import groupby, islice
g = groupby("aooooaooaoo")
res = sum(1 for _, j in g for _ in islice(j, 1, None)) # 5
You could use a regular expression if you want:
import re
rx = re.compile(r'(\w)\1+')
repeating = sum(x[1] - x[0] - 1
for m in rx.finditer("loooooveee")
for x in [m.span()])
print(repeating)
This correctly yields 6 and makes use of the .span() function.
The expression is
(\w)\1+
which captures a word character (one of a-zA-Z0-9_) and tries to repeat it as often as possible.
See a demo on regex101.com for the repeating pattern.
If you want to match any character (that is, not only word characters), change your expression to:
(.)\1+
See another demo on regex101.com.
try this:
word=input('something:')
sum = 0
chars=set(list(word)) #get the set of unique characters
for item in chars: #iterate over the set and output the count for each item
if word.count(char)>1:
sum+=word.count(char)
print('{}|{}'.format(item,str(word.count(char)))
print('Total:'+str(sum))
EDIT:
added total count of repetitions
Since it doesn't matter where the repetition is occurring or which characters are being repeated, you can make use of the set data structure provided in Python. It will discard the duplicate occurrences of any character or an object.
Therefore, the solution would look something like this:
def measure_normalized_emphasis(text):
return len(text) - len(set(text))
This will give you the exact result.
Also, make sure to look out for some edge cases, which you should as it is a good practice.
I think your code is comparing the wrong things
You start by finding the last character:
char = text[-1]
Then you compare this to itself:
for i in range(1, len(text)):
if text[-i] == char: #<-- surely this is test[-1] to begin with?
Why not just run through the characters:
def measure_normalized_emphasis(text):
char = text[0]
emphasis_size = 0
for i in range(1, len(text)):
if text[i] == char:
emphasis_size += 1
else:
char = text[i]
return emphasis_size
This seems to work.
Coding a program to detect a n-length pattern in a string, even without knowing where the pattern starts, could be easily done by creating a list of n-length substrings and check if starting at one point there are same items or the rest of the list. Without any piece of information other than the string to check through, is the only way to recognize the pattern is to brute-force through all lengths and check or is there a more efficient algorithm?
(I'm just a beginner in Python, so this may be easy to code... )
Current code that only suits checking for starting at index 0:
def search(s):
match=s[0]+s[1]
while (match != s) and (match[0] != match[-1]):
for matchLen in range(len(match),len(s)-1):
letter = s[matchLen]
if letter == match[-1]:
match += s[len(match)]
break
if match == s:
return None
else:
return match[:-1]
You can use re.findall(r'(.{2,})\1+', string). The parentheses creates a capture group that is later backreferenced by \1. The . matches any character (except for line breaks). The {2,} requires the pattern to be at least two characters long (otherwise strings like ss would be considered a pattern). Finally the + requires that pattern to repeat 1 or more times (in addition to the first time that it occurred inside the capture group). You can see it working in action.
Pattern is a far too vague term, but assuming you mean some string repeating itself, the regexp (?P<pat>.+)(?P=pat) will work.
Given a string what you could do is -
You start with length = 1, and take two pointer variables i and j which you shall use to traverse the string.
Set i = 0 and j = i+length
if str[i]==str[j]:
i++,j++ // till j not equal to length of string
else:
length = length + 1
//increase length by 1 and start the algorithm over from i = 0
Take the example abcdeabcde :
In this we see
Initially i = 0, j = 1 ,
but str[0]!=str[1] i.e. a!=b,
Then we get length = 2 i.e., i = 0,j = 2
but str[0]!=str[2] i.e. a!=c,
Continuing in the same fashion,
We see when length = 5 and i = 0 and j = 5,
str[0]==str[5]
and thus you can see that i and j increment till j is equal to string length.
And you have your answer that is the pattern length. It may not seem obvious but i would suggest you dry-run this algorithm over some of your test cases and let me know the results.
You can use re.findall() to find all matches:
import re
s = "somethingabcdeabcdeabcdeabcdeabcdeelseabcdeabcdeabcde"
li = re.findall(r'abcde',s)
print(li)
Output:
['abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde', 'abcde']
For a homework assignment, I have to take 2 user inputted strings, and figure out how many letters are common (in the same position of both strings), as well as find common letters.. For example for the two strings 'cat' and 'rat', there are 2 common letter positions (which are positions 2 and 3 in this case), and the common letters are also 2 because 'a' is found one and 't' is found once too..
So I made a program and it worked fine, but then my teacher updated the homework with more examples, specifically examples with repetitive letters, and my program isn't working for that.. For example, with strings 'ahahaha' and 'huhu' - there are 0 common letters in same positions, but there's 3 common letters between them (because 'h' in string 2 appears in string 1, three times..)
My whole issue is that I can't figure out how to count if "h" appears multiple times in the first string, as well as I don't know how to NOT check the SECOND 'h' in huhu because it should only count unique letters, so the overall common letter count should be 2..
This is my current code:
S1 = input("Enter a string: ")
S2 = input("Enter a string: ")
i = 0
big_string = 0
short_string = 0
same_letter = 0
common_letters = 0
if len(S1) > len(S2):
big_string = len(S1)
short_string = len(S2)
elif len(S1) < len(S2):
big_string = len(S2)
short_string = len(S1)
elif len(S1) == len(S2):
big_string = short_string = len(S1)
while i < short_string:
if (S1[i] == S2[i]) and (S1[i] in S2):
same_letter += 1
common_letters += 1
elif (S1[i] == S2[i]):
same_letter += 1
elif (S1[i] in S2):
common_letters += 1
i += 1
print("Number of positions with the same letter: ", same_letter)
print("Number of letters from S1 that are also in S2: ", common_letters)
So this code worked for strings without common letters, but when I try to use it with "ahahaha" and "huhu" I get 0 common positions (which makes sense) and 2 common letters (when it should be 3).. I figured it might work if I tried to add the following:
while x < short_string:
if S1[i] in S2[x]:
common_letters += 1
else:
pass
x += 1
However this doesn't work either...
I am not asking for a direct answer or piece of code to do this, because I want to do it on my own, but I just need a couple of hints or ideas how to do this..
Note: I can't use any functions we haven't taken in class, and in class we've only done basic loops and strings..
You need a data structure like multidict. To my knowledge, the most similar data structure in standard library is Counter from collections.
For simple frequency counting:
>>> from collections import Counter
>>> strings = ['cat', 'rat']
>>> counters = [Counter(s) for s in strings]
>>> sum((counters[0] & counters[1]).values())
2
With index counting:
>>> counters = [Counter(zip(s, range(len(s)))) for s in strings]
>>> sum(counters[0] & counters[1].values())
2
For your examples ahahaha and huhu, you should get 2 and 0, respectively since we get two h but in wrong positions.
Since you can't use advanced constructs, you just need to simulate counter with arrays.
Create 26 elements arrays
Loop over strings and update relevant index for each letter
Loop again over arrays simultaneously and sum the minimums of respective indexes.
A shorter version is this:
def gen1(listItem):
returnValue = []
for character in listItem:
if character not in returnValue and character != " ":
returnValue.append(character)
return returnValue
st = "first string"
r1 = gen1(st)
st2 = "second string"
r2 = gen1(st2)
if len(st)> len(st2):
print list(set(r1).intersection(r2))
else:
print list(set(r2).intersection(r1))
Note:
This is a pretty old post but since its got new activity,I posted my version.
Since you can't use arrays or lists,
Maybe try to add every common character to a var_string then test
if c not in var_string:
before incrementing your common counter so you are not counting the same character multiple times.
You are only getting '2' because you're only going to look at 4 total characters out of ahahaha (because huhu, the shortest string, is only 4 characters long). Change your while loop to go over big_string instead, and then add (len(S2) > i) and to your two conditional tests; the last test performs an in, so it won't cause a problem with index length.
NB: All of the above implicitly assumes that len(S1) >= len(S2); that should be easy enough to ensure, using a conditional and an assignment, and it would simplify other parts of your code to do so. You can replace the first block entirely with something like:
if (len(S2) > len(S1)): (S2, S1) = (S1, S2)
big_string = len(S1)
short_string = len(S2)
We can solve this by using one for loop inside of another as follows
int y=0;
for(i=0;i<big_string ;i++)
{
for(j=0;j<d;j++)
{
if(s1[i]==s2[j])
{y++;}
}
If you enter 'ahahaha' and 'huhu' this code take first character of big
string 'a' when it goes into first foor loop. when it enters into second for loop
it takes first letter of small string 'h' and compares them as they are not
equal y is not incremented. In next step it comes out of second for loop but
stays in first for loop so it consider first character of big string 'a' and
compares it against second letter of small string 'u' as 'j' is incremented even
in this case both of them are not equal and y remains zero. Y is incremented in
the following cases:-
when it compares second letter of big string 'h' and small letter of first string y is incremented for once i,e y=1;
when it compares fourth letter of big string 'h' and small letter of first string y is incremented again i,e y=2;
when it compares sixth letter of big string 'h' and small letter of first string y is incremented again i,e y=3;
Final output is 3. I think that is what we want.