Get certain items based on their formatting

Get certain items based on their formatting - python

I have a list of values, some are numeric only, others made up of words, others a mix of the two.
I would like to select only those items composed by the combination number, single letter, number.
let me explain, this is my list of values
l = ['980X2350', 'DO_UN_HPL_Glas_Links', 'DO_UN_HPL_Glas_Rechts',
'930x2115', 'DO_UN_HPL_Links', 'DO_UN_HPL_Rechts', '830X2115',
'Deuropening', 'BF_32_Tourniquets_dubbeledeur_Aluminium']
I'd like to just get back:
['980X2350', '930x2115', '830X2115']

There is no need of importing re for such trivial matter.
Here is an approach that is more efficient than the regex based one:
allowed = '0123456786x'
def filter_str(lst):
output = []
for s in lst:
c = s.lower().strip()
if all(i in allowed for i in c) and c.count('x') == 1:
output.append(s)
return output
If the strings must contain two numeric fields:
allowed = '0123456786x'
def filter_str(lst):
output = []
for s in lst:
c = s.lower().strip()
n = len(c) - 1
if all(i in allowed for i in c) and c.count('x') == 1 and c.index('x') not in (0, n):
output.append(s)
return output
all function short-circuits (i.e. it stops checking as soon as Falsy value is registered), all Python logical operators also short-circuit, for the and operator, the right-hand operand won't be executed if the left-hand operand is Falsy, so my code does look it's longer than the regex based one, but it actually executes faster because regex checks whole strings and does not short-circuit.

Assuming a list of strings as input, you can use a regex and a list comprehension:
l = ['980X2350', 'DO_UN_HPL_Glas_Links', 'DO_UN_HPL_Glas_Rechts',
'930x2115', 'DO_UN_HPL_Links', 'DO_UN_HPL_Rechts', '830X2115',
'Deuropening', 'BF_32_Tourniquets_dubbeledeur_Aluminium']
import re
regex = re.compile('\d+x\d+', flags=re.I)
out = [s for s in l if regex.match(s.strip())]
output:
['980X2350', '930x2115', '830X2115']

Assuming a list of strings :
you can store in a counter the number of letters encountered, if this number is exactly equal to 1 and you have encountered some numbers then you can store it to your output list :
a = ['980X2350', 'DO_UN_HPL_Glas_Links', 'DO_UN_HPL_Glas_Rechts', '930x2115', 'DO_UN_HPL_Links',
'DO_UN_HPL_Rechts', '830X2115', 'Deuropening' ]
alphabet = 'abcdefghijklmnopqrstuvwxyz'
alphabet+= alphabet.upper()
numeric = '0123456789'
numeric_flag = False
output = []
for item in a:
alphabet_count = 0
for char in item:
if char in alphabet:
alphabet_count += 1
if char in numeric:
numeric_flag = True
if alphabet_count == 1 and numeric_flag:
output.append(item)
print(output)
# ['980X2350', '930x2115', '830X2115']

Related

Count max substring of the same character

i want to write a function in which it receives a string (s) and a single letter (s). the function needs to return the length of the longest substring of this letter. i dont know why the function i wrote doesn't work
for exmaple: print(count_longest_repetition('eabbaaaacccaaddd', 'a') supposed to return '4'
def count_longest_repetition(s, c):
n= len(s)
lst=[]
length_charachter=0
for i in range(n-1):
if s[i]==c and s[i+1]==c:
if s[i] in lst:
lst.append(s[i])
length_charachter= len(lst)
return length_charachter

Due to the condition if s[i] in lst, nothing will be appended to 'lst' as originally 'lst' is empty and the if condition will never be satisfied. Also, to traverse through the entire string you need to use range(n) as it generates numbers from 0 to n-1. This should work -
def count_longest_repetition(s, c):
n= len(s)
length_charachter=0
max_length = 0
for i in range(n):
if s[i] == c:
length_charachter += 1
else:
length_charachter = 0
max_length = max(max_length, length_charachter)
return max_length

I might suggest using a regex approach here with re.findall:
def count_longest_repetition(s, c):
matches = re.findall(r'' + c + '+', s)
matches = sorted(matches, key=len, reverse=True)
return len(matches[0])
cnt = count_longest_repetition('eabbaaaacccaaddd', 'a')
print(cnt)
This prints: 4
To better explain the above, given the inputs shown, the regex used is a+, that is, find groups of one or more a characters. The sorted list result from the call to re.findall is:
['aaaa', 'aa', 'a']
By sorting descending by string length, we push the longest match to the front of the list. Then, we return this length from the function.

Your function doesn't work because if s[i] in lst: will initially return false and never gets to add anything to to the lst list (so it will remain false throughout the loop).
You should look into regular expressions for this kind of string processing/search:
import re
def count_longest_repetition(s, c):
return max((0,*map(len,re.findall(f"{re.escape(c)}+",s))))
If you're not allowed to use libraries, you could compute repetitions without using a list by adding matches to a counter that you reset on every mismatch:
def count_longest_repetition(s, c):
maxCount = count = 0
for b in s:
count = (count+1)*(b==c)
maxCount = max(count,maxCount)
return maxCount

This can also be done by groupby
from itertools import groupby
def count_longest_repetition(text,let):
return max([len(list(group)) for key, group in groupby(list(text)) if key==let])
count_longest_repetition("eabbaaaacccaaddd",'a')
#returns 4

Is there an easy way to get the number of repeating character in a word?

I'm trying to get how many any character repeats in a word. The repetitions must be sequential.
For example, the method with input "loooooveee" should return 6 (4 times 'o', 2 times 'e').
I'm trying to implement string level functions and I can do it this way but, is there an easy way to do this? Regex, or some other sort of things?

Original question: order of repetition does not matter
You can subtract the number of unique letters by the number of total letters. set applied to a string will return a unique collection of letters.
x = "loooooveee"
res = len(x) - len(set(x)) # 6
Or you can use collections.Counter, subtract 1 from each value, then sum:
from collections import Counter
c = Counter("loooooveee")
res = sum(i-1 for i in c.values()) # 6
New question: repetitions must be sequential
You can use itertools.groupby to group sequential identical characters:
from itertools import groupby
g = groupby("aooooaooaoo")
res = sum(sum(1 for _ in j) - 1 for i, j in g) # 5
To avoid the nested sum calls, you can use itertools.islice:
from itertools import groupby, islice
g = groupby("aooooaooaoo")
res = sum(1 for _, j in g for _ in islice(j, 1, None)) # 5

You could use a regular expression if you want:
import re
rx = re.compile(r'(\w)\1+')
repeating = sum(x[1] - x[0] - 1
for m in rx.finditer("loooooveee")
for x in [m.span()])
print(repeating)
This correctly yields 6 and makes use of the .span() function.
The expression is
(\w)\1+
which captures a word character (one of a-zA-Z0-9_) and tries to repeat it as often as possible.
See a demo on regex101.com for the repeating pattern.
If you want to match any character (that is, not only word characters), change your expression to:
(.)\1+
See another demo on regex101.com.

try this:
word=input('something:')
sum = 0
chars=set(list(word)) #get the set of unique characters
for item in chars: #iterate over the set and output the count for each item
if word.count(char)>1:
sum+=word.count(char)
print('{}|{}'.format(item,str(word.count(char)))
print('Total:'+str(sum))
EDIT:
added total count of repetitions

Since it doesn't matter where the repetition is occurring or which characters are being repeated, you can make use of the set data structure provided in Python. It will discard the duplicate occurrences of any character or an object.
Therefore, the solution would look something like this:
def measure_normalized_emphasis(text):
return len(text) - len(set(text))
This will give you the exact result.
Also, make sure to look out for some edge cases, which you should as it is a good practice.

I think your code is comparing the wrong things
You start by finding the last character:
char = text[-1]
Then you compare this to itself:
for i in range(1, len(text)):
if text[-i] == char: #<-- surely this is test[-1] to begin with?
Why not just run through the characters:
def measure_normalized_emphasis(text):
char = text[0]
emphasis_size = 0
for i in range(1, len(text)):
if text[i] == char:
emphasis_size += 1
else:
char = text[i]
return emphasis_size
This seems to work.

How to stop over counting of duplicate letters in a list of strings

I'm trying to count the number of times a duplicate letter shows up in the list element.
For example, given
arr = ['capps','hat','haaah']
I out put a list and I get ['1','0','1']
def myfunc(words):
counter = 0 #counters dup letters in words
len_ = len(words)-1
for i in range(len_):
if words[i] == words[i+1]: #if the letter ahead is the same add one
counter+=1
return counter
def minimalOperations(arr):
return [*map(myfunc,arr)] #map fuc applies myfunc to element in words.
But my code would output [1,0,2]
I'm not sure why I am over counting.
Can anyone help me resolve this, thank you in advance.

A more efficient solution using a regular expression:
import re
def myfunc(words):
reg_str = r"(\w)\1{1,}"
return len(re.findall(reg_str, words))
This function will find the number of substrings of length 2 or more containing the same letter. Thus 'aaa' in your example will only be counted once.
For a string like
'hhhhfafaahggaa'
the output will be 4 , since there are 4 maximal substrings of the same letter occuring at least twice : 'hhh' , 'ss', 'gg', 'aa'

You aren't accounting for situations where you have greater than 2 identical characters in succession. To do this, you can look back as well as forward:
if (words[i] == words[i+1]) and (words[i] != words[i-1] if i != 0 else True)
# as before
The ternary statement helps for the first iteration of the loop, to avoid comparing the last letter of a string with the first.
Another solution is to use itertools.groupby and count the number of instances where a group has a length greater than 1:
arr = ['capps','hat','haaah']
from itertools import groupby
res = [sum(1 for _, j in groupby(el) if sum(1 for _ in j) > 1) for el in arr]
print(res)
[1, 0, 1]
The sum(1 for _ in j) part is used to count the number items in a generator. It's also possible to use len(list(j)), though this requires list construction.

Well, your code counts the number of duplications, so what you observe is quite logical:
your input is arr = ['capps','hat','haaah']
in 'capps', the letter p is duplicated 1 time => myfunc() returns 1
in 'hat', there is no duplicated letter => myfunc() returns 0
in 'haaah', the letter a is duplicated 2 times => myfunc() returns 2
So finally you get [1,0,2].
For your purpose, I suggest you to use a regex to match and count the number of groups of duplicated letters in each word. I also replaced the usage of map() with a list comprehension that I find more readable:
import re
def myfunc(words):
return len(re.findall(r'(\w)\1+', words))
def minimalOperations(arr):
return [myfunc(a) for a in arr]
arr = ['capps','hat','haaah']
print(minimalOperations(arr)) # [1,0,1]
arr = ['cappsuul','hatppprrrrtyyy','haaah']
print(minimalOperations(arr)) # [2,3,1]

You need to keep track of a little more state, specifically if you're looking at duplicates now.
def myfunc(words):
counter = 0 #counters dup letters in words
seen = None
len_ = len(words)-1
for i in range(len_):
if words[i] == words[i+1] and words[i+1] != seen: #if the letter ahead is the same add one and wasn't the first
counter+=1
seen = words[i]
return counter
This gives you the following output
>>> arr = ['capps','hat','haaah']
>>> map(myfunc, arr)
[1, 0, 1]
As others have pointed out, you could use a regular expression and trade clarity for performance. They key is to find a regular expression that means "two or more repeated characters" and may depend on what you consider to be characters (e.g. how do you treat duplicate punctuation?)
Note: the "regex" used for this is technically an extension on regular expressions because it requires memory.
The form will be len(re.findall(regex, words))

I would break this kind of problem into smaller chunks. Starting by grouping duplicates.
The documentation for itertools has groupby and recipes for this kind of things.
A slightly edited version of unique_justseen would look like this:
duplicates = (len(sum(1 for _ in group) for _key, group in itertools.groupby("haaah")))
and yields values: 1, 3, 1. As soon as any of these values are greater than 1 you have a duplicate. So just count them:
sum(n > 1 for n in duplicates)

Use re.findall for matches of 2 or more letters
>>> arr = ['capps','hat','haaah']
>>> [len(re.findall(r'(.)\1+', w)) for w in arr]
[1, 0, 1]

Finding regular expression with at least one repetition of each letter

From any *.fasta DNA sequence (only 'ACTG' characters) I must find all sequences which contain at least one repetition of each letter.
For examle from sequence 'AAGTCCTAG' I should be able to find: 'AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG' and 'CTAG' (iteration on each letter).
I have no clue how to do that in pyhton 2.7. I was trying with regular expressions but it was not searching for every variants.
How can I achive that?

You could find all substrings of length 4+, and then down select from those to find only the shortest possible combinations that contain one of each letter:
s = 'AAGTCCTAG'
def get_shortest(s):
l, b = len(s), set('ATCG')
options = [s[i:j+1] for i in range(l) for j in range(i,l) if (j+1)-i > 3]
return [i for i in options if len(set(i) & b) == 4 and (set(i) != set(i[:-1]))]
print(get_shortest(s))
Output:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']

This is another way you can do it. Maybe not as fast and nice as chrisz answere. But maybe a little simpler to read and understand for beginners.
DNA='AAGTCCTAG'
toSave=[]
for i in range(len(DNA)):
letters=['A','G','T','C']
j=i
seq=[]
while len(letters)>0 and j<(len(DNA)):
seq.append(DNA[j])
try:
letters.remove(DNA[j])
except:
pass
j+=1
if len(letters)==0:
toSave.append(seq)
print(toSave)

Since the substring you are looking for may be of about any length, a LIFO queue seems to work. Append each letter at a time, check if there are at least one of each letters. If found return it. Then remove letters at the front and keep checking until no longer valid.
def find_agtc_seq(seq_in):
chars = 'AGTC'
cur_str = []
for ch in seq_in:
cur_str.append(ch)
while all(map(cur_str.count,chars)):
yield("".join(cur_str))
cur_str.pop(0)
seq = 'AAGTCCTAG'
for substr in find_agtc_seq(seq):
print(substr)
That seems to result in the substrings you are looking for:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG

I really wanted to create a short answer for this, so this is what I came up with!
See code in use here
s = 'AAGTCCTAG'
d = 'ACGT'
c = len(d)
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
print(x)
s,c = s[1:],len(d)
It works as follows:
c is set to the length of the string of characters we are ensuring exist in the string (d = ACGT)
The while loop iterates over each possible substring of s such that c is smaller than the length of s.
This works by increasing c by 1 upon each iteration of the while loop.
If every character in our string d (ACGT) exist in the substring, we print the result, reset c to its default value and slice the string by 1 character from the start.
The loop continues until the string s is shorter than d
Result:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
To get the output in a list instead (see code in use here):
s = 'AAGTCCTAG'
d = 'ACGT'
c,r = len(d),[]
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
r.append(x)
s,c = s[1:],len(d)
print(r)
Result:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']

If you can break the sequence into a list, e.g. of 5-letter sequences, you could then use this function to find repeated sequences.
from itertools import groupby
import numpy as np
def find_repeats(input_list, n_repeats):
flagged_items = []
for item in input_list:
# Create itertools.groupby object
groups = groupby(str(item))
# Create list of tuples: (digit, number of repeats)
result = [(label, sum(1 for _ in group)) for label, group in groups]
# Extract just number of repeats
char_lens = np.array([x[1] for x in result])
# Append to flagged items
if any(char_lens >= n_repeats):
flagged_items.append(item)
# Return flagged items
return flagged_items
#--------------------------------------
test_list = ['aatcg', 'ctagg', 'catcg']
find_repeats(test_list, n_repeats=2) # Returns ['aatcg', 'ctagg']

Determine prefix from a set of (similar) strings

I have a set of strings, e.g.
my_prefix_what_ever
my_prefix_what_so_ever
my_prefix_doesnt_matter
I simply want to find the longest common portion of these strings, here the prefix. In the above the result should be
my_prefix_
The strings
my_prefix_what_ever
my_prefix_what_so_ever
my_doesnt_matter
should result in the prefix
my_
Is there a relatively painless way in Python to determine the prefix (without having to iterate over each character manually)?
PS: I'm using Python 2.6.3.

Never rewrite what is provided to you: os.path.commonprefix does exactly this:
Return the longest path prefix (taken
character-by-character) that is a prefix of all paths in list. If list
is empty, return the empty string (''). Note that this may return
invalid paths because it works a character at a time.
For comparison to the other answers, here's the code:
# Return the longest prefix of all list elements.
def commonprefix(m):
"Given a list of pathnames, returns the longest common leading component"
if not m: return ''
s1 = min(m)
s2 = max(m)
for i, c in enumerate(s1):
if c != s2[i]:
return s1[:i]
return s1

Ned Batchelder is probably right. But for the fun of it, here's a more efficient version of phimuemue's answer using itertools.
import itertools
strings = ['my_prefix_what_ever',
'my_prefix_what_so_ever',
'my_prefix_doesnt_matter']
def all_same(x):
return all(x[0] == y for y in x)
char_tuples = itertools.izip(*strings)
prefix_tuples = itertools.takewhile(all_same, char_tuples)
''.join(x[0] for x in prefix_tuples)
As an affront to readability, here's a one-line version :)
>>> from itertools import takewhile, izip
>>> ''.join(c[0] for c in takewhile(lambda x: all(x[0] == y for y in x), izip(*strings)))
'my_prefix_'

Here's my solution:
a = ["my_prefix_what_ever", "my_prefix_what_so_ever", "my_prefix_doesnt_matter"]
prefix_len = len(a[0])
for x in a[1 : ]:
prefix_len = min(prefix_len, len(x))
while not x.startswith(a[0][ : prefix_len]):
prefix_len -= 1
prefix = a[0][ : prefix_len]

The following is an working, but probably quite inefficient solution.
a = ["my_prefix_what_ever", "my_prefix_what_so_ever", "my_prefix_doesnt_matter"]
b = zip(*a)
c = [x[0] for x in b if x==(x[0],)*len(x)]
result = "".join(c)
For small sets of strings, the above is no problem at all. But for larger sets, I personally would code another, manual solution that checks each character one after another and stops when there are differences.
Algorithmically, this yields the same procedure, however, one might be able to avoid constructing the list c.

Just out of curiosity I figured out yet another way to do this:
def common_prefix(strings):
if len(strings) == 1:#rule out trivial case
return strings[0]
prefix = strings[0]
for string in strings[1:]:
while string[:len(prefix)] != prefix and prefix:
prefix = prefix[:len(prefix)-1]
if not prefix:
break
return prefix
strings = ["my_prefix_what_ever","my_prefix_what_so_ever","my_prefix_doesnt_matter"]
print common_prefix(strings)
#Prints "my_prefix_"
As Ned pointed out it's probably better to use os.path.commonprefix, which is a pretty elegant function.

The second line of this employs the reduce function on each character in the input strings. It returns a list of N+1 elements where N is length of the shortest input string.
Each element in lot is either (a) the input character, if all input strings match at that position, or (b) None. lot.index(None) is the position of the first None in lot: the length of the common prefix. out is that common prefix.
val = ["axc", "abc", "abc"]
lot = [reduce(lambda a, b: a if a == b else None, x) for x in zip(*val)] + [None]
out = val[0][:lot.index(None)]

Here's a simple clean solution. The idea is to use zip() function to line up all the characters by putting them in a list of 1st characters, list of 2nd characters,...list of nth characters. Then iterate each list to check if they contain only 1 value.
a = ["my_prefix_what_ever", "my_prefix_what_so_ever", "my_prefix_doesnt_matter"]
list = [all(x[i] == x[i+1] for i in range(len(x)-1)) for x in zip(*a)]
print a[0][:list.index(0) if list.count(0) > 0 else len(list)]
output: my_prefix_

Here is another way of doing this using OrderedDict with minimal code.
import collections
import itertools
def commonprefix(instrings):
""" Common prefix of a list of input strings using OrderedDict """
d = collections.OrderedDict()
for instring in instrings:
for idx,char in enumerate(instring):
# Make sure index is added into key
d[(char, idx)] = d.get((char,idx), 0) + 1
# Return prefix of keys while value == length(instrings)
return ''.join([k[0] for k in itertools.takewhile(lambda x: d[x] == len(instrings), d)])

I had a slight variation of the problem and google sends me here, so I think it will be useful to document:
I have a list like:
my_prefix_what_ever
my_prefix_what_so_ever
my_prefix_doesnt_matter
some_noise
some_other_noise
So I would expect my_prefix to be returned. That can be done with:
from collections import Counter
def get_longest_common_prefix(values, min_length):
substrings = [value[0: i-1] for value in values for i in range(min_length, len(value))]
counter = Counter(substrings)
# remove count of 1
counter -= Counter(set(substrings))
return max(counter, key=len)

In one line without using itertools, for no particular reason, although it does iterate through each character:
''.join([z[0] for z in zip(*(list(s) for s in strings)) if all(x==z[0] for x in z)])

Find the common prefix in all words from the given input string, if there is no common prefix print -1
stringList = ['my_prefix_what_ever', 'my_prefix_what_so_ever', 'my_prefix_doesnt_matter']
len2 = len( stringList )
if len2 != 0:
# let shortest word is prefix
prefix = min( stringList )
for i in range( len2 ):
word = stringList[ i ]
len1 = len( prefix )
# slicing each word as lenght of prefix
word = word[ 0:len1 ]
for j in range( len1 ):
# comparing each letter of word and prefix
if word[ j ] != prefix[ j ]:
# if letter does not match slice the prefix
prefix = prefix[ :j ]
break # after getting comman prefix move to next word
if len( prefix ) != 0:
print("common prefix: ",prefix)
else:
print("-1")
else:
print("string List is empty")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get certain items based on their formatting - python

Related

Count max substring of the same character

Is there an easy way to get the number of repeating character in a word?

How to stop over counting of duplicate letters in a list of strings

Finding regular expression with at least one repetition of each letter

Determine prefix from a set of (similar) strings

Categories

Resources