masking alphanumeric strings to similar format

masking alphanumeric strings to similar format - python

i'm trying to impose a format on an alphanumeric string where digits become all 9's and the alphabets become A's.
e.g. N43563 == A999
e.g2. dhfgb85fb == AAAAA99AA
something along these lines
pytho based. i have tried regex but it was a bit confusing for me which is why i'm now asking for assistance

>>> result1 = re.sub('[a-zA-Z]', 'A', 'dhfgb85fb')
>>> result2 = re.sub('[0-9]', '9', result1)
>>> result2
'AAAAA99AA'

You don't need re for that, if you only want to replace each digit for a 9 and each letter for an A then you can do this:
sample = ['N43563', 'dhfgb85fb']
for s in sample:
new_s = ''.join(
'9' if letter.isdigit() else 'A' for letter in s
)
print(new_s)
>>> A99999
>>> AAAAA99AA

Related

How can I check a string for two letters or more?

I am pulling data from a table that changes often using Python - and the method I am using is not ideal. What I would like to have is a method to pull all strings that contain only one letter and leave out anything that is 2 or more.
An example of data I might get:
115
19A6
HYS8
568
In this example, I would like to pull 115, 19A6, and 568.
Currently I am using the isdigit() method to determine if it is a digit and this filters out all numbers with one letter, which works for some purposes, but is less than ideal.

Try this:
string_list = ["115", "19A6", "HYS8", "568"]
output_list = []
for item in string_list: # goes through the string list
letter_counter = 0
for letter in item: # goes through the letters of one string
if not letter.isdigit(): # checks if the letter is a digt
letter_counter += 1
if letter_counter < 2: # if the string has more then 1 letter it wont be in output list
output_list.append(item)
print(output_list)
Output:
['115', '19A6', '568']

Here is a one-liner with a regular expression:
import re
data = ["115", "19A6", "HYS8", "568"]
out = [string for string in data if len(re.sub("\d", "", string))<2]
print(out)
Output:
['115', '19A6', '568']

This is an excellent case for regular expressions (regex), which is available as the built-in re library.
The code below follows the logic:
Define the dataset. Two examples have been added to show that a string containing two alpha-characters is rejected.
Compile a character pattern to be matched. In this case, zero or more digits, followed by zero or one upper case letter, ending with zero of more digits.
Use the filter function to detect matches in the data list and output as a list.
For example:
import re
data = ['115', '19A6', 'HYS8', '568', 'H', 'HI']
rexp = re.compile('^\d*[A-Z]{0,1}\d*$')
result = list(filter(rexp.match, data))
print(result)
Output:
['115', '19A6', '568', 'H']

Another solution, without re using str.maketrans/str.translate:
lst = ["115", "19A6", "HYS8", "568"]
d = str.maketrans(dict.fromkeys(map(str, range(10)), ""))
out = [i for i in lst if len(i.translate(d)) < 2]
print(out)
Prints:
['115', '19A6', '568']

z=False
a = str(a)
for I in range(len(a)):
if a[I].isdigit():
z = True
break
else:
z="no digit"
print(z)```

Common substring in list of strings

i encountered a problem while trying to solve a problem where given some strings and their lengths, you need to find their common substring. My code for the part where it loops through the list and then each through each word in it is this:
num_of_cases = int(input())
for i in range(1, num_of_cases+1):
if __name__ == '__main__':
len_of_str = list(map(int, input().split()))
len_of_virus = int(input())
strings = []
def string(strings, len_of_str):
len_of_list = len(len_of_str)
for i in range(1, len_of_list+1):
strings.append(input())
lst_of_subs = []
virus_index = []
def substr(strings, len_of_virus):
for word in strings:
for i in range(len(len_of_str)):
leng = word[i:len_of_virus]
lst_of_subs.append(leng)
virus_index.append(i)
print(string(strings, len_of_str))
print(substr(strings, len_of_virus))
And it prints the following given the strings: ananasso, associazione, tassonomia, massone
['anan', 'nan', 'an', 'n', 'asso', 'sso', 'so', 'o', 'tass', 'ass', 'ss', 's', 'mass', 'ass', 'ss', 's']
It seems that the end index doesn't increase, although i tried it by writing len_of_virus += 1 at the end of the loop.
sample input:
1
8 12 10 7
4
ananasso
associazione
tassonomia
massone
where the 1st letter is the number of cases, the second line is the name of the strings, 3rd is the length of the virus(the common substring), and then there are the given strings that i should loop through.
expected output:
Case #1: 4 0 1 1
where the four numbers are the starting indexes of the common substring.(i dont think that code for printing cares us for this particular problem)
What should i do? Please help!!

The problem, beside defining functions in odd places and using said function to get side effect in ways that aren't really encourage, is here:
for i in range(len(len_of_str)):
leng = word[i:len_of_virus]
i constantly increase in each iteration, but len_of_virus stay the same, so you are effectively doing
word[0:4] #when len_of_virus=4
word[1:4]
word[2:4]
word[3:4]
...
that is where the 'anan', 'nan', 'an', 'n', come from the first word "ananasso", and the same for the other
>>> word="ananasso"
>>> len_of_virus = 4
>>> for i in range(len(word)):
word[i:len_of_virus]
'anan'
'nan'
'an'
'n'
''
''
''
''
>>>
you can fix it moving the upper end by i, but that leave with the same problem in the other end
>>> for i in range(len(word)):
word[i:len_of_virus+i]
'anan'
'nana'
'anas'
'nass'
'asso'
'sso'
'so'
'o'
>>>
so some simple adjustments in the range and problem solve:
>>> for i in range(len(word)-len_of_virus+1):
word[i:len_of_virus+i]
'anan'
'nana'
'anas'
'nass'
'asso'
>>>
Now that the substring part is done, the rest is also easy
>>> def substring(text,size):
return [text[i:i+size] for i in range(len(text)-size+1)]
>>> def find_common(lst_text,size):
subs = [set(substring(x,size)) for x in lst_text]
return set.intersection(*subs)
>>> test="""ananasso
associazione
tassonomia
massone""".split()
>>> find_common(test,4)
{'asso'}
>>>
To find the common part to all the strings in our list we can use a set, first we put all the substring of a given word into a set and finally we intersect them all.
the rest is just printing it to your liking
>>> virus = find_common(test,4).pop()
>>> print("case 1:",*[x.index(virus) for x in test])
case 1: 4 0 1 1
>>>

First extract all the substrings of the give size from the shortest string. Then select the first of these substrings that is present in all of the strings. Finally output the position of this common substring in each of the strings:
def commonSubs(strings,size):
base = min(strings,key=len) # shortest string
subs = [base[i:i+size] for i in range(len(base)-size+1)] # all substrings
cs = next(ss for ss in subs if all(ss in s for s in strings)) # first common
return [s.index(cs) for s in strings] # indexes of common substring
output:
S = ["ananasso", "associazione", "tassonomia", "massone"]
print(commonSubs(S,4))
[4, 0, 1, 1]
You could also use a recursive approach:
def commonSubs(strings,size,i=0):
sub = strings[0][i:i+size]
if all(sub in s for s in strings):
return [s.index(sub) for s in strings]
return commonSubs(strings,size,i+1)

from suffix_trees import STree
STree.STree(["come have some apple pies",
'apple pie available',
'i love apple pie haha']).lcs()
the most simple way is use STree

Finding element in a string and smarter manipulation

I have a list of characters
a = ["s", "a"]
I have some words.
b = "asp"
c= "lat"
d = "kasst"
I know that the characters in the list can appear only once or in linear order(or at most on small set can appear in the bigger one).
I would like to split my words by putting the elements in a in the middle, an the rest on the left or on the right (and put a "=" if there is nothing)
so b = ["*", "as", "p"]
If a bigger set of characters which contains
d = ["k", "ass", "t"]
I know that the combinations can be at most of length 4.
So I have divided the possible combinations depending on the length:
import itertools
c4 = [''.join(i) for i in itertools.product(a, repeat = 4)]
c3 = [''.join(i) for i in itertools.product(a, repeat = 3)]
c2 = [''.join(i) for i in itertools.product(a, repeat = 2)]
c1 = [''.join(i) for i in itertools.product(a, repeat = 1)]
For each c, starting with the greater
For simplicity, let's say I start with c3 in this case and not with length 4.
I have to do this with a lot of data.
Is there a way to simplify the code ?

You can do something similar using a regular expression:
>>> import re
>>> p = re.compile(r'([sa]{1,4})')
p matches the characters 's' or 'a' repeated between 1 and 4 times.
To split a given string at this pattern, use p.split. The use of capturing parentheses in the pattern leads to the pattern itself being included in the result.
>>> p.split('asp')
['', 'as', 'p']
>>> p.split('lat')
['l', 'a', 't']
>>> p.split('kasst')
['k', 'ass', 't']

Use regex ?
import re
a = ["s", "a"]
text = "kasst"
pattern = re.compile("[" + "".join(a) + "]{1,4}")
match = pattern.search(text)
parts = [text[:match.start()], text[match.start():match.end()], text[match.end():]]
parts = [part if part else "*" for part in parts]
However, note that this won't handle the case when there is no match on the elements in a

I would do a regular expression to simplify the matching.
import re
splitters = ''.join(a)
pattern = re.compile("([^%s]*)([%s]+)([^%s]*)" % (splitters, splitters, splitters))
words = [v if v else '=' for v in pattern.match(s).groups() ]
This doesn't allow the characters in the first or last group, so not all string will match correctly (and throw an exception). You can allow them if you want. Feel free to modify the regular expression to better match what you want it to do.
Also you only need to run the re.compile once, not for every string you are trying to match.

Python find element in list that ends with number

I have a list of strings, and I want to all the strings that end with _1234 where 1234 can be any 4-digit number. It's ideal to find all the elements, and what the digits actually are, or at least return the 1st matching element, and what the 4 digit is.
For example, I have
['A', 'BB_1024', 'CQ_2', 'x_0510', 'y_98765']
I want to get
['1024', '0510']
Okay so far I got, _\d{4}$ will match _1234 and return a match object, and the match_object.group(0) is the actual matched string. But is there a better way to look for _\d{4}$ but only return \d{4} without the _?

Use re.search():
import re
lst = ['A', 'BB_1024', 'CQ_2', 'x_0510']
newlst = []
for item in lst:
match = re.search(r'_(\d{4})\Z', item)
if match:
newlst.append(match.group(1))
print(newlst) # ['1024', '0510']
As for the regex, the pattern matches an underscore and exactly 4 digits at the end of the string, capturing only the digits (note the parens). The captured group is then accessible via match.group(1) (remember that group(0) is the entire match).

import re
src = ['A', 'BB_1024', 'CQ_2', 'x_0510', 'y_98765', 'AB2421', 'D3&1345']
res = []
p = re.compile('.*\D(\d{4})$')
for s in src:
m = p.match(s)
if m:
res.append(m.group(1))
print(res)
Works fine, \D means not a number, so it will match 'AB2421', 'D3&1345' and so on.

Please show some code next time you ask a question here, even if it doesn't work at all. It makes it easier for people to help you.
If you're interested in a solution without any regex, here's a way with list comprehensions:
>>> data = ['A', 'BB_1024', 'CQ_2', 'x_0510', 'y_98765']
>>> endings = [text.split('_')[-1] for text in data]
>>> endings
['A', '1024', '2', '0510', '98765']
>>> [x for x in endings if x.isdigit() and len(x)==4]
['1024', '0510']

Try this:
[s[-4:] for s in lst if s[-4:].isdigit() and len(s) > 4]
Just check the last four characters if it's a number or not.
added the len(s) > 4 to correct the mistake Joran pointed out.

Try this code:
r = re.compile(".*?([0-9]+)$")
newlist = filter(r.match, mylist)
print newlist

How do I extract certain digits from raw input in Python?

Let's say I ask a users for some random letters and numbers. let's say they gave me 1254jf4h. How would I take the letters jfh and convert them inter a separate variable and then take the numbers 12544 and make them in a separate variable?

>>> s="1254jf4h"
>>> num=[]
>>> alpah=[]
>>> for n,i in enumerate(s):
... if i.isdigit():
... num.append(i)
... else:
... alpah.append(i)
...
>>> alpah
['j', 'f', 'h']
>>> num
['1', '2', '5', '4', '4']

A for loop is simple enough. Personally, I would use filter().
s = "1254jf4h"
nums = filter(lambda x: x.isdigit(), s)
chars = filter(lambda x: x.isalpha(), s)
print nums # 12544
print chars # jfh

edit: oh well, you already got your answer. Ignore.
NUMBERS = "0123456789"
LETTERS = "abcdefghijklmnopqrstuvwxyz"
def numalpha(string):
return string.translate(None, NUMBERS), string.translate(None, LETTERS)
print numalpha("14asdf129h53")
The function numalpha returns a 2-tuple with two strings, the first containing all the alphabetic characters in the original string, the second containing the numbers.
Note this is highly inefficient, as it traverses the string twice and it doesn't take into account the fact that numbers and letters have consecutive ASCII codes, though it does have the advantage of being easily modifiable to work with other codifications.
Note also that I only extracted lower-case letters. Yeah, It's not the best code snippet I've ever written xD. Hope it helps anyway.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

masking alphanumeric strings to similar format - python

i'm trying to impose a format on an alphanumeric string where digits become all 9's and the alphabets become A's. e.g. N43563 == A999 e.g2. dhfgb85fb == AAAAA99AA something along these lines pytho based. i have tried regex but it was a bit confusing for me which is why i'm now asking for assistance

>>> result1 = re.sub('[a-zA-Z]', 'A', 'dhfgb85fb') >>> result2 = re.sub('[0-9]', '9', result1) >>> result2 'AAAAA99AA'

You don't need re for that, if you only want to replace each digit for a 9 and each letter for an A then you can do this: sample = ['N43563', 'dhfgb85fb'] for s in sample: new_s = ''.join( '9' if letter.isdigit() else 'A' for letter in s ) print(new_s) >>> A99999 >>> AAAAA99AA

Related

How can I check a string for two letters or more?

Common substring in list of strings

Finding element in a string and smarter manipulation

Python find element in list that ends with number

How do I extract certain digits from raw input in Python?

Categories

Resources