Repair one string relative to another in Python - python

Query "AAAAA-AAACAAA-AAAAAA"
Reference "AA-AATAAAAAAATAAAAAA"
In Python,how do I repair a string (Query) relative to a Reference string where dashes in the query are substituted for the reference character, and dashes in the Reference string result in deletions in the corresponding Query character?
"AAAAA-AAACAAA-AAAAAA" should become
"AAAATAAACAAATAAAAAA"
(where parantheses here "AA()AA(T)AAACAAA(T)AAAAAA" highlight the modified characters)
Below is code that can repair the dashes in the Query relative to the reference which may or may not be helpful(line numbers are specific to the file, not relevant here, I apologize for the non-pythonic code!), but I cannot modify the Query according to dashes in the reference....
if "Query identifier" in line:
Query = line[24:-12]
if "-" in Query:
indices = [i for i, x in enumerate(Query) if x == "-"]
QueryStringUntilFirstDash = Query[:indices[0]]
found = 2
if found ==2 and "Reference identifier" in line:
Ref = line[24:-12]
if len(indices) == 1:
QueryDashToEnd.append(Query[indices[0]+1:])
print QueryStringUntilFirstDash+Ref[indices[0]]+str(QueryDashToEnd[0])
del(A[:])
else:
while y < len(indices):
y+=1
if y < len(indices):
DashesMiddleofQuery.append(Query[indices[y-1]:indices[y]])
DashesMiddleofQuerySubstitution = [B.replace('-', Ref[indices[y-1]]) for B in B]
Concat= ''.join(B)
del(B[:])
print UID
print Beg+str(Concat)+Query[indices[-1]+1:]+">1"
found = 0
y = 0

IIUC, something like this might work:
>>> query = "AAAAA-AAACAAA-AAAAAA"
>>> ref = "AA-AATAAAAAAATAAAAAA"
>>> fixed = ''.join(r if q == '-' else '' if r == '-' else q
... for q,r in zip(query, ref))
>>>
>>> fixed
'AAAATAAACAAATAAAAAA'
Or if you want to push the logic into a function:
>>> def fixer(q,r):
... if q == '-':
... return r
... if r == '-':
... return ''
... return q
...
>>> fixed = ''.join(map(fixer, query, ref))
>>> fixed
'AAAATAAACAAATAAAAAA'
I think it's easier to think in terms of pairs of characters, and what to do with those directly, rather than indices.

Related

How to get the index of a repeating element in list?

I wanted to make a Japanese transliteration program.
I won't explain the details, but some characters in pairs have different values ​​than if they were separated, so I made a loop that gets two characters (current and next)
b = "きゃきゃ"
b = list(b)
name = ""
for i in b:
if b.index(i) + 1 <= len(b) - 1:
if i in "き / キ" and b[b.index(i) + 1] in "ゃ ャ":
if b[b.index(i) + 1] != " ":
del b[b.index(i) + 1]
del b[int(b.index(i))]
cur = "kya"
name += cur
print(name)
but it always automatically giving an index 0 to "き", so i can't check it more than once.
How can i change that?
I tried to delete an element after analyzing it.... but it didn't help.
Rather than looking ahead a character, it may be easier to store a reference to the previous character, and replacing the previous transliteration if you found a combo match.
Example (I'm not sure if I got all of the transliterations correct):
COMBOS = {('き', 'ゃ'): 'kya', ('き', 'ャ'): 'kya', ('キ', 'ゃ'): 'kya', ('キ', 'ャ'): 'kya'}
TRANSLITERATIONS = {'き': 'ki', 'キ': 'ki', 'ャ': 'ya', 'ゃ': 'ya'}
def transliterate(text: str) -> str:
transliterated = []
last = None
for c in text:
try:
combo = COMBOS[(last, c)]
except KeyError:
transliterated.append(TRANSLITERATIONS.get(c, c))
else:
transliterated.pop() # remove the last value that was added
transliterated.append(combo)
last = c
return ''.join(transliterated) # combine the transliterations into a single str
That being said, rather than re-inventing the wheel, it may make more sense to use an existing library that already handles transliterating Japanese to romaji, such as Pykakasi.
Example:
>>> import pykakasi
>>> kks = pykakasi.kakasi()
>>> kks.convert('きゃ')
[{'orig': 'きゃ', 'hira': 'きゃ', 'kana': 'キャ', 'hepburn': 'kya', 'kunrei': 'kya', 'passport': 'kya'}]
if you are looking for the indices of 'き':
b = "きゃきゃ"
b = list(b)
indices = [i for i, x in enumerate(b) if x == "き"]
print(indices)
[0, 2]

python replace multiple occurrences of string with different values

i am writing a script in python that replaces all the occurrences of an math functions such as log with there answers but soon after i came into this problem i am unable replace multiple occurrences of a function with its answer
text = "69+log(2)+log(3)-log(57)/420"
log_list = []
log_answer = []
z = ""
c = 0
hit_l = False
for r in text:
if hit_l:
c += 1
if c >= 4 and r != ")":
z += r
elif r == ")":
hit_l = False
if r == "l":
hit_l = True
log_list.append(z)
if z != '':
logs = log_list[-1]
logs = re.sub("og\\(", ";", logs)
log_list = logs.split(";")
for ans in log_list:
log_answer.append(math.log(int(ans)))
for a in log_answer:
text = re.sub(f"log\\({a}\\)", str(a), text)
i want to replace log(10) and log(2) with 1 and 0.301 respectively i tried using re.sub but it is not working i am not able to replace the respective functions with there answers any help will be appreciated thank you
Here is my take on this using eval along with re.sub with a callback function:
x = "log(10)+log(2)"
output = re.sub(r'log\((\d+(?:\.\d+)?)\)', lambda x: str(eval('math.log(' + x.group(1) + ', 10)')), x)
print(output) # 1.0+0.301029995664
As long as your string contains no spaces and there are + signs between different logarithmic functions, eval could be a way to do it.
>>> a = 'log(10)+log(2)'
>>> b = a.split('+')
>>> b
['log(10)', 'log(2)']
>>> from math import log10 as log
>>> [eval(i) for i in b]
[1.0, 0.3010299956639812]
EDIT:
You could repeatedly use str.replace method to replace all mathematical operators (if there are more than one) with whitespaces and eventually use str.split like:
>>> text.replace('+', ' ').replace('-', ' ').replace('*', ' ').replace('/', ' ').split()
['69', 'log(2)', 'log(3)', 'log(57)', '420']

Looking for numeric characters in a single word string PYTHON

I have a variety of values in a text field of a CSV
Some values look something like this
AGM00BALDWIN
AGM00BOUCK
however, some have duplicates, changing the names to
AGM00BOUCK01
AGM00COBDEN01
AGM00COBDEN02
My goal is to write a specific ID to values NOT containing a numeric suffix
Here is the code so far
prov_count = 3000
prov_ID = 0
items = (name, x, y)
xy_tup = tuple(items)
if "*1" not in name and "*2" not in name:
prov_ID = prov_count + 1
else:
prov_ID = ""
It seems that the the wildcard isn't the appropriate method here but I can't seem to find an appropriate solution.
Using regular expressions seems appropriate here:
import re
pattern= re.compile(r'(\d+$)')
prov_count = 3000
prov_ID = 0
items = (name, x, y)
xy_tup = tuple(items)
if pattern.match(name)==False:
prov_ID = prov_count + 1
else:
prov_ID = ""
There are different ways to do it, one with the isdigit function:
a = ["AGM00BALDWIN", "AGM00BOUCK", "AGM00BOUCK01", "AGM00COBDEN01", "AGM00COBDEN02"]
for i in a:
if i[-1].isdigit(): # can use i[-1] and i[-2] for both numbers
print (i)
Using regex:
import re
a = ["AGM00BALDWIN", "AGM00BOUCK", "AGM00BOUCK01", "AGM00COBDEN01", "AGM00COBDEN02"]
pat = re.compile(r"^.*\d$") # can use "\d\d" instead of "\d" for 2 numbers
for i in a:
if pat.match(i): print (i)
another:
for i in a:
if name[-1:] in map(str, range(10)): print (i)
all above methods return inputs with numeric suffix:
AGM00BOUCK01
AGM00COBDEN01
AGM00COBDEN02
You can use slicing to find the last 2 characters of the element and then check if it ends with '01' or '02':
l = ["AGM00BALDWIN", "AGM00BOUCK", "AGM00BOUCK01", "AGM00COBDEN01", "AGM00COBDEN02"]
for i in l:
if i[-2:] in ('01', '02'):
print('{} is a duplicate'.format(i))
Output:
AGM00BOUCK01 is a duplicate
AGM00COBDEN01 is a duplicate
AGM00COBDEN02 is a duplicate
Or another way would be using the str.endswith method:
l = ["AGM00BALDWIN", "AGM00BOUCK", "AGM00BOUCK01", "AGM00COBDEN01", "AGM00COBDEN02"]
for i in l:
if i.endswith('01') or i.endswith('02'):
print('{} is a duplicate'.format(i))
So your code would look like this:
prov_count = 3000
prov_ID = 0
items = (name, x, y)
xy_tup = tuple(items)
if name[-2] in ('01', '02'):
prov_ID = prov_count + 1
else:
prov_ID = ""

Splitting an unspaced string of decimal values - Python

An awful person has given me a string like this
values = '.850000.900000.9500001.000001.50000'
and I need to split it to create the following list:
['.850000', '.900000', '.950000', '1.00000', '1.500000']
I know that I was dealing only with numbers < 1 I could use the code
dl = '.'
splitvalues = [dl+e for e in values.split(dl) if e != ""]
But in cases like this one where there are numbers greater than 1 buried in the string, splitvalue would end up being
['.850000', '.900000', '.9500001', '.000001', '.50000']
So is there a way to split a string with multiple delimiters while also splitting the string differently based on which delimiter is encountered?
I think this is somewhat closer to a fixed width format string. Try a regular expression like this:
import re
str = "(\d{1,2}\\.\d{5})"
m = re.search(str, input_str)
your_first_number = m.group(0)
Try this repeatedly on the remaining string to consume all numbers.
>>> import re
>>> source = '0.850000.900000.9500001.000001.50000'
>>> re.findall("(.*?00+(?!=0))", source)
['0.850000', '.900000', '.950000', '1.00000', '1.50000']
The split is based on looking for "{anything, double zero, a run of zeros (followed by a not-zero)"}.
Assume that the value before the decimal is less than 10, and then we have,
values = '0.850000.900000.9500001.000001.50000'
result = list()
last_digit = None
for value in values.split('.'):
if value.endswith('0'):
result.append(''.join([i for i in [last_digit, '.', value] if i]))
last_digit = None
else:
result.append(''.join([i for i in [last_digit, '.', value[0:-1]] if i]))
last_digit = value[-1]
if values.startswith('0'):
result = result[1:]
print(result)
# Output
['.850000', '.900000', '.950000', '1.00000', '1.50000']
How about using re.split():
import re
values = '0.850000.900000.9500001.000001.50000'
print([a + b for a, b in zip(*(lambda x: (x[1::2], x[2::2]))(re.split(r"(\d\.)", values)))])
OUTPUT
['0.85000', '0.90000', '0.950000', '1.00000', '1.50000']
Here digits are of fixed width, i.e. 6, if include the dot it's 7. Get the slices from 0 to 7 and 7 to 14 and so on. Because we don't need the initial zero, I use the slice values[1:] for extraction.
values = '0.850000.900000.9500001.000001.50000'
[values[1:][start:start+7] for start in range(0,len(values[1:]),7)]
['.850000', '.900000', '.950000', '1.00000', '1.50000']
Test;
''.join([values[1:][start:start+7] for start in range(0,len(values[1:]),7)]) == values[1:]
True
With a fixed / variable string, you may try something like:
values = '0.850000.900000.9500001.000001.50000'
str_list = []
first_index = values.find('.')
while first_index > 0:
last_index = values.find('.', first_index + 1)
if last_index != -1:
str_list.append(values[first_index - 1: last_index - 2])
first_index = last_index
else:
str_list.append(values[first_index - 1: len(values) - 1])
break
print str_list
Output:
['0.8500', '0.9000', '0.95000', '1.0000', '1.5000']
Assuming that there will always be a single digit before the decimal.
Please take this as a starting point and not a copy paste solution.

Funny behaviour of my recursive function

t = 8
string = "1 2 3 4 3 3 2 1"
string.replace(" ","")
string2 = [x for x in string]
print string2
for n in range(t-1):
string2.remove(' ')
print string2
def remover(ca):
newca = []
print len(ca)
if len(ca) == 1:
return ca
else:
for i in ca:
newca.append(int(i) - int(min(ca)))
for x in newca:
if x == 0:
newca.remove(0)
print newca
return remover(newca)
print (remover(string2))
It's supposed to be a program that takes in a list of numbers, and for every number in the list it subtracts from it, the min(list). It works fine for the first few iterations but not towards the end. I've added print statements here and there to help out.
EDIT:
t = 8
string = "1 2 3 4 3 3 2 1"
string = string.replace(" ","")
string2 = [x for x in string]
print len(string2)
def remover(ca):
newca = []
if len(ca) == 1: return()
else:
for i in ca:
newca.append(int(i) - int(min(ca)))
while 0 in newca:
newca.remove(0)
print len(newca)
return remover(newca)
print (remover(string2))
for x in newca:
if x == 0:
newca.remove(0)
Iterating over a list and removing things from it at the same time can lead to strange and unexpected behvaior. Try using a while loop instead.
while 0 in newca:
newca.remove(0)
Or a list comprehension:
newca = [item for item in newca if item != 0]
Or create yet another temporary list:
newnewca = []
for x in newca:
if x != 0:
newnewca.append(x)
print newnewca
return remover(newnewca)
(Not a real answer, JFYI:)
Your program can be waaay shorter if you decompose it into proper parts.
def aboveMin(items):
min_value = min(items) # only calculate it once
return differenceWith(min_value, items)
def differenceWith(min_value, items):
result = []
for value in items:
result.append(value - min_value)
return result
The above pattern can, as usual, be replaced with a comprehension:
def differenceWith(min_value, items):
return [value - min_value for value in items]
Try it:
>>> print aboveMin([1, 2, 3, 4, 5])
[0, 1, 2, 3, 4]
Note how no item is ever removed, and that data are generally not mutated at all. This approach helps reason about programs a lot; try it.
So IF I've understood the description of what you expect,
I believe the script below would result in something closer to your goal.
Logic:
split will return an array composed of each "number" provided to raw_input, while even if you used the output of replace, you'd end up with a very long number (you took out the spaces that separated each number from one another), and your actual split of string splits it in single digits number, which does not match your described intent
you should test that each input provided is an integer
as you already do a print in your function, no need for it to return anything
avoid adding zeros to your new array, just test first
string = raw_input()
array = string.split()
intarray = []
for x in array:
try:
intarray.append(int(x))
except:
pass
def remover(arrayofint):
newarray = []
minimum = min(arrayofint)
for i in array:
if i > minimum:
newarray.append(i - minimum)
if len(newarray) > 0:
print newarray
remover(newarray)
remover(intarray)

Categories