Looking for numeric characters in a single word string PYTHON - python

I have a variety of values in a text field of a CSV
Some values look something like this
AGM00BALDWIN
AGM00BOUCK
however, some have duplicates, changing the names to
AGM00BOUCK01
AGM00COBDEN01
AGM00COBDEN02
My goal is to write a specific ID to values NOT containing a numeric suffix
Here is the code so far
prov_count = 3000
prov_ID = 0
items = (name, x, y)
xy_tup = tuple(items)
if "*1" not in name and "*2" not in name:
prov_ID = prov_count + 1
else:
prov_ID = ""
It seems that the the wildcard isn't the appropriate method here but I can't seem to find an appropriate solution.

Using regular expressions seems appropriate here:
import re
pattern= re.compile(r'(\d+$)')
prov_count = 3000
prov_ID = 0
items = (name, x, y)
xy_tup = tuple(items)
if pattern.match(name)==False:
prov_ID = prov_count + 1
else:
prov_ID = ""

There are different ways to do it, one with the isdigit function:
a = ["AGM00BALDWIN", "AGM00BOUCK", "AGM00BOUCK01", "AGM00COBDEN01", "AGM00COBDEN02"]
for i in a:
if i[-1].isdigit(): # can use i[-1] and i[-2] for both numbers
print (i)
Using regex:
import re
a = ["AGM00BALDWIN", "AGM00BOUCK", "AGM00BOUCK01", "AGM00COBDEN01", "AGM00COBDEN02"]
pat = re.compile(r"^.*\d$") # can use "\d\d" instead of "\d" for 2 numbers
for i in a:
if pat.match(i): print (i)
another:
for i in a:
if name[-1:] in map(str, range(10)): print (i)
all above methods return inputs with numeric suffix:
AGM00BOUCK01
AGM00COBDEN01
AGM00COBDEN02

You can use slicing to find the last 2 characters of the element and then check if it ends with '01' or '02':
l = ["AGM00BALDWIN", "AGM00BOUCK", "AGM00BOUCK01", "AGM00COBDEN01", "AGM00COBDEN02"]
for i in l:
if i[-2:] in ('01', '02'):
print('{} is a duplicate'.format(i))
Output:
AGM00BOUCK01 is a duplicate
AGM00COBDEN01 is a duplicate
AGM00COBDEN02 is a duplicate
Or another way would be using the str.endswith method:
l = ["AGM00BALDWIN", "AGM00BOUCK", "AGM00BOUCK01", "AGM00COBDEN01", "AGM00COBDEN02"]
for i in l:
if i.endswith('01') or i.endswith('02'):
print('{} is a duplicate'.format(i))
So your code would look like this:
prov_count = 3000
prov_ID = 0
items = (name, x, y)
xy_tup = tuple(items)
if name[-2] in ('01', '02'):
prov_ID = prov_count + 1
else:
prov_ID = ""

Related

split or chunk dynamic string into specific parts and merging in python

Is there way to split or chunk the dynamic string into fixed size? let me explain:
Suppose:
name = Natalie
Family = David12
length = len(name) #7 bit
length = len(Family) # 7 bit
i want to split the name and family into and merging as :
result=nadatavilid1e2
and again split and extract the the 2 string as
x= Natalie
y= david
another Example:
Name = john
Family= mark
split and merging:
result= jomahnrk
and again split and extract the the 2 string as
x=john
y= mark
.
Remember variable name and family have different size length every time not static! . i hope my question is clear. i have seen some related solution about it like here and here and here and here and here and here and here but none of these work with what im looking for. Any suggestion ?? Thanks
i'm using spyder python 3.6.4
I have try this code split data into two parts:
def split(data):
indices = list(int(x) for x in data[-1:])
data = data[:-1]
rv = []
for i in indices[::-1]:
rv.append(data[-i:])
data=data[:-i]
rv.append(data)
return rv[::-1]
data='Natalie'
x,c=split(str(data))
print (x)
print (c)
Given you have stated names will always be of equal length you could use wrap to split in to 2 char pairs and the zip and chain to join them up. In the split part you can again use wwrap to split in 2 char pairs but if the number of pairs is odd then you need to split the last pair into 2 single entries. something like.
from textwrap import wrap
from itertools import chain
def merge_names(name, family):
name_split = wrap(name, 2)
family_split = wrap(family, 2)
return "".join(chain(*zip(name_split, family_split)))
def split_names(merged_name):
names = ["", ""]
char_pairs = wrap(merged_name, 2)
if len(char_pairs) % 2:
char_pairs.append(char_pairs[-1][1])
char_pairs[-2] = char_pairs[-2][0]
for index, chars in enumerate(char_pairs):
pos = 1 if index % 2 else 0
names[pos] += chars
return names
print(merge_names("john", "mark"))
print(split_names("jomahnrk"))
print(merge_names("stephen", "natalie"))
print(split_names("stnaeptaheline"))
print(merge_names("Natalie", "David12"))
print(split_names("NaDatavilid1e2"))
OUTPUT
jomahnrk
['john', 'mark']
stnaeptaheline
['stephen', 'natalie']
NaDatavilid1e2
['Natalie', 'David12']
Something like:
a = "Eleonora"
b = "James"
l = max(len(a), len(b))
a = a.lower() + " " * (l-len(a))
b = b.lower() + " " * (l-len(b))
n = 2
a = [a[i:i+n] for i in range(0, len(a), n)]
b = [b[i:i+n] for i in range(0, len(b), n)]
ans = "".join(map(lambda xy: "".join(xy), zip(a, b))).replace(" ", "")
Giving for this example:
eljaeomenosra

Cut string within a specific pattern in python

I have string of some length consisting of only 4 characters which are 'A,T,G and C'. I have pattern 'GAATTC' present multiple times in the given string. I have to cut the string at intervals where this pattern is..
For example for a string, 'ATCGAATTCATA', I should get output of
string one - ATCGA
string two - ATTCATA
I am newbie in using Python but I have come up with the following (incomplete) code:
seq = seq.upper()
str1 = "GAATTC"
seqlen = len(seq)
seq = list(seq)
for i in range(0,seqlen-1):
site = seq.find(str1)
print(site[0:(i+2)])
Any help would be really appreciated.
First lets develop your idea of using find, so you can figure out your mistakes.
seq = 'ATCGAATTCATAATCGAATTCATAATCGAATTCATA'
seq = seq.upper()
pattern = "GAATTC"
split_at = 2
seqlen = len(seq)
i = 0
while i < seqlen:
site = seq.find(pattern, i)
if site != -1:
print(seq[i: site + split_at])
i = site + split_at
else:
print seq[i:]
break
Yet python string sports a powerful replace method that directly replaces fragments of string. The below snippet uses the replace method to insert separators when needed:
seq = 'ATCGAATTCATAATCGAATTCATAATCGAATTCATA'
seq = seq.upper()
pattern = "GA","ATTC"
pattern1 = ''.join(pattern) # 'GAATTC'
pattern2 = ' '.join(pattern) # 'GA ATTC'
splited_seq = seq.replace(pattern1, pattern2) # 'ATCGA ATTCATAATCGA ATTCATAATCGA ATTCATA'
print (splited_seq.split())
I believe it is more intuitive and should be faster then RE (which might have lower performance, depending on library and usage)
Here is a simple solution :
seq = 'ATCGAATTCATA'
seq_split = seq.upper().split('GAATTC')
result = [
(seq_split[i] + 'GA') if i % 2 == 0 else ('ATTC' + seq_split[i])
for i in range(len(seq_split)) if len(seq_split[i]) > 0
]
Result :
print(result)
['ATCGA', 'ATTCATA']
BioPython has a restriction enzyme package to do exactly what you're asking.
from Bio.Restriction import *
from Bio.Alphabet.IUPAC import IUPACAmbiguousDNA
print(EcoRI.site) # You will see that this is the enzyme you listed above
test = 'ATCGAATTCATA'.upper() # This is the sequence you want to search
my_seq = Seq(test, IUPACAmbiguousDNA()) # Create a biopython Seq object with our sequence
cut_sites = EcoRI.search(my_seq)
cut_sites contain a list of exactly where to cut the input sequence (such that GA is in the left sequence and ATTC is in the right sequence.
You can then split the sequence into contigs using:
cut_sites = [0] + cut_sites # We add a leading zero so this works for the first
# contig. This might not always be needed.
contigs = [test[i:j] for i,j in zip(cut_sites, cut_sites[1:]+[None])]
You can see this page for more details about BioPython.
My code is a bit sloppy, but you could try something like this when you want to iterate over multiple occurrences of the string
def split_strings(seq):
string1 = seq[:seq.find(str1) +2]
string2 = seq[seq.find(str1) +2:]
return string1, string2
test = 'ATCGAATTCATA'.upper()
str1 = 'GAATTC'
seq = test
while str1 in seq:
string1, seq = split_strings(seq)
print string1
print seq
Here's a solution using the regular expression module:
import re
seq = 'ATCGAATTCATA'
restriction_site = re.compile('GAATTC')
subseq_start = 0
for match in restriction_site.finditer(seq):
print seq[subseq_start:match.start()+2]
subseq_start = match.start()+2
print seq[subseq_start:]
Output:
ATCGA
ATTCATA

Funny behaviour of my recursive function

t = 8
string = "1 2 3 4 3 3 2 1"
string.replace(" ","")
string2 = [x for x in string]
print string2
for n in range(t-1):
string2.remove(' ')
print string2
def remover(ca):
newca = []
print len(ca)
if len(ca) == 1:
return ca
else:
for i in ca:
newca.append(int(i) - int(min(ca)))
for x in newca:
if x == 0:
newca.remove(0)
print newca
return remover(newca)
print (remover(string2))
It's supposed to be a program that takes in a list of numbers, and for every number in the list it subtracts from it, the min(list). It works fine for the first few iterations but not towards the end. I've added print statements here and there to help out.
EDIT:
t = 8
string = "1 2 3 4 3 3 2 1"
string = string.replace(" ","")
string2 = [x for x in string]
print len(string2)
def remover(ca):
newca = []
if len(ca) == 1: return()
else:
for i in ca:
newca.append(int(i) - int(min(ca)))
while 0 in newca:
newca.remove(0)
print len(newca)
return remover(newca)
print (remover(string2))
for x in newca:
if x == 0:
newca.remove(0)
Iterating over a list and removing things from it at the same time can lead to strange and unexpected behvaior. Try using a while loop instead.
while 0 in newca:
newca.remove(0)
Or a list comprehension:
newca = [item for item in newca if item != 0]
Or create yet another temporary list:
newnewca = []
for x in newca:
if x != 0:
newnewca.append(x)
print newnewca
return remover(newnewca)
(Not a real answer, JFYI:)
Your program can be waaay shorter if you decompose it into proper parts.
def aboveMin(items):
min_value = min(items) # only calculate it once
return differenceWith(min_value, items)
def differenceWith(min_value, items):
result = []
for value in items:
result.append(value - min_value)
return result
The above pattern can, as usual, be replaced with a comprehension:
def differenceWith(min_value, items):
return [value - min_value for value in items]
Try it:
>>> print aboveMin([1, 2, 3, 4, 5])
[0, 1, 2, 3, 4]
Note how no item is ever removed, and that data are generally not mutated at all. This approach helps reason about programs a lot; try it.
So IF I've understood the description of what you expect,
I believe the script below would result in something closer to your goal.
Logic:
split will return an array composed of each "number" provided to raw_input, while even if you used the output of replace, you'd end up with a very long number (you took out the spaces that separated each number from one another), and your actual split of string splits it in single digits number, which does not match your described intent
you should test that each input provided is an integer
as you already do a print in your function, no need for it to return anything
avoid adding zeros to your new array, just test first
string = raw_input()
array = string.split()
intarray = []
for x in array:
try:
intarray.append(int(x))
except:
pass
def remover(arrayofint):
newarray = []
minimum = min(arrayofint)
for i in array:
if i > minimum:
newarray.append(i - minimum)
if len(newarray) > 0:
print newarray
remover(newarray)
remover(intarray)

formatting list to convert into string

Here is my question
count += 1
num = 0
num = num + 1
obs = obs_%d%(count)
mag = mag_%d%(count)
while num < 4:
obsforsim = obs + mag
mylist.append(obsforsim)
for index in mylist:
print index
The above code gives the following results
obs1 = mag1
obs2 = mag2
obs3 = mag3
and so on.
obsforrbd = parentV = {0},format(index)
cmds.dynExpression(nPartilce1,s = obsforrbd,c = 1)
However when i run the code above it only gives me
parentV = obs3 = mag3
not the whole list,it only gives me the last element of the list why is that..??
Thanks.
I'm having difficulty interpreting your question, so I'm just going to base this on the question title.
Let's say you have a list of items (they could be anything, numbers, strings, characters, etc)
myList = [1,2,3,4,"abcd"]
If you do something like:
for i in myList:
print(i)
you will get:
1
2
3
4
"abcd"
If you want to convert this to a string:
myString = ' '.join(myList)
should have:
print(myString)
>"1 2 3 4 abcd"
Now for some explanation:
' ' is a string in python, and strings have certain methods associated with them (functions that can be applied to strings). In this instance, we're calling the .join() method. This method takes a list as an argument, and extracts each element of the list, converts it to a string representation and 'joins' it based on ' ' as a separator. If you wanted a comma separated list representation, just replace ' ' with ','.
I think your indentations wrong ... it should be
while num < 4:
obsforsim = obs + mag
mylist.append(obsforsim)
for index in mylist:
but Im not sure if thats your problem or not
the reason it did not work before is
while num < 4:
obsforsim = obs + mag
#does all loops before here
mylist.append(obsforsim) #appends only last
The usual pythonic way to spit out a list of numbered items would be either the range function:
results = []
for item in range(1, 4):
results.append("obs%i = mag_%i" % (item, item))
> ['obs1 = mag_1', 'obs2 = mag_2', 'ob3= mag_3']
and so on (note in this example you have to pass in the item variable twice to get it to register twice.
If that's to be formatted into something like an expression you could use
'\n'.join(results)
as in the other example to create a single string with the obs = mag pairs on their own lines.
Finally, you can do all that in one line with a list comprehension.
'\n'.join([ "obs%i = mag_%i" % (item, item) for item in range (1, 4)])
As other people have pointed out, while loops are dangerous - its easier to use range

Python: How to replace N random string occurrences in text?

Say that I have 10 different tokens, "(TOKEN)" in a string. How do I replace 2 of those tokens, chosen at random, with some other string, leaving the other tokens intact?
>>> import random
>>> text = '(TOKEN)__(TOKEN)__(TOKEN)__(TOKEN)__(TOKEN)__(TOKEN)__(TOKEN)__(TOKEN)__(TOKEN)__(TOKEN)'
>>> token = '(TOKEN)'
>>> replace = 'foo'
>>> num_replacements = 2
>>> num_tokens = text.count(token) #10 in this case
>>> points = [0] + sorted(random.sample(range(1,num_tokens+1),num_replacements)) + [num_tokens+1]
>>> replace.join(token.join(text.split(token)[i:j]) for i,j in zip(points,points[1:]))
'(TOKEN)__(TOKEN)__(TOKEN)__(TOKEN)__foo__(TOKEN)__foo__(TOKEN)__(TOKEN)__(TOKEN)'
In function form:
>>> def random_replace(text, token, replace, num_replacements):
num_tokens = text.count(token)
points = [0] + sorted(random.sample(range(1,num_tokens+1),num_replacements)) + [num_tokens+1]
return replace.join(token.join(text.split(token)[i:j]) for i,j in zip(points,points[1:]))
>>> random_replace('....(TOKEN)....(TOKEN)....(TOKEN)....(TOKEN)....(TOKEN)....(TOKEN)....(TOKEN)....(TOKEN)....','(TOKEN)','FOO',2)
'....FOO....(TOKEN)....(TOKEN)....(TOKEN)....(TOKEN)....(TOKEN)....(TOKEN)....FOO....'
Test:
>>> for i in range(0,9):
print random_replace('....(0)....(0)....(0)....(0)....(0)....(0)....(0)....(0)....','(0)','(%d)'%i,i)
....(0)....(0)....(0)....(0)....(0)....(0)....(0)....(0)....
....(0)....(0)....(0)....(0)....(1)....(0)....(0)....(0)....
....(0)....(0)....(0)....(0)....(0)....(2)....(2)....(0)....
....(3)....(0)....(0)....(3)....(0)....(3)....(0)....(0)....
....(4)....(4)....(0)....(0)....(4)....(4)....(0)....(0)....
....(0)....(5)....(5)....(5)....(5)....(0)....(0)....(5)....
....(6)....(6)....(6)....(0)....(6)....(0)....(6)....(6)....
....(7)....(7)....(7)....(7)....(7)....(7)....(0)....(7)....
....(8)....(8)....(8)....(8)....(8)....(8)....(8)....(8)....
If you need exactly two, then:
Detect the tokens (keep some links to them, like index into the string)
Choose two at random (random.choice)
Replace them
What are you trying to do, exactly? A good answer will depend on that...
That said, a brute-force solution that comes to mind is to:
Store the 10 tokens in an array, such that tokens[0] is the first token, tokens[1] is the second, ... and so on
Create a dictionary to associate each unique "(TOKEN)" with two numbers: start_idx, end_idx
Write a little parser that walks through your string and looks for each of the 10 tokens. Whenever one is found, record the start/end indexes (as start_idx, end_idx) in the string where that token occurs.
Once done parsing, generate a random number in the range [0,9]. Lets call this R
Now, your random "(TOKEN)" is tokens[R];
Use the dictionary in step (3) to find the start_idx, end_idx values in the string; replace the text there with "some other string"
My solution in code:
import random
s = "(TOKEN)test(TOKEN)fgsfds(TOKEN)qwerty(TOKEN)42(TOKEN)(TOKEN)ttt"
replace_from = "(TOKEN)"
replace_to = "[REPLACED]"
amount_to_replace = 2
def random_replace(s, replace_from, replace_to, amount_to_replace):
parts = s.split(replace_from)
indices = random.sample(xrange(len(parts) - 1), amount_to_replace)
replaced_s_parts = list()
for i in xrange(len(parts)):
replaced_s_parts.append(parts[i])
if i < len(parts) - 1:
if i in indices:
replaced_s_parts.append(replace_to)
else:
replaced_s_parts.append(replace_from)
return "".join(replaced_s_parts)
#TEST
for i in xrange(5):
print random_replace(s, replace_from, replace_to, 2)
Explanation:
Splits string into several parts using replace_from
Chooses indexes of tokens to replace using random.sample. This returned list contains unique numbers
Build a list for string reconstruction, replacing tokens with generated index by replace_to.
Concatenate all list elements into single string
Try this solution:
import random
def replace_random(tokens, eqv, n):
random_tokens = eqv.keys()
random.shuffle(random_tokens)
for i in xrange(n):
t = random_tokens[i]
tokens = tokens.replace(t, eqv[t])
return tokens
Assuming that a string with tokens exists, and a suitable equivalence table can be constructed with a replacement for each token:
tokens = '(TOKEN1) (TOKEN2) (TOKEN3) (TOKEN4) (TOKEN5) (TOKEN6) (TOKEN7) (TOKEN8) (TOKEN9) (TOKEN10)'
equivalences = {
'(TOKEN1)' : 'REPLACEMENT1',
'(TOKEN2)' : 'REPLACEMENT2',
'(TOKEN3)' : 'REPLACEMENT3',
'(TOKEN4)' : 'REPLACEMENT4',
'(TOKEN5)' : 'REPLACEMENT5',
'(TOKEN6)' : 'REPLACEMENT6',
'(TOKEN7)' : 'REPLACEMENT7',
'(TOKEN8)' : 'REPLACEMENT8',
'(TOKEN9)' : 'REPLACEMENT9',
'(TOKEN10)' : 'REPLACEMENT10'
}
You can call it like this:
replace_random(tokens, equivalences, 2)
> '(TOKEN1) REPLACEMENT2 (TOKEN3) (TOKEN4) (TOKEN5) (TOKEN6) (TOKEN7) (TOKEN8) REPLACEMENT9 (TOKEN10)'
There are lots of ways to do this. My approach would be to write a function that takes the original string, the token string, and a function that returns the replacement text for an occurrence of the token in the original:
def strByReplacingTokensUsingFunction(original, token, function):
outputComponents = []
matchNumber = 0
unexaminedOffset = 0
while True:
matchOffset = original.find(token, unexaminedOffset)
if matchOffset < 0:
matchOffset = len(original)
outputComponents.append(original[unexaminedOffset:matchOffset])
if matchOffset == len(original):
break
unexaminedOffset = matchOffset + len(token)
replacement = function(original=original, offset=matchOffset, matchNumber=matchNumber, token=token)
outputComponents.append(replacement)
matchNumber += 1
return ''.join(outputComponents)
(You could certainly change this to use shorter identifiers. My style is somewhat more verbose than typical Python style.)
Given that function, it's easy to replace two random occurrences out of ten. Here's some sample input:
sampleInput = 'a(TOKEN)b(TOKEN)c(TOKEN)d(TOKEN)e(TOKEN)f(TOKEN)g(TOKEN)h(TOKEN)i(TOKEN)j(TOKEN)k'
The random module has a handy method for picking random items from a population (not picking the same item twice):
import random
replacementIndexes = random.sample(range(10), 2)
Then we can use the function above to replace the randomly-chosen occurrences:
sampleOutput = strByReplacingTokensUsingFunction(sampleInput, '(TOKEN)',
(lambda matchNumber, token, **keywords:
'REPLACEMENT' if (matchNumber in replacementIndexes) else token))
print sampleOutput
And here's some test output:
a(TOKEN)b(TOKEN)cREPLACEMENTd(TOKEN)e(TOKEN)fREPLACEMENTg(TOKEN)h(TOKEN)i(TOKEN)j(TOKEN)k
Here's another run:
a(TOKEN)bREPLACEMENTc(TOKEN)d(TOKEN)e(TOKEN)f(TOKEN)gREPLACEMENTh(TOKEN)i(TOKEN)j(TOKEN)k
from random import sample
mystr = 'adad(TOKEN)hgfh(TOKEN)hjgjh(TOKEN)kjhk(TOKEN)jkhjk(TOKEN)utuy(TOKEN)tyuu(TOKEN)tyuy(TOKEN)tyuy(TOKEN)tyuy(TOKEN)'
def replace(mystr, substr, n_repl, replacement='XXXXXXX', tokens=10, index=0):
choices = sorted(sample(xrange(tokens),n_repl))
for i in xrange(choices[-1]+1):
index = mystr.index(substr, index) + 1
if i in choices:
mystr = mystr[:index-1] + mystr[index-1:].replace(substr,replacement,1)
return mystr
print replace(mystr,'(TOKEN)',2)

Categories