How to remove consecutive single letter characters from a string in python - python

I have a string as follows where I want to remove consecutive single letter characters longer than 5.
mystring = "the nucleotide sequence of wheat triticum aestivum l chloroplastid ribosome associated 4 5 s rna is u a g u g a g c g c g a g a c g a g c g u a u a g u g u c a g u g a g u g c a g u g a u g u a u g c a g c u g a g c a u c u a c g a c g a c g a u g a coh"
My output shoud be as follows.
myoutput = "the nucleotide sequence of wheat triticum aestivum l chloroplastid ribosome associated 4 5 s rna is coh"
I tried to do it as follows.
for i, my in enumerate(line.split()):
if len(my) == 1:
count = count + 1
else:
count = 0
if count == 5:
print(i)
In summary, I am keeping a count and check if it has 5 length single letter characters and remove the 5 positions from the list and so on.
However, without using a variable for counting the length and removeing 5 by 5, I would like to perform this in a more efficient pythonic way.
I am happy to provide more details if needed.

I believe in this case, we can use regular expression to solve this problem:
mystring = ("the nucleotide sequence of wheat triticum aestivum l"
"chloroplastid ribosome associated 4 5 s rna is u a "
"g u g a g c g c g a g a c g a g c g u a u a g u g u "
"c a g u g a g u g c a g u g a u g u a u g c a g c u "
"g a g c a u c u a c g a c g a c g a u g a coh")
print(mystring)
# See https://regex101.com/r/aUDK7K/1
# \b: word boundary
# \w: word char
# \s+: one or more white spaces
# {5,}: 5 or more times
shorten = re.sub(r'(\b\w\s+){5,}', '', mystring)
print(shorten)

Related

How I can delete extra whitespaces between characters in string python [duplicate]

This question already has answers here:
Python: substitute double space with one space and remove single space
(4 answers)
Closed 9 months ago.
How I can normalize a string that contains extras whitespaces between
characters
From this
'I n e e d t o d e l e t e t h e e x t r a w h i t e s p a c e b e t w e e n c h a r a c t e r s'
In to
'I need to delete the extra whitespace between characters'
Here's an option using split and join:
>>> s = 'I n e e d t o d e l e t e t h e e x t r a w h i t e s p a c e b e t w e e n c h a r a c t e r s'
>>> ''.join(c if c else " " for c in s.split(" "))
'I need to delete the extra whitespace between characters'
You can use a regex, delete spaces not followed by spaces:
s = 'I n e e d t o d e l e t e t h e e x t r a w h i t e s p a c e b e t w e e n c h a r a c t e r s'
import re
out = re.sub('\s(?!\s)', '', s)
output: 'I need to delete the extra whitespace between characters'
Alternative to handle any number of spaces, if more than one, only keep one, else delete:
s = 'I n e e d t o d e l e t e t h e e x t r a w h i t e s p a c e b e t w e e n c h a r a c t e r s'
import re
out = re.sub('\s+?(\s)?', r'\1', s)
As discussed in the comments, you want to remove every space that is not followed by a non-whitespace character. Doing exactly this in the python shell:
>>> import re
>>> s = 'I n e e d t o d e l e t e t h e e x t r a w h i t e s p a c e b e t w e e n c h a r a c t e r s'
>>> re.sub(r"\s(\S)", r"\1", s)
'I need to delete the extra whitespace between characters'
>>>
The r"\s(\S)" says "match a whitespace and then a non-whitespace. The non-whitespace is in group 1. The r"\1" says to replace this full match with the group 1.

Alphabetical Grid using python3

how to write a function grid that returns an alphabetical grid of size NxN, where a = 0, b = 1, c = 2.... in python
example :
a b c d
b c d e
c d e f
d e f g
here I try to create a script using 3 for loops but it's going to print all the alphabets
def grid(N):
for i in range(N):
for j in range(N):
for k in range(ord('a'),ord('z')+1):
print(chr(k))
pass
Not the most elegant, but gets the job done.
import string
def grid(N):
i = 0
for x in range(N):
for y in string.ascii_lowercase[i:N+i]:
print(y, end=" ")
i += 1
print()
grid(4)
Output
a b c d
b c d e
c d e f
d e f g
Extending from #MichHeng's suggestion, and using list comprehension:
letters = [chr(x) for x in range(ord('a'),ord('z')+1)]
def grid(N):
for i in range(N):
print(' '.join([letters[i] for i in range(i,N+i)]))
grid(4)
output is
a b c d
b c d e
c d e f
d e f g
You have specified for k in range(ord('a'),ord('z')+1) which prints out the entire series from 'a' to 'z'. What you probably need is a reference list comprehension to pick your letters from, for example
[chr(x) for x in range(ord('a'),ord('z')+1)]
Try this:
letters = [chr(x) for x in range(ord('a'),ord('z')+1)]
def grid(N):
for i in range(N):
for j in range(i, N+i):
print(letters[j], end=' ')
if j==N+i-1:
print('') #to move to next line
grid(4)
Output
a b c d
b c d e
c d e f
d e f g
Do you need to add a check for N<=13 ?

Trying to verify last position of a string

Im trying to verify if the last char is not on my list
def acabar_char(input):
list_chars = "a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 1 2 3 4 5 6 7 8 9 0".split()
tam = 0
tam = (len(input)-1)
for char in input:
if char[tam] in list_chars:
return False
else:
return True
When i try this i get this error:
if char[tam] in list_chars:
IndexError: string index out of range
you can index from the end (of a sting or a list) with negative numbers
def acabar_char(input, list_cars):
return input[-1] is not in list_chars
It seems that you are trying to assert that the last element of an input string (or also list/tuple) is NOT in a subset of disallowed chars.
Currently, your loop never even gets to the second and more iteration because you use return inside the loop; so the last element of the input only gets checked if the input has length of 1.
I suggest something like this instead (also using the string.ascii_letters definition):
import string
DISALLOWED_CHARS = string.ascii_letters + string.digits
def acabar_char(val, disallowed_chars=DISALLOWED_CHARS):
if len(val) == 0:
return False
return val[-1] not in disallowed_chars
Does this work for you?
you are already iterating through your list in that for loop, so theres no need to use indices. you can use list comprehension as the other answer suggest, but I'm guessing you're trying to learn python, so here would be the way to rewrite your function.
list_chars = "a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 1 2 3 4 5 6 7 8 9 0".split()
for char in input:
if char in list_chars:
return False
return True
list_chars = "a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 1 2 3 4 5 6 7 8 9 0".split()
def acabar_char(input):
if input in list_chars:
print('True')

Optimizing pandas filter inside apply function

I have a list of pairs--stored in a DataFrame--each pair having an 'a' column and a 'b' column. For each pair I want to return the 'b's that have the same 'a'. For example, given the following set of pairs:
a b
0 c d
1 e f
2 c g
3 e h
4 i j
5 e k
I would like to end up with:
a b equivalents
0 c d [g]
1 e f [h, k]
2 c g [d]
3 e h [f, k]
4 i j []
5 e k [h, e]
I can do this with the following:
def equivalents(x):
l = pairs[pairs["a"] == x["a"]]["b"].tolist()
return l[1:] if l else l
pairs["equivalents"] = pairs.apply(equivalents, axis = 1)
But it is painfully slow on larger sets (e.g. 1 million plus pairs). Any suggestions how I could do this faster?
I think this ought to be a bit faster. First, just add them up.
df['equiv'] = df.groupby('a')['b'].transform(sum)
a b equiv
0 c d dg
1 e f fhk
2 c g dg
3 e h fhk
4 i j j
5 e k fhk
Now convert to a list and remove whichever letter is already in column 'b'.
df.apply( lambda x: [ y for y in list( x.equiv ) if y != x.b ], axis=1 )
0 [g]
1 [h, k]
2 [d]
3 [f, k]
4 []
5 [f, h]

Why write() method writes unknown characters?

I have this simple txt file:
[header]
width=8
height=5
tilewidth=175
tileheight=150
[tilesets]
tileset=../GFX/ts1.png,175,150,0,0
[layer]
type=Tile Layer 1
data=
1,1,1,1,1,1,1,1,
1,0,0,0,0,0,0,1,
1,0,0,0,0,1,1,1,
1,0,0,0,6,0,0,1,
1,1,1,1,4,1,1,1
I want to separate the text by the "[header]", "[tilesets]" and "[layers]". Problem is, if I split it in this way:
m = open(self.fullPath, 'r+')
sliced = m.read().split() # Default = \n
print sliced
It shall separate each line, because read() always leave a '\n' at the end of every line:
['[header]', 'width=8', 'height=5', 'tilewidth=175', 'tileheight=150', '[tilesets]', 'tileset=../GFX/ts1.png,175,150,0,0', '[layer]', 'type=Tile', 'Layer', '1', 'data=', '1,1,1,1,1,1,1,1,', '1,0,0,0,0,0,0,1,', '1,0,0,0,0,1,1,1,', '1,0,0,0,6,0,0,1,', '1,1,1,1,4,1,1,1']
But it's possible to split perfectly if, instead a new-line-character, there was a "#" sign or whatever separating each section.
Then, I thought: "There are empty lines there, and they are new-line-characters, so I just need to test if the line equals to the new-line-character and replace it with '#'":
for line in m.readlines():
if line == '\n':
m.write('#')
for line in m.readlines():
print line
Perfect.. Except that.. Instead of achieving this:
[header]
width=8
height=5
tilewidth=175
tileheight=150
#
[tilesets]
tileset=../GFX/ts1.png,175,150,0,0
#
[layer]
type=Tile Layer 1
data=
1,1,1,1,1,1,1,1,
1,0,0,0,0,0,0,1,
1,0,0,0,0,1,1,1,
1,0,0,0,6,0,0,1,
1,1,1,1,4,1,1,1
I get this:
[header]
width=8
height=5
tilewidth=175
tileheight=150
[tilesets]
tileset=../GFX/ts1.png,175,150,0,0
[layer]
type=Tile Layer 1
data=
1,1,1,1,1,1,1,1,
1,0,0,0,0,0,0,1,
1,0,0,0,0,1,1,1,
1,0,0,0,6,0,0,1,
1,1,1,1,4,1,1,1##õÙÓ Z d Z d d l Z d d l Z d " Z d $ „ Z d - f d „ ƒ Y Z e H d ƒ Z e I j ƒ Æ Çîà  õÙÓ ; | j d ƒ } i < g d 6 g d 6 g d 6 } d = d d g } x 0 ·ð? | j ƒ D u tîà  õÙÓI À¶ð ) (–à W # "íà  õ#ÎÔ €·ðB | j ƒ D ú ú–à  õ(Tò `·ð } | C G H q | #·ð Ñ Ñ–à  õ#ÎÔ ¨ ¨–à  õ#ÎÔ
E G H | F j ƒ –à  õ#ÎÔ S V V–à  õ#ÎÔž ÿÿÿÿ t | j d ƒ } i g d 6g d 6g d 6} d d d g } x0 | j ƒ D]" } | d k rk | j d ƒ n qI Wx | j ƒ D] } | GHq| Wd
GH| j ƒ d S `:ð> >§à  õ#ÎÔÀ:ðà¢îðà:ð ;ð`ßî ;ð#;ð0ð`;ð £îXð# ï€;ð€ð ;ð`£îÀ;ðà;ð ð2 2›à  õ#ÎÔ`<ð€<ðà¤î <ð ?îÀ<ðà<ð =ð =ð#=ðÀ?î ïÐð`=ð¸ï€=ð =ðøðÀ=ðà=ð >ð >ð#>ð`>ð ð€>ð >ðÀ>ðà>ð ?ð#OÑ ?ð#?ð`?ð€?ð ?ðHðpðÀ?ð˜ðÀðà?ð #ðÀ£î##ð`#ð€#ð #ð PðHPðÀ#ðà#ð
It makes no sense :).
Simultaneously reading from and writing to a file tends to have unpredictable effects on what kind of output you get.
If your categories are always separated by two newlines, then just split on that, instead of doing any fancy find/replace operations.
m = open("input.txt", "r+")
sliced = m.read().split("\n\n")
print "data has been split into {} categories.".format(len(sliced))
#print the starting line of each category
for category in sliced:
print category.split("\n")[0]
Result:
data has been split into 3 categories.
[header]
[tilesets]
[layer]

Categories