extracting characters from python string using variable index - python

I'm trying to split a string of letters and numbers into a list of tuples like this:
[(37, 'M'), (1, 'I'), (5, 'M'), (1, 'D'), (25, 'M'), (33, 'S')]
This is what is kind of working, but when I try to get print "37" (print(cigar[d:pos])) it does not print the entire string, only 3.
#iterate through cigar sequence
print(cigar)
#count position in cigar sequence
pos=0
#count position of last key
d=0
splitCigar=[]
for char in cigar:
#print(cigar[pos])
if char.isalpha() == False:
print("first for-loop")
print(cigar[d])
print(cigar[pos])
print(cigar[d:pos])
num=(cigar[d:pos])
pos+=1
if char.isalpha() == True:
print("second for-loop")
splitCigar.append((num,char))
pos+=1
d=pos
print(splitCigar)
The output of this code:
37M1I5M1D25M33S
first for-loop
3
3
first for-loop
3
7
3
second for-loop
<and so on...>
second for-loop
[('3', 'M'), ('', 'I'), ('', 'M'), ('', 'D'), ('2', 'M'), ('3', 'S')]

Solution using regexp:
import re
cigar = "37M1I5M1D25M33S"
digits = re.findall('[0-9]+', cigar)
chars = re.findall('[A-Z]+', cigar)
results = list(zip(digits, chars))
Everything printed so you can see what it does:
>>> print(digits)
['37', '1', '5', '1', '25', '33']
>>> print(chars)
['M', 'I', 'M', 'D', 'M', 'S']
>>> print(results)
[('37', 'M'), ('1', 'I'), ('5', 'M'), ('1', 'D'), ('25', 'M'), ('33', 'S')]
I hope this "functional" approach suits you

Pyparsing library makes writing parsers more maintainable and readable.
If the format of the data changes, you can modify the parser without too much effort.
import pyparsing as pp
def make_grammar():
# Number consists of several digits
num = pp.Word(pp.nums).setName("Num")
# Convert the num to int
num = num.setParseAction(
pp.pyparsing_common.convertToInteger)
# 1 letter
letter = pp.Word(pp.alphas, exact=1)\
.setName("Letter")
# 1 num followed by letter with possibly
# some spaces in between
package = pp.Group(num + letter)
# 1 or more packages
grammar = pp.OneOrMore(package)
return grammar
def main():
x = "37M1I5M1D25M33S"
g = make_grammar()
result = g.parseString(x, parseAll=True)
print(result)
# [[37, 'M'], [1, 'I'], [5, 'M'],
# [1, 'D'], [25, 'M'], [33, 'S']]
# If you really want tuples:
print([tuple(r) for r in result])
main()

Sounds like a job for itertools.groupby
inp = '37M1I5M1D25M33S'
e = [''.join(g) for k, g in itertools.groupby(inp, key=lambda l: l.isdigit())]
print(e)
This will give you-
['37', 'M', '1', 'I', '5', 'M', '1', 'D', '25', 'M', '33', 'S']
Basically, groupby collects all consecutive elements that satisfy the key function (.isdigit) into groups, each of those groups is turned into a string using ''.join
Now, all you have to do is zip them together-
res = list(zip(e[::2], e[1::2]))
print(res)
That will give you
[('37', 'M'), ('1', 'I'), ('5', 'M'), ('1', 'D'), ('25', 'M'), ('33', 'S')]
If you want numericals instead of string representation of numbers, that's also super simple-
res = list(map(lambda l: (int(l[0]), l[1]), res))
Which yields
[(37, 'M'), (1, 'I'), (5, 'M'), (1, 'D'), (25, 'M'), (33, 'S')]
I'd say this is a pretty pythonic solution for your problem.

You can simply attain the desired output as follows:
cigar= '37M1I5M1D25M33S'
splitCigar=[]
t=[]
num=''
for char in cigar:
if char.isalpha()==False:
num+= char
else:
t.append(num)
num=''
t.append(char)
splitCigar.append(tuple(t))
t=[]
print(splitCigar)
Output:
[('37', 'M'), ('1', 'I'), ('5', 'M'), ('1', 'D'), ('25', 'M'), ('33', 'S')]

Related

about list comprehension in python

Im completely stuck with these codes and dont have any idea why this two codes' output is different.
ans
str = 'I am an NLPer'
def ngram(n, lst):
return list(zip(*[lst[i:] for i in range(n)]))
ngram(2,str)
output
[('I', ' '),
(' ', 'a'),
('a', 'm'),
('m', ' '),
(' ', 'a'),
('a', 'n'),
('n', ' '),
(' ', 'N'),
('N', 'L'),
('L', 'P'),
('P', 'e'),
('e', 'r')]
my code
def myngram(n):
for i in range(n):
return list(zip(*str[i:]))
myngram(2)
output
[('I', ' ', 'a', 'm', ' ', 'a', 'n', ' ', 'N', 'L', 'P', 'e', 'r')]
any idea?
I got another solutin down below but that one above is way more sofisticatedd.
str = 'I am an NLPer'
list = []
def ngram(n):
for i in range(len(str)-1):
list.append((str[i], str[i+1]))
return(list)
ngram(2)
You can breakdown the function to have a better understanding of it,
[lst[i:] for i in range(n)]
will give u an output of 2 list with result,
[('I am an NLPer',), (' am an NLPer',)]
with the guide on zip mention here it would unzip the sequence to the output you mention above.

Where did 0 index go in return statements of tuples in this sorting algorithm?

I'm looking at a sorting algorithm to put lowercase letters in front, then uppercase, then odd, and lastly even. For example String1234 becomes ginrtS1324
Code
def getKey(x):
if x.islower():
return(1,x)
elif x.isupper():
return(2,x)
elif x.isdigit():
if int(x)%2 == 1:
return(3,x)
else:
return(4,x)
print(*sorted('String1234',key=getKey),sep='')
I understand that tuples are returned as (1, g), (1,i)... (2, S), (3, 1), (3, 3) (4, 2), (4,4). What I don't understand is why a list is created: ['g', 'i', 'n', 'r', 't', '1', '3', '2', '4'] and what happened to the 0 indexes of the tuples?
sorted returns a sorted list with the elements of whatever iterable you pass into it:
>>> sorted('String1234')
['1', '2', '3', '4', 'S', 'g', 'i', 'n', 'r', 't']
If you want to turn this back into a string, an easy way is join():
>>> ''.join(sorted('String1234'))
'1234Sginrt'
If you pass a key parameter, the resulting keys (obtained by calling the key function on each element to be sorted) are used for the comparison within the sort, but the output is still built out of the original elements, not the keys!
>>> ''.join(sorted('String1234', key=getKey))
'ginrtS1324'
If you wanted to get the list of tuples rather than a list of letters, you'd do that by mapping your key function over the list itself before sorting it (and do that instead of passing it as a separate parameter to sorted):
>>> sorted(map(getKey, 'String1234'))
[(1, 'g'), (1, 'i'), (1, 'n'), (1, 'r'), (1, 't'), (2, 'S'), (3, '1'), (3, '3'), (4, '2'), (4, '4')]
>>> ''.join(map(lambda x: ''.join(map(str, x)), sorted(map(getKey, 'String1234'))))
'1g1i1n1r1t2S31334244'

Creating a Tuple with a variable and Boolean Python

I'm supposed to add a variable and a boolean to a new Tuple - the actual assignment is below, with my code. I know tuples are immutable - this is the first time I've tried to make one. Additionally, I can't find anything about inserting the variable and a boolean. Thanks in advance!
My code is just created a new list. This is the desired result:
[('h', False), ('1', True), ('C', False), ('i', False), ('9', True), ('True', False), ('3.1', False), ('8', True), ('F', False), ('4', True), ('j', False)]
Assignment:
The string module provides sequences of various types of Python
characters. It has an attribute called digits that produces the string
‘0123456789’. Import the module and assign this string to the variable
nums. Below, we have provided a list of characters called chars. Using
nums and chars, produce a list called is_num that consists of tuples.
The first element of each tuple should be the character from chars,
and the second element should be a Boolean that reflects whether or
not it is a Python digit.
import string
nums = string.digits
chars = ['h', '1', 'C', 'i', '9', 'True', '3.1', '8', 'F', '4', 'j']
is_num = []
for item in chars:
if item in string.digits:
is_num.insert(item, bool)
elif item not in string.digits:
is_num.insert(item, bool)
You can use a list comprehension for this, which is like a more concise for loop that creates a new list
>>> from string import digits
>>> chars = ['h', '1', 'C', 'i', '9', 'True', '3.1', '8', 'F', '4', 'j']
>>> is_num = [(i, i in digits) for i in chars]
>>> is_num
[('h', False), ('1', True), ('C', False), ('i', False), ('9', True), ('True', False), ('3.1', False), ('8', True), ('F', False), ('4', True), ('j', False)]
This would be equivalent to the follow loop
is_num = []
for i in chars:
is_num.append((i, i in digits))
>>> is_num
[('h', False), ('1', True), ('C', False), ('i', False), ('9', True), ('True', False), ('3.1', False), ('8', True), ('F', False), ('4', True), ('j', False)]
Note that the containment check is being done using in against string.digits
>>> digits
'0123456789'
>>> '7' in digits
True
>>> 'b' in digits
False
the easiest way is that you should have casted the nums in to a list. i mean num =list(num) before parsing it in
import string
chars = ['h', '1', 'C', 'i', '9', 'True', '3.1', '8', 'F', '4', 'j']
nums = string.digits
nums = list(nums)
is_num = []
for char in chars:
if char in nums:
is_num.append((char, True))
else:
is_num.append((char, False))
print(is_num)

make a spark rdd from tuples list and use groupByKey

I have a list of tuples like below
ls=[('c', 's'),('c', 'm'), ('c', 'p'), ('h', 'bi'), ('h', 'vi'), ('n', 'l'), ('n', 'nc')]
I would like to use pyspark and groupByKey to produce:
nc=[['c','s', 'm', 'p'], ['h','bi','vi'], ['n','l', 'nc']
I dont know how to make a spark rdd and use groupByKey.
I tried:
tem=ls.groupByKey()
'list' object has no attribute 'groupByKey'
You are getting that error because your object is a list and not an rdd. Python lists do not have a groupByKey() method (as the error states).
You can first convert your list to an rdd using sc.parallelize:
myrdd = sc.parallelize(ls)
nc = myrdd.groupByKey().collect()
print(nc)
#[('c',['s', 'm', 'p']), ('h',['bi','vi']), ('n',['l', 'nc'])]
This returns a list of tuples where the first element is the key and the second element is a list of the values. If you wanted to flatten these tuples, you can use itertools.chain.from_iterable:
from itertools import chain
nc = [tuple(chain.from_iterable(v)) for v in nc]
print(nc)
#[('c', 's', 'm', 'p'), ('h', 'bi', 'vi'), ('n', 'l', 'nc')]
However, you can avoid spark completely achieve the desired result using itertools.groupby:
from itertools import groupby, chain
ls=[('c', 's'),('c', 'm'), ('c', 'p'), ('h', 'bi'), ('h', 'vi'), ('n', 'l'), ('n', 'nc')]
nc = [
(key,) + tuple(chain.from_iterable(g[1:] for g in list(group)))
for key, group in groupby(ls, key=lambda x: x[0])
]
print(nc)
#[('c', 's', 'm', 'p'), ('h', 'bi', 'vi'), ('n', 'l', 'nc')]
As pault mentioned, the problem here is that Spark operates on specialised parallelized datasets, such as an RDD. To get the exact format you're after using groupByKey you'll need to do some funky stuff with lists:
ls = sc.parallelize(ls)
tem=ls.groupByKey().map(lambda x: ([x[0]] + list(x[1]))).collect()
print(tem)
#[['h', 'bi', 'vi'], ['c', 's', 'm', 'p'], ['n', 'l', 'nc']]
However, generally its best to avoid groupByKey as it can result in a large number of shuffles. This problem could also be solved with reduceByKey using:
ls=[('c', 's'),('c', 'm'), ('c', 'p'), ('h', 'bi'), ('h', 'vi'), ('n', 'l'), ('n', 'nc')]
ls = sc.parallelize(ls)
tem=ls.map(lambda x: (x[0], [x[1]])).reduceByKey(lambda x,y: x + y).collect()
print(tem)
This will scale more effectively, but note that RDD operations can start to look a little cryptic when you need to manipulate list structure.

How to get all permutations of string as list of strings (instead of list of tuples)?

The goal was to create a list of all possible combinations of certain letters in a word... Which is fine, except it now ends up as a list of tuples with too many quotes and commas.
import itertools
mainword = input(str("Enter a word: "))
n_word = int((len(mainword)))
outp = (list(itertools.permutations(mainword,n_word)))
What I want:
[yes, yse, eys, esy, sye, sey]
What I'm getting:
[('y', 'e', 's'), ('y', 's', 'e'), ('e', 'y', 's'), ('e', 's', 'y'), ('s', 'y', 'e'), ('s', 'e', 'y')]
Looks to me I just need to remove all the brackets, quotes, and commas.
I've tried:
def remove(old_list, val):
new_list = []
for items in old_list:
if items!=val:
new_list.append(items)
return new_list
print(new_list)
where I just run the function a few times. But it doesn't work.
You can recombine those tuples with a comprehension like:
Code:
new_list = [''.join(d) for d in old_list]
Test Code:
data = [
('y', 'e', 's'), ('y', 's', 'e'), ('e', 'y', 's'),
('e', 's', 'y'), ('s', 'y', 'e'), ('s', 'e', 'y')
]
data_new = [''.join(d) for d in data]
print(data_new)
Results:
['yes', 'yse', 'eys', 'esy', 'sye', 'sey']
You need to call str.join() on your string tuples in order to convert it back to a single string. Your code can be simplified with list comprehension as:
>>> from itertools import permutations
>>> word = 'yes'
>>> [''.join(w) for w in permutations(word)]
['yes', 'yse', 'eys', 'esy', 'sye', 'sey']
OR you may also use map() to get the desired result as:
>>> list(map(''.join, permutations(word)))
['yes', 'yse', 'eys', 'esy', 'sye', 'sey']
You can use the join function . Below code works perfect .
I am also attach the screenshot of the output.
import itertools
mainword = input(str("Enter a word: "))
n_word = int((len(mainword)))
outp = (list(itertools.permutations(mainword,n_word)))
for i in range(0,6):
outp[i]=''.join(outp[i])
print(outp)

Categories