Calculating Length of Sequences from .PBS File

Calculating Length of Sequences from .PBS File - python

I am new here. I am looking for help in a bioinformatics type task I have. The task was to calculate the total length of all the sequences in a .pbs file.
The file when opened, displays something like :
The Length is 102
The Length is 1100
The Length is 101
The Length is 111200
The Length is 102
I see that the length is given like a list, with letters and numbers. I need help figuring out what python code to write to add all the lengths together. Not all the sums are the same.
So far my code is:
f = open('lengthofsequence2.pbs.o8767272','r')
lines = f.readlines()
f.close()
def lengthofsequencesinpbsfile(i):
for x in i:
if
return x +=
print lengthofsequencesinpbsfile(lines)
I am not sure what to do with the for loop. I want to just count the numbers after the statement "The Length is..."
Thank You!

"The Length is " has 14 characters so line[14:] will give you the substring corresponding to the number you are after (starting after the 14th character), you then just have to convert it to int with int(line[14:]) before adding to your total: total += int(line[14:])

You need to parse your input to get the data you want to work with.
a. x.replace('The Length is ','') - this removes the unwanted text.
b. int(x.replace('The Length is ','')) - convert digit characters to
an integer
Add to a total: total += int(x.replace('The Length is ',''))
All of this is directly accessible using google. I looked for python string functions and type conversion functions. I've only looked briefly at python and never programmed with it, but I think these two items should help you do what you want to do.

Related

Python - append zeros to beginning of numbers depending on the string length

i have python script that results in numbers, for example
1001
10001
100001
i need to append zeroes at the beginning that the length is always equal to 10
0000001001
0000010001
0000100001
i need to know how to do it in python
thanks in advance

In python, numbers with leading 0’s will raise an error. You can, however, format them in when displaying them as a string, which is what #ddejohn mentioned in the comments.

i figured out an answer :
x = 1001
n = 10-len(x)
zeros = '0'*n
and then i can append the zeros to x value and wrap it in a str

Why my code consumes too much memory even after clearing list?

So i'm trying to solve this problem and the question goes like this
Probably, You all Know About The Famous Japanese Cartoon Character Nobita and Shizuka. Nobita Shizuka are very Good friend. However , Shizuka Love a special kind of string Called Tokushuna.
A string T is called Tokushuna if
The length of the string is greater or equal then 3 (|T| ≥ 3 )
It start and end with a charecter ‘1’ (one)
It contain (|T|-2) number of ‘0’ (zero)
here |T| = length of string T . Example , 10001 ,101,10001 is Tokushuna string But 1100 ,1111, 0000 is not.
One Day Shizuka Give a problem to nobita and promise to go date with him if he is able to solve this problem. Shizuka give A string S and told to Count number of Tokushuna string can be found from all possible the substring of string S . Nobita wants to go to date with Shizuka But You Know , he is very weak in Math and counting and always get lowest marks in Math . And In this Time Doraemon is not present to help him .So he need your help to solve the problem .
Input
First line of the input there is an integer T, the number of test cases. In each test case, you are given a binary string S consisting only 0 and 1.
Subtasks
Subtask #1 (50 points)
1 ≤ T ≤ 100
1 ≤ |S| ≤ 100
Subtask #2 (50 points)
1 ≤ T ≤ 100
1 ≤ |S| ≤ 105
Output
For each test case output a line Case X: Y where X is the case number and Y is the number of Tokushuna string can be found from all possible the substring of string S
Sample
Input
3
10001
10101
1001001001
Output
Case 1: 1
Case 2: 2
Case 3: 3
Look, in first case 10001 is itself is Tokushuna string.
In second Case 2 Substring S[1-3] 101 and S[3-6] 101 Can be found which is Tokushuna string.
What I've done so far
I've already solved the problem but the problem is it shows my code exceeds memory limit (512mb). I'm guessing it is because of the large input size. To solve that I've tried to clear the list which holds all the substring of one string after completing each operation. But this isn't helping.
My code
num = int(input())
num_list = []
for i in range(num):
i = input()
num_list.append(i)
def condition(a_list):
case = 0
case_no = 1
sub = []
for st in a_list:
sub.append([st[i:j] for i in range(len(st)) for j in range(i + 1, len(st) + 1)])
for i in sub:
for item in i:
if len(item) >= 3 and (item[0] == '1' and item[-1] == '1') and (len(item) - 2 == item.count('0')):
case += 1
print("Case {}: {}".format(case_no, case))
case = 0
case_no += 1
sub.clear()
condition(num_list)
Is there any better approach to solve the memory consumption problem?

Have you tried taking java heap dump and java thread dump? These will tell the memory leak and also the thread that is consuming memory.

Your method of creating all possible substrings won't scale very well to larger problems. If the input string is length N, the number of substrings is N * (N + 1) / 2 -- in other words, the memory needed will grow roughly like N ** 2. That said, it is a bit puzzling to me why your code would exceed 512MB if the length of the input string is always less than 105.
In any case, there is no need to store all of those substrings in memory, because a Tokushuna string cannot contain other Tokushuna strings nested within
it:
1 # Leading one.
0... # Some zeros. Cannot be or contain a Tokushuna.
1 # Trailing one. Could also be the start of the next Tokushuna.
That means a single scan over the string should be sufficient to find them all.
You could write your own algorithmic code to scan the characters and keep track
of whether it finds a Tokushuna string. But that requires some tedious
bookkeeping.
A better option is regex, which is very good at character-by-character analysis:
import sys
import re
# Usage: python foo.py 3 10001 10101 1001001001
cases = sys.argv[2:]
# Match a Tokushuna string without consuming the last '1', using a lookahead.
rgx = re.compile(r'10+(?=1)')
# Check the cases.
for i, c in enumerate(cases):
matches = list(rgx.finditer(c))
msg = 'Case {}: {}'.format(i + 1, len(matches))
print(msg)
If you do not want to use regex, my first instinct would be to start the algorithm by finding the indexes of all of the ones: indexes = [j for j, c in enumerate(case) if c == '1']. Then pair those indexes up: zip(indexes, indexes[1:]). Then iterate over the pairs, checking whether the part in the middle is all zeros.
A small note regarding your current code:
# Rather than this,
sub = []
for st in a_list:
sub.append([...]) # Incurs memory cost of the temporary list
# and a need to drill down to the inner list.
...
sub.clear() # Also requires a step that's easy to forget.
# just do this.
for st in a_list:
sub = [...]
...

How to call an index value from an itertools permutation without converting it to a list?

I need to create all combinations of these characters:
'0123456789qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM. '
That are 100 letters long, such as:
'0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001'
I'm currently using this code:
import itertools
babel = itertools.product(k_c, repeat = 100)
This code works, but I need to be able to return the combination at a certain index, however itertools.product does not support indexing, turning the product into a list yields a MemoryError, and iterating through the product until I reaches a certain value takes too long for values over a billion.
Thanks for any help

With 64 characters and 100 letters there will be 64^100 combinations. For each value of the first letter, there will be 64^99 combinations of the remaining letters, then 64^98, 64^97, and so on.
This means that your Nth combination can be expressed as N in base 64 where each "digit" represents the index of the letter in the string.
An easy solution would be to build the string recursively by progressively determining the index of each position and getting the rest of the string with the remainder of N:
chars = '0123456789qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM. '
def comboNumber(n,size=100):
if size == 1: return chars[n]
return comboNumber(n//len(chars),size-1)+chars[n%len(chars)]
output:
c = comboNumber(123456789000000000000000000000000000000000000123456789)
print(c)
# 000000000000000000000000000000000000000000000000000000000000000000000059.90jDxZuy6drpQdWATyZ8007dNJs
c = comboNumber(1083232247617211325080159061900470944719547986644358934)
print(c)
# 0000000000000000000000000000000000000000000000000000000000000000000000Python.Person says Hello World
Conversely, if you want to know at which combination index a particular string is located, you can compute the base64 value by combining the character index (digit) at each position:
s = "Python.Person says Hello World" # leading zeroes are implied
i = 0
for c in s:
i = i*len(chars)+chars.index(c)
print(i) # 1083232247617211325080159061900470944719547986644358934
You are now this much closer to understanding base64 encoding which is the same thing applied to 24bit numbers coded over 4 characters (i.e 3 binary bytes --> 4 alphanumeric characters) or any variant thereof

Python3, Concatenating Strings

I'm new here and also new to Python. So, this is my question and I really don't know where to start. I created a list and wrote liste.add(str(input('input a string'))). Can you help me, thanks.
I have N strings and the target is to obtain maximum points by concatenating some of these strings. If we want to add string "B" after string "A" they should satisfy rules below:
● "A" should be lexicographically smaller than "B".
● Some suffix of "A" (with minimum length of 1) should be same with some prefix of "B". Ex: last three characters of string “abaca” is same with first three characters of string “acaba”.
● After concatenating string "A" and "B", we gain points equal to length of their overlap(3 for example above)
Range 1 ≤ N ≤ 500
1 ≤ | Si | ≤ 500 (Length of any string)
Input Format:
In the first line there will be number (number of strings). Next lines there will be strings we have (all of them contains only lowercase English character)
Output Format:
In single line, print the maximum points can user get.
Sample Input : 4 a ba ab acaba
Sample Output : 3
Explanation:
With a - acaba - ba order, user can get 1 + 2 = 3 points.

Firstly, I went ahead and got some sample code to run in my machine to fit your needs for an input. I also took the liberty of adding some additions to make sure your input's requirements were fully met. Here are my results below:
import itertools
n = int(input('Number of strings: '))
while n not in range(1,501):
print('Error, new number needed')
n = int(input('Number of strings: '))
strings = []
a = 0
for i in range(0,n):
string = str(input('String: '))
while len(strings[a]) < len(string):
print('Error: new string must be longer than previous input')
string = str(input('String: '))
strings.append(string)
If you notice, I imported itertools. That is because if you would like every combination of strings in your list, itertools will help you get that done. I have also provided the code found on Get every combination of strings (if you click on that, it'll take you to the page), where they go over this topic. I'm not a professional on itertools, but hopefully the attached code should be a good leap forward:
S = set(['a', 'ab', 'ba'])
collect = set()
step = set([''])
while step:
step = set(a+b for a in step for b in S if len(a+b) <= 6)
collect |= step
print sorted(collect)
If you have any other questions, please let me know. Happy programming!

Why do I have to change integers to strings in order to iterate them in Python?

First of all, I have only recently started to learn Python on codeacademy.com and this is probably a very basic question, so thank you for the help and please forgive my lack of knowledge.
The function below takes positive integers as input and returns the sum of all that numbers' digits. What I don't understand, is why I have to change the type of the input into str first, and then back into integer, in order to add the numbers' digits to each other. Could someone help me out with an explanation please? The code works fine for the exercise, but I feel I am missing the big picture here.
def digit_sum(n):
num = 0
for i in str(n):
num += int(i)
return num

Integers are not sequences of digits. They are just (whole) numbers, so they can't be iterated over.
By turning the integer into a string, you created a sequence of digits (characters), and a string can be iterated over. It is no longer a number, it is now text.
See it as a representation; you could also have turned the same number into hexadecimal text, or octal text, or binary text. It would still be the same numerical value, just written down differently in text.
Iteration over a string works, and gives you single characters, which for a number means that each character is also a digit. The code takes that character and turns it back into a number with int(i).
You don't have to use that trick. You could also use maths:
def digit_sum(n):
total = 0
while n:
n, digit = divmod(n, 10)
num += digit
return num
This uses a while loop, and repeatedly divides the input number by ten (keeping the remainder) until 0 is reached. The remainders are summed, giving you the digit sum. So 1234 is turned into 123 and 4, then 12 and 3, etc.

Let's say the number 12345
So I would need 1,2,3,4,5 from the given number and then sum it up.
So how to get individuals number. One mathematical way was how #Martijn Pieters showed.
Another is to convert it into a string , and make it iterable.
This is one of the many ways to do it.
>>> sum(map(int, list(str(12345))))
15
The list() function break a string into individual letters. SO I needed a string. Once I have all numbers as individual letters, I can convert them into integers and add them up .

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculating Length of Sequences from .PBS File - python

"The Length is " has 14 characters so line[14:] will give you the substring corresponding to the number you are after (starting after the 14th character), you then just have to convert it to int with int(line[14:]) before adding to your total: total += int(line[14:])

Related

Python - append zeros to beginning of numbers depending on the string length

Why my code consumes too much memory even after clearing list?

How to call an index value from an itertools permutation without converting it to a list?

Python3, Concatenating Strings

Why do I have to change integers to strings in order to iterate them in Python?

Categories

Resources