Extracting columns having values >= 90% - python

I wrote this script to extract values from my .txt file that have >= 90 % identity. However, this program does not take into consideration values higher than 100.00 for example 100.05, why?
import re
output=open('result.txt','w')
f=open('file.txt','r')
lines=f.readlines()
for line in lines:
new_list=re.split(r'\t+',line.strip())
id_per=new_list[2]
if id_per >= '90':
new_list.append(id_per)
output.writelines(line)
f.close()
output.close()
Input file example
A 99.12
B 93.45
C 100.00
D 100.05
E 87.5

You should compare them as floats not strings. Something as follows:
import re
output=open('result.txt','w')
f=open('file.txt','r')
lines=f.readlines()
for line in lines:
new_list=re.split(r'\t+',line.strip())
id_per=new_list[2]
if float(id_per) >= 90.0:
new_list.append(id_per)
output.writelines(line)
f.close()
output.close()
This is because python compares is interpreting the numbers as strings even though you want them interpreted as numbers. For strings, python does the comparisons character by character using the ASCII or Unicode rules. This is why your code will not throw any error however it will not run the way you expect it to run using float rules rather than string rules.

As an alternative to #sshashank124's answer, you could use simple string manipulation if your lines have a simple format;
output=open('result.txt','w')
f=open('file.txt','r')
for line in f:
words = line.split()
num_per=words[1]
if float(num_per) >= 90:
new_list.append(num_per)
output.writelines(line)
f.close()
output.close()

Python is dynamicaly but strongly typed language. Therefore 90 and '90' are completely different things - one is integer number and other is a string.
You're comparing strings and in string comparison, '90' is "greater" than '100.05' (strings are compared characted by character and '9' is greater than '1').
So what you need to do is:
convert id_per to number (you'll want probably floats, as you care about decimal places)
compare it to number, i.e., 90, not a '90'
In code:
id_per = float(new_list[2])
if id_per >= 90:

You are using string comparison - lexically 100 is less than 90. I bet that it works for 950...
Get rid of the quotes around the '90'

Related

Comparing slices of binary data in python

Say I have a file in hexadecimal and I need to search for repeating sets of bytes in it. What would be the best way to do this in python?
Right now what I'm doing is treating everything as a string with the re module, which is extremely slow and not the right way to do it. I just can't figure out how to slice up and compare binary data.
for i in range(int(len(data))):
string = data[i:i+16]
pattern = re.compile(string)
m = pattern.findall(data)
count += 1
if len(m) > 1:
k = [str(i), str(len(m))]
t = ":".join(k)
output_file.write(' {}'.format(t))
else:
continue
Just to make sure there's no confusion, data here is just a big string of hex data from open('pathtofile/file', 'r')

Python: split line by comma, then by space

I'm using Python 3 and I need to parse a line like this
-1 0 1 0 , -1 0 0 1
I want to split this into two lists using Fraction so that I can also parse entries like
1/2 17/12 , 1 0 1 1
My program uses a structure like this
from sys import stdin
...
functions'n'stuff
...
for line in stdin:
and I'm trying to do
for line in stdin:
X = [str(elem) for elem in line.split(" , ")]
num = [Fraction(elem) for elem in X[0].split()]
den = [Fraction(elem) for elem in X[1].split()]
but all I get is a list index out of range error: den = [Fraction(elem) for elem in X[1].split()]
IndexError: list index out of range
I don't get it. I get a string from line. I split that string into two strings at " , " and should get one list X containing two strings. These I split at the whitespace into two separate lists while converting each element into Fraction. What am I missing?
I also tried adding X[-1] = X[-1].strip() to get rid of \n that I get from ending the line.
The problem is that your file has a line without a " , " in it, so the split doesn't return 2 elements.
I'd use split(',') instead, and then use strip to remove the leading and trailing blanks. Note that str(...) is redundant, split already returns strings.
X = [elem.strip() for elem in line.split(",")]
You might also have a blank line at the end of the file, which would still only produce one result for split, so you should have a way to handle that case.
With valid input, your code actually works.
You probably get an invalid line, with too much space or even an empty line or so. So first thing inside the loop, print line. Then you know what's going on, you can see right above the error message what the problematic line was.
Or maybe you're not using stdin right. Write the input lines in a file, make sure you only have valid lines (especially no empty lines). Then feed it into your script:
python myscript.py < test.txt
How about this one:
pairs = [line.split(",") for line in stdin]
num = [fraction(elem[0]) for elem in pairs if len(elem) == 2]
den = [fraction(elem[1]) for elem in pairs if len(elem) == 2]

Codeeval Challenge not returning correct output. (Python)

So I started doing the challenges in codeeval and i'm stuck at an easy challenge called "word to digit"
This is the challenge description:
Having a string representation of a set of numbers you need to print
this numbers.
All numbers are separated by semicolon. There are up to 20 numbers in one line. The numbers are "zero" to "nine"
input sample:
zero;two;five;seven;eight;four
three;seven;eight;nine;two
output sample:
025784
37892
I have tested my code and it works, but in codeeval the output is always missing the last number from each line of words in the input file.
This is my code:
import sys
def WordConverter(x):
test=str()
if (x=="zero"):
test="0"
elif (x=="one"):
test="1"
elif (x=="two"):
test="2"
elif (x=="three"):
test="3"
elif (x=="four"):
test="4"
elif (x=="five"):
test="5"
elif (x=="six"):
test="6"
elif (x=="seven"):
test="7"
elif (x=="eight"):
test="8"
elif (x=="nine"):
test="9"
return (test)
t=str()
string=str()
test_cases=open(sys.argv[1],'r')
for line in test_cases:
string=line.split(";")
for i in range(0,len(string)):
t+=WordConverter(string[i])
print (t)
t=str()
Am I doing something wrong? Or is it a Codeeval bug?
You just need to remove the newline char from the input. Replace:
string=line.split(";")
With
string=line.strip().split(";")
However, using string as the variable name is not a good decision...
When iterating over the lines of a file with for line in test_cases:, each value of line will include the newline at the end of the line (if any). This results in the last element of string having a newline at the end, and so this value won't compare equal to anything in WordConverter, causing an empty string to be returned. You need to remove the newline from the string at some point.

How does len function actually work for files?

The python docs say: Return the length (the number of items) of an object. The argument may be a sequence (string, tuple or list) or a mapping (dictionary).
Code:
from sys import argv
script, from_file = argv
input = open(from_file)
indata = input.read()
print "The input file is %d bytes long" % len(indata)
Contents of the file:
One two three
Upon running this simple program I get as output: The input file is 14 bytes long
Qutestion:
I don't understand, if my file has written in it only 11 characters(One two three) how can len return me 14 bytes and not just simply 11?(what's with the bytes by the way?) In the python interpreter if I type s = "One two three" and then len(s) I get 13, so I am very confused.
"One two three" is indeed 13 chars (11 letters + 2 spaces).
>>> open("file.txt", 'w').write("one two three")
>>> len(open("file.txt").read())
13
Most likely you have an extra char for the endline, which explains the 14.
One two three
one = 3 characters
two = 3 characters
three = 5 characters
and than you have two spaces. So a total of 13 characters.
when reading from file there is an extra space in your file so 14 characters.
In your python interpreter do this:
st = " "
len(st)
output = 1
I used your code and created file similar to your by content. Result of running: indeed you have extra non-printable character in your "One two three" file. It's the only explanation. Space and line break - most obvious things to look up for.

python: find and replace numbers < 1 in text file

I'm pretty new to Python programming and would appreciate some help to a problem I have...
Basically I have multiple text files which contain velocity values as such:
0.259515E+03 0.235095E+03 0.208262E+03 0.230223E+03 0.267333E+03 0.217889E+03 0.156233E+03 0.144876E+03 0.136187E+03 0.137865E+00
etc for many lines...
What I need to do is convert all the values in the text file that are less than 1 (e.g. 0.137865E+00 above) to an arbitrary value of 0.100000E+01. While it seems pretty simple to replace specific values with the 'replace()' method and a while loop, how do you do this if you want to replace a range?
thanks
I think when you are beginning programming, it's useful to see some examples; and I assume you've tried this problem on your own first!
Here is a break-down of how you could approach this:
contents='0.259515E+03 0.235095E+03 0.208262E+03 0.230223E+03 0.267333E+03 0.217889E+03 0.156233E+03 0.144876E+03 0.136187E+03 0.137865E+00'
The split method works on strings. It returns a list of strings. By default, it splits on whitespace:
string_numbers=contents.split()
print(string_numbers)
# ['0.259515E+03', '0.235095E+03', '0.208262E+03', '0.230223E+03', '0.267333E+03', '0.217889E+03', '0.156233E+03', '0.144876E+03', '0.136187E+03', '0.137865E+00']
The map command applies its first argument (the function float) to each of the elements of its second argument (the list string_numbers). The float function converts each string into a floating-point object.
float_numbers=map(float,string_numbers)
print(float_numbers)
# [259.51499999999999, 235.095, 208.262, 230.22300000000001, 267.33300000000003, 217.88900000000001, 156.233, 144.876, 136.18700000000001, 0.13786499999999999]
You can use a list comprehension to process the list, converting numbers less than 1 into the number 1. The conditional expression (1 if num<1 else num) equals 1 when num is less than 1, otherwise, it equals num.
processed_numbers=[(1 if num<1 else num) for num in float_numbers]
print(processed_numbers)
# [259.51499999999999, 235.095, 208.262, 230.22300000000001, 267.33300000000003, 217.88900000000001, 156.233, 144.876, 136.18700000000001, 1]
This is the same thing, all in one line:
processed_numbers=[(1 if num<1 else num) for num in map(float,contents.split())]
To generate a string out of the elements of processed_numbers, you could use the str.join method:
comma_separated_string=', '.join(map(str,processed_numbers))
# '259.515, 235.095, 208.262, 230.223, 267.333, 217.889, 156.233, 144.876, 136.187, 1'
typical technique would be:
read file line by line
split each line into a list of strings
convert each string to the float
compare converted value with 1
replace when needed
write back to the new file
As I don't see you having any code yet, I hope that this would be a good start
def float_filter(input):
for number in input.split():
if float(number) < 1.0:
yield "0.100000E+01"
else:
yield number
input = "0.259515E+03 0.235095E+03 0.208262E+03 0.230223E+03 0.267333E+03 0.217889E+03 0.156233E+03 0.144876E+03 0.136187E+03 0.137865E+00"
print " ".join(float_filter(input))
import numpy as np
a = np.genfromtxt('file.txt') # read file
a[a<1] = 0.1 # replace
np.savetxt('converted.txt', a) # save to file
You could use regular expressions for parsing the string. I'm assuming here that the mantissa is never larger than 1 (ie, begins with 0). This means that for the number to be less than 1, the exponent must be either 0 or negative. The following regular expression matches '0', '.', unlimited number of decimal digits (at least 1), 'E' and either '+00' or '-' and two decimal digits.
0\.\d+E(-\d\d|\+00)
Assuming that you have the file read into variable 'text', you can use the regexp with the following python code:
result = re.sub(r"0\.\d*E(-\d\d|\+00)", "0.100000E+01", text)
Edit: Just realized that the description doesn't limit the valid range of input numbers to positive numbers. Negative numbers can be matched with the following regexp:
-0\.\d+E[-+]\d\d
This can be alternated with the first one using the (pattern1|pattern2) syntax which results in the following Python code:
result = re.sub(r"(0\.\d+E(-\d\d|\+00)|-0\.\d+E[-+]\d\d)", "0.100000E+00", subject)
Also if there's a chance that the exponent goes past 99, the regexp can be further modified by adding a '+' sign after the '\d\d' patterns. This allows matching digits ending in two OR MORE digits.
I've got the script working as I want now...thanks people.
When writing the list to a new file I used the replace method to get rid of the brackets and commas - is there a simpler way?
ftext = open("C:\\Users\\hhp06\\Desktop\\out.grd", "r")
otext = open("C:\\Users\\hhp06\\Desktop\\out2.grd", "w+")
for line in ftext:
stringnum = line.split()
floatnum = map(float, stringnum)
procnum = [(1.0 if num<1 else num) for num in floatnum]
stringproc = str(procnum)
s = (stringproc).replace(",", " ").replace("[", " ").replace("]", "")
otext.writelines(s + "\n")
otext.close()

Categories