Python - Appending and Sorting a List - python

I'm working on a code where I'm trying to take a argv (i, w or f) from the command line. Then using input, I want to take a list of integers, float or words and execute a few things.
User will enter 'f' on the command line and then input a list of floating points where the values will append to an empty list. Then the program will sort the list of float and print the output results.
I want to similar for words and integers.
If the input is a list of words, the output will print words in alphabetize order. If the input is a list of integers, the output will be the list in the reverse order.
This is the code that I have so far, but as of right now some of the input values are just appending the values to the empty list. What am I missing that is preventing the code to execute properly?
for example, program will start by adding program name and 'w' for word:
$ test.py w
>>> abc ABC def DEF
[ABC, DEF,abc,def] # list by length, alphabetizing words
code
import sys, re
script, options = sys.argv[0], sys.argv[1:]
a = []
for line in options:
if re.search('f',line): # 'f' in the command line
a.append(input())
a.join(sorted(a)) # sort floating point ascending
print (a)
elif re.search('w', line):
a.append.sort(key=len, reverse=True) # print list in alphabetize order
print(a)
else: re.search('i', line)
a.append(input())
''.join(a)[::-1] # print list in reverse order
print (a)

Try this:
import sys
option, values = sys.argv[1], sys.argv[2:]
tmp = {
'i': lambda v: map(int, v),
'w': lambda v: map(str, v),
'f': lambda v: map(float, v)
}
print(sorted(tmp[option](values)))
Output:
shell$ python my.py f 1.0 2.0 -1.0
[-1.0, 1.0, 2.0]
shell$
shell$ python my.py w aa bb cc
['aa', 'bb', 'cc']
shell$
shell$ python my.py i 10 20 30
[10, 20, 30]
shell$
You'll have to add necessary error handling. For e.g,
>>> float('aa')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: could not convert string to float: aa
>>>

Related

Read python list correctly

I have defined a function that takes in a list like this
arr = ['C','D','E','I','M']
I have another function that produces a similar kind of list, the function is:
def tree_count(arr):
feat = ['2','2','2','2','0']
feat_2 = []
dictionary = dict(zip(arr, feat))
print('dic',dictionary)
feat_2.append([k for k,v in dictionary.items() if v=='2'])
newarr = str(feat_2)[1:-1]
print(newarr)
This outputs the correct result that I want, i.e:
['C','D','E','I']
But the problem is, when I use this list in another function, its values should be read as C,D,E,I . But instead when I print this, the bracket [ and ' are included as result:
for i in newarr:
print(i)
The printed result is : [ ' C ', and so on for each line. I want to get rid of [ '. How do I solve this?
For some reason you are using str() on the array, this is what causes the square brackets from array to appear in the print statement.
See if the following methods suit you:
print(arr) # ['C','D','E','I'] - the array itself
print(str(arr)) # "['C', 'D', 'E', 'I']" - the array as string literal
print(''.join(arr)) # 'CDEI' - array contents as string with no spaces
print(' '.join(arr)) # 'C D E I' - array contents as string with spaces
Make your function return the dictionary rather than just printing it:
def tree_count(arr):
feat = ['2','2','2','2','0']
dictionary = dict(zip(arr, feat))
dictionary = [k for k in dictionary if dictionary[k] == '2']
return dictionary
For instance,
$ results = tree_count(['C','D','E','I','M'])
$ print(results)
['I', 'C', 'D', 'E']
Pretty-printing is then fairly straightforward:
$ print("\n".join(results))
I
C
D
E
... or if you just want ,:
$ print(", ".join(results))
I, C, D, E

Parsing key-value log in Python

I have a system log that looks like the following:
{
a = 1
b = 2
c = [
x:1,
y:2,
z:3,
]
d = 4
}
I want to parse this in Python into a dictionary object with = splitting key-value pairs. At the same time, the array that is enclosed by [] is also preserved. I want to keep this as generic as possible so the parsing can also hold some future variations.
What I tried so far (code will be written): split each line by "=" into key-value pair, determine where [ and ] starts and end and then split the lines in between by ":" into key-value pairs. That seems a little hard-coded.. Any better idea?
This could be pretty easily simplified to YAML. pip install pyyaml, then set up like so:
import string, yaml
data = """
{
a = 1
b = 2
c = [
x:1,
y:2,
z:3,
]
d = 4
}
"""
With this setup, you can use the following to parse your data:
data2 = data.replace(":", ": ").replace("=", ":").replace("[","{").replace("]","}")
lines = data2.splitlines()
for i, line in enumerate(lines):
if len(line)>0 and line[-1] in string.digits and not line.endswith(",") or i < len(lines) - 1 and line.endswith("}"):
lines[i] += ","
data3 = "\n".join(lines)
yaml.load(data3) # {'a': 1, 'b': 2, 'c': {'x': 1, 'y': 2, 'z': 3}, 'd': 4}
Explanation
In the first line, we perform some simple substitutions:
YAML requires that there is a space after colons in key/value pairs. So with replace(":", ": "), we can ensure this.
Since YAML key/value pairs are always denoted by a colon and your format sometimes uses equals signs, we replace equal signs with commas using .replace("=", ":")
Your format sometimes uses square brackets where curly brackets should be used in YAML. We fix using .replace("[","{").replace("]","}")
At this point, your data looks like this:
{
a : 1
b : 2
c : {
x: 1,
y: 2,
z: 3,
}
d : 4
}
Next, we have a for loop. This is simply responsible for adding commas after lines where they're missing. The two cases in which for loops are missing are:
- They're absent after a numeric value
- They're absent after a closing bracket
We match the first of these cases using len(line)>0 and line[-1] in string.digits (the last character in the line is a digit)
The second case is matched using i < len(lines) - 1 and line.endswith("}"). This checks if the line ends with }, and also checks that the line is not the last, since YAML won't allow a comma after the last bracket.
After the loop, we have:
{
a : 1,
b : 2,
c : {
x: 1,
y: 2,
z: 3,
},
d : 4,
}
which is valid YAML. All that's left is yaml.load, and you've got yourself a python dict.
If anything isn't clear please leave a comment and I'll happily elaborate.
There is probably a better answer, but I would take advantage of all your dictionary keys being at the same indentation level. There's not an obvious way to be to do this with newline splitting, JSON loading, or that sort of thing since the list structure is a bit weird (it seems like a cross between a list and a dictionary).
Here's an implementation that parses keys based on indentation level:
import re
log = '''{
a = 1
b = 2
c = [
x:1,
y:2,
z:3,
]
d = 4
}'''
log_lines = log.split('\n')[1:-1] # strip bracket lines
KEY_REGEX = re.compile(r' [^ ]')
d = {}
current_pair = ''
for i, line in enumerate(log_lines):
if KEY_REGEX.match(line):
if current_pair:
key, value = current_pair.split('=')
d[key.strip()] = value.strip()
current_pair = line
else:
current_pair += line.strip()
if current_pair:
key, value = current_pair.split('=')
d[key.strip()] = value.strip()
print(d)
Output:
{'d': '4', 'c': '[x:1,y:2,z:3,]', 'a': '1', 'b': '2'}

How do I fix the error "unhashable type: 'list'" when trying to make a dictionary of lists?

I'm trying to make a dictionary of lists. The input I am using looks like this:
4
1: 25
2: 20 25 28
3: 27 32 37
4: 22
Where 4 is the amount of lines that will be outputted in that format. The first thing I did was remove the "#: " format and simply kept each line like so:
['25']
['20','25','28']
['27','32','37']
['22']
So finally the goal was to take those values and put them into a dictionary with their assigned value in the dictionary being the length of how many numbers they held.
So I wanted the dictionary to look like this:
['25'] : 1
['20','25','28'] : 3
['27','32','37'] : 3
['22'] : 1
However, when I tried to code the program, I got the error:
TypeError: unhashable type: 'list'
Here is my code:
def findPairs(people):
pairs = {}
for person in range(people):
data = raw_input()
#Remove the "#: " in each line and format numbers into ['1','2','3']
data = (data[len(str(person))+2:]).split()
options = len(data)
pairs.update({data:options})
print(pairs)
findPairs(input())
Does anyone know how I can fix this and create my dictionary?
Lists are mutable, and as such - can not be hashed (what happens if the list is changed after being used as a key)?
Use tuples which are immutable instead:
d = dict()
lst = [1,2,3]
d[tuple(lst)] = "some value"
print d[tuple(lst)] # prints "some value"
list is an unhashable type, you need to convert it into tuple before using it as a key in dictionary:
>>> lst = [['25'], ['20','25','28'], ['27','32','37'], ['22']]
>>> print dict((tuple(l), len(l)) for l in lst)
{('20', '25', '28'): 3, ('22',): 1, ('25',): 1, ('27', '32', '37'): 3}

Why I can't convert a list of str to a list of floats?

I'm starting to write a code, but it fails at the beginning.
This is my code:
import csv
reader = csv.reader(open("QstartRefseqhg19.head"), dialect='excel-tab' )
for row in reader:
C = row[1].split(",")[1:]
C1 = [float(i) for i in C]
print C1
and the error log says:
Traceback (most recent call last):
File "/home/geparada/workspace/SJtag/src/TagGen.py", line 8, in <module>
C1 = [float(i) for i in C]
ValueError: empty string for float()
I've also tried
import csv
reader = csv.reader(open("QstartRefseqhg19.head"), dialect='excel-tab' )
for row in reader:
C = row[1].split(",")[1:]
C1 = map(float, C)
print C1
My input file looks like this:
NM_032291 0,227,291,316,388,445,500,676,688,700,725,777,863,956,1031,1532,1660,1787,1847,1959,2115,2248,2451,2516,2681, tttctctcagcatcttcttggtagcctgcctgtaggtgaagaagcaccagcagcatccatggcctgtcttttggcttaacacttatctcctttggctttgacagcggacggaatagacctcagcagcggcgtggtgaggacttagctgggacctggaatcgtatcctcctgtgttttttcagactccttggaaattaaggaatgcaattctgccaccatgatggaaggattgaaaaaacgtacaaggaaggcctttggaatacggaagaaagaaaaggacactgattctacaggttcaccagatagagatggaattcagcccagcccacacgaaccaccctacaatagcaaagcagagtgtgcgcgtgaaggaggaaaaaaagtttcgaagaaaagcaatggggcaccaaatggattttatgcggaaattgattgggaaagatataactcacctgagctggatgaagaaggctacagcatcagacccgaggaacccggctctaccaaaggaaagcacttttattcttcaagtgaatcggaagaagaagaagaatcacataagaaatttaatatcaagattaaaccattgcaatctaaagacattcttaagaatgctgcaactgtagatgaattgaaggcatcaataggcaacatcgcactttccccatcaccagtgaggaaaagtccgaggcgcagcccgggtgcaattaaaaggaacttatccagtgaagaagtggcaagacccaggcgttccacaccaactccagaacttataagcaaaaagcctccagatgacactacggcccttgctcctctctttggcccaccactagaatcagcttttgatgaacagaagacagaagttcttttagatcagcctgagatatggggttcaggccaaccaattaatccaagcatggagtcgccaaagttaacaaggccttttcccactggaacacctccaccactgcctccaaaaaatgtaccagctaccccaccccgaacaggatcccccttaacaattggaccaggaaatgaccagtcagccacagaggtcaaaattgaaaaactaccatccatcaatgacttggacagcatttttgggccagtattgtcccccaagtctgttgctgttaatgctgaagaaaagtgggtccatttttctgatacatccccggaacatgttactccggagttgactccaagggaaaaagtggtgtccccaccagctacaccagacaacccagctgactccccagctccaggccctctcggccccccaggtcccacaggccccccagggcctcctgggcctcctcgcaatgtactatcgccgctcaatttagaagaagtccagaagaaagtcgctgagcagaccttcattaaagatgattacttagaaacaatctcatctcctaaagattttgggttgggacaaagagcaactccacctcccccaccaccacccacctacaggactgtggtttcgtcccccggacctggctcgggccctggtccggggaccaccagtggtgcatcatcccctgctcgaccagccactcctttggttccttgcagaagtaccactccacctccacctcctccccggcctccatcccggccaaagctacctccaggaaaacctggagttggagatgtgtccagaccttttagccctcccattcattcttccagccctcctccaatagcacccttagcgcgggctgaaagcacttcttcaatatcgtcaaccaattccttgagcgcagccaccactcccacagttgagaatgaacagccttccctcgtttggtttgacagaggaaagttttatttgacttttgaaggttcttccaggggacccagccccctaaccatgggagctcaggacactctccctgttgcagcagcatttacagaaacagtcaatgcctatttcaaaggagcagacccaagcaaatgtatcgttaagattaccggagaaatggtgttgtcatttcctgctggcatcaccagacactttgccaacaacccgtccccagctgctctgacttttcgggtgataaatttcagcaggttagaacacgtcctgccaaacccccaacttctctgctgtgataatacacaaaatgatgccaataccaaggaattctgggtaaacatgccaaatttgatgactcacctaaagaaagtgtctgaacaaaaaccccaggctacatattataacgttgacatgctcaaatatcaggtgtctgcccagggcattcagtccacacctctgaacctggcagtgaattggcgatgtgagccttcaagcactgacctgcgcatagattacaaatataatacagatgcaatgacgactgctgtggccctcaacaatgtgcagttcctggtccccatcgacggaggagtcaccaagctccaggcagtgctcccaccagcagtctggaatgctgaacaacagagaatattgtggaagattcctgatatctctcagaagtcagaaaatggaggggtgggttctttgttggcaagatttcagttatctgaaggcccaagcaaaccttctccattggttgtgcagttcacaagtgaaggaagcaccctttctggctgtgacattgaacttgttggagcagggtatcgattttcactcatcaagaaaaggtttgctgcaggaaaatacttggcagataactaatgaaatcttatgcaaggatttggaggattcatataatggagaactgatgtatgagaaacagattttaattttggtttgatgaaaacaaaccaatatctgcacttgggatatatcaggtggaaagtcaatgactttcatctgtgatttccctcacacactaccatgatgaccagtcctacagtatttacttctaggtgtaatattgttaatggttttaaaatgtaattattgtatttgtaaattgtactctcattccagtaaggcagttagacacttgagttttagcattttaccattcctgaaatggatgtaatttaaactgtggtatgtaaatttaatagtagtattgttgaatggcacaatgcttacagaggtagattgcattttgtcaatatataaaatttaaatataatattgatagctgtcataaagggggtgccacatattaaagaaacttaagtggaaccagaagaaaaagaaacaaacttacttttcttcaatgcttagtatgttttactctagtgctaaataaaaactctatcttcaaatgtttagtgggttaaattgagaaactatttcagaaaaaaattctaaggttacagcatattcaaagaaaagcattagttaccactttttaaaaagcttttttttcaaactgcaaatttcataaaaatgcaaactgtgtaaacagggcctcttatttttataacttgtgtaaaaagggaaagcaattcatatttaaagtttaagtatattaaattataatcaagagtaaagaagatgttgaagtcttaactacttgcccctctctacagtttcgcaaatgtggggattgctgaataatcagtcagactaaaaccaaaattgtgattttaagatttcaagactttccgtagttgaactggttaagaatttttgcttagttactctgaatagatgatcttactcatccagtatgggggaatgatacctcacgtcttcctctttacccacaggaatcaaaacgctgagactgagaattttagggaaaaaaaagtccactgtttagatccagaaggagagttttaatcattgtttatatcatttgagaatgaaaaaataagcttcataaatgaaattctattcacattactgtgtaataaatttccttttggatgattaggattcattgtataaaactgtaaatctttgccattcttggagaagcaaaaggagagttatcaaaaatgtatgtcgtttcatcgttgcaaggtataataaaaactgtaattattcaatctggccctgccatatgaacatttagaaagacaaacttcttcgggagtctcagttgtaaaaccttccctcattaatatctgaaaatgttagtcttcctttaagtcatagaacttatttaaacataaaccaatttctattacaggttatgctattaaatagctgtaattattaagttattatttttataattagttgttaaatttcattttacacccactcaaatttaacaaagaatctttagcccctttaaattttagaattaaattaaatttttaaagttttacttctaaaatgagattgtgactggcaattgtttatagtgaaactttttaaattaatctttgtactcctctatcagtgcttgctaccaagagaatgtccaaaatgatttgttttaccatgggaaaattcttactattcaacaaactctcagttggccccctacagcagtctggtgttgaagtttctttgaacgaactaaatatactcattttatgtaaaggtatccaatttgattttgaaaccaaaatagaaaatgcaaaattctaaattccatgaaacatggaatttatgacaccaaaatcaatggagagtaagcagcagcaaactgagaattatccagcatatgaatataacaatgtgtttttaagtaatcaattcatttaaaaaattgaatattaatacaaagcatattaaaaacatgtaaatatta
NM_001080397 0,397,490,715,1443,1597,1774,1980, atgatccccgcagccagcagcaccccgccgggagatgccctcttccccagcgtggccccacaggacttctggaggtcccaggtcacgggctactcggggtccgtgacacgacacctcagtcaccgggccaacaacttcaaacgacaccccaagaggaggaagtgcattcgtccctccccacccccgccccccaacaccccgtgcccgcttgagctggtggacttcggggacctgcacccccagaggtccttccgggagctgcttttcaacggctgcattctctttggcatcgagttcagctacgccatggagacggcgtacgtgaccccggtgctcctgcagatgggcctgcccgaccagctctacagcctggtgtggttcatcagccccatcctcggattcctactgcagcctctgttgggtgcttggagtgaccggtgtacctcaaggtttggaaggagacgccctttcattcttgtcctggctataggggcactgctgggcctctcgctcttgctgaatggccgggacattggcatcgccctggctgacgtgaccgggaaccacaagtggggcctgctgctgaccgtgtgcggtgtggtgctgatggactttagcgccgactcggcggacaaccccagccacgcctacatgatggacgtgtgcagccccgcagaccaggaccgaggcctgaacatccacgccctcctggcaggtctcggaggaggctttggatacgtggtcggcggaatccactgggataaaacgggcttcgggagggccctggggggacagctccgagtcatttacctcttcactgcggtcaccctgagcgtcaccaccgtcctgaccctggtcagcatccctgagaggccgctgcggccgccgagtgagaagcgggcagccatgaagagccccagcctcccgctgcccccgtccccacccgtcctgccagaggaaggccctggcgacagcctcccgtcgcacacggccaccaacttctccagccccatctcgccgcccagccccctcacgcccaagtacggcagcttcatcagcagggacagctccctgacgggcatcagcgagttcgcctcatcctttggcacggccaacatagacagcgtcctcattgactgcttcacgggcggccacgacagctacctggccatccctggcagcgtccccaggccgcccatcagcgtcagcttcccccgggcccccgacggcttctaccgccaggaccgtggacttctggagggcagagagggtgccctgacctccggctgtgacggggacattctgagggtgggctccttggacacctctaagccgaggtcatcagggattctgaagagacctcagaccttggccatcccggacgcagccggaggagggggtcccgaaaccagcaggagaaggaatgtgaccttcagtcagcaggtggccaatatcctgctcaacggcgtgaagtatgagagcgagctgacgggctccagcgagcgcgcggagcagcctctgtccgtggggcgcctctgctccaccatctgcaacatgcccaaggcgctacgcaccctctgcgtcaaccacttcctggggtggctctcattcgaggggatgttgctcttctacacagacttcatgggcgaggtggtgtttcagggggaccccaaggccccgcacacatcagaggcgtatcagaagtacaacagcggcgtgaccatgggctgctggggcatgtgtatctacgccttcagtgctgccttctactcagctatcctggagaagctggaggagttcctcagcgtccgcaccctctacttcatcgcctatctcgccttcggcctggggaccgggcttgccaccctctccaggaacctctacgtggtcctgtcgctctgcataacctacgggattttattttccaccctgtgcaccttgccttactcgctgctctgcgattactatcagagtaagaagtttgcagggtccagtgcggacggcacccggcggggcatgggcgtggacatctctctgctgagctgccagtacttcctggctcagattctggtctccctggtcctggggcccctgacctcggccgtgggcagtgccaacggggtgatgtacttctccagcctcgtgtccttcctgggctgcctgtactcctccctgtttgtcatttatgaaattcctcccagcgacgctgcagacgaggagcaccggcccctcctgctgaacgtctgacatcgcggagcctcgactccggacacgcgcctgcacctgggggtctggagcaggccgaccagtgaggaccaaagggccttgttggacagggggactggctgcctactggaatgtaaatatgtgataaaataataaatgacagcggcaaagccta
NM_001145277 0,182,283,388,470,579,757, gaaacctggtcagagagtcgcaccgcttccgtccgtcggacagaggaacggtggaagtcgccggaagttcggtgggctccaggcgtcgcgatggaggagagcgggtacgagtcggtgctctgtgtcaagcctgacgtccacgtctaccgcatccctccgcgggctaccaaccgtggctacagggctgcggagtggcagctggaccagccatcatggagtggccggctgaggatcactgcaaagggacagatggcctacatcaagctggaggacaggacgtcaggggagctctttgctcaggccccggtggatcagtttcctggcacagctgtggagagtgtgacggattccagcaggtacttcgtgatccgcatcgaagatggaaatgggcgacgggcgtttattggaattggcttcggggaccgaggtgatgcctttgacttcaatgttgcattgcaggaccatttcaagtgggtgaaacagcagtgtgaatttgcaaaacaagcccagaacccagaccaaggccctaaactggacctgggcttcaaggagggccagaccatcaagctcaacatcgcaaacatgaagaagaaggaaggagcagctgggaatccccgagtccggcctgccagcacaggagggctgagcctgcttccccctcccccaggggggaaaacctccaccctgatccctccccctggggagcagttggctgtggggggatccctcgtccagccagcagttgctcccagttcagatcaacttccagccagacccagccaggcacaggctgggtccagttctgacctgagcacggtttttcctcatgtgacttctgggaaggcgctccctcatctgggccaaaggaaggaggacgaagccctcctcagctggcctgtgtttggggcatgaatctctcctctcctccttgtctggctctgttgacaaaccgggcatgtttggcagtaaattggcaccgtgtcacactgtttcctgggattcaagtatgcaaccagaacacaggagaagaaaagctccaggatccctgtccccatctgtcctcttgatgtgagagagactctgagacttcttccatcgcaatgacctgtattaaacacaagccccccaagcaaaagaagaggttgagtttgctgccaggattcagatcagcccttcccagggtctgcaggtgtcacatgatcacagttcagcgggaggctttccgtacccacactggctgtagccacttcagtccatctgccctccagaggaggggtttcttcctgatttttagcaggtttagaggctgcagcttgagctacaatcaggagggaaattggaaggattagcagcttttaaaaatgtttaaatattttgctttgctaatgtgctgatccgcactaactcatctttgcaaaaggaactgctccctcggcgtgccccagctggggcctctgaagggattcctcactgtgggcagctgccctgagcttcaggcagcagtgtttatctctggccagttgtctggtttccatgtattctaggccaggtaggcaacacagagccaaggcgggtgctggaagccagacggaacagtgttggggcaggaaggtggatgctgttgtcatggagctgtgggagttggcactctgtctgctggtggccctctcggctcacatgttcacagtgcagctcctggcagacttgggttttctctttggtggtttctaaagtgccttatctgcaaacaacttcttttctccttcaggaactgtgaatggctagaagaaggagctcagtaaactagaagtccagggttgcttggtttactggtttataagaaatctgaaagcacctctgacattccttttattaactcacctctcagttgaaagatttcttctttgaaaggtcaagaccgtgaactgaaaaaagtgttggcctttttgcgggaccagatttttaagataaaataaatatttttacttctgtcattgtatgtgaaaaaaaaaaaaa
I'm stack, thanks for your time!!
Your input line is ended with a comma, e.g.:
0,182,283,388,470,579,757,
So naturally, this will be split into:
['0', '182', '283', '388', '470', '579', '757', '']
The last element will always be an empty string, ''. You'd need to account for this. One way is simply by ignoring it:
C1 = [float(i) for i in C if i]
Or by cutting out the last element before casting:
C = row[1].split(",")[1:-1] # the slice will exclude first and last elements
you could use try/except when converting to float and catch ValueErrors.
C = row[1].split(",")[1:]
for item in C:
try:
convert=float(item)
except ValueError:
print "not a number"

Getting the maximum value from dictionary

I'm facing problem with this. I have 10,000 rows in my dictionary and this is one of the rows
Example: A (8) C (4) G (48419) T (2) when printed out
I'd like to get 'G' as an answer, since it has the highest value.
I'm currently using Python 2.4 and I have no idea how to solve this as I'm quite new in Python.
Thanks a lot for any help given :)
Here's a solution that
uses a regexp to scan all occurrences of an uppercase letter followed by a number in brackets
transforms the string pairs from the regexp with a generator expression into (value,key) tuples
returns the key from the tuple that has the highest value
I also added a main function so that the script can be used as a command line tool to read all lines from one file and the write the key with the highest value for each line to an output file. The program uses iterators, so that it is memory efficient no matter how large the input file is.
import re
KEYVAL = re.compile(r"([A-Z])\s*\((\d+)\)")
def max_item(row):
return max((int(v),k) for k,v in KEYVAL.findall(row))[1]
def max_item_lines(fh):
for row in fh:
yield "%s\n" % max_item(row)
def process_file(infilename, outfilename):
infile = open(infilename)
max_items = max_item_lines(infile)
outfile = open(outfilename, "w")
outfile.writelines(max_items)
outfile.close()
if __name__ == '__main__':
import sys
infilename, outfilename = sys.argv[1:]
process_file(infilename, outfilename)
For a single row, you can call:
>>> max_item("A (8) C (4) G (48419) T (2)")
'G'
And to process a complete file:
>>> process_file("inputfile.txt", "outputfile.txt")
If you want an actual Python list of every row's maximum value, then you can use:
>>> map(max_item, open("inputfile.txt"))
max(d.itervalues())
This will be much faster than say d.values() as it is using an iterable.
Try the following:
st = "A (8) C (4) G (48419) T (2)" # your start string
a=st.split(")")
b=[x.replace("(","").strip() for x in a if x!=""]
c=[x.split(" ") for x in b]
d=[(int(x[1]),x[0]) for x in c]
max(d) # this is your result.
Use regular expressions to split the line. Then for all the matched groups, you have to convert the matched strings to numbers, get the maximum, and figure out the corresponding letter.
import re
r = re.compile('A \((\d+)\) C \((\d+)\) G \((\d+)\) T \((\d+)\)')
for line in my_file:
m = r.match(line)
if not m:
continue # or complain about invalid line
value, n = max((int(value), n) for (n, value) in enumerate(m.groups()))
print "ACGT"[n], value
row = "A (8) C (4) G (48419) T (2)"
lst = row.replace("(",'').replace(")",'').split() # ['A', '8', 'C', '4', 'G', '48419', 'T', '2']
dd = dict(zip(lst[0::2],map(int,lst[1::2]))) # {'A': 8, 'C': 4, 'T': 2, 'G': 48419}
max(map(lambda k:[dd[k],k], dd))[1] # 'G'

Categories