How to print different character of repeated words in python? - python

My data in a text file PDBs.txt looks like this:
150L_A
150L_B
150L_C
150L_D
16GS_A
16GS_B
17GS_A
17GS_B
The end result needed is:
"First chain of 150L is A and second is B and third is C and forth is D"
"First chain of 16GS is A and second is B"
etc.
in chains.txt output file.
Thank you for your help.

You could achieve this by first reading the file and extracting the PDB and chain labels to a dictionary mapping the PDB ID to a list of chain labels, here called results. Then, you can write the "chains.txt" file line by line by iterating through these results and constructing the output lines you indicated:
from collections import defaultdict
results = defaultdict(list)
with open("PDBs.txt") as fh:
for line in fh:
line = line.strip()
if line:
pdb, chain = line.split("_")
results[pdb].append(chain)
# Note that you would need to extend this if more than 4 chains are possible
prefix = {2: "second", 3: "third", 4: "fourth"}
with open("chains.txt", "w") as fh:
for pdb, chains in results.items():
fh.write(f"First chain of {pdb} is {chains[0]}")
for ii, chain in enumerate(chains[1:], start=1):
fh.write(f" and {prefix[ii + 1]} is {chain}")
fh.write("\n")
Content of "chains.txt":
First chain of 150L is A and second is B and third is C and fourth is D
First chain of 16GS is A and second is B
First chain of 17GS is A and second is B
First chain of 18GS is A and second is B
First chain of 19GS is A and second is B

You can reach that simply with split operations and a loop.
First split your data by empty chars to get the separated chunks as a list. Then each chunk consists of a key and a value, separated by an underscore. You can iterate over all chunks and split each of them into the key and the value. Then simply create a python dictionary with an array of all values per key.
data = "150L_A 150L_B 150L_C 150L_D 16GS_A 16GS_B 17GS_A 17GS_B 18GS_A 18GS_B 19GS_A 19GS_B"
chunks = data.split()
result = {}
for chunk in chunks:
(key, value) = chunk.split('_')
if not key in result:
result[key] = []
result[key].append(value)
print(result)
# {'150L': ['A', 'B', 'C', 'D'], '16GS': ['A', 'B'], '17GS': ['A', 'B'], '18GS': ['A', 'B'], '19GS': ['A', 'B']}

Related

How to make a fill down like process in list of lists?

I have the following list a with None values for which I want to make a "fill down".
a = [
['A','B','C','D'],
[None,None,2,None],
[None,1,None,None],
[None,None,8,None],
['W','R',5,'Q'],
['H','S','X','V'],
[None,None,None,7]
]
The expected output would be like this:
b = [
['A','B','C','D'],
['A','B',2,'D'],
['A',1,'C','D'],
['A','B',8,'D'],
['W','R',5,'Q'],
['H','S','X','V'],
['H','S','X',7]
]
I was able to make the next code and seems to work but I was wondering if there is a built-in method or more direct
way to do it. I know that there is something like that using pandas but needs to convert to dataframe, and I want
to continue working with list, if possible only update a list, and if not possible to modify a then get output in b list. Thanks
b = []
for z in a:
if None in z:
b.append([temp[i] if value == None else value for i, value in enumerate(z) ])
else:
b.append(z)
temp = z
You could use a list comprehension for this, but I'm not sure it adds a lot to your solution already.
b = [a[0]]
for a_row in a[1:]:
b.append([i if i else j for i,j in zip(a_row, b[-1])])
I'm not sure if it's by design, but in your example a number is never carried down to the next row. If you wanted to ensure that only letters are carried down, this could be added by keeping track of the letters last seen in each position. Assuming that the first row of a is always letters then;
last_seen_letters = a[0]
b = []
for a_row in a:
b.append(b_row := [i if i else j for i,j in zip(a_row, last_seen_letters)])
last_seen_letters = [i if isinstance(i, str) else j for i,j in zip(b_row, last_seen_letters)]
First, consider the process of "filling down" into a single row. We have two rows as input: the row above and the row below; we want to consider elements from the two lists pairwise. For each pair, our output is determined by simple logic - use the first value if the second value is None, and the second value otherwise:
def fill_down_new_cell(above, current):
return above if current is None else current
which we then apply to each pair in the pairwise iteration:
def fill_down_new_row(above, current):
return [fill_down_new_cell(a, c) for a, c in zip(above, current)]
Next we need to consider overlapping pairs of rows from our original list. Each time, we replace the contents of the "current" row with the fill_down_row result, by slice-assigning them to the entire list. In this way, we can elegantly update the row list in place, which allows changes to propagate to the next iteration. So:
def fill_down_inplace(rows):
for above, current in zip(rows, rows[1:]):
current[:] = fill_down_new_row(above, current)
Let's test it:
>>> a = [
... ['A','B','C','D'],
... [None,None,2,None],
... [None,1,None,None],
... [None,None,8,None],
... ['W','R',5,'Q'],
... ['H','S','X','V'],
... [None,None,None,7]
... ]
>>> fill_down_inplace(a)
>>> import pprint
>>> pprint.pprint(a)
[['A', 'B', 'C', 'D'],
['A', 'B', 2, 'D'],
['A', 1, 2, 'D'],
['A', 1, 8, 'D'],
['W', 'R', 5, 'Q'],
['H', 'S', 'X', 'V'],
['H', 'S', 'X', 7]]

Converting a list to json in python

Here is the code, I have a list, which I want to convert to JSON with dynamic keys.
>>> print (list) #list
['a', 'b', 'c', 'd']
>>> outfile = open('c:\\users\\fawads\desktop\csd\\Test44.json','w')#writing data to file
>>> for entry in list:
... data={'key'+str(i):entry}
... i+=1
... json.dump(data,outfile)
...
>>> outfile.close()
The result is as following:
{"key0": "a"}{"key1": "b"}{"key2": "c"}{"key3": "d"}
Which is not valid json.
Enumerate your list (which you should not call list, by the way, you will shadow the built in list):
>>> import json
>>> lst = ['a', 'b', 'c', 'd']
>>> jso = {'key{}'.format(k):v for k, v in enumerate(lst)}
>>> json.dumps(jso)
'{"key3": "d", "key2": "c", "key1": "b", "key0": "a"}'
data = []
for entry in lst:
data.append({'key'+str(lst.index(entry)):entry})
json.dump(data, outfile)
As a minimal change which I originally posted in a comment:
outfile = open('c:\\users\\fawads\desktop\csd\\Test44.json','w')#writing data to file
all_data = [] #keep a list of all the entries
i = 0
for entry in list:
data={'key'+str(i):entry}
i+=1
all_data.append(data) #add the data to the list
json.dump(all_data,outfile) #write the list to the file
outfile.close()
calling json.dump on the same file multiple times is very rarely useful as it creates multiple segments of json data that needs to be seperated in order to be parsed, it makes much more sense to only call it once when you are done constructing the data.
I'd also like to suggest you use enumerate to handle the i variable as well as using a with statement to deal wit the file IO:
all_data = [] #keep a list of all the entries
for i,entry in enumerate(list):
data={'key'+str(i):entry}
all_data.append(data)
with open('c:\\users\\fawads\desktop\csd\\Test44.json','w') as outfile:
json.dump(all_data,outfile)
#file is automatically closed at the end of the with block (even if there is an e
The loop could be shorted even further with list comprehension:
all_data = [{'key'+str(i):entry}
for i,entry in enumerate(list)]
Which (if you really want) could be put directly into the json.dump:
with open('c:\\users\\fawads\desktop\csd\\Test44.json','w') as outfile:
json.dump([{'key'+str(i):entry}
for i,entry in enumerate(list)],
outfile)
although then you start to lose readability so I don't recommend going that far.
Here is what you need to do:
mydict = {}
i = 0
for entry in list:
dict_key = "key" + str(i)
mydict[dict_key] = entry
i = i + 1
json.dump(mydict, outfile)
Currently you are creating a new dict entry in every iteration of the loop , hence the result is not a valid json.

How to cluster (or group) the data from a CSV file?

I have a three column data set in CSV,
A,B,10
A,C,15
A,D,21
B,A,10
B,C,20
I want to group or cluster A,B,C,D pairs based on the third column. The condition is the increment of 10. 0-10 one cluster, 11-20 another cluster and so on. Each cluster will contain pairs of A,B,C,D. Basically if the third column is in between 0 - 10 a pair will go to first cluster. A,B has 10 in third column so they go in the first cluster. I expect it to be 10-15 clusters.
Here is how I opened CSV:
fileread = open('/data/dataset.csv', 'rU')
readcsv = csv.reader(fileread, delimiter=',')
L = list(readcsv)
I have created a set:
set(item[2] for item in L if (item[0]=='A' and item[1] == 'B' and item[2] <= 10)
My basic question here is that how to check the third column and store the pairs in a cluster?
How about this: Loop the data and determine the groups by integer-dividing the third element by 10.
import csv
with open('data.txt') as f:
groups = {}
for item in list(csv.reader(f, delimiter=',')):
n = int(item[2]) // 10
group = "%d-%d" % (n*10, n*10+9)
groups.setdefault(group, []).append(item[:2])
Using your data, groups ends up as this:
{'20-29': [['A', 'D'], ['B', 'C']],
'10-19': [['A', 'B'], ['A', 'C'], ['B', 'A']]}
Dictionaries are unordered, so if you want to print them in sorted order you have to sort the keys. This is a bit tricky, since they are strings and would be sorted lexicographically. But you could do this:
for k in sorted(groups, key=lambda k: int(k.split('-')[0])):
print k, groups[k]
(or use just the smaller number as key in the first place)

Create a dictionary from text file

Alright well I am trying to create a dictionary from a text file so the key is a single lowercase character and each value is a list of the words from the file that start with that letter.
The text file containts one lowercase word per line eg:
airport
bathroom
boss
bottle
elephant
Output:
words = {'a': ['airport'], 'b': ['bathroom', 'boss', 'bottle'], 'e':['elephant']}
Havent got alot done really, just confused how I would get the first index from each line and set it as the key and append the values. would really appreatiate if someone can help me get sarted.
words = {}
for line in infile:
line = line.strip() # not sure if this line is correct
So let's examine your example:
words = {}
for line in infile:
line = line.strip()
This looks good for a beginning. Now you want to do something with the line. Probably you'll need the first character, which you can access through line[0]:
first = line[0]
Then you want to check whether the letter is already in the dict. If not, you can add a new, empty list:
if first not in words:
words[first] = []
Then you can append the word to that list:
words[first].append(line)
And you're done!
If the lines are already sorted like in your example file, you can also make use of itertools.groupby, which is a bit more sophisticated:
from itertools import groupby
from operator import itemgetter
with open('infile.txt', 'r') as f:
words = { k:map(str.strip, g) for k, g in groupby(f, key=itemgetter(0)) }
You can also sort the lines first, which makes this method generally applicable:
groupby(sorted(f), ...)
defaultdict from the collections module is a good choice for these kind of tasks:
>>> import collections
>>> words = collections.defaultdict(list)
>>> with open('/tmp/spam.txt') as f:
... lines = [l.strip() for l in f if l.strip()]
...
>>> lines
['airport', 'bathroom', 'boss', 'bottle', 'elephant']
>>> for word in lines:
... words[word[0]].append(word)
...
>>> print words
defaultdict(<type 'list'>, {'a': ['airport'], 'b': ['bathroom', 'boss', 'bottle'], 'e': ['elephant']})

Getting the maximum value from dictionary

I'm facing problem with this. I have 10,000 rows in my dictionary and this is one of the rows
Example: A (8) C (4) G (48419) T (2) when printed out
I'd like to get 'G' as an answer, since it has the highest value.
I'm currently using Python 2.4 and I have no idea how to solve this as I'm quite new in Python.
Thanks a lot for any help given :)
Here's a solution that
uses a regexp to scan all occurrences of an uppercase letter followed by a number in brackets
transforms the string pairs from the regexp with a generator expression into (value,key) tuples
returns the key from the tuple that has the highest value
I also added a main function so that the script can be used as a command line tool to read all lines from one file and the write the key with the highest value for each line to an output file. The program uses iterators, so that it is memory efficient no matter how large the input file is.
import re
KEYVAL = re.compile(r"([A-Z])\s*\((\d+)\)")
def max_item(row):
return max((int(v),k) for k,v in KEYVAL.findall(row))[1]
def max_item_lines(fh):
for row in fh:
yield "%s\n" % max_item(row)
def process_file(infilename, outfilename):
infile = open(infilename)
max_items = max_item_lines(infile)
outfile = open(outfilename, "w")
outfile.writelines(max_items)
outfile.close()
if __name__ == '__main__':
import sys
infilename, outfilename = sys.argv[1:]
process_file(infilename, outfilename)
For a single row, you can call:
>>> max_item("A (8) C (4) G (48419) T (2)")
'G'
And to process a complete file:
>>> process_file("inputfile.txt", "outputfile.txt")
If you want an actual Python list of every row's maximum value, then you can use:
>>> map(max_item, open("inputfile.txt"))
max(d.itervalues())
This will be much faster than say d.values() as it is using an iterable.
Try the following:
st = "A (8) C (4) G (48419) T (2)" # your start string
a=st.split(")")
b=[x.replace("(","").strip() for x in a if x!=""]
c=[x.split(" ") for x in b]
d=[(int(x[1]),x[0]) for x in c]
max(d) # this is your result.
Use regular expressions to split the line. Then for all the matched groups, you have to convert the matched strings to numbers, get the maximum, and figure out the corresponding letter.
import re
r = re.compile('A \((\d+)\) C \((\d+)\) G \((\d+)\) T \((\d+)\)')
for line in my_file:
m = r.match(line)
if not m:
continue # or complain about invalid line
value, n = max((int(value), n) for (n, value) in enumerate(m.groups()))
print "ACGT"[n], value
row = "A (8) C (4) G (48419) T (2)"
lst = row.replace("(",'').replace(")",'').split() # ['A', '8', 'C', '4', 'G', '48419', 'T', '2']
dd = dict(zip(lst[0::2],map(int,lst[1::2]))) # {'A': 8, 'C': 4, 'T': 2, 'G': 48419}
max(map(lambda k:[dd[k],k], dd))[1] # 'G'

Categories