How to cluster (or group) the data from a CSV file? - python

I have a three column data set in CSV,
A,B,10
A,C,15
A,D,21
B,A,10
B,C,20
I want to group or cluster A,B,C,D pairs based on the third column. The condition is the increment of 10. 0-10 one cluster, 11-20 another cluster and so on. Each cluster will contain pairs of A,B,C,D. Basically if the third column is in between 0 - 10 a pair will go to first cluster. A,B has 10 in third column so they go in the first cluster. I expect it to be 10-15 clusters.
Here is how I opened CSV:
fileread = open('/data/dataset.csv', 'rU')
readcsv = csv.reader(fileread, delimiter=',')
L = list(readcsv)
I have created a set:
set(item[2] for item in L if (item[0]=='A' and item[1] == 'B' and item[2] <= 10)
My basic question here is that how to check the third column and store the pairs in a cluster?

How about this: Loop the data and determine the groups by integer-dividing the third element by 10.
import csv
with open('data.txt') as f:
groups = {}
for item in list(csv.reader(f, delimiter=',')):
n = int(item[2]) // 10
group = "%d-%d" % (n*10, n*10+9)
groups.setdefault(group, []).append(item[:2])
Using your data, groups ends up as this:
{'20-29': [['A', 'D'], ['B', 'C']],
'10-19': [['A', 'B'], ['A', 'C'], ['B', 'A']]}
Dictionaries are unordered, so if you want to print them in sorted order you have to sort the keys. This is a bit tricky, since they are strings and would be sorted lexicographically. But you could do this:
for k in sorted(groups, key=lambda k: int(k.split('-')[0])):
print k, groups[k]
(or use just the smaller number as key in the first place)

Related

How to make a fill down like process in list of lists?

I have the following list a with None values for which I want to make a "fill down".
a = [
['A','B','C','D'],
[None,None,2,None],
[None,1,None,None],
[None,None,8,None],
['W','R',5,'Q'],
['H','S','X','V'],
[None,None,None,7]
]
The expected output would be like this:
b = [
['A','B','C','D'],
['A','B',2,'D'],
['A',1,'C','D'],
['A','B',8,'D'],
['W','R',5,'Q'],
['H','S','X','V'],
['H','S','X',7]
]
I was able to make the next code and seems to work but I was wondering if there is a built-in method or more direct
way to do it. I know that there is something like that using pandas but needs to convert to dataframe, and I want
to continue working with list, if possible only update a list, and if not possible to modify a then get output in b list. Thanks
b = []
for z in a:
if None in z:
b.append([temp[i] if value == None else value for i, value in enumerate(z) ])
else:
b.append(z)
temp = z
You could use a list comprehension for this, but I'm not sure it adds a lot to your solution already.
b = [a[0]]
for a_row in a[1:]:
b.append([i if i else j for i,j in zip(a_row, b[-1])])
I'm not sure if it's by design, but in your example a number is never carried down to the next row. If you wanted to ensure that only letters are carried down, this could be added by keeping track of the letters last seen in each position. Assuming that the first row of a is always letters then;
last_seen_letters = a[0]
b = []
for a_row in a:
b.append(b_row := [i if i else j for i,j in zip(a_row, last_seen_letters)])
last_seen_letters = [i if isinstance(i, str) else j for i,j in zip(b_row, last_seen_letters)]
First, consider the process of "filling down" into a single row. We have two rows as input: the row above and the row below; we want to consider elements from the two lists pairwise. For each pair, our output is determined by simple logic - use the first value if the second value is None, and the second value otherwise:
def fill_down_new_cell(above, current):
return above if current is None else current
which we then apply to each pair in the pairwise iteration:
def fill_down_new_row(above, current):
return [fill_down_new_cell(a, c) for a, c in zip(above, current)]
Next we need to consider overlapping pairs of rows from our original list. Each time, we replace the contents of the "current" row with the fill_down_row result, by slice-assigning them to the entire list. In this way, we can elegantly update the row list in place, which allows changes to propagate to the next iteration. So:
def fill_down_inplace(rows):
for above, current in zip(rows, rows[1:]):
current[:] = fill_down_new_row(above, current)
Let's test it:
>>> a = [
... ['A','B','C','D'],
... [None,None,2,None],
... [None,1,None,None],
... [None,None,8,None],
... ['W','R',5,'Q'],
... ['H','S','X','V'],
... [None,None,None,7]
... ]
>>> fill_down_inplace(a)
>>> import pprint
>>> pprint.pprint(a)
[['A', 'B', 'C', 'D'],
['A', 'B', 2, 'D'],
['A', 1, 2, 'D'],
['A', 1, 8, 'D'],
['W', 'R', 5, 'Q'],
['H', 'S', 'X', 'V'],
['H', 'S', 'X', 7]]

How to print different character of repeated words in python?

My data in a text file PDBs.txt looks like this:
150L_A
150L_B
150L_C
150L_D
16GS_A
16GS_B
17GS_A
17GS_B
The end result needed is:
"First chain of 150L is A and second is B and third is C and forth is D"
"First chain of 16GS is A and second is B"
etc.
in chains.txt output file.
Thank you for your help.
You could achieve this by first reading the file and extracting the PDB and chain labels to a dictionary mapping the PDB ID to a list of chain labels, here called results. Then, you can write the "chains.txt" file line by line by iterating through these results and constructing the output lines you indicated:
from collections import defaultdict
results = defaultdict(list)
with open("PDBs.txt") as fh:
for line in fh:
line = line.strip()
if line:
pdb, chain = line.split("_")
results[pdb].append(chain)
# Note that you would need to extend this if more than 4 chains are possible
prefix = {2: "second", 3: "third", 4: "fourth"}
with open("chains.txt", "w") as fh:
for pdb, chains in results.items():
fh.write(f"First chain of {pdb} is {chains[0]}")
for ii, chain in enumerate(chains[1:], start=1):
fh.write(f" and {prefix[ii + 1]} is {chain}")
fh.write("\n")
Content of "chains.txt":
First chain of 150L is A and second is B and third is C and fourth is D
First chain of 16GS is A and second is B
First chain of 17GS is A and second is B
First chain of 18GS is A and second is B
First chain of 19GS is A and second is B
You can reach that simply with split operations and a loop.
First split your data by empty chars to get the separated chunks as a list. Then each chunk consists of a key and a value, separated by an underscore. You can iterate over all chunks and split each of them into the key and the value. Then simply create a python dictionary with an array of all values per key.
data = "150L_A 150L_B 150L_C 150L_D 16GS_A 16GS_B 17GS_A 17GS_B 18GS_A 18GS_B 19GS_A 19GS_B"
chunks = data.split()
result = {}
for chunk in chunks:
(key, value) = chunk.split('_')
if not key in result:
result[key] = []
result[key].append(value)
print(result)
# {'150L': ['A', 'B', 'C', 'D'], '16GS': ['A', 'B'], '17GS': ['A', 'B'], '18GS': ['A', 'B'], '19GS': ['A', 'B']}

Cycle through parallel lists deleting matches, until no more matches exist

I have 3 parallel lists representing a 3-tuple (date, description, amount), and 3 new lists that I need to merge without creating duplicate entries. Yes, the lists have overlapping entries, however these duplicate entries are not grouped together (instead of all of the duplicates being 0 through x and all of the new entries being x through the end).
The problem I'm having is iterating the correct number of times to ensure all of the duplicates are caught. Instead, my code moves on with duplicates remaining.
for x in dates:
MoveNext = 'false'
while MoveNext == 'false':
Reiterate = 'false'
for a, b in enumerate(descriptions):
if Reiterate == 'true':
break
if b in edescriptions:
eindex = [c for c, d in enumerate(edescriptions) if d == b]
for e, f in enumerate(eindex):
if Reiterate == 'true':
break
if edates[f] == dates[a]:
if eamounts[f] == amounts[a]:
del dates[a]
del edates[f]
del descriptions[a]
del edescriptions[f]
del amounts[a]
del eamounts[f]
Reiterate = 'true'
break
else:
MoveNext = 'true'
else:
MoveNext = 'true'
else:
MoveNext = 'true'
I don't know if it's a coincidence, but I'm currently getting exactly one half of the new items deleted and the other half remain. In reality, there should be far less than that remaining. That makes me think the for x in dates: is not iterating the correct number of times.
I suggest a different approach: Instead of trying to remove items from a list (or worse, several parallel lists), run through the input and yield only the data that passes your test --- in this case, data you haven't seen before. This is much easier with a single stream of input.
Your lists of data are crying out to be made into objects, since each piece (like the date) is meaningless without the other two... at least for your current purpose. Below, I start by combining each triplet into an instance of Record, a collections.namedtuple. They're great for this kind of use-once-and-throw-away work.
In the program below, build_records creates Record objects from your three input lists. dedup_records merges multiple streams of Record objects, using unique to filter out the duplicates. Keeping each function small (most of the main function is test data) makes each step easy to test.
#!/usr/bin/env python3
import collections
import itertools
Record = collections.namedtuple('Record', ['date', 'description', 'amount'])
def unique(records):
'''
Yields only the unique Records in the given iterable of Records.
'''
seen = set()
for record in records:
if record not in seen:
seen.add(record)
yield record
return
def dedup_records(*record_iterables):
'''
Yields unique Records from multiple iterables of Records, preserving the
order of first appearance.
'''
all_records = itertools.chain(*record_iterables)
yield from unique(all_records)
return
def build_records(dates, descriptions, amounts):
'''
Yields Record objects built from each date-description-amount triplet.
'''
for args in zip(dates, descriptions, amounts):
yield Record(*args)
return
def main():
# Sample data
dates_old = [
'2000-01-01',
'2001-01-01',
'2002-01-01',
'2003-01-01',
'2000-01-01',
'2001-01-01',
'2002-01-01',
'2003-01-01',
]
dates_new = [
'2000-01-01',
'2001-01-01',
'2002-01-01',
'2003-01-01',
'2003-01-01',
'2002-01-01',
'2001-01-01',
'2000-01-01',
]
descriptions_old = ['a', 'b', 'c', 'd', 'a', 'b', 'c', 'd']
descriptions_new = ['b', 'b', 'c', 'a', 'a', 'c', 'd', 'd']
amounts_old = [0, 1, 0, 1, 0, 1, 0, 1]
amounts_new = [0, 0, 0, 0, 1, 1, 1, 1]
old = [dates_old, descriptions_old, amounts_old]
new = [dates_new, descriptions_new, amounts_new]
for record in dedup_records(build_records(*old), build_records(*new)):
print(record)
return
if '__main__' == __name__:
main()
This reduces the 16 input Records to 11:
Record(date='2000-01-01', description='a', amount=0)
Record(date='2001-01-01', description='b', amount=1)
Record(date='2002-01-01', description='c', amount=0)
Record(date='2003-01-01', description='d', amount=1)
Record(date='2000-01-01', description='b', amount=0)
Record(date='2001-01-01', description='b', amount=0)
Record(date='2003-01-01', description='a', amount=0)
Record(date='2003-01-01', description='a', amount=1)
Record(date='2002-01-01', description='c', amount=1)
Record(date='2001-01-01', description='d', amount=1)
Record(date='2000-01-01', description='d', amount=1)
Note that the yield from ... syntax requires Python 3.3 or greater.

How to combine two rows in Python list

Suppose I have a 2D list,
a= [['a','b','c',1],
['a','b','d',2],
['a','e','d',3],
['a','e','c',4]]
I want to obtain a list such that if the first two elements in rows are identical, sum the fourth element, drop the third element and combine these rows together, like the following,
b = [['a','b',3],
['a','e',7]]
What is the most efficient way to do this?
If your list is already sorted, then you can use itertools.groupby. Once you group by the first two elements, you can use a generator expression to sum the 4th element and create your new lists.
>>> from itertools import groupby
>>> a= [['a','b','c',1],
['a','b','d',2],
['a','e','d',3],
['a','e','c',4]]
>>> [g[0] + [sum(i[3] for i in g[1])] for g in groupby(a, key = lambda i : i[:2])]
[['a', 'b', 3],
['a', 'e', 7]]
Using pandas's groupby:
import pandas as pd
df = pd.DataFrame(a)
df.groupby([0, 1]).sum().reset_index().values.tolist()
Output:
df.groupby([0, 1]).sum().reset_index().values.tolist()
Out[19]: [['a', 'b', 3L], ['a', 'e', 7L]]
You can use pandas groupby methods to achieve that goal.
import pandas as pd
a= [['a','b','c',1],
['a','b','d',2],
['a','e','d',3],
['a','e','c',4]]
df = pd.DataFrame(a)
df_sum = df.groupby([0,1])[3].sum().reset_index()
array_return = df_sum.values
list_return = array_return.tolist()
print(list_return)
list_reuturn is the result you want.
If you're interested. Here is an implementation using raw python. I've only tested it on the dataset you provided.
a= [['a','b','c',1],
['a','b','d',2],
['a','e','d',3],
['a','e','c',4]]
b_dict = {}
for row in a:
key = (row[0], row[1])
b_dict[key] = b_dict[key] + row[3] if key in b_dict else row[3]
b = [[key[0], key[1], value] for key, value in b_dict.iteritems()]

Getting the maximum value from dictionary

I'm facing problem with this. I have 10,000 rows in my dictionary and this is one of the rows
Example: A (8) C (4) G (48419) T (2) when printed out
I'd like to get 'G' as an answer, since it has the highest value.
I'm currently using Python 2.4 and I have no idea how to solve this as I'm quite new in Python.
Thanks a lot for any help given :)
Here's a solution that
uses a regexp to scan all occurrences of an uppercase letter followed by a number in brackets
transforms the string pairs from the regexp with a generator expression into (value,key) tuples
returns the key from the tuple that has the highest value
I also added a main function so that the script can be used as a command line tool to read all lines from one file and the write the key with the highest value for each line to an output file. The program uses iterators, so that it is memory efficient no matter how large the input file is.
import re
KEYVAL = re.compile(r"([A-Z])\s*\((\d+)\)")
def max_item(row):
return max((int(v),k) for k,v in KEYVAL.findall(row))[1]
def max_item_lines(fh):
for row in fh:
yield "%s\n" % max_item(row)
def process_file(infilename, outfilename):
infile = open(infilename)
max_items = max_item_lines(infile)
outfile = open(outfilename, "w")
outfile.writelines(max_items)
outfile.close()
if __name__ == '__main__':
import sys
infilename, outfilename = sys.argv[1:]
process_file(infilename, outfilename)
For a single row, you can call:
>>> max_item("A (8) C (4) G (48419) T (2)")
'G'
And to process a complete file:
>>> process_file("inputfile.txt", "outputfile.txt")
If you want an actual Python list of every row's maximum value, then you can use:
>>> map(max_item, open("inputfile.txt"))
max(d.itervalues())
This will be much faster than say d.values() as it is using an iterable.
Try the following:
st = "A (8) C (4) G (48419) T (2)" # your start string
a=st.split(")")
b=[x.replace("(","").strip() for x in a if x!=""]
c=[x.split(" ") for x in b]
d=[(int(x[1]),x[0]) for x in c]
max(d) # this is your result.
Use regular expressions to split the line. Then for all the matched groups, you have to convert the matched strings to numbers, get the maximum, and figure out the corresponding letter.
import re
r = re.compile('A \((\d+)\) C \((\d+)\) G \((\d+)\) T \((\d+)\)')
for line in my_file:
m = r.match(line)
if not m:
continue # or complain about invalid line
value, n = max((int(value), n) for (n, value) in enumerate(m.groups()))
print "ACGT"[n], value
row = "A (8) C (4) G (48419) T (2)"
lst = row.replace("(",'').replace(")",'').split() # ['A', '8', 'C', '4', 'G', '48419', 'T', '2']
dd = dict(zip(lst[0::2],map(int,lst[1::2]))) # {'A': 8, 'C': 4, 'T': 2, 'G': 48419}
max(map(lambda k:[dd[k],k], dd))[1] # 'G'

Categories