Convert undstructured blocks of data in a columnwise manner (DataFrame)

Convert undstructured blocks of data in a columnwise manner (DataFrame) - python

Description of the problem:
I have an external *.xls file that I have converted to a *.csv file containing block of data such as:
"Legend number one";;;;Number of items;6
X;-358.6806792;-358.6716338;;;
Y;0.8767189;0.8966855;Avg;;50.1206378
Z;-0.7694626;-0.7520983;Std;;-0.0010354
D;8.0153902;8;Err;;1.010385
;;;;;
There is many many blocks.
Each block may contain some additional lines data;
"Legend number six";;;;Number of items;19
X;-358.6806792;-358.6716338;;;
Y;0.8767189;0.8966855;Avg;;50.1206378
Z;-0.7654644;-0.75283;Std;;-0.0010354
D;8.0153902;8;Err;;1.010385
A;0;1;Value;;0
B;1;0;;;
;;;;;
The structure is such that a new empty line separate each blocs, which is the ';;;;;;' line in my samples.
The first line after this begins with a unique identifier of the block.
It appears that each line contains 6 elements such as key1;elem1;elem2;key2;elem3;elem4 which would be nice to represent as two 3-elements vector key1;elem1;elem2 and key2;elem3;elem4 on two separate lines. Example for the second sample:
"Legend number six";;
;;Number of items;19
X;-358.6806792;-358.6716338;
;;
Y;0.8767189;0.8966855;
Avg;;50.1206378
Z;-0.7654644;-0.75283;
Std;;-0.0010354
D;8.0153902;8;
Err;;1.010385
A;0;1;
Value;;0
B;1;0;
;;
;;;;;
Some are empty but I do not want to discard them for the moment.
But I would like to end up a DataFrame containing columnwise elements for each block of data.
The cleanest "pre solution" I have so far:
With this Python code I ended up in a more organized "List of dictionaries":
import os, sys, re, glob
import pandas as pd
csvFile = os.path.join(workingDir,'file.csv')
h = 0 # Number of lines to skip in head
s = 2 # number of values per key
s += 1
str1 = 'Number of items'
# Reading file in a global list and storing each line in a sublist:
A = [line.split(';') for line in open(csvFile).read().split('\n')]
# This code splits each 6-elements sublist in one new sublist
# containing two-elements; each element with 3 values:
B = [(';'.join(el[:s])+'\n'+';'.join(el[s:])).split('\n') for el in A]
# Init empty structures:
names = [] # to store block unique identifier (the name in the legend)
L = [] # future list of dictionnaries
for el in (B):
for idx,elj in enumerate(el):
vi = elj.split(';')[1:]
# Here we grep the name only when the 2nd element of
# the first line contains the string "Number of items",
# which is constant all over the file:
if len(vi)>1 and vi[0]==str1:
name = el[idx-1].split(';')[0]
names.append(name)
#print(name)
# We loop again over B to append in a new list one dictionary
# per vector of 3 elements because each vector of 3 elements
# is structured like ; key;elem1;elem2
for el in (B):
for elj in (el):
k = elj.split(';')[0]
v = elj.split(';')[1:]
# Little tweak because the key2;elem3;elem4 of the
# first line (the one containing the name) have the
# key in the second place like "elem3;key2;elem4" :
if len(v)>1 and v[0]==str1:
kp = v[0]
v = [v[1],k]
k = kp
if k!='':
dct = {k:v}
L.append(dct)
I am unsuccessful to extract the name as a global identifier and all values of the blocs as variable so far. I can't play with some modulo based technique because of the variable number of informations in each separate block of data even if all block contain at least some common keys.
I also tried a while condition within a for loop all over each dictionary but it's a mess now.
zip could potentially be a nice option but I don't really know how to use it properly.
Target DataFrame:
What I'd like to end up should ideally look something similar to a DataFrame containing;
index 'Number of items' 'X' '' 'Y' 'Avg' 'Z' 'Std' ...
"Legend number one" 6 ...
"Legend number six" 19 ...
"Legend number 11" 6 ...
"Legend number 15" 18 ...
The columns names are the keys and the table is containing the values for each block of data on a separate line.
If there is a numbered index and a new column with "Legend name"; it's OK as well.
CSV sample to play with:
"Legend number one";;;;Number of items;6
X;8.6806792;8.6716338;;;
Y;0.1557;0.1556;Avg;;50.1206378
Z;-0.7859;-0.7860;Std;;-0.0010354
D;8.0153902;8;Err;;1.010385
;;;;;
"Legend number six";;;;Number of items;19
X;56.6806792;56.6716338;;;
Y;0.1324;0.1322;Avg;;50.1206378
Z;-0.7654644;-0.75283;Std;;-0.0010354
D;8.0153902;8;Err;;1.010385
A;0;1;Value;;0
B;1;0;;;
;;;;;
"Legend number 11";;;;Number of items;6
X;358.6806792;358.6716338;;;
Y;0.1324;0.1322;Avg;;50.1206378
Z;-0.7777;-0.7778;Std;;-0.0010354
D;8.0153902;8;Err;;1.010385
;;;;;
"Legend number 15";;;;Number of items;18
X;58.6806792;58.6716338;;;
Y;0.1324;0.1322;Avg;;50.1206378
Z;0.5555;0.5554;Std;;-0.0010354
D;8.0153902;8;Err;;1.010385
A;0;1;Value;;0
B;1;0;;;
C;0;0;k;1;0
;;;;;
I'm using Ubuntu and Python 3.6 but the script must work on a Windows computer as well.

Appending this to the previous code should work pretty well:
for elem in L:
for key,val in elem.items():
if key in names:
name = key
Dict2 = {}
else:
Dict2[key] = val
Dict1[name] = Dict2
df1 = pd.DataFrame.from_dict(Dict1, orient='index')
df2 = pd.DataFrame(index=df1.index)
for col in df1.columns:
colS = df1[col].apply(pd.Series)
colS = colS.rename(columns = lambda x : col+'_'+ str(x))
df2 = pd.concat([df2[:], colS[:]], axis=1)
df2.to_csv('output.csv', sep=',', index=True, header=True)
There is probably many other ways to go...
This link was helpful:
https://chrisalbon.com/python/data_wrangling/pandas_expand_cells_containing_lists/

Related

Trying to print text from WebElements to the same line when that text was gathered using for loops?

Individually the prints in my for loops correctly print the items I want, but I'm having difficulties printing them together on the same line.
#Grabbing text from the first column in the table that contains "Elephant"
for cell in driver.find_elements_by_xpath("//*[contains(text(),'Elephant')]"):
ElepantText = cell.text
print(ElephantText)
#This prints:
#Elephant 1
#Elephant 2
#Elephant 3 etc...which is what I want
for element in driver.find_elements_by_xpath("//[contains(text(),'Elephant')]/following::td[1]/span/select[1]"):
selected = Select(element).first_selected_option
select_text = selected.text
print(select_text)
#This acquires the selected option in the dropdown menu following the cell that contains "Elephant" and prints the selected option which is what I want.
I tried:
print(ElephantText, select_text)
But this just returns the last value in ElephantText and none of the select_text Selected options.
I also tried to zip the two together using:
zipped = zip(ElephantText, select_text)
print(zipped)
But it returns this:
<zip object at 'random hexadecimal number'>
I tried turning these into lists again, but it just turned each letter in the result into an item within the list, so I'm kind of out of ideas at this point. Any direction would be appreciated.
EDIT
This is what I'd like my results to look like:
Elephant 1 - Selected
Elephant 2 - Selected
Elephant 3 - Selected

ElephantText and selected_text are strings. You cannot zip them. You need to store all the text values (if you're iterating over the collections one after the other) and then zip the list of text values:
ElephantTexts = []
for cell in driver.find_elements_by_xpath("//*[contains(text(),'Elephant')]"):
ElephantText = cell.text
print(ElephantText)
ElephantTexts.append(EelephantText)
Selected_texts = []
for element in driver.find_elements_by_xpath("//[contains(text(),'Elephant')]/following::td[1]/span/select[1]"):
selected = Select(element).first_selected_option
select_text = selected.text
print(select_text)
Selected_texts.append(selected_text)
merged = tuple(zip(ElephantTexts, Selected_texts)) # assuming they are the same size
for tup in merged:
print(tup)
I ran the following code with hardcoded lists:
ElephantTexts = ['Elephant1', 'Elephant2', 'Elephant3']
Selected_texts = ['Selected1', 'Selected2', 'Selected3']
merged = tuple(zip(ElephantTexts, Selected_texts)) # assuming they are the same size
for tup in merged:
print(tup)
and this is the output:
('Elephant1', 'Selected1')
('Elephant2', 'Selected2')
('Elephant3', 'Selected3')

Comparing key from first dictionary to values from second dictionary

Please I need some help again.
I have a big data base file (let's call it db.csv) containing many informations.
Simplified database file to illustrate:
I run usearch61 -cluster_fast on my genes sequences in order to cluster them.
I obtained a file named 'clusters.uc'. I opened it as csv then I made a code to create a dictionary (let's say dict_1) to have my cluster number as keys and my gene_id (VFG...) as values.
Here is an example of what I made then stored in a file: dict_1
0 ['VFG003386', 'VFG034084', 'VFG003381']
1 ['VFG000838', 'VFG000630', 'VFG035932', 'VFG000636']
2 ['VFG018349', 'VFG018485', 'VFG043567']
...
14471 ['VFG015743', 'VFG002143']
So far so good. Then using db.csv I made another dictionary (dict_2) were gene_id (VFG...) are keys and VF_Accession (IA... or CVF.. or VF...) are values, illustration: dict_2
VFG044259 IA027
VFG044258 IA027
VFG011941 CVF397
VFG012016 CVF399
...
What I want in the end is to have for each VF_Accession the numbers of cluster groups, illustration:
IA027 [0,5,6,8]
CVF399 [15, 1025, 1562, 1712]
...
So I guess since I'm still a beginner in coding that I need to create a code that compare values from dict_1(VFG...) to keys from dict_2(VFG...). If they match put VF_Accession as a key with all cluster numbers as values. Since VF_Accession are keys they can't have duplicate I need a dictionary of list. I guess I can do that because I made it for dict_1. But my problem is that I can't figure out a way to compare values from dict_1 to keys from dict_2 and put to each VF_Accession a cluster number. Please help me.

First, let's give your dictionaries some better names then dict_1, dict_2, ... that makes it easier to work with them and to remember what they contain.
You first created a dictionary that has cluster numbers as keys and gene_ids (VFG...) as values:
cluster_nr_to_gene_ids = {0: ['VFG003386', 'VFG034084', 'VFG003381', 'VFG044259'],
1: ['VFG000838', 'VFG000630', 'VFG035932', 'VFG000636'],
2: ['VFG018349', 'VFG018485', 'VFG043567', 'VFG012016'],
5: ['VFG011941'],
7949: ['VFG003386'],
14471: ['VFG015743', 'VFG002143', 'VFG012016']}
And you also have another dictionary where gene_ids are keys and VF_Accessions (IA... or CVF.. or VF...) are values:
gene_id_to_vf_accession = {'VFG044259': 'IA027',
'VFG044258': 'IA027',
'VFG011941': 'CVF397',
'VFG012016': 'CVF399',
'VFG000676': 'VF0142',
'VFG002231': 'VF0369',
'VFG003386': 'CVF051'}
And we want to create a dictionary where each VF_Accession key has as value the numbers of cluster groups: vf_accession_to_cluster_groups.
We also note that a VF Accession belongs to multiple gene IDs (for example: the VF Accession IA027 has both the VFG044259 and the VFG044258 gene IDs.
So we use defaultdict to make a dictionary with VF Accession as key and a list of gene IDs as value
from collections import defaultdict
vf_accession_to_gene_ids = defaultdict(list)
for gene_id, vf_accession in gene_id_to_vf_accession.items():
vf_accession_to_gene_ids[vf_accession].append(gene_id)
For the sample data I posted above, vf_accession_to_gene_ids now looks like:
defaultdict(<class 'list'>, {'VF0142': ['VFG000676'],
'CVF051': ['VFG003386'],
'IA027': ['VFG044258', 'VFG044259'],
'CVF399': ['VFG012016'],
'CVF397': ['VFG011941'],
'VF0369': ['VFG002231']})
Now we can loop over each VF Accession and look up its list of gene IDs. Then, for each gene ID, we loop over every cluster and see if the gene ID is present there:
vf_accession_to_cluster_groups = {}
for vf_accession in vf_accession_to_gene_ids:
gene_ids = vf_accession_to_gene_ids[vf_accession]
cluster_group = []
for gene_id in gene_ids:
for cluster_nr in cluster_nr_to_gene_ids:
if gene_id in cluster_nr_to_gene_ids[cluster_nr]:
cluster_group.append(cluster_nr)
vf_accession_to_cluster_groups[vf_accession] = cluster_group
The end result for the above sample data now is:
{'VF0142': [],
'CVF051': [0, 7949],
'IA027': [0],
'CVF399': [2, 14471],
'CVF397': [5],
'VF0369': []}

Caveat: I don't do much Python development, so there's likely a better way to do this. You can first map your VFG... gene_ids to their cluster numbers, and then use that to process the second dictionary:
from collections import defaultdict
import sys
import ast
# see https://stackoverflow.com/questions/960733/python-creating-a-dictionary-of-lists
vfg_cluster_map = defaultdict(list)
# map all of the vfg... keys to their cluster numbers first
with open(sys.argv[1], 'r') as dict_1:
for line in dict_1:
# split the line at the first space to separate the cluster number and gene ID list
# e.g. after splitting the line "0 ['VFG003386', 'VFG034084', 'VFG003381']",
# cluster_group_num holds "0", and vfg_list holds "['VFG003386', 'VFG034084', 'VFG003381']"
cluster_group_num, vfg_list = line.strip().split(' ', 1)
cluster_group_num = int(cluster_group_num)
# convert "['VFG...', 'VFG...']" from a string to an actual list
vfg_list = ast.literal_eval(vfg_list)
for vfg in vfg_list:
vfg_cluster_map[vfg].append(cluster_group_num)
# you now have a dictionary mapping gene IDs to the clusters they
# appear in, e.g
# {'VFG003386': [0],
# 'VFG034084': [0],
# ...}
# you can look in that dictionary to find the cluster numbers corresponding
# to your vfg... keys in dict_2 and add them to the list for that vf_accession
vf_accession_cluster_map = defaultdict(list)
with open(sys.argv[2], 'r') as dict_2:
for line in dict_2:
vfg, vf_accession = line.strip().split(' ')
# add the list of cluster numbers corresponding to this vfg... to
# the list of cluster numbers corresponding to this vf_accession
vf_accession_cluster_map[vf_accession].extend(vfg_cluster_map[vfg])
for vf_accession, cluster_list in vf_accession_cluster_map.items():
print vf_accession + ' ' + str(cluster_list)
Then save the above script and invoke it like python <script name> dict1_file dict2_file > output (or you could write the strings to a file instead of printing them and redirecting).
EDIT: After looking at #BioGeek's answer, I should note that it would make more sense to process this all in one shot than to create dict_1 and dict_2 files, read them in, parse the lines back into numbers and lists, etc. If you don't need to write the dictionaries to a file first, then you can just add your other code to the script and use the dictionaries directly.

Program only prints out last row of list

Hello I'm trying to get this program to print out the list data for the corridor entered in the class call at the bottom. But it only prints out the very last row in the list. This program takes in a .csv file and turns into a list. Not by any means a very experienced python programmer.
class csv_get(object): # class to being in the .csv file to the program
import os
os.chdir('C:\Users\U2970\Documents\ArcGIS')
gpsTrack = open('roadlog_intersection_export_02_18_2014_2.csv', 'rb')
# Figure out position of lat and long in the header
headerLine = gpsTrack.readline()
valueList = headerLine.split(",")
class data_set(object): # place columns from .csv file into a python dictionary
dict = {'DESC' : csv_get.valueList.index("TDD_DESC"),
'ROUTE_NAME' : csv_get.valueList.index("ROUTE_NAME"),
'CORRIDOR': csv_get.valueList.index("CORRIDOR"),
'ROADBED': csv_get.valueList.index("DC_RBD"),
'BEG_RP': csv_get.valueList.index("BEG_RP"),
'END_RP': csv_get.valueList.index("END_RP"),
'DESIGNATION': csv_get.valueList.index("NRLG_SYS_DESC")}
class columns_set(object): # append the dict into a list
new_list = []
for line in csv_get.gpsTrack.readlines():
segmentedLine = line.split(",")
new_list.append([segmentedLine[data_set.dict['DESC']],\
'{:>7}'.format(segmentedLine[data_set.dict['ROUTE_NAME']]),\
'{:>7}'.format(segmentedLine[data_set.dict['CORRIDOR']]),\
'{:>7}'.format(segmentedLine[data_set.dict['ROADBED']]),\
'{:>7}'.format(segmentedLine[data_set.dict['BEG_RP']]),\
'{:>7}'.format(segmentedLine[data_set.dict['END_RP']]),\
'{:>7}'.format(segmentedLine[data_set.dict['DESIGNATION']])])
class data:
def __init__(self,corridor):
for col in columns_set.new_list: # for each column in the list new_list
self.desc = col[0]
self.route = col[1] # assigns column names to column numbers
self.corridor = col[2]
self.roadbed = col[3]
self.beg_rp = col[4]
self.end_rp = col[5]
self.designation = col[6]
def displayData(self): # print data for corridor number entered
print self.desc,\
self.route,\
self.corridor,\
self.roadbed,\
self.beg_rp,\
self.end_rp,\
self.designation
set1 = data('C000021') # corridor number to be sent into data class
# should print all the corridor data but only prints very last record
set1.displayData()

You're only storing data from the current row, and overwriting it with each row. A line like self.desc = col[0] says "overwrite self.desc so it refers to the value of col[0].
I hate to say it, but all of this code is flawed at a fundamental level. Your classes, except for data, are really functions. And even data is defective because it pulls in hardwired elements from outside itself.
You really should use Python's included CSV module to parse a CSV file into lists of lists. It can even give you a list of dictionaries and handle the header line.

Optimizing searches in very large csv files

I have a csv file with a single column, but 6.2 million rows, all containing strings between 6 and 20ish letters. Some strings will be found in duplicate (or more) entries, and I want to write these to a new csv file - a guess is that there should be around 1 million non-unique strings. That's it, really. Continuously searching through a dictionary of 6 million entries does take its time, however, and I'd appreciate any tips on how to do it. Any script I've written so far takes at least a week (!) to run, according to some timings I did.
First try:
in_file_1 = open('UniProt Trypsinome (full).csv','r')
in_list_1 = list(csv.reader(in_file_1))
out_file_1 = open('UniProt Non-Unique Reference Trypsinome.csv','w+')
out_file_2 = open('UniProt Unique Trypsin Peptides.csv','w+')
writer_1 = csv.writer(out_file_1)
writer_2 = csv.writer(out_file_2)
# Create trypsinome dictionary construct
ref_dict = {}
for row in range(len(in_list_1)):
ref_dict[row] = in_list_1[row]
# Find unique/non-unique peptides from trypsinome
Peptide_list = []
Uniques = []
for n in range(len(in_list_1)):
Peptide = ref_dict.pop(n)
if Peptide in ref_dict.values(): # Non-unique peptides
Peptide_list.append(Peptide)
else:
Uniques.append(Peptide) # Unique peptides
for m in range(len(Peptide_list)):
Write_list = (str(Peptide_list[m]).replace("'","").replace("[",'').replace("]",''),'')
writer_1.writerow(Write_list)
Second try:
in_file_1 = open('UniProt Trypsinome (full).csv','r')
in_list_1 = list(csv.reader(in_file_1))
out_file_1 = open('UniProt Non-Unique Reference Trypsinome.csv','w+')
writer_1 = csv.writer(out_file_1)
ref_dict = {}
for row in range(len(in_list_1)):
Peptide = in_list_1[row]
if Peptide in ref_dict.values():
write = (in_list_1[row],'')
writer_1.writerow(write)
else:
ref_dict[row] = in_list_1[row]
EDIT: here's a few lines from the csv file:
SELVQK
AKLAEQAER
AKLAEQAERR
LAEQAER
LAEQAERYDDMAAAMK
LAEQAERYDDMAAAMKK
MTMDKSELVQK
YDDMAAAMKAVTEQGHELSNEER
YDDMAAAMKAVTEQGHELSNEERR

Do it with Numpy. Roughly:
import numpy as np
column = 42
mat = np.loadtxt("thefile", dtype=[TODO])
uniq = set(np.unique(mat[:,column]))
for row in mat:
if row[column] not in uniq:
print row
You could even vectorize the output stage using numpy.savetxt and the char-array operators, but it probably won't make very much difference.

First hint : Python has support for lazy evaluation, better to use it when dealing with huge datasets. So :
iterate over your csv.reader instead of building a huge in-memory list,
don't build huge in-memory lists with ranges - use enumerate(seq) instead if you need both the item and index, and just iterate over your sequence's items if you don't need the index.
Second hint : the main point of using a dict (hashtable) is to lookup on keys, not values... So don't build a huge dict that's used as a list.
Third hint : if you just want a way to store "already seen" values, use a Set.

I'm not so good in Python, so I don't know how the 'in' works, but your algorithm seems to run in n².
Try to sort your list after reading it, with an algo in n log(n), like quicksort, it should work better.
Once the list is ordered, you just have to check if two consecutive elements of the list are the same.
So you get the reading in n, the sorting in n log(n) (at best), and the comparison in n.

Although I think that the numpy solution is the best, I'm curious whether we can speed up the given example. My suggestions are:
skip csv.reader costs and just read the line
rb to skip the extra scan needed to fix newlines
use bigger file buffer sizes (read 1Meg, write 64K is probably good)
use the dict keys as an index - key lookup is much faster than value lookup
I'm not a numpy guy, so I'd do something like
in_file_1 = open('UniProt Trypsinome (full).csv','rb', 1048576)
out_file_1 = open('UniProt Non-Unique Reference Trypsinome.csv','w+', 65536)
ref_dict = {}
for line in in_file_1:
peptide = line.rstrip()
if peptide in ref_dict:
out_file_1.write(peptide + '\n')
else:
ref_dict[peptide] = None

For-loop to count differences in lines with python

I have a file filled with lines like this (this is just a small bit of the file):
9 Hyphomicrobium facile Hyphomicrobiaceae
9 Hyphomicrobium facile Hyphomicrobiaceae
7 Mycobacterium kansasii Mycobacteriaceae
7 Mycobacterium gastri Mycobacteriaceae
10 Streptomyces olivaceiscleroticus Streptomycetaceae
10 Streptomyces niger Streptomycetaceae
1 Streptomyces geysiriensis Streptomycetaceae
1 Streptomyces minutiscleroticus Streptomycetaceae
0 Brucella neotomae Brucellaceae
0 Brucella melitensis Brucellaceae
2 Mycobacterium phocaicum Mycobacteriaceae
The number refers to a cluster, and then it goes 'Genus' 'Species' 'Family'.
What I want to do is write a program that will look through each line and report back to me: a list of the different genera in each cluster, and how many of each of those genera are in the cluster. So I'm interested in cluster number and the first 'word' in each line.
My trouble is that I'm not sure how to get this information. I think I need to use a for-loop, starting at lines that begin with '0.'The output would be a file that looks something like:
Cluster 0: Brucella(2) # Lists cluster, followed by genera in cluster with number, something like that.
Cluster 1: Streptomyces(2)
Cluster 2: Brucella(1)
etc.
Eventually I want to do the same kind of count with the Families in each cluster, and then Genera and Species together. Any thoughts on how to start would be greatly appreciated!

I thought this would be a fun little toy project, so I wrote a little hack to read in an input file like yours from stdin, count and format the output recursively and spit out output that looks a little like yours, but with a nested format, like so:
Cluster 0:
Brucella(2)
melitensis(1)
Brucellaceae(1)
neotomae(1)
Brucellaceae(1)
Streptomyces(1)
neotomae(1)
Brucellaceae(1)
Cluster 1:
Streptomyces(2)
geysiriensis(1)
Streptomycetaceae(1)
minutiscleroticus(1)
Streptomycetaceae(1)
Cluster 2:
Mycobacterium(1)
phocaicum(1)
Mycobacteriaceae(1)
Cluster 7:
Mycobacterium(2)
gastri(1)
Mycobacteriaceae(1)
kansasii(1)
Mycobacteriaceae(1)
Cluster 9:
Hyphomicrobium(2)
facile(2)
Hyphomicrobiaceae(2)
Cluster 10:
Streptomyces(2)
niger(1)
Streptomycetaceae(1)
olivaceiscleroticus(1)
Streptomycetaceae(1)
I also added some junk data to my table so that I could see an extra entry in Cluster 0, separated from the other two... The idea here is that you should be able to see a top level "Cluster" entry and then nested, indented entries for genus, species, family... it shouldn't be hard to extend for deeper trees, either, I hope.
# Sys for stdio stuff
import sys
# re for the re.split -- this can go if you find another way to parse your data
import re
# A global (shame on me) for storing the data we're going to parse from stdin
data = []
# read lines from standard input until it's empty (end-of-file)
for line in sys.stdin:
# Split lines on spaces (gobbling multiple spaces for robustness)
# and trim whitespace off the beginning and end of input (strip)
entry = re.split("\s+", line.strip())
# Throw the array into my global data array, it'll look like this:
# [ "0", "Brucella", "melitensis", "Brucellaceae" ]
# A lot of this code assumes that the first field is an integer, what
# you call "cluster" in your problem description
data.append(entry)
# Sort, first key is expected to be an integer, and we want a numerical
# sort rather than a string sort, so convert to int, then sort by
# each subsequent column. The lamba is a function that returns a tuple
# of keys we care about for each item
data.sort(key=lambda item: (int(item[0]), item[1], item[2], item[3]))
# Our recursive function -- we're basically going to treat "data" as a tree,
# even though it's not.
# parameters:
# start - an integer telling us what line to begin working from so we needn't
# walk the whole tree each time to figure out where we are.
# super - An array that captures where we are in the search. This array
# will have more elements in it as we deepen our traversal of the "tree"
# Initially, it will be []
# In the next ply of the tree, it will be [ '0' ]
# Then something like [ '0', 'Brucella' ] and so on.
# data - The global data structure -- this never mutates after the sort above,
# I could have just used the global directly
def groupedReport(start, super, data):
# Figure out what ply we're on in our depth-first traversal of the tree
depth = len(super)
# Count entries in the super class, starting from "start" index in the array:
count = 0
# For the few records in the data file that match our "super" exactly, we count
# occurrences.
if depth != 0:
for i in range(start, len(data)):
if (data[i][0:depth] == data[start][0:depth]):
count = count + 1
else:
break; # We can stop counting as soon as a match fails,
# because of the way our input data is sorted
else:
count = len(data)
# At depth == 1, we're reporting about clusters -- this is the only piece of
# the algorithm that's not truly abstract, and it's only for presentation
if (depth == 1):
sys.stdout.write("Cluster " + super[0] + ":\n")
elif (depth > 0):
# Every other depth: indent with 4 spaces for every ply of depth, then
# output the unique field we just counted, and its count
sys.stdout.write((' ' * ((depth - 1) * 4)) +
data[start][depth - 1] + '(' + str(count) + ')\n')
# Recursion: we're going to figure out a new depth and a new "super"
# and then call ourselves again. We break out on depth == 4 because
# of one other assumption (I lied before about the abstract thing) I'm
# making about our input data here. This could
# be made more robust/flexible without a lot of work
subsuper = None
substart = start
for i in range(start, start + count):
record = data[i] # The original record from our data
newdepth = depth + 1
if (newdepth > 4): break;
# array splice creates a new copy
newsuper = record[0:newdepth]
if newsuper != subsuper:
# Recursion here!
groupedReport(substart, newsuper, data)
# Track our new "subsuper" for subsequent comparisons
# as we loop through matches
subsuper = newsuper
# Track position in the data for next recursion, so we can start on
# the right line
substart = substart + 1
# First call to groupedReport starts the recursion
groupedReport(0, [], data)
If you make my Python code into a file like "classifier.py", then you can run your input.txt file (or whatever you call it) through it like so:
cat input.txt | python classifier.py
Most of the magic of the recursion, if you care, is implemented using slices of arrays and leans heavily on the ability to compare array slices, as well as the fact that I can order the input data meaningfully with my sort routine. You may want to convert your input data to all-lowercase, if it is possible that case inconsistencies could yield mismatches.

It is easy to do.
create an empty dict {} to store your result, lets call it 'result'
Loop over the data line by line.
Split the line on space to get 4 elements as per your structure, cluster,genus,species,family
Increment counts of genus inside each cluster key when they are found in the current loop, they have to be set to 1 for the first occurence though.
result = { '0': { 'Brucella': 2} ,'1':{'Streptomyces':2}..... }
Code:
my_data = """9 Hyphomicrobium facile Hyphomicrobiaceae
9 Hyphomicrobium facile Hyphomicrobiaceae
7 Mycobacterium kansasii Mycobacteriaceae
7 Mycobacterium gastri Mycobacteriaceae
10 Streptomyces olivaceiscleroticus Streptomycetaceae
10 Streptomyces niger Streptomycetaceae
1 Streptomyces geysiriensis Streptomycetaceae
1 Streptomyces minutiscleroticus Streptomycetaceae
0 Brucella neotomae Brucellaceae
0 Brucella melitensis Brucellaceae
2 Mycobacterium phocaicum Mycobacteriaceae"""
result = {}
for line in my_data.split("\n"):
cluster,genus,species,family = line.split(" ")
result.setdefault(cluster,{}).setdefault(genus,0)
result[cluster][genus] += 1
print(result)
{'10': {'Streptomyces': 2}, '1': {'Streptomyces': 2}, '0': {'Brucella': 2}, '2': {'Mycobacterium': 1}, '7': {'Mycobacterium': 2}, '9': {'Hyphomicrobium': 2}}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert undstructured blocks of data in a columnwise manner (DataFrame) - python

Related

Trying to print text from WebElements to the same line when that text was gathered using for loops?

Comparing key from first dictionary to values from second dictionary

Program only prints out last row of list

Optimizing searches in very large csv files

For-loop to count differences in lines with python

Categories

Resources