Matching strings for multiple data set in Python

Matching strings for multiple data set in Python - python

I am working on python and I need to match the strings of several data files. First I used pickle to unpack my files and then I place them into a list. I only want to match strings that have the same conditions. This conditions are indicated at the end of the string.
My working script looks approximately like this:
import pickle
f = open("data_a.dat")
list_a = pickle.load( f )
f.close()
f = open("data_b.dat")
list_b = pickle.load( f )
f.close()
f = open("data_c.dat")
list_c = pickle.load( f )
f.close()
f = open("data_d.dat")
list_d = pickle.load( f )
f.close()
for a in list_a:
for b in list_b:
for c in list_c
for d in list_d:
if a.GetName()[12:] in b.GetName():
if a.GetName[12:] in c.GetName():
if a.GetName[12:] in d.GetName():
"do whatever"
This seems to work fine for these 2 lists. The problems begin when I try to add more 8 or 9 more data files for which I also need to match the same conditions. The script simple won't process and it gets stuck. I appreciate your help.
Edit: Each of the lists contains histograms named after the parameters that were used to create them. The name of the histograms contains these parameters and their values at the end of the string. In the example I did it for 2 data sets, now I would like to do it for 9 data sets without using multiple loops.
Edit 2. I just expanded the code to reflect more accurately what I want to do. Now if I try to do that for 9 lists, it does not only look horrible, but it also doesn't work.

out of my head:
files = ["file_a", "file_b", "file_c"]
sets = []
for f in files:
f = open("data_a.dat")
sets.append(set(pickle.load(f)))
f.close()
intersection = sets[0].intersection(*sets[1:])
EDIT: Well I overlooked your mapping to x.GetName()[12:], but you should be able to reduce your problem to set logic.

Here a small piece of code you can inspire on. The main idea is the use of a recursive function.
For simplicity sake, I admit that I already have data loaded in lists but you can get them from file before :
data_files = [
'data_a.dat',
'data_b.dat',
'data_c.dat',
'data_d.dat',
'data_e.dat',
]
lists = [pickle.load(open(f)) for f in data_files]
And because and don't really get the details of what you really need to do, my goal here is to found the matches on the four firsts characters :
def do_wathever(string):
print "I have match the string '%s'" % string
lists = [
["hello", "world", "how", "grown", "you", "today", "?"],
["growl", "is", "a", "now", "on", "appstore", "too bad"],
["I", "wish", "I", "grow", "Magnum", "mustache", "don't you?"],
]
positions = [0 for i in range(len(lists))]
def recursive_match(positions, lists):
strings = map(lambda p, l: l[p], positions, lists)
match = True
searched_string = strings.pop(0)[:4]
for string in strings:
if searched_string not in string:
match = False
break
if match:
do_wathever(searched_string)
# increment positions:
new_positions = positions[:]
lists_len = len(lists)
for i, l in enumerate(reversed(lists)):
max_position = len(l)-1
list_index = lists_len - i - 1
current_position = positions[list_index]
if max_position > current_position:
new_positions[list_index] += 1
break
else:
new_positions[list_index] = 0
continue
return new_positions, not any(new_positions)
search_is_finished = False
while not search_is_finished:
positions, search_is_finished = recursive_match(positions, lists)
Of course you can optimize a lot of things here, this is draft code, but take a look at the recursive function, this is a major concept.

In the end I ended up using the map built in function. I realize now I should have been even more explicit than I was (which I will do in the future).
My data files are histograms with 5 parameters, some with 3 or 4. Something like this,
par1=["list with some values"]
par2=["list with some values"]
par3=["list with some values"]
par4=["list with some values"]
par5=["list with some values"]
I need to examine the behavior of the quantity plotted for each possible combination of the values of the parameters. In the end, I get a data file with ~300 histograms each identified in their name with the corresponding values of the parameters and the sample name. It looks something like,
datasample1-par1=val1-par2=val2-par3=val3-par4=val4-par5=val5
datasample1-"permutation of the above values"
...
datasample9-par1=val1-par2=val2-par3=val3-par4=val4-par5=val5
datasample9-"permutation of the above values"
So I get 300 histograms for each of the 9 data files, but luckily all of this histograms are created in the same order. Hence I can pair all of them just using the map built in function. I unpack the data files, put each on lists and the use the map function to pair each histogram with its corresponding configuration in the other data samples.
for lst in map(None, data1_histosli, data2_histosli, ...data9_histosli):
do_something(lst)
This solves my problem. Thank you to all for your help!

Related

Finding all possible permutations of a hash when given list of grouped elements

Best way to show what I'm trying to do:
I have a list of different hashes that consist of ordered elements, seperated by an underscore. Each element may or may not have other possible replacement values. I'm trying to generate a list of all possible combinations of this hash, after taking into account replacement values.
Example:
grouped_elements = [["1", "1a", "1b"], ["3", "3a"]]
original_hash = "1_2_3_4_5"
I want to be able to generate a list of the following hashes:
[
"1_2_3_4_5",
"1a_2_3_4_5",
"1b_2_3_4_5",
"1_2_3a_4_5",
"1a_2_3a_4_5",
"1b_2_3a_4_5",
]
The challenge is that this'll be needed on large dataframes.
So far here's what I have:
def return_all_possible_hashes(df, grouped_elements)
rows_to_append = []
for grouped_element in grouped_elements:
for index, row in enriched_routes[
df["hash"].str.contains("|".join(grouped_element))
].iterrows():
(element_used_in_hash,) = set(grouped_element) & set(row["hash"].split("_"))
hash_used = row["hash"]
replacement_elements = set(grouped_element) - set([element_used_in_hash])
for replacement_element in replacement_elements:
row["hash"] = stop_hash_used.replace(
element_used_in_hash, replacement_element
)
rows_to_append.append(row)
return df.append(rows_to_append)
But the problem is that this will only append hashes with all combinations of a given grouped_element, and not all combinations of all grouped_elements at the same time. So using the example above, my function would return:
[
"1_2_3_4_5",
"1a_2_3_4_5",
"1b_2_3_4_5",
"1_2_3a_4_5",
]
I feel like I'm not far from the solution, but I also feel stuck, so any help is much appreciated!

If you make a list of the original hash value's elements and replace each element with a list of all its possible variations, you can use itertools.product to get the Cartesian product across these sublists. Transforming each element of the result back to a string with '_'.join() will get you the list of possible hashes:
from itertools import product
def possible_hashes(original_hash, grouped_elements):
hash_list = original_hash.split('_')
variations = list(set().union(*grouped_elements))
var_list = hash_list.copy()
for i, h in enumerate(hash_list):
if h in variations:
for g in grouped_elements:
if h in g:
var_list[i] = g
break
else:
var_list[i] = [h]
return ['_'.join(h) for h in product(*var_list)]
possible_hashes("1_2_3_4_5", [["1", "1a", "1b"], ["3", "3a"]])
['1_2_3_4_5',
'1_2_3a_4_5',
'1a_2_3_4_5',
'1a_2_3a_4_5',
'1b_2_3_4_5',
'1b_2_3a_4_5']
To use this function on various original hash values stored in a dataframe column, you can do something like this:
df['hash'].apply(lambda x: possible_hashes(x, grouped_elements))

ASCII-alphabetical topological sort

I have been trying to figure out a issue, in which my program does the topological sort but not in the format that I'm trying to get. For instance if I give it the input:
Learn Python
Understand higher-order functions
Learn Python
Read the Python tutorial
Do assignment 1
Learn Python
It should output:
Read the python tutorial
Understand higher-order functions
Learn Python
Do assignment 1
Instead I get it where the first two instances are swapped, for some of my other test cases this occurs as well where it will swap 2 random instances, heres my code:
import sys
graph={}
def populate(name,dep):
if name in graph:
graph[name].append(dep)
else:
graph[name]=[dep]
if dep not in graph:
graph[dep]=[]
def main():
last = ""
for line in sys.stdin:
lne=line.strip()
if last == "":
last=lne
else:
populate(last,lne)
last=""
def topoSort(graph):
sortedList=[] #result
zeroDegree=[]
inDegree = { u : 0 for u in graph }
for u in graph:
for v in graph[u]:
inDegree[v]+=1
for i in inDegree:
if(inDegree[i]==0):
zeroDegree.append(i)
while zeroDegree:
v=zeroDegree.pop(0)
sortedList.append(v)
#selection sort for alphabetical sort
for x in graph[v]:
inDegree[x]-=1
if (inDegree[x]==0):
zeroDegree.insert(0,x)
sortedList.reverse()
#for y in range(len(sortedList)):
# min=y
# for j in range(y+1,len(sortedList)):
# if sortedList[min]>sortedList[y]:
# min=j
# sortedList[y],sortedList[min]=sortedList[min],sortedList[y]
return sortedList
if __name__=='__main__':
main()
result=topoSort(graph)
if (len(result)==len(graph)):
print(result)
else:
print("cycle")
Any Ideas as to why this may be occurring?

The elements within dictionaries or sets are not ordered. If you add elements they are randomly inserted and not appended to the end. I think that is the reason why you get random results with your sorting algorithm. I guess it must have to do something with inDegree but I didn't debug very much.
I can't offer you a specific fix for your code, but accordingly to the wanted input and output it should look like this:
# read tuples from stdin until ctrl+d is pressed (on linux) or EOF is reached
graph = set()
while True:
try:
graph |= { (input().strip(), input().strip()) }
except:
break
# apply topological sort and print it to stdout
print("----")
while graph:
z = { (a,b) for a,b in graph if not [1 for c,d in graph if b==c] }
print ( "\n".join ( sorted ( {b for a,b in z} )
+ sorted ( {a for a,b in z if not [1 for c,d in graph if a==d]} ) ) )
graph -= z
The great advantage of Python (here 3.9.1) is the short solution you might get. Instead of lists I would use sets because those can be easier edited: graph|{elements} inserts items to this set and graph-{elements} removes entities from it. Duplicates are ignored.
At first are some tuples red from stdin with ... = input(), input() into the graph item set.
The line z = {result loop condition...} filters the printable elements which are then subtracted from the so called graph set.
The generated sets are randomly ordered so the printed output must be turned to sorted lists at the end which are separated by newlines.

difflib.SequenceMatcher not returning unique ratio

I am trying to compare 2 street networks and when i run this code it returns a a ratio of .253529... i need it to compare each row to get a unique value so i can query out the streets that dont match. What can i do it get it to return unique ratio values per row?
# Set local variables
inFeatures = gp.GetParameterAsText(0)
fieldName = gp.GetParameterAsText(1)
fieldName1 = gp.GetParameterAsText(2)
fieldName2 = gp.GetParameterAsText(3)
expression = difflib.SequenceMatcher(None,fieldName1,fieldName2).ratio()
# Execute CalculateField
arcpy.CalculateField_management(inFeatures, fieldName, expression, "PYTHON_9.3")

If you know both files always have the exact same number of lines, a simple approach like this would work:
ratios = []
with open('fieldName1', 'r') as f1, open('fieldName2', 'r') as f2:
for l1, l2 in zip(f1, f2):
R = difflib.SequenceMatcher(None,l1,l2).ratio()
ratios.append((l1, l2, R))
This will produce a list of tuples like this:
[("aa", "aa", 1), ("aa", "ab", 0.5), ...]
If your files are different sizes you'll need to find some way to match up the lines, or otherwise handle it

Optimizing searches in very large csv files

I have a csv file with a single column, but 6.2 million rows, all containing strings between 6 and 20ish letters. Some strings will be found in duplicate (or more) entries, and I want to write these to a new csv file - a guess is that there should be around 1 million non-unique strings. That's it, really. Continuously searching through a dictionary of 6 million entries does take its time, however, and I'd appreciate any tips on how to do it. Any script I've written so far takes at least a week (!) to run, according to some timings I did.
First try:
in_file_1 = open('UniProt Trypsinome (full).csv','r')
in_list_1 = list(csv.reader(in_file_1))
out_file_1 = open('UniProt Non-Unique Reference Trypsinome.csv','w+')
out_file_2 = open('UniProt Unique Trypsin Peptides.csv','w+')
writer_1 = csv.writer(out_file_1)
writer_2 = csv.writer(out_file_2)
# Create trypsinome dictionary construct
ref_dict = {}
for row in range(len(in_list_1)):
ref_dict[row] = in_list_1[row]
# Find unique/non-unique peptides from trypsinome
Peptide_list = []
Uniques = []
for n in range(len(in_list_1)):
Peptide = ref_dict.pop(n)
if Peptide in ref_dict.values(): # Non-unique peptides
Peptide_list.append(Peptide)
else:
Uniques.append(Peptide) # Unique peptides
for m in range(len(Peptide_list)):
Write_list = (str(Peptide_list[m]).replace("'","").replace("[",'').replace("]",''),'')
writer_1.writerow(Write_list)
Second try:
in_file_1 = open('UniProt Trypsinome (full).csv','r')
in_list_1 = list(csv.reader(in_file_1))
out_file_1 = open('UniProt Non-Unique Reference Trypsinome.csv','w+')
writer_1 = csv.writer(out_file_1)
ref_dict = {}
for row in range(len(in_list_1)):
Peptide = in_list_1[row]
if Peptide in ref_dict.values():
write = (in_list_1[row],'')
writer_1.writerow(write)
else:
ref_dict[row] = in_list_1[row]
EDIT: here's a few lines from the csv file:
SELVQK
AKLAEQAER
AKLAEQAERR
LAEQAER
LAEQAERYDDMAAAMK
LAEQAERYDDMAAAMKK
MTMDKSELVQK
YDDMAAAMKAVTEQGHELSNEER
YDDMAAAMKAVTEQGHELSNEERR

Do it with Numpy. Roughly:
import numpy as np
column = 42
mat = np.loadtxt("thefile", dtype=[TODO])
uniq = set(np.unique(mat[:,column]))
for row in mat:
if row[column] not in uniq:
print row
You could even vectorize the output stage using numpy.savetxt and the char-array operators, but it probably won't make very much difference.

First hint : Python has support for lazy evaluation, better to use it when dealing with huge datasets. So :
iterate over your csv.reader instead of building a huge in-memory list,
don't build huge in-memory lists with ranges - use enumerate(seq) instead if you need both the item and index, and just iterate over your sequence's items if you don't need the index.
Second hint : the main point of using a dict (hashtable) is to lookup on keys, not values... So don't build a huge dict that's used as a list.
Third hint : if you just want a way to store "already seen" values, use a Set.

I'm not so good in Python, so I don't know how the 'in' works, but your algorithm seems to run in n².
Try to sort your list after reading it, with an algo in n log(n), like quicksort, it should work better.
Once the list is ordered, you just have to check if two consecutive elements of the list are the same.
So you get the reading in n, the sorting in n log(n) (at best), and the comparison in n.

Although I think that the numpy solution is the best, I'm curious whether we can speed up the given example. My suggestions are:
skip csv.reader costs and just read the line
rb to skip the extra scan needed to fix newlines
use bigger file buffer sizes (read 1Meg, write 64K is probably good)
use the dict keys as an index - key lookup is much faster than value lookup
I'm not a numpy guy, so I'd do something like
in_file_1 = open('UniProt Trypsinome (full).csv','rb', 1048576)
out_file_1 = open('UniProt Non-Unique Reference Trypsinome.csv','w+', 65536)
ref_dict = {}
for line in in_file_1:
peptide = line.rstrip()
if peptide in ref_dict:
out_file_1.write(peptide + '\n')
else:
ref_dict[peptide] = None

How to compare clusters?

Hopefully this can be done with python! I used two clustering programs on the same data and now have a cluster file from both. I reformatted the files so that they look like this:
Cluster 0:
Brucellaceae(10)
Brucella(10)
abortus(1)
canis(1)
ceti(1)
inopinata(1)
melitensis(1)
microti(1)
neotomae(1)
ovis(1)
pinnipedialis(1)
suis(1)
Cluster 1:
Streptomycetaceae(28)
Streptomyces(28)
achromogenes(1)
albaduncus(1)
anthocyanicus(1)
etc.
These files contain bacterial species info. So I have the cluster number (Cluster 0), then right below it 'family' (Brucellaceae) and the number of bacteria in that family (10). Under that is the genera found in that family (name followed by number, Brucella(10)) and finally the species in each genera (abortus(1), etc.).
My question: I have 2 files formatted in this way and want to write a program that will look for differences between the two. The only problem is that the two programs cluster in different ways, so two cluster may be the same, even if the actual "Cluster Number" is different (so the contents of Cluster 1 in one file might match Cluster 43 in the other file, the only different being the actual cluster number). So I need something to ignore the cluster number and focus on the cluster contents.
Is there any way I could compare these 2 files to examine the differences? Is it even possible? Any ideas would be greatly appreciated!

Given:
file1 = '''Cluster 0:
giant(2)
red(2)
brick(1)
apple(1)
Cluster 1:
tiny(3)
green(1)
dot(1)
blue(2)
flower(1)
candy(1)'''.split('\n')
file2 = '''Cluster 18:
giant(2)
red(2)
brick(1)
tomato(1)
Cluster 19:
tiny(2)
blue(2)
flower(1)
candy(1)'''.split('\n')
Is this what you need?
def parse_file(open_file):
result = []
for line in open_file:
indent_level = len(line) - len(line.lstrip())
if indent_level == 0:
levels = ['','','']
item = line.lstrip().split('(', 1)[0]
levels[indent_level - 1] = item
if indent_level == 3:
result.append('.'.join(levels))
return result
data1 = set(parse_file(file1))
data2 = set(parse_file(file2))
differences = [
('common elements', data1 & data2),
('missing from file2', data1 - data2),
('missing from file1', data2 - data1) ]
To see the differences:
for desc, items in differences:
print desc
print
for item in items:
print '\t' + item
print
prints
common elements
giant.red.brick
tiny.blue.candy
tiny.blue.flower
missing from file2
tiny.green.dot
giant.red.apple
missing from file1
giant.red.tomato

So just for help, as I see lots of different answers in the comment, I'll give you a very, very simple implementation of a script that you can start from.
Note that this does not answer your full question but points you in one of the directions in the comments.
Normally if you have no experience I'd argue to go a head and read up on Python (which i'll do anyways, and i'll throw in a few links in the bottom of the answer)
On to the fun stuffs! :)
class Cluster(object):
'''
This is a class that will contain your information about the Clusters.
'''
def __init__(self, number):
'''
This is what some languages call a constructor, but it's not.
This method initializes the properties with values from the method call.
'''
self.cluster_number = number
self.family_name = None
self.bacteria_name = None
self.bacteria = []
#This part below isn't a part of the class, this is the actual script.
with open('bacteria.txt', 'r') as file:
cluster = None
clusters = []
for index, line in enumerate(file):
if line.startswith('Cluster'):
cluster = Cluster(index)
clusters.append(cluster)
else:
if not cluster.family_name:
cluster.family_name = line
elif not cluster.bacteria_name:
cluster.bacteria_name = line
else:
cluster.bacteria.append(line)
I wrote this as dumb and overly simple as I could without any fancy stuff and for Python 2.7.2
You could copy this file into a .py file and run it directly from command line python bacteria.py for example.
Hope this helps a bit and don't hesitate to come by our Python chat room if you have any questions! :)
http://learnpythonthehardway.org/
http://www.diveintopython.net/
http://docs.python.org/2/tutorial/inputoutput.html
check if all elements in a list are identical
Retaining order while using Python's set difference

You have to write some code to parse the file. If you ignore the cluster, you should be able to distinguish between family, genera and species based on indentation.
The easiest way it to define a named tuple:
import collections
Bacterium = collections.namedtuple('Bacterium', ['family', 'genera', 'species'])
You can make in instance of this object like this:
b = Bacterium('Brucellaceae', 'Brucella', 'canis')
Your parser should read a file line by line, and set the family and genera. If it then finds a species, it should add a Bacterium to a list;
with open('cluster0.txt', 'r') as infile:
lines = infile.readlines()
family = None
genera = None
bacteria = []
for line in lines:
# set family and genera.
# if you detect a bacterium:
bacteria.append(Bacterium(family, genera, species))
Once you have a list of all bacteria in each file or cluster, you can select from all the bacteria like this:
s = [b for b in bacteria if b.genera == 'Streptomycetaceae']

Comparing two clusterings is not trivial task and reinventing the wheel is unlikely to be successful. Check out this package which has lots of different cluster similarity metrics and can compare dendrograms (the data structure you have).
The library is called CluSim and can be found here:
https://github.com/Hoosier-Clusters/clusim/

After learning so much from Stackoverflow, finally I have an opportunity to give back! A different approach from those offered so far is to relabel clusters to maximize alignment, and then comparison becomes easy. For example, if one algorithm assigns labels to a set of six items as L1=[0,0,1,1,2,2] and another assigns L2=[2,2,0,0,1,1], you want these two labelings to be equivalent since L1 and L2 are essentially segmenting items into clusters identically. This approach relabels L2 to maximize alignment, and in the example above, will result in L2==L1.
I found a soution to this problem in "Menéndez, Héctor D. A genetic approach to the graph and spectral clustering problem. MS thesis. 2012." and below is an implementation in Python using numpy. I'm relatively new to Python, so there may be better implementations, but I think this gets the job done:
def alignClusters(clstr1,clstr2):
"""Given 2 cluster assignments, this funciton will rename the second to
maximize alignment of elements within each cluster. This method is
described in in Menéndez, Héctor D. A genetic approach to the graph and
spectral clustering problem. MS thesis. 2012. (Assumes cluster labels
are consecutive integers starting with zero)
INPUTS:
clstr1 - The first clustering assignment
clstr2 - The second clustering assignment
OUTPUTS:
clstr2_temp - The second clustering assignment with clusters renumbered to
maximize alignment with the first clustering assignment """
K = np.max(clstr1)+1
simdist = np.zeros((K,K))
for i in range(K):
for j in range(K):
dcix = clstr1==i
dcjx = clstr2==j
dd = np.dot(dcix.astype(int),dcjx.astype(int))
simdist[i,j] = (dd/np.sum(dcix!=0) + dd/np.sum(dcjx!=0))/2
mask = np.zeros((K,K))
for i in range(K):
simdist_vec = np.reshape(simdist.T,(K**2,1))
I = np.argmax(simdist_vec)
xy = np.unravel_index(I,simdist.shape,order='F')
x = xy[0]
y = xy[1]
mask[x,y] = 1
simdist[x,:] = 0
simdist[:,y] = 0
swapIJ = np.unravel_index(np.where(mask.T),simdist.shape,order='F')
swapI = swapIJ[0][1,:]
swapJ = swapIJ[0][0,:]
clstr2_temp = np.copy(clstr2)
for k in range(swapI.shape[0]):
swapj = [swapJ[k]==i for i in clstr2]
clstr2_temp[swapj] = swapI[k]
return clstr2_temp

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Matching strings for multiple data set in Python - python

Related

Finding all possible permutations of a hash when given list of grouped elements

ASCII-alphabetical topological sort

difflib.SequenceMatcher not returning unique ratio

Optimizing searches in very large csv files

How to compare clusters?

Categories

Resources