difflib.SequenceMatcher not returning unique ratio - python

I am trying to compare 2 street networks and when i run this code it returns a a ratio of .253529... i need it to compare each row to get a unique value so i can query out the streets that dont match. What can i do it get it to return unique ratio values per row?
# Set local variables
inFeatures = gp.GetParameterAsText(0)
fieldName = gp.GetParameterAsText(1)
fieldName1 = gp.GetParameterAsText(2)
fieldName2 = gp.GetParameterAsText(3)
expression = difflib.SequenceMatcher(None,fieldName1,fieldName2).ratio()
# Execute CalculateField
arcpy.CalculateField_management(inFeatures, fieldName, expression, "PYTHON_9.3")

If you know both files always have the exact same number of lines, a simple approach like this would work:
ratios = []
with open('fieldName1', 'r') as f1, open('fieldName2', 'r') as f2:
for l1, l2 in zip(f1, f2):
R = difflib.SequenceMatcher(None,l1,l2).ratio()
ratios.append((l1, l2, R))
This will produce a list of tuples like this:
[("aa", "aa", 1), ("aa", "ab", 0.5), ...]
If your files are different sizes you'll need to find some way to match up the lines, or otherwise handle it

Related

Finding all possible permutations of a hash when given list of grouped elements

Best way to show what I'm trying to do:
I have a list of different hashes that consist of ordered elements, seperated by an underscore. Each element may or may not have other possible replacement values. I'm trying to generate a list of all possible combinations of this hash, after taking into account replacement values.
Example:
grouped_elements = [["1", "1a", "1b"], ["3", "3a"]]
original_hash = "1_2_3_4_5"
I want to be able to generate a list of the following hashes:
[
"1_2_3_4_5",
"1a_2_3_4_5",
"1b_2_3_4_5",
"1_2_3a_4_5",
"1a_2_3a_4_5",
"1b_2_3a_4_5",
]
The challenge is that this'll be needed on large dataframes.
So far here's what I have:
def return_all_possible_hashes(df, grouped_elements)
rows_to_append = []
for grouped_element in grouped_elements:
for index, row in enriched_routes[
df["hash"].str.contains("|".join(grouped_element))
].iterrows():
(element_used_in_hash,) = set(grouped_element) & set(row["hash"].split("_"))
hash_used = row["hash"]
replacement_elements = set(grouped_element) - set([element_used_in_hash])
for replacement_element in replacement_elements:
row["hash"] = stop_hash_used.replace(
element_used_in_hash, replacement_element
)
rows_to_append.append(row)
return df.append(rows_to_append)
But the problem is that this will only append hashes with all combinations of a given grouped_element, and not all combinations of all grouped_elements at the same time. So using the example above, my function would return:
[
"1_2_3_4_5",
"1a_2_3_4_5",
"1b_2_3_4_5",
"1_2_3a_4_5",
]
I feel like I'm not far from the solution, but I also feel stuck, so any help is much appreciated!
If you make a list of the original hash value's elements and replace each element with a list of all its possible variations, you can use itertools.product to get the Cartesian product across these sublists. Transforming each element of the result back to a string with '_'.join() will get you the list of possible hashes:
from itertools import product
def possible_hashes(original_hash, grouped_elements):
hash_list = original_hash.split('_')
variations = list(set().union(*grouped_elements))
var_list = hash_list.copy()
for i, h in enumerate(hash_list):
if h in variations:
for g in grouped_elements:
if h in g:
var_list[i] = g
break
else:
var_list[i] = [h]
return ['_'.join(h) for h in product(*var_list)]
possible_hashes("1_2_3_4_5", [["1", "1a", "1b"], ["3", "3a"]])
['1_2_3_4_5',
'1_2_3a_4_5',
'1a_2_3_4_5',
'1a_2_3a_4_5',
'1b_2_3_4_5',
'1b_2_3a_4_5']
To use this function on various original hash values stored in a dataframe column, you can do something like this:
df['hash'].apply(lambda x: possible_hashes(x, grouped_elements))

Python: How to iterate through set of files based on file names?

I have a set of files named like this:
qd-p64-dZP-d64-z8-8nn.q
qd-p8-dPZ-d8-z1-1nn.q qq-p8-dZP-d8-z1-2nn.q
qd-p8-dPZ-d8-z1-2nn.q qq-p8-dZP-d8-z1-4nn.q
qd-p8-dPZ-d8-z1-4nn.q qq-p8-dZP-d8-z16-1nn.q
qd-p8-dPZ-d8-z16-1nn.q qq-p8-dZP-d8-z16-2nn.q
qd-p8-dPZ-d8-z16-2nn.q qq-p8-dZP-d8-z16-4nn.q
qd-p8-dPZ-d8-z16-4nn.q qq-p8-dZP-d8-z16-8nn.q
qd-p8-dPZ-d8-z16-8nn.q qq-p8-dZP-d8-z1-8nn.q
qd-p8-dPZ-d8-z1-8nn.q qq-p8-dZP-d8-z2-1nn.q
qd-p8-dPZ-d8-z2-1nn.q qq-p8-dZP-d8-z2-2nn.q
qd-p8-dPZ-d8-z2-2nn.q qq-p8-dZP-d8-z2-4nn.q
qd-p8-dPZ-d8-z2-4nn.q qq-p8-dZP-d8-z2-8nn.q
qd-p8-dPZ-d8-z2-8nn.q qq-p8-dZP-d8-z32-1nn.q
qd-p8-dPZ-d8-z32-1nn.q qq-p8-dZP-d8-z32-2nn.q
qd-p8-dPZ-d8-z32-2nn.q qq-p8-dZP-d8-z32-4nn.q
qd-p8-dPZ-d8-z32-4nn.q qq-p8-dZP-d8-z32-8nn.q
qd-p8-dPZ-d8-z32-8nn.q qq-p8-dZP-d8-z4-1nn.q
qd-p8-dPZ-d8-z4-1nn.q qq-p8-dZP-d8-z4-2nn.q
qd-p8-dPZ-d8-z4-2nn.q qq-p8-dZP-d8-z4-4nn.q
The information to iterate is given in the file names, for example:
Fix
dZP, 1nn, z2,
and vary
d
with values
{d8, d16, d32 d64}
Then, increase z value to get
dZP, 1nn, z4
and vary d again
{d8, d16, d32 d64}
Once I'm able to iterate like this I need to do some information processing from the files.
Looks like a good task for a generator. I just did it for d, z, and n, but it should be easy enough to generalize to all of your filename fields:
def filename_generator():
l1 = ['d8', 'd16', 'd32', 'd64']
l2 = ['z1', 'z2', 'z4', ,'z8', 'z16', 'z32']
l3 = ['1nn', '2nn', '4nn', '8nn']
for n in l3:
for z in l2:
for d in l1:
yield '%s-%s-%s.q' % (d, z, n)
You could something like the following. It may not be exactly what you want, since you've left some important details out of your question, but I've attempted to write it in a way to make it easy for you to change as necessary depending on what you really want.
In a nutshell, what it does is use the re module break each filename up into "fields" with the numeric value found in each. These values are assigned corresponding names it a temporary dictionary, which is then used to create a namedtuple of the values with the desired field precedence. Other parts of the filename are ignored.
The initial filename list can be obtained from the file system using os.listdir() or glob.glob().
from collections import namedtuple
import re
filenames = ['qd-p64-dZP-d64-z8-8nn.q', 'qd-p8-dPZ-d8-z1-1nn.q',
'qd-p8-dPZ-d8-z1-2nn.q', 'qd-p8-dPZ-d8-z1-4nn.q',
'qd-p8-dPZ-d8-z16-1nn.q', 'qd-p8-dPZ-d8-z16-2nn.q',
'qd-p8-dPZ-d8-z16-4nn.q', 'qd-p8-dPZ-d8-z16-8nn.q',
'qd-p8-dPZ-d8-z1-8nn.q', 'qd-p8-dPZ-d8-z2-1nn.q',
'qd-p8-dPZ-d8-z2-2nn.q', 'qd-p8-dPZ-d8-z2-4nn.q',
'qd-p8-dPZ-d8-z2-8nn.q', 'qd-p8-dPZ-d8-z32-1nn.q',
'qd-p8-dPZ-d8-z32-2nn.q', 'qd-p8-dPZ-d8-z32-4nn.q',
'qd-p8-dPZ-d8-z32-8nn.q', 'qd-p8-dPZ-d8-z4-1nn.q',
'qd-p8-dPZ-d8-z4-2nn.q', 'qq-p8-dZP-d8-z1-2nn.q',
'qq-p8-dZP-d8-z1-4nn.q', 'qq-p8-dZP-d8-z16-1nn.q',
'qq-p8-dZP-d8-z16-2nn.q', 'qq-p8-dZP-d8-z16-4nn.q',
'qq-p8-dZP-d8-z16-8nn.q', 'qq-p8-dZP-d8-z1-8nn.q',
'qq-p8-dZP-d8-z2-1nn.q', 'qq-p8-dZP-d8-z2-2nn.q',
'qq-p8-dZP-d8-z2-4nn.q', 'qq-p8-dZP-d8-z2-8nn.q',
'qq-p8-dZP-d8-z32-1nn.q', 'qq-p8-dZP-d8-z32-2nn.q',
'qq-p8-dZP-d8-z32-4nn.q', 'qq-p8-dZP-d8-z32-8nn.q',
'qq-p8-dZP-d8-z4-1nn.q', 'qq-p8-dZP-d8-z4-2nn.q',
'qq-p8-dZP-d8-z4-4nn.q']
filename_order = ('p', 'd', 'z', 'nn') # order fields occur in the filenames
fieldname_order = ('z', 'd', 'p', 'nn') # desired field sort order
OrderedTuple = namedtuple('OrderedTuple', fieldname_order)
def keyfunc(filename):
values = [int(value) for value in re.findall(r'-\D*(\d+)', filename)]
parts = dict(zip(filename_order, values))
return OrderedTuple(**parts)
filenames.sort(key=keyfunc) # sort filename list in-place
Resulting order of filenames in list:
['qd-p8-dPZ-d8-z1-1nn.q', 'qd-p8-dPZ-d8-z1-2nn.q', 'qq-p8-dZP-d8-z1-2nn.q',
'qd-p8-dPZ-d8-z1-4nn.q', 'qq-p8-dZP-d8-z1-4nn.q', 'qd-p8-dPZ-d8-z1-8nn.q',
'qq-p8-dZP-d8-z1-8nn.q', 'qd-p8-dPZ-d8-z2-1nn.q', 'qq-p8-dZP-d8-z2-1nn.q',
'qd-p8-dPZ-d8-z2-2nn.q', 'qq-p8-dZP-d8-z2-2nn.q', 'qd-p8-dPZ-d8-z2-4nn.q',
'qq-p8-dZP-d8-z2-4nn.q', 'qd-p8-dPZ-d8-z2-8nn.q', 'qq-p8-dZP-d8-z2-8nn.q',
'qd-p8-dPZ-d8-z4-1nn.q', 'qq-p8-dZP-d8-z4-1nn.q', 'qd-p8-dPZ-d8-z4-2nn.q',
'qq-p8-dZP-d8-z4-2nn.q', 'qq-p8-dZP-d8-z4-4nn.q',
'qd-p64-dZP-d64-z8-8nn.q', 'qd-p8-dPZ-d8-z16-1nn.q',
'qq-p8-dZP-d8-z16-1nn.q', 'qd-p8-dPZ-d8-z16-2nn.q',
'qq-p8-dZP-d8-z16-2nn.q', 'qd-p8-dPZ-d8-z16-4nn.q',
'qq-p8-dZP-d8-z16-4nn.q', 'qd-p8-dPZ-d8-z16-8nn.q',
'qq-p8-dZP-d8-z16-8nn.q', 'qd-p8-dPZ-d8-z32-1nn.q',
'qq-p8-dZP-d8-z32-1nn.q', 'qd-p8-dPZ-d8-z32-2nn.q',
'qq-p8-dZP-d8-z32-2nn.q', 'qd-p8-dPZ-d8-z32-4nn.q',
'qq-p8-dZP-d8-z32-4nn.q', 'qd-p8-dPZ-d8-z32-8nn.q',
'qq-p8-dZP-d8-z32-8nn.q']

Algorithmic / coding help for a PySpark markov model

I need some help getting my brain around designing an (efficient) markov chain in spark (via python). I've written it as best as I could, but the code I came up with doesn't scale.. Basically for the various map stages, I wrote custom functions and they work fine for sequences of a couple thousand, but when we get in the 20,000+ (and I've got some up to 800k) things slow to a crawl.
For those of you not familiar with markov moodels, this is the gist of it..
This is my data.. I've got the actual data (no header) in an RDD at this point.
ID, SEQ
500, HNL, LNH, MLH, HML
We look at sequences in tuples, so
(HNL, LNH), (LNH,MLH), etc..
And I need to get to this point.. where I return a dictionary (for each row of data) that I then serialize and store in an in memory database.
{500:
{HNLLNH : 0.333},
{LNHMLH : 0.333},
{MLHHML : 0.333},
{LNHHNL : 0.000},
etc..
}
So in essence, each sequence is combined with the next (HNL,LNH become 'HNLLNH'), then for all possible transitions (combinations of sequences) we count their occurrence and then divide by the total number of transitions (3 in this case) and get their frequency of occurrence.
There were 3 transitions above, and one of those was HNLLNH.. So for HNLLNH, 1/3 = 0.333
As a side not, and I'm not sure if it's relevant, but the values for each position in a sequence are limited.. 1st position (H/M/L), 2nd position (M/L), 3rd position (H,M,L).
What my code had previously done was to collect() the rdd, and map it a couple times using functions I wrote. Those functions first turned the string into a list, then merged list[1] with list[2], then list[2] with list[3], then list[3] with list[4], etc.. so I ended up with something like this..
[HNLLNH],[LNHMLH],[MHLHML], etc..
Then the next function created a dictionary out of that list, using the list item as a key and then counted the total ocurrence of that key in the full list, divided by len(list) to get the frequency. I then wrapped that dictionary in another dictionary, along with it's ID number (resulting in the 2nd code block, up a above).
Like I said, this worked well for small-ish sequences, but not so well for lists with a length of 100k+.
Also, keep in mind, this is just one row of data. I have to perform this operation on anywhere from 10-20k rows of data, with rows of data varying between lengths of 500-800,000 sequences per row.
Any suggestions on how I can write pyspark code (using the API map/reduce/agg/etc.. functions) to do this efficiently?
EDIT
Code as follows.. Probably makes sense to start at the bottom. Please keep in mind I'm learning this(Python and Spark) as I go, and I don't do this for a living, so my coding standards are not great..
def f(x):
# Custom RDD map function
# Combines two separate transactions
# into a single transition state
cust_id = x[0]
trans = ','.join(x[1])
y = trans.split(",")
s = ''
for i in range(len(y)-1):
s= s + str(y[i] + str(y[i+1]))+","
return str(cust_id+','+s[:-1])
def g(x):
# Custom RDD map function
# Calculates the transition state probabilities
# by adding up state-transition occurrences
# and dividing by total transitions
cust_id=str(x.split(",")[0])
trans = x.split(",")[1:]
temp_list=[]
middle = int((len(trans[0])+1)/2)
for i in trans:
temp_list.append( (''.join(i)[:middle], ''.join(i)[middle:]) )
state_trans = {}
for i in temp_list:
state_trans[i] = temp_list.count(i)/(len(temp_list))
my_dict = {}
my_dict[cust_id]=state_trans
return my_dict
def gen_tsm_dict_spark(lines):
# Takes RDD/string input with format CUST_ID(or)PROFILE_ID,SEQ,SEQ,SEQ....
# Returns RDD of dict with CUST_ID and tsm per customer
# i.e. {cust_id : { ('NLN', 'LNN') : 0.33, ('HPN', 'NPN') : 0.66}
# creates a tuple ([cust/profile_id], [SEQ,SEQ,SEQ])
cust_trans = lines.map(lambda s: (s.split(",")[0],s.split(",")[1:]))
with_seq = cust_trans.map(f)
full_tsm_dict = with_seq.map(g)
return full_tsm_dict
def main():
result = gen_tsm_spark(my_rdd)
# Insert into DB
for x in result.collect():
for k,v in x.iteritems():
db_insert(k,v)
You can try something like below. It depends heavily on tooolz but if you prefer to avoid external dependencies you can easily replace it with some standard Python libraries.
from __future__ import division
from collections import Counter
from itertools import product
from toolz.curried import sliding_window, map, pipe, concat
from toolz.dicttoolz import merge
# Generate all possible transitions
defaults = sc.broadcast(dict(map(
lambda x: ("".join(concat(x)), 0.0),
product(product("HNL", "NL", "HNL"), repeat=2))))
rdd = sc.parallelize(["500, HNL, LNH, NLH, HNL", "600, HNN, NNN, NNN, HNN, LNH"])
def process(line):
"""
>>> process("000, HHH, LLL, NNN")
('000', {'LLLNNN': 0.5, 'HHHLLL': 0.5})
"""
bits = line.split(", ")
transactions = bits[1:]
n = len(transactions) - 1
frequencies = pipe(
sliding_window(2, transactions), # Get all transitions
map(lambda p: "".join(p)), # Joins strings
Counter, # Count
lambda cnt: {k: v / n for (k, v) in cnt.items()} # Get frequencies
)
return bits[0], frequencies
def store_partition(iter):
for (k, v) in iter:
db_insert(k, merge([defaults.value, v]))
rdd.map(process).foreachPartition(store_partition)
Since you know all possible transitions I would recommend using a sparse representation and ignore zeros. Moreover you can replace dictionaries with sparse vectors to reduce memory footprint.
you can achieve this result by using pure Pyspark, i did using it using pyspark.
To create frequencies, let say you have already achieved and these are input RDDs
ID, SEQ
500, [HNL, LNH, MLH, HML ...]
and to get frequencies like, (HNL, LNH),(LNH, MLH)....
inputRDD..map(lambda (k, list): get_frequencies(list)).flatMap(lambda x: x) \
.reduceByKey(lambda v1,v2: v1 +v2)
get_frequencies(states_list):
"""
:param states_list: Its a list of Customer States.
:return: State Frequencies List.
"""
rest = []
tuples_list = []
for idx in range(0,len(states_list)):
if idx + 1 < len(states_list):
tuples_list.append((states_list[idx],states_list[idx+1]))
unique = set(tuples_list)
for value in unique:
rest.append((value, tuples_list.count(value)))
return rest
and you will get results
((HNL, LNH), 98),((LNH, MLH), 458),() ......
after this you may convert result RDDs into Dataframes or yu can directly insert into DB using RDDs mapPartitions

Identifying coordinate matches from two files using python

I've got two sets of data describing atomic positions. They're in separate files that I would like to compare, aim being identifying matching atoms by their coordinates. Data looks like the following in both cases, and there's going to be up to a 1000 or so entries. The files are of different lengths since they describe different sized systems and have the following format:
1 , 0.000000000000E+00 0.000000000000E+00
2 , 0.000000000000E+00 2.468958660000E+00
3 , 0.000000000000E+00 -2.468958660000E+00
4 , 2.138180920454E+00 -1.234479330000E+00
5 , 2.138180920454E+00 1.234479330000E+00
The first column is the entry ID, second is a set of coordinates in the x,y.
What I'd like to do is compare the coordinates in both sets of data, identify matches and the corresponding ID eg "Entry 3 in file 1 corresponds to Entry 6 in file 2." I'll be using this information to alter the coordinate values within file 2.
I've read the files, line by line and split them into two entries per line using the command, then put them into a list, but am a bit stumped as to how to specify the comparison bit - particularly telling it to compare the second entries only, whilst being able to call the first entry. I'd imagine it would require looping ?
Code looks like this so far:
open1 = open('./3x3supercell_coord_clean','r')
openA = open('./6x6supercell_coord_clean','r')
small_list=[]
for line in open1:
stripped_small_line = line.strip()
column_small = stripped_small_line.split(",")
small_list.append(column_small)
big_list=[]
for line in openA:
stripped_big_line = line.strip()
column_big = stripped_big_line.split(",")
big_list.append(column_big)
print small_list[2][1] #prints out coords only
Use a dictionary with coordinates as keys.
data1 = """1 , 0.000000000000E+00 0.000000000000E+00
2 , 0.000000000000E+00 2.468958660000E+00
3 , 0.000000000000E+00 -2.468958660000E+00
4 , 2.138180920454E+00 -1.234479330000E+00
5 , 2.138180920454E+00 1.234479330000E+00"""
# Read data1 into a list of tupes (id, x, y)
coords1 = [(int(line[0]), float(line[2]), float(line[3])) for line in
(line.split() for line in data1.split("\n"))]
# This dictionary will map (x, y) -> id
coordsToIds = {}
# Add coords1 to this dictionary.
for id, x, y in coords1:
coordsToIds[(x, y)] = id
# Read coords2 the same way.
# Left as an exercise to the reader.
# Look up each of coords2 in the dictionary.
for id, x, y in coords2:
if (x, y) in coordsToIds:
print(coordsToIds[(x, y)] # the ID in coords1
Beware that comparing floats is always a problem.
If all you are doing is trying to compare the second element of each element in two lists, that can be done by having each coord compared against each coord in the opposite file. This is definitely not the fastest way to go about it, but it should get you the results you need.It scans through small list, and checks every small_entry[1] (the coordinate) against every coordinate for each entry in big_list
for small_entry in small_list:
for big_entry in big_list:
if small_entry[1] == big_entry[1] :
print(small_entry[0] + "matches" + big_entry[0])
something like this?
Build two dictionaries the following way:
# do your splitting to populate two dictionaries of this format:
# mydata1[Coordinate] = ID
# i.e.
for line in data1.split():
coord = line[2] + ' ' + line[3]
id = line[0]
mydata1[coord] = id
for line in data2.split():
coord = line[2] + ' ' + line[3]
id = line[0]
mydata2[coord] = id
#then we can use set intersection to find all coordinates in both key sets
set1=set(mydata1.keys())
set2=set(mydata2.keys())
intersect = set1.intersection(set2)
for coordinate in intersect:
print ' '.join(["Coordinate", str(coordinate), "found in set1 id", set1[coordinate]), "and set2 id", set2[coordinate])])
Here's an approach that uses dictionaries:
coords = {}
with open('first.txt', 'r') as first_list:
for i in first_list:
pair = [j for j in i.split(' ') if j]
coords[','.join(pair[2:4])] = pair[0]
#reformattted coords used as key "2.138180920454E+00,-1.234479330000E+00"
with open('second.txt', 'r') as second_list:
for i in second_list:
pair = [j for j in i.split(' ') if j]
if ','.join(pair[2:4]) in coords:
#reformatted coords from second list checked for presence in keys of dictionary
print coords[','.join(pair[2:4])], pair[0]
What's going on here is that each of your coordinates from file A (which you have stated will be distinct), get stored into a dictionary as the key. Then, the first file is closed and the second file is opened. The second list's coordinates get opened, reformatted to match how the dictionary keys are saved and checks for membership. If the coordinate string from list B is in dictionary coords, the pair exists in both lists. It then prints the ID from the first and second list, regarding that match.
Dictionary lookups are much faster O(1). This approach also has the advantage of not needing to have all the data in memory in order to check (just one list) as well as not worrying about type-casting, e.g., float/int conversions.

Matching strings for multiple data set in Python

I am working on python and I need to match the strings of several data files. First I used pickle to unpack my files and then I place them into a list. I only want to match strings that have the same conditions. This conditions are indicated at the end of the string.
My working script looks approximately like this:
import pickle
f = open("data_a.dat")
list_a = pickle.load( f )
f.close()
f = open("data_b.dat")
list_b = pickle.load( f )
f.close()
f = open("data_c.dat")
list_c = pickle.load( f )
f.close()
f = open("data_d.dat")
list_d = pickle.load( f )
f.close()
for a in list_a:
for b in list_b:
for c in list_c
for d in list_d:
if a.GetName()[12:] in b.GetName():
if a.GetName[12:] in c.GetName():
if a.GetName[12:] in d.GetName():
"do whatever"
This seems to work fine for these 2 lists. The problems begin when I try to add more 8 or 9 more data files for which I also need to match the same conditions. The script simple won't process and it gets stuck. I appreciate your help.
Edit: Each of the lists contains histograms named after the parameters that were used to create them. The name of the histograms contains these parameters and their values at the end of the string. In the example I did it for 2 data sets, now I would like to do it for 9 data sets without using multiple loops.
Edit 2. I just expanded the code to reflect more accurately what I want to do. Now if I try to do that for 9 lists, it does not only look horrible, but it also doesn't work.
out of my head:
files = ["file_a", "file_b", "file_c"]
sets = []
for f in files:
f = open("data_a.dat")
sets.append(set(pickle.load(f)))
f.close()
intersection = sets[0].intersection(*sets[1:])
EDIT: Well I overlooked your mapping to x.GetName()[12:], but you should be able to reduce your problem to set logic.
Here a small piece of code you can inspire on. The main idea is the use of a recursive function.
For simplicity sake, I admit that I already have data loaded in lists but you can get them from file before :
data_files = [
'data_a.dat',
'data_b.dat',
'data_c.dat',
'data_d.dat',
'data_e.dat',
]
lists = [pickle.load(open(f)) for f in data_files]
And because and don't really get the details of what you really need to do, my goal here is to found the matches on the four firsts characters :
def do_wathever(string):
print "I have match the string '%s'" % string
lists = [
["hello", "world", "how", "grown", "you", "today", "?"],
["growl", "is", "a", "now", "on", "appstore", "too bad"],
["I", "wish", "I", "grow", "Magnum", "mustache", "don't you?"],
]
positions = [0 for i in range(len(lists))]
def recursive_match(positions, lists):
strings = map(lambda p, l: l[p], positions, lists)
match = True
searched_string = strings.pop(0)[:4]
for string in strings:
if searched_string not in string:
match = False
break
if match:
do_wathever(searched_string)
# increment positions:
new_positions = positions[:]
lists_len = len(lists)
for i, l in enumerate(reversed(lists)):
max_position = len(l)-1
list_index = lists_len - i - 1
current_position = positions[list_index]
if max_position > current_position:
new_positions[list_index] += 1
break
else:
new_positions[list_index] = 0
continue
return new_positions, not any(new_positions)
search_is_finished = False
while not search_is_finished:
positions, search_is_finished = recursive_match(positions, lists)
Of course you can optimize a lot of things here, this is draft code, but take a look at the recursive function, this is a major concept.
In the end I ended up using the map built in function. I realize now I should have been even more explicit than I was (which I will do in the future).
My data files are histograms with 5 parameters, some with 3 or 4. Something like this,
par1=["list with some values"]
par2=["list with some values"]
par3=["list with some values"]
par4=["list with some values"]
par5=["list with some values"]
I need to examine the behavior of the quantity plotted for each possible combination of the values of the parameters. In the end, I get a data file with ~300 histograms each identified in their name with the corresponding values of the parameters and the sample name. It looks something like,
datasample1-par1=val1-par2=val2-par3=val3-par4=val4-par5=val5
datasample1-"permutation of the above values"
...
datasample9-par1=val1-par2=val2-par3=val3-par4=val4-par5=val5
datasample9-"permutation of the above values"
So I get 300 histograms for each of the 9 data files, but luckily all of this histograms are created in the same order. Hence I can pair all of them just using the map built in function. I unpack the data files, put each on lists and the use the map function to pair each histogram with its corresponding configuration in the other data samples.
for lst in map(None, data1_histosli, data2_histosli, ...data9_histosli):
do_something(lst)
This solves my problem. Thank you to all for your help!

Categories