Converting nested dict into a single unchanging unique number - python

I'm working on a minimax algorithm project and I am trying to find a way to save board values in a text file so they don't need to be calculated over and over again each time the program is tested. I have the board stored as a nested dictionary.
rows = {
4:{1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0},
3:{1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0},
2:{1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0},
1:{1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0},
}
I tried doing this, which gives the desired result but is not at all optimized and I'm sure there is a way to do this better.
rows = {
4:{1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0},
3:{1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0},
2:{1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0},
1:{1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0},
}
e = []
for key in rows:
e.append(list(rows[key].values()))
e=str(e)
e=e.replace ("[",""); e=e.replace ("]","")
e=e.replace (" ","")
e=e.replace (",","")
print(e)

You could make use of a str.join(), map is used to convert integers to strings:
res = ''.join(''.join(map(str, r.values())) for r in rows.values())
print(res)
Out:
00000000000000000000000000000000

Related

Finding all possible permutations of a hash when given list of grouped elements

Best way to show what I'm trying to do:
I have a list of different hashes that consist of ordered elements, seperated by an underscore. Each element may or may not have other possible replacement values. I'm trying to generate a list of all possible combinations of this hash, after taking into account replacement values.
Example:
grouped_elements = [["1", "1a", "1b"], ["3", "3a"]]
original_hash = "1_2_3_4_5"
I want to be able to generate a list of the following hashes:
[
"1_2_3_4_5",
"1a_2_3_4_5",
"1b_2_3_4_5",
"1_2_3a_4_5",
"1a_2_3a_4_5",
"1b_2_3a_4_5",
]
The challenge is that this'll be needed on large dataframes.
So far here's what I have:
def return_all_possible_hashes(df, grouped_elements)
rows_to_append = []
for grouped_element in grouped_elements:
for index, row in enriched_routes[
df["hash"].str.contains("|".join(grouped_element))
].iterrows():
(element_used_in_hash,) = set(grouped_element) & set(row["hash"].split("_"))
hash_used = row["hash"]
replacement_elements = set(grouped_element) - set([element_used_in_hash])
for replacement_element in replacement_elements:
row["hash"] = stop_hash_used.replace(
element_used_in_hash, replacement_element
)
rows_to_append.append(row)
return df.append(rows_to_append)
But the problem is that this will only append hashes with all combinations of a given grouped_element, and not all combinations of all grouped_elements at the same time. So using the example above, my function would return:
[
"1_2_3_4_5",
"1a_2_3_4_5",
"1b_2_3_4_5",
"1_2_3a_4_5",
]
I feel like I'm not far from the solution, but I also feel stuck, so any help is much appreciated!
If you make a list of the original hash value's elements and replace each element with a list of all its possible variations, you can use itertools.product to get the Cartesian product across these sublists. Transforming each element of the result back to a string with '_'.join() will get you the list of possible hashes:
from itertools import product
def possible_hashes(original_hash, grouped_elements):
hash_list = original_hash.split('_')
variations = list(set().union(*grouped_elements))
var_list = hash_list.copy()
for i, h in enumerate(hash_list):
if h in variations:
for g in grouped_elements:
if h in g:
var_list[i] = g
break
else:
var_list[i] = [h]
return ['_'.join(h) for h in product(*var_list)]
possible_hashes("1_2_3_4_5", [["1", "1a", "1b"], ["3", "3a"]])
['1_2_3_4_5',
'1_2_3a_4_5',
'1a_2_3_4_5',
'1a_2_3a_4_5',
'1b_2_3_4_5',
'1b_2_3a_4_5']
To use this function on various original hash values stored in a dataframe column, you can do something like this:
df['hash'].apply(lambda x: possible_hashes(x, grouped_elements))

Apply operation and a division operation in the same step using Python

I am trying to get proportion of nouns in my text using the code below and it is giving me an error. I am using a function that calculates the number of nouns in my text and I have the overall word count in a different column.
pos_family = {
'noun' : ['NN','NNS','NNP','NNPS']
}
def check_pos_tag(x, flag):
cnt = 0
try:
for tag,value in x.items():
if tag in pos_family[flag]:
cnt +=value
except:
pass
return cnt
df2['noun_count'] = df2['PoS_Count'].apply(lambda x: check_pos_tag(x, 'noun')/df2['word_count'])
Note: I have used nltk package to get the counts by PoS tags and I have the counts in a dictionary in PoS_Count column in my dataframe.
If I remove "/df2['word_count']" in the first run and get the noun count and include it again and run, it works fine but if I run it for the first time I get the below error.
ValueError: Wrong number of items passed 100, placement implies 1
Any help is greatly appreciated
Thanks in Advance!
As you have guessed, the problem is in the /df2['word_count'] bit.
df2['word_count'] is a pandas series, but you need to use a float or int here, because you are dividing check_pos_tag(x, 'noun') (which is an int) by it.
A possible solution is to extract the corresponding field from the series and use it in your lambda.
However, it would be easier (and arguably faster) to do each operation alone.
Try this:
df2['noun_count'] = df2['PoS_Count'].apply(lambda x: check_pos_tag(x, 'noun')) / df2['word_count']

Python Linear Search Better Efficiency

I've got a question regarding Linear Searching in Python. Say I've got the base code of
for l in lines:
for f in search_data:
if my_search_function(l[1],[f[0],f[2]]):
print "Found it!"
break
in which we want to determine where in search_data exists the value stored in l[1]. Say my_search_function() looks like this:
def my_search_function(search_key, search_values):
for s in search_values:
if search_key in s:
return True
return False
Is there any way to increase the speed of processing? Binary Search would not work in this case, as lines and search_data are multidimensional lists and I need to preserve the indexes. I've tried an outside-in approach, i.e.
for line in lines:
negative_index = -1
positive_index = 0
middle_element = len(search_data) /2 if len(search_data) %2 == 0 else (len(search_data)-1) /2
found = False
while positive_index < middle_element:
# print str(positive_index)+","+str(negative_index)
if my_search_function(line[1], [search_data[positive_index][0],search_data[negative_index][0]]):
print "Found it!"
break
positive_index = positive_index +1
negative_index = negative_index -1
However, I'm not seeing any speed increases from this. Does anyone have a better approach? I'm looking to cut the processing speed in half as I'm working with large amounts of CSV and the processing time for one file is > 00:15 which is unacceptable as I'm processing batches of 30+ files. Basically the data I'm searching on is essentially SKUs. A value from lines[0] could be something like AS123JK and a valid match for that value could be AS123. So a HashMap would not work here, unless there exists a way to do partial matches in a HashMap lookup that wouldn't require me breaking down the values like ['AS123', 'AS123J', 'AS123JK'], which is not ideal in this scenario. Thanks!
Binary Search would not work in this case, as lines and search_data are multidimensional lists and I need to preserve the indexes.
Regardless, it may be worth your while to extract the strings (along with some reference to the original data structure) into a flat list, sort it, and perform fast binary searches on it with help of the bisect module.
Or, instead of a large number of searches, sort also a combined list of all the search keys and traverse both lists in parallel, looking for matches. (Proceeding in a similar manner to the merge step in merge sort, without actually outputting a merged list)
Code to illustrate the second approach:
lines = ['AS12', 'AS123', 'AS123J', 'AS123JK','AS124']
search_keys = ['AS123', 'AS125']
try:
iter_keys = iter(sorted(search_keys))
key = next(iter_keys)
for line in sorted(lines):
if line.startswith(key):
print('Line {} matches {}'.format(line, key))
else:
while key < line[:len(key)]:
key = next(iter_keys)
except StopIteration: # all keys processed
pass
Depends on problem detail.
For instance if you search for complete words, you could create a hashtable on searchable elements, and the final search would be a simple lookup.
Filling the hashtable is pseudo-linear.
Ultimately, I was broke down and implemented Binary Search on my multidimensional lists by sorting using the sorted() function with a lambda as a key argument.Here is the first pass code that I whipped up. It's not 100% efficient, but it's a vast improvement from where we were
def binary_search(master_row, source_data,master_search_index, source_search_index):
lower_bound = 0
upper_bound = len(source_data) - 1
found = False
while lower_bound <= upper_bound and not found:
middle_pos = (lower_bound + upper_bound) // 2
if source_data[middle_pos][source_search_index] < master_row[master_search_index]:
if search([source_data[middle_pos][source_search_index]],[master_row[master_search_index]]):
return {"result": True, "index": middle_pos}
break
lower_bound = middle_pos + 1
elif source_data[middle_pos][source_search_index] > master_row[master_search_index] :
if search([master_row[master_search_index]],[source_data[middle_pos][source_search_index]]):
return {"result": True, "index": middle_pos}
break
upper_bound = middle_pos - 1
else:
if len(source_data[middle_pos][source_search_index]) > 5:
return {"result": True, "index": middle_pos}
else:
break
and then where we actually make the Binary Search call
#where master_copy is the first multidimensional list, data_copy is the second
#the search columns are the columns we want to search against
for line in master_copy:
for m in master_search_columns:
found = False
for d in data_search_columns:
data_copy = sorted(data_copy, key=lambda x: x[d], reverse=False)
results = binary_search(line, data_copy,m, d)
found = results["result"]
if found:
line = update_row(line, data_copy[results["index"]], column_mapping)
found_count = found_count +1
break
if found:
break
Here's the info for sorting a multidimensional list Python Sort Multidimensional Array Based on 2nd Element of Subarray

Python - Count JSON elements before extracting data

I use an API which gives me a JSON file structured like this:
{
offset: 0,
results: [
{
source_link: "http://www.example.com/1",
source_link/_title: "Title example 1",
source_link/_source: "/1",
source_link/_text: "Title example 1"
},
{
source_link: "http://www.example.com/2",
source_link/_title: "Title example 2",
source_link/_source: "/2",
source_link/_text: "Title example 2"
},
...
And I use this code in Python to extract the data I need:
import json
import urllib2
u = urllib2.urlopen('myapiurl')
z = json.load(u)
u.close
link = z['results'][1]['source_link']
title = z['results'][1]['source_link/_title']
The problem is that to use it I have to know the number of the element from which I'm extracting the data. My results can have different length every time, so what I want to do is to count the number of elements in results at first, so I would be able to set up a loop to extract data from each element.
To check the length of the results key:
len(z["results"])
But if you're just looping around them, a for loop is perfect:
for result in x["results"]:
print(result["source_link"])
You didn't need to know the length of the result, you are fine with a for loop:
for result in z['results']:
# process the results here
Anyway, if you want to know the length of 'results': len(z.results)
If you want to get the length, you can try:
len(z['result'])
But in python, what we usually do is:
for i in z['result']:
# do whatever you like with `i`
Hope this helps.
You don't need, or likely want, to count them in order to loop over them, you could do:
import json
import urllib2
u = urllib2.urlopen('myapiurl')
z = json.load(u)
u.close
for result in z['results']:
link = result['source_link']
title = result['source_link/_title']
# do something with link/title
Or you could do:
u = urllib2.urlopen('myapiurl')
z = json.load(u)
u.close
link = [result['source_link'] for result in z['results']]
title = [result['source_link/_title'] for result in z['results']]
# do something with links/titles lists
Few pointers:
No need to know results's length to iterate it. You can use for result in z['results'].
lists start from 0.
If you do need the index take a look at enumerate.
use this command to print the result on the terminal and then can check the number of results
print(len(z['results'][0]))

Optimizing searches in very large csv files

I have a csv file with a single column, but 6.2 million rows, all containing strings between 6 and 20ish letters. Some strings will be found in duplicate (or more) entries, and I want to write these to a new csv file - a guess is that there should be around 1 million non-unique strings. That's it, really. Continuously searching through a dictionary of 6 million entries does take its time, however, and I'd appreciate any tips on how to do it. Any script I've written so far takes at least a week (!) to run, according to some timings I did.
First try:
in_file_1 = open('UniProt Trypsinome (full).csv','r')
in_list_1 = list(csv.reader(in_file_1))
out_file_1 = open('UniProt Non-Unique Reference Trypsinome.csv','w+')
out_file_2 = open('UniProt Unique Trypsin Peptides.csv','w+')
writer_1 = csv.writer(out_file_1)
writer_2 = csv.writer(out_file_2)
# Create trypsinome dictionary construct
ref_dict = {}
for row in range(len(in_list_1)):
ref_dict[row] = in_list_1[row]
# Find unique/non-unique peptides from trypsinome
Peptide_list = []
Uniques = []
for n in range(len(in_list_1)):
Peptide = ref_dict.pop(n)
if Peptide in ref_dict.values(): # Non-unique peptides
Peptide_list.append(Peptide)
else:
Uniques.append(Peptide) # Unique peptides
for m in range(len(Peptide_list)):
Write_list = (str(Peptide_list[m]).replace("'","").replace("[",'').replace("]",''),'')
writer_1.writerow(Write_list)
Second try:
in_file_1 = open('UniProt Trypsinome (full).csv','r')
in_list_1 = list(csv.reader(in_file_1))
out_file_1 = open('UniProt Non-Unique Reference Trypsinome.csv','w+')
writer_1 = csv.writer(out_file_1)
ref_dict = {}
for row in range(len(in_list_1)):
Peptide = in_list_1[row]
if Peptide in ref_dict.values():
write = (in_list_1[row],'')
writer_1.writerow(write)
else:
ref_dict[row] = in_list_1[row]
EDIT: here's a few lines from the csv file:
SELVQK
AKLAEQAER
AKLAEQAERR
LAEQAER
LAEQAERYDDMAAAMK
LAEQAERYDDMAAAMKK
MTMDKSELVQK
YDDMAAAMKAVTEQGHELSNEER
YDDMAAAMKAVTEQGHELSNEERR
Do it with Numpy. Roughly:
import numpy as np
column = 42
mat = np.loadtxt("thefile", dtype=[TODO])
uniq = set(np.unique(mat[:,column]))
for row in mat:
if row[column] not in uniq:
print row
You could even vectorize the output stage using numpy.savetxt and the char-array operators, but it probably won't make very much difference.
First hint : Python has support for lazy evaluation, better to use it when dealing with huge datasets. So :
iterate over your csv.reader instead of building a huge in-memory list,
don't build huge in-memory lists with ranges - use enumerate(seq) instead if you need both the item and index, and just iterate over your sequence's items if you don't need the index.
Second hint : the main point of using a dict (hashtable) is to lookup on keys, not values... So don't build a huge dict that's used as a list.
Third hint : if you just want a way to store "already seen" values, use a Set.
I'm not so good in Python, so I don't know how the 'in' works, but your algorithm seems to run in n².
Try to sort your list after reading it, with an algo in n log(n), like quicksort, it should work better.
Once the list is ordered, you just have to check if two consecutive elements of the list are the same.
So you get the reading in n, the sorting in n log(n) (at best), and the comparison in n.
Although I think that the numpy solution is the best, I'm curious whether we can speed up the given example. My suggestions are:
skip csv.reader costs and just read the line
rb to skip the extra scan needed to fix newlines
use bigger file buffer sizes (read 1Meg, write 64K is probably good)
use the dict keys as an index - key lookup is much faster than value lookup
I'm not a numpy guy, so I'd do something like
in_file_1 = open('UniProt Trypsinome (full).csv','rb', 1048576)
out_file_1 = open('UniProt Non-Unique Reference Trypsinome.csv','w+', 65536)
ref_dict = {}
for line in in_file_1:
peptide = line.rstrip()
if peptide in ref_dict:
out_file_1.write(peptide + '\n')
else:
ref_dict[peptide] = None

Categories