How to get a family of independent universal hash function? - python

I am trying to implement the hyperloglog counting algorithm using stochastic averaging. To do that, I need many independent universal hash functions to hash items in different substreams.
I found that there are only a few hash function available in hashlib
and there seems to be no way for me to provide a seed or something? I am thinking using different salts for different substreams.

You probably DON'T need different hash functions. A common solution to this problem is to use only part of the hash to compute the HyperLogLog rho statistic, and the other part to select the substream. If you use a good hash function (e.g. murmur3), it effectively behaves as multiple independent ones.
See the "stochastic averaging" section here for an explanation of this:
https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/

Related

lmfit/scipy.optimize minimization methods description?

Is there any place with a brief description of each of the algorithms for the parameter method in the minimize function of the lmfit package? Both there and in the documentation of SciPy there is no explanation about the details of each algorithm. Right now I know I can choose between them but I don't know which one to choose...
My current problem
I am using lmfit in Python to minimize a function. I want to minimize the function within a finite and predefined range where the function has the following characteristics:
It is almost zero everywhere, which makes it to be numerically identical to zero almost everywhere.
It has a very, very sharp peak in some point.
The peak can be anywhere within the region.
This makes many minimization algorithms to not work. Right now I am using a combination of the brute force method (method="brute") to find a point close to the peak and then feed this value to the Nelder-Mead algorithm (method="nelder") to finally perform the minimization. It is working approximately 50 % of the times, and the other 50 % of the times it fails to find the minimum. I wonder if there are better algorithms for cases like this one...
I think it is a fair point that docs for lmfit (such as https://lmfit.github.io/lmfit-py/fitting.html#fit-methods-table) and scipy.optimize (such as https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html#optimization-scipy-optimize) do not give detailed mathematical descriptions of the algorithms.
Then again, most of the docs for scipy, numpy, and related libraries describe how to use the methods, but do not describe in much mathematical detail how the algorithms work.
In fairness, the different optimization algorithms share many features and the differences between them can get pretty technical. All of these methods try to minimize some metric (often called "cost" or "residual") by changing the values of parameters for the supplied function.
It sort of takes a text book (or at least a Wikipedia page) to establish the concepts and mathematical terms used for these methods, and then a paper (or at least a Wikipedia page) to describe how each method differs from the others. So, I think the basic answer would be to look up the different methods.

Using ideas from HashEmbeddings with sklearn's HashingVectorizer

Svenstrup et. al. 2017 propose an interesting way to handle hash collisions in hashing vectorizers: Use 2 different hashing functions, and concatenate their results before modeling.
They claim that the combination of multiple hash functions approximates a single hash function with much larger range (see section 4 of the paper).
I'd like to try this out with some text data I'm working with in sklearn. The idea would be to run the HashingVectorizer twice, with a different hash function each time, and then concatenate the results as an input to my model.
How might I do with with sklearn? There's not an option to change the hash function used, but maybe could modify the vectorizer somehow?
Or maybe there's a way I could achieve this with SparseRandomProjection ?
HashingVectorizer in scikit-learn already includes a mechanism to mitigate hash collisions with alternate_sign=True option. This adds a random sign during token summation which improves the preservation of distances in the hashed space (see scikit-learn#7513 for more details).
By using N hash functions and concatenating the output, one would increase both n_features and the number of non null terms (nnz) in the resulting sparse matrix by N. In other words each token will now be represented as N elements. This is quite wastful memory wise. In addition, since the run time for sparse array computations is directly dependent on nnz (and less so on n_features) this will have a much larger negative performance impact than only increasing n_features. I'm not sure that such approach is very useful in practice.
If you nevertheless want to implement such vectorizer, below are a few comments.
because FeatureHasher is implemented in Cython, it is difficult to modify its functionality from Python without editing/re-compiling the code.
writing a quick pure-python implemnteation of HashingVectorizer could be one way to do it.
otherwise, there is a somewhat experimental re-implementation of HashingVectorizer in the text-vectorize package. Because it is written in Rust (with Python binding), other hash functions are easily accessible and can potentially be added.

Optimizing/learning function in Python

I would like to create a function that, given a list of integers as input, returns a boolean based on that number. I would like it to use an algorithm to find the optimum cut-off value that optimizes the number of correct returns.
Is there some tool built-in with Python for this? Otherwise, how would I approach such a problem using Python? Preferably, I would want to learn how to do both.
This appears to be something that a linear machine learning algorithm could solve. In fact, the Ordinary Least Squares linear classification model seems to follow the exact outline you provide: it uses an algorithm to attempt to match it's output with your examples based on the numerical input, with the heuristic it attempts to minimize being a number of answers it gets wrong. If this is indeed the case, I believe scikit-learn will be the library you want. As to learning how this is done, the document linked above will at least get you started.

How to measure similarity between two python code blocks?

Many would want to measure code similarity to catch plagiarisms, however my intention is to cluster a set of python code blocks (say answers to the same programming question) into different categories and distinguish different approaches taken by students.
If you have any idea how this could be achieved, I would appreciate it if you share it here.
You can choose any scheme you like that essentially hashes the contents of the code blocks, and place code blocks with identical hashes into the same category.
Of course, what will turn out to be similar will then depend highly on how you defined the hashing function. For instance, a truly stupid hashing function H(code)==0 will put everything in the same bin.
A hard problem is finding a hashing function that classifies code blocks in a way that seems similar in a natural sense. With lots of research, nobody has yet found anything better to judge this than I'll know if they are similar when I see them.
You surely do not want it to be dependent on layout/indentation/whitespace/comments, or slight changes to these will classify blocks differently even if their semantic content is identical.
There are three major schemes people have commonly used to find duplicated (or similar) code:
Metrics-based schemes, which compute the hash by counting various type of operators and operands by computing a metric. (Note: this uses lexical tokens). These often operate only at the function level. I know of no practical tools based on this.
Lexically based schemes, which break the input stream into lexemes, convert identifiers and literals into fixed special constants (e.g, treat them as undifferentiated), and then essentially hash N-grams (a sequence of N tokens) over these sequences. There are many clone detectors based on essentially this idea; they work tolerably well, but also find stupid matches because nothing forces alignment with program structure boundaries.
The sequence
return ID; } void ID ( int ID ) {
is an 11 gram which occurs frequently in C like languages but clearly isn't a useful clone). The result is that false positives tend to occur, e.g, you get claimed matches where there isn't one.
Abstract syntax tree based matching, (hashing over subtrees) which automatically aligns clones to language boundaries by virtue of using the ASTs, which represent the language structures directly. (I'm the author of the original paper on this, and build a commercial product CloneDR based on the idea, see my bio). These tools have the advantage that they can match code that contains sequences of tokens of different lengths in the middle of a match, e.g., one statement (of arbitrary size) is replaced by another.
This paper provides a survey of the various techniques: http://www.cs.usask.ca/~croy/papers/2009/RCK_SCP_Clones.pdf. It shows that AST-based clone detection tools appear to be the most effective at producing clones that people agree are similar blocks of code, which seems key to OP's particular interest; see Table 14.
[There are graph-based schemes that match control and data flow graphs. They should arguably produce even better matches but apparantly do not do much better in practice.]
One approach would be to count then number of functions, objects, keywords possibly grouped into categories such as branching, creating, manipulating, etc., and number variables of each type. Without relying on the methods and variables being called the same name(s).
For a given problem the similar approaches will tend to come out with similar scores for these, e.g.: A students who used decision tree would have a high number of branch statements while one who used a decision table would have much lower.
This approach would be much quicker to implement than parsing the code structure and comparing the results.

Does Matlab's fminimax apply Pareto optimality?

I am working on multi-objective optimization in Matlab, and am using the fiminimax in the Optimization toolbox. I want to know if fminimax applies Pareto optimization, and if not, why? Also, can you suggest a multi-objective optimization package in Matlab or Python that does use Pareto?
For python, DEAP may be the one you're looking for. Extensive documentation with a lot of real life examples, and a really helpful Google Groups forum. It implements two robust MO algorithms: NSGA-II and SPEA-II.
Edit (as requested)
I am using DEAP for my MSc thesis, so I will let you know how we are using Pareto optimality. Setting DEAP up is pretty straight-forward, as you will see in the examples. Use this one as a starting point. This is the short version, which uses the built-in algorithms and operators. Read both and then follow these guidelines.
As the OneMax example is single-objective, it doesn't use MO algorithms. However, it's easy to implement them:
Change your evaluation function so it returns a n-tuple with the desired scores. If you want to minimize standard deviation too, something like return sum(individual), numpy.std(individual) would work.
Also, modify the weights parameter of the base.Fitness object so it matches that returned n-tuple. A positive float means maximization, while a negative one means minimization. You can use any real number, but I would stick with 1.0 and -1.0 for the sake of simplicity.
Change your genetic operators to cxSimulatedBinaryBounded(), mutPolynomialBounded() and selNSGA2(), for crossover, mutation and selection operations, respectively. These are the suggested methods, as they were developed by the NSGA-II authors.
If you want to use one of the embedded ready-to-go algorithms in DEAP, choose MuPlusLambda().
When calling the algorithm, remember to change the halloffame parameter from HallOfFame() to ParetoFront(). This will return all non-dominated individuals, instead of the best lexicographically sorted "best individuals in all generations". Then you can resolve your Pareto Front as desired: weighted sum, custom lexicographic sorting, etc.
I hope that helps. Take into account that there's also a full, somehow more advanced, NSGA2 example available here.
For fminimax and fgoalattain it looks like the answer is no. However, the genetic algorithm solver, gamultiobj, is Pareto set-based, though I'm not sure if it's the kind of multi-objective optimization function you want to use. gamultiobj implements the NGSA-II evolutionary algorithm. There's also this package that implements the Strengthen Pareto Evolutionary Algorithm 2 (SPEA-II) in C with a Matlab mex interface. It's a bit old so you might want to recompile it (you'll need to anyways if you're not on Windows 32-bit).

Categories