Python: create a polynomial of degree n - python

I have a feature set
[x1,x2....xm]
Now I want to create polynomial feature set
What that means is that if degree is two, then I have the feature set
[x1.... xm,x1^2,x2^2...xm^2, x1x2, x1x3....x1,xm......xm-1x1....xm-1xm]
So it contains terms of only of order 2..
same is if order is three.. then you will have cubic terms as well..
How to do this?
Edit 1: I am working on a machine learning project where I have close to 7 features... and a non-linear regression on this linear features are giving ok result...Hence I thought that to get more number in features I can map these features to a higher dimension..
So one way is to consider polynomial order of the feature vector...
Also generating x1*x1 is easy.. :) but getting the rest of the combinations are a bit tricky..
Can combinations give me x1x2x3 result if the order is 3?

Use
itertools.combinations(list, r)
where list is the feature set, and r is the order of desired polynomial features. Then multiply elements of the sublists given by the above. That should give you {x1*x2, x1*x3, ...}. You'll need to construct other ones, then union all parts.
[Edit]
Better: itertools.combinations_with_replacement(list, r) will nicely give sorted length-r tuples with repeated elements allowed.

You could use itertools.product to create all the possible sets of n values that are chosen from the original set; but keep in mind that this will generate (x2, x1) as well as (x1, x2).
Similarly, itertools.combinations will produce sets without repetition or re-ordering, but that means you won't get (x1, x1) for example.
What exactly are you trying to do? What do you need these result values for? Are you sure you do want those x1^2 type terms (what does it mean to have the same feature more than once)? What exactly is a "feature" in this context anyway?

Using Karl's answer as inspiration, try using product and then taking advantage of the set object. Something like,
set([set(comb) for comb in itertools.product(range(5),range(5)])
This will get rid of recurring pairs. Then you can turn the set back into a list and sort it or iterate over it as you please.
EDIT:
this will actually kill the x_m^2 terms, so build sorted tuples instead of sets. this will allow the terms to be hashable and nonrepeating.
set([tuple(sorted(comb)) for comb in itertools.product(range(5),range(5))])

Related

How to define the grid (for using grid search) from scratch in Python?

I write a function that accepts a dictionary of parameter names (could be of any size>0) and numeric values, and returns a list of dictionaries with all the possible combinations.
For example:
myFunc({'ParA':[1,2,3], 'ParB': [0.1,0.2,0.3,0.4]})
Will return
[{'ParA':1, 'ParB':0.1},{'ParA':1, 'ParB':0.2}...{'ParA':2, 'ParB':0.1}....{'ParA':3, 'ParB':0.4}]
Normally I'd use nested loops, but I do not know the number of parameters in advance.
Another challenge is that the number of possible values in each parameter changes, so the lengths are not sorted in the dictionary.
Is there a smart way of doing that?
Note: it has to be from scratch, so no grid-search functions from SciPy
Here is the solution based on Julien's answer (itertools.product):
def makeGrid(pars_dict):
keys=pars_dict.keys()
combinations=itertools.product(*pars_dict.values())
ds=[dict(zip(keys,cc)) for cc in combinations]
return ds

How to create list of numbers as user input and then finding mean, median and mode of the provided list

So I want to create functions for the median, mean and mode of a list.
The list must be a user input. How would I go about this? Thanks.
You do not have to create functions for median, mean, mode because they are implemented already and can be called explicitly using Numpy and Scipy libraries in Python. Implementing these functions would mean "reinventing the wheel" and could lead to errors and take time. Feel free to use libraries because in most cases they are tested and safe to use. For example:
import numpy as np
from scipy import stats
mylist = [0,1,2,3,3,4,5,6]
median = np.median(mylist)
mean = np.mean(mylist)
mode = int(stats.mode(mylist)[0])
To get user input you should use input(). See https://anh.cs.luc.edu/python/hands-on/3.1/handsonHtml/io.html
If this supposed to be your homework, I'll give you some hint:
mean: iterate through the list, calculate the sum of elements and divide by element count
median: First, you have to sort the list elements in increasing order. Then find out whether the list length is even or odd. If odd, return the center element. If even, return center element and the element next to the the center and calculate their average.
mode: Create a 'helper' list first containing distinct elements of the input list. Then, create a function that has one parameter: a number to - to be counted - how many times it is in the input list. Run this function in a for cycle providing the distinct list elements as input. At each iteration, save the result in a tuple: a tuple consists of (element value, element count). Afterall you should have an array of tuples. When you have all this stuff, select the tuple that has the maximum "element count" and return the corresponding "element value".
Please note that these are just fast hints that can be useful in order to create your own implementation based on the right algorithm you prefer. This could be a good exercise to get started with algorithms and data structures, I hope you'll not skip it:) Good luck!

cross-similarity between users in python

I have a user group list UserGroupA=[CustomerA_id1,CustomerA_id2 ....] containing 1000 users and user group list UserGroupB=[CustomerB_id1,CustomerB_id2 ...] containing 10000 users and I have a similarity function defined for any two users from UserGroupA and UserGroupB
Similarity(CustomerA_id(k),CustomerB_id(l)) where k and l are indices for users in Group A and B.
My objective is to find the most similar 1000 users from Group B to users in GroupA and the way I want to use CrossSimilarity to determine that. Is there a more efficient way to do it especially when the size of GroupB increases?
CrossSimilarity = None * [10000]
for i in range(10000):
for j in range(1000):
CrossSimilarity[i] = CrossSimilarity[i] + Similarity(CustomerA_id[k],CustomerB_id[i])
CrossSimilarity.sort()
It really depends on the Similarity function and how much time it takes. I expect it will heavily dominate your runtime, but without a runtime profile, it's hard to say. I have some general advice only:
Have a look at how you calculate Similarity and whether you can improve the process by doing everyone from group A, or B in one go rather than starting from scratch.
There are some micro-optimisations you can do: For example += will be tiny bit faster. Caching CustomerB_id in outer loop as well. You can likely squeeze some time out of your similarity function the same way. But I wouldn't expect this time to matter.
If your code is using pure python and is CPU-heavy, you could try compiling via CPython, or running in Pypy instead of standard Python.
Since what you are doing is basically a matrix multiplication between the two list (UserGroupA and UserGroupB) a more efficient and fastest way to perform it in memory, could be to use the scikit-sklearn module that provide the function:
sklearn.metrics.pairwise.pairwise_distances(X, Y, metric='euclidean')
where obviously X=UserGroupA and Y=UserGroupB and in metric field you can use the default similarity measure of sklearn or pass your own.
It will return a distance matrix D such that then D_{i, k} is the distance between the ith array from X and the kth array from Y.
Then to find the top 1000 similar user you can simply transform the matrix in a list and sort it.
Maybe is a little more articulated than your solution but should be faster:)

Optimal algorithm for covering all non-ordered 3-tuples in a set of N, using M-tuples (M<N)

I have a set of N items, of which I need to cover all 3-tuples.
I can do this using M-tuples, and it's important that I use a minimal number of M-tuples, because for each one I need to make an expensive API call.
For example:
N = 5, M=4
Set = {1,2,3,4,5}
A minimal set of non-ordered 4-tuples will be:
{{1,2,3,4},{2,3,4,5},{1,2,3,5},{1,3,4,5}} - and it covers all possible 3-tuples from the set
Is there a known algorithm (or better yet a python library :)) that does it?

Calculate difference between two values (python)

If I have a variable x that returns a bunch of numbers (floats), how can I calculate the difference between all the adjacent numbers (e.g. (x - x-1), (x-1 - x-2) until the last term?).
Look at what you've written down in your question. The answer is there staring at you.
[x[i+1]-x[i] for i in range(len(x)-1)]
One of the nicest things about python is that is has declarative features. You can often get what you want by just describing it; you don't always have to explicitly give the recipe.

Categories