I would like to generate all combinations of length n, for a list of k variables. I can do this as follows:
import itertools
import pandas as pd
from sklearn import datasets
dataset = datasets.load_breast_cancer()
X = dataset.data
y = dataset.target
df = pd.DataFrame(X, columns=dataset.feature_names)
features = dataset.feature_names
x = set(['mean radius', 'mean texture'])
for s in itertools.combinations(features, 3):
if x.issubset(set(s)):
print s
len(features) = 30, thus this will generate 4060 combinations where n=3. When n=10, this is 30,045,015 combinations.
len(tuple(itertools.combinations(features, 10)
Each of these combinations will then be evaluated based on the conditional statement. However for n>10 this becomes unfeasible.
Instead of generating all combinations, and then filtering by some condition like in this example, is it possible to generate all combinations given this condition?
In other words, generate all combinations where n=3, 4, 5 ... k, given 'mean radius' and 'mean texture' appear in the combination?
Just generate the combinations without 'mean radius' and 'mean texture' and add those two to every combination, thus largely reducing the number of combinations. This way you don't have to filter, every combination generated will be useful.
# remove the fixed features from the pool:
features = set(features) - x
for s in itertools.combinations(features, n - len(x)):
s = set(s) & x # add the fixed features to each combination
print(s)
Related
I want to generate random numbers from two different ranges [0, 0.3) and [0.7, 1) in python.
numpy.random.uniform has the option of generating only from one particular interval.
I assume you want to choose an interval with probability weighted by its size, then sample uniformly from the chosen interval. In that case, the following Python code will do this:
import random
# Define the intervals. They should be disjoint.
intervals=[[0, 0.05], [0.7, 1]]
# Choose one number uniformly inside the set
random.uniform(*random.choices(intervals,
weights=[r[1]-r[0] for r in intervals])[0])
import numpy
# Generate a NumPy array of given size
size=1000
numpy.asarray([ \
random.uniform(*random.choices(intervals,
weights=[r[1]-r[0] for r in intervals])[0]) \
for i in range(1000)])
Note that the intervals you give, [[0, 0.3], [0.7, 1]], appear to be arbitrary; this solution works for any number of disjoint intervals, and it samples uniformly at random from the union of those intervals.
How about this one?
first_interval = np.array([0, 0.3])
second_interval = np.array([0.7, 1])
total_length = np.ptp(first_interval)+np.ptp(second_interval)
n = 100
numbers = np.random.random(n)*total_length
numbers += first_interval.min()
numbers[numbers > first_interval.max()] += second_interval.min()-first_interval.max()
Refer to this thread hope it solves your query.
But the answer is as follows :
def random_of_ranges(*ranges):
all_ranges = sum(ranges, [])
return random.choice(all_ranges)
print(random_of_ranges(range(65, 90), range(97, 122)))
click here
You can just concatenate random numbers from those two intervals.
import numpy as np
rng = np.random.default_rng(12345)
a = rng.uniform(0, 0.3,1000)
b = rng.uniform(0.7, 1,1000)
my_rnd = np.concatenate([a, b])
These look fairly uniform across the two intervals.
I have a house price prediction dataset. I have to split the dataset into train and test.
I would like to know if it is possible to do this by using numpy or scipy?
I cannot use scikit learn at this moment.
I know that your question was only to do a train_test_split with numpy or scipy but there is actually a very simple way to do it with Pandas :
import pandas as pd
# Shuffle your dataset
shuffle_df = df.sample(frac=1)
# Define a size for your train set
train_size = int(0.7 * len(df))
# Split your dataset
train_set = shuffle_df[:train_size]
test_set = shuffle_df[train_size:]
For those who would like a fast and easy solution.
Although this is old question, this answer might help.
This is how sklearn implements train_test_split, this method given below, takes similar arguments as sklearn.
import numpy as np
from itertools import chain
def _indexing(x, indices):
"""
:param x: array from which indices has to be fetched
:param indices: indices to be fetched
:return: sub-array from given array and indices
"""
# np array indexing
if hasattr(x, 'shape'):
return x[indices]
# list indexing
return [x[idx] for idx in indices]
def train_test_split(*arrays, test_size=0.25, shufffle=True, random_seed=1):
"""
splits array into train and test data.
:param arrays: arrays to split in train and test
:param test_size: size of test set in range (0,1)
:param shufffle: whether to shuffle arrays or not
:param random_seed: random seed value
:return: return 2*len(arrays) divided into train ans test
"""
# checks
assert 0 < test_size < 1
assert len(arrays) > 0
length = len(arrays[0])
for i in arrays:
assert len(i) == length
n_test = int(np.ceil(length*test_size))
n_train = length - n_test
if shufffle:
perm = np.random.RandomState(random_seed).permutation(length)
test_indices = perm[:n_test]
train_indices = perm[n_test:]
else:
train_indices = np.arange(n_train)
test_indices = np.arange(n_train, length)
return list(chain.from_iterable((_indexing(x, train_indices), _indexing(x, test_indices)) for x in arrays))
Of course sklearn's implementation supports stratified k-fold, splitting of pandas series etc. This one only works for splitting lists and numpy arrays, which I think will work for your case.
This solution using pandas and numpy only
def split_train_valid_test(data,valid_ratio,test_ratio):
shuffled_indcies=np.random.permutation(len(data))
valid_set_size= int(len(data)*valid_ratio)
valid_indcies=shuffled_indcies[:valid_set_size]
test_set_size= int(len(data)*test_ratio)
test_indcies=shuffled_indcies[valid_set_size:test_set_size+valid_set_size]
train_indices=shuffled_indcies[test_set_size:]
return data.iloc[train_indices],data.iloc[valid_indcies],data.iloc[test_indcies]
train_set,valid_set,test_set=split_train_valid_test(dataset,valid_ratio=0.2,test_ratio=0.2)
print(len(train_set),len(valid_set),len(test_set))
##out: (16512, 4128, 4128)
This code should work (Assuming X_data is a pandas DataFrame):
import numpy as np
num_of_rows = len(X_data) * 0.8
values = X_data.values
np.random_shuffle(values) #shuffles data to make it random
train_data = values[:num_of_rows] #indexes rows for training data
test_data = values[num_of_rows:] #indexes rows for test data
Hope this helps!
import numpy as np
import pandas as pd
X_data = pd.read_csv('house.csv')
Y_data = X_data["prices"]
X_data.drop(["offers", "brick", "bathrooms", "prices"],
axis=1, inplace=True) # important to drop prices as well
# create random train/test split
indices = range(X_data.shape[0])
num_training_instances = int(0.8 * X_data.shape[0])
np.random.shuffle(indices)
train_indices = indices[:num_training_indices]
test_indices = indices[num_training_indices:]
# split the actual data
X_data_train, X_data_test = X_data.iloc[train_indices], X_data.iloc[test_indices]
Y_data_train, Y_data_test = Y_data.iloc[train_indices], Y_data.iloc[test_indices]
This assumes you want a random split. What happens is that we're creating a list of indices as long as the number of data points you have, i.e. the first axis of X_data (or Y_data). We then put them in random order and just take the first 80% of those random indices as training data and the rest for testing. [:num_training_indices] just selects the first num_training_indices from the list. After that you just extract the rows from your data using the lists of random indices and your data is split. Remember to drop the prices from your X_data and to set a seed if you want the split to be reproducible (np.random.seed(some_integer) in the beginning).
I am using sklearn DBSCAN to cluster my data as follows.
#Apply DBSCAN (sims == my data as list of lists)
db1 = DBSCAN(min_samples=1, metric='precomputed').fit(sims)
db1_labels = db1.labels_
db1n_clusters_ = len(set(db1_labels)) - (1 if -1 in db1_labels else 0)
#Returns the number of clusters (E.g., 10 clusters)
print('Estimated number of clusters: %d' % db1n_clusters_)
Now I want to get the top 3 clusters sorted from the size (number of data points in each cluster). Please let me know how to obtain the cluster size in sklearn?
Another option would be to use numpy.unique:
db1_labels = db1.labels_
labels, counts = np.unique(db1_labels[db1_labels>=0], return_counts=True)
print labels[np.argsort(-counts)[:3]]
Well you can Bincount Function in Numpy to get the frequencies of labels. For example, we will use the example for DBSCAN using scikit-learn:
#Store the labels
labels = db.labels_
#Then get the frequency count of the non-negative labels
counts = np.bincount(labels[labels>=0])
print counts
#Output : [243 244 245]
Then to get the top 3 values use argsort in numpy. In our example since there are only 3 clusters, I will extract the top 2 values :
top_labels = np.argsort(-counts)[:2]
print top_labels
#Output : [2 1]
#To get their respective frequencies
print counts[top_labels]
I'm trying to generate all possible bidirectional graphs given a set of nodes. I'm storing my graphs as numpy vectors, and need them that way as there is some downstream code consumes the graphs in this format.
Lets say I have two sets of nodes in which nodes in the same set do not connect. It is however possible to have graphs in which members of two sets do not meet at all.
posArgs= [0,1] # (0->1) / (1->1) is not allowed..neither is (0->1)
negArgs= [2] # (0->2) is possible, and so is (0 2) - meaning no connection.
To illustrate what I mean:
translates as:
import numpy as np
graph = np.array([0,0,1,0,0,1,0,0,0])
singleGraph= np.vstack( np.array_split(np.array(graph), nargs))
>>[[0 0 1]
[0 0 1]
[0 0 0]]
# Where the first row represent node 0's relationships with nodes 0, 1, 2
# the second row represents node 1's relationships with nodes 0, 1, 2 etc
What I want is to generate all the possible ways in which these two sets can then be constructed (i.e. all the possible node combinations as vectors). At the moment I am using itertools.product to generate a power set of all the nodes, then I create a bunch of vectors in which there are circular connections and same set connections. I then delete them from the powerset. Hence using the sets above I have the following code:
import numpy as np
import itertools
posArgs= [0,1] # 0->1 / 1->1 is not allowed..neither is 0->1
negArgs= [2]
nargs= len(posArgs+ negArgs)
allPermutations= np.array(list(itertools.product([0,1], repeat=nargs*nargs)))
# Create list of Attacks that we will never need. Circular attacks, and attacks between arguments of same polarity
circularAttacks = (np.arange(0, nargs*nargs, nargs+1)).tolist()
samePolarityAttacks = []
posList = list(itertools.permutations(posArgs, 2))
negList = list(itertools.permutations(negArgs, 2))
totList = posList + negList
for l in totList:
ptn = ((l[0]+1)*nargs)- ((nargs+1) - l[1]) + 1 # All the odd +1 are to account for the shift in 0 index
samePolarityAttacks.append(ptn)
graphsToDelete = np.unique([circularAttacks + samePolarityAttacks])
subGraphs = allPermutations[:,graphsToDelete]
cutDownGraphs = np.delete(allPermutations, (np.where(subGraphs>0)[0]).tolist(), axis = 0)
for graph in cutDownGraphs:
singleGraph= np.vstack( np.array_split(np.array(graph), nargs))
print(singleGraph)
My problem is that when I have more than 5 nodes within both of my sets my itertools.product is trying to produce (2^25) set of vectors. This of course is really expensive and leaves me with memory leaks.
Are you aware of a smart way in which I can reshape this code while ensuring my graphs stay in this numpy array format?
--Additional info:
For two sets, one node a set, all possible combinations look as follows:
Thanks
I was using StratifiedKFold from scikit-learn, but now I need to watch also for "groups". There is nice function GroupKFold, but my data are very time dependent. So similary as in help, ie number of week is the grouping index. But each week should be only in one fold.
Suppose I need 10 folds. What I need is to shuffle data first, before I can used GroupKFold.
Shuffling is in group sence - so whole groups should be shuffle among each other.
Is there way to do is with scikit-learn elegant somehow? Seems to me GroupKFold is robust to shuffle data first.
If there is no way to do it with scikit, can anyone write some effective code of this? I have large data sets.
matrix, label, groups as inputs
EDIT: This solution does not work.
I think using sklearn.utils.shuffle is an elegant solution!
For data in X, y and groups:
from sklearn.utils import shuffle
X_shuffled, y_shuffled, groups_shuffled = shuffle(X, y, groups, random_state=0)
Then use X_shuffled, y_shuffled and groups_shuffled with GroupKFold:
from sklearn.model_selection import GroupKFold
group_k_fold = GroupKFold(n_splits=10)
splits = group_k_fold.split(X_shuffled, y_shuffled, groups_shuffled)
Of course, you probably want to shuffle multiple times and do the cross-validation with each shuffle. You could put the entire thing in a loop - here's a complete example with 5 shuffles (and only 3 splits instead of your required 10):
X = np.arange(20).reshape((10, 2))
y = np.arange(10)
groups = [0, 0, 0, 1, 2, 3, 4, 5, 6, 7]
n_shuffles = 5
group_k_fold = GroupKFold(n_splits=3)
for i in range(n_shuffles):
X_shuffled, y_shuffled, groups_shuffled = shuffle(X, y, groups, random_state=i)
splits = group_k_fold.split(X_shuffled, y_shuffled, groups_shuffled)
# do something with splits here, I'm just printing them out
print 'Shuffle', i
print 'groups_shuffled:', groups_shuffled
for train_idx, val_idx in splits:
print 'Train:', train_idx
print 'Val:', val_idx
The same group will not appear in two different folds (the number of distinct groups has to be at least equal to the number of folds)
In the GroupKfold the shape of the group is the same as data shape
For data in X, y and groups:
import numpy as np
import pandas as pd
from sklearn.model_selection import GroupKFold
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
import datetime
X = np.array([[1,2,1,1], [3,4,7,8], [5,6,1,3], [7,8,4,7]])
y=np.array([0,2,1,2])
groups=np.array([2,1,0,1])
group_kfold = GroupKFold(n_splits=len(groups.unique))
group_kfold.get_n_splits(X, y, groups)
param_grid ={
'min_child_weight': [50,100],
'subsample': [0.1,0.2],
'colsample_bytree': [0.1,0.2],
'max_depth': [2,3],
'learning_rate': [0.01],
'n_estimators': [100,500],
'reg_lambda': [0.1,0.2]
}
xgb = XGBClassifier()
grid_search = GridSearchCV(xgb, param_grid, cv=group_kfold.split(X, Y, groups), n_jobs=-1)
result = grid_search.fit(X,Y)
Here is a performant solution that essentially reassigns the values of the keys in a way that respects the original groups.
Code is shown below, but the 4 steps are:
Shuffle the grouping-key vector. The key goal here is rearrange the first time each grouping key appears.
Use np.unique() to return the first_index values for each unique key and the inverse_index values that could be used to reconstruct the grouping-key vector.
Use fancy indexing of the inverse indexes operating on the first_index values to construct a new array of grouping keys where each grouping key has been transformed to a number representing the order in which it first shows up in the shuffled grouping vector.
This new vector of grouping keys can be used in the standard GroupKFold splitter to get a different set of splits than the original because you have reordered the grouping indexes.
To give a quick example, imagine your original grouping-key vector was [3, 1, 1, 5, 3, 5], then this procedure would create a new grouping key vector [0, 1, 1, 2, 0, 2]. The 3's have become 0's because they were the first key to show up, the 1's have become 1's because they were the second key to show up, and the 5's have become 2's because they were the 3rd key to show up. As long as you shuffle the keys, you will get a transformation of grouping-keys, leading to a different set of splits by GroupKFold.
Code:
# Say that A is the official grouping key
A = list(range(10)) + list(range(10))
B = list(range(20))
y = np.zeros(20)
X = pd.DataFrame({
'group': A,
'var': B
})
X = X.sample(frac=1)
original_grouping_vector = X['group']
unique_values, indexes, inverse = np.unique(original_grouping_vector, return_inverse=True, return_index=True)
new_grouping_vector = indexes[inverse] # This is where the magic happens!
splitter = GroupKFold()
for train, test in splitter.split(X, y, groups=new_grouping_vector):
print(X.iloc[test, :])
The above will print out different splits upon shuffling because the grouping-keys are being reordered, causing the value of new_grouping_vector to change.