I want to classify cracks by their depths.
To do it, I store in a data frame the following features:
WindowsDf = pd.DataFrame(dataForWindowsDf, columns=['IsCrack', 'CheckTypeEncode', 'DepthCrack',
'WindowOfInterest'])
#dataForWindowsDf is a list which iteratively built from csv files.
#Windows data frame taking this list and build a data frame from it.
So, my target column is 'DepthCrack' and the other columns are part of feature vector.
WindowOfInterest is a column of 2d list - list of points - a graph that represents a test that is done (based on electro-magnetic waves returned from a surface as a function of time) :
[[0.9561600000000001, 0.10913097635410397], [0.95621,0.1100000]...]
The problem I faced is how to train a model - using a column of 2d list(I tried to push that as it is and it didn't work)?
What way do you suggest to deal with this problem?
I thought about extracting features from the 2d-list - to get one dimensional features(integral and etc.)
You might transform this one feature in two, like WindowOfInterest can become :
WindowOfInterest_x1 and WindowOfInterest_x2
For example from your DataFrame :
>>> import pandas as pd
>>> df = pd.DataFrame({'IsCrack': [1, 1, 1, 1, 1],
... 'CheckTypeEncode': [0, 1, 0, 0, 0],
... 'DepthCrack': [0.4, 0.2, 1.4, 0.7, 0.1],
... 'WindowOfInterest': [[0.9561600000000001, 0.10913097635410397], [0.95621,0.1100000], [0.459561, 0.635410397], [0.4495621,0.32], [0.621,0.2432]]},
... index = [0, 1, 2, 3, 4])
>>> df
IsCrack CheckTypeEncode DepthCrack WindowOfInterest
0 1 0 0.4 [0.9561600000000001, 0.10913097635410397]
1 1 1 0.2 [0.95621, 0.11]
2 1 0 1.4 [0.459561, 0.635410397]
3 1 0 0.7 [0.4495621, 0.32]
4 1 0 0.1 [0.621, 0.2432]
We can split the list like so :
>>> df[['WindowOfInterest_x1','WindowOfInterest_x2']] = pd.DataFrame(df['WindowOfInterest'].tolist(), index=df.index)
>>> df
IsCrack CheckTypeEncode DepthCrack WindowOfInterest WindowOfInterest_x1 WindowOfInterest_x2
0 1 0 0.4 [0.9561600000000001, 0.10913097635410397] 0.956160 0.109131
1 1 1 0.2 [0.95621, 0.11] 0.956210 0.110000
2 1 0 1.4 [0.459561, 0.635410397] 0.459561 0.635410
3 1 0 0.7 [0.4495621, 0.32] 0.449562 0.320000
4 1 0 0.1 [0.621, 0.2432] 0.621000 0.243200
To finish, we can drop the WindowOfInterest column :
>>> df = df.drop(['WindowOfInterest'], axis=1)
>>> df
IsCrack CheckTypeEncode DepthCrack WindowOfInterest_x1 WindowOfInterest_x2
0 1 0 0.4 0.956160 0.109131
1 1 1 0.2 0.956210 0.110000
2 1 0 1.4 0.459561 0.635410
3 1 0 0.7 0.449562 0.320000
4 1 0 0.1 0.621000 0.243200
Now you can pass WindowOfInterest_x1 and WindowOfInterest_x2 as features for you model.
Related
I'm trying to predict probability of X_test and getting 2 values in an array. I need to compare those 2 values and make it 1.
when I write code
y_pred = classifier.predict_proba(X_test)
y_pred
It gives output like
array([[0.5, 0.5],
[0.6, 0.4],
[0.7, 0.3],
...,
[0.5, 0.5],
[0.4, 0.6],
[0.3, 0.7]])
We know that if values if >= 0.5 then it's and 1 and if it's less than 0.5 it's 0
I converted the above array into pandas using below code
proba = pd.DataFrame(proba)
proba.columns = [['pred_0', 'pred_1']]
proba.head()
And output is
pred_0 pred_1
0 0.5 0.5
1 0.6 0.4
2 0.7 0.3
3 0.4 0.6
4 0.3 0.7
How to iterate the above rows and write a condition that if row value of column 1 is greater than equal to 0.5 with row value of 2, then it's 1 and if row value of column 1 is less than 0.5 when compared to row value of column 2.
For example, by seeing the above data frame the output must be
output
0 0
1 1
2 1
3 1
4 1
You could just map your initial array without converting it to a Pandas Dataframe so that it returns True when the first value of every subarray is >= 0.5 and if not it returns False. And finally, convert it to int:
>>> import numpy as np
>>> a = np.array([[0.5, 0.5], [0.6, 0.4], [0.3, 0.7]])
>>> a
array([[0.5, 0.5],
[0.6, 0.4],
[0.3, 0.7]])
>>> result = map(lambda x:int(x[0] >= 0.5), a)
>>> print(list(result))
[1, 1, 0]
Compare the two columns to create a boolean index then convert to int using astype:
Option 1:
df['output'] = (df['pred_0'] >= df['pred_1']).astype(int)
Option 2:
df['output'] = df['pred_0'].ge(df['pred_1']).astype(int)
Or via np.where:
Option 3:
df['output'] = np.where(df['pred_0'] >= df['pred_1'], 1, 0)
Option 4:
df['output'] = np.where(df['pred_0'].ge(df['pred_1']), 1, 0)
pred_0 pred_1 output
0 0.5 0.5 1
1 0.6 0.4 1
2 0.7 0.3 1
3 0.4 0.6 0
4 0.3 0.7 0
I am trying to find the most valuable features by applying feature selection methods to my dataset. Im using the SelectKBest function for now. I can generate the score values and sort them as I want, but I don't understand exactly how this score value is calculated. I know that theoretically high score is more valuable, but I need a mathematical formula or an example to calculate the score for learning this deeply.
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(dataValues, dataTargetEncoded)
feat_importances = pd.Series(fit.scores_, index=dataValues.columns)
topFatures = feat_importances.nlargest(50).copy().index.values
print("TOP 50 Features (Best to worst) :\n")
print(topFatures)
Thank you in advance
Say you have one feature and a target with 3 possible values
X = np.array([3.4, 3.4, 3. , 2.8, 2.7, 2.9, 3.3, 3. , 3.8, 2.5])
y = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2, 2])
X y
0 3.4 0
1 3.4 0
2 3.0 0
3 2.8 1
4 2.7 1
5 2.9 1
6 3.3 2
7 3.0 2
8 3.8 2
9 2.5 2
First we binarize the target
y = LabelBinarizer().fit_transform(y)
X y1 y2 y3
0 3.4 1 0 0
1 3.4 1 0 0
2 3.0 1 0 0
3 2.8 0 1 0
4 2.7 0 1 0
5 2.9 0 1 0
6 3.3 0 0 1
7 3.0 0 0 1
8 3.8 0 0 1
9 2.5 0 0 1
Then perform a dot product between feature and target, i.e. sum all feature values by class value
observed = y.T.dot(X)
>>> observed
array([ 9.8, 8.4, 12.6])
Next take a sum of feature values and calculate class frequency
feature_count = X.sum(axis=0).reshape(1, -1)
class_prob = y.mean(axis=0).reshape(1, -1)
>>> class_prob, feature_count
(array([[0.3, 0.3, 0.4]]), array([[30.8]]))
Now as in the first step we take the dot product, and get expected and observed matrices
expected = np.dot(class_prob.T, feature_count)
>>> expected
array([[ 9.24],[ 9.24],[12.32]])
Finally we calculate a chi^2 value:
chi2 = ((observed.reshape(-1,1) - expected) ** 2 / expected).sum(axis=0)
>>> chi2
array([0.11666667])
We have a chi^2 value, now we need to judge how extreme it is. For that we use a chi^2 distribution with number of classes - 1 degrees of freedom and calculate the area from chi^2 to infinity to get the probability of chi^2 be the same or more extreme than what we've got. This is a p-value. (using chi square survival function from scipy)
p = scipy.special.chdtrc(3 - 1, chi2)
>>> p
array([0.94333545])
Compare with SelectKBest:
s = SelectKBest(chi2, k=1)
s.fit(X.reshape(-1,1),y)
>>> s.scores_, s.pvalues_
(array([0.11666667]), [0.943335449873492])
This is a follow up to this question: determine the coordinates where two pandas time series cross, and how many times the time series cross
I have 2 series in my Pandas dataframe, and would like to know where they intersect.
A B
0 1 0.5
1 2 3.0
2 3 1.0
3 4 1.0
4 5 6.0
With this code, we can create a third column that will contain True everytime the two series intersect:
df['difference'] = df.A - df.B
df['cross'] = np.sign(df.difference.shift(1))!=np.sign(df.difference)
np.sum(df.cross)-1
Now, instead of a simple True or False, I would to know in which direction the intersection took place. For example: from 1 to 2, it intersected upwards, from 2 to 3 downwards, from 3 to 4 no intersections, from 4 to 5 upwards.
A B Cross_direction
0 1 0.5 None
1 2 3.0 Upwards
2 3 1.0 Downwards
3 4 1.0 None
4 5 6.0 Upwards
In pseudo-code, it should be like this:
cross_directions = [none, none, ... * series size]
for item in df['difference']:
if item > 0 and next_item < 0:
cross_directions.append("up")
elif item < 0 and next_item > 0:
cross_directions.append("down")
The problem is that next_item is unavailable with this syntax (we obtain that in the original syntax using .shift(1)) and that it takes a lot of code.
Should I look into implementing the code above using something that can group the loop by 2 items at a time? Or is there a simpler and more elegant solution like the one from the previous question?
You can use numpy.select.
Below code should work for you, the code is as follows:
df = pd.DataFrame({'A': [1, 2, 3, 4,5], 'B': [0.5, 3, 1, 1, 6]})
df['Diff'] = df.A - df.B
df['Cross'] = np.select([((df.Diff < 0) & (df.Diff.shift() > 0)), ((df.Diff > 0) & (df.Diff.shift() < 0))], ['Up', 'Down'], 'None')
#Output dataframe
A B Diff Cross
0 1 0.5 0.5 None
1 2 3.0 -1.0 Up
2 3 1.0 2.0 Down
3 4 1.0 3.0 None
4 5 6.0 -1.0 Up
My very lousy and redundant solution.
dataframe['difference'] = dataframe['A'] - dataframe['B']
dataframe['temporary_a'] = np.array(dataframe.difference) > 0
dataframe['temporary_b'] = np.array(dataframe.difference.shift(1)) < 0
cross_directions = []
for index,row in dataframe.iterrows():
if not row['temporary_a'] and not row['temporary_b']:
cross_directions.append("up")
elif row['temporary_a'] and row['temporary_b']:
cross_directions.append("down")
else:
cross_directions.append("not")
dataframe['cross_direction'] = cross_directions
How to aggregate in the way to get the average of b for group a, while excluding the current row (the target result is in c)?
a b c
1 1 0.5 # (avg of 0 & 1, excluding 1)
1 1 0.5 # (avg of 0 & 1, excluding 1)
1 0 1 # (avg of 1 & 1, excluding 0)
2 1 0.5 # (avg of 0 & 1, excluding 1)
2 0 1 # (avg of 1 & 1, excluding 0)
2 1 0.5 # (avg of 0 & 1, excluding 1)
3 1 0.5 # (avg of 0 & 1, excluding 1)
3 0 1 # (avg of 1 & 1, excluding 0)
3 1 0.5 # (avg of 0 & 1, excluding 1)
Data dump:
import pandas as pd
data = pd.DataFrame([[1, 1, 0.5], [1, 1, 0.5], [1, 0, 1], [2, 1, 0.5], [2, 0, 1],
[2, 1, 0.5], [3, 1, 0.5], [3, 0, 1], [3, 1, 0.5]],
columns=['a', 'b', 'c'])
Suppose a group has values x_1, ..., x_n.
The average of the entire group would be
m = (x_1 + ... + x_n)/n
The sum of the group without x_i would be
(m*n - x_i)
The average of the group without x_i would be
(m*n - x_i)/(n-1)
Therefore, you could compute the desired column of values with
import pandas as pd
df = pd.DataFrame([[1, 1, 0.5], [1, 1, 0.5], [1, 0, 1], [2, 1, 0.5], [2, 0, 1],
[2, 1, 0.5], [3, 1, 0.5], [3, 0, 1], [3, 1, 0.5]],
columns=['a', 'b', 'c'])
grouped = df.groupby(['a'])
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)
which yields
In [32]: df
Out[32]:
a b c result
0 1 1 0.5 0.5
1 1 1 0.5 0.5
2 1 0 1.0 1.0
3 2 1 0.5 0.5
4 2 0 1.0 1.0
5 2 1 0.5 0.5
6 3 1 0.5 0.5
7 3 0 1.0 1.0
8 3 1 0.5 0.5
In [33]: assert df['result'].equals(df['c'])
Per the comments below, in the OP's actual use case, the DataFrame's a column
contains strings:
def make_random_str_array(letters, strlen, size):
return (np.random.choice(list(letters), size*strlen)
.view('|S{}'.format(strlen)))
N = 3*10**6
df = pd.DataFrame({'a':make_random_str_array(letters='ABCD', strlen=10, size=N),
'b':np.random.randint(10, size=N)})
so that there are about a million unique values in df['a'] out of 3 million
total:
In [87]: uniq, key = np.unique(df['a'], return_inverse=True)
In [88]: len(uniq)
Out[88]: 988337
In [89]: len(df)
Out[89]: 3000000
In this case the calculation above requires (on my machine) about 11 seconds:
In [86]: %%timeit
....: grouped = df.groupby(['a'])
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)
....: ....: ....: ....:
1 loops, best of 3: 10.5 s per loop
Pandas converts all string-valued columns to object
dtype. But we could convert the
DataFrame column to a NumPy array with a fixed-width dtype, and the group
according to those values.
Here is a benchmark showing that if we convert the Series with object dtype to a NumPy array with fixed-width string dtype, the calculation requires less than 2 seconds:
In [97]: %%timeit
....: grouped = df.groupby(df['a'].values.astype('|S4'))
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)
....: ....: ....: ....:
1 loops, best of 3: 1.39 s per loop
Beware that you need to know the maximum length of the strings in df['a'] to choose the appropriate fixed-width dtype. In the example above, all the strings have length 4, so |S4 works. If you use |Sn for some integer n and n is smaller than the longest string, then those strings will get silently truncated with no error warning. This could potentially lead to the grouping of values which should not be grouped together. Thus, the onus is on you to choose the correct fixed-width dtype.
You could use
dtype = '|S{}'.format(df['a'].str.len().max())
grouped = df.groupby(df['a'].values.astype(dtype))
to ensure the conversion uses the correct dtype.
You can calculate the statistics manually by iterating group by group:
# Set up input
import pandas as pd
df = pd.DataFrame([
[1, 1, 0.5], [1, 1, 0.5], [1, 0, 1],
[2, 1, 0.5], [2, 0, 1], [2, 1, 0.5],
[3, 1, 0.5], [3, 0, 1], [3, 1, 0.5]
], columns=['a', 'b', 'c'])
df
a b c
0 1 1 0.5
1 1 1 0.5
2 1 0 1.0
3 2 1 0.5
4 2 0 1.0
5 2 1 0.5
6 3 1 0.5
7 3 0 1.0
8 3 1 0.5
# Perform grouping, excluding the current row
results = []
grouped = df.groupby(['a'])
for key, group in grouped:
for idx, row in group.iterrows():
# The group excluding current row
group_other = group.drop(idx)
avg = group_other['b'].mean()
results.append(row.tolist() + [avg])
# Compare our results with what is expected
results_df = pd.DataFrame(
results, columns=['a', 'b', 'c', 'c_new']
)
results_df
a b c c_new
0 1 1 0.5 0.5
1 1 1 0.5 0.5
2 1 0 1.0 1.0
3 2 1 0.5 0.5
4 2 0 1.0 1.0
5 2 1 0.5 0.5
6 3 1 0.5 0.5
7 3 0 1.0 1.0
8 3 1 0.5 0.5
This way you can use any statistic you want.
I have a large pandas DataFrame that I need to fill.
Here is my code:
trains = np.arange(1, 101)
#The above are example values, it's actually 900 integers between 1 and 20000
tresholds = np.arange(10, 70, 10)
tuples = []
for i in trains:
for j in tresholds:
tuples.append((i, j))
index = pd.MultiIndex.from_tuples(tuples, names=['trains', 'tresholds'])
df = pd.DataFrame(np.zeros((len(index), len(trains))), index=index, columns=trains, dtype=float)
metrics = dict()
for i in trains:
m = binary_metric_train(True, i)
#Above function returns a binary array of length 35
#Example: [1, 0, 0, 1, ...]
metrics[i] = m
for i in trains:
for j in tresholds:
trA = binary_metric_train(True, i, tresh=j)
for k in trains:
if k != i:
trB = metrics[k]
corr = abs(pearsonr(trA, trB)[0])
df[k][i][j] = corr
else:
df[k][i][j] = np.nan
My problem is, when this piece of code is finally done computing, my DataFrame df still contains nothing but zeros. Even the NaN are not inserted. I think that my indexing is correct. Also, I have tested my binary_metric_train function separately, it does return an array of length 35.
Can anyone spot what I am missing here?
EDIT: For clarity, this DataFrame looks like this:
1 2 3 4 5 ...
trains tresholds
1 10
20
30
40
50
60
2 10
20
30
40
50
60
...
As #EdChum noted, you should take a lookt at pandas indexing. Here's some test data for the purpose of illustration, which should clear things up.
import numpy as np
import pandas as pd
trains = [ 1, 1, 1, 2, 2, 2]
thresholds = [10, 20, 30, 10, 20, 30]
data = [ 1, 0, 1, 0, 1, 0]
df = pd.DataFrame({
'trains' : trains,
'thresholds' : thresholds,
'C1' : data,
'C2' : data
}).set_index(['trains', 'thresholds'])
print df
df.ix[(2, 30), 0] = 3 # using column index
# or...
df.ix[(2, 30), 'C1'] = 3 # using column name
df.loc[(2, 30), 'C1'] = 3 # using column name
# but not...
df.loc[(2, 30), 1] = 3 # creates a new column
print df
Which outputs the DataFrame before and after modification:
C1 C2
trains thresholds
1 10 1 1
20 0 0
30 1 1
2 10 0 0
20 1 1
30 0 0
C1 C2 1
trains thresholds
1 10 1 1 NaN
20 0 0 NaN
30 1 1 NaN
2 10 0 0 NaN
20 1 1 NaN
30 3 0 3