Best way to get joint probability from 2D numpy

Best way to get joint probability from 2D numpy - python

Was wondering if there is a better way to get the probability of a 2D numpy array. Maybe using some of numpy's built in functions.
For simplicity, say we have an example array:
[['apple','pie'],
['apple','juice'],
['orange','pie'],
['strawberry','cream'],
['strawberry','candy']]
Would like to get the probability such as:
['apple' 'juice'] --> 0.4 * 0.5 = 0.2
['apple' 'pie'] --> 0.4 * 0.5 = 0.2
['orange' 'pie'] --> 0.2 * 1.0 = 0.2
['strawberry' 'candy'] --> 0.4 * 0.5 = 0.2
['strawberry' 'cream'] --> 0.4 * 0.5 = 0.2
Where 'juice' as the second word has a probabiliy of 0.2. Since apple has probability of 2/5 * 1/2 (for juice).
On the other hand, 'pie' as a second word, has a probability of 0.4. The combination of the probability from 'apple' and 'orange'.
The way I approached the problem was adding 3 new columns to the array, for the probability of 1st column, 2nd column, and the final probability. Group the array per 1st column, then per 2nd column and update the probability accordingly.
Below is my code:
a = np.array([['apple','pie'],['apple','juice'],['orange','pie'],['strawberry','cream'],['strawberry','candy']])
ans = []
unique, counts = np.unique(a.T[0], return_counts=True) ## TRANSPOSE a, AND GET unique
myCounter = zip(unique,counts)
num_rows = sum(counts)
a = np.c_[a,np.zeros(num_rows),np.zeros(num_rows),np.zeros(num_rows)] ## ADD 3 COLUMNS to a
groups = []
## GATHER GROUPS BASE ON COLUMN 0
for _unique, _count in myCounter:
index = a[:,0] == _unique ## WHERE COLUMN 0 MATCH _unique
curr_a = a[index]
for j in range(len(curr_a)):
curr_a[j][2] = _count/num_rows
groups.append(curr_a)
## GATHER UNIQUENESS FROM COLUMN 1, PER GROUP
for g in groups:
unique, counts = np.unique(g.T[1], return_counts=True)
myCounter = zip(unique, counts)
num_rows = sum(counts)
for _unique, _count in myCounter:
index = g[:, 1] == _unique
curr_g = g[index]
for j in range(len(curr_g)):
curr_g[j][3] = _count / num_rows
curr_g[j][4] = float(curr_g[j][2]) * float(curr_g[j][3]) ## COMPUTE FINAL PROBABILITY
ans.append(curr_g[j])
for an in ans:
print(an)
Outputs:
['apple' 'juice' '0.4' '0.5' '0.2']
['apple' 'pie' '0.4' '0.5' '0.2']
['orange' 'pie' '0.2' '1.0' '0.2']
['strawberry' 'candy' '0.4' '0.5' '0.2']
['strawberry' 'cream' '0.4' '0.5' '0.2']
Was wondering if there is a better short/faster way of doing this using numpy or other means. Adding columns is not necessary, this was just my way of doing it. Other approach will be acceptable.

Based on the definition of probability distribution you have given, you can use pandas to do the same i.e
import pandas as pd
a = np.array([['apple','pie'],['apple','juice'],['orange','pie'],['strawberry','cream'],['strawberry','candy']])
df = pd.DataFrame(a)
# Find the frequency of first word and divide by the total number of rows
df[2]=df[0].map(df[0].value_counts())/df.shape[0]
# Divide 1 by the total repetion
df[3]=1/(df[0].map(df[0].value_counts()))
# Multiply the probabilities
df[4]= df[2]*df[3]
Output:
0 1 2 3 4
0 apple pie 0.4 0.5 0.2
1 apple juice 0.4 0.5 0.2
2 orange pie 0.2 1.0 0.2
3 strawberry cream 0.4 0.5 0.2
4 strawberry candy 0.4 0.5 0.2
If you want that in the form of list you can use df.values.tolist()
If you dont want the columns then
df = pd.DataFrame(a)
df[2]=((df[0].map(df[0].value_counts())/df.shape[0]) * (1/(df[0].map(df[0].value_counts()))))
Output:
0 1 2
0 apple pie 0.2
1 apple juice 0.2
2 orange pie 0.2
3 strawberry cream 0.2
4 strawberry candy 0.2
For combined probablity print(df.groupby(1)[2].sum())
candy 0.2
cream 0.2
juice 0.2
pie 0.4

Related

Generate a simulated dataset in pandas

I would like to generate pandas dataframes with simulated data.
There should be x sets of columns.
Each set corresponds to y number of columns.
Each set should have a value, a, in z number of rows. The value, a, is a float.
However, z may be different for the different sets of column sets.
The remaining columns will have another value, b, which is also a float.
I would like to write a function to generate such pandas data frames where I can specify the variables x, y, a, b and where a specific value for z can be set for the individual column sets.
Here is an example df:
data = [[0.5, 0.5, 0.1, 0.1, 0.1, 0.1], [0.1, 0.1, 0.5, 0.5, 0.1, 0.1], [0.1, 0.1, 0.1, 0.1, 0.5, 0.5]]
df = pd.DataFrame(data, columns=['set1_col1', 'set1_col2', 'set2_col1', 'set2_col2', 'set3_col1', 'set3_col2'])
df
But I would like to be able to specify the variables, which for the above example would be:
x = 3 #(set1, set2, set3)
y = 2 #(col1, col2 for each set)
a = 0.5
z = 1 #(for all column sets)
b = 0.1
Advice on this would be greatly appreciated!
Thanks!

Use numpy.random.choice:
N = 5 #No of rows
x = 3 #(set1, set2, set3)
y = 2 #(col1, col2 for each set)
a = 0.5
z = 1 #(for all column sets)
b = 0.1
#names of sets
sets = [f'set{w+1}' for w in range(x)]
#names of columns
cols = [f'col{w+1}' for w in range(y)]
#MultiIndex by product
mux = pd.MultiIndex.from_product([sets, cols])
#DataFrame with default value
df = pd.DataFrame(b, index=range(N), columns=mux)
#random assign a by random index with no repeat
for c, i in zip(df.columns.levels[0], np.random.choice(df.index, z * x, replace=False)):
df.loc[i, c] = a
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
set1_col1 set1_col2 set2_col1 set2_col2 set3_col1 set3_col2
0 0.1 0.1 0.1 0.1 0.1 0.1
1 0.1 0.1 0.5 0.5 0.1 0.1
2 0.5 0.5 0.1 0.1 0.1 0.1
3 0.1 0.1 0.1 0.1 0.1 0.1
4 0.1 0.1 0.1 0.1 0.5 0.5
EDIT: For consecutive random values use:
N = 6 #No of rows
x = 3 #(set1, set2, set3)
y = 2 #(col1, col2 for each set)
a = 0.5
z = 2 #(for all column sets)
b = 0.1
#names of sets
sets = [f'set{w+1}' for w in range(x)]
#names of columns
cols = [f'col{w+1}' for w in range(y)]
#MultiIndex by product
mux = pd.MultiIndex.from_product([sets, cols])
#DataFrame with default value, index is create by consecutive groups
df = pd.DataFrame(b, index=np.arange(N) // z, columns=mux)
print (df)
#random assign a by random index with no repeat
for c, i in zip(df.columns.levels[0],
np.random.choice(np.unique(df.index), x, replace=False)):
df.loc[i, c] = a
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
df = df.reset_index(drop=True)
print (df)
set1_col1 set1_col2 set2_col1 set2_col2 set3_col1 set3_col2
0 0.5 0.5 0.1 0.1 0.1 0.1
1 0.5 0.5 0.1 0.1 0.1 0.1
2 0.1 0.1 0.1 0.1 0.5 0.5
3 0.1 0.1 0.1 0.1 0.5 0.5
4 0.1 0.1 0.5 0.5 0.1 0.1
5 0.1 0.1 0.5 0.5 0.1 0.1

Training a model with List of points column

I want to classify cracks by their depths.
To do it, I store in a data frame the following features:
WindowsDf = pd.DataFrame(dataForWindowsDf, columns=['IsCrack', 'CheckTypeEncode', 'DepthCrack',
'WindowOfInterest'])
#dataForWindowsDf is a list which iteratively built from csv files.
#Windows data frame taking this list and build a data frame from it.
So, my target column is 'DepthCrack' and the other columns are part of feature vector.
WindowOfInterest is a column of 2d list - list of points - a graph that represents a test that is done (based on electro-magnetic waves returned from a surface as a function of time) :
[[0.9561600000000001, 0.10913097635410397], [0.95621,0.1100000]...]
The problem I faced is how to train a model - using a column of 2d list(I tried to push that as it is and it didn't work)?
What way do you suggest to deal with this problem?
I thought about extracting features from the 2d-list - to get one dimensional features(integral and etc.)

You might transform this one feature in two, like WindowOfInterest can become :
WindowOfInterest_x1 and WindowOfInterest_x2
For example from your DataFrame :
>>> import pandas as pd
>>> df = pd.DataFrame({'IsCrack': [1, 1, 1, 1, 1],
... 'CheckTypeEncode': [0, 1, 0, 0, 0],
... 'DepthCrack': [0.4, 0.2, 1.4, 0.7, 0.1],
... 'WindowOfInterest': [[0.9561600000000001, 0.10913097635410397], [0.95621,0.1100000], [0.459561, 0.635410397], [0.4495621,0.32], [0.621,0.2432]]},
... index = [0, 1, 2, 3, 4])
>>> df
IsCrack CheckTypeEncode DepthCrack WindowOfInterest
0 1 0 0.4 [0.9561600000000001, 0.10913097635410397]
1 1 1 0.2 [0.95621, 0.11]
2 1 0 1.4 [0.459561, 0.635410397]
3 1 0 0.7 [0.4495621, 0.32]
4 1 0 0.1 [0.621, 0.2432]
We can split the list like so :
>>> df[['WindowOfInterest_x1','WindowOfInterest_x2']] = pd.DataFrame(df['WindowOfInterest'].tolist(), index=df.index)
>>> df
IsCrack CheckTypeEncode DepthCrack WindowOfInterest WindowOfInterest_x1 WindowOfInterest_x2
0 1 0 0.4 [0.9561600000000001, 0.10913097635410397] 0.956160 0.109131
1 1 1 0.2 [0.95621, 0.11] 0.956210 0.110000
2 1 0 1.4 [0.459561, 0.635410397] 0.459561 0.635410
3 1 0 0.7 [0.4495621, 0.32] 0.449562 0.320000
4 1 0 0.1 [0.621, 0.2432] 0.621000 0.243200
To finish, we can drop the WindowOfInterest column :
>>> df = df.drop(['WindowOfInterest'], axis=1)
>>> df
IsCrack CheckTypeEncode DepthCrack WindowOfInterest_x1 WindowOfInterest_x2
0 1 0 0.4 0.956160 0.109131
1 1 1 0.2 0.956210 0.110000
2 1 0 1.4 0.459561 0.635410
3 1 0 0.7 0.449562 0.320000
4 1 0 0.1 0.621000 0.243200
Now you can pass WindowOfInterest_x1 and WindowOfInterest_x2 as features for you model.

Choosing a random value from a discrete distribution

I had come across the following code while reading up about RL. The probs vector contains the probabilities of each action to be taken. And I believe the given loop tries to choose an action randomly from the given distribution. Why/How does this work?
a = 0
rand_select = np.random.rand()
while True:
rand_select -= probs[a]
if rand_select < 0 or a + 1 == n_actions:
break
a += 1
actions = a
After going through similar code, I realised that "actions" contains the final action to be taken.

You can view the probabilities as a distribution of contiguous parts on the line from 0.0 to 1.0.
if we have A: 0.2, B: 0.3, C: 0.5 to the line could be
0.0 --A--> 0.2
0.2 --B--> 0.5
0.5 --C--> 1.0
And in total 1.0.
The algorithm is choosing a random location between 0.0->1.0 and finds out where it "landed" (A, B or C) by sequentially ruling out parts.
Suppose we draw 0.73, We can "visualize" it like this (selection marked with *):
0.0 ---------------------------> 1.0
*
0.0 --A--> 0.2 --B--> 0.5 --C--> 1.0
0.73 - 0.2 > 0 so we reduce 0.2 (=0.53) and are left with:
0.2 --B--> 0.5
0.5 --C--> 1.0
0.53 - 0.3 > 0 so we reduce 0.5 (=0.23) and are left with:
0.5 --C--> 1.0
0.23 - 0.5 < 0 so we know the part we drew was C.
The selection distributes the same as the probabilities and the algorithm is O(n) where n is the number of probabilities.

Unwrap a Frequency Table in pandas

I have a dataframe A that looks like this
value Frequency
0.1 3
0.2 2
and I want to convert it to dataframe B like below
Sample
0.1
0.1
0.1
0.2
0.2
Simply put, dataframe A is the samples and their frequency (repetition). Dataframe B is literally expanding that. Is there a straightforward way to do this?
what I did (minimal working example reproducing above):
X = pd.DataFrame([(0.1,3),(0.2,2)],columns=['value','Frequency'])
Sample = list()
for index, row in X.iterrows():
Value = row['value']
Freq = int(row['Frequency'])
Sample = Sample + [Value]*Freq
Data = pd.DataFrame({'Sample':pd.Series(Sample)})

You can use Series.repeat, where the repeats argument can also be a series of ints:
df.value.repeat(df.Frequency).reset_index(drop=True).to_frame('Sample')
Sample
0 0.1
1 0.1
2 0.1
3 0.2
4 0.2

Use repeat
>>> df['value'].repeat(df.Frequency)
0 0.1
0 0.1
0 0.1
1 0.2
1 0.2
Name: value, dtype: float64
Or Create new dataframe with
>>> pd.DataFrame(df['value'].repeat(df.Frequency).to_numpy(),columns=["Sample"])
Sample
0 0.1
1 0.1
2 0.1
3 0.2
4 0.2

You can use reindex + repeat
X = X.reindex(X.index.repeat(X.Frequency))

pandas product of a column with its index groupby

I am working with a dataframe, and had to do a groupby in order to make some operations on my data.
This is an example of my Dataframe:
I SI deltas
1 10 0.1
1 14 0.1
2 10 0.1
2 18 0.3
1 17 0.05
2 30 0.3
1 10 0.4
1 14 0.2
2 10 0.1
2 18 0.2
1 17 0.15
Now, for each I, I count the relative frequency of the SI in this way:
results = df.groupby(['I', 'SI'])[['deltas']].sum()
#for each I, we sum all the weights (Deltas)
denom = results.groupby('I')['deltas'].sum()
#for each I, we divide each deltas by the sum, getting them normalized to one
results.deltas = results.deltas / denom
So my Dataframe now looks like this:
I = 1
deltas
SI = 10 0.5
SI = 14 0.3
SI = 17 0.2
I = 2
deltas
SI = 10 0.2
SI = 18 0.5
SI = 30 0.3
....
What I need to do is to print for each I the sum of deltas times their relative SI:
I = 1 sum = 0.5 * 10 + 0.3*14 + 0.2*17 = 12.6
I = 2 sum = 0.2*10 + 18*0.5 + 30*0.3 = 21
But since now I am working with a dataframe where the indices are I and SI, I do not know how to use them. I tried this code:
for idx2, j in enumerate(results.index.get_level_values(0).unique()):
#print results.loc[j]
f.write("%d\t"%(j)+results.loc[j].to_string(index=False)+'\n')
but I am not sure how should I proceed to get the indices values

Let's assume you have an input dataframe df following your initial transformations. If SI is your index, elevate it to a column via df = df.reset_index() as an initial step.
I SI weight
0 1 10 0.5
1 1 14 0.3
2 1 17 0.2
3 2 10 0.2
4 2 18 0.5
5 2 30 0.3
You can then calculate the product of SI and weight, then use GroupBy + sum:
res = df.assign(prod=df['SI']*df['weight'])\
.groupby('I')['prod'].sum().reset_index()
print(res)
I prod
0 1 12.6
1 2 20.0
For a single dataframe in isolation, you can use np.dot for the dot product.
s = pd.Series([0.5, 0.3, 0.2], index=[10, 14, 17])
s.index.name = 'SI'
res = np.dot(s.index, s) # 12.6

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Best way to get joint probability from 2D numpy - python

Related

Generate a simulated dataset in pandas

Training a model with List of points column

Choosing a random value from a discrete distribution

Unwrap a Frequency Table in pandas

pandas product of a column with its index groupby

Categories

Resources