I have a similar question to one I posed here, but subtly different as it includes an extra step to the process involving a probability:
Using a Python pandas dataframe column as input to a loop through another column
I've got two pandas dataframes: one has these variables
Year Count Probability
1 8 25%
2 26 19%
3 17 26%
4 9 10%
Another is a table with these variables:
ID Value
1 100
2 25
3 50
4 15
5 75
Essentially I need to use the Count x in the first dataframe to loop through the 2nd dataframe x times, but only pull a value from the 2nd dataframe y percent of the times (using random number generation) - and then create a new column in the first dataframe that represents the sum of the values in the loop.
So - just to demonstrate - in that first column, we'd loop through the 2nd table 8 times, but only pull a random value from that table 25% of the time - so we might get output of:
0 100 0 0 25 0 0 0
...which sums to 125 - so we our added column to the first table looks like
Year Count Probability Sum
1 8 25% 125
....and so on. Thanks in advance.
We'll use numpy binomial and pandas sample to get this done.
import pandas as pd
import numpy as np
# Set up dataframes
vals = pd.DataFrame([[1,8,'25%'], [2,26,'19%'], [3,17,'26%'],[4,9,'10%']])
vals.columns = ['Year', 'Count', 'Probability']
temp = pd.DataFrame([[1,100], [2,25], [3,50], [4,15], [5,75]])
temp.columns = ['ID', 'Value']
# Get probability fraction from string
vals['Numeric_Probability'] = pd.to_numeric(vals['Probability'].str.replace('%', '')) / 100
# Total rows is binomial random variable with n=Count, p=Probability.
vals['Total_Rows'] = np.random.binomial(n=vals['Count'], p=vals['Numeric_Probability'])
# Sample "total rows" from other DataFrame and sum.
vals['Sum'] = vals['Total_Rows'].apply(lambda x: temp['Value'].sample(
n=x, replace=True).sum())
# Drop intermediate rows
vals.drop(columns=['Numeric_Probability', 'Total_Rows'], inplace=True)
print(vals)
Year Count Probability Sum
0 1 8 25% 15
1 2 26 19% 350
2 3 17 26% 190
3 4 9 10% 0
You could use pass a probabilities list to np.random.choice:
In [1]: import numpy as np
...: import pandas as pd
In [2]: d_1 = {
...: 'Year': [1, 2, 3, 4],
...: 'Count': [8, 26, 17, 9],
...: 'Probability': ['25%', '19%', '26%', '10%'],
...: }
...: df_1 = pd.DataFrame(data=d_1)
In [3]: d_2 = {
...: 'ID': [1, 2, 3, 4, 5],
...: 'Value': [100, 25, 50, 15, 75],
...: }
...: df_2 = pd.DataFrame(data=d_2)
In [4]: def get_probabilities(values: pd.Series, percentage: float) -> list[float]:
...: percentage /= 100
...: perecent_per_val = percentage / values.size
...: return [perecent_per_val] * values.size + [1 - percentage]
...:
In [5]: df_1['Sum'] = [
...: np.random.choice(a=pd.concat([df_2['Value'], pd.Series([0])]),
...: size=n,
...: p=get_probabilites(values=df_2['Value'],
...: percentage=float(percent[:-1]))).sum()
...: for n, percent in zip(df_1['Count'], df_1['Probability'])
...: ]
...: df_1
Out[5]:
Year Count Probability Sum
0 1 8 25% 100
1 2 26 19% 375
2 3 17 26% 275
3 4 9 10% 50
Related
Consider I have dataframe:
data = [[11, 10, 13], [16, 15, 45], [35, 14,9]]
df = pd.DataFrame(data, columns = ['A', 'B', 'C'])
df
The data looks like:
A B C
0 11 10 13
1 16 15 45
2 35 14 9
The real data consists of a hundred columns and thousand rows.
I have a function, the aim of the function is to count how many values that higher than the minimum value of another column. The function looks like this:
def get_count_higher_than_min(df, column_name_string, df_col_based):
seriesObj = df.apply(lambda x: True if x[column_name_string] > df_col_based.min(skipna=True) else False, axis=1)
numOfRows = len(seriesObj[seriesObj == True].index)
return numOfRows
Example output from the function like this:
get_count_higher_than_min(df, 'A', df['B'])
The output is 3. That is because the minimum value of df['B'] is 10 and three values from df['A'] are higher than 10, so the output is 3.
The problem is I want to compute the pairwise of all columns using that function
I don't know what an effective and efficient way to solve this issue. I want the output in the form of a similar to confusion matrix or similar to correlation matrix.
Example output:
A B C
A X 3 X
B X X X
C X X X
This is O(n2m) where n is the number of columns and m the number of rows.
minima = df.min()
m = pd.DataFrame({c: (df > minima[c]).sum()
for c in df.columns})
Result:
>>> m
A B C
A 2 3 3
B 2 2 3
C 2 2 2
In theory O(n log(n) m) is possible.
from itertools import product
pairs = product(df.columns, repeat=2)
min_value = {}
output = []
for each_pair in pairs:
# making sure that we are calculating min only once
min_ = min_value.get(each_pair[1], df[each_pair[1]].min())
min_value[each_pair[1]] = min_
count = df[df[each_pair[0]]>min_][each_pair[0]].count()
output.append(count)
df_desired = pd.DataFrame(
[output[i: i+len(df.columns)] for i in range(0, len(output), len(df.columns))],
columns=df.columns, index=df.columns)
print(df_desired)
A B C
A 2 3 3
B 2 2 3
C 2 2 2
I have following DataFrame
0
0 5
1 10
2 15
3 20
i want get location of value which is near to n value. For example: if n=7 then nearest number is 5 .
And after this return location of 5 i.e [0] [0]
Use Series.abs and Series.idxmin:
# Setup
df = pd.DataFrame({0: {0: 5, 1: 10, 2: 15, 3: 20}})
n = 7
(n - df[0]).abs().idxmin()
[out]
0
Use numpy.argmin to get the closest number index:
df[1] = 7
df[2] = df[1] - df[0]
df[2] = df[2].abs()
print(np.argmin(df[2]))
I have a dataframe with 3 columns, in each row I have the probability that this row, the feature T has the value 1, 2 and 3
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({"T1" : [0.8,0.5,0.01],"T2":[0.1,0.2,0.89],"T3":[0.1,0.3,0.1]})
For row 0, T is 1 with 80% chance, 2 with 10% and 3 with 10%
I want to simulate the value of T for each row and change the columns T1,T2, T3 to binary features.
I have a solution but it needs to loop on the rows of the dataframe, it is really slow (my real dataframe has over 1 million rows) :
possib = df.columns
for i in range(df.shape[0]):
probas = df.iloc[i][possib].tolist()
choix_transp = np.random.choice(possib,1, p=probas)[0]
for pos in possib:
if pos==choix_transp:
df.iloc[i][pos] = 1
else:
df.iloc[i][pos] = 0
Is there a way to vectorize this code ?
Thank you !
Here's one based on vectorized random.choice with a given matrix of probabilities -
def matrixprob_to_onehot(ar):
# Get one-hot encoded boolean array based on matrix of probabilities
c = ar.cumsum(axis=1)
idx = (np.random.rand(len(c), 1) < c).argmax(axis=1)
ar_out = np.zeros(ar.shape, dtype=bool)
ar_out[np.arange(len(idx)),idx] = 1
return ar_out
ar_out = matrixprob_to_onehot(df.values)
df_out = pd.DataFrame(ar_out.view('i1'), index=df.index, columns=df.columns)
Verify with a large dataset for the probabilities -
In [139]: df = pd.DataFrame({"T1" : [0.8,0.5,0.01],"T2":[0.1,0.2,0.89],"T3":[0.1,0.3,0.1]})
In [140]: df
Out[140]:
T1 T2 T3
0 0.80 0.10 0.1
1 0.50 0.20 0.3
2 0.01 0.89 0.1
In [141]: p = np.array([matrixprob_to_onehot(df.values) for i in range(100000)]).argmax(2)
In [142]: np.array([np.bincount(p[:,i])/100000.0 for i in range(len(df))])
Out[142]:
array([[0.80064, 0.0995 , 0.09986],
[0.50051, 0.20113, 0.29836],
[0.01015, 0.89045, 0.0994 ]])
In [145]: np.round(_,2)
Out[145]:
array([[0.8 , 0.1 , 0.1 ],
[0.5 , 0.2 , 0.3 ],
[0.01, 0.89, 0.1 ]])
Timings on 1000,000 rows -
# Setup input
In [169]: N = 1000000
...: a = np.random.rand(N,3)
...: df = pd.DataFrame(a/a.sum(1,keepdims=1),columns=[['T1','T2','T3']])
# #gmds's soln
In [171]: %timeit pd.get_dummies((np.random.rand(len(df), 1) > df.cumsum(axis=1)).idxmin(axis=1))
1 loop, best of 3: 4.82 s per loop
# Soln from this post
In [172]: %%timeit
...: ar_out = matrixprob_to_onehot(df.values)
...: df_out = pd.DataFrame(ar_out.view('i1'), index=df.index, columns=df.columns)
10 loops, best of 3: 43.1 ms per loop
We can use numpy for this:
result = pd.get_dummies((np.random.rand(len(df), 1) > df.cumsum(axis=1)).idxmin(axis=1))
This generates a single column of random values and compares it to the column-wise cumsum of the dataframe, which results in a DataFrame of values where the first False value shows which "bucket" the random value falls in. With idxmax, we can get the index of this bucket, which we can then convert back with pd.get_dummies.
Example:
import numpy as np
import pandas as pd
np.random.seed(0)
data = np.random.rand(10, 3)
normalised = data / data.sum(axis=1)[:, np.newaxis]
df = pd.DataFrame(normalised)
result = pd.get_dummies((np.random.rand(len(df), 1) > df.cumsum(axis=1)).idxmin(axis=1))
print(result)
Output:
0 1 2
0 1 0 0
1 0 0 1
2 0 1 0
3 0 1 0
4 1 0 0
5 0 0 1
6 0 1 0
7 0 1 0
8 0 0 1
9 0 1 0
A note:
Most of the slowdown comes from pd.get_dummies; if you use Divakar's method of pd.DataFrame(result.view('i1'), index=df.index, columns=df.columns), it gets a lot faster.
I have a dictionary 'wordfreq' like this:
{'techsmart': 30, 'paradies': 57, 'jobvark': 5000, 'midgley': 100, 'weisman': 2, 'tucuman': 1, 'amdahl': 2, 'frogfeet': 1, 'd8848': 1, 'jiaoyuwang': 1, 'walter': 19}
and I want to put the keys in a list if the value is more than 5 and also if the key is not in another dataframe 'df', and then adding them to a list called 'stopword':here is a df dataframe:
word freq
1 paradies 1
5 tucuman 1
and here is the code I am using:
stopword = []
for k,v in wordfreq.items():
if v >= 5:
if k not in list_c:
stopword.append((k))
Anybody knows how can I do the same thing with isin() method or more efficiently at least?
I'd load your dict into a df:
In [177]:
wordfreq = {'techsmart': 30, 'paradies': 57, 'jobvark': 5000, 'midgley': 100, 'weisman': 2, 'tucuman': 1, 'amdahl': 2, 'frogfeet': 1, 'd8848': 1, 'jiaoyuwang': 1, 'walter': 19}
df = pd.DataFrame({'word':list(wordfreq.keys()), 'freq':list(wordfreq.values())})
df
Out[177]:
freq word
0 1 frogfeet
1 1 tucuman
2 57 paradies
3 1 d8848
4 5000 jobvark
5 100 midgley
6 1 jiaoyuwang
7 30 techsmart
8 2 weisman
9 19 walter
10 2 amdahl
And then filter using isin against the other df (df_1 in my case) like this:
In [181]:
df[(df['freq'] > 5) & (~df['word'].isin(df1['word']))]
Out[181]:
freq word
4 5000 jobvark
5 100 midgley
7 30 techsmart
9 19 walter
So the boolean condition looks for freq values greater than 5 and also where the word is not in the other df using isin and invert the boolean mask ~.
You can then now get a list easily:
In [182]:
list(df[(df['freq'] > 5) & (~df['word'].isin(df1['word']))]['word'])
Out[182]:
['jobvark', 'midgley', 'techsmart', 'walter']
I have a large pandas DataFrame that I need to fill.
Here is my code:
trains = np.arange(1, 101)
#The above are example values, it's actually 900 integers between 1 and 20000
tresholds = np.arange(10, 70, 10)
tuples = []
for i in trains:
for j in tresholds:
tuples.append((i, j))
index = pd.MultiIndex.from_tuples(tuples, names=['trains', 'tresholds'])
df = pd.DataFrame(np.zeros((len(index), len(trains))), index=index, columns=trains, dtype=float)
metrics = dict()
for i in trains:
m = binary_metric_train(True, i)
#Above function returns a binary array of length 35
#Example: [1, 0, 0, 1, ...]
metrics[i] = m
for i in trains:
for j in tresholds:
trA = binary_metric_train(True, i, tresh=j)
for k in trains:
if k != i:
trB = metrics[k]
corr = abs(pearsonr(trA, trB)[0])
df[k][i][j] = corr
else:
df[k][i][j] = np.nan
My problem is, when this piece of code is finally done computing, my DataFrame df still contains nothing but zeros. Even the NaN are not inserted. I think that my indexing is correct. Also, I have tested my binary_metric_train function separately, it does return an array of length 35.
Can anyone spot what I am missing here?
EDIT: For clarity, this DataFrame looks like this:
1 2 3 4 5 ...
trains tresholds
1 10
20
30
40
50
60
2 10
20
30
40
50
60
...
As #EdChum noted, you should take a lookt at pandas indexing. Here's some test data for the purpose of illustration, which should clear things up.
import numpy as np
import pandas as pd
trains = [ 1, 1, 1, 2, 2, 2]
thresholds = [10, 20, 30, 10, 20, 30]
data = [ 1, 0, 1, 0, 1, 0]
df = pd.DataFrame({
'trains' : trains,
'thresholds' : thresholds,
'C1' : data,
'C2' : data
}).set_index(['trains', 'thresholds'])
print df
df.ix[(2, 30), 0] = 3 # using column index
# or...
df.ix[(2, 30), 'C1'] = 3 # using column name
df.loc[(2, 30), 'C1'] = 3 # using column name
# but not...
df.loc[(2, 30), 1] = 3 # creates a new column
print df
Which outputs the DataFrame before and after modification:
C1 C2
trains thresholds
1 10 1 1
20 0 0
30 1 1
2 10 0 0
20 1 1
30 0 0
C1 C2 1
trains thresholds
1 10 1 1 NaN
20 0 0 NaN
30 1 1 NaN
2 10 0 0 NaN
20 1 1 NaN
30 3 0 3