Generate a simulated dataset in pandas - python

I would like to generate pandas dataframes with simulated data.
There should be x sets of columns.
Each set corresponds to y number of columns.
Each set should have a value, a, in z number of rows. The value, a, is a float.
However, z may be different for the different sets of column sets.
The remaining columns will have another value, b, which is also a float.
I would like to write a function to generate such pandas data frames where I can specify the variables x, y, a, b and where a specific value for z can be set for the individual column sets.
Here is an example df:
data = [[0.5, 0.5, 0.1, 0.1, 0.1, 0.1], [0.1, 0.1, 0.5, 0.5, 0.1, 0.1], [0.1, 0.1, 0.1, 0.1, 0.5, 0.5]]
df = pd.DataFrame(data, columns=['set1_col1', 'set1_col2', 'set2_col1', 'set2_col2', 'set3_col1', 'set3_col2'])
df
But I would like to be able to specify the variables, which for the above example would be:
x = 3 #(set1, set2, set3)
y = 2 #(col1, col2 for each set)
a = 0.5
z = 1 #(for all column sets)
b = 0.1
Advice on this would be greatly appreciated!
Thanks!

Use numpy.random.choice:
N = 5 #No of rows
x = 3 #(set1, set2, set3)
y = 2 #(col1, col2 for each set)
a = 0.5
z = 1 #(for all column sets)
b = 0.1
#names of sets
sets = [f'set{w+1}' for w in range(x)]
#names of columns
cols = [f'col{w+1}' for w in range(y)]
#MultiIndex by product
mux = pd.MultiIndex.from_product([sets, cols])
#DataFrame with default value
df = pd.DataFrame(b, index=range(N), columns=mux)
#random assign a by random index with no repeat
for c, i in zip(df.columns.levels[0], np.random.choice(df.index, z * x, replace=False)):
df.loc[i, c] = a
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
set1_col1 set1_col2 set2_col1 set2_col2 set3_col1 set3_col2
0 0.1 0.1 0.1 0.1 0.1 0.1
1 0.1 0.1 0.5 0.5 0.1 0.1
2 0.5 0.5 0.1 0.1 0.1 0.1
3 0.1 0.1 0.1 0.1 0.1 0.1
4 0.1 0.1 0.1 0.1 0.5 0.5
EDIT: For consecutive random values use:
N = 6 #No of rows
x = 3 #(set1, set2, set3)
y = 2 #(col1, col2 for each set)
a = 0.5
z = 2 #(for all column sets)
b = 0.1
#names of sets
sets = [f'set{w+1}' for w in range(x)]
#names of columns
cols = [f'col{w+1}' for w in range(y)]
#MultiIndex by product
mux = pd.MultiIndex.from_product([sets, cols])
#DataFrame with default value, index is create by consecutive groups
df = pd.DataFrame(b, index=np.arange(N) // z, columns=mux)
print (df)
#random assign a by random index with no repeat
for c, i in zip(df.columns.levels[0],
np.random.choice(np.unique(df.index), x, replace=False)):
df.loc[i, c] = a
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
df = df.reset_index(drop=True)
print (df)
set1_col1 set1_col2 set2_col1 set2_col2 set3_col1 set3_col2
0 0.5 0.5 0.1 0.1 0.1 0.1
1 0.5 0.5 0.1 0.1 0.1 0.1
2 0.1 0.1 0.1 0.1 0.5 0.5
3 0.1 0.1 0.1 0.1 0.5 0.5
4 0.1 0.1 0.5 0.5 0.1 0.1
5 0.1 0.1 0.5 0.5 0.1 0.1

Related

I'm trying to predict probability of X_test and getting 2 values in an array. I need to compare those 2 values and make it 1

I'm trying to predict probability of X_test and getting 2 values in an array. I need to compare those 2 values and make it 1.
when I write code
y_pred = classifier.predict_proba(X_test)
y_pred
It gives output like
array([[0.5, 0.5],
[0.6, 0.4],
[0.7, 0.3],
...,
[0.5, 0.5],
[0.4, 0.6],
[0.3, 0.7]])
We know that if values if >= 0.5 then it's and 1 and if it's less than 0.5 it's 0
I converted the above array into pandas using below code
proba = pd.DataFrame(proba)
proba.columns = [['pred_0', 'pred_1']]
proba.head()
And output is
pred_0 pred_1
0 0.5 0.5
1 0.6 0.4
2 0.7 0.3
3 0.4 0.6
4 0.3 0.7
How to iterate the above rows and write a condition that if row value of column 1 is greater than equal to 0.5 with row value of 2, then it's 1 and if row value of column 1 is less than 0.5 when compared to row value of column 2.
For example, by seeing the above data frame the output must be
output
0 0
1 1
2 1
3 1
4 1
You could just map your initial array without converting it to a Pandas Dataframe so that it returns True when the first value of every subarray is >= 0.5 and if not it returns False. And finally, convert it to int:
>>> import numpy as np
>>> a = np.array([[0.5, 0.5], [0.6, 0.4], [0.3, 0.7]])
>>> a
array([[0.5, 0.5],
[0.6, 0.4],
[0.3, 0.7]])
>>> result = map(lambda x:int(x[0] >= 0.5), a)
>>> print(list(result))
[1, 1, 0]
Compare the two columns to create a boolean index then convert to int using astype:
Option 1:
df['output'] = (df['pred_0'] >= df['pred_1']).astype(int)
Option 2:
df['output'] = df['pred_0'].ge(df['pred_1']).astype(int)
Or via np.where:
Option 3:
df['output'] = np.where(df['pred_0'] >= df['pred_1'], 1, 0)
Option 4:
df['output'] = np.where(df['pred_0'].ge(df['pred_1']), 1, 0)
pred_0 pred_1 output
0 0.5 0.5 1
1 0.6 0.4 1
2 0.7 0.3 1
3 0.4 0.6 0
4 0.3 0.7 0

calculating a function using files

I have written a python function for some calculation that uses below two data sets.I want to calculate z for every data in data_2 using row1,row2,ror3,row4,row5 of data_1.However, as i am new to python i tried to write but fails in between.please help.Thanks.
data_1 data_2
file a b c d x
file1 0.5 0.6 0.8 0.3 0.5
file1 0.2 0.2 0.4 0.1 0.8
file1 0.1 0.4 0.5 0.2 0.9
my tried code is here:
import numpy as np
file1=np.loadtxt('data_1',skiprows=1,usecols=(1,2,3))
file2=np.loadtxt('data_2',skiprows=1,usecols=(0))
def calculation(a,b,c,x):
z=(a+b+c)*x
return z
for value in file2:
print(value)
calculation
my expected output should be something like
data_3
file a b c d z
file1 0.5 0.6 0.8 0.3 -
file1 0.5 0.6 0.8 0.3 -
file1 0.5 0.6 0.8 0.3 -
file1 0.2 0.2 0.4 0.1 -
file1 0.2 0.2 0.4 0.1 -
file1 0.2 0.2 0.4 0.1 -
file1 0.1 0.4 0.5 0.2 -
file1 0.1 0.4 0.5 0.2 -
file1 0.1 0.4 0.5 0.2 -
Python is a dynamic language and numpy tends to override normal operators to apply operations to entire collections of data. Often, if you have a for loop, you are not taking advantage of that.
numpy arrays can only hold a single data type but you have a string in column 0. pandas wraps numpy and makes multiple data types easier to deal with. So I've switch to reading pandas.DataFrame objects instead of arrays.
It looks like you want the cartesian product of file2["x"] with the rows in file1. One way to do that is to create a dummy column in both dataframes that have matching values and then merge. Use the sum method for a + b + c and then multiply with x, and you have the result.
import pandas as pd
# read space separated tables
file1=pd.read_table('data_1', sep=r"\s+")
file2=pd.read_table('data_2', sep=r"\s+")
# we want (a+b+c)*x, for each value in file2["x"]. Do the sum, then
# use `merge` with a temporary key to create the cartesian product
# with x. For each `x`, merge will create a row for each matching
# key and since all keys match, we've got a cartesian product.
# Finally, multiply.
file1["_tmpsums"] = file1[["a", "b", "c"]].sum(axis=1)
file1["_tmpmergekey"] = file2["_tmpmergekey"] = 1
file1 = pd.merge(file1, file2, on="_tmpmergekey")
file1["z"] = file1["_tmpsums"] * file1["x"]
file1 = file1.drop(["_tmpsums", "_tmpmergekey", "x"], axis=1)
print(" data_3")
print(file1.to_string(col_space=6, index=False, justify="center"))
Result
data_3
file a b c d z
file1 0.5 0.6 0.8 0.3 0.95
file1 0.5 0.6 0.8 0.3 1.52
file1 0.5 0.6 0.8 0.3 1.71
file1 0.2 0.2 0.4 0.1 0.40
file1 0.2 0.2 0.4 0.1 0.64
file1 0.2 0.2 0.4 0.1 0.72
file1 0.1 0.4 0.5 0.2 0.50
file1 0.1 0.4 0.5 0.2 0.80
file1 0.1 0.4 0.5 0.2 0.90
Using pandas as follows
import pandas as pd
# Load Data
data_1 = pd.read_csv('data_1.txt', delimiter = r"\s+")
data_2 = pd.read_csv('data_2.txt', delimiter = r"\s+")
# Compute the cartesian product of data_1 with data_2
# since for each row in data_1, we need sequence of rows in data_2
# We do this using DataFrame merge by injecting a key that is repeated for each row
# i.e. 'merge_key'
data_1['merge_key'] = pd.Series([1]*len(data_1))
data_2['merge_key'] = pd.Series([1]*len(data_2))
df = pd.merge(data_1, data_2, on = 'merge_key')
# Drop merge key from result
df.drop('merge_key', axis = 'columns', inplace = True)
# DataFrame df now has columns File, a, b, c, d, x
# We can apply function calulation to each row using apply
# and specifying the columns to send to calculation
df['z'] = df.apply(lambda row: calculation(row['a'], row['b'], row['c'], row['x']), axis = 'columns')
# Drop x column
df.drop('x', axis = 'columns', inplace = True)
# Write to CSV file
df.to_csv('data_3.txt', index=False, sep = " ")
Output
Pandas DataFrame df
file a b c d z
0 file1 0.5 0.6 0.8 0.3 0.95
1 file1 0.5 0.6 0.8 0.3 1.52
2 file1 0.5 0.6 0.8 0.3 1.71
3 file1 0.2 0.2 0.4 0.1 0.40
4 file1 0.2 0.2 0.4 0.1 0.64
5 file1 0.2 0.2 0.4 0.1 0.72
6 file1 0.1 0.4 0.5 0.2 0.50
7 file1 0.1 0.4 0.5 0.2 0.80
8 file1 0.1 0.4 0.5 0.2 0.90
CSV File data_3.txt
file a b c d z
file1 0.5 0.6 0.8 0.3 0.9500000000000001
file1 0.5 0.6 0.8 0.3 1.5200000000000002
file1 0.5 0.6 0.8 0.3 1.7100000000000002
file1 0.2 0.2 0.4 0.1 0.4
file1 0.2 0.2 0.4 0.1 0.6400000000000001
file1 0.2 0.2 0.4 0.1 0.7200000000000001
file1 0.1 0.4 0.5 0.2 0.5
file1 0.1 0.4 0.5 0.2 0.8
file1 0.1 0.4 0.5 0.2 0.9
Basic Python
Same output
# Get data from first file
with open('data_1.txt', 'r') as f:
# first file header
header1 = f.readline()
# Let's get the lines of data
data_1 = []
for line in f:
new_data = line.rstrip().split() # strip '\n' and split on parens
for i in range(1, len(new_data)):
new_data[i] = float(new_data[i]) # convert columns after file to float
data_1.append(new_data)
# Get data from second file
with open('data_2.txt', 'r') as f:
# second file header
header2 = f.readline()
# Let's get the lines of data
data_2 = []
for line in f:
new_data = float(line.rstrip()) # only one value per line
data_2.append(new_data)
with open('data_3.txt', 'w') as f:
# Output file
# Write Header
f.write("file a b c d z\n")
# Use double loop to loop through all rows of data_2 for each row in data_1
for v1 in data_1:
# For each row in data_1
file, a, b, c, d = v1 # unpacking the values in v1 to individual variables
for v2 in data_2:
# for each row in data_2
x = v2 # data2 just has a single value per row
# Calculation using posted formula
z = calculation(a, b, c, x)
# Write result
f.write(f"{file} {a} {b} {c} {d} {z}\n")
Numpy Version
import numpy as np
file1=np.loadtxt('data_1.txt',skiprows=1,usecols=(1,2,3, 4))
file2=np.loadtxt('data_2.txt',skiprows=1,usecols=(0))
with open('data_3.txt', 'w') as f:
# Write header
f.write("file a b c d z\n")
# Double loop to through the values of file1 and file2
for val1 in file1:
for val2 in file2:
# Only use first 3 values (val1[:3] to only use first 3 value so ignore d)
z = calculation(*val1[:3], val2) # *val[:3] is unpacking values to go into calculation
# Write result
# map(str, val1) - converts values to string
# str(z) converts z to string
#' '.join([*map(str, val1), str(z)] - creates a space separated string
f.write(' '.join([*map(str, val1), str(z)]) + "\n")

Unwrap a Frequency Table in pandas

I have a dataframe A that looks like this
value Frequency
0.1 3
0.2 2
and I want to convert it to dataframe B like below
Sample
0.1
0.1
0.1
0.2
0.2
Simply put, dataframe A is the samples and their frequency (repetition). Dataframe B is literally expanding that. Is there a straightforward way to do this?
what I did (minimal working example reproducing above):
X = pd.DataFrame([(0.1,3),(0.2,2)],columns=['value','Frequency'])
Sample = list()
for index, row in X.iterrows():
Value = row['value']
Freq = int(row['Frequency'])
Sample = Sample + [Value]*Freq
Data = pd.DataFrame({'Sample':pd.Series(Sample)})
You can use Series.repeat, where the repeats argument can also be a series of ints:
df.value.repeat(df.Frequency).reset_index(drop=True).to_frame('Sample')
Sample
0 0.1
1 0.1
2 0.1
3 0.2
4 0.2
Use repeat
>>> df['value'].repeat(df.Frequency)
0 0.1
0 0.1
0 0.1
1 0.2
1 0.2
Name: value, dtype: float64
Or Create new dataframe with
>>> pd.DataFrame(df['value'].repeat(df.Frequency).to_numpy(),columns=["Sample"])
Sample
0 0.1
1 0.1
2 0.1
3 0.2
4 0.2
You can use reindex + repeat
X = X.reindex(X.index.repeat(X.Frequency))

pandas product of a column with its index groupby

I am working with a dataframe, and had to do a groupby in order to make some operations on my data.
This is an example of my Dataframe:
I SI deltas
1 10 0.1
1 14 0.1
2 10 0.1
2 18 0.3
1 17 0.05
2 30 0.3
1 10 0.4
1 14 0.2
2 10 0.1
2 18 0.2
1 17 0.15
Now, for each I, I count the relative frequency of the SI in this way:
results = df.groupby(['I', 'SI'])[['deltas']].sum()
#for each I, we sum all the weights (Deltas)
denom = results.groupby('I')['deltas'].sum()
#for each I, we divide each deltas by the sum, getting them normalized to one
results.deltas = results.deltas / denom
So my Dataframe now looks like this:
I = 1
deltas
SI = 10 0.5
SI = 14 0.3
SI = 17 0.2
I = 2
deltas
SI = 10 0.2
SI = 18 0.5
SI = 30 0.3
....
What I need to do is to print for each I the sum of deltas times their relative SI:
I = 1 sum = 0.5 * 10 + 0.3*14 + 0.2*17 = 12.6
I = 2 sum = 0.2*10 + 18*0.5 + 30*0.3 = 21
But since now I am working with a dataframe where the indices are I and SI, I do not know how to use them. I tried this code:
for idx2, j in enumerate(results.index.get_level_values(0).unique()):
#print results.loc[j]
f.write("%d\t"%(j)+results.loc[j].to_string(index=False)+'\n')
but I am not sure how should I proceed to get the indices values
Let's assume you have an input dataframe df following your initial transformations. If SI is your index, elevate it to a column via df = df.reset_index() as an initial step.
I SI weight
0 1 10 0.5
1 1 14 0.3
2 1 17 0.2
3 2 10 0.2
4 2 18 0.5
5 2 30 0.3
You can then calculate the product of SI and weight, then use GroupBy + sum:
res = df.assign(prod=df['SI']*df['weight'])\
.groupby('I')['prod'].sum().reset_index()
print(res)
I prod
0 1 12.6
1 2 20.0
For a single dataframe in isolation, you can use np.dot for the dot product.
s = pd.Series([0.5, 0.3, 0.2], index=[10, 14, 17])
s.index.name = 'SI'
res = np.dot(s.index, s) # 12.6

Best way to get joint probability from 2D numpy

Was wondering if there is a better way to get the probability of a 2D numpy array. Maybe using some of numpy's built in functions.
For simplicity, say we have an example array:
[['apple','pie'],
['apple','juice'],
['orange','pie'],
['strawberry','cream'],
['strawberry','candy']]
Would like to get the probability such as:
['apple' 'juice'] --> 0.4 * 0.5 = 0.2
['apple' 'pie'] --> 0.4 * 0.5 = 0.2
['orange' 'pie'] --> 0.2 * 1.0 = 0.2
['strawberry' 'candy'] --> 0.4 * 0.5 = 0.2
['strawberry' 'cream'] --> 0.4 * 0.5 = 0.2
Where 'juice' as the second word has a probabiliy of 0.2. Since apple has probability of 2/5 * 1/2 (for juice).
On the other hand, 'pie' as a second word, has a probability of 0.4. The combination of the probability from 'apple' and 'orange'.
The way I approached the problem was adding 3 new columns to the array, for the probability of 1st column, 2nd column, and the final probability. Group the array per 1st column, then per 2nd column and update the probability accordingly.
Below is my code:
a = np.array([['apple','pie'],['apple','juice'],['orange','pie'],['strawberry','cream'],['strawberry','candy']])
ans = []
unique, counts = np.unique(a.T[0], return_counts=True) ## TRANSPOSE a, AND GET unique
myCounter = zip(unique,counts)
num_rows = sum(counts)
a = np.c_[a,np.zeros(num_rows),np.zeros(num_rows),np.zeros(num_rows)] ## ADD 3 COLUMNS to a
groups = []
## GATHER GROUPS BASE ON COLUMN 0
for _unique, _count in myCounter:
index = a[:,0] == _unique ## WHERE COLUMN 0 MATCH _unique
curr_a = a[index]
for j in range(len(curr_a)):
curr_a[j][2] = _count/num_rows
groups.append(curr_a)
## GATHER UNIQUENESS FROM COLUMN 1, PER GROUP
for g in groups:
unique, counts = np.unique(g.T[1], return_counts=True)
myCounter = zip(unique, counts)
num_rows = sum(counts)
for _unique, _count in myCounter:
index = g[:, 1] == _unique
curr_g = g[index]
for j in range(len(curr_g)):
curr_g[j][3] = _count / num_rows
curr_g[j][4] = float(curr_g[j][2]) * float(curr_g[j][3]) ## COMPUTE FINAL PROBABILITY
ans.append(curr_g[j])
for an in ans:
print(an)
Outputs:
['apple' 'juice' '0.4' '0.5' '0.2']
['apple' 'pie' '0.4' '0.5' '0.2']
['orange' 'pie' '0.2' '1.0' '0.2']
['strawberry' 'candy' '0.4' '0.5' '0.2']
['strawberry' 'cream' '0.4' '0.5' '0.2']
Was wondering if there is a better short/faster way of doing this using numpy or other means. Adding columns is not necessary, this was just my way of doing it. Other approach will be acceptable.
Based on the definition of probability distribution you have given, you can use pandas to do the same i.e
import pandas as pd
a = np.array([['apple','pie'],['apple','juice'],['orange','pie'],['strawberry','cream'],['strawberry','candy']])
df = pd.DataFrame(a)
# Find the frequency of first word and divide by the total number of rows
df[2]=df[0].map(df[0].value_counts())/df.shape[0]
# Divide 1 by the total repetion
df[3]=1/(df[0].map(df[0].value_counts()))
# Multiply the probabilities
df[4]= df[2]*df[3]
Output:
0 1 2 3 4
0 apple pie 0.4 0.5 0.2
1 apple juice 0.4 0.5 0.2
2 orange pie 0.2 1.0 0.2
3 strawberry cream 0.4 0.5 0.2
4 strawberry candy 0.4 0.5 0.2
If you want that in the form of list you can use df.values.tolist()
If you dont want the columns then
df = pd.DataFrame(a)
df[2]=((df[0].map(df[0].value_counts())/df.shape[0]) * (1/(df[0].map(df[0].value_counts()))))
Output:
0 1 2
0 apple pie 0.2
1 apple juice 0.2
2 orange pie 0.2
3 strawberry cream 0.2
4 strawberry candy 0.2
For combined probablity print(df.groupby(1)[2].sum())
candy 0.2
cream 0.2
juice 0.2
pie 0.4

Categories