Unwrap a Frequency Table in pandas - python

I have a dataframe A that looks like this
value Frequency
0.1 3
0.2 2
and I want to convert it to dataframe B like below
Sample
0.1
0.1
0.1
0.2
0.2
Simply put, dataframe A is the samples and their frequency (repetition). Dataframe B is literally expanding that. Is there a straightforward way to do this?
what I did (minimal working example reproducing above):
X = pd.DataFrame([(0.1,3),(0.2,2)],columns=['value','Frequency'])
Sample = list()
for index, row in X.iterrows():
Value = row['value']
Freq = int(row['Frequency'])
Sample = Sample + [Value]*Freq
Data = pd.DataFrame({'Sample':pd.Series(Sample)})

You can use Series.repeat, where the repeats argument can also be a series of ints:
df.value.repeat(df.Frequency).reset_index(drop=True).to_frame('Sample')
Sample
0 0.1
1 0.1
2 0.1
3 0.2
4 0.2

Use repeat
>>> df['value'].repeat(df.Frequency)
0 0.1
0 0.1
0 0.1
1 0.2
1 0.2
Name: value, dtype: float64
Or Create new dataframe with
>>> pd.DataFrame(df['value'].repeat(df.Frequency).to_numpy(),columns=["Sample"])
Sample
0 0.1
1 0.1
2 0.1
3 0.2
4 0.2

You can use reindex + repeat
X = X.reindex(X.index.repeat(X.Frequency))

Related

Generate a simulated dataset in pandas

I would like to generate pandas dataframes with simulated data.
There should be x sets of columns.
Each set corresponds to y number of columns.
Each set should have a value, a, in z number of rows. The value, a, is a float.
However, z may be different for the different sets of column sets.
The remaining columns will have another value, b, which is also a float.
I would like to write a function to generate such pandas data frames where I can specify the variables x, y, a, b and where a specific value for z can be set for the individual column sets.
Here is an example df:
data = [[0.5, 0.5, 0.1, 0.1, 0.1, 0.1], [0.1, 0.1, 0.5, 0.5, 0.1, 0.1], [0.1, 0.1, 0.1, 0.1, 0.5, 0.5]]
df = pd.DataFrame(data, columns=['set1_col1', 'set1_col2', 'set2_col1', 'set2_col2', 'set3_col1', 'set3_col2'])
df
But I would like to be able to specify the variables, which for the above example would be:
x = 3 #(set1, set2, set3)
y = 2 #(col1, col2 for each set)
a = 0.5
z = 1 #(for all column sets)
b = 0.1
Advice on this would be greatly appreciated!
Thanks!
Use numpy.random.choice:
N = 5 #No of rows
x = 3 #(set1, set2, set3)
y = 2 #(col1, col2 for each set)
a = 0.5
z = 1 #(for all column sets)
b = 0.1
#names of sets
sets = [f'set{w+1}' for w in range(x)]
#names of columns
cols = [f'col{w+1}' for w in range(y)]
#MultiIndex by product
mux = pd.MultiIndex.from_product([sets, cols])
#DataFrame with default value
df = pd.DataFrame(b, index=range(N), columns=mux)
#random assign a by random index with no repeat
for c, i in zip(df.columns.levels[0], np.random.choice(df.index, z * x, replace=False)):
df.loc[i, c] = a
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
set1_col1 set1_col2 set2_col1 set2_col2 set3_col1 set3_col2
0 0.1 0.1 0.1 0.1 0.1 0.1
1 0.1 0.1 0.5 0.5 0.1 0.1
2 0.5 0.5 0.1 0.1 0.1 0.1
3 0.1 0.1 0.1 0.1 0.1 0.1
4 0.1 0.1 0.1 0.1 0.5 0.5
EDIT: For consecutive random values use:
N = 6 #No of rows
x = 3 #(set1, set2, set3)
y = 2 #(col1, col2 for each set)
a = 0.5
z = 2 #(for all column sets)
b = 0.1
#names of sets
sets = [f'set{w+1}' for w in range(x)]
#names of columns
cols = [f'col{w+1}' for w in range(y)]
#MultiIndex by product
mux = pd.MultiIndex.from_product([sets, cols])
#DataFrame with default value, index is create by consecutive groups
df = pd.DataFrame(b, index=np.arange(N) // z, columns=mux)
print (df)
#random assign a by random index with no repeat
for c, i in zip(df.columns.levels[0],
np.random.choice(np.unique(df.index), x, replace=False)):
df.loc[i, c] = a
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
df = df.reset_index(drop=True)
print (df)
set1_col1 set1_col2 set2_col1 set2_col2 set3_col1 set3_col2
0 0.5 0.5 0.1 0.1 0.1 0.1
1 0.5 0.5 0.1 0.1 0.1 0.1
2 0.1 0.1 0.1 0.1 0.5 0.5
3 0.1 0.1 0.1 0.1 0.5 0.5
4 0.1 0.1 0.5 0.5 0.1 0.1
5 0.1 0.1 0.5 0.5 0.1 0.1

If loop that is adding new values as a whole column instead of one row in one column

I am new to python and trying to run an if loop so that if the probability given is an even number, I would add 0.1 to that number and in a new column "new_probability". However, when I run that if loop, it adds the new value as a column name with each original probability as each row.
data['New_Probability'] = data['Probability']
data.loc[data['Probability'] * 10 % 2 == 0, data['New_Probability']] = (data['Probability'] + .1)
This is preventing the loop from running correctly and only changing the correct probability values. Is there a better/easier way to loop this or do I just have something out of place?
You don't need a loop to do it, for example:
data = pd.DataFrame({'Probability':[0.1,0.2,0.3,0.4]})
You can use the boolean you have:
data['Probability'] * 10 % 2 == 0
0 False
1 True
2 False
3 True
Name: Probability, dtype: bool
We multiply this by 0.1, False would be 0 and True would be 1, so it gives you the expected output in the end:
data['New_Probability'] = data['Probability'] \
+ 0.1 *(data['Probability'] * 10 % 2 == 0)
Probability New_Probability
0 0.1 0.1
1 0.2 0.3
2 0.3 0.3
3 0.4 0.5

How to obtain same results of plus operation in python and in math?

In run the follow code:
k = 0
while k <= 1:
print(k)
k += 0.1
And get result:
0
0.1
0.2
0.30000000000000004
0.4
0.5
0.6
0.7
0.7999999999999999
0.8999999999999999
0.9999999999999999
However, the expected output is
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
How to make the result of python output same as in math?
Incrementing by a step size that's not exactly 0.1 (since that can't be represented as fixed point binary number) will continually increase your error. The float that the literal 0.1 gets translated to is not exactly 0.1. Computing the nearest binary approximation of the correct fraction is a better way to go:
k = 0
while k <= 10:
print(k / 10)
k += 1
k / 10 will not be an exact representation of the numbers you want except for 0.0, 0.5, 1.0, but since it will be the closest available float, it will print correctly.
As an aside, switching to integers allows you to rewrite your loop more idiomatically as
for k in range(11):
print(k / 10)
Just use round method with a precision of 1:
In [1217]: k = 0
...: while k<=1:
...: print(round(k,1))
...: k += 0.1
...:
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
You can do with format() function as below:
k = 0
while k<=1:
print('{0:.2g}'.format(k))
k += 0.1
An alternative to your code is initialize a variable i with 0 and increment it in each loop. You then set k = 0.1 * i
You thereby avoid cumulative rounding errors.

calculating a function using files

I have written a python function for some calculation that uses below two data sets.I want to calculate z for every data in data_2 using row1,row2,ror3,row4,row5 of data_1.However, as i am new to python i tried to write but fails in between.please help.Thanks.
data_1 data_2
file a b c d x
file1 0.5 0.6 0.8 0.3 0.5
file1 0.2 0.2 0.4 0.1 0.8
file1 0.1 0.4 0.5 0.2 0.9
my tried code is here:
import numpy as np
file1=np.loadtxt('data_1',skiprows=1,usecols=(1,2,3))
file2=np.loadtxt('data_2',skiprows=1,usecols=(0))
def calculation(a,b,c,x):
z=(a+b+c)*x
return z
for value in file2:
print(value)
calculation
my expected output should be something like
data_3
file a b c d z
file1 0.5 0.6 0.8 0.3 -
file1 0.5 0.6 0.8 0.3 -
file1 0.5 0.6 0.8 0.3 -
file1 0.2 0.2 0.4 0.1 -
file1 0.2 0.2 0.4 0.1 -
file1 0.2 0.2 0.4 0.1 -
file1 0.1 0.4 0.5 0.2 -
file1 0.1 0.4 0.5 0.2 -
file1 0.1 0.4 0.5 0.2 -
Python is a dynamic language and numpy tends to override normal operators to apply operations to entire collections of data. Often, if you have a for loop, you are not taking advantage of that.
numpy arrays can only hold a single data type but you have a string in column 0. pandas wraps numpy and makes multiple data types easier to deal with. So I've switch to reading pandas.DataFrame objects instead of arrays.
It looks like you want the cartesian product of file2["x"] with the rows in file1. One way to do that is to create a dummy column in both dataframes that have matching values and then merge. Use the sum method for a + b + c and then multiply with x, and you have the result.
import pandas as pd
# read space separated tables
file1=pd.read_table('data_1', sep=r"\s+")
file2=pd.read_table('data_2', sep=r"\s+")
# we want (a+b+c)*x, for each value in file2["x"]. Do the sum, then
# use `merge` with a temporary key to create the cartesian product
# with x. For each `x`, merge will create a row for each matching
# key and since all keys match, we've got a cartesian product.
# Finally, multiply.
file1["_tmpsums"] = file1[["a", "b", "c"]].sum(axis=1)
file1["_tmpmergekey"] = file2["_tmpmergekey"] = 1
file1 = pd.merge(file1, file2, on="_tmpmergekey")
file1["z"] = file1["_tmpsums"] * file1["x"]
file1 = file1.drop(["_tmpsums", "_tmpmergekey", "x"], axis=1)
print(" data_3")
print(file1.to_string(col_space=6, index=False, justify="center"))
Result
data_3
file a b c d z
file1 0.5 0.6 0.8 0.3 0.95
file1 0.5 0.6 0.8 0.3 1.52
file1 0.5 0.6 0.8 0.3 1.71
file1 0.2 0.2 0.4 0.1 0.40
file1 0.2 0.2 0.4 0.1 0.64
file1 0.2 0.2 0.4 0.1 0.72
file1 0.1 0.4 0.5 0.2 0.50
file1 0.1 0.4 0.5 0.2 0.80
file1 0.1 0.4 0.5 0.2 0.90
Using pandas as follows
import pandas as pd
# Load Data
data_1 = pd.read_csv('data_1.txt', delimiter = r"\s+")
data_2 = pd.read_csv('data_2.txt', delimiter = r"\s+")
# Compute the cartesian product of data_1 with data_2
# since for each row in data_1, we need sequence of rows in data_2
# We do this using DataFrame merge by injecting a key that is repeated for each row
# i.e. 'merge_key'
data_1['merge_key'] = pd.Series([1]*len(data_1))
data_2['merge_key'] = pd.Series([1]*len(data_2))
df = pd.merge(data_1, data_2, on = 'merge_key')
# Drop merge key from result
df.drop('merge_key', axis = 'columns', inplace = True)
# DataFrame df now has columns File, a, b, c, d, x
# We can apply function calulation to each row using apply
# and specifying the columns to send to calculation
df['z'] = df.apply(lambda row: calculation(row['a'], row['b'], row['c'], row['x']), axis = 'columns')
# Drop x column
df.drop('x', axis = 'columns', inplace = True)
# Write to CSV file
df.to_csv('data_3.txt', index=False, sep = " ")
Output
Pandas DataFrame df
file a b c d z
0 file1 0.5 0.6 0.8 0.3 0.95
1 file1 0.5 0.6 0.8 0.3 1.52
2 file1 0.5 0.6 0.8 0.3 1.71
3 file1 0.2 0.2 0.4 0.1 0.40
4 file1 0.2 0.2 0.4 0.1 0.64
5 file1 0.2 0.2 0.4 0.1 0.72
6 file1 0.1 0.4 0.5 0.2 0.50
7 file1 0.1 0.4 0.5 0.2 0.80
8 file1 0.1 0.4 0.5 0.2 0.90
CSV File data_3.txt
file a b c d z
file1 0.5 0.6 0.8 0.3 0.9500000000000001
file1 0.5 0.6 0.8 0.3 1.5200000000000002
file1 0.5 0.6 0.8 0.3 1.7100000000000002
file1 0.2 0.2 0.4 0.1 0.4
file1 0.2 0.2 0.4 0.1 0.6400000000000001
file1 0.2 0.2 0.4 0.1 0.7200000000000001
file1 0.1 0.4 0.5 0.2 0.5
file1 0.1 0.4 0.5 0.2 0.8
file1 0.1 0.4 0.5 0.2 0.9
Basic Python
Same output
# Get data from first file
with open('data_1.txt', 'r') as f:
# first file header
header1 = f.readline()
# Let's get the lines of data
data_1 = []
for line in f:
new_data = line.rstrip().split() # strip '\n' and split on parens
for i in range(1, len(new_data)):
new_data[i] = float(new_data[i]) # convert columns after file to float
data_1.append(new_data)
# Get data from second file
with open('data_2.txt', 'r') as f:
# second file header
header2 = f.readline()
# Let's get the lines of data
data_2 = []
for line in f:
new_data = float(line.rstrip()) # only one value per line
data_2.append(new_data)
with open('data_3.txt', 'w') as f:
# Output file
# Write Header
f.write("file a b c d z\n")
# Use double loop to loop through all rows of data_2 for each row in data_1
for v1 in data_1:
# For each row in data_1
file, a, b, c, d = v1 # unpacking the values in v1 to individual variables
for v2 in data_2:
# for each row in data_2
x = v2 # data2 just has a single value per row
# Calculation using posted formula
z = calculation(a, b, c, x)
# Write result
f.write(f"{file} {a} {b} {c} {d} {z}\n")
Numpy Version
import numpy as np
file1=np.loadtxt('data_1.txt',skiprows=1,usecols=(1,2,3, 4))
file2=np.loadtxt('data_2.txt',skiprows=1,usecols=(0))
with open('data_3.txt', 'w') as f:
# Write header
f.write("file a b c d z\n")
# Double loop to through the values of file1 and file2
for val1 in file1:
for val2 in file2:
# Only use first 3 values (val1[:3] to only use first 3 value so ignore d)
z = calculation(*val1[:3], val2) # *val[:3] is unpacking values to go into calculation
# Write result
# map(str, val1) - converts values to string
# str(z) converts z to string
#' '.join([*map(str, val1), str(z)] - creates a space separated string
f.write(' '.join([*map(str, val1), str(z)]) + "\n")

Best way to get joint probability from 2D numpy

Was wondering if there is a better way to get the probability of a 2D numpy array. Maybe using some of numpy's built in functions.
For simplicity, say we have an example array:
[['apple','pie'],
['apple','juice'],
['orange','pie'],
['strawberry','cream'],
['strawberry','candy']]
Would like to get the probability such as:
['apple' 'juice'] --> 0.4 * 0.5 = 0.2
['apple' 'pie'] --> 0.4 * 0.5 = 0.2
['orange' 'pie'] --> 0.2 * 1.0 = 0.2
['strawberry' 'candy'] --> 0.4 * 0.5 = 0.2
['strawberry' 'cream'] --> 0.4 * 0.5 = 0.2
Where 'juice' as the second word has a probabiliy of 0.2. Since apple has probability of 2/5 * 1/2 (for juice).
On the other hand, 'pie' as a second word, has a probability of 0.4. The combination of the probability from 'apple' and 'orange'.
The way I approached the problem was adding 3 new columns to the array, for the probability of 1st column, 2nd column, and the final probability. Group the array per 1st column, then per 2nd column and update the probability accordingly.
Below is my code:
a = np.array([['apple','pie'],['apple','juice'],['orange','pie'],['strawberry','cream'],['strawberry','candy']])
ans = []
unique, counts = np.unique(a.T[0], return_counts=True) ## TRANSPOSE a, AND GET unique
myCounter = zip(unique,counts)
num_rows = sum(counts)
a = np.c_[a,np.zeros(num_rows),np.zeros(num_rows),np.zeros(num_rows)] ## ADD 3 COLUMNS to a
groups = []
## GATHER GROUPS BASE ON COLUMN 0
for _unique, _count in myCounter:
index = a[:,0] == _unique ## WHERE COLUMN 0 MATCH _unique
curr_a = a[index]
for j in range(len(curr_a)):
curr_a[j][2] = _count/num_rows
groups.append(curr_a)
## GATHER UNIQUENESS FROM COLUMN 1, PER GROUP
for g in groups:
unique, counts = np.unique(g.T[1], return_counts=True)
myCounter = zip(unique, counts)
num_rows = sum(counts)
for _unique, _count in myCounter:
index = g[:, 1] == _unique
curr_g = g[index]
for j in range(len(curr_g)):
curr_g[j][3] = _count / num_rows
curr_g[j][4] = float(curr_g[j][2]) * float(curr_g[j][3]) ## COMPUTE FINAL PROBABILITY
ans.append(curr_g[j])
for an in ans:
print(an)
Outputs:
['apple' 'juice' '0.4' '0.5' '0.2']
['apple' 'pie' '0.4' '0.5' '0.2']
['orange' 'pie' '0.2' '1.0' '0.2']
['strawberry' 'candy' '0.4' '0.5' '0.2']
['strawberry' 'cream' '0.4' '0.5' '0.2']
Was wondering if there is a better short/faster way of doing this using numpy or other means. Adding columns is not necessary, this was just my way of doing it. Other approach will be acceptable.
Based on the definition of probability distribution you have given, you can use pandas to do the same i.e
import pandas as pd
a = np.array([['apple','pie'],['apple','juice'],['orange','pie'],['strawberry','cream'],['strawberry','candy']])
df = pd.DataFrame(a)
# Find the frequency of first word and divide by the total number of rows
df[2]=df[0].map(df[0].value_counts())/df.shape[0]
# Divide 1 by the total repetion
df[3]=1/(df[0].map(df[0].value_counts()))
# Multiply the probabilities
df[4]= df[2]*df[3]
Output:
0 1 2 3 4
0 apple pie 0.4 0.5 0.2
1 apple juice 0.4 0.5 0.2
2 orange pie 0.2 1.0 0.2
3 strawberry cream 0.4 0.5 0.2
4 strawberry candy 0.4 0.5 0.2
If you want that in the form of list you can use df.values.tolist()
If you dont want the columns then
df = pd.DataFrame(a)
df[2]=((df[0].map(df[0].value_counts())/df.shape[0]) * (1/(df[0].map(df[0].value_counts()))))
Output:
0 1 2
0 apple pie 0.2
1 apple juice 0.2
2 orange pie 0.2
3 strawberry cream 0.2
4 strawberry candy 0.2
For combined probablity print(df.groupby(1)[2].sum())
candy 0.2
cream 0.2
juice 0.2
pie 0.4

Categories