Related
I would like to generate pandas dataframes with simulated data.
There should be x sets of columns.
Each set corresponds to y number of columns.
Each set should have a value, a, in z number of rows. The value, a, is a float.
However, z may be different for the different sets of column sets.
The remaining columns will have another value, b, which is also a float.
I would like to write a function to generate such pandas data frames where I can specify the variables x, y, a, b and where a specific value for z can be set for the individual column sets.
Here is an example df:
data = [[0.5, 0.5, 0.1, 0.1, 0.1, 0.1], [0.1, 0.1, 0.5, 0.5, 0.1, 0.1], [0.1, 0.1, 0.1, 0.1, 0.5, 0.5]]
df = pd.DataFrame(data, columns=['set1_col1', 'set1_col2', 'set2_col1', 'set2_col2', 'set3_col1', 'set3_col2'])
df
But I would like to be able to specify the variables, which for the above example would be:
x = 3 #(set1, set2, set3)
y = 2 #(col1, col2 for each set)
a = 0.5
z = 1 #(for all column sets)
b = 0.1
Advice on this would be greatly appreciated!
Thanks!
Use numpy.random.choice:
N = 5 #No of rows
x = 3 #(set1, set2, set3)
y = 2 #(col1, col2 for each set)
a = 0.5
z = 1 #(for all column sets)
b = 0.1
#names of sets
sets = [f'set{w+1}' for w in range(x)]
#names of columns
cols = [f'col{w+1}' for w in range(y)]
#MultiIndex by product
mux = pd.MultiIndex.from_product([sets, cols])
#DataFrame with default value
df = pd.DataFrame(b, index=range(N), columns=mux)
#random assign a by random index with no repeat
for c, i in zip(df.columns.levels[0], np.random.choice(df.index, z * x, replace=False)):
df.loc[i, c] = a
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
set1_col1 set1_col2 set2_col1 set2_col2 set3_col1 set3_col2
0 0.1 0.1 0.1 0.1 0.1 0.1
1 0.1 0.1 0.5 0.5 0.1 0.1
2 0.5 0.5 0.1 0.1 0.1 0.1
3 0.1 0.1 0.1 0.1 0.1 0.1
4 0.1 0.1 0.1 0.1 0.5 0.5
EDIT: For consecutive random values use:
N = 6 #No of rows
x = 3 #(set1, set2, set3)
y = 2 #(col1, col2 for each set)
a = 0.5
z = 2 #(for all column sets)
b = 0.1
#names of sets
sets = [f'set{w+1}' for w in range(x)]
#names of columns
cols = [f'col{w+1}' for w in range(y)]
#MultiIndex by product
mux = pd.MultiIndex.from_product([sets, cols])
#DataFrame with default value, index is create by consecutive groups
df = pd.DataFrame(b, index=np.arange(N) // z, columns=mux)
print (df)
#random assign a by random index with no repeat
for c, i in zip(df.columns.levels[0],
np.random.choice(np.unique(df.index), x, replace=False)):
df.loc[i, c] = a
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
df = df.reset_index(drop=True)
print (df)
set1_col1 set1_col2 set2_col1 set2_col2 set3_col1 set3_col2
0 0.5 0.5 0.1 0.1 0.1 0.1
1 0.5 0.5 0.1 0.1 0.1 0.1
2 0.1 0.1 0.1 0.1 0.5 0.5
3 0.1 0.1 0.1 0.1 0.5 0.5
4 0.1 0.1 0.5 0.5 0.1 0.1
5 0.1 0.1 0.5 0.5 0.1 0.1
In run the follow code:
k = 0
while k <= 1:
print(k)
k += 0.1
And get result:
0
0.1
0.2
0.30000000000000004
0.4
0.5
0.6
0.7
0.7999999999999999
0.8999999999999999
0.9999999999999999
However, the expected output is
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
How to make the result of python output same as in math?
Incrementing by a step size that's not exactly 0.1 (since that can't be represented as fixed point binary number) will continually increase your error. The float that the literal 0.1 gets translated to is not exactly 0.1. Computing the nearest binary approximation of the correct fraction is a better way to go:
k = 0
while k <= 10:
print(k / 10)
k += 1
k / 10 will not be an exact representation of the numbers you want except for 0.0, 0.5, 1.0, but since it will be the closest available float, it will print correctly.
As an aside, switching to integers allows you to rewrite your loop more idiomatically as
for k in range(11):
print(k / 10)
Just use round method with a precision of 1:
In [1217]: k = 0
...: while k<=1:
...: print(round(k,1))
...: k += 0.1
...:
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
You can do with format() function as below:
k = 0
while k<=1:
print('{0:.2g}'.format(k))
k += 0.1
An alternative to your code is initialize a variable i with 0 and increment it in each loop. You then set k = 0.1 * i
You thereby avoid cumulative rounding errors.
I have written a python function for some calculation that uses below two data sets.I want to calculate z for every data in data_2 using row1,row2,ror3,row4,row5 of data_1.However, as i am new to python i tried to write but fails in between.please help.Thanks.
data_1 data_2
file a b c d x
file1 0.5 0.6 0.8 0.3 0.5
file1 0.2 0.2 0.4 0.1 0.8
file1 0.1 0.4 0.5 0.2 0.9
my tried code is here:
import numpy as np
file1=np.loadtxt('data_1',skiprows=1,usecols=(1,2,3))
file2=np.loadtxt('data_2',skiprows=1,usecols=(0))
def calculation(a,b,c,x):
z=(a+b+c)*x
return z
for value in file2:
print(value)
calculation
my expected output should be something like
data_3
file a b c d z
file1 0.5 0.6 0.8 0.3 -
file1 0.5 0.6 0.8 0.3 -
file1 0.5 0.6 0.8 0.3 -
file1 0.2 0.2 0.4 0.1 -
file1 0.2 0.2 0.4 0.1 -
file1 0.2 0.2 0.4 0.1 -
file1 0.1 0.4 0.5 0.2 -
file1 0.1 0.4 0.5 0.2 -
file1 0.1 0.4 0.5 0.2 -
Python is a dynamic language and numpy tends to override normal operators to apply operations to entire collections of data. Often, if you have a for loop, you are not taking advantage of that.
numpy arrays can only hold a single data type but you have a string in column 0. pandas wraps numpy and makes multiple data types easier to deal with. So I've switch to reading pandas.DataFrame objects instead of arrays.
It looks like you want the cartesian product of file2["x"] with the rows in file1. One way to do that is to create a dummy column in both dataframes that have matching values and then merge. Use the sum method for a + b + c and then multiply with x, and you have the result.
import pandas as pd
# read space separated tables
file1=pd.read_table('data_1', sep=r"\s+")
file2=pd.read_table('data_2', sep=r"\s+")
# we want (a+b+c)*x, for each value in file2["x"]. Do the sum, then
# use `merge` with a temporary key to create the cartesian product
# with x. For each `x`, merge will create a row for each matching
# key and since all keys match, we've got a cartesian product.
# Finally, multiply.
file1["_tmpsums"] = file1[["a", "b", "c"]].sum(axis=1)
file1["_tmpmergekey"] = file2["_tmpmergekey"] = 1
file1 = pd.merge(file1, file2, on="_tmpmergekey")
file1["z"] = file1["_tmpsums"] * file1["x"]
file1 = file1.drop(["_tmpsums", "_tmpmergekey", "x"], axis=1)
print(" data_3")
print(file1.to_string(col_space=6, index=False, justify="center"))
Result
data_3
file a b c d z
file1 0.5 0.6 0.8 0.3 0.95
file1 0.5 0.6 0.8 0.3 1.52
file1 0.5 0.6 0.8 0.3 1.71
file1 0.2 0.2 0.4 0.1 0.40
file1 0.2 0.2 0.4 0.1 0.64
file1 0.2 0.2 0.4 0.1 0.72
file1 0.1 0.4 0.5 0.2 0.50
file1 0.1 0.4 0.5 0.2 0.80
file1 0.1 0.4 0.5 0.2 0.90
Using pandas as follows
import pandas as pd
# Load Data
data_1 = pd.read_csv('data_1.txt', delimiter = r"\s+")
data_2 = pd.read_csv('data_2.txt', delimiter = r"\s+")
# Compute the cartesian product of data_1 with data_2
# since for each row in data_1, we need sequence of rows in data_2
# We do this using DataFrame merge by injecting a key that is repeated for each row
# i.e. 'merge_key'
data_1['merge_key'] = pd.Series([1]*len(data_1))
data_2['merge_key'] = pd.Series([1]*len(data_2))
df = pd.merge(data_1, data_2, on = 'merge_key')
# Drop merge key from result
df.drop('merge_key', axis = 'columns', inplace = True)
# DataFrame df now has columns File, a, b, c, d, x
# We can apply function calulation to each row using apply
# and specifying the columns to send to calculation
df['z'] = df.apply(lambda row: calculation(row['a'], row['b'], row['c'], row['x']), axis = 'columns')
# Drop x column
df.drop('x', axis = 'columns', inplace = True)
# Write to CSV file
df.to_csv('data_3.txt', index=False, sep = " ")
Output
Pandas DataFrame df
file a b c d z
0 file1 0.5 0.6 0.8 0.3 0.95
1 file1 0.5 0.6 0.8 0.3 1.52
2 file1 0.5 0.6 0.8 0.3 1.71
3 file1 0.2 0.2 0.4 0.1 0.40
4 file1 0.2 0.2 0.4 0.1 0.64
5 file1 0.2 0.2 0.4 0.1 0.72
6 file1 0.1 0.4 0.5 0.2 0.50
7 file1 0.1 0.4 0.5 0.2 0.80
8 file1 0.1 0.4 0.5 0.2 0.90
CSV File data_3.txt
file a b c d z
file1 0.5 0.6 0.8 0.3 0.9500000000000001
file1 0.5 0.6 0.8 0.3 1.5200000000000002
file1 0.5 0.6 0.8 0.3 1.7100000000000002
file1 0.2 0.2 0.4 0.1 0.4
file1 0.2 0.2 0.4 0.1 0.6400000000000001
file1 0.2 0.2 0.4 0.1 0.7200000000000001
file1 0.1 0.4 0.5 0.2 0.5
file1 0.1 0.4 0.5 0.2 0.8
file1 0.1 0.4 0.5 0.2 0.9
Basic Python
Same output
# Get data from first file
with open('data_1.txt', 'r') as f:
# first file header
header1 = f.readline()
# Let's get the lines of data
data_1 = []
for line in f:
new_data = line.rstrip().split() # strip '\n' and split on parens
for i in range(1, len(new_data)):
new_data[i] = float(new_data[i]) # convert columns after file to float
data_1.append(new_data)
# Get data from second file
with open('data_2.txt', 'r') as f:
# second file header
header2 = f.readline()
# Let's get the lines of data
data_2 = []
for line in f:
new_data = float(line.rstrip()) # only one value per line
data_2.append(new_data)
with open('data_3.txt', 'w') as f:
# Output file
# Write Header
f.write("file a b c d z\n")
# Use double loop to loop through all rows of data_2 for each row in data_1
for v1 in data_1:
# For each row in data_1
file, a, b, c, d = v1 # unpacking the values in v1 to individual variables
for v2 in data_2:
# for each row in data_2
x = v2 # data2 just has a single value per row
# Calculation using posted formula
z = calculation(a, b, c, x)
# Write result
f.write(f"{file} {a} {b} {c} {d} {z}\n")
Numpy Version
import numpy as np
file1=np.loadtxt('data_1.txt',skiprows=1,usecols=(1,2,3, 4))
file2=np.loadtxt('data_2.txt',skiprows=1,usecols=(0))
with open('data_3.txt', 'w') as f:
# Write header
f.write("file a b c d z\n")
# Double loop to through the values of file1 and file2
for val1 in file1:
for val2 in file2:
# Only use first 3 values (val1[:3] to only use first 3 value so ignore d)
z = calculation(*val1[:3], val2) # *val[:3] is unpacking values to go into calculation
# Write result
# map(str, val1) - converts values to string
# str(z) converts z to string
#' '.join([*map(str, val1), str(z)] - creates a space separated string
f.write(' '.join([*map(str, val1), str(z)]) + "\n")
I have a dataframe A that looks like this
value Frequency
0.1 3
0.2 2
and I want to convert it to dataframe B like below
Sample
0.1
0.1
0.1
0.2
0.2
Simply put, dataframe A is the samples and their frequency (repetition). Dataframe B is literally expanding that. Is there a straightforward way to do this?
what I did (minimal working example reproducing above):
X = pd.DataFrame([(0.1,3),(0.2,2)],columns=['value','Frequency'])
Sample = list()
for index, row in X.iterrows():
Value = row['value']
Freq = int(row['Frequency'])
Sample = Sample + [Value]*Freq
Data = pd.DataFrame({'Sample':pd.Series(Sample)})
You can use Series.repeat, where the repeats argument can also be a series of ints:
df.value.repeat(df.Frequency).reset_index(drop=True).to_frame('Sample')
Sample
0 0.1
1 0.1
2 0.1
3 0.2
4 0.2
Use repeat
>>> df['value'].repeat(df.Frequency)
0 0.1
0 0.1
0 0.1
1 0.2
1 0.2
Name: value, dtype: float64
Or Create new dataframe with
>>> pd.DataFrame(df['value'].repeat(df.Frequency).to_numpy(),columns=["Sample"])
Sample
0 0.1
1 0.1
2 0.1
3 0.2
4 0.2
You can use reindex + repeat
X = X.reindex(X.index.repeat(X.Frequency))
I am trying to print items in two separate lists in a way that items in list-1 will align with items in list-2.
Here is my attempt:
import numpy as np
list_1=[1,2,3,4]
list_2=np.arange(0.1,0.4,0.1)
for x in list_1:
j=x/2.0
for y in list_2:
print j,',', y
My Output:
0.5 , 0.1
0.5 , 0.2
0.5 , 0.3
0.5 , 0.4
1.0 , 0.1
1.0 , 0.2
1.0 , 0.3
1.0 , 0.4
1.5 , 0.1
1.5 , 0.2
1.5 , 0.3
1.5 , 0.4
2.0 , 0.1
2.0 , 0.2
2.0 , 0.3
2.0 , 0.4
Desired Output:
0.5 , 0.1
1.0 , 0.2
1.5 , 0.3
2.0 , 0.4
What you want is zip().
Example:
>>> l1 = range(10)
>>> l2 = range(20,30)
>>> for x,y in zip(l1, l2):
print x, y
0 20
1 21
2 22
3 23
4 24
5 25
6 26
7 27
8 28
9 29
Explanation:
zip receives iterables, and then iterates over all of them at once, starting from the 0 element of each, then going on to the 1st and then 2nd and so on, once any of the iterables reaches the end - the zip will stop, you can use izip_longest from itertools to fill empty items in iterables with None (or you can do some fancier things - but that is for a different question)