I have problem with DataFrame for range.
In the first line, I would like to calculate and add the data,
subsequent lines depend on each previous one.
So the first formula is "different", the rest are repeated.
I did this in a DataFrame and it works, but very slowly.
All other data so far is in the DataFrame.
import pandas as pd
import numpy as np
calc = pd.DataFrame(np.random.binomial(n=10, p=0.2, size=(5,1)))
calc['op_ol'] = calc[0]
calc['op_ol'][0] = calc[0][0]
for ee in range(1,5):
calc['op_ol'][ee] = 0 if calc['op_ol'][ee-1] == 0 else calc[0][ee-1] * calc['op_ol'][ee-1]
How could I speed this up?
It's generally slow when you use loops with pandas. I suggest you these lines:
calc = pd.DataFrame(np.random.binomial(n=10, p=0.2, size=(5,1)))
calc['op_ol'] = (calc[0].cumprod() * calc[0][0]).shift(fill_value=calc[0][0])
Where cumprod is the cumulative product and we shift it with the first value.
Related
Is this an efficient method of updating columns based on conditions in other columns using pandas?
I am looking to generalize an update function that will move gaussian values and I had difficulty using lambda because there are multiple columns that could be conditions. Similarly apply was problematic because I couldn't get the variables to be in the form that it wanted, though honestly I probably could have spent more time on that part.
Problem statement:
How should I handle updating large pandas dataFrames based on a value in another column in such a way that I could run many of these functions within acceptable speed parameters? Please respond with a complete example and if possible use my 'silly_series_generator' to make sure we are staying the same problem case. Thanks.
import random
import pandas
def silly_series_generator():
# requires import of random and pandas
ret = []
ret.append(r.choice(['X', 'Y', 'Z']))
for i in range(9):
ret.append(random.gauss(0,1))
return pandas.Series(ret, list("ABCDEFGHIJ"))
def silly_update(df, condition_col, condition_value, target_col, mean, sd = .1):
# requires import of random and pandas
effected_cells = df[condition_col] == condition_value[0]
x = df[effected_cells][target_col] + r.gauss(mean, sd)
df[target_col].update(x)
return df
def run_test():
# requires import of random and pandas
# requires functions: silly_series_generator and silly_update
rows = []
for i in range(50):
rows.append(silly_series_generator())
original_df = pd.DataFrame(rows)
print('original_df',original_df['B'].mean())
updated_df = silly_update(original_df, 'A', 'X', 'B', 1)
print('updated_df', updated_df['B'].mean())
if __name__ == "__main__":
run_test()
I'm not sure the examples below are any faster (I'm sure the apply() is slower), but it's how I would do it. Looking back on your problem - I'm not sure it's even different enough to write up, but here it is.
Make the data
import numpy as np
import pandas as pd
import random
def silly_series_generator():
# requires import of random and pandas
ret = []
ret.append(random.choice(['X', 'Y', 'Z']))
for i in range(9):
ret.append(random.gauss(0,1))
return pd.Series(ret, list("ABCDEFGHIJ"))
rows = []
for i in range(50):
rows.append(silly_series_generator())
df = pd.DataFrame(rows)
Using apply
I think apply is typically the slowest route because it runs on one row at a time. However I still like it so here's an example. We can provide the extra args to apply() with the kwargs.
def update(row, condition_col, condition_value, target_col, mean, sd = .1):
if row[condition_col] == condition_value:
v = row[target_col] + random.gauss(mean, sd)
else:
v = row[target_col]
return v
df['B'] = df.apply(update, axis=1, condition_col='A', condition_value='X', target_col='B', mean=1)
Using a mask
This is basically what you did - I just used the .loc[] instead of .update(). I'm not sure if it's any faster, but it's another option.
mask = df['A'] == 'X'
df.loc[mask, 'B'] = df['B'] + random.gauss(1, 0.1)
Using a mask - new random value for each row
It's unclear if you want the same random number added to each row. The way we have it setup now, it's the same random number added to everything that matches. It's likely we want each value shifted by a different random number each time.
Here's an example of making a new random number for each row. I'm leaving around some extra columns for debug.
mask = df['A'] == 'X'
# Generate a random number for each row
# df['r'] = np.random.normal(1, 0.1, size=(df.shape[0],1))
# Only generate the random numbers for the mask locations
df.loc[mask, 'r'] = np.random.normal(1, 0.1, size=(df[mask].shape[0],1))
df.loc[mask, 'Bprime'] = df['B'] + df['r']
I basically have a dataframe (df1) with columns 7 columns. The values are always integers.
I have another dataframe (df2), which has 3 columns. One of these columns is a list of lists with a sequence of 7 integers. Example:
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
I now want to compare the sequence of the rows in df1 with the 'Sequence' column in df2 and get a percentage of overlap. In a primitive for loop this would look like this:
df2['Overlap'] = 0.
for i in range(len(df2)):
c = sum(el in list(df2.at[i, 'Sequence']) for el in df1.values.tolist())
df2.at[i, 'Overlap'] = c/len(df1)
Now the problem is that my df2 has 500000 rows and my df1 usually around 50-100. This means that the task easily gets very time consuming. I know that there must be a way to optimize this with numpy, but I cannot figure it out. Can someone please help me?
By default engine used in pandas cython, but you can also change engine to numba or use njit decorator to speed up. Look up enhancingperf.
Numba converts python code to optimized machine codee, pandas is highly integrated with numpy and hence numba also. You can experiment with parallel, nogil, cache, fastmath option for speedup. This method shines for huge inputs where speed is needed.
Numba you can do eager compilation or first time execution take little time for compilation and subsequent usage will be fast
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
a = df1.values
# Also possible to add `parallel=True`
f = nb.njit(lambda x: (x == a).mean())
# This is just illustration, not correct logic. Change the logic according to needs
# nb.njit((nb.int64,))
# def f(x):
# sum = 0
# for i in nb.prange(x.shape[0]):
# for j in range(a.shape[0]):
# sum += (x[i] == a[j]).sum()
# return sum
# Experiment with engine
print(df2['Sequence'].apply(f))
You can use direct comparison of the arrays and sum the identical values. Use apply to perform the comparison per row in df2:
df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)
output:
0 0.270000
1 0.298571
To save the output in your original dataframe:
df2['Overlap'] = df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)
I am running a simple python script for MC. Basically it reads through every row in the dataframe and selects the max and min of the two variables. Then the simulation if run 1000 times selecting a random value between the min and max and computes the product and writes the P50 value back to the datatable.
Somehow the P50 output is the same for all rows. Any help on where I am going wrong?
import pandas as pd
import random
import numpy as np
data = [[0.075,0.085, 120, 150], [0.055, 0.075, 150, 350],[0.045,0.055,175,400]]
df = pd.DataFrame(data, columns = ['P_min','P_max','H_min','H_max'])
NumSim = 1000
for index, row in df.iterrows():
outdata = np.zeros(shape=(NumSim,), dtype=float)
for k in range(NumSim):
phi = (row['P_min'] + (row['P_max'] - row['P_min']) * random.uniform(0, 1))
ht = (row['H_min'] + (row['H_max'] - row['H_min']) * random.uniform(0, 1))
outdata[k] = phi*ht
df['out_p50'] = np.percentile(outdata,50)
print(df)
By df['out_p50'] = np.percentile(outdata,50) you are saying that you want the whole column to be set to given value, not a specific row of the column. Therefore, the numbers are generated and saved but they are saved to the whole column and in the end, you see the last generated number in every row.
Instead, use df.loc[index, 'out_p50'] = np.percentile(outdata,50) to specify the specific row you want to set.
Yup -- you're writing a scalar value to the entire column. You overwrite that value on each iteration. If you want, you can simply specify the row with df.loc for a quick fix. Also consider using outdata.median instead of percentile.
Perhaps the most important feature of PANDAS is the built-in support for vectorization: you work with entire columns of data, rather than looping through the data frame. Think like a list comprehension in which you don't need the for row in df iteration at the end.
I am sorting every column of a very large pandas dataframe using a for loop. However, this process is taking very long because the dataframe has more than 1 million columns. I want this process to run much faster than it is running right now.
This is the code I have at the moment:
top25s = []
for i in range(1, len(mylist)):
topchoices = df.sort_values(i, ascending=False).iloc[0:25, 0].values
top25s.append(topchoices)
Here len(mylist) is 14256 but can easily go up to more than 1000000 in the future. df has a dimension of 343 rows × 14256 columns.
Thanks for all of your inputs!
You can use nlargest:
df.apply(lambda x: x.nlargest(25).reset_index(drop=True))
But I doubt this will gain you much time honestly. As commented, you just have a lot of data to go through.
I'd propose using a bit of help from numpy. Which should speed things up significantly.
The following code will return a 2D numpy array with the top25 elements in each column.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(50,100)) # Generate random data
rank = df.rank(axis = 0, ascending=False)
top25s = np.extract(rank<=25, df).reshape(25, 100)
I am relatively new to python and numpy. I am currently trying to replicate the following table as shown in the image in python using numpy.
As in the figure, I have got the columns "group, sub_group,value" that are populated. I want to transpose column "sub_group" and do a simple calculation i.e. value minus shift(value) and display the figure in the lower diagonal of the matrix for each group. If sub_group is "0", then assign the whole column as 0. The transposed sub_group can be named anything (preferably index numbers) if it makes it easier. I am ok with a pandas solution as well. I just think pandas may be slow?
Below is code in array form:
import numpy as np
a=np.array([(1,-1,10),(1,0,10),(1,-2,15),(1,-3,1),(1,-4,1),(1,0,12),(1,-5,16)], dtype=[('group',float),('sub_group',float),('value',float)])
Any help would be appreciated.
Regards,
S
Try this out :
import numpy as np
import pandas as pd
a=np.array([(1,-1,10),(1,0,10),(1,-2,15),(1,-3,1),(1,-4,1),(1,0,12),(1,-5,16)], dtype=[('group',float),('sub_group',float),('value',float)])
df = pd.DataFrame(a)
for i in df.index:
col_name = str(int(df['sub_group'][i]))
df[col_name] = None
if df['sub_group'][i] == 0:
df[col_name] = 0
else:
val = df['value'][i]
for j in range(i, df.index[-1]+1):
df[col_name][j] = val - df['value'][j]
For the upper triangle of the matrix, I have put Nonevalues. You can replace it by whatever you want.
This piece of code does the calculation for the example of the subgroup, I am not sure if this is what you actually want, in that case post a comment here and I will edit
import numpy as np
array_1=np.array([(1,-1,10),(1,0,10),(1,-2,15),(1,-3,1),(1,-4,1);(1,0,12),(1,-5,16)])
#transpose the matrix
transposed_group = array_1.transpose()
#loop over the first row
for i in range(0,len(transposed_group[1,:])):
#value[i] - first value of the row
transposed_group[0,i] = transposed_group[0,i] - transposed_group[0,0]
print transposed_group
In case you want to display that in the diagonal of the matrix, you can loop through the rows and columns, as for example:
import numpy as np
#create an array of 0
array = np.zeros(shape=(3,3))
#fill the array with 1 in the diagonals
print array
#loop over rows
for i in range(0,len(array[:,1])):
#loop over columns
for j in range(0,len(array[1,:])):
array[i,j] = 1
print array