multiplying all combinations of columns - python

I am trying to find an efficient way of multiplying each column combination within a pandas dataframe. I have managed to achieve this with itertools, however when the size of the dataframe increases it dramatically slows down. I am going to need to perform this on a dataframe with a size of about (100,1000)
Example of working code with smaller dataframe below,
import numpy as np
import pandas as pd
from itertools import combinations_with_replacement
df = pd.DataFrame(np.random.randn(3, 10))
new_df = pd.DataFrame()
for p in combinations_with_replacement(df.columns,2):
title = p
new_df[title] = df[p[0]]*df[p[1]]
Does anybody have any suggestions on how this could be achieved?

Combining index view and array.prod(axis), this runs ~100 times faster:
def f1():
#with loop
new_df = pd.DataFrame()
for p in combinations_with_replacement(df.columns,2):
title = p
new_df[title] = df[p[0]]*df[p[1]]
return new_df
def f2():
n = len(df.columns)
ix = np.indices((n,n))[:, ~np.tri(n, k=-1, dtype=bool)]
return pd.DataFrame(df.values.T[ix.T].prod(1).T, columns=list(map(tuple, ix.T)))

Related

DataFrame Pandas for range

I have problem with DataFrame for range.
In the first line, I would like to calculate and add the data,
subsequent lines depend on each previous one.
So the first formula is "different", the rest are repeated.
I did this in a DataFrame and it works, but very slowly.
All other data so far is in the DataFrame.
import pandas as pd
import numpy as np
calc = pd.DataFrame(np.random.binomial(n=10, p=0.2, size=(5,1)))
calc['op_ol'] = calc[0]
calc['op_ol'][0] = calc[0][0]
for ee in range(1,5):
calc['op_ol'][ee] = 0 if calc['op_ol'][ee-1] == 0 else calc[0][ee-1] * calc['op_ol'][ee-1]
How could I speed this up?
It's generally slow when you use loops with pandas. I suggest you these lines:
calc = pd.DataFrame(np.random.binomial(n=10, p=0.2, size=(5,1)))
calc['op_ol'] = (calc[0].cumprod() * calc[0][0]).shift(fill_value=calc[0][0])
Where cumprod is the cumulative product and we shift it with the first value.

Is this an efficient method of updating columns based on conditions in other columns using pandas

Is this an efficient method of updating columns based on conditions in other columns using pandas?
I am looking to generalize an update function that will move gaussian values and I had difficulty using lambda because there are multiple columns that could be conditions. Similarly apply was problematic because I couldn't get the variables to be in the form that it wanted, though honestly I probably could have spent more time on that part.
Problem statement:
How should I handle updating large pandas dataFrames based on a value in another column in such a way that I could run many of these functions within acceptable speed parameters? Please respond with a complete example and if possible use my 'silly_series_generator' to make sure we are staying the same problem case. Thanks.
import random
import pandas
def silly_series_generator():
# requires import of random and pandas
ret = []
ret.append(r.choice(['X', 'Y', 'Z']))
for i in range(9):
ret.append(random.gauss(0,1))
return pandas.Series(ret, list("ABCDEFGHIJ"))
def silly_update(df, condition_col, condition_value, target_col, mean, sd = .1):
# requires import of random and pandas
effected_cells = df[condition_col] == condition_value[0]
x = df[effected_cells][target_col] + r.gauss(mean, sd)
df[target_col].update(x)
return df
def run_test():
# requires import of random and pandas
# requires functions: silly_series_generator and silly_update
rows = []
for i in range(50):
rows.append(silly_series_generator())
original_df = pd.DataFrame(rows)
print('original_df',original_df['B'].mean())
updated_df = silly_update(original_df, 'A', 'X', 'B', 1)
print('updated_df', updated_df['B'].mean())
if __name__ == "__main__":
run_test()
I'm not sure the examples below are any faster (I'm sure the apply() is slower), but it's how I would do it. Looking back on your problem - I'm not sure it's even different enough to write up, but here it is.
Make the data
import numpy as np
import pandas as pd
import random
def silly_series_generator():
# requires import of random and pandas
ret = []
ret.append(random.choice(['X', 'Y', 'Z']))
for i in range(9):
ret.append(random.gauss(0,1))
return pd.Series(ret, list("ABCDEFGHIJ"))
rows = []
for i in range(50):
rows.append(silly_series_generator())
df = pd.DataFrame(rows)
Using apply
I think apply is typically the slowest route because it runs on one row at a time. However I still like it so here's an example. We can provide the extra args to apply() with the kwargs.
def update(row, condition_col, condition_value, target_col, mean, sd = .1):
if row[condition_col] == condition_value:
v = row[target_col] + random.gauss(mean, sd)
else:
v = row[target_col]
return v
df['B'] = df.apply(update, axis=1, condition_col='A', condition_value='X', target_col='B', mean=1)
Using a mask
This is basically what you did - I just used the .loc[] instead of .update(). I'm not sure if it's any faster, but it's another option.
mask = df['A'] == 'X'
df.loc[mask, 'B'] = df['B'] + random.gauss(1, 0.1)
Using a mask - new random value for each row
It's unclear if you want the same random number added to each row. The way we have it setup now, it's the same random number added to everything that matches. It's likely we want each value shifted by a different random number each time.
Here's an example of making a new random number for each row. I'm leaving around some extra columns for debug.
mask = df['A'] == 'X'
# Generate a random number for each row
# df['r'] = np.random.normal(1, 0.1, size=(df.shape[0],1))
# Only generate the random numbers for the mask locations
df.loc[mask, 'r'] = np.random.normal(1, 0.1, size=(df[mask].shape[0],1))
df.loc[mask, 'Bprime'] = df['B'] + df['r']

Python list comparison numpy optimization

I basically have a dataframe (df1) with columns 7 columns. The values are always integers.
I have another dataframe (df2), which has 3 columns. One of these columns is a list of lists with a sequence of 7 integers. Example:
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
I now want to compare the sequence of the rows in df1 with the 'Sequence' column in df2 and get a percentage of overlap. In a primitive for loop this would look like this:
df2['Overlap'] = 0.
for i in range(len(df2)):
c = sum(el in list(df2.at[i, 'Sequence']) for el in df1.values.tolist())
df2.at[i, 'Overlap'] = c/len(df1)
Now the problem is that my df2 has 500000 rows and my df1 usually around 50-100. This means that the task easily gets very time consuming. I know that there must be a way to optimize this with numpy, but I cannot figure it out. Can someone please help me?
By default engine used in pandas cython, but you can also change engine to numba or use njit decorator to speed up. Look up enhancingperf.
Numba converts python code to optimized machine codee, pandas is highly integrated with numpy and hence numba also. You can experiment with parallel, nogil, cache, fastmath option for speedup. This method shines for huge inputs where speed is needed.
Numba you can do eager compilation or first time execution take little time for compilation and subsequent usage will be fast
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
a = df1.values
# Also possible to add `parallel=True`
f = nb.njit(lambda x: (x == a).mean())
# This is just illustration, not correct logic. Change the logic according to needs
# nb.njit((nb.int64,))
# def f(x):
# sum = 0
# for i in nb.prange(x.shape[0]):
# for j in range(a.shape[0]):
# sum += (x[i] == a[j]).sum()
# return sum
# Experiment with engine
print(df2['Sequence'].apply(f))
You can use direct comparison of the arrays and sum the identical values. Use apply to perform the comparison per row in df2:
df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)
output:
0 0.270000
1 0.298571
To save the output in your original dataframe:
df2['Overlap'] = df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)

Function returns only one iteration, instead of multiple. What is wrong?

First of all I'm a beginner and having an issue about functions and returning values. After that, I need to do some matrix operations to take minimum value of the right column. However, since I cannot return these values (I could not figure out why) I'm not able to do any operations on it. The problem here is, every time I try to use return, It gives me only the first or the last row of the matrix. If you can help, I really appreciate it. Thanks.
import numpy as np
import pandas as pd
df = pd.read_csv(r"C:\Users\Yunus Özer\Downloads/MA.csv")
df.head()
x = df["x"]
def minreg():
for k in range(2,16):
x_pred = np.full(x.shape, np.nan)
for t in range(k,x.size):
x_pred[t] = np.mean(x[(t-k):t])
mape_value=((np.mean(np.abs(x-x_pred)/np.abs(x))*100))
m=np.array([k,mape_value])
return m
print(minreg())
return m command basicly terminates the function and returns m. As a result, the function terminates after executing the first loop. So firstly you need to call return after your loop ends. Secondly you need to put each m value generated for the loop to an array to store them and return that array.
import numpy as np
import pandas as pd
df = pd.read_csv(r"C:\Users\Yunus Özer\Downloads/MA.csv")
df.head()
x = df["x"]
def minreg():
m_arr = []
for k in range(2,16):
x_pred = np.full(x.shape, np.nan)
for t in range(k,x.size):
x_pred[t] = np.mean(x[(t-k):t])
mape_value=((np.mean(np.abs(x-x_pred)/np.abs(x))*100))
m_arr.append(np.array([k,mape_value]))
return m_arr
print(minreg())

Split very large Pandas dataframe, alternative to Numpy array_split

Any ideas on the limit of rows to use the Numpy array_split method?
I have a dataframe with +6m rows and would like to split it in 20 or so chunks.
My attempt followed that described in:
Split a large pandas dataframe
using Numpy and the array_split function, however being a very large dataframe it just goes on forever.
My dataframe is df which includes 8 columns and 6.6 million rows.
df_split = np.array_split(df,20)
Any ideas on an alternative method to split this? Alternatively tips to improve dataframe performance are also welcomed.
Maybe this resolve your problem by separating the dataframe to chunk like this example:
import numpy as np
import pandas as pds
df = pds.DataFrame(np.random.rand(14,4), columns=['a', 'b', 'c', 'd'])
def chunker(seq, size):
return (seq[pos:pos + size] for pos in range(0, len(seq), size))
for i in chunker(df,5):
df_split = np.array_split(i, 20)
print(df_split)
I do not have a general solution, however there are two things you could consider:
You could try loading the data in chunks, instead of loading it and then splitting it. If you use pandas.read_csv the skiprows argument would be the way to go.
You could reshape your data with df.values.reshape((20,-1,8)). However this would require the number of rows to be divisible by 20. You could consider not using the last (a maximum of 19) of the samples to make it fit. This would of course be the fastest solution.
With a little modifications on the code of Houssem Maamria, this file could help someone trying to export each chunk to an excel file.
import pandas as pd
import numpy as np
dfLista_90 = pd.read_excel('my_excel.xlsx', index_col = 0) # to include the headers
count = 0
limit = 200
rows = len(dfLista_90)
partition = (rows // limit) + 1
def chunker(df, size):
return (df[pos:pos + size] for pos in range(0, len(df), size))
for a in chunker(dfLista_90, limit):
to_excel = np.array_split(a, partition)
count += 1
a.to_excel('file_{:02d}.xlsx'.format(count), index=True)

Categories