Optimizing Iterations on large datsets

Optimizing Iterations on large datsets - python

I have two large datasets df1 and df2, both have a column that records the time each observation was made. I want to find the time difference between every entry of df1 and every entry of df2.
The code below works but runs into memory errors when I attempt to run it on the entire datasets. How can I optimize this for memory efficiency?
df1 = pd.read_csv("table0.csv")
df2 = pd.read_csv("table1.csv")
LINE_NUMBER_table0 = [ ] # Initialize an empty list where we will add the number of row of table0
LINE_NUMBER_table1 = [ ] # Initialize an empty list where we will add the number of row of table1
TIME_DIFFERENCE = [ ] # Initialize an empty list where we will add the time difference between the row i of table0 and the row j of tabele1
for i in range(1000) :
for j in range(1000) :
LINE_NUMBER_table0.append(i) # Add the number of row i of table0
LINE_NUMBER_table1.append(j) # Add the number of row j of table1
timedifference = df1["mjd"][i] - df2["MJD"][j] # Calculate the time difference between row i and row j
TIME_DIFFERENCE.append(timedifference) # Add this time difference to the list TIME_DIFFERENCE

You do not need a loop for that. Python loops are generally inefficient (especially iterating on Pandas dataframes, see this post). You need to use vectorized calls instead. For example, Numpy functions or the ones of Pandas. In this case, you can use np.tile and np.repeat. Here is an (untested) example:
import numpy as np
df1 = pd.read_csv("table0.csv")
df2 = pd.read_csv("table1.csv")
tmp = np.arange(1000)
LINE_NUMBER_table0 = np.repeat(tmp, 1000)
LINE_NUMBER_table1 = np.tile(tmp, 1000)
df1_mjd = np.repeat(df1["mjd"].to_numpy(), 1000)
df2_MJD = np.tile(df2["MJD"].to_numpy(), 1000)
TIME_DIFFERENCE = df1_mjd - df2_MJD
Note that you can convert Numpy array back to list using your_array.tolist() but it is better to work with Numpy array for sake of performance (note that Pandas uses Numpy array internally so the conversion between Pandas datafram and Numpy array is cheap as opposed to lists).

Related

Python list comparison numpy optimization

I basically have a dataframe (df1) with columns 7 columns. The values are always integers.
I have another dataframe (df2), which has 3 columns. One of these columns is a list of lists with a sequence of 7 integers. Example:
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
I now want to compare the sequence of the rows in df1 with the 'Sequence' column in df2 and get a percentage of overlap. In a primitive for loop this would look like this:
df2['Overlap'] = 0.
for i in range(len(df2)):
c = sum(el in list(df2.at[i, 'Sequence']) for el in df1.values.tolist())
df2.at[i, 'Overlap'] = c/len(df1)
Now the problem is that my df2 has 500000 rows and my df1 usually around 50-100. This means that the task easily gets very time consuming. I know that there must be a way to optimize this with numpy, but I cannot figure it out. Can someone please help me?

By default engine used in pandas cython, but you can also change engine to numba or use njit decorator to speed up. Look up enhancingperf.
Numba converts python code to optimized machine codee, pandas is highly integrated with numpy and hence numba also. You can experiment with parallel, nogil, cache, fastmath option for speedup. This method shines for huge inputs where speed is needed.
Numba you can do eager compilation or first time execution take little time for compilation and subsequent usage will be fast
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
a = df1.values
# Also possible to add `parallel=True`
f = nb.njit(lambda x: (x == a).mean())
# This is just illustration, not correct logic. Change the logic according to needs
# nb.njit((nb.int64,))
# def f(x):
# sum = 0
# for i in nb.prange(x.shape[0]):
# for j in range(a.shape[0]):
# sum += (x[i] == a[j]).sum()
# return sum
# Experiment with engine
print(df2['Sequence'].apply(f))

You can use direct comparison of the arrays and sum the identical values. Use apply to perform the comparison per row in df2:
df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)
output:
0 0.270000
1 0.298571
To save the output in your original dataframe:
df2['Overlap'] = df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)

Efficient way to rebuild a dictionary of dataframes

I have a dictionary that is filled with multiple dataframes. Now I am searching for an efficient way for changing the key structure, but the solution I have found is rather slow when more dataframes / bigger dataframes are involved. Thats why I wanted to ask if anyone might know a more convenient / efficient / faster way or approach than mine. So first, I created this example to show where I initially started:
import pandas as pd
import numpy as np
# assign keys to dic
teams = ["Arsenal", "Chelsea", "Manchester United"]
dic_teams = {}
# fill dic with random entries
for t1 in teams:
dic_teams[t1] = pd.DataFrame({'date': pd.date_range("20180101", periods=30),
'Goals': pd.Series(np.random.randint(0,5, size = 30)),
'Chances': pd.Series(np.random.randint(0,15, size = 30)),
'Fouls': pd.Series(np.random.randint(0, 20, size = 30)),
'Offside': pd.Series(np.random.randint(0, 10, size = 30))})
dic_teams[t1] = dic_teams[t1].set_index('date')
dic_teams[t1].index.name = None
Now I basically have a dictionary where every key is a team, which means I have a dataframe for every team with information on their game performance over time. Now I would prefer to change this particular dictionary so I get a structure where the key is the date, instead of a team. This would mean that I have a dataframe for every date, which is filled with the performance of each team on that date. I managed to do that using the following code, which works but is really slow once I add more teams and performance factors:
# prepare lists for looping
dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
dic_dates = {}
# new structure where key = date
for d in dates:
dic_dates[d] = pd.DataFrame(index = teams, columns = perf)
for t2 in teams:
dic_dates[d].loc[t2] = dic_teams[t2].loc[d]
Because I am using a nested loop, the restructuring of my dictionary is slow. Does anyone have an idea how I could improve the second piece of code? I'm not necessarily searching just for a solution, also for a logic or idea how to do better.
Thanks in advance, any help is highly appreciated

Creating a Pandas dataframes the way you do is (strangely) awfully slow, as well as direct indexing.
Copying a dataframe is surprisingly quite fast. Thus you can use an empty reference dataframe copied multiple times. Here is the code:
dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
zygote = pd.DataFrame(index = teams, columns = perf)
dic_dates = {}
# new structure where key = date
for d in dates:
dic_dates[d] = zygote.copy()
for t2 in teams:
dic_dates[d].loc[t2] = dic_teams[t2].loc[d]
This is about 2 times faster than the reference on my machine.
Overcoming the slow dataframe direct indexing is tricky. We can use numpy to do that. Indeed, we can convert the dataframe to a 3D numpy array, use numpy to perform the transposition, and finally convert the slices into dataframes again. Note that this approach assumes that all values are integers and that the input dataframe are well structured.
Here is the final implementation:
dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
dic_dates = {}
# Create a numpy array from Pandas dataframes
# Assume the order of the `dates` and `perf` indices are the same in all dataframe (and their order)
full = np.empty(shape=(len(teams), len(dates), len(perf)), dtype=int)
for tId,tName in enumerate(teams):
full[tId,:,:] = dic_teams[tName].to_numpy()
# New structure where key = date, created from the numpy array
for dId,dName in enumerate(dates):
dic_dates[dName] = pd.DataFrame({pName: full[:,dId,pId] for pId,pName in enumerate(perf)}, index = teams)
This implementation is 6.4 times faster than the reference on my machine. Note that about 75% of the time is sadly spent in the pd.DataFrame calls. Thus, if you want a faster code, use a basic 3D numpy array!

Monte carlo simulation in python - problem with looping

I am running a simple python script for MC. Basically it reads through every row in the dataframe and selects the max and min of the two variables. Then the simulation if run 1000 times selecting a random value between the min and max and computes the product and writes the P50 value back to the datatable.
Somehow the P50 output is the same for all rows. Any help on where I am going wrong?
import pandas as pd
import random
import numpy as np
data = [[0.075,0.085, 120, 150], [0.055, 0.075, 150, 350],[0.045,0.055,175,400]]
df = pd.DataFrame(data, columns = ['P_min','P_max','H_min','H_max'])
NumSim = 1000
for index, row in df.iterrows():
outdata = np.zeros(shape=(NumSim,), dtype=float)
for k in range(NumSim):
phi = (row['P_min'] + (row['P_max'] - row['P_min']) * random.uniform(0, 1))
ht = (row['H_min'] + (row['H_max'] - row['H_min']) * random.uniform(0, 1))
outdata[k] = phi*ht
df['out_p50'] = np.percentile(outdata,50)
print(df)

By df['out_p50'] = np.percentile(outdata,50) you are saying that you want the whole column to be set to given value, not a specific row of the column. Therefore, the numbers are generated and saved but they are saved to the whole column and in the end, you see the last generated number in every row.
Instead, use df.loc[index, 'out_p50'] = np.percentile(outdata,50) to specify the specific row you want to set.

Yup -- you're writing a scalar value to the entire column. You overwrite that value on each iteration. If you want, you can simply specify the row with df.loc for a quick fix. Also consider using outdata.median instead of percentile.
Perhaps the most important feature of PANDAS is the built-in support for vectorization: you work with entire columns of data, rather than looping through the data frame. Think like a list comprehension in which you don't need the for row in df iteration at the end.

How to insert a multidimensional numpy array to pandas column?

I have some numpy array, whose number of rows (axis=0) is the same as a pandas dataframe's number of rows.
I want to create a new column in the dataframe, for which each entry would be a numpy array of a lesser dimension.
Code:
some_df = pd.DataFrame(columns=['A'])
for i in range(10):
some_df.loc[i] = [np.random.rand(4, 6, 8)
data = np.stack(some_df['A'].values) #shape (10, 4, 6, 8)
processed = np.max(data, axis=1) # shape (10, 6, 8)
some_df['B'] = processed # This fails
I want the new column 'B' to contain numpy arrays of shape (6, 8)
How can this be done?

This is not recommended, it is pain, slow and later processing is not easy.
One possible solution is use list comprehension:
some_df['B'] = [x for x in processed]
Or convert to list and assign:
some_df['B'] = processed.tolist()

Coming back to this after 2 years, here is a much better practice:
from itertools import product, chain
import pandas as pd
import numpy as np
from typing import Dict
def calc_col_names(named_shape):
*prefix, shape = named_shape
names = [map(str, range(i)) for i in shape]
return map('_'.join, product(prefix, *names))
def create_flat_columns_df_from_dict_of_numpy(
named_np: Dict[str, np.array],
n_samples_per_np: int,
):
named_np_correct_lenth = {k: v for k, v in named_np.items() if len(v) == n_samples_per_np}
flat_nps = [a.reshape(n_samples_per_np, -1) for a in named_np_correct_lenth.values()]
stacked_nps = np.column_stack(flat_nps)
named_shapes = [(name, arr.shape[1:]) for name, arr in named_np_correct_lenth.items()]
col_names = [*chain.from_iterable(calc_col_names(named_shape) for named_shape in named_shapes)]
df = pd.DataFrame(stacked_nps, columns=col_names)
df = df.convert_dtypes()
return df
def parse_series_into_np(df, col_name, shp):
# can parse the shape from the col names
n_samples = len(df)
col_names = sorted(c for c in df.columns if col_name in c)
col_names = list(filter(lambda c: c.startswith(col_name + "_") or len(col_names) == 1, col_names))
col_as_np = df[col_names].astype(np.float).values.reshape((n_samples, *shp))
return col_as_np
usage to put a ndarray into a Dataframe:
full_rate_df = create_flat_columns_df_from_dict_of_numpy(
named_np={name: np.array(d[name]) for name in ["name1", "name2"]},
n_samples_per_np=d["name1"].shape[0]
)
where d is a dict of nd arrays of the same shape[0], hashed by ["name1", "name2"].
The reverse operation can be obtained by parse_series_into_np.
The accepted answer remains, as it answers the original question, but this one is a much better practice.

I know this question already has an answer to it, but I would like to add a much more scalable way of doing this. As mentioned in the comments above it is in general not recommended to store arrays as "field"-values in a pandas-Dataframe column (I actually do not know why?). Nevertheless, in my day to day work this is an extermely important functionality when working with time-series data and a bunch of related meta-data.
In general I organize my experimantal time-series in form of pandas dataframes with one column holding same-length numpy arrays and the other columns containing information on meta-data with respect to certain measurement conditions etc.
The proposed solution by jezrael works very well, and I used this for the last 4 years on a regular basis. But this method potentially encounters huge memory problems. In my case I came across these problems working with dataframes beyond 5 Million rows and time-series with approx. 100 data points.
The solution to these problems is extremely simple, since I did not find it anywhere I just wanted to share it here: Simply transform your 2D array to a pandas-Series object and assign this to a column of your dataframe:
df["new_list_column"] = pd.Series(list(numpy_array_2D))

Efficient combined in-place adding/removing of rows of a huge 2D numpy array

I have a 2D NumPy array and it's huge. I have some computer memory, which is not so huge.
A single copy of the array fits snugly in the computer memory. A second copy of this array brings the computer to its knees crying.
Before I can cut up the matrix into smaller, more manageable, chunks I need to add a few rows to it and remove some. Luckily I need to remove more rows than add new ones, so in theory this could all be done in-place. I'm working on a function to accomplish this, but I'm curious to what advice any of you can give me.
The plan so far:
Make a list of rows to remove
Make a matrix of rows to add
Replace rows to remove by the rows to add (one by one, cannot use fancy indexing here?)
Move any rows that still need to be removed to the end of the matrix
Call .resize() on the matrix to resize it in memory
Specially step 4 is hard to implement efficiently.
Code so far:
import numpy as np
n_rows = 100
n_columns = 1000000
n_rows_to_drop = 20
n_rows_to_add = 10
# Init huge array
data = np.random.rand(n_rows, n_columns)
# Some rows we drop
to_drop = np.arange(n_rows)
np.random.shuffle(to_drop)
to_drop = to_drop[:n_rows_to_drop]
# Some rows we add
new_data = np.random.rand(n_rows_to_add, n_columns)
# Start replacing rows with new rows
for new_data_idx, to_drop_idx in enumerate(to_drop):
if new_data_idx >= n_rows_to_add:
break # no more new data to add
# Replace a row to drop with a new row
data[to_drop_idx] = new_data[new_data_idx]
# These should still be dropped
to_drop = to_drop[n_rows_to_add:]
to_drop.sort()
# Make a list of row indices to keep, last rows first
to_keep = set(range(n_rows)) - set(to_drop)
to_keep = list(to_keep)
to_keep.sort()
to_keep = to_keep[::-1]
# Replace rows to drop with rows at the end of the matrix
for to_drop_idx, to_keep_idx in zip(to_drop, to_keep):
if to_drop_idx > to_keep_idx:
# All remaining rows to drop are at the end of the matrix
break
data[to_drop_idx] = data[to_keep_idx]
# Resize matrix in memory
data.resize(n_rows - n_rows_to_drop + n_rows_to_add, n_columns)
This seems to work, but is there any way to make this more elegant/efficient? Any way to check whether a copy of the huge array is made at some point?

This seems to perform the same as your code but is a little more brief. I'm relatively sure no copies of the big array are made here - the fancy indexing will work with views.
import numpy as np
n_rows = 100
n_columns = 100000
n_rows_to_drop = 20
n_rows_to_add = 10
# Init huge array
data = np.random.rand(n_rows, n_columns)
# Some rows we drop
to_drop = np.random.randint(0, n_rows, n_rows_to_drop)
to_drop = np.unique(to_drop)
# Some rows we add
new_data = np.random.rand(n_rows_to_add, n_columns)
# Start replacing rows with new rows
data[to_drop[:n_rows_to_add]] = new_data
# These should still be dropped
to_drop = to_drop[:n_rows_to_add]
# Make a list of row indices to keep, last rows first
to_keep = np.setdiff1d(np.arange(n_rows), to_drop, assume_unique=True)[-n_rows_to_add:]
# Replace rows to drop with rows at the end of the matrix
for to_drop_i, to_keep_i in zip(to_drop, to_keep):
data[to_drop_i] = data[to_keep_i]
# Resize matrix in memory
data.resize(n_rows - n_rows_to_drop + n_rows_to_add, n_columns)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimizing Iterations on large datsets - python

Related

Python list comparison numpy optimization

Efficient way to rebuild a dictionary of dataframes

Monte carlo simulation in python - problem with looping

How to insert a multidimensional numpy array to pandas column?

Efficient combined in-place adding/removing of rows of a huge 2D numpy array

Categories

Resources