Split very large Pandas dataframe, alternative to Numpy array_split - python

Any ideas on the limit of rows to use the Numpy array_split method?
I have a dataframe with +6m rows and would like to split it in 20 or so chunks.
My attempt followed that described in:
Split a large pandas dataframe
using Numpy and the array_split function, however being a very large dataframe it just goes on forever.
My dataframe is df which includes 8 columns and 6.6 million rows.
df_split = np.array_split(df,20)
Any ideas on an alternative method to split this? Alternatively tips to improve dataframe performance are also welcomed.

Maybe this resolve your problem by separating the dataframe to chunk like this example:
import numpy as np
import pandas as pds
df = pds.DataFrame(np.random.rand(14,4), columns=['a', 'b', 'c', 'd'])
def chunker(seq, size):
return (seq[pos:pos + size] for pos in range(0, len(seq), size))
for i in chunker(df,5):
df_split = np.array_split(i, 20)
print(df_split)

I do not have a general solution, however there are two things you could consider:
You could try loading the data in chunks, instead of loading it and then splitting it. If you use pandas.read_csv the skiprows argument would be the way to go.
You could reshape your data with df.values.reshape((20,-1,8)). However this would require the number of rows to be divisible by 20. You could consider not using the last (a maximum of 19) of the samples to make it fit. This would of course be the fastest solution.

With a little modifications on the code of Houssem Maamria, this file could help someone trying to export each chunk to an excel file.
import pandas as pd
import numpy as np
dfLista_90 = pd.read_excel('my_excel.xlsx', index_col = 0) # to include the headers
count = 0
limit = 200
rows = len(dfLista_90)
partition = (rows // limit) + 1
def chunker(df, size):
return (df[pos:pos + size] for pos in range(0, len(df), size))
for a in chunker(dfLista_90, limit):
to_excel = np.array_split(a, partition)
count += 1
a.to_excel('file_{:02d}.xlsx'.format(count), index=True)

Related

DataFrame Pandas for range

I have problem with DataFrame for range.
In the first line, I would like to calculate and add the data,
subsequent lines depend on each previous one.
So the first formula is "different", the rest are repeated.
I did this in a DataFrame and it works, but very slowly.
All other data so far is in the DataFrame.
import pandas as pd
import numpy as np
calc = pd.DataFrame(np.random.binomial(n=10, p=0.2, size=(5,1)))
calc['op_ol'] = calc[0]
calc['op_ol'][0] = calc[0][0]
for ee in range(1,5):
calc['op_ol'][ee] = 0 if calc['op_ol'][ee-1] == 0 else calc[0][ee-1] * calc['op_ol'][ee-1]
How could I speed this up?
It's generally slow when you use loops with pandas. I suggest you these lines:
calc = pd.DataFrame(np.random.binomial(n=10, p=0.2, size=(5,1)))
calc['op_ol'] = (calc[0].cumprod() * calc[0][0]).shift(fill_value=calc[0][0])
Where cumprod is the cumulative product and we shift it with the first value.

How do I save a N x M array/list using Pandas?

I have a N x M numpy array / list. I want to save this matrix into a .csv file using Pandas. Unfortunately I don't know a priori the values of M and N which can be large. I am interested in Pandas because I find it manageable in terms of data columns access.
Let's start with this MWE:
import numpy as np
import pandas as pd
N,M = np.random.randint(10,100, size = 2)
A = np.random.randint(10, size = (N,M))
columns = []
for i in range(len(A[0,:])):
columns.append( "column_{} ".format(i) )
I cannot do something like pd.append( ) i.e. appending columns with new additional indices via a for loop.
Is there a way to save A into a .csv file?
Following the comment of Quang Hoang, there are 2 possibilities:
pd.DataFrame(A).to_csv('yourfile.csv').
np.save("yourfile.npy",A) and then A = np.load("yourfile.npy").

How to insert a multidimensional numpy array to pandas column?

I have some numpy array, whose number of rows (axis=0) is the same as a pandas dataframe's number of rows.
I want to create a new column in the dataframe, for which each entry would be a numpy array of a lesser dimension.
Code:
some_df = pd.DataFrame(columns=['A'])
for i in range(10):
some_df.loc[i] = [np.random.rand(4, 6, 8)
data = np.stack(some_df['A'].values) #shape (10, 4, 6, 8)
processed = np.max(data, axis=1) # shape (10, 6, 8)
some_df['B'] = processed # This fails
I want the new column 'B' to contain numpy arrays of shape (6, 8)
How can this be done?
This is not recommended, it is pain, slow and later processing is not easy.
One possible solution is use list comprehension:
some_df['B'] = [x for x in processed]
Or convert to list and assign:
some_df['B'] = processed.tolist()
Coming back to this after 2 years, here is a much better practice:
from itertools import product, chain
import pandas as pd
import numpy as np
from typing import Dict
def calc_col_names(named_shape):
*prefix, shape = named_shape
names = [map(str, range(i)) for i in shape]
return map('_'.join, product(prefix, *names))
def create_flat_columns_df_from_dict_of_numpy(
named_np: Dict[str, np.array],
n_samples_per_np: int,
):
named_np_correct_lenth = {k: v for k, v in named_np.items() if len(v) == n_samples_per_np}
flat_nps = [a.reshape(n_samples_per_np, -1) for a in named_np_correct_lenth.values()]
stacked_nps = np.column_stack(flat_nps)
named_shapes = [(name, arr.shape[1:]) for name, arr in named_np_correct_lenth.items()]
col_names = [*chain.from_iterable(calc_col_names(named_shape) for named_shape in named_shapes)]
df = pd.DataFrame(stacked_nps, columns=col_names)
df = df.convert_dtypes()
return df
def parse_series_into_np(df, col_name, shp):
# can parse the shape from the col names
n_samples = len(df)
col_names = sorted(c for c in df.columns if col_name in c)
col_names = list(filter(lambda c: c.startswith(col_name + "_") or len(col_names) == 1, col_names))
col_as_np = df[col_names].astype(np.float).values.reshape((n_samples, *shp))
return col_as_np
usage to put a ndarray into a Dataframe:
full_rate_df = create_flat_columns_df_from_dict_of_numpy(
named_np={name: np.array(d[name]) for name in ["name1", "name2"]},
n_samples_per_np=d["name1"].shape[0]
)
where d is a dict of nd arrays of the same shape[0], hashed by ["name1", "name2"].
The reverse operation can be obtained by parse_series_into_np.
The accepted answer remains, as it answers the original question, but this one is a much better practice.
I know this question already has an answer to it, but I would like to add a much more scalable way of doing this. As mentioned in the comments above it is in general not recommended to store arrays as "field"-values in a pandas-Dataframe column (I actually do not know why?). Nevertheless, in my day to day work this is an extermely important functionality when working with time-series data and a bunch of related meta-data.
In general I organize my experimantal time-series in form of pandas dataframes with one column holding same-length numpy arrays and the other columns containing information on meta-data with respect to certain measurement conditions etc.
The proposed solution by jezrael works very well, and I used this for the last 4 years on a regular basis. But this method potentially encounters huge memory problems. In my case I came across these problems working with dataframes beyond 5 Million rows and time-series with approx. 100 data points.
The solution to these problems is extremely simple, since I did not find it anywhere I just wanted to share it here: Simply transform your 2D array to a pandas-Series object and assign this to a column of your dataframe:
df["new_list_column"] = pd.Series(list(numpy_array_2D))

multiplying all combinations of columns

I am trying to find an efficient way of multiplying each column combination within a pandas dataframe. I have managed to achieve this with itertools, however when the size of the dataframe increases it dramatically slows down. I am going to need to perform this on a dataframe with a size of about (100,1000)
Example of working code with smaller dataframe below,
import numpy as np
import pandas as pd
from itertools import combinations_with_replacement
df = pd.DataFrame(np.random.randn(3, 10))
new_df = pd.DataFrame()
for p in combinations_with_replacement(df.columns,2):
title = p
new_df[title] = df[p[0]]*df[p[1]]
Does anybody have any suggestions on how this could be achieved?
Combining index view and array.prod(axis), this runs ~100 times faster:
def f1():
#with loop
new_df = pd.DataFrame()
for p in combinations_with_replacement(df.columns,2):
title = p
new_df[title] = df[p[0]]*df[p[1]]
return new_df
def f2():
n = len(df.columns)
ix = np.indices((n,n))[:, ~np.tri(n, k=-1, dtype=bool)]
return pd.DataFrame(df.values.T[ix.T].prod(1).T, columns=list(map(tuple, ix.T)))

Sorting every column on a very large pandas dataframe

I am sorting every column of a very large pandas dataframe using a for loop. However, this process is taking very long because the dataframe has more than 1 million columns. I want this process to run much faster than it is running right now.
This is the code I have at the moment:
top25s = []
for i in range(1, len(mylist)):
topchoices = df.sort_values(i, ascending=False).iloc[0:25, 0].values
top25s.append(topchoices)
Here len(mylist) is 14256 but can easily go up to more than 1000000 in the future. df has a dimension of 343 rows × 14256 columns.
Thanks for all of your inputs!
You can use nlargest:
df.apply(lambda x: x.nlargest(25).reset_index(drop=True))
But I doubt this will gain you much time honestly. As commented, you just have a lot of data to go through.
I'd propose using a bit of help from numpy. Which should speed things up significantly.
The following code will return a 2D numpy array with the top25 elements in each column.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(50,100)) # Generate random data
rank = df.rank(axis = 0, ascending=False)
top25s = np.extract(rank<=25, df).reshape(25, 100)

Categories