Sorting every column on a very large pandas dataframe - python

I am sorting every column of a very large pandas dataframe using a for loop. However, this process is taking very long because the dataframe has more than 1 million columns. I want this process to run much faster than it is running right now.
This is the code I have at the moment:
top25s = []
for i in range(1, len(mylist)):
topchoices = df.sort_values(i, ascending=False).iloc[0:25, 0].values
top25s.append(topchoices)
Here len(mylist) is 14256 but can easily go up to more than 1000000 in the future. df has a dimension of 343 rows × 14256 columns.
Thanks for all of your inputs!

You can use nlargest:
df.apply(lambda x: x.nlargest(25).reset_index(drop=True))
But I doubt this will gain you much time honestly. As commented, you just have a lot of data to go through.

I'd propose using a bit of help from numpy. Which should speed things up significantly.
The following code will return a 2D numpy array with the top25 elements in each column.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(50,100)) # Generate random data
rank = df.rank(axis = 0, ascending=False)
top25s = np.extract(rank<=25, df).reshape(25, 100)

Related

DataFrame Pandas for range

I have problem with DataFrame for range.
In the first line, I would like to calculate and add the data,
subsequent lines depend on each previous one.
So the first formula is "different", the rest are repeated.
I did this in a DataFrame and it works, but very slowly.
All other data so far is in the DataFrame.
import pandas as pd
import numpy as np
calc = pd.DataFrame(np.random.binomial(n=10, p=0.2, size=(5,1)))
calc['op_ol'] = calc[0]
calc['op_ol'][0] = calc[0][0]
for ee in range(1,5):
calc['op_ol'][ee] = 0 if calc['op_ol'][ee-1] == 0 else calc[0][ee-1] * calc['op_ol'][ee-1]
How could I speed this up?
It's generally slow when you use loops with pandas. I suggest you these lines:
calc = pd.DataFrame(np.random.binomial(n=10, p=0.2, size=(5,1)))
calc['op_ol'] = (calc[0].cumprod() * calc[0][0]).shift(fill_value=calc[0][0])
Where cumprod is the cumulative product and we shift it with the first value.

Python list comparison numpy optimization

I basically have a dataframe (df1) with columns 7 columns. The values are always integers.
I have another dataframe (df2), which has 3 columns. One of these columns is a list of lists with a sequence of 7 integers. Example:
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
I now want to compare the sequence of the rows in df1 with the 'Sequence' column in df2 and get a percentage of overlap. In a primitive for loop this would look like this:
df2['Overlap'] = 0.
for i in range(len(df2)):
c = sum(el in list(df2.at[i, 'Sequence']) for el in df1.values.tolist())
df2.at[i, 'Overlap'] = c/len(df1)
Now the problem is that my df2 has 500000 rows and my df1 usually around 50-100. This means that the task easily gets very time consuming. I know that there must be a way to optimize this with numpy, but I cannot figure it out. Can someone please help me?
By default engine used in pandas cython, but you can also change engine to numba or use njit decorator to speed up. Look up enhancingperf.
Numba converts python code to optimized machine codee, pandas is highly integrated with numpy and hence numba also. You can experiment with parallel, nogil, cache, fastmath option for speedup. This method shines for huge inputs where speed is needed.
Numba you can do eager compilation or first time execution take little time for compilation and subsequent usage will be fast
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
a = df1.values
# Also possible to add `parallel=True`
f = nb.njit(lambda x: (x == a).mean())
# This is just illustration, not correct logic. Change the logic according to needs
# nb.njit((nb.int64,))
# def f(x):
# sum = 0
# for i in nb.prange(x.shape[0]):
# for j in range(a.shape[0]):
# sum += (x[i] == a[j]).sum()
# return sum
# Experiment with engine
print(df2['Sequence'].apply(f))
You can use direct comparison of the arrays and sum the identical values. Use apply to perform the comparison per row in df2:
df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)
output:
0 0.270000
1 0.298571
To save the output in your original dataframe:
df2['Overlap'] = df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)

How to reduce the time taken working on a big Data Frame

I think my code is inefficient and I think there may be a way to do it better.
The objective of the code is that it takes an Excel listing and has to relate each element of a column to the rest of the elements of the same column. Depending on some conditions store it in a new data frame with the joint information, in my case the file has more than 16000 rows, so when doing the exercise it must perform (16.000 x 16.000) 256.000.000 iterations. But it takes days processing.
The code I have is the following:
import pandas as pd
import numpy as np
excel1="Cs.xlsx"
dataframe1=pd.read_excel(excel1)
col_names=["Eb","Eb_n","Eb_Eb","L1","Ll1","L2","Ll2","D"]
my_df =pd.DataFrame(columns=col_names)
count_row = dataframe1.shape[0]
print(count_row)
for n in range(0,count_row):
for p in range(0,count_row):
if abs(dataframe1.iloc[n,1] - dataframe1.iloc[p,1]) < 0.27 and abs(dataframe1.iloc[n,2] -
dataframe1.iloc[p,2]) < 0.27:
Nb_Nb=dataframe1.iloc[n,0]+"_"+dataframe1.iloc[p,0]
myrow=pd.Series([dataframe1.iloc[n,0],dataframe1.iloc[p,0],Nb_Nb,dataframe1.iloc[n,1],
dataframe1.iloc[n,2],dataframe1.iloc[p,1],dataframe1.iloc[p,2]],
index=["Eb","Eb_n","Eb_Eb","L1","Ll1","L2","Ll2"])
my_df = my_df.append(myrow, ignore_index=True)
print(my_df.head(5))
To start with, you can try using a different python structure. Dataframes take a lot of memory and are slower to process.
Order from simple structures and more efficient processing to complex structures and less efficient processing
Lists
Dictionaries
Numpy Arrays
Pandas Series
Pandas Dataframes

Split very large Pandas dataframe, alternative to Numpy array_split

Any ideas on the limit of rows to use the Numpy array_split method?
I have a dataframe with +6m rows and would like to split it in 20 or so chunks.
My attempt followed that described in:
Split a large pandas dataframe
using Numpy and the array_split function, however being a very large dataframe it just goes on forever.
My dataframe is df which includes 8 columns and 6.6 million rows.
df_split = np.array_split(df,20)
Any ideas on an alternative method to split this? Alternatively tips to improve dataframe performance are also welcomed.
Maybe this resolve your problem by separating the dataframe to chunk like this example:
import numpy as np
import pandas as pds
df = pds.DataFrame(np.random.rand(14,4), columns=['a', 'b', 'c', 'd'])
def chunker(seq, size):
return (seq[pos:pos + size] for pos in range(0, len(seq), size))
for i in chunker(df,5):
df_split = np.array_split(i, 20)
print(df_split)
I do not have a general solution, however there are two things you could consider:
You could try loading the data in chunks, instead of loading it and then splitting it. If you use pandas.read_csv the skiprows argument would be the way to go.
You could reshape your data with df.values.reshape((20,-1,8)). However this would require the number of rows to be divisible by 20. You could consider not using the last (a maximum of 19) of the samples to make it fit. This would of course be the fastest solution.
With a little modifications on the code of Houssem Maamria, this file could help someone trying to export each chunk to an excel file.
import pandas as pd
import numpy as np
dfLista_90 = pd.read_excel('my_excel.xlsx', index_col = 0) # to include the headers
count = 0
limit = 200
rows = len(dfLista_90)
partition = (rows // limit) + 1
def chunker(df, size):
return (df[pos:pos + size] for pos in range(0, len(df), size))
for a in chunker(dfLista_90, limit):
to_excel = np.array_split(a, partition)
count += 1
a.to_excel('file_{:02d}.xlsx'.format(count), index=True)

Speed-up Python loop running by rows and elements in each row

I have a dataframe containing dates as rows and columns as $investment in each stock on a particular day ("ndate"). Also, I have a Series ("portT") containing the sum of the total investments in all stocks each date (series size: len(ndate)*1). Here is the code that calculates the weight of each stock/each date by dividing each element of each row of ndate by sum of that day:
(l,w)=port1.shape
for i in range(0,l):
port1.iloc[i]=np.divide(ndate.iloc[i],portT.iloc[i])
The code works very slowly, could you please let me know how I can modify and speed it up? I tried to do this by vectorising, but did not succeed.
as this is justa simple divison of two dataframes of the same shape (or you can formulate it as such) you can use the simple /-operator, pandas will execute it element-wise (possibly with replication if shapes don't match, so be sure about that):
import pandas as pd
df1 = pd.DataFrame([[1,2], [3,4]])
df2 = pd.DataFrame([[2,2], [3,3]])
df_new = df1 / df2
#>>> pd.DataFrame([[0.5, 1.],[1., 1.3]])
this is most likely internally doing the same operations that you have specified in your example, however, internal assignments and checks are by-passed, which should give you some speed
EDIT:
I was mistaken on the outline of your problem; maybe include a minimal self-contained code example next time. Still the /-operator also works for Dataframes and Series in combination:
import pandas as pd
df = pd.DataFrame([[1,2], [3,4]])
s = pd.Series([1,2])
new_df = df / s
#>>> pd.DataFrame([[1., 3.],[1., 2]])

Categories