Faster Way of Getting Distinct Rows - python

Suppose we have a PySpark dataframe with ~10M rows. Is there a a faster way of getting distinct rows compared to df.distinct()? Maybe use df.groupBy()?

if you try select the columns of interest before you do the operation then wil be faster (smaller dataset).
something like
python
columns_to_select = ["col1", "col2"]
df.select(columns_to_select).distinct()

Related

pyspark Drop rows in dataframe to only have X distinct values in one column

So I have a dataframe with a column "Category" and it has over 12k distinct values, for sampling purposes I would like to get a small sample where there are only 1000 different values of this category column.
Before I was doing:
small_distinct = df.select("category").distinct().limit(1000).rdd.flatMap(lambda x: x).collect()
df = df.where(col("category").isin(small_distinct))
I know this is extremely inefficient as I'm doing a distinct of the category column and then casting it into a normal python list so I can use isin() filter.
Is there any "spark" way of doing this? I thought maybe something with rollingoverwindows could do the job? But I cant get to solve it
Thanks!
You can improve your code using a left_semi join:
small_distinct = df.select("category").distinct().limit(1000)
df = df.join(small_distinct, "category", "left_semi")
Using left_semi is a good way to filter a table using another table, keeping the same schema, in a efficient way.

Fastest way to search dataframe with conditions

I am looking for the most efficient way to search a large dataframe based on specific conditions. I have tried .loc, .iloc, and numpy, but all of them are too slow. The fastest thus far is with numpy, where my code looks something like this:
ParsedTimestamp = []
for index, row in df_primary.iterrows():
d_index = list(np.where((df_data['filePath'] == row['FilePath']) & (df_data['session id'] == row['ChannelName']) & (df_data['message'] == row['Text']) & (df_data['d_temp'] == row['MessageTimestamp']))[0])[0]
ParsedTimestamp.append(df_data.loc[d_index]['Datetime UTC'])
As you may be able to tell, I have one datadrame (df_primary) that I need to match 4 values from another dataframe (df_data) to find a more accurate timestamp. The issue is that each search for the index in df_data that matches the row in df_primary takes over 1 second, which is much too long. The df_data dataframe is about 2.5 million rows.
I am open to converting the dataframes to dictionaries or any other forms, but from my research I have been told that dictionaries are less efficient at this size. Does anybody have any suggestions?
Why don't you just merge?
ParsedTimestamp = pd.merge(
df_data, df_primary,
left_on=['filePath','session id','message','MessageTimestamp','d_temp'],
right_on=['FilePath','ChannelName','Text','MessageTimestamp']
)['Datetime UTC']

Fastest way to update pandas columns based on matching column from other pandas dataframe

I have two pandas dataframes and one has updated values for a subset of the values for the primary dataframe. The main one is ~2m rows and the column to update is ~20k. This operation is running extremely slowly as I have it below which is O(m*n) as far as I can tell, is there a good way to vectorize it or just generally increase the speed? I don't see how many other optimizations can apply to this case. I have also tried making the 'object_id' column the index but that didn't lead to a meaningful increase in speed.
# df_primary this is 2m rows
# df_updated this is 20k rows
for idx, row in df_updated.iterrows():
df_primary.loc[df_primary.object_id == row.object_id, ['status', 'category']] = [row.status, row.category]
Let's try DataFrame.update to update df_primary in place using values from df_updated:
df_primary = df_primary.set_index('object_id')
df_primary.update(df_updated.set_index('object_id')[['status', 'category']])
df_primary = df_primary.reset_index()
use join methods based on requirements like left/right/inner joins. It will be super fast than any other way.

Pandas dataframe: selecting max by column for subset

Am fairly new to pandas and going around in circles trying to find an easy way to solve the following problem:
I have a large correlation matrix (several thousand rows / columns) as a dataframe and would like to extract the maximum value by column excluding the '1' which is of course present in all columns (diagonal of the matrix).
Tried all sorts of variations of .max() .imax(), including the following:
corr.drop(corr.idxmax()).max()
But only get nonsense results. Any help is highly appreciated.
You can probably use np.fill_diagonal
df_values=df.values.copy()
np.fill_diagonal(df_values,-np.inf)
df_values.max(0)
Or with a one-liner you can use:
df.values[~np.eye(df.shape[0],dtype=bool)].reshape(df.shape[0]-1,-1).max(0)
This will get the 2nd highest values from each column.
As array:
np.partition(df.values, len(df)-2, axis=0)[len(df)-2]
or in a dataframe:
pd.DataFrame(np.partition(df.values, len(df)-2, axis=0)[len(df)-2],
index=df.columns, columns=['2nd'])

How to SELECT DISTINCT from a pandas hdf5store?

I have a large amount of data in an HDFStore (as a table), on the order of 80M rows with 1500 columns. Column A has integer values ranging between 1 and 40M or so. The values in column A are not unique and there may be between 1 and 30 rows with the same column A value. In addition, all rows which share a common value in column A will also have a common value in column B (not the same value as column A though).
I would like to do a select against the table to get a list of column A values and their corresponding column B values. The equivalent SQL statement would be something like SELECT DISTINCT ColA, ColB FROM someTable What are some ways to achieve this? Can it be done such that the results of the query are stored directly into another table in the HDF5Store?
Blocked Algorithms
One solution would be to look at dask.dataframe which implements a subset of the Pandas API with blocked algorithms.
import dask.dataframe as dd
df = dd.read_hdf('myfile.hdf5', '/my/data', columns=['A', 'B'])
result = df.drop_duplicates().compute()
In this particular case dd.DataFrame.drop_duplicates would pull out a medium-sized block of rows, perform the pd.DataFrame.drop_duplicates call and store the (hopefully smaller) result. It would do this for all blocks, concatenate them, and then perform a final pd.DataFrame.drop_duplicates on the concatenated intermediate result. You could also do this with just a for loop. Your case is a bit odd in that you also have a large number of unique elements. This might still be a challenge to compute even with blocked algorithms. Worth a shot though.
Column Store
Alternatively you should consider looking into a storage format that can store your data as individual columns. This would let you collect just the two columns that you need, A and B, rather than having to wade through all of your data on disk. Arguably you should be able to fit 80 million rows into a single Pandas dataframe in memory. You could consider bcolz for this.
To be clear, you tried something like this and it didn't work?
import pandas
import tables
import pandasql
check that your store is the type you think it is:
in: store
out: <class 'pandas.io.pytables.HDFStore'>
You can select a table from a store like this:
df = store.select('tablename')
Check that it worked:
in: type(tablename)
out: pandas.core.frame.DataFrame
Then you can do something like this:
q = """SELECT DISTINCT region, segment FROM tablename"""
distinct_df = (pandasql.sqldf(q, locals()))
(note that you will get deprecation warnings doing it this way, but it works)

Categories