I have a requirement to compute distinct values for a large number of columns (>20,000). I am now using pyspark.sql.functions.approxCountDistinct() to get an approximation for each column's distinct count. That is super fast (HyperLogLog). After that, if the distinct count is below some threshold (like 10), we want the values. I have a loop that does this.
distinct_values_list[cname] = df.select(cname).distinct().collect()
It is extremely slow as most of the time, I have many columns to process, could be half the columns (10K). Is there no way to make spark do many columns at a time? Seems like it will only parallelize each column but unable to do many columns at once.
Appreciate any help i can get.
(Updated)
Not sure, it is fast enough but you may want to try
import pyspark.sql.functions as F
df.select(*[
F.collect_set(c).alias(c)
for c in LIST_10k_COLS
]).collect()
Suppose you have only 2 values in each column. Then the number of unique combinations is 2^20000 =~ 10^7000. This is 1 with 7000 zeroes. If there are more than 2 values in some columns, this number will be even higher.
Revise your model. Are all columns really independent? Can it be that many of these columns just represent different values of the same dimension? Then may be you can essentially reduce the number of columns.
Consider if you are using the proper tool. May be some completely different approaches (Neo4j, ...) suit better?
Related
I have two datasets that partially overlap. The overlapping part should have identical values in two columns. However, I suspect that's not always the case. I want to check this using pandas, but I run into a problem: since the dataframes are structured differently, their row indexes do not correspond. Moreover, corresponding rows have a different "Name" or "ID". Therefore, I wanted to match rows by matching values from three other columns that I am confident are the same: latitude, longitude and number of samples (I need all three because some rows are collected at the same location and some rows may have the same number of samples).
In short, I want to formulate a condition that requires three columns in a row from either dataframe to be equal, and then check the values of the columns that I suspect are different. Unfortunately, I have not been able to formulate this problem well enough to make google find me the correct function.
Many thanks!
I have few questions for the slicing operation.
in pandas we can do operation as follows -:
df["A"].iloc[0]
df["B"].iloc[-1]
# here df["A"],df["B"] is sorted
as we can't do this (Slicing and Multiple_col_sorting) with Dask (i am not 100% sure), I used another way to do it
df["A"]=df.sort_values(by=['A'])
first=list(df["A"])[0]
df["B"]=df.sort_values(by=['B'])
end=list(df["B"])[-1]
this way is really time-consuming when the dataframe is large, is there any other way to do this operation?
https://docs.dask.org/en/latest/dataframe-indexing.html
https://docs.dask.org/en/latest/array-slicing.html
I tried working with this, but it does not work.
The index or Dask is different than Pandas because Pandas is a global ordering of the data. Dask is indexed from 1 to N for each partition so there are multiple items with index value of 1. This is why iloc on a row is disallowed I think.
For this specifically, use
first: https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.first.html
last:
https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.last.html
Sorting is a very expensive operation for large dataframes spread across multiple machines, whereas first and last are very parallelizeable operations because it can be done per partition and then executed again among the results of each partition.
It's possible to get almost .iloc-like behaviour with dask dataframes, but it requires having a pass through the whole dataset once. For very large datasets, this might be a meaningful time cost.
The rough steps are: 1) create a unique index that matches the row numbering (modifying this answer to start from zero or using this answer), and 2) swap .iloc[N] for .loc[N].
This won't help with relative syntax like .iloc[-1], however if you know the total number of rows, you can compute the corresponding absolute position to pass into .loc.
suppose we have very very huge table similar like this.
if we use pyspark to call this table and groupby("id").agg('value':'sum')
comparing with
partial call this table by
where date first then
groupby("id").agg('value':'sum')
then sum all partial values together.
Which one should be faster?
Can you please elaborate your question ? I am assuming that you want to perform the following operations
case-1: sum(value) group by id (adds all date)
case-2: sum(value) group by id where date = 1
In either case, your performance will depend on the following:
Cardinality of your id column. Whether you have a very large number of unique values or small unique values repeating.
The type of file format you are using. If it is columnar (like parquet) vs row (like csv)
The partitioning and bucketing strategy you are using for storing the files. If date columns hold few values, go for partitioning, else bucketing.
All these factors will help determine whether the above 2 cases will show similar processing time or they will drastically be different. This is due to the fact, that reading large amount of data will take more time than reading less / pruned data with given filters. Also, your shuffle block size and partitions play key role
I have two large datasets. Let's say few thousands rows for V dataset with 18 columns. I would need to find correlations between individual rows (e.g., row V125 is similar to row V569 across the 18 columns). But since it's large I don't know how to filter it after. Another problem is that I have B dataset (different information on my 18 columns) and I would like to find similar pattern between the two datasets (e.g., row V55 and row B985 are similar, V3 is present only if B45 is present, etc...). Is there a way to find out? I'm open to any solutions. PS: this is my first question so let me know if it needs to be edited or I'm not clear. Thank you for any help.
Row V125 is a value, perhaps you meant row 125. If two rows are the same, you can use the duplicate function for pandas or find from the home menu in excel.
For the second question, this can be done using bash or the windows terminal for large datasets, but the simplest would be to merge the two datasets. For datasets of few thousand rows, this is very quick. If you are using a pandas dataframe, you can then use the append function to merge them and find the duplicates.
Although there is an answer,
Considering the possibility that the answers may be useful together,
I pose the first part of the question as code,
After this point, it may be possible to solve it by making use of the first answer.
`
import numpy as np
from scipy import signal
"""
I have two large datasets.
Let's say few thousands rows for V dataset with 18 columns.
"""
features = 18
rows = 3000
V = np.random.random(features * rows).reshape(rows, features)
"""I would need to find correlations between individual rows
(e.g., row V125 is similar to row V569 across the 18 columns). """
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.correlate.html
Correlations = signal.correlate(V, V.copy(), mode='full', method='auto')
""" As per docs:
Correlations variable above ,is An N-dimensional array containing a subset of the
discrete linear cross-correlation of in1 with in2."""
`
Ps: Using V and V.copy() may not be the solution you want.
I find it convenient to correlate a set with itself to arrive at conclusions.
I'm new to python and pandas and maybe i'm missing something. I have many columns in a dataframe with two important columns in particular: Weight and Volume. I want to create many clusters with the rows from the dataframe, with conditions:
No cluster accumulate (summing) above 30000 Kg in weight.
No cluster accumulate (summing) above 30 M^3 in volume.
The cluster are as large as possible but bellow limits expressed in first two points.
The resulting cluster for each row is annotated in a "cluster" column in the same dataframe.
The algorithm is implemented in a procedural style, with nested loops. I'm reading about rolling and expanding functions in pandas, but i don't find a pandoric way (without loops?) to do that. Is there a way?
Here is some code to help explaining me.