Compare values from two datasets using pandas - python

I have two datasets that partially overlap. The overlapping part should have identical values in two columns. However, I suspect that's not always the case. I want to check this using pandas, but I run into a problem: since the dataframes are structured differently, their row indexes do not correspond. Moreover, corresponding rows have a different "Name" or "ID". Therefore, I wanted to match rows by matching values from three other columns that I am confident are the same: latitude, longitude and number of samples (I need all three because some rows are collected at the same location and some rows may have the same number of samples).
In short, I want to formulate a condition that requires three columns in a row from either dataframe to be equal, and then check the values of the columns that I suspect are different. Unfortunately, I have not been able to formulate this problem well enough to make google find me the correct function.
Many thanks!

Related

Data science: How to find patterns and correlations in two datasets in python

I have two large datasets. Let's say few thousands rows for V dataset with 18 columns. I would need to find correlations between individual rows (e.g., row V125 is similar to row V569 across the 18 columns). But since it's large I don't know how to filter it after. Another problem is that I have B dataset (different information on my 18 columns) and I would like to find similar pattern between the two datasets (e.g., row V55 and row B985 are similar, V3 is present only if B45 is present, etc...). Is there a way to find out? I'm open to any solutions. PS: this is my first question so let me know if it needs to be edited or I'm not clear. Thank you for any help.
Row V125 is a value, perhaps you meant row 125. If two rows are the same, you can use the duplicate function for pandas or find from the home menu in excel.
For the second question, this can be done using bash or the windows terminal for large datasets, but the simplest would be to merge the two datasets. For datasets of few thousand rows, this is very quick. If you are using a pandas dataframe, you can then use the append function to merge them and find the duplicates.
Although there is an answer,
Considering the possibility that the answers may be useful together,
I pose the first part of the question as code,
After this point, it may be possible to solve it by making use of the first answer.
`
import numpy as np
from scipy import signal
"""
I have two large datasets.
Let's say few thousands rows for V dataset with 18 columns.
"""
features = 18
rows = 3000
V = np.random.random(features * rows).reshape(rows, features)
"""I would need to find correlations between individual rows
(e.g., row V125 is similar to row V569 across the 18 columns). """
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.correlate.html
Correlations = signal.correlate(V, V.copy(), mode='full', method='auto')
""" As per docs:
Correlations variable above ,is An N-dimensional array containing a subset of the
discrete linear cross-correlation of in1 with in2."""
`
Ps: Using V and V.copy() may not be the solution you want.
I find it convenient to correlate a set with itself to arrive at conclusions.

How to check differences between column values in pandas?

I'm manually comparing two or three rows very similar using pandas. Is there a more automated way to do this? I would like a better method than using '=='.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.diff.html
See if this will satisfy your needs.
df['sales_diff'] = df['sales'].diff()
The above code snippet creates a new column in your data frame, which contains the difference between the previous row by default. You can screw around with the parameters (axis) to compare rows or columns and you can change (period) to compare to a specific row or column.

Split Pandas Dataframe by column value with different lenght

I have a DataFrame like this : DATASET
I want to separate my data set into four parts.
Each of the four parts correspond to a title.
The difficulty is that each of these four parts of the table has different sizes. Moreover I don't have only one table but many others (with parts having different sizes but still four parts).
The goal is to find a way to separate my table by referring to the TITLE each time. Do you have an idea of how to do it?
Sincerely,
Etienne

Optimize distinct values on a large number of columns

I have a requirement to compute distinct values for a large number of columns (>20,000). I am now using pyspark.sql.functions.approxCountDistinct() to get an approximation for each column's distinct count. That is super fast (HyperLogLog). After that, if the distinct count is below some threshold (like 10), we want the values. I have a loop that does this.
distinct_values_list[cname] = df.select(cname).distinct().collect()
It is extremely slow as most of the time, I have many columns to process, could be half the columns (10K). Is there no way to make spark do many columns at a time? Seems like it will only parallelize each column but unable to do many columns at once.
Appreciate any help i can get.
(Updated)
Not sure, it is fast enough but you may want to try
import pyspark.sql.functions as F
df.select(*[
F.collect_set(c).alias(c)
for c in LIST_10k_COLS
]).collect()
Suppose you have only 2 values in each column. Then the number of unique combinations is 2^20000 =~ 10^7000. This is 1 with 7000 zeroes. If there are more than 2 values in some columns, this number will be even higher.
Revise your model. Are all columns really independent? Can it be that many of these columns just represent different values of the same dimension? Then may be you can essentially reduce the number of columns.
Consider if you are using the proper tool. May be some completely different approaches (Neo4j, ...) suit better?

Joining a large and a massive spark dataframe

My problem is as follows:
I have a large dataframe called details containing 900K rows and the other one containing 80M rows named attributes.
Both have a column A on which I would like to do a left-outer join, the left dataframe being deatils.
There are only 75K unique entries in column A in the dataframe details. The dataframe attributes 80M unique entries in column A.
What is the best possible way to achieve the join operation?
What have I tried?
The simple join i.e. details.join(attributes, "A", how="left_outer") just times out (or gives out of memory).
Since there are only 75K unique entries in column A in details, we don't care about the rest in the dataframe in attributes. So, first I filter that using:
uniqueA = details.select('A').distinct().collect()
uniqueA = map(lambda x: x.A, uniqueA)
attributes_filtered = attributes.filter(attributes.A.isin(*uniqueA))
I thought this would work out because the attributes table comes down from 80M rows to mere 75K rows. However, it still takes forever to complete the join (and it never completes).
Next, I thought that there are too many partitions and the data to be joined is not on the same partition. Though, I don't know how to bring all the data to the same partition, I figured repartitioning may help. So here it goes.
details_repartitioned = details.repartition("A")
attributes_repartitioned = attributes.repartition("A")
The above operation brings down the number of partitions in attributes from 70K to 200. The number of partitions in details are about 1100.
details_attributes = details_repartitioned.join(broadcast(
attributes_repartitioned), "A", how='left_outer') # tried without broadcast too
After all this, the join still doesn't work. I am still learning PySpark so I might have misunderstood the fundamentals behind repartitioning. If someone could shed light on this, it would be great.
P.S. I have already seen this question but that does not answer this question.
Details table has 900k items with 75k distinct entries in column A. I think the filter on the column A you have tried is a correct direction. However, the collect and followed by the map operation
attributes_filtered = attributes.filter(attributes.A.isin(*uniqueA))
this is too expensive. An alternate approach would be
uniqueA = details.select('A').distinct().persist(StorageLevel.DISK_ONLY)
uniqueA.count // Breaking the DAG lineage
attrJoined = attributes.join(uniqueA, "inner")
Also, you probably need to set the shuffle partition correctly if you haven't done that yet.
One problem could happen in your dataset is that skew. It could happen among 75k unique values only a few joining with a large number of rows in the attribute table. In that case join could take much longer time and may not finish.
To resolve that you need to find the skewed values of column A and process them separately.

Categories