Aggregate over two columns in pandas - python

I'm new to python and pandas and maybe i'm missing something. I have many columns in a dataframe with two important columns in particular: Weight and Volume. I want to create many clusters with the rows from the dataframe, with conditions:
No cluster accumulate (summing) above 30000 Kg in weight.
No cluster accumulate (summing) above 30 M^3 in volume.
The cluster are as large as possible but bellow limits expressed in first two points.
The resulting cluster for each row is annotated in a "cluster" column in the same dataframe.
The algorithm is implemented in a procedural style, with nested loops. I'm reading about rolling and expanding functions in pandas, but i don't find a pandoric way (without loops?) to do that. Is there a way?
Here is some code to help explaining me.

Related

Data science: How to find patterns and correlations in two datasets in python

I have two large datasets. Let's say few thousands rows for V dataset with 18 columns. I would need to find correlations between individual rows (e.g., row V125 is similar to row V569 across the 18 columns). But since it's large I don't know how to filter it after. Another problem is that I have B dataset (different information on my 18 columns) and I would like to find similar pattern between the two datasets (e.g., row V55 and row B985 are similar, V3 is present only if B45 is present, etc...). Is there a way to find out? I'm open to any solutions. PS: this is my first question so let me know if it needs to be edited or I'm not clear. Thank you for any help.
Row V125 is a value, perhaps you meant row 125. If two rows are the same, you can use the duplicate function for pandas or find from the home menu in excel.
For the second question, this can be done using bash or the windows terminal for large datasets, but the simplest would be to merge the two datasets. For datasets of few thousand rows, this is very quick. If you are using a pandas dataframe, you can then use the append function to merge them and find the duplicates.
Although there is an answer,
Considering the possibility that the answers may be useful together,
I pose the first part of the question as code,
After this point, it may be possible to solve it by making use of the first answer.
`
import numpy as np
from scipy import signal
"""
I have two large datasets.
Let's say few thousands rows for V dataset with 18 columns.
"""
features = 18
rows = 3000
V = np.random.random(features * rows).reshape(rows, features)
"""I would need to find correlations between individual rows
(e.g., row V125 is similar to row V569 across the 18 columns). """
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.correlate.html
Correlations = signal.correlate(V, V.copy(), mode='full', method='auto')
""" As per docs:
Correlations variable above ,is An N-dimensional array containing a subset of the
discrete linear cross-correlation of in1 with in2."""
`
Ps: Using V and V.copy() may not be the solution you want.
I find it convenient to correlate a set with itself to arrive at conclusions.

Undersampling large dataset under specific conditon applied to other column in python/pandas

I currently working with a large dataset (about 40 coulmns and tens of thousans of rows) and i would like to undersample it to be able to work with it more easily.
For the undersampling, unlike the resample method from pandas that resample according to timedelta, I'm trying to specify conditons on other columns to determine the data points to keep.
I'm not sure it's so clear but for example, let's say I have 3 columns (index, time and temperature) like as followed:
Now for the resampling, I would like to keep a data point every 1s or every 2°C, the resulting dataset would look like this:
I couldn't find a simple way of doing this with pandas. The only way would be to iterate over the rows but it was very slow because of the size of my datasets.
I though about using the diff method but of course it can only determine the difference on a specified period, same for pct_change that could have been use to keep only the points in the regions were the variations are maximal to undersample.
Thanks in advance if you have any suggestions on how to proceed with this resampling.

How to speed up the same calculation on sub-combinations of pandas DataFrame columns?

I'm looking to apply the same function to multiple sub-combinations of a pandas DataFrame. Imagine the full DataFrame having 15 columns, and I want to draw from this full DataFrame a sub-frame containing 10 columns, I would have 3003 such sub-frames in total. My current approach is to use multiprocessing which works well for a full DataFrame with about 20 columns - 184,756 combinations, however the real full frame has 50 columns leading to more than 10 billions combinations, after which it will take too long. Is there any library that would be suitable for this type of calculation ? I have used dask before and it's incredibly powerful but dask is only suitable for calculation on a single DataFrame, not different ones.
Thanks.
It's hard to answer this question without a MVCE. The best path forward depends on what you want to do with your 10 billion DataFrame combinations (write them to disk, train models, aggregations, etc.).
I'll provide some high level advice:
using a columnar file format like Parquet allows for column pruning (grabbing certain columns rather than all of them), which can be memory efficient
Persisting the entire DataFrame in memory with ddf.persist() may be a good way for you to handle this combinations problem so you're not constantly reloading it
Feel free to add more detail about the problem and I can add a more detailed solution.

Pandas find the percentage of overlap and split to train test

I am running a ML experiment in python and I am stuck with data that have overlaps.
I am having a dataframe with multiple columns and the rows between entries are to a big extent similar to subsequent rows.
Are there pandas functions that can split my data frame to two sets trying to reduce the overlaps between the two sets, in a sense that the overall overlaps between the two sets will be as small as possible?
Unfortunately I can not share the dataset but if you can pinpoint me to relevant functions that will be enough for me to continue searching and reading.
I would like to thank you in advance for your reply
Regards
Alex

Optimize distinct values on a large number of columns

I have a requirement to compute distinct values for a large number of columns (>20,000). I am now using pyspark.sql.functions.approxCountDistinct() to get an approximation for each column's distinct count. That is super fast (HyperLogLog). After that, if the distinct count is below some threshold (like 10), we want the values. I have a loop that does this.
distinct_values_list[cname] = df.select(cname).distinct().collect()
It is extremely slow as most of the time, I have many columns to process, could be half the columns (10K). Is there no way to make spark do many columns at a time? Seems like it will only parallelize each column but unable to do many columns at once.
Appreciate any help i can get.
(Updated)
Not sure, it is fast enough but you may want to try
import pyspark.sql.functions as F
df.select(*[
F.collect_set(c).alias(c)
for c in LIST_10k_COLS
]).collect()
Suppose you have only 2 values in each column. Then the number of unique combinations is 2^20000 =~ 10^7000. This is 1 with 7000 zeroes. If there are more than 2 values in some columns, this number will be even higher.
Revise your model. Are all columns really independent? Can it be that many of these columns just represent different values of the same dimension? Then may be you can essentially reduce the number of columns.
Consider if you are using the proper tool. May be some completely different approaches (Neo4j, ...) suit better?

Categories