Select rows in one DataFrame based on rows in another - python

Let's assume I have a very large pandas DataFrame dfBig with columns Param1, Param2, ..., ParamN, score, step, and a smaller DataFrame dfSmall with columns Param1, Param2, ..., ParamN (i.e. missing the score and step columns).
I want to select all the rows of dfBig for which the values of columns Param1, Param2, ..., ParamN match those of some row in dfSmall. Is there a clean way of doing this in pandas?
Edit: To give an example, consider this DataFrame dfBig:
Arch | Layers | Score | Time
A | 1 | 0.3 | 10
A | 1 | 0.6 | 20
A | 1 | 0.7 | 30
A | 2 | 0.4 | 10
A | 2 | 0.5 | 20
A | 2 | 0.6 | 30
B | 1 | 0.1 | 10
B | 1 | 0.2 | 20
B | 1 | 0.7 | 30
B | 2 | 0.7 | 10
B | 2 | 0.8 | 20
B | 2 | 0.8 | 30
Let's imagine a model is specified by a pair (Arch, Layers). I want to query dfBig and get the time series for scores over time for the best performing models with Arch A and Arch B.
Following EdChum's answer below, I take it that the best solution is to do something like this procedurally:
modelColumns = [col for col in dfBigCol if col not in ["Time", "Score"]]
groupedBest = dfBig.groupby("Arch").Score.max()
dfSmall = pd.DataFrame(groupedBest).reset_index()[modelColumns]
dfBest = pd.merge(dfSmall, dfBig)
which yields:
Arch | Layers | Score | Time
A | 1 | 0.3 | 10
A | 1 | 0.6 | 20
A | 1 | 0.7 | 30
B | 2 | 0.7 | 10
B | 2 | 0.8 | 20
B | 2 | 0.8 | 30
If there's a better way to do this, I'm happy to hear it.

If I understand your question correctly you should be able to just call merge on dfBig and pass dfSmall which will look for matches in the aligned columns and only return those rows.
Example:
In [71]:
dfBig = pd.DataFrame({'a':np.arange(100), 'b':np.arange(100), 'c':np.arange(100)})
dfSmall = pd.DataFrame({'a':[3,4,5,6]})
dfBig.merge(dfSmall)
Out[71]:
a b c
0 3 3 3
1 4 4 4
2 5 5 5
3 6 6 6

Related

Calculate mean median first third quartile from dataframe with distribution in with pandas

I have an aggregate level dataframe as follow:
errors | class | num_students
1 | A | 5
2 | A | 8
3 | A | 2
...
10 | A | 1
1 | B | 9
2 | B | 12
3 | B | 5
10 | B | 2
...
The original data was at student ID level so in this dataframe, distribution of errors for each class is calculated. I want to get summary statistics per class from my dataframe that looks like below:
Class | average error | median error | Q1 error | Q3 error
A | 2.1 | 2 | 1 | 3
B | 3.4 | 3 | 2 | 5
What is the best way to accomplish this?

Filter according dynamic conditions that are dependent

I'm having a dataframe that looks like:
+----+---------+---------+
| | Count | Value |
|----+---------+---------|
| 0 | 10 | 0.5 |
| 1 | 17 | 0.9 |
| 2 | 56 | 0.6 |
| 3 | 25 | 0.7 |
| 4 | 80 | 0.7 |
| 5 | 190 | 0.6 |
| 6 | 3 | 0.8 |
| 7 | 60 | 0.5 |
+----+---------+---------+
Now I want to filter. Smaller amounts of Count require a higher Value to get in focus.
The dependencies could look like: dict({100:0.5, 50:0.6, 40:0.7, 20:0.75, 10:0.8})
Examples:
if Count is above 100, Value requires only to be greater/equal 0.5
if Count is only 10 to 19, Value need to be greather/equal 0.8
I could filter it easily with:
df[((df["Count"]>=100) & (df["Value"]>=0.5)) |
((df["Count"]>=50) & (df["Value"]>=0.6)) |
((df["Count"]>=40) & (df["Value"]>=0.7)) |
((df["Count"]>=20) & (df["Value"]>=0.75)) |
((df["Count"]>=10) & (df["Value"]>=0.8))]
+----+---------+---------+
| | Count | Value |
|----+---------+---------|
| 1 | 17 | 0.9 |
| 2 | 56 | 0.6 |
| 4 | 80 | 0.7 |
| 5 | 190 | 0.6 |
+----+---------+---------+
But want to change periodically the thresholds (also adding or removing threshold steps) without constantly changing the filter. How could I do this in pandas?
MWE
import pandas as pd
df = pd.DataFrame({
"Count":[10,17,56,25,80,190,3,60],
"Value":[0.5,0.9,0.6,0.7,0.7,0.6,0.8,0.5]
})
limits = dict({100:0.5, 50:0.6, 40:0.7, 20:0.75, 10:0.8})
R equivalent
In R I could solve a similar question with following code (thanks to akrun). But I don't know how to adapt to pandas.
library(data.table)
set.seed(33)
df = data.table(CPE=sample(1:500, 100),
PERC=runif(min = 0.1, max = 1, n=100))
lst1 <- list(c(20, 0.95), c(50, 0.9), c(100,0.85), c(250,0.8))
df[Reduce(`|`, lapply(lst1, \(x) CPE > x[1] & PERC > x[2]))]
Lets simplify your code by using boolean reduction with np.logical_or. This is also very close to what your are trying to do in R
c = ['Count', 'Value']
df[np.logical_or.reduce([df[c].ge(t).all(1) for t in limits.items()])]
Count Value
1 17 0.9
2 56 0.6
4 80 0.7
5 190 0.6
I would use pandas.cut to perform the comparison in linear time. If you have many groups performing multiple comparisons will become inefficient (O(n*m) complexity):
# sorted bins and matching labels
bins = sorted(limits)
# [10, 20, 40, 50, 100]
labels = [limits[x] for x in bins]
# [0.8, 0.75, 0.7, 0.6, 0.5]
# mapping threshold from bins
s = pd.cut(df['Count'], bins=[0]+bins+[np.inf], labels=[np.inf]+labels, right=False).astype(float)
out = df[df['Value'].ge(s)]
Output:
Count Value
1 17 0.9
2 56 0.6
4 80 0.7
5 190 0.6
Intermediate s:
0 0.80
1 0.80
2 0.60
3 0.75
4 0.60
5 0.50
6 inf
7 0.60
Name: Count, dtype: float64

How do I transpose and aggregate this dataframe in right order?

I am trying to find an efficient way to create a dataframe which lists all distinct game values as the columns and then aggregates the rows by user_id for game play hours accordingly? This is my example df:
user_id | game | game_hours | rank_order
1 | Fortnight | 1.5 | 1
1 | COD | 0.5 | 2
1 | Horizon | 1.7 | 3
1 | ... | ... | n
2 | Fifa2021 | 1.9 | 1
2 | A Way Out | 0.2 | 2
2 | ... | ... | n
...
Step 1: How do I get this to this df format (match rank order correctly due to time sequence)?
user_id | game_1 | game_2 | game_3 | game_n ...| game_hours
1 | Fortnight | COD | Horizon| | 3.7
2 | Fifa21 | A Way Out | | | 2.1
...
Use DataFrame.pivot with DataFrame.add_prefix and for new column DataFrame.assign with aggregation sum:
df = (df.pivot('user_id','rank_order','game')
.add_prefix('game_')
.assign(game_hours=df.groupby('user_id')['game_hours'].sum())
.reset_index()
.rename_axis(None, axis=1))
print (df)
user_id game_1 game_2 game_3 game_hours
0 1 Fortnight COD Horizon 3.7
1 2 Fifa2021 A Way Out NaN 2.1

Add values in two Spark DataFrames, row by row

I have two Spark DataFrames, with values that I would like to add, and then multiply, and keep the lowest pair of values only. I have written a function that will do this:
math_func(aValOne, aValTwo, bValOne, bValTwo):
tmpOne = aValOne + bValOne
tmpTwo = aValTwo + bValTwo
final = tmpOne*tmpTwo
return final
I would like to iterate through two Spark DataFrames, "A" and "B", row by row, and keep the lowest values results. So if I have two DataFrames:
DataFrameA:
ID | ValOne | ValTwo
0 | 2 | 4
1 | 3 | 6
DataFrameB:
ID | ValOne | ValTwo
0 | 4 | 5
1 | 7 | 9
I would like to first take row 0 from DataFrameA:, compare it to rows 0 and 1 of DataFrameB, and then keep the lowest value results. I have tried this:
results = DataFrameA.select('ID')(lambda i: DataFrameA.select('ID')(math_func(DataFrameA.ValOne, DataFrameA.ValTwo, DataFrameB.ValOne, DataFrameB.ValOne))
but I get errors about iterating through a DataFrame column. I know that in Pandas I would essentially make a nested "for loop", and then just write the results to another DataFrame and append the results. The results I would expect are:
Initial Results:
DataFrameA_ID | Value | DataFrameB_ID
0 | 54 | 0
0 | 117 | 1
1 | 77 | 0
1 | 150 | 1
Final Results:
DataFrameA_ID | Value | DataFrameB_ID
0 | 54 | 0
1 | 77 | 0
I am quite new at Spark, but I know enough to know I'm not approaching this the right way.
Any thoughts on how to go about this?
You will need multiple steps to achieve this.
Suppose you have data
DFA:
ID | ValOne | ValTwo
0 | 2 | 4
1 | 3 | 6
DFB:
ID | ValOne | ValTwo
0 | 4 | 5
1 | 7 | 9
Step 1.
Do a cartesian join on your 2 dataframes. That will give you:
Cartesian:
DFA.ID | DFA.ValOne | DFA.ValTwo | DFB.ID | DFB.ValOne | DFB.ValTwo
0 | 2 | 4 | 0 | 4 | 5
1 | 3 | 6 | 0 | 4 | 5
0 | 2 | 4 | 1 | 7 | 9
1 | 3 | 6 | 1 | 7 | 9
Step 2.
Multiply columns:
Multiplied:
DFA.ID | DFA.Mul | DFB.ID | DFB.Mul
0 | 8 | 0 | 20
1 | 18 | 0 | 20
0 | 8 | 1 | 63
1 | 18 | 1 | 63
Step 3.
Group by DFA.ID and select min from DFA.Mul and DFB.Mul

Pandas merge or join in smaller dataframe

I have an issue whereby I have one long dataframe and one short dataframe, and I want to merge so that the shorter dataframe repeats itself to fill the length of the longer (left) df.
df1:
| Index | Wafer | Chip | Value |
---------------------------------
| 0 | 1 | 32 | 0.99 |
| 1 | 1 | 33 | 0.89 |
| 2 | 1 | 39 | 0.96 |
| 3 | 2 | 32 | 0.81 |
| 4 | 2 | 33 | 0.87 |
df2:
| Index | x | y |
-------------------------
| 0 | 1 | 3 |
| 1 | 2 | 2 |
| 2 | 1 | 6 |
df_combined:
| Index | Wafer | Chip | Value | x | y |
-------------------------------------------------
| 0 | 1 | 32 | 0.99 | 1 | 3 |
| 1 | 1 | 33 | 0.89 | 2 | 2 |
| 2 | 1 | 39 | 0.96 | 1 | 6 |
| 3 | 2 | 32 | 0.81 | 1 | 3 | <--- auto-repeats...
| 4 | 2 | 33 | 0.87 | 2 | 2 |
Is this a built in join/merge-type, or requiring a loop of some sort?
{This is just false data, but dfs are over 1000 rows...}
Current code is a simple outer merge, but doesn't provide the fill/repeat to end:
df = main.merge(df_coords, left_index=True, right_index = True, how='outer') and just gives NaNs.
I've checked around:
Merge two python pandas data frames of different length but keep all rows in output data frame
pandas: duplicate rows from small dataframe to large based on cell value
and it feels like this could be an arguement somewhere in a merge function... but I can't find it.
Any help gratefully received.
Thanks
You can repeat df2 until it's as long as df1, then reset_index and merge:
new_len = round(len(df1)/len(df2))
repeated = (pd.concat([df2] * new_len)
.reset_index()
.drop(["index"], 1)
.iloc[:len(df1)])
repeated
x y
0 1 3
1 2 2
2 1 6
3 1 3
4 2 2
df1.merge(repeated, how="outer", left_index=True, right_index=True)
Wafer Chip Value x y
0 1 32 0.99 1 3
1 1 33 0.89 2 2
2 1 39 0.96 1 6
3 2 32 0.81 1 3
4 2 33 0.87 2 2
A little hacky, but it should work.
Note: I'm assuming your Index column is not actually a column, but is in fact intended to represent the data frame index. I'm making this assumption because you refer to left_index/right_index args in your merge() code. If Index is actually its own column, this code will basically work, you'll just need to drop Index as well if you don't want it in the final df.
You can achieve this with a left join on the value of df1["Index"] mod the length of df2["Index"]:
# Creating Modular Index values on df1
n = df2.shape[0]
df1["Modular Index"] = df1["Index"].apply(lambda x: str(int(x)%n))
# Merging dataframes
df_combined = df1.merge(df2, how="left", left_on="Modular Index", right_on="Index")
# Dropping unnecessary columns
df_combined = df_combined.drop(["Modular Index", "Index_y"], axis=1)
print(df_combined)
0 Index_x Wafer Chip Value x y
0 0 1 32 0.99 1 3
1 1 1 33 0.89 2 2
2 2 1 39 0.96 1 6
3 3 2 32 0.81 1 3
4 4 2 33 0.87 2 2

Categories