Pandas merge or join in smaller dataframe - python

I have an issue whereby I have one long dataframe and one short dataframe, and I want to merge so that the shorter dataframe repeats itself to fill the length of the longer (left) df.
df1:
| Index | Wafer | Chip | Value |
---------------------------------
| 0 | 1 | 32 | 0.99 |
| 1 | 1 | 33 | 0.89 |
| 2 | 1 | 39 | 0.96 |
| 3 | 2 | 32 | 0.81 |
| 4 | 2 | 33 | 0.87 |
df2:
| Index | x | y |
-------------------------
| 0 | 1 | 3 |
| 1 | 2 | 2 |
| 2 | 1 | 6 |
df_combined:
| Index | Wafer | Chip | Value | x | y |
-------------------------------------------------
| 0 | 1 | 32 | 0.99 | 1 | 3 |
| 1 | 1 | 33 | 0.89 | 2 | 2 |
| 2 | 1 | 39 | 0.96 | 1 | 6 |
| 3 | 2 | 32 | 0.81 | 1 | 3 | <--- auto-repeats...
| 4 | 2 | 33 | 0.87 | 2 | 2 |
Is this a built in join/merge-type, or requiring a loop of some sort?
{This is just false data, but dfs are over 1000 rows...}
Current code is a simple outer merge, but doesn't provide the fill/repeat to end:
df = main.merge(df_coords, left_index=True, right_index = True, how='outer') and just gives NaNs.
I've checked around:
Merge two python pandas data frames of different length but keep all rows in output data frame
pandas: duplicate rows from small dataframe to large based on cell value
and it feels like this could be an arguement somewhere in a merge function... but I can't find it.
Any help gratefully received.
Thanks

You can repeat df2 until it's as long as df1, then reset_index and merge:
new_len = round(len(df1)/len(df2))
repeated = (pd.concat([df2] * new_len)
.reset_index()
.drop(["index"], 1)
.iloc[:len(df1)])
repeated
x y
0 1 3
1 2 2
2 1 6
3 1 3
4 2 2
df1.merge(repeated, how="outer", left_index=True, right_index=True)
Wafer Chip Value x y
0 1 32 0.99 1 3
1 1 33 0.89 2 2
2 1 39 0.96 1 6
3 2 32 0.81 1 3
4 2 33 0.87 2 2
A little hacky, but it should work.
Note: I'm assuming your Index column is not actually a column, but is in fact intended to represent the data frame index. I'm making this assumption because you refer to left_index/right_index args in your merge() code. If Index is actually its own column, this code will basically work, you'll just need to drop Index as well if you don't want it in the final df.

You can achieve this with a left join on the value of df1["Index"] mod the length of df2["Index"]:
# Creating Modular Index values on df1
n = df2.shape[0]
df1["Modular Index"] = df1["Index"].apply(lambda x: str(int(x)%n))
# Merging dataframes
df_combined = df1.merge(df2, how="left", left_on="Modular Index", right_on="Index")
# Dropping unnecessary columns
df_combined = df_combined.drop(["Modular Index", "Index_y"], axis=1)
print(df_combined)
0 Index_x Wafer Chip Value x y
0 0 1 32 0.99 1 3
1 1 1 33 0.89 2 2
2 2 1 39 0.96 1 6
3 3 2 32 0.81 1 3
4 4 2 33 0.87 2 2

Related

Mapping duplicate rows to originals with dictionary - Python 3.6

I am trying to locate duplicate rows in my pandas dataframe. In reality, df.shape is 438796, 4531, but I am using this toy example below for an MRE
| id | ft1 | ft2 | ft3 | ft4 | ft5 | label |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|
| id_100 | 1 | 1 | 43 | 1 | 1 | High |
| id_101 | 1 | 1 | 33 | 0 | 1 | Medium |
| id_102 | 1 | 1 | 12 | 1 | 1 | Low |
| id_103 | 1 | 1 | 46 | 1 | 0 | Low |
| id_104 | 1 | 1 | 10 | 1 | 1 | High |
| id_105 | 0 | 1 | 99 | 0 | 1 | Low |
| id_106 | 0 | 0 | 0 | 0 | 0 | High |
| id_107 | 1 | 1 | 6 | 0 | 1 | High |
| id_108 | 1 | 1 | 29 | 1 | 1 | Medium |
| id_109 | 1 | 0 | 27 | 0 | 0 | Medium |
| id_110 | 0 | 1 | 32 | 0 | 1 | High |
What I am trying to accomplish is observing a subset of the features, and if there are duplicate rows, to keep the first and then denote which id: label pair is the duplicate.
I have looked at the following posts:
find duplicate rows in a pandas dataframe
(I could not figure out how to replace col1 in df['index_original'] = df.groupby(['col1', 'col2']).col1.transform('idxmin') with my list of cols)
Find all duplicate rows in a pandas dataframe
I know pandas has a duplicated() call. So I tried implementing that and it sort of works:
import pandas as pd
# Read in example data
df = pd.read_clipboard()
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]
# Create a list of duplicates
dupes = sub_df.index[sub_df.duplicated(keep='first')].tolist()
# Loop through the duplicates and print out the values I want
for idx in dupes:
# print(df[:idx])
print(df.loc[[idx],['id', 'label']])
However, what I am trying to do is for a particular row, determine which rows are duplicates of it by saving those rows as id: label combination. So while I'm able to extract the id and label for each duplicate, I have no ability to map it back to the original row for which it is a duplicate.
An ideal dataset would look like:
| id | ft1 | ft2 | ft3 | ft4 | ft5 | label | duplicates |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|:-------------------------------------------:|
| id_100 | 1 | 1 | 43 | 1 | 1 | High | {id_102: Low, id_104: High, id_108: Medium} |
| id_101 | 1 | 1 | 33 | 0 | 1 | Medium | {id_107: High} |
| id_102 | 1 | 1 | 12 | 1 | 1 | Low | |
| id_103 | 1 | 1 | 46 | 1 | 0 | Low | |
| id_104 | 1 | 1 | 10 | 1 | 1 | High | |
| id_105 | 0 | 1 | 99 | 0 | 1 | Low | {id_110: High} |
| id_106 | 0 | 0 | 0 | 0 | 0 | High | |
| id_107 | 1 | 1 | 6 | 0 | 1 | High | |
| id_108 | 1 | 1 | 29 | 1 | 1 | Medium | |
| id_109 | 1 | 0 | 27 | 0 | 0 | Medium | |
| id_110 | 0 | 1 | 32 | 0 | 1 | High | |
How can I take my duplicated values and map them back to their originals efficiently (understanding the size of my actual dataset)?
Working with dictionaries in columns is really complicated, here is one possible solution:
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]
#mask for first dupes
m = sub_df.duplicated()
#create tuples, aggregate to list of tuples
s = (df.assign(a = df[['id','label']].apply(tuple, 1))[m]
.groupby(cols)['a']
.agg(lambda x: dict(list(x))))
#add new column
df = df.join(s.rename('duplicates'), on=cols)
#repalce missing values and not first duplciates to empty strings
df['duplicates'] = df['duplicates'].fillna('').mask(m, '')
print (df)
id ft1 ft2 ft3 ft4 ft5 label \
0 id_100 1 1 43 1 1 High
1 id_101 1 1 33 0 1 Medium
2 id_102 1 1 12 1 1 Low
3 id_103 1 1 46 1 0 Low
4 id_104 1 1 10 1 1 High
5 id_105 0 1 99 0 1 Low
6 id_106 0 0 0 0 0 High
7 id_107 1 1 6 0 1 High
8 id_108 1 1 29 1 1 Medium
9 id_109 1 0 27 0 0 Medium
10 id_110 0 1 32 0 1 High
duplicates
0 {'id_102': 'Low', 'id_104': 'High', 'id_108': ...
1 {'id_107': 'High'}
2
3
4
5 {'id_110': 'High'}
6
7
8
9
10
Alternative with custom function for assign all dupes without first one to first value of new column per groups, last is changed mask for replace empty strings:
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
m = ~df.duplicated(subset=cols) & df.duplicated(subset=cols, keep=False)
def f(x):
x.loc[x.index[0], 'duplicated'] = [dict(x[['id','label']].to_numpy()[1:])]
return x
df = df.groupby(cols).apply(f)
df['duplicated'] = df['duplicated'].where(m, '')
print (df)
id ft1 ft2 ft3 ft4 ft5 label \
0 id_100 1 1 43 1 1 High
1 id_101 1 1 33 0 1 Medium
2 id_102 1 1 12 1 1 Low
3 id_103 1 1 46 1 0 Low
4 id_104 1 1 10 1 1 High
5 id_105 0 1 99 0 1 Low
6 id_106 0 0 0 0 0 High
7 id_107 1 1 6 0 1 High
8 id_108 1 1 29 1 1 Medium
9 id_109 1 0 27 0 0 Medium
10 id_110 0 1 32 0 1 High
duplicated
0 {'id_102': 'Low', 'id_104': 'High', 'id_108': ...
1 {'id_107': 'High'}
2
3
4
5 {'id_110': 'High'}
6
7
8
9
10

Add values in two Spark DataFrames, row by row

I have two Spark DataFrames, with values that I would like to add, and then multiply, and keep the lowest pair of values only. I have written a function that will do this:
math_func(aValOne, aValTwo, bValOne, bValTwo):
tmpOne = aValOne + bValOne
tmpTwo = aValTwo + bValTwo
final = tmpOne*tmpTwo
return final
I would like to iterate through two Spark DataFrames, "A" and "B", row by row, and keep the lowest values results. So if I have two DataFrames:
DataFrameA:
ID | ValOne | ValTwo
0 | 2 | 4
1 | 3 | 6
DataFrameB:
ID | ValOne | ValTwo
0 | 4 | 5
1 | 7 | 9
I would like to first take row 0 from DataFrameA:, compare it to rows 0 and 1 of DataFrameB, and then keep the lowest value results. I have tried this:
results = DataFrameA.select('ID')(lambda i: DataFrameA.select('ID')(math_func(DataFrameA.ValOne, DataFrameA.ValTwo, DataFrameB.ValOne, DataFrameB.ValOne))
but I get errors about iterating through a DataFrame column. I know that in Pandas I would essentially make a nested "for loop", and then just write the results to another DataFrame and append the results. The results I would expect are:
Initial Results:
DataFrameA_ID | Value | DataFrameB_ID
0 | 54 | 0
0 | 117 | 1
1 | 77 | 0
1 | 150 | 1
Final Results:
DataFrameA_ID | Value | DataFrameB_ID
0 | 54 | 0
1 | 77 | 0
I am quite new at Spark, but I know enough to know I'm not approaching this the right way.
Any thoughts on how to go about this?
You will need multiple steps to achieve this.
Suppose you have data
DFA:
ID | ValOne | ValTwo
0 | 2 | 4
1 | 3 | 6
DFB:
ID | ValOne | ValTwo
0 | 4 | 5
1 | 7 | 9
Step 1.
Do a cartesian join on your 2 dataframes. That will give you:
Cartesian:
DFA.ID | DFA.ValOne | DFA.ValTwo | DFB.ID | DFB.ValOne | DFB.ValTwo
0 | 2 | 4 | 0 | 4 | 5
1 | 3 | 6 | 0 | 4 | 5
0 | 2 | 4 | 1 | 7 | 9
1 | 3 | 6 | 1 | 7 | 9
Step 2.
Multiply columns:
Multiplied:
DFA.ID | DFA.Mul | DFB.ID | DFB.Mul
0 | 8 | 0 | 20
1 | 18 | 0 | 20
0 | 8 | 1 | 63
1 | 18 | 1 | 63
Step 3.
Group by DFA.ID and select min from DFA.Mul and DFB.Mul

Shuffling Several DataFrames Together

Is it possible to shuffle several DataFrames together?
For example I have a DataFrame df1 and a DataFrame df2. I want to shuffle the rows randomly, but for both DataFrames in the same way.
Example
df1:
|___|_______|
| 1 | ... |
| 2 | ... |
| 3 | ... |
| 4 | ... |
df2:
|___|_______|
| 1 | ... |
| 2 | ... |
| 3 | ... |
| 4 | ... |
After shuffling a possible order for both DataFrames could be:
|___|_______|
| 2 | ... |
| 3 | ... |
| 4 | ... |
| 1 | ... |
I think you can double reindex with applying numpy.random.permutation to index, but is necessary both DataFrames have same length and same unique index values:
df1 = pd.DataFrame({'a':range(5)})
print (df1)
a
0 0
1 1
2 2
3 3
4 4
df2 = pd.DataFrame({'a':range(5)})
print (df2)
a
0 0
1 1
2 2
3 3
4 4
idx = np.random.permutation(df1.index)
print (df1.reindex(idx))
a
2 2
4 4
1 1
3 3
0 0
print (df2.reindex(idx))
a
2 2
4 4
1 1
3 3
0 0
Alternative with reindex_axis:
print (df1.reindex_axis(idx, axis=0))
print (df2.reindex_axis(idx, axis=0))

Select rows in one DataFrame based on rows in another

Let's assume I have a very large pandas DataFrame dfBig with columns Param1, Param2, ..., ParamN, score, step, and a smaller DataFrame dfSmall with columns Param1, Param2, ..., ParamN (i.e. missing the score and step columns).
I want to select all the rows of dfBig for which the values of columns Param1, Param2, ..., ParamN match those of some row in dfSmall. Is there a clean way of doing this in pandas?
Edit: To give an example, consider this DataFrame dfBig:
Arch | Layers | Score | Time
A | 1 | 0.3 | 10
A | 1 | 0.6 | 20
A | 1 | 0.7 | 30
A | 2 | 0.4 | 10
A | 2 | 0.5 | 20
A | 2 | 0.6 | 30
B | 1 | 0.1 | 10
B | 1 | 0.2 | 20
B | 1 | 0.7 | 30
B | 2 | 0.7 | 10
B | 2 | 0.8 | 20
B | 2 | 0.8 | 30
Let's imagine a model is specified by a pair (Arch, Layers). I want to query dfBig and get the time series for scores over time for the best performing models with Arch A and Arch B.
Following EdChum's answer below, I take it that the best solution is to do something like this procedurally:
modelColumns = [col for col in dfBigCol if col not in ["Time", "Score"]]
groupedBest = dfBig.groupby("Arch").Score.max()
dfSmall = pd.DataFrame(groupedBest).reset_index()[modelColumns]
dfBest = pd.merge(dfSmall, dfBig)
which yields:
Arch | Layers | Score | Time
A | 1 | 0.3 | 10
A | 1 | 0.6 | 20
A | 1 | 0.7 | 30
B | 2 | 0.7 | 10
B | 2 | 0.8 | 20
B | 2 | 0.8 | 30
If there's a better way to do this, I'm happy to hear it.
If I understand your question correctly you should be able to just call merge on dfBig and pass dfSmall which will look for matches in the aligned columns and only return those rows.
Example:
In [71]:
dfBig = pd.DataFrame({'a':np.arange(100), 'b':np.arange(100), 'c':np.arange(100)})
dfSmall = pd.DataFrame({'a':[3,4,5,6]})
dfBig.merge(dfSmall)
Out[71]:
a b c
0 3 3 3
1 4 4 4
2 5 5 5
3 6 6 6

Interleaving Pandas Dataframes by Timestamp

I've got 2 Pandas DataFrame, each of them containing 2 columns. One of the columns is a timestamp column [t], the other one contains sensor readings [s].
I now want to create a single DataFrame, containing 4 columns, that is interleaved on the timestamp column.
Example:
First Dataframe:
+----+----+
| t1 | s1 |
+----+----+
| 0 | 1 |
| 2 | 3 |
| 3 | 3 |
| 5 | 2 |
+----+----+
Second DataFrame:
+----+----+
| t2 | s2 |
+----+----+
| 1 | 5 |
| 2 | 3 |
| 4 | 3 |
+----+----+
Target:
+----+----+----+----+
| t1 | t2 | s1 | s2 |
+----+----+----+----+
| 0 | 0 | 1 | 0 |
| 0 | 1 | 1 | 5 |
| 2 | 1 | 3 | 5 |
| 2 | 2 | 3 | 3 |
| 3 | 2 | 3 | 3 |
| 3 | 4 | 3 | 3 |
| 5 | 4 | 2 | 3 |
+----+----+----+----+
I hat a look at pandas.merge, but that left me with a lot of NaNs and an unsorted table.
a.merge(b, how='outer')
Out[55]:
t1 s1 t2 s2
0 0 1 NaN NaN
1 2 3 2 3
2 3 3 NaN NaN
3 5 2 NaN NaN
4 1 NaN 1 5
5 4 NaN 4 3
Merging will put NaNs in common columns that you merge on, if those values are not present in both indexes. It will not create new data that is not present in the dataframes that are being merged.
For example, index 0 in your target dataframe shows t2 with a value of 0. This is not present in the second dataframe, so you cannot expect it to appear in the merged dataframe either. Same applies for other rows as well.
What you can do instead is reindex the dataframes to a common index. In your case, since the maximum index is 5 in the target dataframe, lets use this list to reindex both input dataframes:
In [382]: ind
Out[382]: [0, 1, 2, 3, 4, 5]
Now, we will reindex according both inputs to this index:
In [372]: x = a.set_index('t1').reindex(ind).fillna(0).reset_index()
In [373]: x
Out[373]:
t1 s1
0 0 1
1 1 0
2 2 3
3 3 3
4 4 0
5 5 2
In [374]: y = b.set_index('t2').reindex(ind).fillna(0).reset_index()
In [375]: y
Out[375]:
t2 s2
0 0 0
1 1 5
2 2 3
3 3 0
4 4 5
5 5 0
And, now we merge it to get something close to the target dataframe:
In [376]: x.merge(y, left_on=['t1'], right_on=['t2'], how='outer')
Out[376]:
t1 s1 t2 s2
0 0 1 0 0
1 1 0 1 5
2 2 3 2 3
3 3 3 3 0
4 4 0 4 5
5 5 2 5 0

Categories