Pandas join on immediate predecessor - python

I will describe my problem through an example:
x = pd.DataFrame.from_dict({'row':[5, 10, 12], 'val_x': [11,222, 333]})
y = pd.DataFrame.from_dict({'row':[2, 4, 9, 13], 'val_y': [1, 12, 123, 4]})
In [4]: x
row val_x
0 5 11
1 10 222
2 12 333
In [5]: y
row val_y
0 2 1
1 4 12
2 9 123
3 13 4
I want each row in x to be joined with a row in y, that is immediately before it according to the row column (equal value is also allowed)
In other words, the output looks like
row val_x row_y val_y
0 5 11 4 12
1 10 222 9 123
2 12 333 9 123
I know I need some sort of special merge on the row columns, but I have no idea exactly how to express it.

Try using pd.merge_asof
pd.merge_asof(x,y,on='row',direction ='backward').merge(y,left_on='val_y',right_on='val_y')
Out[828]:
row_x val_x val_y row_y
0 5 11 12 4
1 10 222 123 9
2 12 333 123 9
EDIT:
from itertools import product
import pandas as pd
DF=pd.DataFrame(list(product(x.row, y.row)), columns=['l1', 'l2'])
DF['DIFF']=DF.l1-DF.l2
DF=DF.loc[DF.DIFF>=0,:]
DF=DF.sort_values(['l1','DIFF']).drop_duplicates(['l1'],keep='first')
x.merge(DF,left_on='row',right_on='l1',how='left').merge(y,left_on='l2',right_on='row')[['row_x','val_x','row_y','val_y']]
row_x val_x row_y val_y
0 5 11 4 12
1 10 222 9 123
2 12 333 9 123

Related

Adding dataframe columns from shorter lists

I have a dataframe with three columns. The first column specifies a group into which each row is classified. Each group normally consists of 3 data points (rows), but it is possible for the last group to be "cut off," and contain fewer than three data points. In the real world, this could be due to the experiment or data collection process being cut off prematurely. In the below example, group 3 is cut off and only contains one data point.
import pandas as pd
data = {
"group_id": [0, 0, 0, 1, 1, 1, 2, 2, 2, 3],
"valueA": [420, 380, 390, 500, 270, 220, 150, 400, 330, 170],
"valueB": [50, 40, 45, 22, 20, 50, 10, 60, 90, 10]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
I also have two lists with additional values.
x_list = [1, 3, 5]
y_list = [2, 4, 6]
I want to add these lists to my dataframe as new columns, and have the values repeat for each group. In other words, I want my output to look like this.
group_id valueA valueB x y
0 0 420 50 1 2
1 0 380 40 3 4
2 0 390 45 5 6
3 1 500 22 1 2
4 1 270 20 3 4
5 1 220 50 5 6
6 2 150 10 1 2
7 2 400 60 3 4
8 2 330 90 5 6
9 3 170 10 1 2
Notice that even though the length of a column is not divisible by the length of the shorter lists, the number of rows in the dataframe does not change.
How do I achieve this without losing dataframe rows or adding new rows with NaN values?
You can use GroupBy.cumcount to generate a indexer, then use this to duplicate the values in order of the groups:
new = pd.DataFrame({'x': x_list, 'y': y_list})
idx = df.groupby('group_id').cumcount()
df[['x', 'y']] = new.reindex(idx).to_numpy()
Output:
group_id valueA valueB x y
0 0 420 50 1 2
1 0 380 40 3 4
2 0 390 45 5 6
3 1 500 22 1 2
4 1 270 20 3 4
5 1 220 50 5 6
6 2 150 10 1 2
7 2 400 60 3 4
8 2 330 90 5 6
9 3 170 10 1 2
As your lists have the same length, you can use:
df[['x', 'y']] = (pd.DataFrame({'x': x_list, 'y': y_list})
.reindex(df.groupby('group_id').cumcount().mod(3)).values)
print(df)
# Output
group_id valueA valueB x y
0 0 420 50 1 2
1 0 380 40 3 4
2 0 390 45 5 6
3 1 500 22 1 2
4 1 270 20 3 4
5 1 220 50 5 6
6 2 150 10 1 2
7 2 400 60 3 4
8 2 330 90 5 6
9 3 170 10 1 2
Let's use np.resize:
import pandas as pd
import numpy as np
data = {
"group_id": [0, 0, 0, 1, 1, 1, 2, 2, 2, 3],
"valueA": [420, 380, 390, 500, 270, 220, 150, 400, 330, 170],
"valueB": [50, 40, 45, 22, 20, 50, 10, 60, 90, 10]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
df['x'] = np.resize(x_list, len(df))
df['y'] = np.resize(y_list, len(df))
df
Output:
group_id valueA valueB x y
0 0 420 50 1 2
1 0 380 40 3 4
2 0 390 45 5 6
3 1 500 22 1 2
4 1 270 20 3 4
5 1 220 50 5 6
6 2 150 10 1 2
7 2 400 60 3 4
8 2 330 90 5 6
9 3 170 10 1 2
An alternative in case of having lists of different sizes:
lambda_duplicator = lambda lista, lenn, shape : (lista*int(1 + shape/lenn))[:shape]
df['x'] = lambda_duplicator(x_list, len(x_list), df.shape[0])
df['y'] = lambda_duplicator(y_list, len(y_list), df.shape[0])

Convert a column with values in a list into separated rows grouped by specific columns

I'm trying to convert a column with values in a list into separated rows grouped by specifics columns.
That's the dataframe I have:
id rooms bathrooms facilities
111 1 2 [2, 3, 4]
222 2 3 [4, 5, 6]
333 2 1 [2, 3, 4]
That's the dataframe I need:
id rooms bathrooms facility
111 1 2 2
111 1 2 3
111 1 2 4
222 2 3 4
222 2 3 5
222 2 3 6
333 2 1 2
333 2 1 3
333 2 1 4
I was trying converting to list the column facilities first:
facilities = pd.DataFrame(df.facilities.tolist())
And later join by columns and following the same method with another suggested solution:
df[['id', 'rooms', 'bathrooms']].join(facilities).melt(id_vars=['id', 'rooms', 'bathrooms']).drop('variable', 1)
Unfortunately, it didn't work for me.
Another solution?
Thanks in advance!
You need explode:
df.explode('facilities')
# id rooms bathrooms facilities
#0 111 1 2 2
#0 111 1 2 3
#0 111 1 2 4
#1 222 2 3 4
#1 222 2 3 5
#1 222 2 3 6
#2 333 2 1 2
#2 333 2 1 3
#2 333 2 1 4
It is a bit awkward to have list as values in a dataframe so one way I can think of to work around this is to unpack the lists and store each in its own column, then use the melt function.
# recreate your data
d = {"id":[111, 222, 333],
"rooms": [1,2,2],
"bathrooms": [2,3,1],
"facilities": [[2, 3, 4],[4, 5, 6],[2, 3, 4]]}
df = pd.DataFrame(d)
# unpack the lists
f0, f1, f2 = [],[],[]
for row in df.itertuples():
f0.append(row.facilities[0])
f1.append(row.facilities[1])
f2.append(row.facilities[2])
df["f0"] = f0
df["f1"] = f1
df["f2"] = f2
# melt the dataframe
df = pd.melt(df, id_vars=['id', 'rooms', 'bathrooms'], value_vars=["f0", "f1", "f2"], value_name="facilities")
# optionally sort the values and remove the "variable" column
df.sort_values(by=['id'], inplace=True)
df = df[['id', 'rooms', 'bathrooms', 'facilities']]
I think that should get you the dataframe you need.
id rooms bathrooms facilities
0 111 1 2 2
3 111 1 2 3
6 111 1 2 4
1 222 2 3 4
4 222 2 3 5
7 222 2 3 6
2 333 2 1 2
5 333 2 1 3
8 333 2 1 4
The following will give the desired output
def changeDf(x):
df_m = pd.DataFrame(columns=['id','rooms','bathrooms','facilities'])
for index, fc in enumerate(x['facilities']):
df_m.loc[index] = [x['id'], x['rooms'], x['bathrooms'], fc]
return df_m
df_modified = df.apply(changeDf, axis=1)
df_final = pd.concat([i for i in df_modified])
print(df_final)
"df" is input dataframe and "df_final" is desired dataframe
Try this
reps = [len(x) for x in df.facilities]
facilities = pd.Series(np.array(df.facilities.tolist()).ravel())
df = df.loc[df.index.repeat(reps)].reset_index(drop=True)
df.facilities = facilities
df
id rooms bathrooms facilities
0 111 1 2 2
1 111 1 2 3
2 111 1 2 4
3 222 2 3 4
4 222 2 3 5
5 222 2 3 6
6 333 2 1 2
7 333 2 1 3
8 333 2 1 4

Sum pandas dataframe column values based on condition of column name

I have a DataFrame with column names in the shape of x.y, where I would like to sum up all columns with the same value on x without having to explicitly name them. That is, the value of column_name.split(".")[0] should determine their group. Here's an example:
import pandas as pd
df = pd.DataFrame({'x.1': [1,2,3,4], 'x.2': [5,4,3,2], 'y.8': [19,2,1,3], 'y.92': [10,9,2,4]})
df
Out[3]:
x.1 x.2 y.8 y.92
0 1 5 19 10
1 2 4 2 9
2 3 3 1 2
3 4 2 3 4
The result should be the same as this operation, only I shouldn't have to explicitly list the column names and how they should group.
pd.DataFrame({'x': df[['x.1', 'x.2']].sum(axis=1), 'y': df[['y.8', 'y.92']].sum(axis=1)})
x y
0 6 29
1 6 11
2 6 3
3 6 7
Another option, you can extract the prefix from the column names and use it as a group variable:
df.groupby(by = df.columns.str.split('.').str[0], axis = 1).sum()
# x y
#0 6 29
#1 6 11
#2 6 3
#3 6 7
You can first create Multiindex by split and then groupby by first level and aggregate sum:
df.columns = df.columns.str.split('.', expand=True)
print (df)
x y
1 2 8 92
0 1 5 19 10
1 2 4 2 9
2 3 3 1 2
3 4 2 3 4
df = df.groupby(axis=1, level=0).sum()
print (df)
x y
0 6 29
1 6 11
2 6 3
3 6 7

Pandas - Using a list of values to create a smaller frame

I have a list of values that are found in a large pandas dataframe:
value_list = [1, 4, 5, 6, 54]
Example DataFrame df is below:
column x
0 1 3
1 4 6
2 5 8
3 6 19
4 8 21
5 12 97
6 54 102
I would like to create a subset of the data frame using only these values:
df_new = df[df['column'] is in value_list] # pseudo code
Is this possible?
You might be looking for isin operation.
In [60]: df[df['column'].isin(value_list)]
Out[60]:
column x
0 1 3
1 4 6
2 5 8
3 6 19
6 54 102
Also, you can use query like
In [63]: df.query('column in #value_list')
Out[63]:
column x
0 1 3
1 4 6
2 5 8
3 6 19
6 54 102
You missed a for loop :
df_new = [df[elem]['column'] for elem in df if df[elem]['column'] in value_list]

Pandas lookup from one of multiple columns, based on value

I have the following DataFrame:
Date best a b c d
1990 a 5 4 7 2
1991 c 10 1 2 0
1992 d 2 1 4 12
1993 a 5 8 11 6
I would like to make a dataframe as follows:
Date best value
1990 a 5
1991 c 2
1992 d 12
1993 a 5
So I am looking to find a value based on another row value by using column names. For instance, the value for 1990 in the second df should lookup "a" from the first df and the second row should lookup "c" (=2) from the first df.
Any ideas?
There is a built in lookup function that can handle this type of situation (looks up by row/column). I don't know how optimized it is, but may be faster than the apply solution.
In [9]: df['value'] = df.lookup(df.index, df['best'])
In [10]: df
Out[10]:
Date best a b c d value
0 1990 a 5 4 7 2 5
1 1991 c 10 1 2 0 2
2 1992 d 2 1 4 12 12
3 1993 a 5 8 11 6 5
You create a lookup function and call apply on your dataframe row-wise, this isn't very efficient for large dfs though
In [245]:
def lookup(x):
return x[x.best]
df['value'] = df.apply(lambda row: lookup(row), axis=1)
df
Out[245]:
Date best a b c d value
0 1990 a 5 4 7 2 5
1 1991 c 10 1 2 0 2
2 1992 d 2 1 4 12 12
3 1993 a 5 8 11 6 5
You can do this using np.where like below. I think it will be more efficient
import numpy as np
import pandas as pd
df = pd.DataFrame([['1990', 'a', 5, 4, 7, 2], ['1991', 'c', 10, 1, 2, 0], ['1992', 'd', 2, 1, 4, 12], ['1993', 'a', 5, 8, 11, 6]], columns=('Date', 'best', 'a', 'b', 'c', 'd'))
arr = df.best.values
cols = df.columns[2:]
for col in cols:
arr2 = df[col].values
arr = np.where(arr==col, arr2, arr)
df.drop(columns=cols, inplace=True)
df["values"] = arr
df
Result
Date best values
0 1990 a 5
1 1991 c 2
2 1992 d 12
3 1993 a 5
lookup is deprecated since version 1.2.0. With melt you can 'unpivot' columns to the row axis, where the column names are stored per default in column variable and their values in value. query returns only such rows where the columns best and variable are equal. drop and sort_values are used to match your requested format.
df_new = (
df.melt(id_vars=['Date', 'best'], value_vars=['a', 'b', 'c', 'd'])
.query('best == variable')
.drop('variable', axis=1)
.sort_values('Date')
)
Output:
Date best value
0 1990 a 5
9 1991 c 2
14 1992 d 12
3 1993 a 5
A simple solution that uses a mapper dictionary:
vals = df[['a','b','c','d']].to_dict('list')
mapper = {k: vals[v][k] for k,v in zip(df.index, df['best'])}
df['value'] = df.index.map(mapper).to_numpy()
Output:
Date best a b c d value
0 1990 a 5 4 7 2 5
1 1991 c 10 1 2 0 2
2 1992 d 2 1 4 12 12
3 1993 a 5 8 11 6 5
Use looking up values by index column labels because DataFrame.lookup is deprecated since version 1.2.0:
idx, cols = pd.factorize(df['best'])
df['value'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
print (df)
Date best a b c d value
0 1990 a 5 4 7 2 5
1 1991 c 10 1 2 0 2
2 1992 d 2 1 4 12 12
3 1993 a 5 8 11 6 5

Categories