Merging dataframes with multiple key columns - python

I'd like to merge this dataframe:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[1,10,100],[2,20,np.nan],[3,30,300]], columns=["A","B","C"])
df1
A B C
0 1 10 100
1 2 20 NaN
2 3 30 300
with this one:
df2 = pd.DataFrame([[1,422],[10,72],[2,278],[300,198]], columns=["ID","Value"])
df2
ID Value
0 1 422
1 10 72
2 2 278
3 300 198
to get an output:
df_output = pd.DataFrame([[1,10,100,422],[1,10,100,72],[2,20,200,278],[3,30,300,198]], columns=["A","B","C","Value"])
df_output
A B C Value
0 1 10 100 422
1 1 10 100 72
2 2 20 NaN 278
3 3 30 300 198
The idea is that for df2 the key column is "ID", while for df1 we have 3 possible key columns ["A","B","C"].
Please notice that the numbers in df2 are chosen to be like this for simplicity, and they can include random numbers in practice.
How do I perform such a merge? Thanks!

IIUC, you need a double merge/join.
First, melt df1 to get a single column, while keeping the index. Then merge to get the matches. Finally join to the original DataFrame.
s = (df1
.reset_index().melt(id_vars='index')
.merge(df2, left_on='value', right_on='ID')
.set_index('index')['Value']
)
# index
# 0 422
# 1 278
# 0 72
# 2 198
# Name: Value, dtype: int64
df_output = df1.join(s)
output:
A B C Value
0 1 10 100.0 422
0 1 10 100.0 72
1 2 20 NaN 278
2 3 30 300.0 198
Alternative with stack + map:
s = df1.stack().droplevel(1).map(df2.set_index('ID')['Value']).dropna()
df_output = df1.join(s.rename('Value'))

Related

Generating a new variable based on the values of other variables

I have the following data set
import pandas as pd
df = pd.DataFrame({"ID": [1,1,1,1,1,2,2,2,2,2],
"TP1": [1,2,3,4,5,9,8,7,6,5],
"TP2": [11,22,32,43,53,94,85,76,66,58],
"TP10": [114,222,324,443,535,94,385,76,266,548],
"count": [1,2,3,4,10,1,2,3,4,10]})
print (df)
I want a "Final" variable in the df that will be based on the ID, TP and count variable.
The final result will look like following.
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [1,1,1,1,1,2,2,2,2,2], "TP1": [1,2,3,4,5,9,8,7,6,5],
"TP2": [11,22,32,43,53,94,85,76,66,58], "TP10": [114,222,324,443,535,94,385,76,266,548],
"count": [1,2,3,4,10,1,2,3,4,10],
"final" : [1,22,np.nan,np.nan,535,9,85,np.nan,np.nan,548]})
print (df)
So for example, the loop of if will do the following
It will look at the ID
Then for 1st ID it should look at value of count, if the value of count is 1
Then if should look at the variable TP1 and its 1st value should be placed in "final" variable.
The look will then look at count 2 for ID 1 and the value of TP2 should come in the "final" variable and so on.
I hope my question is clear. I am looking for a loop because there are 1000 TP variables in the original dataset.
I tried to make a code something like the following but it is utterly rubbish.
for col in df.columns:
if col.startswith('TP') and count == int(col[2:])
df["Final"] = count
Thanks
If my understanding is correct, if count=1 then pick TP1, if count=2 then pick TP2 etc.
This can be done with numpy.select(). Note that I have added the condition if f"TP{x}" in df.columns because not all columns TP1, TP2, TP3, ... TP10 are available in the dataframe. If all are available in your actual dataframe then this if statement is not required.
import numpy as np
conds = [df["count"] == x for x in range(1,11) if f"TP{x}" in df.columns]
output = [df[f"TP{x}"] for x in range(1,11) if f"TP{x}" in df.columns]
df["final"] = np.select(conds, output, np.nan)
print(df)
Output:
ID TP1 TP2 TP10 count final
0 1 1 11 114 1 1.0
1 1 2 22 222 2 22.0
2 1 3 32 324 3 NaN
3 1 4 43 443 4 NaN
4 1 5 53 535 10 535.0
5 2 9 94 94 1 9.0
6 2 8 85 385 2 85.0
7 2 7 76 76 3 NaN
8 2 6 66 266 4 NaN
9 2 5 58 548 10 548.0

Pandas reshape extracting multiple values from colname

I have a wide dataframe I want to be able to reshape.
I have some columns that I wanna preserve. I have been exploring melt and wide_to_long but I'm not sure that's what I need.
Imagine I have some columns named: 'id', 'classroom', 'city'
And other columns called: 'alumn_x_subject_y_mark', 'alumn_x_subject_y_name', 'alumn_x_subject_y_teacher'
And x and y are the product of [range(20), range(10)].
I would like to end with a df that has columns: id, classroom, city, alumn, subject, mark, name, teacher
With all the original 20*10 columns converted to rows.
An empty dataframe with that structure can be generated this way:
import pandas as pd
import itertools
vals = list(itertools.product(*[range(20), range(10)]))
pd.DataFrame(columns=['id', 'classroom', 'city']+ \
['alumn_{0}_subject_{1}_mark'.format(x, y) for x, y in vals] + \
['alumn_{0}_subject_{1}_name'.format(x, y) for x, y in vals] + \
['alumn_{0}_subject_{1}_teacher'.format(x, y) for x, y in vals]
, dtype=object)
I'm not building this dataframe but receiving it from a file, that's why it has so many columns and I cannot change that.
If you had only 2 parameters to extract, wide_to_long would work.
Here you have 3, thus you can perform a manual reshaping with a MultiIndex:
regex = r'alumn_(\d+)_subject_(\d+)_(.*)'
out = (df
.set_index(['id', 'classroom', 'city'])
.pipe(lambda d: d.set_axis(pd.MultiIndex
.from_frame(d.columns.str.extract(regex),
names=['alumn', 'subject', None]
),
axis=1))
.stack(['alumn', 'subject'])
.reset_index()
)
output:
Empty DataFrame
Columns: [id, classroom, city, alumn, subject, mark, name, teacher]
Index: []
output with a single row (after df.loc[0] = range(df.shape[1])):
id classroom city alumn subject mark name teacher
0 0 1 2 0 0 3 203 403
1 0 1 2 0 1 4 204 404
2 0 1 2 0 2 5 205 405
3 0 1 2 0 3 6 206 406
4 0 1 2 0 4 7 207 407
.. .. ... ... ... ... ... ... ...
195 0 1 2 9 5 98 298 498
196 0 1 2 9 6 99 299 499
197 0 1 2 9 7 100 300 500
198 0 1 2 9 8 101 301 501
199 0 1 2 9 9 102 302 502
[200 rows x 8 columns]

Extend and fill a Pandas DataFrame to match another

I have two Pandas DataFrames A and B.
They have an identical index (weekly dates) up to a point: the series ends at the beginning of the year
for A and goes on for a number of observations in frame B. I need to set data frame A to have the same index as frame B - and fill each column with its own last values.
Thank you in advance.
Tikhon
EDIT: thank you for the advice on the question. What I need is for dfA_before to look at dfB and become dfA_after:
print(dfA_before)
a b
0 10 100
1 20 200
2 30 300
print(dfB)
a b
0 11 111
1 22 222
2 33 333
3 44 444
4 55 555
print(dfA_after)
a b
0 10 100
1 20 200
2 30 300
3 30 300
4 30 300
This should work
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'a':[10,20,30],'b':[100,200,300]})
df2 = pd.DataFrame({'a':[11,22,33,44,55],'c':[111,222,333,444,555]})
# solution
last = df1.iloc[-1].to_numpy()
df3 = pd.DataFrame(np.tile(last,(2,1)),
columns=df1.columns)
df4 = df1.append(df3,ignore_index=True)
# method 2
for _ in range(len(df2)-len(df1)):
df1.loc[len(df1)] = df1.loc[len(df1)-1]
# method 3
for _ in range(df2.shape[0]-df1.shape[0]):
df1 = df1.append(df1.loc[len(df1)-1],ignore_index=True)
# result
a b
0 10 100
1 20 200
2 30 300
3 30 300
4 30 300
Probably very inefficient - I am a beginner:
dfA_New = dfB.copy()
dfA_New.loc[:] = 0
dfA_New.loc[:] = dfA.loc[:]
dfA_New.fillna(method='ffill', inplace = True)
dfA = dfA_New

Merge two Dataframes with same columns with overwrite

I have dataframe like this:
df = pd.DataFrame({"flag":["1","0","1","0"],
"val":["111","111","222","222"], "qwe":["","11","","12"]})
It gives:
flag qwe val
0 1 111
1 0 11 111
2 1 222
3 0 12 222
Then i am filtering first dataframe like this:
dff = df.loc[df["flag"]=="1"]
**was:**
dff.loc["qwe"] = "123"
**edited:** (setting all rows in column "qwe" to "123")
dff["qwe"] = "123"
And now i need to merge/join df and dff in such a way to get:
flag qwe val
0 1 123 111
1 0 11 111
2 1 123 222
3 0 12 222
Adding changes in 'qwe' from dff only if df value is empty.
Something like this:
pd.merge(df, dff, left_index=True, right_index=True, how="left")
gives
flag_x qwe_x val_x flag_y qwe_y val_y
0 1 111 1 111
1 0 11 111 NaN NaN NaN
2 1 222 1 222
3 0 12 222 NaN NaN NaN
so, after that i need to drop flag_y, val_y, rename _x columns and merge manually qwe_x and qwe_y. But is there any way to make it easier?
pd.merge has an on argument that you can use to join columns with the same name in different dataframes.
Try:
pd.merge(df, dff, how="left", on=['flag', 'qwe', 'val'])
However, I don't think you need to do that at all. You can produce the same result using df.loc to conditionally assign a value:
df.loc[(df["flag"] == "1") & (df['qwe'].isnull()), 'qwe'] = 123
After edited changes, for me works this code:
c1 = dff.combine_first(df)
It produces:
flag qwe val
0 1 123 111
1 0 11 111
2 1 123 222
3 0 12 222
Which is exactly i was looking for.

Attributes/information contained in DataFrame column names

I have some data imported from a csv, to create something similar I used this:
data = pd.DataFrame([[1,0,2,3,4,5],[0,1,2,3,4,5],[1,1,2,3,4,5],[0,0,2,3,4,5]], columns=['split','sex', 'group0Low', 'group0High', 'group1Low', 'group1High'])
means = data.groupby(['split','sex']).mean()
so the dataframe looks something like this:
group0Low group0High group1Low group1High
split sex
0 0 2 3 4 5
1 2 3 4 5
1 0 2 3 4 5
1 2 3 4 5
You'll notice that each column actually contains 2 variables (group# and height). (It was set up this way for running repeated measures anova in SPSS.)
I want to split the columns up, so I can also groupby "group", like this (I actually screwed up the order of the numbers, but hopefully the idea is clear):
low high
split sex group
0 0 95 265
0 0 1 123 54
1 0 120 220
1 1 98 111
1 0 0 150 190
0 1 211 300
1 0 139 86
1 1 132 250
How do I achieve this?
The first trick is to gather the columns into a single column using stack:
In [6]: means
Out[6]:
group0Low group0High group1Low group1High
split sex
0 0 2 3 4 5
1 2 3 4 5
1 0 2 3 4 5
1 2 3 4 5
In [13]: stacked = means.stack().reset_index(level=2)
In [14]: stacked.columns = ['group_level', 'mean']
In [15]: stacked.head(2)
Out[15]:
group_level mean
split sex
0 0 group0Low 2
0 group0High 3
Now we can do whatever string operations we want on group_level using pd.Series.str as follows:
In [18]: stacked['group'] = stacked.group_level.str[:6]
In [21]: stacked['level'] = stacked.group_level.str[6:]
In [22]: stacked.head(2)
Out[22]:
group_level mean group level
split sex
0 0 group0Low 2 group0 Low
0 group0High 3 group0 High
Now you're in business and you can do whatever you want. For example, sum each group/level:
In [31]: stacked.groupby(['group', 'level']).sum()
Out[31]:
mean
group level
group0 High 12
Low 8
group1 High 20
Low 16
How do I group by everything?
If you want to group by split, sex, group and level you can do:
In [113]: stacked.reset_index().groupby(['split', 'sex', 'group', 'level']).sum().head(4)
Out[113]:
mean
split sex group level
0 0 group0 High 3
Low 2
group1 0High 5
0Low 4
What if the split is not always at location 6?
This SO answer will show you how to do the splitting more intelligently.
This can be done by first construct multi-level index on column names and then reshape the dataframe by stack.
import pandas as pd
import numpy as np
# some artificial data
# ==================================
multi_index = pd.MultiIndex.from_arrays([[0,0,1,1], [0,1,0,1]], names=['split', 'sex'])
np.random.seed(0)
df = pd.DataFrame(np.random.randint(50,300, (4,4)), columns='group0Low group0High group1Low group1High'.split(), index=multi_index)
df
group0Low group0High group1Low group1High
split sex
0 0 222 97 167 242
1 117 245 153 59
1 0 261 71 292 86
1 137 120 266 138
# processing
# ==============================
level_group = np.where(df.columns.str.contains('0'), 0, 1)
# output: array([0, 0, 1, 1])
level_low_high = np.where(df.columns.str.contains('Low'), 'low', 'high')
# output: array(['low', 'high', 'low', 'high'], dtype='<U4')
multi_level_columns = pd.MultiIndex.from_arrays([level_group, level_low_high], names=['group', 'val'])
df.columns = multi_level_columns
df.stack(level='group')
val high low
split sex group
0 0 0 97 222
1 242 167
1 0 245 117
1 59 153
1 0 0 71 261
1 86 292
1 0 120 137
1 138 266

Categories