I am trying to add a column to a pandas dataframe (df1) that has a unique identifier ('id') column from another dataframe (df2) that has the same unique identifier ('sameid'). I have tried merge, but I need to only add one specific column ('addthiscolumn') not all of the columns. What is the best way to do this?
print df1
'id' 'column1'
0 aaa randomdata1
1 aab randomdata2
2 aac randomdata3
3 aad randomdata4
print df2
'sameid' 'irrelevant' 'addthiscolumn'
0 aaa irre1 1234
1 aab irre2 2345
2 aac irre3 3456
3 aad irre4 4567
4 aae irre5 5678
5 aad irre6 6789
Desired Result
print df1
'id' 'column1' 'addthiscolumn'
0 aaa randomdata1 1234
1 aab randomdata2 2345
2 aac randomdata3 3456
3 aad randomdata4 4567
Because you just want to merge a single column, you can select as follows:
df1.merge(df2[['sameid', 'addthiscolumn']], left_on='id', right_on='sameid')
Related
I am trying to compare two columns from two different dataframes and insert a new column in second dataframe from the first one.
I have two data frames df1 and df2. I would like to compare ID column from df1 and df2 and insert filename in df2 .
df1:
ID Date filename col2
1 20220207 data1.csv AAA
2 20220207 data2.csv BBB
3 20220207 data2.csv CCC
df2:
ID Date col1
1 20220207 123XER
2 20220207 234FGY
3 20220207 000GGG
Result
df2:
ID Date col1 filename
1 20220207 123XER data1.csv
2 20220207 234FGY data2.csv
3 20220207 000GGG data2.csv
I tried with below code
df2['FileName']=np.where(df1['ID'].equals(df2['ID']), df1['filename'], '')
It throws below error.
Length of values (1863) does not match length of index (1862)
Can anyone please help me with this logic?
df2['FileName'] = np.where(df1['ID'] == df2['ID'], df1['filename'], None)
I know how to return the rows based on specific text by specifying the column name like below.
import pandas as pd
data = {'id':['1', '2', '3','4'],
'City1':['abc','def','abc','khj'],
'City2':['JH','abc','abc','yuu'],
'City2':['JRR','ytu','rr','abc']}
df = pd.DataFrame(data)
df.loc[df['City1']== 'abc']
and output is -
id City1 City2
0 1 abc JRR
2 3 abc rr
but what i need is -my specific value 'abc' can be in any columns and i need to return rows values that has specific text eg 'abc' without giving column name. Is there any way? need output as below
id City1 City2
0 1 abc JRR
1 3 abc rr
2 4 khj abc
You can use any with the (1) parameter to apply it on all columns to get the expected result :
>>> df[(df == 'abc').any(1)]
id City1 City2
0 1 abc JRR
2 3 abc rr
3 4 khj abc
I have two dataframes containing similar columns:
ID prop1
1 UUU &&&
2 III ***
3 OOO )))
4 PPP %%%
and
ID prop2
1 UUU 1234
2 WWW 4567
3 III 7890
5 EEE 0123
6 OOO 3456
7 RRR 6789
8 PPP 9012
I need to merge these two dataframes where the IDs match, and add the prop2 column to the original.
ID prop1 prop1
1 UUU &&& 1234
2 III *** 7890
3 OOO ))) 3456
4 PPP %%% 9012
Ive tried every combination of merge, join, concat, for, iter, etc. It will either fail to merge, lose the index, or straight-up drop the column values.
You can use pd.merge():
pd.merge(df1, df2, on='ID')
Output:
ID prop1 prop2
0 UUU &&& 1234
1 III *** 7890
2 OOO ))) 3456
3 PPP %%% 9012
You can also use df.merge() as follows::
df1.merge(df2, on='ID')
Same result.
The default parameter on .merge() no matter using pd.merge() or df.merge() is how='inner'. So you are already doing an inner join without specifying how= parameter.
More complex scenario:
If you require the more complicated situation to maintain the index of df1 1, 2, 3, 4 instead of 0, 1, 2, 3, you can do it by resetting index before merge and then set index on the interim index column produced when resetting index:
df1.reset_index().merge(df2, on='ID').set_index('index')
Output:
ID prop1 prop2
index
1 UUU &&& 1234
2 III *** 7890
3 OOO ))) 3456
4 PPP %%% 9012
Now, the index 1 2 3 4 of original df1 are kept.
Optionally, if you don't want the axis label index appear on top of the row index, you can do a rename_axis() as follows:
df1.reset_index().merge(df2, on='ID').set_index('index').rename_axis(index=None)
Output:
ID prop1 prop2
1 UUU &&& 1234
2 III *** 7890
3 OOO ))) 3456
4 PPP %%% 9012
You can also use .map to add the prop2 values to your original dataframe, where the ID column values match.
df1['prop2'] = df1['ID'].map(dict(df2[['ID', 'prop2']].to_numpy())
Should there be any IDs in your original dataframe that aren't also in the second one (and so don't have a prop2 value to bring across, you can fill those holes by adding .fillna() with the value of your choice.
df1['prop2'] = df1['ID'].map(dict(df2[['ID', 'prop2']].to_numpy()).fillna(your_fill_value_here)
I have a dataframe that has 500 columns, 2 columns ('FieldTitle', 'Value') columns whose rows I want to 'flip' into columns and df looks like this:
id FieldTitle Value UID number XID
1 fname aaa 12 123 345
1 lname bbb 12 123 345
2 fname ccc 23 432 543
2 lname ddd 23 432 543
How do I make the dataframe look like this?:
id fname lname UID number XID
1 aaa bbb 12 123 345
2 ccc ddd 23 432 543
currently when I pivot, only the columns in 'FieldTitle' and 'Value' are remaining while all the static columns get dropped.
pivoted_df = pd.pivot_table(df, index='Id', columns='FieldTitle', values='Value', aggfunc='first').reset_index()
I have also tried the below, with no success:
pivoted_df = pd.pivot_table(df, index='Id', columns='FieldTitle', values=['Value'], aggfunc='first').reset_index()
You can pass list of columns names to parameter index:
pivoted_df = pd.pivot_table(df, index=['id','UID','number','XID'],
columns='FieldTitle',
values='Value',
aggfunc='first').reset_index()
print (pivoted_df)
FieldTitle id UID number XID fname lname
0 1 12 123 345 aaa bbb
1 2 23 432 543 ccc ddd
If want dynamically add values to index parameter:
cols = df.columns.difference(['FieldTitle','Value']).tolist()
pivoted_df = pd.pivot_table(df, index=cols,
columns='FieldTitle',
values='Value',
aggfunc='first').reset_index()
print (pivoted_df)
I have a dataframe as below:
id timestamp name
1 2018-01-23 15:49:53 "aaa"
1 2018-01-23 15:54:56 "bbb"
1 2018-01-23 15:49:57 "bbb"
1 2018-01-23 15:49:54 "ccc"
This is one example of group of id from my data. I have several groups of ids.
What I am trying to do is to collapse each group into a row but in a chronological order according to timestamp eg like this
id name
1 aaa->ccc->bbb->bbb
The values in name are in chronological order as they appear with timestamp. Any pointers regarding this ?
I too the liberty to add some data to your df:
print(df)
Output:
id timestamp name
0 1 2018-01-23T15:49:53 aaa
1 1 2018-01-23T15:54:56 bbb
2 1 2018-01-23T15:49:57 bbb
3 1 2018-01-23T15:49:54 ccc
4 2 2018-01-23T15:49:54 ccc
5 2 2018-01-23T15:49:57 aaa
Then you need:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.sort_values(['id', 'timestamp'])
grp = df.groupby('id')['name'].aggregate(lambda x: '->'.join(tuple(x))).reset_index()
print(grp)
Output:
id name
0 1 aaa->ccc->bbb->bbb
1 2 ccc->aaa