How can i add datas from another data, but without removing NaN values?
I have three data similar to this
df_main = df_main = pd.DataFrame({'ID': ['10', '11', '12', '13', '14', '15', '16'], 'Name': [ np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]})
ID Name
0 10 NaN
1 11 NaN
2 12 NaN
3 13 NaN
4 14 NaN
5 15 NaN
6 16 NaN
df2 = pd.DataFrame({'ID': ['10', '11', '12'], 'Name': [ 'Peter', 'Bruce', 'Tony']})
ID Name
0 10 Peter
1 11 Bruce
2 12 Tony
df3 = pd.DataFrame({'ID': ['15', '16'], 'Name': ['Wanda', 'Natasha']})
ID Name
0 15 Wanda
1 16 Natasha
What I want to have is data like this:
ID Name
0 10 Peter
1 11 Bruce
2 12 Tony
3 13 NaN
4 14 NaN
5 15 Wanda
6 16 Natasha
I tried this code but it did not work
for id in df2['ID'].unique():
if id in df_main['ID'].unique():
df_main.loc[df_main['ID'] == id, 'Name'] = df2.loc[df2['ID'] == id, 'Name']
for id in df3['ID'].unique():
if id in df_main['ID'].unique():
df_main.loc[df_main['ID'] == id, 'Name'] = df3.loc[df3['ID'] == id, 'Name']
IIUC, you can use concat with GroupBy.first :
out = pd.concat([df2, df_main, df3]).groupby("ID", as_index=False).first()
Output :
print(out)
ID Name
0 10 Peter
1 11 Bruce
2 12 Tony
3 13 None
4 14 None
5 15 Wanda
6 16 Natasha
concat df2/df3 and map the values:
df_main['Name'] = df_main['ID'].map(pd.concat([df2, df3]).set_index('ID')['Name'])
Output:
ID Name
0 10 Peter
1 11 Bruce
2 12 Tony
3 13 NaN
4 14 NaN
5 15 Wanda
6 16 Natasha
df_main.set_index("ID").combine_first(df2.set_index("ID"))\
.combine_first(df3.set_index("ID")).reset_index()
out
ID Name
0 10 Peter
1 11 Bruce
2 12 Tony
3 13 NaN
4 14 NaN
5 15 Wanda
6 16 Natasha
Related
If I have a following dataframe, I would like to clean data by replacing multiple strings and numbers into NaNs: ie. 68, Tardeo Road and 0 from state, 567 from dept, and #ERROR! and 123 from phonenumber:
id state dept \
0 1 Abu Dhabi {Marketing}
1 2 MO {Other}
2 3 68, Tardeo Road {"Human Resources"}
3 4 National Capital Territory of Delhi {"Human Resources"}
4 5 Aargau Canton {Marketing}
5 6 Aargau Canton 567
6 18 NB {"Finance & Administration"}
7 19 0 {Sales}
8 20 Abu Dhabi {"Human Resources"}
9 21 Aargau {"Finance & Administration"}
phonenumber
0 123
1 5635888000
2 18006708450
3 #ERROR!
4 12032722596
5 18003928343
6 NaN
7 #ERROR!
8 NaN
9 NaN
I have tried the following code:
Solution 1:
mask = (df.state == '0') | (df.state == '68, Tardeo Road')
df.loc[mask, ['state']] = np.nan
Solution 2:
df.loc[(df.state == '68, Tardeo Road') | (df.state == 0), 'state'] = np.nan
Solution 3:
df.loc[df.state == '0', 'state'] = np.nan
df.loc[df.state == '68, Tardeo Road', 'state'] = np.nan
All of them works, but if I apply them to multiple columns, it's a little bit long.
Just wondering if it's possible to make it more concise and efficient? By using str.replace for example. Thanks.
You can do a replace:
df = df.replace({'state':['68, Tardeo Road','0'],
'dept':['567'],
'phonenumber':['#ERROR!','123']}, np.nan)
Output:
id state dept phonenumber
-- ---- ----------------------------------- ---------------------------- -------------
0 1 Abu Dhabi {Marketing} nan
1 2 MO {Other} 5635888000
2 3 nan {"Human Resources"} 18006708450
3 4 National Capital Territory of Delhi {"Human Resources"} nan
4 5 Aargau Canton {Marketing} 12032722596
5 6 Aargau Canton nan 18003928343
6 18 NB {"Finance & Administration"} nan
7 19 nan {Sales} nan
8 20 Abu Dhabi {"Human Resources"} nan
9 21 Aargau {"Finance & Administration"} nan
I have a problem with the groupby and pandas, at the beginning I have this chart :
import pandas as pd
data = {'Code_Name':[1,2,3,4,1,2,3,4] ,'Name':['Tom', 'Nicko', 'Krish','Jack kr','Tom', 'Nick', 'Krishx', 'Jacks'],'Cat':['A', 'B','C','D','A', 'B','C','D'], 'T':[9, 7, 14, 12,4, 3, 12, 11]}
# Create DataFrame
df = pd.DataFrame(data)
df
i have this :
Code_Name Name Cat T
0 1 Tom A 9
1 2 Nick B 7
2 3 Krish C 14
3 4 Jack kr D 12
4 1 Tom A 4
5 2 Nick B 3
6 3 Krishx C 12
7 4 Jacks D 11
Now i with groupby :
df.groupby(['Code_Name','Name','Cat'],as_index=False)['T'].sum()
i got this:
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nick B 10
2 3 Krish C 14
3 3 Krishx C 12
4 4 Jack kr D 12
5 4 Jacks D 11
But for me , i need this result :
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nick B 10
2 3 Krish C 26
3 4 Jack D 23
i don't care about Name the Code_name is only thing important for me with sum of T
Thank's
There is 2 ways - for each column with avoid losts add aggreation function - first, last or ', '.join obviuosly for strings columns and aggregation dunctions like sum, mean for numeric columns:
df = df.groupby('Code_Name',as_index=False).agg({'Name':'first', 'Cat':'first', 'T':'sum'})
print (df)
Code_Name Name Cat T
0 1 Tom A 13
1 2 Nicko B 10
2 3 Krish C 26
3 4 Jack kr D 23
Or if some values are duplicated per groups like here Cat values add this columns to groupby - only order should be changed in output:
df = df.groupby(['Code_Name','Cat'],as_index=False).agg({'Name':'first', 'T':'sum'})
print (df)
Code_Name Cat Name T
0 1 A Tom 13
1 2 B Nicko 10
2 3 C Krish 26
3 4 D Jack kr 23
If you don't care about the other variable then just group by the column of interest:
gb = df.groupby(['Code_Name'],as_index=False)['T'].sum()
print(gb)
Code_Name T
0 1 13
1 2 10
2 3 26
3 4 23
Now to get your output, you can take the last value of Name for each group:
gb = df.groupby(['Code_Name'],as_index=False).agg({'Name': 'last', 'Cat': 'first', 'T': 'sum'})
print(gb)
0 1 Tom A 13
1 2 Nick B 10
2 3 Krishx C 26
3 4 Jacks D 23
Perhaps you can try:
(df.groupby("Code_Name", as_index=False)
.agg({"Name":"first", "Cat":"first", "T":"sum"}))
see link: https://datascience.stackexchange.com/questions/53405/pandas-dataframe-groupby-and-then-sum-multi-columns-sperately for the original answer
I have a dataframe:
df = pd.DataFrame([[2, 4, 7, 8, 1, 3, 2013], [9, 2, 4, 5, 5, 6, 2014]], columns=['Amy', 'Bob', 'Carl', 'Chris', 'Ben', 'Other', 'Year'])
Amy Bob Carl Chris Ben Other Year
0 2 4 7 8 1 3 2013
1 9 2 4 5 5 6 2014
And a dictionary:
d = {'A': ['Amy'], 'B': ['Bob', 'Ben'], 'C': ['Carl', 'Chris']}
I would like to reshape my dataframe to look like this:
Group Name Year Value
0 A Amy 2013 2
1 A Amy 2014 9
2 B Bob 2013 4
3 B Bob 2014 2
4 B Ben 2013 1
5 B Ben 2014 5
6 C Carl 2013 7
7 C Carl 2014 4
8 C Chris 2013 8
9 C Chris 2014 5
10 Other 2013 3
11 Other 2014 6
Note that Other doesn't have any values in the Name column and the order of the rows does not matter. I think I should be using the melt function but the examples that I've come across aren't too clear.
melt gets you part way there.
In [29]: m = pd.melt(df, id_vars=['Year'], var_name='Name')
This has everything except Group. To get that, we need to reshape d a bit as well.
In [30]: d2 = {}
In [31]: for k, v in d.items():
for item in v:
d2[item] = k
....:
In [32]: d2
Out[32]: {'Amy': 'A', 'Ben': 'B', 'Bob': 'B', 'Carl': 'C', 'Chris': 'C'}
In [34]: m['Group'] = m['Name'].map(d2)
In [35]: m
Out[35]:
Year Name value Group
0 2013 Amy 2 A
1 2014 Amy 9 A
2 2013 Bob 4 B
3 2014 Bob 2 B
4 2013 Carl 7 C
.. ... ... ... ...
7 2014 Chris 5 C
8 2013 Ben 1 B
9 2014 Ben 5 B
10 2013 Other 3 NaN
11 2014 Other 6 NaN
[12 rows x 4 columns]
And moving 'Other' from Name to Group
In [8]: mask = m['Name'] == 'Other'
In [9]: m.loc[mask, 'Name'] = ''
In [10]: m.loc[mask, 'Group'] = 'Other'
In [11]: m
Out[11]:
Year Name value Group
0 2013 Amy 2 A
1 2014 Amy 9 A
2 2013 Bob 4 B
3 2014 Bob 2 B
4 2013 Carl 7 C
.. ... ... ... ...
7 2014 Chris 5 C
8 2013 Ben 1 B
9 2014 Ben 5 B
10 2013 3 Other
11 2014 6 Other
[12 rows x 4 columns]
Pandas Melt Function :-
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.
eg:-
melted = pd.melt(df, id_vars=["weekday"],
var_name="Person", value_name="Score")
we use melt to transform wide data to long data.
I have two dataframes that I want to join based on a relationship described in a dictionary of lists, where the keys in the dictionary refer to ids from dfA idA column, and the items in the list are ids from dfB idB column. The dataframes and dictionary look something like this:
dfA
colA colB idA
0 a abc 3
1 b def 4
2 b ghi 5
dfB
colX idB colZ
0 bob 7 a
1 bob 7 b
2 bob 7 c
3 jim 8 d
4 jake 9 a
5 jake 9 e
myDict = { '3': [ '7', '8' ], '4': [], '5': ['7', '9'] }
How can I use myDict to join the two dataframes to produce a dataframe like the following?
dfC
colA colB idA colX idB colZ
0 a abc 3 bob 7 a
1 b
2 c
3 jim 8 d
4 b def 4 None None None
5 b ghi 5 bob 7 a
6 b
7 c
8 jake 9 a
9 e
You can create a linking table (DataFrame) from your dictionary. Below full working example. It might need some row and column sorting at the end to produce exactly your output.
import pandas as pd
import numpy as np
dfA = pd.DataFrame({'colA': ('a', 'b', 'b'),
'colB': ('abc', 'def', 'ghi'),
'idA': ('3', '4', '5')})
dfB = pd.DataFrame({'colX': ('bob', 'bob', 'bob', 'jim', 'jake', 'jake'),
'idB': ('7', '7', '7', '8', '9', '9'),
'colZ': ('a', 'b', 'c', 'd', 'a', 'e')})
myDict = {'3': ['7', '8'], '4': [], '5': ['7', '9']}
dfC = pd.DataFrame(columns=['idA', 'idB'])
i = 0
for key, value in myDict.items():
# the if statement is for empty list to create one record with NaNs
if not value:
dfC.loc[i, 'idA'] = key
dfC.loc[i, 'idB'] = np.nan
i += 1
for val in value:
dfC.loc[i, 'idA'] = key
dfC.loc[i, 'idB'] = val
i += 1
temp = dfA.merge(dfC, how='right')
result = temp.merge(dfB, how='outer')
print(result)
The output is:
colA colB idA idB colX colZ
0 a abc 3 7 bob a
1 a abc 3 7 bob b
2 a abc 3 7 bob c
3 b ghi 5 7 bob a
4 b ghi 5 7 bob b
5 b ghi 5 7 bob c
6 a abc 3 8 jim d
7 b def 4 NaN NaN NaN
8 b ghi 5 9 jake a
9 b ghi 5 9 jake e
This is not greatest solution, but it is fairly simple and gets the job done
temp = pd.DataFrame(dfA.idAaux.tolist(), index = dfA.idA).stack()
temp = temp.reset_index()[['idA', 0]]
temp.columns = ['idA', 'idB']
temp2 = dfA.merge(temp, left_on='idA', right_on='idA', how='left').drop('idAaux', axis=1)
temp2['idB'] = pd.to_numeric(temp2['idB'])
res= temp2.merge(dfB, left_on='idB', right_on='idB', how='left')
Output:
colA colB idA idB colX colZ
0 a abc 3 7.0 bob a
1 a abc 3 7.0 bob b
2 a abc 3 7.0 bob c
3 a abc 3 8.0 jim d
4 b def 4 NaN NaN NaN
5 b ghi 5 7.0 bob a
6 b ghi 5 7.0 bob b
7 b ghi 5 7.0 bob c
8 b ghi 5 9.0 jake a
9 b ghi 5 9.0 jake e
In Pandas merge function you can merge two data frames, but I need to merge N, similar to an SQL statement where you combine N tables in a full outer join. For example, I need to merge the three data frames below by 'type_1', 'subject_id_1', 'type_2', 'subject_id_2' and 'type_3', 'subject_id_3'. Is this possible?
import pandas as pd
raw_data = {
'type_1': [1, 1, 0, 0, 1],
'subject_id_1': ['1', '2', '3', '4', '5'],
'first_name_1': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung']}
df_a = pd.DataFrame(raw_data, columns = ['type_1', 'subject_id_1', 'first_name_1'])
raw_datab = {
'type_2': [1, 1, 0, 0, 0],
'subject_id_2': ['4', '5', '6', '7', '8'],
'first_name_2': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty']}
df_b = pd.DataFrame(raw_datab, columns = ['type_2', 'subject_id_2', 'first_name_2'])
raw_datac = {
'type_3': [1, 1],
'subject_id_3': ['4', '5'],
'first_name_3': ['Joe', 'Paul']}
df_c = pd.DataFrame(raw_datac, columns = ['type_3', 'subject_id_3', 'first_name_3'])
### need to include here the third data frame
merged = pd.merge(df_a, df_b, left_on=['type_1','subject_id_1'],
right_on = ['type_2','subject_id_2'], how='outer')
print(merged)
Note: The names of the fields to join are different in each data frame.
I believe need join by indices created by set_index with concat:
dfs = [df_a.set_index(['type_1','subject_id_1']),
df_b.set_index(['type_2','subject_id_2']),
df_c.set_index(['type_3','subject_id_3'])]
df = pd.concat(dfs, axis=1)
print (df)
first_name_1 first_name_2 first_name_3
0 3 Allen NaN NaN
4 Alice NaN NaN
6 NaN Bran NaN
7 NaN Bryce NaN
8 NaN Betty NaN
1 1 Alex NaN NaN
2 Amy NaN NaN
4 NaN Billy Joe
5 Ayoung Brian Paul
df = pd.concat(dfs, axis=1).rename_axis(('type','subject_id')).reset_index()
print (df)
type subject_id first_name_1 first_name_2 first_name_3
0 0 3 Allen NaN NaN
1 0 4 Alice NaN NaN
2 0 6 NaN Bran NaN
3 0 7 NaN Bryce NaN
4 0 8 NaN Betty NaN
5 1 1 Alex NaN NaN
6 1 2 Amy NaN NaN
7 1 4 NaN Billy Joe
8 1 5 Ayoung Brian Paul