iterate the rows and join in python pandas - python

i have master dataset like this
master = pd.DataFrame({'Channel':['1','1','1','1','1'],'Country':['India','Singapore','Japan','United Kingdom','Austria'],'Product':['X','6','7','X','X']})
and user table like this
user = pd.DataFrame({'User':['101','101','102','102','102','103','103','103','103','103'],'Country':['India','Brazil','India','Brazil','Japan','All','Austria','Japan','Singapore','United Kingdom'],'count':['2','1','3','2','1','1','1','1','1','1']})
i wanted master table left join with user table for each user. like below for one user
merge_101 = pd.merge(master,user[(user.User=='101')],how='left',on=['Country'])
merge_102 = pd.merge(master,user[(user.User=='102')],how='left',on=['Country'])
merge_103 = pd.merge(master,user[(user.User=='103')],how='left',on=['Country'])
merge_all = pd.concat([merge_101, merge_102,merge_103], ignore_index=True)
how to iterate each user here i am first filtering the dataset and creating another data set and appending the whole data set later.
is there any better way to do this task like for loop or any joins?
Thanks

IIUC, you need:
pd.concat([pd.merge(master,user[(user.User==x)],how='left',on=['Country']) for x in list(user['User'].unique())], ignore_index=True)
Output:
Channel Country Product User count
0 1 India X 101 2
1 1 Singapore 6 NaN NaN
2 1 Japan 7 NaN NaN
3 1 United Kingdom X NaN NaN
4 1 Austria X NaN NaN
5 1 India X 102 3
6 1 Singapore 6 NaN NaN
7 1 Japan 7 102 1
8 1 United Kingdom X NaN NaN
9 1 Austria X NaN NaN
10 1 India X NaN NaN
11 1 Singapore 6 103 1
12 1 Japan 7 103 1
13 1 United Kingdom X 103 1
14 1 Austria X 103 1

Related

Pandas how to add value to an existing data-frame by index

I have an example data frame let's call it df. I want to add more numbers to df but i don't want to start adding after NaN's which will be the index 7 i want to start adding from index 3.
year number letter
0 1945 10 a
1 1950 15 b
2 1955 20 c
3 1960 NaN NaN
4 1965 NaN Nan
5 1970 NaN Nan
6 1975 NaN Nan
Let's say we have a column like this:
number2
0 25
1 30
2 35
3 40
my target is to get a df like this
year number letter
0 1945 10 a
1 1950 15 b
2 1955 20 c
3 1960 25 NaN
4 1965 30 Nan
5 1970 35 Nan
6 1975 40 Nan
I hope I explained it well enough. Thank you for your support !
number2 = [25,30,35,40]
df.loc[df.number.isna(), 'number'] = number2
Result df:

Add the values of several columns when the number of columns exceeds 3 - Pandas

I have a pandas dataframe with several columns of dates, numbers and bill amounts. I would like to add the amounts of the other invoices with the 3rd one and change the invoice number by "1111".
Here is an example:
ID customer
Bill1
Date 1
ID Bill 1
Bill2
Date 2
ID Bill 2
Bill3
Date3
ID Bill 3
Bill4
Date 4
ID Bill 4
Bill5
Date 5
ID Bill 5
4
6
2000-10-04
1
45
2000-11-05
2
51
1999-12-05
3
23
2001-11-23
6
76
2011-08-19
12
6
8
2016-05-03
7
39
2017-08-09
8
38
2018-07-14
17
21
2009-05-04
9
Nan
Nan
Nan
12
14
2016-11-16
10
73
2017-05-04
15
Nan
Nan
Nan
Nan
Nan
Nan
Nan
Nan
Nan
And I would like to get this :
ID customer
Bill1
Date 1
ID Bill 1
Bill2
Date 2
ID Bill 2
Bill3
Date3
ID Bill 3
4
6
2000-10-04
1
45
2000-11-05
2
150
1999-12-05
1111
6
8
2016-05-03
7
39
2017-08-09
8
59
2018-07-14
1111
12
14
2016-11-16
10
73
2017-05-04
15
Nan
Nan
Nan
This example is a sample of my data, I may have many more than 5 columns.
Thanks for your help
with a little of data manipulation, you should be able to do it as:
df = df.replace('Nan', np.nan)
idx_col_bill3 = 7
step = 3
idx_col_bill3_id = 10
cols = df.columns
bills = df[cols[range(idx_col_bill3,len(cols), step)]].sum(axis=1)
bills.replace(0, nan, inplace=True)
df = df[cols[range(idx_col_bill3_id)]]
df['Bill3'] = bills
df['ID Bill 3'].iloc._setitem_with_indexer(df['ID Bill 3'].notna(),1111)

Creating new column based on column values in row and column values in other rows in df?

I have df below as:
id | name | status | country | ref_id
3 Bob False Germany NaN
5 422 True USA 3
7 Nick False India NaN
6 Chris True Australia 7
8 324 True Africa 28
28 Tim False Canada 53
I want to add a new column for each row, where if the status for that row is True, if the ref_id for that row exists in the id column in another row and that rows status is False, give me the value of the name in that column.
So expected output below would be:
id | name | status | country | ref_id | new
3 Bob False Germany NaN NaN
5 422 True USA 3 Bob
7 Nick False India NaN NaN
6 Chris True Australia 7 Nick
8 324 True Africa 28 Tim
28 Tim False Canada 53 NaN
I have code below that I am using for other purposes that just filters for rows that have a status of True, and and id_reference value that exists in the id column like below:
(df.loc[df["status"]&df["id_reference"].astype(float).isin(df.loc[~df["status"], "id"])])
But I am trying to also calculate a new column as mentioned prior above with the value of the name if it has one in that column
Thanks!
Let us try
df['new']=df.loc[df.status,'ref_id'].map(df.set_index('id')['name'])
df
id name status country ref_id new
0 3 Bob False Germany NaN NaN
1 5 422 True USA 3.0 Bob
2 7 Nick False India NaN NaN
3 6 Chris True Australia 7.0 Nick
4 8 324 True Africa 28.0 Tim
5 28 Tim False Canada 53.0 NaN
This essentially a merge:
merged = (df.loc[df['status'],['ref_id']]
.merge(df.loc[~df['status'],['id','name']], left_on='ref_id', right_on='id')
)
df['ref_id'] = (df['id'].map(merged.set_index('id')['name'])
.where(df['status'])
)

Replace multiple strings and numbers from multiple columns with NaN in Pandas

If I have a following dataframe, I would like to clean data by replacing multiple strings and numbers into NaNs: ie. 68, Tardeo Road and 0 from state, 567 from dept, and #ERROR! and 123 from phonenumber:
id state dept \
0 1 Abu Dhabi {Marketing}
1 2 MO {Other}
2 3 68, Tardeo Road {"Human Resources"}
3 4 National Capital Territory of Delhi {"Human Resources"}
4 5 Aargau Canton {Marketing}
5 6 Aargau Canton 567
6 18 NB {"Finance & Administration"}
7 19 0 {Sales}
8 20 Abu Dhabi {"Human Resources"}
9 21 Aargau {"Finance & Administration"}
phonenumber
0 123
1 5635888000
2 18006708450
3 #ERROR!
4 12032722596
5 18003928343
6 NaN
7 #ERROR!
8 NaN
9 NaN
I have tried the following code:
Solution 1:
mask = (df.state == '0') | (df.state == '68, Tardeo Road')
df.loc[mask, ['state']] = np.nan
Solution 2:
df.loc[(df.state == '68, Tardeo Road') | (df.state == 0), 'state'] = np.nan
Solution 3:
df.loc[df.state == '0', 'state'] = np.nan
df.loc[df.state == '68, Tardeo Road', 'state'] = np.nan
All of them works, but if I apply them to multiple columns, it's a little bit long.
Just wondering if it's possible to make it more concise and efficient? By using str.replace for example. Thanks.
You can do a replace:
df = df.replace({'state':['68, Tardeo Road','0'],
'dept':['567'],
'phonenumber':['#ERROR!','123']}, np.nan)
Output:
id state dept phonenumber
-- ---- ----------------------------------- ---------------------------- -------------
0 1 Abu Dhabi {Marketing} nan
1 2 MO {Other} 5635888000
2 3 nan {"Human Resources"} 18006708450
3 4 National Capital Territory of Delhi {"Human Resources"} nan
4 5 Aargau Canton {Marketing} 12032722596
5 6 Aargau Canton nan 18003928343
6 18 NB {"Finance & Administration"} nan
7 19 nan {Sales} nan
8 20 Abu Dhabi {"Human Resources"} nan
9 21 Aargau {"Finance & Administration"} nan

Python Pandas - difference between 'loc' and 'where'?

Just curious on the behavior of 'where' and why you would use it over 'loc'.
If I create a dataframe:
df = pd.DataFrame({'ID':[1,2,3,4,5,6,7,8,9,10],
'Run Distance':[234,35,77,787,243,5435,775,123,355,123],
'Goals':[12,23,56,7,8,0,4,2,1,34],
'Gender':['m','m','m','f','f','m','f','m','f','m']})
And then apply the 'where' function:
df2 = df.where(df['Goals']>10)
I get the following which filters out the results where Goals > 10, but leaves everything else as NaN:
Gender Goals ID Run Distance
0 m 12.0 1.0 234.0
1 m 23.0 2.0 35.0
2 m 56.0 3.0 77.0
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 m 34.0 10.0 123.0
If however I use the 'loc' function:
df2 = df.loc[df['Goals']>10]
It returns the dataframe subsetted without the NaN values:
Gender Goals ID Run Distance
0 m 12 1 234
1 m 23 2 35
2 m 56 3 77
9 m 34 10 123
So essentially I am curious why you would use 'where' over 'loc/iloc' and why it returns NaN values?
Think of loc as a filter - give me only the parts of the df that conform to a condition.
where originally comes from numpy. It runs over an array and checks if each element fits a condition. So it gives you back the entire array, with a result or NaN. A nice feature of where is that you can also get back something different, e.g. df2 = df.where(df['Goals']>10, other='0'), to replace values that don't meet the condition with 0.
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
7 0 0 0 0
8 0 0 0 0
9 10 123 34 m
Also, while where is only for conditional filtering, loc is the standard way of selecting in Pandas, along with iloc. loc uses row and column names, while iloc uses their index number. So with loc you could choose to return, say, df.loc[0:1, ['Gender', 'Goals']]:
Gender Goals
0 m 12
1 m 23
If check docs DataFrame.where it replace rows by condition - default by NAN, but is possible specify value:
df2 = df.where(df['Goals']>10)
print (df2)
ID Run Distance Goals Gender
0 1.0 234.0 12.0 m
1 2.0 35.0 23.0 m
2 3.0 77.0 56.0 m
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 10.0 123.0 34.0 m
df2 = df.where(df['Goals']>10, 100)
print (df2)
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
3 100 100 100 100
4 100 100 100 100
5 100 100 100 100
6 100 100 100 100
7 100 100 100 100
8 100 100 100 100
9 10 123 34 m
Another syntax is called boolean indexing and is for filter rows - remove rows matched condition.
df2 = df.loc[df['Goals']>10]
#alternative
df2 = df[df['Goals']>10]
print (df2)
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
9 10 123 34 m
If use loc is possible also filter by rows by condition and columns by name(s):
s = df.loc[df['Goals']>10, 'ID']
print (s)
0 1
1 2
2 3
9 10
Name: ID, dtype: int64
df2 = df.loc[df['Goals']>10, ['ID','Gender']]
print (df2)
ID Gender
0 1 m
1 2 m
2 3 m
9 10 m
loc retrieves only the rows that matches the condition.
where returns the whole dataframe, replacing the rows that don't match the condition (NaN by default).

Categories