I have 2 dataframes:
df1:
branch sessions users
toledo 10 3
new york 14 2
boston 102 43
seattle 9 7
df2:
branch guests
toledo 10
new york 14
boston 102
seattle 9
The result I'm looking for merges the "guests" column from df2 to df1 like this:
df1:
branch sessions users guests
toledo 10 3 10
new york 14 2 14
boston 102 43 102
seattle 9 7 9
I've tried concat, join and merge with no luck.
With merge
guest_sessions = df2[['branch','guests']].copy()
pd.merge(df1, guest_sessions, left_index=True, right_index=True)
I get this:
branch_x sessions users guests_x branch_y guests_y
What am I doing wrong?
Related
I have a dataframe which looks like this :
Name Age Job
0 Alex 20 Student
1 Sara 21 Doctor
2 john 23 NaN
3 kevin 22 Teacher
4 Rosa 20 senior manager
5 johanes 25 Dentist
6 lina 23 Student
7 yaser 25 Pilot
8 jason 20 Manager
9 Ali 23 NaN
10 Ahmad 21 Professor
11 Joe 24 NaN
12 Donald 29 Waiter
.
.
.
.
I want to select the rows before and after the row that has NaN values in column Job with the row itself. For that I have the following code :
Rows = df[df. Shift(1, fill_value="dummy").Job. isna() | df.Job. isna()| df. Shift(-1, fill_value="dummy"). df. isna()]
print(Rows)
the result is this:
1 Sara 21 Doctor
2 john 23 NaN
3 kevin 22 Teacher
8 jason 20 Manager
9 Ali 23 NaN
10 Ahmad 21 Professor
11 Joe 24 NaN
12 Donald 29 Waiter
The only problem here is the row number 10, it should be double in the result because this row is one time the row after NaN which is number 9 and at the same time the row before NaN value which is row number 11( the row is between two rows with NaN value). So at the end I want to have this :
1 Sara 21 Doctor
2 john 23 NaN
3 kevin 22 Teacher
8 jason 20 Manager
9 Ali 23 NaN
10 Ahmad 21 Professor
10 Ahmad 21 Professor
11 Joe 24 NaN
12 Donald 29 Waiter
So every row which is between two rows with NaN values should be also two times in the result (or should be dupplicate). Is there any way to do this? Any help will be appreciated.
Use concat with rows before, after and match by condition:
m = df.Job.isna()
df = pd.concat([df[m.shift(fill_value=False)],
df[m.shift(-1, fill_value=False)],
df[m]]).sort_index()
print (df)
Name Age Job
1 Sara 21 Doctor
2 john 23 NaN
3 kevin 22 Teacher
8 jason 20 Manager
9 Ali 23 NaN
10 Ahmad 21 Professor
10 Ahmad 21 Professor
11 Joe 24 NaN
12 Donald 29 Waiter
I want to break down multi level columns and have them as a column value.
Original data input (excel):
As read in dataframe:
Company Name Company code 2017-01-01 00:00:00 Unnamed: 3 Unnamed: 4 Unnamed: 5 2017-02-01 00:00:00 Unnamed: 7 Unnamed: 8 Unnamed: 9 2017-03-01 00:00:00 Unnamed: 11 Unnamed: 12 Unnamed: 13
0 NaN NaN Product A Product B Product C Product D Product A Product B Product C Product D Product A Product B Product C Product D
1 Company A #123 1 5 3 5 0 2 3 4 0 1 2 3
2 Company B #124 600 208 30 20 600 213 30 15 600 232 30 12
3 Company C #125 520 112 47 15 520 110 47 10 520 111 47 15
4 Company D #126 420 165 120 31 420 195 120 30 420 182 120 58
Intended data frame:
I have tried stack() and unstack() and also swap level, but I couldn't get the dates column to 'drop as row'. Looks like the merged cells in excels will produce NaN as in the dataframes - and if its the columns that is merged, I will have a unnamed column. How do I work around it? Am I missing something really simple here?
Using stack
df.stack(level=0).reset_index(level=1)
I have two data frames. I have to compare the two data frames and get the position of the unmatched data using python.
Note:
The First column will always not be unique.
Data Frame 1:
0 1 2 3 4
0 1 Dhoni 24 Kota 60000.0
1 2 Raina 90 Delhi 41500.0
2 3 Kholi 67 Ahmedabad 20000.0
3 4 Ashwin 45 Bhopal 8500.0
4 5 Watson 64 Mumbai 6500.0
5 6 KL Rahul 19 Indore 4500.0
6 7 Hardik 24 Bengaluru 1000.0
Data Frame 2
0 1 2 3 4
0 3 Kholi 67 Ahmedabad 20000.0
1 7 Hardik 24 Bengaluru 1000.0
2 4 Ashwin 45 Bhopal 8500.0
3 2 Raina 90 Delhi 41500.0
4 6 KL Rahul 19 Chennai 4500.0
5 1 Dhoni 24 Kota 60000.0
6 5 Watson 64 Mumbai 6500.0
I expect the output of (3,5)-(Indore - Chennai).
df1=pd.DataFrame({'A':['Dhoni','Raina','KL Rahul'],'B':[24,90,67],'C':['Kota','Delhi','Indore'],'D':[6000.0,41500.0,4500.0]})
df2=pd.DataFrame({'A':['Dhoni','Raina','KL Rahul'],'B':[24,90,67],'C':['Kota','Delhi','Chennai'],'D':[6000.0,41500.0,4500.0]})
df1['df']='df1'
df2['df']='df2'
df=pd.concat([df1,df2],sort=False).drop_duplicates(subset=['A','B','C','D'],keep=False)
print(df)
A B C D df
2 KL Rahul 67 Indore 4500.0 df1
2 KL Rahul 67 Chennai 4500.0 df2
I have added df column to show, from which df difference comes from
I have a dataset that looks like the following.
Region_Name Date Average
London 1990Q1 105
London 1990Q1 118
... ... ...
London 2018Q1 157
I converted the date into quarters and wish to create a new dataframe with the matching quarters and region names grouped together, with the mean average.
What is the best way to accomplish such a task.
I have been looking at the groupby function but keep getting a traceback.
for example:
new_df = df.groupby(['Resion_Name','Date']).mean()
dict3={'Region_Name': ['London','Newyork','London','Newyork','London','London','Newyork','Newyork','Newyork','Newyork','London'],
'Date' : ['1990Q1','1990Q1','1990Q2','1990Q2','1991Q1','1991Q1','1991Q2','1992Q2','1993Q1','1993Q1','1994Q1'],
'Average': [34,56,45,67,23,89,12,45,67,34,67]}
df3=pd.DataFrame(dict3)
**Now My df3 is as follows **
Region_Name Date Average
0 London 1990Q1 34
1 Newyork 1990Q1 56
2 London 1990Q2 45
3 Newyork 1990Q2 67
4 London 1991Q1 23
5 London 1991Q1 89
6 Newyork 1991Q2 12
7 Newyork 1992Q2 45
8 Newyork 1993Q1 67
9 Newyork 1993Q1 34
10 London 1994Q1 67
code looks as follows:
new_df = df3.groupby(['Region_Name','Date'])
new1=new_df['Average'].transform('mean')
Result of dataframe new1:
print(new1)
0 34.0
1 56.0
2 45.0
3 67.0
4 56.0
5 56.0
6 12.0
7 45.0
8 50.5
9 50.5
10 67.0
i have a dataframe like this
user = pd.DataFrame({'User':['101','101','101','102','102','101','101','102','102','102'],'Country':['India','Japan','India','Brazil','Japan','UK','Austria','Japan','Singapore','UK']})
i want to apply custom sort in country and Japan needs to be in top for both the users
i have done this but this is not my expected output
user.sort_values(['User','Country'], ascending=[True, False], inplace=True)
my expected output
expected_output = pd.DataFrame({'User':['101','101','101','101','101','102','102','102','102','102'],'Country':['Japan','India','India','UK','Austria','Japan','Japan','Brazil','Singapore','UK']})
i tried to Cast the column as category and when passing the categories and put Japan at the top. is there any other approach i don't want to pass the all the countries list every time. i just want to give user 101 -japan or user 102- UK then the remaining rows order needs to come.
Thanks
Create a new key help sort by using map
user.assign(New=user.Country.map({'Japan':1}).fillna(0)).sort_values(['User','New'], ascending=[True, False]).drop('New',1)
Out[80]:
Country User
1 Japan 101
0 India 101
2 India 101
5 UK 101
6 Austria 101
4 Japan 102
7 Japan 102
3 Brazil 102
8 Singapore 102
9 UK 102
Update base on comment
mapdf=pd.DataFrame({'Country':['Japan','UK'],'User':['101','102'],'New':[1,1]})
user.merge(mapdf,how='left').fillna(0).sort_values(['User','New'], ascending=[True, False]).drop('New',1)
Out[106]:
Country User
1 Japan 101
0 India 101
2 India 101
5 UK 101
6 Austria 101
9 UK 102
3 Brazil 102
4 Japan 102
7 Japan 102
8 Singapore 102
Use boolean indexing with append, last sort by column User:
user = (user[user['Country'] == 'Japan']
.append(user[user['Country'] != 'Japan'])
.sort_values('User'))
Alternative solution:
user = (user.query('Country == "Japan"')
.append(user.query('Country != "Japan"'))
.sort_values('User'))
print (user)
User Country count
1 101 Japan 1
0 101 India 2
2 101 India 3
5 101 UK 1
6 101 Austria 1
4 102 Japan 1
7 102 Japan 1
3 102 Brazil 2
8 102 Singapore 1
9 102 UK 1