I have two dataframes , the first one has 1000 rows and looks like:
Date tri23_1 hsgç_T2 bbbj-1Y_jn Family Bonus
2011-06-09 qwer 1 rits Laavin 456
2011-07-09 ww 43 mayo Grendy 679
2011-09-10 wwer 44 ramya Fantol 431
2011-11-02 5 sam Gondow 569
The second dataframe contains all the unique values and also the hotels, that are associated to these values:
Group Hotel
tri23_1 Jamel
hsgç_T2 Frank
bbbj-1Y_jn Luxy
mlkl_781 Grand Hotel
vchs_94 Vancouver
My goal is to replace the columns of the first dataframe by the the corresponding values of the column Hotel of the second dataframe and the output should look like below:-
Date Jamel Frank Luxy Family Bonus
2011-06-09 qwer 1 rits Laavin 456
2011-07-09 ww 43 mayo Grendy 679
2011-09-10 wwer 44 ramya Fantol 431
2011-11-02 5 sam Gondow 569
Can i achieve this using python.
You could try this, using to_dict():
df1.columns=[df2.set_index('Group').to_dict()['Hotel'][i] if i in df2.set_index('Group').to_dict()['Hotel'].keys() else i for i in df1.columns]
print(df1)
Output:
df1
Date tri23_1 hsgç_T2 bbbj-1Y_jn Family Bonus
0 2011-06-09 qwer 1 rits Laavin 456.0
1 2011-07-09 ww 43 mayo Grendy 679.0
2 2011-09-10 wwer 44 ramya Fantol 431.0
3 2011-11-02 5 sam Gondow 569
df2
Group Hotel
0 tri23_1 Jamel
1 hsgç_T2 Frank
2 bbbj-1Y_jn Luxy
3 mlkl_781 Grand Hotel
4 vchs_94 Vancouver
df1 changed
Date Jamel Frank Luxy Family Bonus
0 2011-06-09 qwer 1 rits Laavin 456.0
1 2011-07-09 ww 43 mayo Grendy 679.0
2 2011-09-10 wwer 44 ramya Fantol 431.0
3 2011-11-02 5 sam Gondow 569
Update: Explanation
First, if df2['Group'] isn't the index of df2, we set it as index.
Then pass the dataframe to a dict:
df2.set_index('Group').to_dict()
>>>{'Hotel': {'tri23_1': 'Jamel', 'hsgç_T2': 'Frank', 'bbbj-1Y_jn': 'Luxy', 'mlkl_781': 'Grand Hotel', 'vchs_94': 'Vancouver'}}
Then we select the value of key 'Hotel'
df2.set_index('Group').to_dict()['Hotel']
>>>{'tri23_1': 'Jamel', 'hsgç_T2': 'Frank', 'bbbj-1Y_jn': 'Luxy', 'mlkl_781': 'Grand Hotel', 'vchs_94': 'Vancouver'}
Then column by column we search its value in that dictionary, and if such column doesn't exit in the keys of the dictionary, we just return the same value e.g. Date, Family, Bonus:
i='Date'
i in df2.set_index('Group').to_dict()['Hotel'].keys --->False
return 'Date'
...
i='tri23_1'
i in df2.set_index('Group').to_dict()['Hotel'].keys --->True
return df2.set_index('Group').to_dict()['Hotel']['tri23_1']
...
...
#And so on...
Related
I have a big dataset. It's about news reading. I'm trying to clean it. I created a checklist of cities that I want to keep (the set has all the cities). How can I drop the rows based on that checklist? For example, I have a checklist (as a list) that contains all the french cities. How can I drop other cities?
To picture the data frame (I have 1.5m rows btw):
City Age
0 Paris 25-34
1 Lyon 45-54
2 Kiev 35-44
3 Berlin 25-34
4 New York 25-34
5 Paris 65+
6 Toulouse 35-44
7 Nice 55-64
8 Hannover 45-54
9 Lille 35-44
10 Edinburgh 65+
11 Moscow 25-34
You can do this using pandas.Dataframe.isin. This will return boolean values checking whether each element is inside the list x. You can then use the boolean values and take out the subset of the df with rows that return True by doing df[df['City'].isin(x)]. Following is my solution:
import pandas as pd
x = ['Paris' , 'Marseille']
df = pd.DataFrame(data={'City':['Paris', 'London', 'New York', 'Marseille'],
'Age':[1, 2, 3, 4]})
print(df)
df = df[df['City'].isin(x)]
print(df)
Output:
>>> City Age
0 Paris 1
1 London 2
2 New York 3
3 Marseille 4
City Age
0 Paris 1
3 Marseille 4
If I have this dataframe:
# data
data = [['london_1', 10,'london'], ['london_2', 15,'london'], ['london_3', 14,'london'],['london',49,'']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['station', 'info','parent_station'])
So:
station info parent_station
0 london_1 10 london
1 london_2 15 london
2 london_3 14 london
3 london 49
I would like to overwrite the info value of the child station according to the info value of the parent station:
station info parent_station
0 london_1 49 london
1 london_2 49 london
2 london_3 49 london
3 london 49
Is there a simple way to do that ?
Additional information:
There could be more than one parent station, but only one parent station per station.
You can map then condition assign
df.loc[df.parent_station.ne(''),'info'] = df.parent_station.map(df.set_index('station')['info'])
df
Out[329]:
station info parent_station
0 london_1 49.0 london
1 london_2 49.0 london
2 london_3 49.0 london
3 london 49.0
I have two dataframes DfMaster and DfError
DfMaster which looks like:
Id Name Building
0 4653 Jane Smith A
1 3467 Steve Jones B
2 34 Kim Lee F
3 4567 John Evans A
4 3643 Kevin Franks S
5 244 Stella Howard D
and DfError looks like
Id Name Building
0 4567 John Evans A
1 244 Stella Howard D
In DfMaster I would like to change the Building value for a record to DD if it appears in the DfError data-frame. So my desired output would be:
Id Name Building
0 4653 Jane Smith A
1 3467 Steve Jones B
2 34 Kim Lee F
3 4567 John Evans DD
4 3643 Kevin Franks S
5 244 Stella Howard DD
I am trying to use the following:
DfMaster.loc[DfError['Id'], 'Building'] = 'DD'
however I get an error:
KeyError: "None of [Int64Index([4567,244], dtype='int64')] are in the [index]"
What have I done wrong?
try this using np.where
import numpy as np
errors = list(dfError['id'].unqiue())
dfMaster['Building_id'] = np.where(dfMaster['Building_id'].isin(errors),'DD',dfMaster['Building_id'])
DataFrame.loc expects that you input an index or a Boolean series, not a value from a column.
I believe this should do the trick:
DfMaster.loc[DfMaster['Id'].isin(DfError['Id']), 'Building'] = 'DD'
Basically, it's telling:
For all rows where Id value is present in DfError['Id'], set the value of 'Building' to 'DD'.
My pandas Data frame df could produce result as below:
grouped = df[(df['X'] == 'venture') & (df['company_code'].isin(['TDS','XYZ','UVW']))].groupby(['company_code','sector'])['X_sector'].count()
The output of this is as follows:
company_code sector
TDS Meta 404
Electrical 333
Mechanical 533
Agri 453
XYZ Sports 331
Electrical 354
Movies 375
Manufacturing 355
UVW Sports 505
Robotics 345
Movies 56
Health 3263
Manufacturing 456
Others 524
Name: X_sector, dtype: int64
What I want to get is the top three sectors within the company codes.
What is the way to do it?
You will have to chain a groupby here. Consider this example:
import pandas as pd
import numpy as np
np.random.seed(111)
names = [
'Robert Baratheon',
'Jon Snow',
'Daenerys Targaryen',
'Theon Greyjoy',
'Tyrion Lannister'
]
df = pd.DataFrame({
'season': np.random.randint(1, 7, size=100),
'actor': np.random.choice(names, size=100),
'appearance': 1
})
s = df.groupby(['season','actor'])['appearance'].count()
print(s.sort_values(ascending=False).groupby('season').head(1)) # <-- head(3) for 3 values
Returns:
season actor
4 Daenerys Targaryen 7
6 Robert Baratheon 6
3 Robert Baratheon 6
5 Jon Snow 5
2 Theon Greyjoy 5
1 Jon Snow 4
Where s is (clipped at 4)
season actor
1 Daenerys Targaryen 2
Jon Snow 4
Robert Baratheon 2
Theon Greyjoy 3
Tyrion Lannister 4
2 Daenerys Targaryen 4
Jon Snow 3
Robert Baratheon 1
Theon Greyjoy 5
Tyrion Lannister 3
3 Daenerys Targaryen 2
Jon Snow 1
Robert Baratheon 6
Theon Greyjoy 3
Tyrion Lannister 3
4 ...
Why would you want things to be complicated, when there are simple codes possible:
Z = df.groupby('country_code')['sector'].value_counts().groupby(level=0).head(3).sort_values(ascending=False).to_frame('counts').reset_index()
Z
I have two dataframes as shown below.
Company Name BOD Position Ethnicity DOB Age Gender Degree ( Specialazation) Remark
0 Big Lots Inc. David J. Campisi Director, President and Chief Executive Offic... American 1956 61 Male Graduate NaN
1 Big Lots Inc. Philip E. Mallott Chairman of the Board American 1958 59 Male MBA, Finace NaN
2 Big Lots Inc. James R. Chambers Independent Director American 1958 59 Male MBA NaN
3 Momentive Performance Materials Inc Mahesh Balakrishnan director Asian 1983 34 Male BA Economics NaN
Company Name Net Sale Gross Profit Remark
0 Big Lots Inc. 5.2B 2.1B NaN
1 Momentive Performance Materials Inc 544M 146m NaN
2 Markel Corporation 5.61B 2.06B NaN
3 Noble Energy, Inc. 3.49B 2.41B NaN
4 Leidos Holding, Inc. 7.04B 852M NaN
I want to create a new dataframe with these two, so that in 2nd dataframe, I have new columns with count of ethinicities from each companies, such as American -2 Mexican -5 and so on, so that later on, i can calculate diversity score.
the variables in the output dataframe is like,
Company Name Net Sale Gross Profit Remark American Mexican German .....
Big Lots Inc. 5.2B 2.1B NaN 2 0 5 ....
First get counts per groups by groupby with size and unstack, last join to second DataFrame:
df1 = pd.DataFrame({'Company Name':list('aabcac'),
'Ethnicity':['American'] * 3 + ['Mexican'] * 3})
df1 = df1.groupby(['Company Name', 'Ethnicity']).size().unstack(fill_value=0)
#slowier alternative
#df1 = pd.crosstab(df1['Company Name'], df1['Ethnicity'])
print (df1)
Ethnicity American Mexican
Company Name
a 2 1
b 1 0
c 0 2
df2 = pd.DataFrame({'Company Name':list('abc')})
print (df2)
Company Name
0 a
1 b
2 c
df3 = df2.join(df1, on=['Company Name'])
print (df3)
Company Name American Mexican
0 a 2 1
1 b 1 0
2 c 0 2
EDIT: You need replace unit by 0 and convert to floats:
print (df)
Name sale
0 A 100M
1 B 200M
2 C 5M
3 D 40M
4 E 10B
5 F 2B
d = {'M': '0'*6, 'B': '0'*9}
df['a'] = df['sale'].replace(d, regex=True).astype(float).sort_values(ascending=False)
print (df)
Name sale a
0 A 100M 1.000000e+08
1 B 200M 2.000000e+08
2 C 5M 5.000000e+06
3 D 40M 4.000000e+07
4 E 10B 1.000000e+10
5 F 2B 2.000000e+09