I have two data frames
df1 =
actorID actorName
0 annie_potts Annie Potts
1 bill_farmer Bill Farmer
2 don_rickles Don Rickles
3 erik_von_detten Erik von Detten
4 greg-berg Greg Berg
df2 =
directorID directorName
0 john_lasseter John Lasseter
1 joe_johnston Joe Johnston
2 donald_petrie Donald Petrie
3 forest_whitaker Forest Whitaker
4 charles_shyer Charles Shyer
What I ideally want is a concatenation of these two dataframes, like pd.concat((df1, df2)):
actorID-directorID actorName-directorName
0 annie_potts Annie Potts
1 bill_farmer Bill Farmer
2 don_rickles Don Rickles
3 erik_von_detten Erik von Detten
4 greg-berg Greg Berg
5 john_lasseter John Lasseter
6 joe_johnston Joe Johnston
7 donald_petrie Donald Petrie
8 forest_whitaker Forest Whitaker
9 charles_shyer Charles Shyer
however I want there to be an easy way to specify that I want to join df1.actorName and df2.directorName together, and actorID / directorID. How can I do this?
Related
I'm try extract the first two words from a string in dataframe
df["Name"]
Name
Anthony Frank Hawk
John Rodney Mullen
Robert Dean Silva Burnquis
Geoffrey Joseph Rowley
To get index of the second " "(Space) I try this but find return NaN instead return number of characters until second Space.
df["temp"] = df["Name"].str.find(" ")+1
df["temp"] = df["Status"].str.find(" ", start=df["Status"], end=None)
df["temp"]
0 NaN
1 NaN
2 NaN
3 NaN
and the last step is slice those names, I try this code but don't work to.
df["Status"] = df["Status"].str.slice(0,df["temp"])
df["Status"]
0 NaN
1 NaN
2 NaN
3 NaN
expected return
0 Anthony Frank
1 John Rodney
2 Robert Dean
3 Geoffrey Joseph
if you have a more efficient way to do this please let me know!?
df['temp'] = df.Name.str.rpartition().get(0)
df
Output
Name temp
0 Anthony Frank Hawk Anthony Frank
1 John Rodney Mullen John Rodney
2 Robert Dean Silva Burnquis Robert Dean Silva
3 Geoffrey Joseph Rowley Geoffrey Joseph
EDIT
If only first two elements are required in output.
df['temp'] = df.Name.str.split().str[:2].str.join(' ')
df
OR
df['temp'] = df.Name.str.split().apply(lambda x:' '.join(x[:2]))
df
OR
df['temp'] = df.Name.str.split().apply(lambda x:' '.join([x[0], x[1]]))
df
Output
Name temp
0 Anthony Frank Hawk Anthony Frank
1 John Rodney Mullen John Rodney
2 Robert Dean Silva Burnquis Robert Dean
3 Geoffrey Joseph Rowley Geoffrey Joseph
You can use str.index(substring) instead of str.find, it returns the smallest index of the substring(such as " ", empty space) found in the string. Then you can split the string by that index and reapply the above to the second string in the resulting list.
I'm finding this problem easy to write out, but difficult to apply with my Pandas Dataframe.
When searching for anything 'unique values' and 'list' I only get answers for getting the unique values in a list.
There is a brute force solution with a double for loop, but there must be a faster Pandas solution than n^2.
I have a DataFrame with two columns: Name and Likes Food.
As output, I want a list of unique Likes Food values for each unique Name.
Example Dataframe df
Index Name Likes Food
0 Tim Pizza
1 Marie Pizza
2 Tim Pasta
3 Tim Pizza
4 John Pizza
5 Amy Pizza
6 Amy Sweet Potatoes
7 Marie Sushi
8 Tim Sushi
I know how to aggregate and groupby the unique count of Likes Food:
df.groupby( by='Name', as_index=False ).agg( {'Likes Food': pandas.Series.nunique} )
df.sort_values(by='Likes Food', ascending=False)
df.reset_index( drop=True )
>>>
Index Name Likes Food
0 Tim 3
1 Marie 2
2 Amy 2
3 John 1
But given that, what ARE the foods for each Name in that DataFrame? For readability, expressed as a list makes good sense. List sorting doesn't matter (and is easy to fix probably).
Example Output
<code here>
>>>
Index Name Likes Food Food List
0 Tim 3 [Pizza, Pasta, Sushi]
1 Marie 2 [Pizza, Sushi]
2 Amy 2 [Pizza, Sweet Potatoes]
3 John 1 [Pizza]
To obtain the output without the counts, just try unique
df.groupby("Name")["Likes"].unique()
Name
Amy [Pizza, Sweet Potatoes]
John [Pizza]
Marie [Pizza, Sushi]
Tim [Pizza, Pasta, Sushi]
Name: Likes, dtype: object
additionally, you can used named aggregation
df.groupby("Name").agg(**{"Likes Food": pd.NamedAgg(column='Likes', aggfunc="size"),
"Food List": pd.NamedAgg(column='Likes', aggfunc="nunique")}).reset_index()
Name Likes Food Food List
0 Amy 2 [Pizza, Sweet Potatoes]
1 John 1 [Pizza]
2 Marie 2 [Pizza, Sushi]
3 Tim 3 [Pizza, Pasta, Sushi]
To get both columns, also sorted, try this:
df = df.groupby("Name")["Likes_Food"].aggregate({'counts': 'nunique',
'food_list': 'unique'}).reset_index().sort_values(by='counts', ascending=False)
df
Name counts food_list
3 Tim 3 [Pizza, Pasta, Sushi]
0 Amy 2 [Pizza, SweetPotatoes]
2 Marie 2 [Pizza, Sushi]
1 John 1 [Pizza]
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have a simple data frame with first and last names. I would like to get the equivalent of SQL SELF JOIN equivalent in pandas.
Here goes the full example:
import numpy as np
import pandas as pd
df = pd.DataFrame({'first_name': ['Rose','Summer','Jane','Kim','Jack'],
'last_name': ['Howard','Solstice','Kim','Cruz','Rose'],
'customer_id': [1,2,3,4,5]})
df
first_name last_name customer_id
0 Rose Howard 1
1 Summer Solstice 2
2 Jane Kim 3
3 Kim Cruz 4
4 Jack Rose 5
REQUIRED OUTPUT
customer_id first_name last_name customer_id_1 first_name_1 last_name_1
1 Rose Howard 5 Jack Rose
4 Kim Cruz 3 Jane Kim
Using SQL
select a.first_name, a.last_name, b.first_name, b.last_name
from df as a, df as b
where a.first_name = b.last_name
My attempt
(pd.concat( [ df[['first_name','last_name']],
df[['first_name','last_name']].add_suffix('_1')
], axis=1, ignore_index=False)
)
first_name last_name first_name_1 last_name_1
0 Rose Howard Rose Howard
1 Summer Solstice Summer Solstice
2 Jane Rose Jane Rose
But,
(pd.concat( [ df,df.add_suffix('_1')], axis=1)
.query(" first_name == last_name_1 ")
)
This gives empty output to my surprise!!
I want two rows and fours columns as given by SQL.
Use left_on and right_on
df.merge(df, left_on='first_name', right_on='last_name')
Result:
first_name_x last_name_x customer_id_x first_name_y last_name_y \
0 Rose Howard 1 Jack Rose
1 Kim Cruz 4 Jane Kim
customer_id_y
0 5
1 3
My pandas Data frame df could produce result as below:
grouped = df[(df['X'] == 'venture') & (df['company_code'].isin(['TDS','XYZ','UVW']))].groupby(['company_code','sector'])['X_sector'].count()
The output of this is as follows:
company_code sector
TDS Meta 404
Electrical 333
Mechanical 533
Agri 453
XYZ Sports 331
Electrical 354
Movies 375
Manufacturing 355
UVW Sports 505
Robotics 345
Movies 56
Health 3263
Manufacturing 456
Others 524
Name: X_sector, dtype: int64
What I want to get is the top three sectors within the company codes.
What is the way to do it?
You will have to chain a groupby here. Consider this example:
import pandas as pd
import numpy as np
np.random.seed(111)
names = [
'Robert Baratheon',
'Jon Snow',
'Daenerys Targaryen',
'Theon Greyjoy',
'Tyrion Lannister'
]
df = pd.DataFrame({
'season': np.random.randint(1, 7, size=100),
'actor': np.random.choice(names, size=100),
'appearance': 1
})
s = df.groupby(['season','actor'])['appearance'].count()
print(s.sort_values(ascending=False).groupby('season').head(1)) # <-- head(3) for 3 values
Returns:
season actor
4 Daenerys Targaryen 7
6 Robert Baratheon 6
3 Robert Baratheon 6
5 Jon Snow 5
2 Theon Greyjoy 5
1 Jon Snow 4
Where s is (clipped at 4)
season actor
1 Daenerys Targaryen 2
Jon Snow 4
Robert Baratheon 2
Theon Greyjoy 3
Tyrion Lannister 4
2 Daenerys Targaryen 4
Jon Snow 3
Robert Baratheon 1
Theon Greyjoy 5
Tyrion Lannister 3
3 Daenerys Targaryen 2
Jon Snow 1
Robert Baratheon 6
Theon Greyjoy 3
Tyrion Lannister 3
4 ...
Why would you want things to be complicated, when there are simple codes possible:
Z = df.groupby('country_code')['sector'].value_counts().groupby(level=0).head(3).sort_values(ascending=False).to_frame('counts').reset_index()
Z
I have two dataframes as shown below.
Company Name BOD Position Ethnicity DOB Age Gender Degree ( Specialazation) Remark
0 Big Lots Inc. David J. Campisi Director, President and Chief Executive Offic... American 1956 61 Male Graduate NaN
1 Big Lots Inc. Philip E. Mallott Chairman of the Board American 1958 59 Male MBA, Finace NaN
2 Big Lots Inc. James R. Chambers Independent Director American 1958 59 Male MBA NaN
3 Momentive Performance Materials Inc Mahesh Balakrishnan director Asian 1983 34 Male BA Economics NaN
Company Name Net Sale Gross Profit Remark
0 Big Lots Inc. 5.2B 2.1B NaN
1 Momentive Performance Materials Inc 544M 146m NaN
2 Markel Corporation 5.61B 2.06B NaN
3 Noble Energy, Inc. 3.49B 2.41B NaN
4 Leidos Holding, Inc. 7.04B 852M NaN
I want to create a new dataframe with these two, so that in 2nd dataframe, I have new columns with count of ethinicities from each companies, such as American -2 Mexican -5 and so on, so that later on, i can calculate diversity score.
the variables in the output dataframe is like,
Company Name Net Sale Gross Profit Remark American Mexican German .....
Big Lots Inc. 5.2B 2.1B NaN 2 0 5 ....
First get counts per groups by groupby with size and unstack, last join to second DataFrame:
df1 = pd.DataFrame({'Company Name':list('aabcac'),
'Ethnicity':['American'] * 3 + ['Mexican'] * 3})
df1 = df1.groupby(['Company Name', 'Ethnicity']).size().unstack(fill_value=0)
#slowier alternative
#df1 = pd.crosstab(df1['Company Name'], df1['Ethnicity'])
print (df1)
Ethnicity American Mexican
Company Name
a 2 1
b 1 0
c 0 2
df2 = pd.DataFrame({'Company Name':list('abc')})
print (df2)
Company Name
0 a
1 b
2 c
df3 = df2.join(df1, on=['Company Name'])
print (df3)
Company Name American Mexican
0 a 2 1
1 b 1 0
2 c 0 2
EDIT: You need replace unit by 0 and convert to floats:
print (df)
Name sale
0 A 100M
1 B 200M
2 C 5M
3 D 40M
4 E 10B
5 F 2B
d = {'M': '0'*6, 'B': '0'*9}
df['a'] = df['sale'].replace(d, regex=True).astype(float).sort_values(ascending=False)
print (df)
Name sale a
0 A 100M 1.000000e+08
1 B 200M 2.000000e+08
2 C 5M 5.000000e+06
3 D 40M 4.000000e+07
4 E 10B 1.000000e+10
5 F 2B 2.000000e+09