so I've data like this:
Id Title Fname lname email
1 meeting with Jay, Aj Jay kay jk#something.com
1 meeting with Jay, Aj Aj xyz aj#something.com
2 call with Steve Steve Jack st#something.com
2 call with Steve Harvey Ray h#something.com
3 lunch Mike Mil Mike m#something.com
I want to remove firstname & last name for each unique Id from Title.
I tried grouping by Id which gives series Objects for Title, Fname, Lname,etc
df.groupby('Id')
I've concatenated Fname with .agg(lambda x: x.sum() if x.dtype == 'float64' else ','.join(x))
& kept in concated dataframe.
likewise all other columns get aggregated. Question is how do I replace values in Title based on this aggregated series.
concated['newTitle'] = [ concated.Title.str.replace(e[0]).replace(e[1]).replace(e[1])
for e in
zip(concated.FName.str.split(','), concated.LName.str.split(','))
]
I want something like this, or some other way, by which for each Id, I could get newTitle, with replaced values.
output be like:
Id Title
1 Meeting with ,
2 call with
3 lunch
Create a mapper series by joining Fname and lname and replace,
s = df.groupby('Id')[['Fname', 'lname']].apply(lambda x: '|'.join(x.stack()))
df.set_index('Id')['Title'].replace(s, '', regex = True).drop_duplicates()
Id
1 meeting with ,
2 call with
3 lunch
Related
I have a dataset with three columns:
Name Customer Value
Johnny Mike 1
Christopher Luke 0
Christopher Mike 0
Carl Marilyn 1
Carl Stephen 1
I need to create a new dataset where I have two columns: one with unique values from Name and Customer columns, and the Value column. Values in the Value column were assigned to Name (this means that multiple rows with same Name have the same value: Carl has value 1, Christopher has value 0, and Johnny has value 1), so Customer elements should have empty values in Value column in the new dataset.
My expected output is
All Value
Johnny 1
Christopher 0
Carl 1
Mike
Luke
Marilyn
Stephen
For unique values in All column I consider unique().to_list() from both Name and Customer:
name = file['Name'].unique().tolist()
customer = file['Customer'].unique().tolist()
all_with_dupl = name + customer
customers=list(dict.fromkeys(all_with_dupl))
df= pd.DataFrame(columns=['All','Value'])
df['All']= customers
I do not know how to assign the values in the new dataset after creating the list with all names and customers with no duplicates.
Any help would be great.
Split columns and .drop_duplicates on data frame to remove duplicates and then append it back:
(df.drop('Customer', 1)
.drop_duplicates()
.rename(columns={'Name': 'All'})
.append(
df[['Customer']].rename(columns={'Customer': 'All'})
.drop_duplicates(),
ignore_index=True
))
All Value
0 Johnny 1.0
1 Christopher 0.0
2 Carl 1.0
3 Mike NaN
4 Luke NaN
5 Marilyn NaN
6 Stephen NaN
Or to split the steps up:
names = df.drop('Customer', 1).drop_duplicates().rename(columns={'Name': 'All'})
customers = df[['Customer']].drop_duplicates().rename(columns={'Customer': 'All'})
names.append(customers, ignore_index=True)
Anaother way
d=dict(zip(df['Name Customer'].str.split('\s').str[0],df['Value']))#Create dict
df['Name Customer']=df['Name Customer'].str.split('\s')
df=df.explode('Name Customer').drop_duplicates(keep='first').assign(Value='')Explode dataframe and drop duplicates
df['Value']=df['Name Customer'].map(d).fillna('')#Map values back
I have problem where I need to update a value if people were at the same table.
import pandas as pd
data = {"p1":['Jen','Mark','Carrie'],
"p2":['John','Jason','Rob'],
"value":[10,20,40]}
df = pd.DataFrame(data,columns=["p1",'p2','value'])
meeting = {'person':['Jen','Mark','Carrie','John','Jason','Rob'],
'table':[1,2,3,1,2,3]}
meeting = pd.DataFrame(meeting,columns=['person','table'])
df is a relationship table and value is the field i need to update. So if two people were at the same table in the meeting dataframe then update the df row accordingly.
for example: Jen and John were both at table 1, so I need to update the row in df that has Jen and John and set their value to value + 100 so 110.
I thought about maybe doing a self join on meeting to get the format to match that of df but not sure if this is the easiest or fastest approach
IIUC you could set the person as index in the meeting dataframe, and use its table values to replace the names in df. Then if both mappings have the same value (table), replace with df.value+100:
m = df[['p1','p2']].replace(meeting.set_index('person').table).eval('p1==p2')
df['value'] = df.value.mask(m, df.value+100)
print(df)
p1 p2 value
0 Jen John 110
1 Mark Jason 120
2 Carrie Rob 140
This could be an approach, using df.to_records():
groups=meeting.groupby('table').agg(set)['person'].to_list()
df['value']=[row[-1]+100 if set(list(row)[1:3]) in groups else row[-1] for row in df.to_records()]
Output:
df
p1 p2 value
0 Jen John 110
1 Mark Jason 120
2 Carrie Rob 140
I am trying to use pandas to rename a column in CSV files. I want to use a dictionary since sometimes columns with the same information can be named differently (e.g. mobile_phone and telephone instead of phone).
I want to rename the first instance of phone. Here is an example to hopefully explain more.
Here is the original in this example:
0 name mobile_phone telephone
1 Bob 12364234234 12364234234
2 Joe 23534235435 43564564563
3 Jill 34573474563 78098080807
Here is what I want it to do:
0 name phone telephone
1 Bob 12364234234 12364234234
2 Joe 23534235435 43564564563
3 Jill 34573474563 78098080807
This is the code I tried:
phone_dict = {
'phone_number': 'phone',
'mobile_phone': 'phone',
'telephone': 'phone',
'phones': 'phone',
}
if 'phone' not in df.columns:
df.rename(columns=dict(phone_dict), inplace=True)
if 'phone' not in df.columns:
raise ValueError("What are these peoples numbers!? (Need 'phone' column)")
I made a dictionary with some possible column names and that I want them to be named 'phone'. However, when I run this code it turns the columns to this changes the second column instead of the first one that matches a key in the dictionary. I want it to stop after it matches the first column it comes across in the CSV.
This is what is happening:
0 name mobile_phone phone
1 Bob 12364234234 12364234234
2 Joe 23534235435 43564564563
3 Jill 34573474563 78098080807
If there is, for example, a third column that matches the dictionary they turn to 'phone' which is again not what I want. I am trying to get it to just change the first column it matches.
Here is an example of what happens when I add a third column.
It goes from:
0 name mobile_phone telephone phone_1
1 Bob 12364234234 12364234234 36346346311
2 Joe 23534235435 43564564563 34634634623
3 Jill 34573474563 78098080807 34634654622
To this:
0 name phone phone phone
1 Bob 12364234234 12364234234 36346346311
2 Joe 23534235435 43564564563 34634634623
3 Jill 34573474563 78098080807 34634654622
But I want it to be this:
0 name phone telephone phone_1
1 Bob 12364234234 12364234234 36346346311
2 Joe 23534235435 43564564563 34634634623
3 Jill 34573474563 78098080807 34634654622
Any advice or tips to stop it second changing the second dictionary match instead of the first one or all of them?
Before I had a bunch of elif statements but I thought a dictionary would be cleaner and easier to read.
You shouldn't expect pd.DataFrame.rename to apply any particular sequential ordering with a dict input. Even if the logic worked, it would be an implementation detail as the docs don't describe the actual process.
Instead, you can use pd.DataFrame.filter to find the first valid column label:
df = df.rename(columns={df.filter(like='phone').columns[0]: 'phone'})
print(df)
0 name phone telephone
0 1 Bob 12364234234 12364234234
1 2 Joe 23534235435 43564564563
2 3 Jill 34573474563 78098080807
If it's possible a valid column may not exist, you can catch IndexError:
try:
df = df.rename(columns={df.filter(like='phones').columns[0]: 'phone'})
except IndexError:
print('No columns including "phones" exists.')
Here's one solution:
df:
Columns: [name, mobile_phone, telephone]
Index: []
Finding the first instance of phone (left to right) in the column index:
a = [True if ('phone' in df.columns[i]) & ('phone' not in df.columns[i-1]) else False for i in range(len(df.columns))]
Getting the column that needs to be renamed phone:
phonecol = df.columns[a][0]
Renaming the column:
df.rename(columns = {phonecol : 'phone'})
Output:
Columns: [name, phone, telephone]
Index: []
Suppose I have a dataframe like this:
fname lname email
Joe Aaron
Joe Aaron some#some.com
Bill Smith
Bill Smith
Bill Smith some2#some.com
Is there a terse and convenient way to drop rows where {fname, lname} is duplicated and email is blank?
You should first check whether your "empty" data is NaN or empty strings. If they are a mixture, you may need to modify the below logic.
If empty rows are NaN
Using pd.DataFrame.sort_values and pd.DataFrame.drop_duplicates:
df = df.sort_values('email')\
.drop_duplicates(['fname', 'lname'])
If empty rows are strings
If your empty rows are strings, you need to specify ascending=False when sorting:
df = df.sort_values('email', ascending=False)\
.drop_duplicates(['fname', 'lname'])
Result
print(df)
fname lname email
4 Bill Smith some2#some.com
1 Joe Aaron some#some.com
You can using first with groupby (Notice replace empty with np.nan, since the first will return the first not null value for each columns)
df.replace('',np.nan).groupby(['fname','lname']).first().reset_index()
Out[20]:
fname lname email
0 Bill Smith some2#some.com
1 Joe Aaron some#some.com
I have 2 excel sheets that I have loaded. I need to add information from one to the other one. See example below.
table 1:
cust_id fname lname date_registered
1 bob holly 1/1/80
2 terri jones 2/3/90
table 2:
fname lname date_registered cust_id zip
lawrence fisher 2/3/12 3 12345
So I need to add cust_id 3 from table 2 to table 1. Along with all the other information, fname, lname, and date_registered. I don't need all the columns though, such as the zip.
I am thinking I can use the pandas/merge. But I am new to all this and not sure how this works. I need to populate the next row in table 1 with the corresponding row information in table 2. Any information would be helpful. Thanks!
With concat:
In [1]: import pandas as pd
In [2]: table_1 = pd.DataFrame({'cust_id':[1,2], 'fname':['bob', 'teri'], 'lname':['holly', 'jones'], 'date_registered':['1/1/80', '2/3/90']})
In [3]: table_2 = pd.DataFrame({'cust_id':[3], 'fname':['lawrence'], 'lname':['fisher'], 'date_registered':['2/3/12'], 'zip':[12345]})
In [4]: final_table = pd.concat([table_1, table_2])
In [5]: final_table
Out[5]:
cust_id date_registered fname lname zip
0 1 1/1/80 bob holly NaN
1 2 2/3/90 teri jones NaN
0 3 2/3/12 lawrence fisher 12345.0
Use append
appended = table1.append(table2[table1.columns])
or concat
concated = pd.concat([table1,table2], join='inner')
Both resulting in
cust_id fname lname date_registered
0 1 bob holly 1/1/80
1 2 terri jones 2/3/90
0 3 lawrence fisher 2/3/12