My goal below is to create 1 single column of all the individual words of each string in the 'Name' column.
Although I am achieving this, I am losing the column header on df = df['Name'].str.split(' ', expand=True) . I would like to preserve the header if possible so that I can refer to it later in the script.
I am also ending up with multiple indexes, which is fine, but if there is a way to not have this, it would be great.
Any help is appreciated greatly. Thank you
import pandas as pd
data = {'Name':['Tom Wilson', 'nick snyder', 'krish moham', 'jack oconnell']}
df = pd.DataFrame(data)
df = df['Name'].str.split(' ', expand=True)
df = df.stack(dropna=True)
print(df)
Try this:
data = {'Name': ['Tom Wilson', 'nick snyder', 'krish moham', 'jack oconnell']}
df = pd.DataFrame(data)
df = df['Name'].str.split(' ').explode().to_frame()
print(df)
Prints:
Name
0 Tom
0 Wilson
1 nick
1 snyder
2 krish
2 moham
3 jack
3 oconnell
Related
A list of names and I want to retrieve each of the correspondent information in different data frames, to form a new dataframe.
I converted the list into a 1 column dataframe, then to look up its corresponding values in different dataframes.
The idea is visualized as:
I have tried:
import pandas as pd
data = {'Name': ["David","Mike","Lucy"]}
data_h = {'Name': ["David","Mike","Peter", "Lucy"],
'Hobby': ['Music','Sports','Cooking','Reading'],
'Member': ['Yes','Yes','Yes','No']}
data_s = {'Name': ["David","Lancy", "Mike","Lucy"],
'Speed': [56, 42, 35, 66],
'Location': ['East','East','West','West']}
df = pd.DataFrame(data)
df_hobby = pd.DataFrame(data_h)
df_speed = pd.DataFrame(data_s)
df['Hobby'] = df.lookup(df['Name'], df_hobby['Hobby'])
print (df)
But it returns the error message as:
ValueError: Row labels must have same size as column labels
I have also tried:
df = pd.merge(df, df_hobby, on='Name')
It works but it includes unnecessary columns.
What will be the smart an efficient way to do such, especially when the number of to-be-looked-up dataframes are many?
Thank you.
Filter only columns for merge and columns for append like:
df = (pd.merge(df, df_hobby[['Name','Hobby']], on='Name')
.merge(df_speed[['Name','Location']], on='Name'))
print(df)
Name Hobby Location
0 David Music East
1 Mike Sports West
2 Lucy Reading West
If want working with list use this solution with filtering columns:
dfList = [df,
df_hobby[['Name','Hobby']],
df_speed[['Name','Location']]]
from functools import reduce
df = reduce(lambda df1,df2: pd.merge(df1,df2,on='Name'), dfList)
print (df)
Name Hobby Location
0 David Music East
1 Mike Sports West
2 Lucy Reading West
When I'm working in SQL, I find almost all the things I do with a column are related to the following four operations:
Add a column.
Remove a column.
Change a column type.
What is the preferred way to do these three DML operations in pandas. For example, let's suppose I am starting with the following DataFrame:
import pandas as pd
df = pd.DataFrame([
{'product': 'drink', 'brand': 'spindrift', 'id': '19'},
{'product': 'cup', 'brand': None, 'id': '11'}
])
How would I:
Change the df.id column from a string (or object as it says) to an int64 ?
Rename the column product to product_type ?
Add a new column called 'cost' with values [2.99, 3.99] ?
Remove the column called brand ?
Simple and complete:
import numpy as np
import pandas as pd
df = pd.DataFrame([
{'product': 'drink', 'brand': 'spindrift', 'id': '19'},
{'product': 'cup', 'brand': None, 'id': '11'}
])
# Change the df.id column from a string (or object as it says) to an int64 ?
df['id'] = df['id'].astype(np.int64)
# Rename the column product to product_type ?
df = df.rename(columns={'product':'prouduct_type'})
# Add a new column called 'cost' with values [2.99, 3.99] ?
df['cost'] = pd.Series([2.99, 3.99])
# Remove the column called brand ?
df = df.drop(columns='brand')
This functions can also be chained together. I would not recommend it as it is not fixable as above:
# do all the steps above with a single line
df = df.astype({'id':np.int64},
axis=1
).rename(columns={'product':'prouduct_type'}
).assign(cost=[2.99, 3.99]
).drop(columns='brand')
There is also another way to which you can use inplace=True . This does the assignment. I don’t recommend it as it is not explicitly as the first method
# Using inplace=True
df['id'].astype(np.int64, inplace=True)
df.rename(columns={'product':'prouduct_type'}, inplace=True)
# No change from previous
df['cost'] = pd.Series([2.99, 3.99])
# pop brand out
df.pop('brand')
print(df)
You can perform these steps like this (starting with your original data frame):
# add a column
df = pd.concat([df, pd.Series([2.99, 3.99], name='cost')], axis=1)
# change column name
df = df.rename(columns={'product': 'product_type'})
# remove brand
df = df.drop(columns='brand')
# change data type
df['id'] = df['id'].astype('int')
print(df)
product_type id cost
0 drink 19 2.99
1 cup 11 3.99
you cound do:
df = pd.DataFrame([
{'product': 'drink', 'brand': 'spindrift', 'id': '19'},
{'product': 'cup', 'brand': None, 'id': '11'}
])
df = (df.assign(cost=[2.99, 3.99],
id=lambda d: d.id.astype(int))
.drop(columns=['brand'])
.rename({"product": 'product_type'}, axis=1))
This should work
# change datatype
>>> df['id'] = df['id'].astype('int64')
>>> df.dtypes
brand object
id int64
product object
# rename column
df.rename(columns={'product': 'product_type'}, inplace=True)
>>> df
brand id product_type
0 spindrift 19 drink
1 None 11 cup
# create new column
df['Cost'] = pd.Series([2.99, 3.99])
>>> df
brand id product_type Cost
0 spindrift 19 drink 2.99
1 None 11 cup 3.99
# drop column
>>> df.drop(['brand'], axis=1, inplace=True)
>>> df
id product_type Cost
0 19 drink 2.99
1 11 cup 3.99
I already have an idea as to how I'm going to do this - I'm just curious about whether my method is the most efficient.
So for instance, let's say that for whatever reason, I have the following table:
The first 4 columns in the table are all repeated - they just say info about the employee. The reason these rows repeat is because that employee handles multiple clients.
In some cases, I am missing info on the Age and Employee duration of an employee. Another colleague gave me this information in an excel sheet.
So now, I have info on Brian's and Dennis' age and employment duration, and I need to fill all rows with their employee IDs based on the information. My plan for doing that is this:
data = {"14": # Brian's Employee ID
{"Age":31,
:"Employment Duration":3},
"21": # Dennis' Employee ID
{"Age":45,
"Employment Duratiaon":12}
}
After making the above dictionary of dictionaries with the necessary values, my plan is to iterate over each row in the above dataframe, and fill in the 'Age' and 'Employment Duration' columns based on the value in 'Employee ID':
for index, row in df.iterrows:
if row["Employee ID"] in data:
row["Age"] = data["Employee ID"]["Age"]
row["Employment Duration"] = data["Employee ID"]["Employement Duration"]
That's my plan for populating the missing values!
I'm curious about whether there's a simpler way that's just not presenting itself to me, because this was the first thing that sprang to mind!
Don't iterate over rows in pandas when you can avoid it. Instead maximize the pandas library with actions like this:
Assume we have a dataframe:
data = pd.DataFrame({
'name' : ['john', 'john', 'mary', 'mary'],
'age' : ['', '', 25, 25]
})
Which looks like:
name age
0 john
1 john
2 mary 25
3 mary 25
We can apply a lambda function like so:
data['age'] = data.apply(lambda x: 27 if x.name == 'john' else x.age, axis=1)
Or we can use pandas .loc:
data['age'].loc[data.name == 'john'] = 27
Test them out and compare how long each take to execute vs. iterating over rows.
Ensure missing values are represented as null values (np.NaN). The second set of information should be stored in another DataFrame with the same column labels.
Then by setting the Index to the 'Employee ID' update will align on the indices and fill the missing values.
Sample Data
import pandas as pd
import numpy as np
df = pd.DataFrame({'Employee ID': ["11", "11", "14", "21"],
'Name': ['Alan', 'Alan', 'Brian', 'Dennis'],
'Age': [14,14, np.NaN, np.NaN],
'Employment Duration': [3,3, np.NaN, np.NaN],
'Clients Handled': ['A', 'B', 'C', 'G']})
data = {"14": {"Age": 31, "Employment Duration": 3},
"21": {"Age": 45, "Employment Duration": 12}}
df2 = pd.DataFrame.from_dict(data, orient='index')
Code
#df = df.replace('', np.NaN) # If not null in your dataset
df = df.set_index('Employee ID')
df.update(df2, overwrite=False)
print(df)
Name Age Employment Duration Clients Handled
Employee ID
11 Alan 14.0 3.0 A
11 Alan 14.0 3.0 B
14 Brian 31.0 3.0 C
21 Dennis 45.0 12.0 G
I have a dataframe that looks like the following but with multiple rows:
data= [['tom is good', 'tom is bad'], ['nick is good', 'nick is good'], ['juli is nice', 'juli is wise']]
df = pd.DataFrame(data, columns = ['Name1', 'Name2'])
I want to find the rows where there is a difference between sentences and only copy where there are different in another column like the following:
df. assign(difference= " ")
df.difference=
["good" -> "bad", Nan Nan,"nice" -> "wise"]
I have tried
index= np.where(df["Name1"]!= df["Name2"])
list_m= index[0].tolist()
but do not know how to just take the different words and not the whole sentence and how to copy them in the format I specified in another column.
Thank you very much in advance
This is one approach using set.
Ex:
data= [['tom is good', 'tom is bad'], ['nick is good', 'nick is good'], ['juli is nice', 'juli is wise']]
df = pd.DataFrame(data, columns = ['Name1', 'Name2'])
df['difference'] = df['Name1'].str.split().map(set) - df['Name2'].str.split().map(set)
print(df)
Use symmetric_difference
Ex:
df['difference'] = df.apply(lambda x: set(x['Name1'].split()) ^ set(x['Name2'].split()), axis=1)
Edit as per comment
df['difference'] = df.apply(lambda x: ["{} --> {}".format(x, y) for x, y in zip(x['Name1'].split(), x['Name2'].split()) if x != y], axis=1)
Output:
Name1 Name2 difference
0 tom is good tom is bad [good --> bad]
1 nick is good nick is good []
2 juli is nice juli is wise [nice --> wise]
I have two dataframe df1 and df2.
df1 = pd.DataFrame ({'Name': ['Adam Smith', 'Anne Kim', 'John Weber', 'Ian Ford'],
'Age': [43, 21, 55, 24]})
df2 = pd.DataFrame ({'Name': ['adam Smith', 'Annie Kim', 'John Weber', 'Ian Ford'],
'gender': ['M', 'F', 'M', 'M']})
I need to join these two dataframe with pandas.merge on the column Name. However, as you notice, there are some slight difference between column Name from the two dataframe. Let's assume they are the same person. If I simply do:
pd.merge(df1, df2, how='inner', on='Name')
I only got a dataframe back with only one row, which is 'Ian Ford'.
Does anyone know how to merge these two dataframe ? I guess this is pretty common situation if we join two tables on a string column. I have absolutely no idea how to handle this. Thanks a lot in advance.
I am using fuzzywuzzy here
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df2['key']=df2.Name.apply(lambda x : [process.extract(x, df1.Name, limit=1)][0][0][0])
df2.merge(df1,left_on='key',right_on='Name')
Out[1238]:
Name_x gender key Age Name_y
0 adam Smith M Adam Smith 43 Adam Smith
1 Annie Kim F Anne Kim 21 Anne Kim
2 John Weber M John Weber 55 John Weber
3 Ian Ford M Ian Ford 24 Ian Ford
Not sure if fuzzy match is what you are looking for. Maybe make every name a proper name?
df1.Name = df1.Name.apply(lambda x: x.title())
df2.Name = df2.Name.apply(lambda x: x.title())
pd.merge(df1, df2, how='inner', on='Name')