Joining string of a columns over several index while keeping other colums - python

Here is an example data set:
>>> df1 = pandas.DataFrame({
"Name": ["Alice", "Marie", "Smith", "Mallory", "Bob", "Doe"],
"City": ["Seattle", None, None, "Portland", None, None],
"Age": [24, None, None, 26, None, None],
"Group": [1, 1, 1, 2, 2, 2]})
>>> df1
Age City Group Name
0 24.0 Seattle 1 Alice
1 NaN None 1 Marie
2 NaN None 1 Smith
3 26.0 Portland 2 Mallory
4 NaN None 2 Bob
5 NaN None 2 Doe
I would like to merge the Name column for all index of the same group while keeping the City and the Age wanting someting like:
>>> df1_summarised
Age City Group Name
0 24.0 Seattle 1 Alice Marie Smith
1 26.0 Portland 2 Mallory Bob Doe
I know those 2 columns (Age, City) will be NaN/None after the first index of a given group from the structure of my starting data.
I have tried the following:
>>> print(df1.groupby('Group')['Name'].apply(' '.join))
Group
1 Alice Marie Smith
2 Mallory Bob Doe
Name: Name, dtype: object
But I would like to keep the Age and City columns...

try this:
In [29]: df1.groupby('Group').ffill().groupby(['Group','Age','City']).Name.apply(' '.join)
Out[29]:
Group Age City
1 24.0 Seattle Alice Marie Smith
2 26.0 Portland Mallory Bob Doe
Name: Name, dtype: object

using dropna and assign with groupby
docs to assign
df1.dropna(subset=['Age', 'City']) \
.assign(Name=df1.groupby('Group').Name.apply(' '.join).values)
timing
per request
update
use groupby and agg
I thought of this and it feels far more satisfying
df1.groupby('Group').agg(dict(Age='first', City='first', Name=' '.join))
to get the exact output
df1.groupby('Group').agg(dict(Age='first', City='first', Name=' '.join)) \
.reset_index().reindex_axis(df1.columns, 1)

Related

Replace column based on string

I'm trying to replace column "Names" by a new variable "Gender" based on the first letters that we find in column name.
INPUT:
df['Name'].value_counts()
OUTPUT:
Mr. Gordon Hemmings 1
Miss Jane Wilkins 1
Mrs. Audrey North 1
Mrs. Wanda Sharp 1
Mr. Victor Hemmings 1
..
Miss Heather Abraham 1
Mrs. Kylie Hart 1
Mr. Ian Langdon 1
Mr. Gordon Watson 1
Miss Irene Vance 1
Name: Name, Length: 4999, dtype: int64
Now, see the Miss, Mrs., and Miss? The first question that comes to mind is: how many different words there are?
INPUT
df.Name.str.split().str[0].value_counts(dropna=False)
Mr. 3351
Mrs. 937
Miss 711
NaN 1
Name: Name, dtype: int64
Now I'm trying to:
#Replace missing value
df['Name'].fillna('Mr.', inplace=True)
# Create Column Gender
df['Gender'] = df['Name']
for i in range(0, df[0]):
A = df['Name'].values[i][0:3]=="Mr."
df['Gender'].values[i] = A
df.loc[df['Gender']==True, 'Gender']="Male"
df.loc[df['Gender']==False, 'Gender']="Female"
del df['Name'] #Delete column 'Name'
df
But I'm missing something since I get the following error:
KeyError: 0
The KeyError is because you don't have a column called 0. However, I would ditch that code and try something more efficient.
You can use np.where with str.contains to search for names with Mr. after using fillna(). Then, just drop the Name column.:
df['Name'] = df['Name'].fillna('Mr.')
df['Gender'] = np.where(df['Name'].str.contains('Mr\.'), 'Male', 'Female')
df = df.drop('Name', axis=1)
df
Full example:
df = pd.DataFrame({'Name': {0: 'Mr. Gordon Hemmings',
1: 'Miss Jane Wilkins',
2: 'Mrs. Audrey North',
3: 'Mrs. Wanda Sharp',
4: 'Mr. Victor Hemmings'},
'Value': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1}})
print(df)
df['Name'] = df['Name'].fillna('Mr.')
df['Gender'] = np.where(df['Name'].str.contains('Mr\.'), 'Male', 'Female')
df = df.drop('Name', axis=1)
print('\n')
print(df)
Name Value
0 Mr. Gordon Hemmings 1
1 Miss Jane Wilkins 1
2 Mrs. Audrey North 1
3 Mrs. Wanda Sharp 1
4 Mr. Victor Hemmings 1
Value Gender
0 1 Male
1 1 Female
2 1 Female
3 1 Female
4 1 Male

Fillna() in column based on condition

I have created a small dictionary, where a specific title is assigned a median age.
Age
Title
Master. 3.5
Miss. 21.0
Mr. 30.0
Mrs. 35.0
other 44.5
Now I want to use this dictionary to fill the missing values in a single column in a dataframe, based on that title. So, for rows where the "Age" is missing, and the title = "Master.", I want to insert the value 3.5 and so on.
I tried this piece of code, but it does not work; it doesn't produce an error, but it also doesn't replace the missing values. What am I doing wrong?
for title in piv.keys():
train[["Age"]][train["Title"]==title].fillna(piv[title], inplace=True)
where "piv" is the name of the dictionary, and "train" is the name of the dataframe.
Also, is there a more elegant way to do this?
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S Mr.
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C Mrs.
{'Master.': 3.5, 'Miss.': 21.0, 'Mr.': 30.0, 'Mrs.': 35.0, 'other': 44.5}
One option:
train['Age'] = train.groupby('Title')['Age'].transform(lambda x: x.fillna(x.mean()))
Another option:
pivdict = piv.set_index('Title').squeeze().to_dict()
train['Age'] = train['Age'].fillna(train['Title'].map(pivdict))
One method:
# create lookup dictionary
title = ['Master', 'Miss.', 'Mr.', 'Mrs.', 'other']
age = [3.5, 21, 30, 35, 44]
title_dict = dict(zip(title, age))
# mock dataframe
df = pd.DataFrame({'Name': ['Bob', 'Alice', 'Charles', 'Mary'],
'Age': [12, 27, None, None],
'Title': ['Master', 'Miss.', 'Mr.', 'other']})
# if age is Na then look it up in dictionary
df['Age'] = df['Age'].fillna(df['Title'].map(title_dict))
Input:
Name Age Title
0 Bob 12.0 Master
1 Alice 27.0 Miss.
2 Charles NaN Mr.
3 Mary NaN other
Output:
Name Age Title
0 Bob 12.0 Master
1 Alice 27.0 Miss.
2 Charles 30.0 Mr.
3 Mary 44.0 other

Amend row in a data-frame if it exists in another data-frame

I have two dataframes DfMaster and DfError
DfMaster which looks like:
Id Name Building
0 4653 Jane Smith A
1 3467 Steve Jones B
2 34 Kim Lee F
3 4567 John Evans A
4 3643 Kevin Franks S
5 244 Stella Howard D
and DfError looks like
Id Name Building
0 4567 John Evans A
1 244 Stella Howard D
In DfMaster I would like to change the Building value for a record to DD if it appears in the DfError data-frame. So my desired output would be:
Id Name Building
0 4653 Jane Smith A
1 3467 Steve Jones B
2 34 Kim Lee F
3 4567 John Evans DD
4 3643 Kevin Franks S
5 244 Stella Howard DD
I am trying to use the following:
DfMaster.loc[DfError['Id'], 'Building'] = 'DD'
however I get an error:
KeyError: "None of [Int64Index([4567,244], dtype='int64')] are in the [index]"
What have I done wrong?
try this using np.where
import numpy as np
errors = list(dfError['id'].unqiue())
dfMaster['Building_id'] = np.where(dfMaster['Building_id'].isin(errors),'DD',dfMaster['Building_id'])
DataFrame.loc expects that you input an index or a Boolean series, not a value from a column.
I believe this should do the trick:
DfMaster.loc[DfMaster['Id'].isin(DfError['Id']), 'Building'] = 'DD'
Basically, it's telling:
For all rows where Id value is present in DfError['Id'], set the value of 'Building' to 'DD'.

Adding a function to a string split command in Pandas

I have a dataframe that has 20 or so columns in it. One of the columns is called 'director_name' and has values such as 'John Doe' or 'Jane Doe'. I want to split this into 2 columns, 'First_Name' and 'Last_Name'. When I run the following it works as expected and splits the string into 2 columns:
data[['First_Name', 'Last_Name']] = data.director_name.str.split(' ', expand
= True)
data
First_Name Last_Name
John Doe
It works great, however it does NOT work when I have NULL (NaN) values under 'director_name'. It throws the following error:
'Columns must be same length as key'
I'd like to add a function which checks if the value != null, then do the command listed above, otherwise enter 'NA' for First_Name and 'Last_Name'
Any ideas how I would go about that?
EDIT:
I just checked the file and I'm not sure if NULL is the issue. I have some names that are 3-4 strings long. i.e.
John Allen Doe
John Allen Doe Jr
Maybe I can't split this into First_Name and Last_Name.
Hmmmm
Here is a way is to split and choose say the first two values as first name and last name
Id name
0 1 James Cameron
1 2 Martin Sheen
2 3 John Allen Doe
3 4 NaN
df['First_Name'] = df.name.str.split(' ', expand = True)[0]
df['Last_Name'] = df.name.str.split(' ', expand = True)[1]
You get
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen
3 4 NaN NaN None
Use str.split (no parameter, because splitter by default whitespace) with indexing with str for select lists by position:
print (df.name.str.split())
0 [James, Cameron]
1 [Martin, Sheen]
2 [John, Allen, Doe]
3 NaN
Name: name, dtype: object
df['First_Name'] = df.name.str.split().str[0]
df['Last_Name'] = df.name.str.split().str[1]
#data borrow from A-Za-z answer
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen
3 4 NaN NaN NaN
There is also possible use paramter n for selecting second or first 2 names:
df['First_Name'] = df.name.str.split().str[0]
df['Last_Name'] = df.name.str.split(n=1).str[1]
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen Doe
3 4 NaN NaN NaN
Solution with str.rstrip
df['First_Name'] = df.name.str.rsplit(n=1).str[0]
df['Last_Name'] = df.name.str.rsplit().str[-1]
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen Doe
3 4 NaN NaN NaN
df['First_Name'] = df.name.str.split(' ', expand = True)[0]
df['Last_Name'] = df.name.str.split(' ', expand = True)[1]
This should do
This should fix your problem
Setup
data= pd.DataFrame({'director_name': {0: 'John Doe', 1: np.nan, 2: 'Alan Smith'}})
data
Out[457]:
director_name
0 John Doe
1 NaN
2 Alan Smith
Solution
#use a lambda function to check nan before splitting the column.
data[['First_Name', 'Last_Name']] = data.apply(lambda x: pd.Series([np.nan,np.nan] if pd.isnull(x.director_name) else x.director_name.split()), axis=1)
data
Out[446]:
director_name First_Name Last_Name
0 John Doe John Doe
1 NaN NaN NaN
2 Alan Smith Alan Smith
If you need to take only the first 2 names, you can do:
data[['First_Name', 'Last_Name']] = data.apply(lambda x: pd.Series([np.nan,np.nan] if pd.isnull(x.director_name) else x.director_name.split()).iloc[:2], axis=1)

Create a Pandas DataFrame from deeply nested JSON

I'm trying to create a single Pandas DataFrame object from a deeply nested JSON string.
The JSON schema is:
{"intervals": [
{
pivots: "Jane Smith",
"series": [
{
"interval_id": 0,
"p_value": 1
},
{
"interval_id": 1,
"p_value": 1.1162791357932633e-8
},
{
"interval_id": 2,
"p_value": 0.0000028675012051504467
}
],
},
{
"pivots": "Bob Smith",
"series": [
{
"interval_id": 0,
"p_value": 1
},
{
"interval_id": 1,
"p_value": 1.1162791357932633e-8
},
{
"interval_id": 2,
"p_value": 0.0000028675012051504467
}
]
}
]
}
Desired Outcome I need to flatten this to produce a table:
Actor Interval_id Interval_id Interval_id ...
Jane Smith 1 1.1162 0.00000 ...
Bob Smith 1 1.1162 0.00000 ...
The first column is the Pivots values, and the remaining columns are the values of the keys interval_id and p_value stored in the list series.
So far i've got
import requests as r
import pandas as pd
actor_data = r.get("url/to/data").json['data']['intervals']
df = pd.DataFrame(actor_data)
actor_data is a list where the length is equal to the number of individuals ie pivots.values(). The df object simply returns
<bound method DataFrame.describe of pivots Series
0 Jane Smith [{u'p_value': 1.0, u'interval_id': 0}, {u'p_va...
1 Bob Smith [{u'p_value': 1.0, u'interval_id': 0}, {u'p_va...
.
.
.
How can I iterate through that series list to get to the dict values and create N distinct columns? Should I try to create a DataFrame for the series list, reshape it,and then do a column bind with the actor names?
UPDATE:
pvalue_list = [i['p_value'] for i in json_data['series']]
this gives me a list of lists. Now I need to figure out how to add each list as a row in a DataFrame.
value_list = []
for i in pvalue_list:
pvs = [j['p_value'] for j in i]
value_list = value_list.append(pvs)
return value_list
This returns a NoneType
Solution
def get_hypthesis_data():
raw_data = r.get("/url/to/data").json()['data']
actor_dict = {}
for actor_series in raw_data['intervals']:
actor = actor_series['pivots']
p_values = []
for interval in actor_series['series']:
p_values.append(interval['p_value'])
actor_dict[actor] = p_values
return pd.DataFrame(actor_dict).T
This returns the correct DataFrame. I transposed it so the individuals were rows and not columns.
I think organizing your data in way that yields repeating column names is only going to create headaches for you later on down the road. A better approach IMHO is to create a column for each of pivots, interval_id, and p_value. This will make extremely easy to query your data after loading it into pandas.
Also, your JSON has some errors in it. I ran it through this to find the errors.
jq helps here
import sh
jq = sh.jq.bake('-M') # disable colorizing
json_data = "from above"
rule = """[{pivots: .intervals[].pivots,
interval_id: .intervals[].series[].interval_id,
p_value: .intervals[].series[].p_value}]"""
out = jq(rule, _in=json_data).stdout
res = pd.DataFrame(json.loads(out))
This will yield output similar to
interval_id p_value pivots
32 2 2.867501e-06 Jane Smith
33 2 1.000000e+00 Jane Smith
34 2 1.116279e-08 Jane Smith
35 2 2.867501e-06 Jane Smith
36 0 1.000000e+00 Bob Smith
37 0 1.116279e-08 Bob Smith
38 0 2.867501e-06 Bob Smith
39 0 1.000000e+00 Bob Smith
40 0 1.116279e-08 Bob Smith
41 0 2.867501e-06 Bob Smith
42 1 1.000000e+00 Bob Smith
43 1 1.116279e-08 Bob Smith
Adapted from this comment
Of course, you can always call res.drop_duplicates() to remove the duplicate rows. This gives
In [175]: res.drop_duplicates()
Out[175]:
interval_id p_value pivots
0 0 1.000000e+00 Jane Smith
1 0 1.116279e-08 Jane Smith
2 0 2.867501e-06 Jane Smith
6 1 1.000000e+00 Jane Smith
7 1 1.116279e-08 Jane Smith
8 1 2.867501e-06 Jane Smith
12 2 1.000000e+00 Jane Smith
13 2 1.116279e-08 Jane Smith
14 2 2.867501e-06 Jane Smith
36 0 1.000000e+00 Bob Smith
37 0 1.116279e-08 Bob Smith
38 0 2.867501e-06 Bob Smith
42 1 1.000000e+00 Bob Smith
43 1 1.116279e-08 Bob Smith
44 1 2.867501e-06 Bob Smith
48 2 1.000000e+00 Bob Smith
49 2 1.116279e-08 Bob Smith
50 2 2.867501e-06 Bob Smith
[18 rows x 3 columns]

Categories