how to fix 'size', 'occurred at index City' error - python

I am trying to do the example in Use Python & Pandas to replace NaN in the 'size' column with a specific value, depending on the City. In the example below i am trying to assign a value of 18 if the City is St. Louis.
I have used the lambda function to do it since the original dataframe has lot of rows with the repeat of City names and only few of them have NaN values.
when i run the code I am getting an error - KeyError: ('size', 'occurred at index City')
below is the snippet of the code -
raw_data = {'City' : ['Dallas', 'Chicago', 'St Louis', 'SFO', 'St Louis'],
'size': [24, 36, 'NaN', 'NaN', 22],
'Type' : ['Pie', 'Hallo', 'Zombi', 'Dru', 'Zoro']
}
df = pd.DataFrame(raw_data)
df
df['size'] = df.apply(lambda x : x['size'].fillna(value = 18 if x['City' == 'St Louis'] else x['size'], axis = 1, inplace = True))
df
Expected - 18 to be populated in size column for St. Louis
Actual - KeyError: ('size', 'occurred at index City')

If all you're trying to do is set the size of St. Louis, you can run:
df.loc[df['City'] == 'St Louis', 'size'] = 18
However, if you instead want to set all values of NaN to 18, you could likewise run:
df.loc[df['size'] == 'NaN', 'size'] = 18
And if you'd just like to set the size of all St. Louis entries where the size is NaN, you could do:
df.loc[df['City'] == 'St Louis' and df['size'] == 'NaN', 'size'] = 18

There is a simple solution by fillna method
df['size'] = df['size'].fillna(18)
EDITED
What I failed to notice - that you populate cells with NaN string, not with real NaN values.
If you change your input data as
raw_data = {'City' : ['Dallas', 'Chicago', 'St Louis', 'SFO', 'St Louis'],
'size': [24, 36, np.NaN, np.NaN, 22],
'Type' : ['Pie', 'Hallo', 'Zombi', 'Dru', 'Zoro']
}
Then the following method will allow you to re-populate size columns cells by city names
df = pd.DataFrame(raw_data)
df[['City', 'size']] = df.set_index('City')['size'].fillna({'St Louis': 18, 'SFO': 20}).reset_index()

Related

How to convert JSON string in data frame column into multiple columns

I have a data frame with a JSON string in a column:
ID
Data
11
{'Name': 'Sam', 'Age': 21}
22
{'Name': 'Nam', 'Age': 22}
33
{'Name': 'Pam', 'Age': 21, 'Salary': 10000}
How can I convert the above JSON string in to columns?
Desired result:
ID
Name
Age
Salary
11
Sam
21
22
Nam
22
33
Pam
21
10000
You can use pandas.Series to read your column of dictionaries values into columns.
Creating the data
data = {
'Id' : [11, 22, 33],
'Data': ["{'Name': 'Sam', 'Age': 21}", "{'Name': 'Nam', 'Age': 22}", "{'Name': 'Pam', 'Age': 21, 'Salary': 10000}"],
}
df = pd.DataFrame(data)
Converting dictionary to column
df['Data'] = df['Data'].map(lambda x: eval(x) if pd.notnull(x) else x)
df = pd.concat([df, df.pop("Data").apply(pd.Series)], axis=1)
Output :
Id Name Age Salary
0 11 Sam 21 NaN
1 22 Nam 22 NaN
2 33 Pam 21 10000.0
Alternate solution
You can also use json_normalize to unravel the dictionary column to column from dictionary keys.
df['Data'] = df['Data'].map(lambda x: eval(x) if pd.notnull(x) else x)
df = pd.concat([df, pd.json_normalize(df.pop("Data"))], axis=1)
which gives you same output

Replace certain values of one column, with different values from a different df, pandas

I have a df,
for example -
df = pd.DataFrame({'name': ['name1', 'name2', 'name3', 'name4'],
'age': [21, 23, 24, 28],
'occupation': ['data scientist', 'doctor', 'data analyst', 'engineer'],
'knowledge':['python', 'medical','sql','c++'],
})
and another df -
df2 = pd.DataFrame({'occupation': ['data scientist', 'data analyst'],
'knowledge':['5', '4'],
})
I want to replace the knowledge values of the first DF with the knowledge values of the second, but only for the rows which are the same.
making the first DF look like that:
df = pd.DataFrame({'name': ['name1', 'name2', 'name3', 'name4'],
'age': [21, 23, 24, 28],
'occupation': ['data scientist', 'doctor', 'data analyst', 'engineer'],
'knowledge':['5', 'medical','4','c++'],
})
I tried to do stuff with replace, but it didn't work...
You may try this:
occ_know_dict = df2.set_index('occupation').to_dict()['knowledge']
df['knowledge'] = df[['knowledge','occupation']].apply(
lambda row: occ_know_dict[row['occupation']] if row['occupation'] in occ_know_dict else row['knowledge'], axis=1)
You can try map the corresponding knowledge column which shares the same occupation of df2 to df1 then update the value to df.
df['knowledge'].update(df['occupation'].map(df2.set_index('occupation')['knowledge']))
Note that update happens inplace.
print(df)
name age occupation knowledge
0 name1 21 data scientist 5
1 name2 23 doctor medical
2 name3 24 data analyst 4
3 name4 28 engineer c++

Reshaping a Pandas data frame with duplicate values

Using the using the Plotly go.Table() function and Pandas, I'm attempting to create a table to summarize some data. My data is as follows:
import pandas as pd
test_df = pd.DataFrame({'Manufacturer':['BMW', 'Chrysler', 'Chrysler', 'Chrysler', 'Brokertec', 'DWAS', 'Ford', 'Buick'],
'Metric':['Indicator', 'Indicator', 'Indicator', 'Indicator', 'Indicator', 'Indicator', 'Indicator', 'Indicator'],
'Dimension':['Short', 'Short', 'Short', 'Long', 'Short', 'Short', 'Long', 'Long'],
'User': ['USA', 'USA', 'USA', 'USA', 'USA', 'New USA', 'USA', 'Los USA'],
'Value':[50, 3, 3, 2, 5, 7, 10, 5]
})
My desired output is as follows (summing the Dimension by Manufacturer):
Manufacturer Short Long
Chrysler 6 2
Buick 5 5
Mercedes 7 0
Ford 0 10
I need to shape the Pandas data frame a bit (and this is where I'm running into trouble). My code was as follows:
table_columns = ['Manufacturer', 'Longs', 'Shorts']
manufacturers = ['Chrysler', 'Buick', 'Mercedes', 'Ford']
df_new = (df[df['Manufacturer'].isin(manufacturers)]
.set_index(['Manufacturer', 'Dimension'])
['Value'].unstack()
.reset_index()[table_columns]
)
Then, create the table using the Plotly go.Table() function:
import plotly.graph_objects as go
direction_table = go.Figure(go.Table(
header=dict(
values=table_columns,
font=dict(size=12),
line_color='darkslategray',
fill_color='lightskyblue',
align='center'
),
cells=dict(
values=df_new.T, # using Transpose here
line_color='darkslategray',
fill_color='lightcyan',
align = 'center')
)
)
direction_table
The error I'm seeing is:
ValueError: Index contains duplicate entries, cannot reshape
What is the best way to work around this?
Thanks in advance!
You need to use pivot_table with aggfunc='sum' instead of set_index.unstack
table_columns = ['Manufacturer', 'Long', 'Short']
manufacturers = ['Chrysler', 'Buick', 'Mercedes', 'Ford']
df_new = (test_df[test_df['Manufacturer'].isin(manufacturers)]
.pivot_table(index='Manufacturer', columns='Dimension',
values='Value', aggfunc='sum', fill_value=0)
.reset_index()
.rename_axis(columns=None)[table_columns]
)
print (df_new)
Manufacturer Long Short
0 Buick 5 0
1 Chrysler 2 6
2 Ford 10 0
Note it is not the same output but I don't think your input can give the expected output
Or the same result with groupby.sum and unstack
(test_df[test_df['Manufacturer'].isin(manufacturers)]
.groupby(['Manufacturer', 'Dimension'])
['Value'].sum()
.unstack(fill_value=0)
.reset_index()
.rename_axis(columns=None)[table_columns]
)

filter pandas where some columns contain any of the words in a list

I would like to filter a Dataframe. The resulting dataframe should contain all the rows where in any of a number of columns contain any of the words of a list.
I started to use for loops but there should be a better pythonic/pandonic way.
Example:
# importing pandas
import pandas as pd
# Creating the dataframe with dict of lists
df = pd.DataFrame({'Name': ['Geeks', 'Peter', 'James', 'Jack', 'Lisa'],
'Team': ['Boston', 'Boston', 'Boston', 'Chele', 'Barse'],
'Position': ['PG', 'PG', 'UG', 'PG', 'UG'],
'Number': [3, 4, 7, 11, 5],
'Age': [33, 25, 34, 35, 28],
'Height': ['6-2', '6-4', '5-9', '6-1', '5-8'],
'Weight': [89, 79, 113, 78, 84],
'College': ['MIT', 'MIT', 'MIT', 'Stanford', 'Stanford'],
'Salary': [99999, 99994, 89999, 78889, 87779]},
index =['ind1', 'ind2', 'ind3', 'ind4', 'ind5'])
df1 = df[df['Team'].str.contains("Boston") | df['College'].str.contains('MIT')]
print(df1)
So it is clear how to filter columns individually that contain a particular word
further on it is also clear how to filter rows per column containing any of the strings of a list:
df[df.Name.str.contains('|'.join(search_values ))]
Where search_values contains a list of words or strings.
search_values = ['boston','mike','whatever']
I am looking for a short way to code
#pseudocode
give me a subframe of df where any of the columns 'Name','Position','Team' contains any of the words in search_values
I know I can do
df[df['Name'].str.contains('|'.join(search_values )) | df['Position'].str.contains('|'.join(search_values )) | df['Team'].contains('|'.join(search_values )) ]
but if I would have like 20 columns that would be a mess of a line of code
any suggestion?
EDIT Bonus:
When looking in a list of columns i.e. 'Name','Position','Team' how to include also the index? passing ['index','Name','Position','Team'] does not work.
thanks.
I had a look to this:
https://www.geeksforgeeks.org/get-all-rows-in-a-pandas-dataframe-containing-given-substring/
https://kanoki.org/2019/03/27/pandas-select-rows-by-condition-and-string-operations/
Filter out rows based on list of strings in Pandas
You can also stack with any on level=0:
cols_list = ['Name','Team'] #add column names
df[df[cols_list].stack().str.contains('|'.join(search_values),case=False,na=False)
.any(level=0)]
Name Team Position Number Age Height Weight College Salary
ind1 Geeks Boston PG 3 33 6-2 89 MIT 99999
ind2 Peter Boston PG 4 25 6-4 79 MIT 99994
ind3 James Boston UG 7 34 5-9 113 MIT 89999
Do apply with any
df[[c1,c2..]].apply(lambda x : x.str.contains('|'.join(search_values )),axis=1).any(axis=1)
You can simply apply in this case:
cols_to_filter = ['Name', 'Position', 'Team']
search_values = ['word1', 'word2']
patt = '|'.join(search_values)
mask = df[cols_to_filter].apply(lambda x: x.str.contains(patt)).any(1)
df[mask]

dataframe groupby one column and average one column while finding most occuring in another

I have a pandas dataframe and I want to groupby one column while averaging one column and finding the most occuring value in another column
I was able to do it but I think there is a concise way to do it instead of 4 lines of code
import pandas as pd
df = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA', 'Russia', 'Russia'], 'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'Chicago', 'Moscow', 'Moscow'], 'Flights' : [22, 45, 32, 16, 31, 25]})
w=df.groupby('Country').mean().round(decimals=2)
x=(df.groupby('Country')['City'].agg(pd.Series.mode))
y=x.to_frame()
z = pd.concat([w, y], axis=1 ,join='outer')
Country Flights City
Russia 29.33 Moscow
USA 27.67 New-York
Use GroupBy.agg with lambda functions, also for mode is possible add Series.iat for select first value,because mode should return more values:
z = df.groupby('Country').agg({'Flights': lambda x: round(x.mean(), 2),
'City': lambda x: x.mode().iat[0]})
print (z)
Flights City
Country
Russia 29.33 Moscow
USA 27.67 New-York
z = df.groupby('Country', as_index=False).agg({'Flights': lambda x: round(x.mean(),2),
'City': lambda x: x.mode().iat[0]})
print (z)
Country Flights City
0 Russia 29.33 Moscow
1 USA 27.67 New-York

Categories