How to concat a dataframe in pandas? - python

I am fetching the data from mongoDB to python through pymongo and then converting it into pandas dataframe
df = pd.DataFrame(list(db.dataset2.find()))
This is how data looks like in mongoDB.
"dish" : [
{
"dish_id" : "005" ,
"dish_name" : "Sandwitch",
"dish_price" : 50,
"coupon_applied" : "Yes",
"coupon_type" : "Rs 20 off"
},
{
"dish_id" : "006" ,
"dish_name" : "Chicken Hundi",
"dish_price" : 125,
"coupon_applied" : "No",
"coupon_type" : "Null"
}
],
I want to seperate dish attributes into two rows in pandas dataframe. Here is the code which does that. (There are 3 dish documents) so, I am iterating it through for loop.
for i in range(0,len(df.dish)):
data_dish = json_normalize(df['dish'][i])
print data_dish
But it gives me below output..
coupon_applied coupon_type dish_id dish_name dish_price
0 Yes Rs 20 off 001 Chicken Biryani 120
1 No Null 001 Paneer Biryani 100
coupon_applied coupon_type dish_id dish_name dish_price
0 Yes Rs 40 off 002 Mutton Biryani 130
1 No Null 004 Aaloo tikki 95
coupon_applied coupon_type dish_id dish_name dish_price
0 Yes Rs 20 off 005 Sandwitch 50
1 No Null 006 Chicken Hundi 125
And I want output in following format..
coupon_applied coupon_type dish_id dish_name dish_price
0 Yes Rs 20 off 001 Chicken Biryani 120
1 No Null 001 Paneer Biryani 100
2 Yes Rs 40 off 002 Mutton Biryani 130
3 No Null 004 Aaloo tikki 95
4 Yes Rs 20 off 005 Sandwitch 50
5 No Null 006 Chicken Hundi 125
Can you help me with this? thanks in advance :)

There is
dishes = [json_normalize(d) for d in df['dish']]
df = pd.concat(dishes, ignore_index=True)

You should be able to get a list of dataframes in a list and then concat them.
Inizialize a new Dataframe:
df = pd.DataFrame()
Create an empty list of Dataframes:
dflist = []
Loop and append dataframes
for i in range(0,len(df.dish)):
data_dish = json_normalize(df['dish'][i])
dflist.append(data_dish)
Then concat the list into the full dataframe:
df = pd.concat(dflist, ignore_index=True)

Related

Replace certain values in df2 based on common values in df1 [duplicate]

I have two dataframes where I need to update the first one based on the value of the second one if exists. Sample story provided below is to replace the student_id with updatedId if exists in 'old_id' column and replace it with 'new_id'.
import pandas as pd
import numpy as np
student = {
'Name': ['John', 'Jay', 'sachin', 'Geetha', 'Amutha', 'ganesh'],
'gender': ['male', 'male', 'male', 'female', 'female', 'male'],
'math score': [50, 100, 70, 80, 75, 40],
'student_Id': ['1234', '6788', 'xyz', 'abcd', 'ok83', '234v'],
}
updatedId = {
'old_id' : ['ok83', '234v'],
'new_id' : ['83ko', 'v432'],
}
df_student = pd.DataFrame(student)
df_updated_id = pd.DataFrame(updatedId)
print(df_student)
print(df_updated_id)
# Method with np.where
for index, row in df_updated_id.iterrows():
df_student['student_Id'] = np.where(df_student['student_Id'] == row['old_id'], row['new_id'], df_student['student_Id'])
# print(df_student)
# Method with dataframe.mask
for index, row in df_updated_id.iterrows():
df_student['student_Id'].mask(df_student['student_Id'] == row['old_id'], row['new_id'], inplace=True)
print(df_student)
The results for both methods above work and yield the correct result
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 ok83
5 ganesh male 40 234v
old_id new_id
0 ok83 83ko
1 234v v432
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
Nonetheless, the actual data of students has about 500,000 rows and updated_id has 6000 rows.
Thus I run into performance issues as loop is very slow:
A simple timer are placed to observe the number of records processed for df_updated_id
100 rows - numpy Time=3.9020769596099854; mask Time=3.9169061183929443
500 rows - numpy Time=20.42293930053711; mask Time=19.768696784973145
1000 rows - numpy Time=40.06309795379639; mask Time=37.26559829711914
My question is whether I can optimize it using a merge (join table), or ditch the iterrows? I tried something like the below but failed to get it to work.
Replace dataframe column values based on matching id in another dataframe, and How to iterate over rows in a DataFrame in Pandas
Please advice..
You can also try with map:
df_student['student_Id'] = (
df_student['student_Id'].map(df_updated_id.set_index('old_id')['new_id'])
.fillna(df_student['student_Id'])
)
print(df_student)
# Output
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
Update
I believe the updated_id isn't unique, so I need to further pre-process the data.
In this case, maybe you could drop duplicates before considering the last value (keep='last') is the most recent for a same old_id:
sr = df_updated_id.drop_duplicates('old_id', keep='last') \
.set_index('old_id')['new_id']
df_student['student_Id'] = df_student['student_Id'].map(sr) \
.fillna(df_student['student_Id']
)
Note: this is exactly what the #BENY's answer does. As he creates a dict, only the last occurrence of an old_id is kept. However, if you want to keep the first value appears, his code doesn't work. With drop_duplicates, you can adjust the keep parameter.
We can just replace
df_student.replace({'student_Id':df_updated_id.set_index('old_id')['new_id']},inplace=True)
df_student
Out[337]:
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432

How to Split a huge csv into multiple csv's based on column header

I'll try to represent my problem, basing on simple example below. I have below main csv and I am trying to split into 2 or more csv basing on column header, keeping the unique column id in intact in every csv file.
Below is the code I am trying to figure out, but not quite getting the result.
import pandas as pd
df = pd.read_csv('abc.csv')
df[['id','name','age']] = df['csv1'].str.split(' ', expand=True)
csv
id name age color Gender
0 101 Jack 23 white M
1 102 Mary 25 black F
2 103 Tom 24 brown M
Output required
csv1
id name age
0 101 Jack 23
1 102 Mary 25
2 103 Tom 24
csv2 -
id color Gender
0 101 white M
1 102 black F
2 103 brown M
UPDATE
I found a better apporach with np.array_split
I used this example df:
x y R TR x_c y_c xxx yyy RRR TTTR xxx_c yyy_c
id
1256780.0 13989 6241 6.689222 20.986341 14050.83 6315.33 213989 36241 46.689222 520.986341 614050.83 76315.33
12000.0 14013 6278 53.152036 0.000000 14060.00 6288.00 214013 36278 453.152036 5.000000 614060.00 76288.00
1100.0 14111 6379 87.598357 5.000000 14070.55 7000.00 214111 36379 487.598357 55.000000 614070.55 76288.00
which has 12 columns.
# the 4 means, split the df into 4 evenly sized chunks
chunks = np.array_split(df,4, axis=1)
Chunks is a list containing all seperate dataframes.
Output:
# chunks[0]
x y R
id
1256780.0 13989.0 6241.0 6.689222
12000.0 14013.0 6278.0 53.152036
1100.0 14111.0 6379.0 87.598357
# chunks[1]
TR x_c y_c
id
1256780.0 20.986341 14050.83 6315.33
12000.0 0.000000 14060.00 6288.00
1100.0 5.000000 14070.55 7000.00
# chunks[2]
xxx yyy RRR
id
1256780.0 213989.0 36241.0 46.689222
12000.0 214013.0 36278.0 453.152036
1100.0 214111.0 36379.0 487.598357
# chunks[3]
TTTR xxx_c yyy_c
id
1256780.0 520.986341 614050.83 76315.33
12000.0 5.000000 614060.00 76288.00
1100.0 55.000000 614070.55 76288.00
Old answer:
You could calculate half of the columns and then use iloc to split the dataframes into two parts.
df = df.set_index('id')
half = len(df.columns)//2
df1, df2 = df.iloc[:,:half], df.iloc[:,half:]
df1 = df1.reset_index()
df2 = df2.reset_index()
Output:
#df1
id name age
0 101 Jack 23
1 102 Mary 25
2 103 Tom 24
#df2
id color Gender
0 101 white M
1 102 black F
2 103 brown M

Replace column value of Dataframe based on a condition on another Dataframe

I have two dataframes where I need to update the first one based on the value of the second one if exists. Sample story provided below is to replace the student_id with updatedId if exists in 'old_id' column and replace it with 'new_id'.
import pandas as pd
import numpy as np
student = {
'Name': ['John', 'Jay', 'sachin', 'Geetha', 'Amutha', 'ganesh'],
'gender': ['male', 'male', 'male', 'female', 'female', 'male'],
'math score': [50, 100, 70, 80, 75, 40],
'student_Id': ['1234', '6788', 'xyz', 'abcd', 'ok83', '234v'],
}
updatedId = {
'old_id' : ['ok83', '234v'],
'new_id' : ['83ko', 'v432'],
}
df_student = pd.DataFrame(student)
df_updated_id = pd.DataFrame(updatedId)
print(df_student)
print(df_updated_id)
# Method with np.where
for index, row in df_updated_id.iterrows():
df_student['student_Id'] = np.where(df_student['student_Id'] == row['old_id'], row['new_id'], df_student['student_Id'])
# print(df_student)
# Method with dataframe.mask
for index, row in df_updated_id.iterrows():
df_student['student_Id'].mask(df_student['student_Id'] == row['old_id'], row['new_id'], inplace=True)
print(df_student)
The results for both methods above work and yield the correct result
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 ok83
5 ganesh male 40 234v
old_id new_id
0 ok83 83ko
1 234v v432
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
Nonetheless, the actual data of students has about 500,000 rows and updated_id has 6000 rows.
Thus I run into performance issues as loop is very slow:
A simple timer are placed to observe the number of records processed for df_updated_id
100 rows - numpy Time=3.9020769596099854; mask Time=3.9169061183929443
500 rows - numpy Time=20.42293930053711; mask Time=19.768696784973145
1000 rows - numpy Time=40.06309795379639; mask Time=37.26559829711914
My question is whether I can optimize it using a merge (join table), or ditch the iterrows? I tried something like the below but failed to get it to work.
Replace dataframe column values based on matching id in another dataframe, and How to iterate over rows in a DataFrame in Pandas
Please advice..
You can also try with map:
df_student['student_Id'] = (
df_student['student_Id'].map(df_updated_id.set_index('old_id')['new_id'])
.fillna(df_student['student_Id'])
)
print(df_student)
# Output
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
Update
I believe the updated_id isn't unique, so I need to further pre-process the data.
In this case, maybe you could drop duplicates before considering the last value (keep='last') is the most recent for a same old_id:
sr = df_updated_id.drop_duplicates('old_id', keep='last') \
.set_index('old_id')['new_id']
df_student['student_Id'] = df_student['student_Id'].map(sr) \
.fillna(df_student['student_Id']
)
Note: this is exactly what the #BENY's answer does. As he creates a dict, only the last occurrence of an old_id is kept. However, if you want to keep the first value appears, his code doesn't work. With drop_duplicates, you can adjust the keep parameter.
We can just replace
df_student.replace({'student_Id':df_updated_id.set_index('old_id')['new_id']},inplace=True)
df_student
Out[337]:
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432

Pandas: index-derived column with specific increments based on other columns

I have the following data frame:
import pandas as pd
pandas_df = pd.DataFrame([
["SEX", "Male"],
["SEX", "Female"],
["EXACT_AGE", None],
["Country", "Afghanistan"],
["Country", "Albania"]],
columns=['FullName', 'ResponseLabel'
])
Now what I need to do is to add sort order to this dataframe. Each new "FullName" would increment it by 100 and each consecutive "ResponseLabel" for a given "FullName" would increment it by 1 (for this specific "FullName"). So I basically create two different sort orders that I sum later on.
pandas_full_name_increment = pandas_df[['FullName']].drop_duplicates()
pandas_full_name_increment = pandas_full_name_increment.reset_index()
pandas_full_name_increment.index += 1
pandas_full_name_increment['SortOrderFullName'] = pandas_full_name_increment.index * 100
pandas_df['SortOrderResponseLabel'] = pandas_df.groupby(['FullName']).cumcount() + 1
pandas_df = pd.merge(pandas_df, pandas_full_name_increment, on = ['FullName'], how = 'left')
Result:
FullName ResponseLabel SortOrderResponseLabel index SortOrderFullName SortOrder
0 SEX Male 1 0 100 101
1 SEX Female 2 0 100 102
2 EXACT_AGE NULL 1 2 200 201
3 Country Afghanistan 1 3 300 301
4 Country Albania 2 3 300 302
The result that I get on my "SortOrder" column is correct but I wonder if there is some better approach pandas-wise?
Thank you!
The best way to do this would be to use ngroup and cumcount
name_group = pandas_df.groupby('FullName')
pandas_df['sort_order'] = (
name_group.ngroup(ascending=False).add(1).mul(100) +
name_group.cumcount().add(1)
)
Output
FullName ResponseLabel sort_order
0 SEX Male 101
1 SEX Female 102
2 EXACT_AGE None 201
3 Country Afghanistan 301
4 Country Albania 302

Pandas hierarchical sort

I have a dataframe of categories and amounts. Categories can be nested into sub categories an infinite levels using a colon separated string. I wish to sort it by descending amount. But in hierarchical type fashion like shown.
How I need it sorted
CATEGORY AMOUNT
Transport 5000
Transport : Car 4900
Transport : Train 100
Household 1100
Household : Utilities 600
Household : Utilities : Water 400
Household : Utilities : Electric 200
Household : Cleaning 100
Household : Cleaning : Bathroom 75
Household : Cleaning : Kitchen 25
Household : Rent 400
Living 250
Living : Other 150
Living : Food 100
EDIT:
The data frame:
pd.DataFrame({
"category": ["Transport", "Transport : Car", "Transport : Train", "Household", "Household : Utilities", "Household : Utilities : Water", "Household : Utilities : Electric", "Household : Cleaning", "Household : Cleaning : Bathroom", "Household : Cleaning : Kitchen", "Household : Rent", "Living", "Living : Other", "Living : Food"],
"amount": [5000, 4900, 100, 1100, 600, 400, 200, 100, 75, 25, 400, 250, 150, 100]
})
Note: this is the order I want it. It may be in any arbitrary order before the sort.
EDIT2:
If anyone looking for a similar solution I posted the one I settled on here: How to sort dataframe in pandas by value in hierarchical category structure
One way could be to first str.split the category column.
df_ = df['category'].str.split(' : ', expand=True)
print (df_.head())
0 1 2
0 Transport None None
1 Transport Car None
2 Transport Train None
3 Household None None
4 Household Utilities None
Then get the column amount and what you want is to get the maximum amount per group based on:
the first column alone,
then the first and the second columns
then the first-second and third columns, ...
You can do this with groupby.transform with max, and you concat each column created.
s = df['amount']
l_cols = list(df_.columns)
dfa = pd.concat([s.groupby([df_[col] for col in range(0, lv+1)]).transform('max')
for lv in l_cols], keys=l_cols, axis=1)
print (dfa)
0 1 2
0 5000 NaN NaN
1 5000 4900.0 NaN
2 5000 100.0 NaN
3 1100 NaN NaN
4 1100 600.0 NaN
5 1100 600.0 400.0
6 1100 600.0 200.0
7 1100 100.0 NaN
8 1100 100.0 75.0
9 1100 100.0 25.0
10 1100 400.0 NaN
11 250 NaN NaN
12 250 150.0 NaN
13 250 100.0 NaN
Now you just need to sort_values on all columns in the right order on first 0, then 1, then 2..., get the index and use loc to order df in the expected way
dfa = dfa.sort_values(l_cols, na_position='first', ascending=False)
dfs = df.loc[dfa.index] #here you can reassign to df directly
print (dfs)
category amount
0 Transport 5000
1 Transport : Car 4900
2 Transport : Train 100
3 Household 1100
4 Household : Utilities 600
5 Household : Utilities : Water 400
6 Household : Utilities : Electric 200
10 Household : Rent 400 #here is the one difference with this data
7 Household : Cleaning 100
8 Household : Cleaning : Bathroom 75
9 Household : Cleaning : Kitchen 25
11 Living 250
12 Living : Other 150
13 Living : Food 100
I packaged #Ben. T's answer into a more generic function, hopefully this is clearer to read!
EDIT: I have made changes to the function to group by columns in order rather than one by one to address potential issues noted by #Ben. T in the comments.
import pandas as pd
def category_sort_df(df, sep, category_col, numeric_col, ascending=False):
'''Sorts dataframe by nested categories using `sep` as the delimiter for `category_col`.
Sorts numeric columns in descending order by default.
Returns a copy.'''
df = df.copy()
try:
to_sort = pd.to_numeric(df[numeric_col])
except ValueError:
print(f'Column `{numeric_col}` is not numeric!')
raise
categories = df[category_col].str.split(sep, expand=True)
# Strips any white space before and after sep
categories = categories.apply(lambda x: x.str.split().str[0], axis=1)
levels = list(categories.columns)
to_concat = []
for level in levels:
# Group by columns in order rather than one at a time
level_by = [df_[col] for col in range(0, level+1)]
gb = to_sort.groupby(level_by)
to_concat.append(gb.transform('max'))
dfa = pd.concat(to_concat, keys=levels, axis=1)
ixs = dfa.sort_values(levels, na_position='first', ascending=False).index
df = df.loc[ixs].copy()
return df
Using Python 3.7.3, pandas 0.24.2
To answer my own question: I found a way. Kind of long winded but here it is.
import numpy as np
import pandas as pd
def sort_tree_df(df, tree_column, sort_column):
sort_key = sort_column + '_abs'
df[sort_key] = df[sort_column].abs()
df.index = pd.MultiIndex.from_frame(
df[tree_column].str.split(":").apply(lambda x: [y.strip() for y in x]).apply(pd.Series))
sort_columns = [df[tree_column].values, df[sort_key].values] + [
df.groupby(level=list(range(0, x)))[sort_key].transform('max').values
for x in range(df.index.nlevels - 1, 0, -1)
]
sort_indexes = np.lexsort(sort_columns)
df_sorted = df.iloc[sort_indexes[::-1]]
df_sorted.reset_index(drop=True, inplace=True)
df_sorted.drop(sort_key, axis=1, inplace=True)
return df_sorted
sort_tree_df(df, 'category', 'amount')
If you don't mind adding an extra column you can extract the main category from the category and then sort by amount/main category/category, ie.:
df['main_category'] = df.category.str.extract(r'^([^ ]+)')
df.sort_values(['main_category', 'amount', 'category'], ascending=False)[['category', 'amount']]
Output:
category amount
0 Transport 5000
1 Transport : Car 4900
2 Transport : Train 100
11 Living 250
12 Living : Other 150
13 Living : Food 100
3 Household 1100
4 Household : Utilities 600
5 Household : Utilities : Water 400
10 Household : Rent 400
6 Household : Utilities : Electric 200
7 Household : Cleaning 100
8 Household : Cleaning : Bathroom 75
9 Household : Cleaning : Kitchen 25
Note that this will work well only if your main categories are single words without spaces. Otherwise you will need to do it in a different way, ie. extract all non-colons and strip the trailing space:
df['main_category'] = df.category.str.extract(r'^([^:]+)')
df['main_category'] = df.main_category.str.rstrip()

Categories