Flatten a column in pandas Dataframe - python

I have a json like the following:
js = """[{"id": 13, "kits": [{"kit": "KIT1216A", "quantity_parts": 80, "quantity_kit": 1},
{"kit": "KIT1216B", "quantity_parts":680, "quantity_kit": 11}],
"transaction_date": "2020-11-27T05:02:03.822000Z", "dispatch_date": "2020-11-27T05:02:05.919000Z", "transaction_no"
: 2005, "transporter_name": "TCI", "vehicle_details": "hr55ab3337", "invoice_number": "355733019", "remarks": "0", "sending_location": 11, "owner": 4}]"""
Where kits is a list which contains multiple dictionaries.
How do I flatten the dataframe which I created from it such that the data of kits is included in the row itself ?
I simply tried:
data = json.loads(js)
df = pd.DataFrame(data)
output:
id kits transaction_date dispatch_date transaction_no transporter_name vehicle_details invoice_number remarks sending_location owner
0 13 [{'kit': 'KIT1216A', 'quantity_parts': 80, 'qu... 2020-11-27T05:02:03.822000Z 2020-11-27T05:02:05.919000Z 2005 TCI hr55ab3337 355733019 0 11 4
Desired Output:

Use json_normalize:
data = json.loads(js)
cols = ['id','transaction_date','dispatch_date','transaction_no','transporter_name',
'vehicle_details','invoice_number','remarks','sending_location','owner']
df = pd.json_normalize(data, 'kits', cols)
print (df)
kit quantity_parts quantity_kit id transaction_date \
0 KIT1216A 80 1 13 2020-11-27T05:02:03.822000Z
1 KIT1216B 680 11 13 2020-11-27T05:02:03.822000Z
dispatch_date transaction_no transporter_name \
0 2020-11-27T05:02:05.919000Z 2005 TCI
1 2020-11-27T05:02:05.919000Z 2005 TCI
vehicle_details invoice_number remarks sending_location owner
0 hr55ab3337 355733019 0 11 4
1 hr55ab3337 355733019 0 11 4

Related

Adding columns to a data frame with data coming from multiple dataframes

I want this matrix as outcome:
int = {
"vendor":['A','B','C','D','E'],
"country":['Spain','Spain','Germany','Italy','Italy'],
"yeardum":['2015','2020','2014','2016','2019'],
"sales_year_data":['15','205','24','920','1310'],
"country_image_data":['2','5','-6','7','-1'],
}
df_inv = pd.DataFrame(int)
The data of colum "sales_year_data" in df_inv come df1=
sales_year_data = {
"country":['Spain','France','Germany','Belgium','Italy'],
"2014":['45','202','24','216','219'],
"2015":['15','55','214','2016','209'],
"2016":['615','2333','205','207','920'],
"2017":['1215','255','234','2116','101'],
"2018":['415','1320','214','2516','2019'],
"2019":['215','220','5614','416','1310'],
"2020":['205','202','44','296','2011'],
}
df1 = pd.DataFrame(sales_year_data)
As you can see in the column "sales_year_data" of df_inv, the number 15 is the intersection in df1 between year 2015 and Spain, the number 205 is in the intersection between Spain and 2020, 24 is in the intersection between Germany and 2014 and so on.
Data of colum "country_image_data" in df_inv comes from df2
country_change_data = {
"country":['Spain','Spain','Germany','Italy','Italy'],
"2014":['4','2','-6','6','9'],
"2015":['2','5','-5','2','3'],
"2016":['5','3','5','7','9'],
"2017":['8','7','5','6','1'],
"2018":['5','1','4','6','2'],
"2019":['1','2','4','6','-1'],
"2020":['5','2','4','6','2'],
}
df2 = pd.DataFrame(country_change_data)
As you can see in the column "country_change_data" of df_inv, the number 2 is the intersection in df2 between year 2015 and Spain, the number 5 is in the intersection between Spain and 2020, -6 is in the intersection between Germany and 2014 and so on.
If my original dataframe is:
inv = {
"vendor":['A','B','C','D','E'],
"country":['Spain','Spain','Germany','Italy','Italy'],
"yeardum":['2015','2020','2014','2016','2019'],
}
df0 = pd.DataFrame(inv)
How could I automate the search across various df1 and df2 in the intersections of interest for building df_inv departing prom df0?
This does it.
sales_counters = {}
country_counters = {}
new_df_data = []
for row in df0.iloc:
c = row['country']
y = row['yeardum']
sales_idx = sales_counters[c] = sales_counters.get(c, -1) + 1
country_idx = country_counters[c] = country_counters.get(c, -1) + 1
d1 = df1[df1['country'] == c]
d2 = df2[df2['country'] == c]
sales_year = d1.iloc[min(sales_idx, d1.shape[0]-1)][y]
country_image = d2.iloc[min(country_idx, d2.shape[0]-1)][y]
new_df_data.append([sales_year, country_image])
df0 = pd.concat([df0, pd.DataFrame(new_df_data)], axis=1).rename({0: 'sales_year_data', 1: 'country_image_data'}, axis=1)
Test:
>>> df0
vendor country yeardum sales_year_data country_image_data
0 A Spain 2015 15 2
1 B Spain 2020 205 2
2 C Germany 2014 24 -6
3 D Italy 2016 920 7
4 E Italy 2019 1310 -1

How to merge two unequal rows of a pandas dataframe, where one column value is to match and another column is to be added?

I have given the following pandas dataframe:
d = {'ID': ['1169', '1234', '2456', '9567', '1234', '4321', '9567', '0169'], 'YEAR': ['2001', '2013', '2009', '1989', '2012', '2013', '2002', '2012'], 'VALUE': [8, 24, 50, 75, 3, 6, 150, 47]}
df = pd.DataFrame(data=d)
print(df)
ID YEAR VALUE
0 1169 2001 8
1 1234 2013 24
2 2456 2009 50
3 9567 1989 75
4 1234 2012 3
5 4321 2013 6
6 9567 2002 150
7 1169 2012 47
I now want to merge two rows of the DataFrame, where there are two different IDs, where ultimately only one remains. The merge should only take place if the values of the column "YEAR" match. The values of the column "VALUE" should be added.
The output should look like this:
ID YEAR VALUE
0 1169 2001 8
1 1234 2013 30
2 2456 2009 50
3 9567 1989 75
4 1234 2012 3
5 9567 2002 150
6 1169 2012 47
Line 1 and line 5 have been merged. Line 5 is removed and line 1 remains with the previous ID, but the VALUEs of line 1 and line 5 have been added.
I would like to specify later which two lines or which two IDs should be merged. One of the two should always remain. The two IDs to be merged come from another function.
I experimented with the groupby() function, but I don't know how to merge two different IDs there. I managed it only with identical values of the "ID" column. This then looked like this:
df.groupby(['ID', 'YEAR'])['VALUE'].sum().reset_index(name ='VALUE')
Unfortunately, even after extensive searching, I have not found anything suitable. I would be very happy if someone can help me! I would like to apply the whole thing later to a much larger DataFrame with more rows. Thanks in advance and best regards!
Try this, just group on 'ID' and take the max YEAR and sum VALUE:
df.groupby('ID', as_index=False).agg({'YEAR':'max', 'VALUE':'sum'})
Output:
ID YEAR VALUE
0 1234 2013 27
1 4321 2013 6
Or group on year and take first ID:
df.groupby('YEAR', as_index=False).agg({'ID':'first', 'VALUE':'sum'})
Ouptut:
YEAR ID VALUE
0 2012 1234 3
1 2013 1234 30
Based on all the comments and update to the question it sounds like the logic (maybe not this exact code) is required...
Try:
import pandas as pd
d = {'ID': ['1169', '1234', '2456', '9567', '1234', '4321', '9567', '0169'], 'YEAR': ['2001', '2013', '2009', '1989', '2012', '2013', '2002', '2012'], 'VALUE': [8, 24, 50, 75, 3, 6, 150, 47]}
df = pd.DataFrame(d)
df['ID'] = df['ID'].astype(int)
def correctRows(l, i):
for x in l:
if df.loc[x, 'YEAR'] == df.loc[i, 'YEAR']:
row = x
break
return row
def mergeRows(a, b):
rowa = list(df[df['ID'] == a].index)
rowb = list(df[df['ID'] == b].index)
if len(rowa) > 1:
if type(rowb)==list:
rowa = correctRows(rowa, rowb[0])
else:
rowa = correctRows(rowa, rowb)
else:
rowa = rowa[0]
if len(rowb) > 1:
if type(rowa)==list:
rowb = correctRows(rowb, rowa[0])
else:
rowb = correctRows(rowb, rowa)
else:
rowb = rowb[0]
print('Keeping:', df.loc[rowa].to_string().replace('\n', ', ').replace(' ', ' '))
print('Dropping:', df.loc[rowb].to_string().replace('\n', ', ').replace(' ', ' '))
df.loc[rowa, 'VALUE'] = df.loc[rowa, 'VALUE'] + df.loc[rowb, 'VALUE']
df.drop(df.index[rowb], inplace=True)
df.reset_index(drop = True, inplace=True)
return None
# add two ids. First 'ID' is kept; the second dropped, but the 'Value'
# of the second is added to the 'Value' of the first.
# Note: the line near the start df['ID'].astype(int), hence integers required
# mergeRows(4321, 1234)
mergeRows(1234, 4321)
Outputs:
Keeping: ID 1234, YEAR 2013, VALUE 24
Dropping: ID 4321, YEAR 2013, VALUE 6
Frame now looks like:
ID YEAR VALUE
0 1169 2001 8
1 1234 2013 30 #<-- sum of 6 + 24
2 2456 2009 50
3 9567 1989 75
4 1234 2012 3
5 9567 2002 150
6 169 2012 47

Pandas: index-derived column with specific increments based on other columns

I have the following data frame:
import pandas as pd
pandas_df = pd.DataFrame([
["SEX", "Male"],
["SEX", "Female"],
["EXACT_AGE", None],
["Country", "Afghanistan"],
["Country", "Albania"]],
columns=['FullName', 'ResponseLabel'
])
Now what I need to do is to add sort order to this dataframe. Each new "FullName" would increment it by 100 and each consecutive "ResponseLabel" for a given "FullName" would increment it by 1 (for this specific "FullName"). So I basically create two different sort orders that I sum later on.
pandas_full_name_increment = pandas_df[['FullName']].drop_duplicates()
pandas_full_name_increment = pandas_full_name_increment.reset_index()
pandas_full_name_increment.index += 1
pandas_full_name_increment['SortOrderFullName'] = pandas_full_name_increment.index * 100
pandas_df['SortOrderResponseLabel'] = pandas_df.groupby(['FullName']).cumcount() + 1
pandas_df = pd.merge(pandas_df, pandas_full_name_increment, on = ['FullName'], how = 'left')
Result:
FullName ResponseLabel SortOrderResponseLabel index SortOrderFullName SortOrder
0 SEX Male 1 0 100 101
1 SEX Female 2 0 100 102
2 EXACT_AGE NULL 1 2 200 201
3 Country Afghanistan 1 3 300 301
4 Country Albania 2 3 300 302
The result that I get on my "SortOrder" column is correct but I wonder if there is some better approach pandas-wise?
Thank you!
The best way to do this would be to use ngroup and cumcount
name_group = pandas_df.groupby('FullName')
pandas_df['sort_order'] = (
name_group.ngroup(ascending=False).add(1).mul(100) +
name_group.cumcount().add(1)
)
Output
FullName ResponseLabel sort_order
0 SEX Male 101
1 SEX Female 102
2 EXACT_AGE None 201
3 Country Afghanistan 301
4 Country Albania 302

Pandas dataframe manipulation to summarize days until certain status appears

What is the most efficient way to transforme this dataframe:
date transaction_ids id status
2020-01-01 00 1 approved
2020-02-01 00 1 approved
2020-03-01 25 1 disapproved
2020-03-08 33 1 trying_to_approve
2020-03-21 25 1 disapproved
2020-03-22 00 1 approved
2020-04-01 00 1 approved
2020-01-01 00 2 approved
2020-02-01 25 2 disapproved
2020-03-01 25 2 disapproved
2020-04-01 25 2 disapproved
To this onde:
id status days_until_approve transaction_ids_not_approved total_transaction_ids
1 approved 21 [25, 33, 25] 3
Where days_until_approve is the time, in days, between (2020-03-01), which is the first disapproved for id 1 and (2020-03-22), the first approved (00) after the disapproved.
Yes use groupby and define your own function for each of the parameters:
#make sure date is a datetime object
df['date'] = pd.to_datetime(df['date'])
def fun(df):
r = pd.DataFrame(columns = ['status', 'days_until_approve',
'transaction_ids_not_approved',
'total_transaction_ids'])
id = df.id.max()
r.loc[id, :] = None
r.loc[id,'status'] = df.query('date == date.max()').status.iloc[0]
first_disapproved = df[df.status != 'approved'].index[0]
next_approve = df[(df.status == 'approved') &
(df.index > first_disapproved) ]
if(next_approve.empty is False):
next_approve = next_approve.index[0]
r.loc[id,'days_until_approve'] = (df.iloc[next_approve].date -
df.iloc[first_disapproved].date).days
else:
r.loc[id,'days_until_approve'] = None
r.loc[id,'transaction_ids_not_approved'] = df.query(
'status != "approved"'
).transaction_ids.to_list()
r.loc[id,'total_transaction_ids'] = df.transaction_ids.nunique()
return r
t.head()
# status days_until_approve transaction_ids_not_approved total_transaction_ids
# id
# 1 approved 21 [25, 33, 25] 3
# 2 disapproved None [25, 25, 25] 2
df['date'] = pd.to_datetime(df['date'])
# Create New Dataframe to populate
column_names = ['id','status','days_until_approve','transaction_ids_not_approved','total_transaction_ids']
new_df = pd.DataFrame(columns = column_names)
for id, group in df.groupby('id'):
# Set Values for New Dataframe
new_df.at[id, 'total_transaction_ids'] = group['transaction_ids'].nunique()
new_df.at[id, 'id'] = id
new_df.at[id, 'status'] = 'approved' if (group['status'].isin(['approved']) ).any() else 'not approved'
# Find Transaction Ids that were never approved
t_ids = group['transaction_ids'].unique()
not_approved = [ str(i) for i in t_ids if ((group['transaction_ids']==i) & (~group['status'].isin(['approved'])) ).any()]
new_df.at[id, 'transaction_ids_not_approved'] = ','.join(not_approved)
# Find number of days it took to approve for each ID group
first_rows_dis = [i for i, row in group.iterrows() if row['status'] == 'disapproved' and ( (group.index> i) & (group['status'].isin(['approved'])) ).any() ]
if first_rows_dis:
first_row = min(first_rows_dis)
first_disapproved_date = group.loc[first_row, 'date']
first_approved_date = group.loc[ (group.index> first_row) & (group['status'].isin(['approved'])),"date"].min()
diff_days = (first_approved_date - first_disapproved_date).days
new_df.at[id, 'days_until_approve'] = diff_days
else:
new_df.at[id, 'days_until_approve'] = None
Output:
id status days_until_approve transaction_ids_not_approved total_transaction_ids
1 approved 21 25,33 3
2 approved None 25 2

Specifying column order following groupby aggregation

The ordering of my age, height and weight columns is changing with each run of the code. I need to keep the order of my agg columns static because I ultimately refer to this output file according to the column locations. What can I do to make sure age, height and weight are output in the same order every time?
d = pd.read_csv(input_file, na_values=[''])
df = pd.DataFrame(d)
df.index_col = ['name', 'address']
df_out = df.groupby(df.index_col).agg({'age':np.mean, 'height':np.sum, 'weight':np.sum})
df_out.to_csv(output_file, sep=',')
I think you can use subset:
df_out = df.groupby(df.index_col)
.agg({'age':np.mean, 'height':np.sum, 'weight':np.sum})[['age','height','weight']]
Also you can use pandas functions:
df_out = df.groupby(df.index_col)
.agg({'age':'mean', 'height':sum, 'weight':sum})[['age','height','weight']]
Sample:
df = pd.DataFrame({'name':['q','q','a','a'],
'address':['a','a','s','s'],
'age':[7,8,9,10],
'height':[1,3,5,7],
'weight':[5,3,6,8]})
print (df)
address age height name weight
0 a 7 1 q 5
1 a 8 3 q 3
2 s 9 5 a 6
3 s 10 7 a 8
df.index_col = ['name', 'address']
df_out = df.groupby(df.index_col)
.agg({'age':'mean', 'height':sum, 'weight':sum})[['age','height','weight']]
print (df_out)
age height weight
name address
a s 9.5 12 14
q a 7.5 4 8
EDIT by suggestion - add reset_index, here as_index=False does not work if need index values too:
df_out = df.groupby(df.index_col)
.agg({'age':'mean', 'height':sum, 'weight':sum})[['age','height','weight']]
.reset_index()
print (df_out)
name address age height weight
0 a s 9.5 12 14
1 q a 7.5 4 8
If you care mostly about the order when written to a file and not while its still in a DataFrame object, you can set the columns parameter of the to_csv() method:
>>> df = pd.DataFrame(
{'age': [28,63,28,45],
'height': [183,156,170,201],
'weight': [70.2, 62.5, 65.9, 81.0],
'name': ['Kim', 'Pat', 'Yuu', 'Sacha']},
columns=['name','age','weight', 'height'])
>>> df
name age weight height
0 Kim 28 70.2 183
1 Pat 63 62.5 156
2 Yuu 28 65.9 170
3 Sacha 45 81.0 201
>>> df_out = df.groupby(['age'], as_index=False).agg(
{'weight': sum, 'height': sum})
>>> df_out
age height weight
0 28 353 136.1
1 45 201 81.0
2 63 156 62.5
>>> df_out.to_csv('out.csv', sep=',', columns=['age','height','weight'])
out.csv then looks like this:
,age,height,weight
0,28,353,136.10000000000002
1,45,201,81.0
2,63,156,62.5

Categories