Broaden Pandas DataFrame with multiple source columns - python

Following on from this question, is it possible to perform a similar 'broaden' operation in pandas where there are multiple source columns per 'entity'?
If my data now looks like:
Box,Code,Category
Green,1221,Active
Green,8391,Inactive
Red,3709,Inactive
Red,2911,Pending
Blue,9820,Active
Blue,4530,Active
How do I most efficiently get to:
Box,Code0,Category0,Code1,Category1
Green,1221,Active,8391,Inactive
Red,3709,Inactive,2911,Pending
Blue,9820,Active,4530,Active
So far, the only solution I have been able to put together that 'works', is follow the example from the linked page and to create two separate DataFrames, one grouped by Box and Code, the other grouped by Box and Category, and then merge the two together by Box.
a = get_clip.groupby('Box')['Code'].apply(list)
b = get_clip.groupby('Box')['Category'].apply(list)
broadeneda = pd.DataFrame(a.values.tolist(), index = a.index).add_prefix('Code').reset_index()
broadenedb = pd.DataFrame(b.values.tolist(), index = b.index).add_prefix('Category').reset_index()
merged = pd.merge(broadeneda, broadenedb, on='Box', how = 'inner')
Is there a way to achieve this without broadening each column separately and merging at the end?

gourpby + cumcount+unstack
df1=df.assign(n=df.groupby('Box').cumcount()).set_index(['Box','n']).unstack(1)
df1.columns=df1.columns.map('{0[0]}{0[1]}'.format)
df1
Out[141]:
Code0 Code1 Category0 Category1
Box
Blue 9820 4530 Active Active
Green 1221 8391 Active Inactive
Red 3709 2911 Inactive Pending

Option 1
Using set_index, pipe, and set_axis
df.set_index(['Box', df.groupby('Box').cumcount()]).unstack().pipe(
lambda d: d.set_axis(d.columns.map('{0[0]}{0[1]}'.format), 1, False)
)
Code0 Code1 Category0 Category1
Box
Blue 9820 4530 Active Active
Green 1221 8391 Active Inactive
Red 3709 2911 Inactive Pending
Option 2
Using defaultdict
from collections import defaultdict
d = defaultdict(dict)
for a, *b in df.values:
i = len(d[a]) // len(b)
c = (f'Code{i}', f'Category{i}')
d[a].update(dict(zip(c, b)))
pd.DataFrame.from_dict(d, 'index').rename_axis('Box')
Code0 Category0 Code1 Category1
Box
Blue 9820 Active 4530 Active
Green 1221 Active 8391 Inactive
Red 3709 Inactive 2911 Pending

This can be done with iteration of sub-dataframes:
cols = ["Box","Code0","Category0","Code1","Category1"]
newdf = pd.DataFrame(columns = cols) # create an empty dataframe to be filled
for box in pd.unique(df.Box): # for each color in Box
subdf = df[df.Box == box] # get a sub-dataframe
newrow = subdf.values[0].tolist() # get its values and then its full first row
newrow.extend(subdf.values[1].tolist()[1:3]) # add second and third entries of second row
newdf = pd.concat([newdf, pd.DataFrame(data=[newrow], columns=cols)], axis=0) # add to new dataframe
print(newdf)
Output:
Box Code0 Category0 Code1 Category1
0 Green 1221.0 Active 8391.0 Inactive
0 Red 3709.0 Inactive 2911.0 Pending
0 Blue 9820.0 Active 4530.0 Active

It seems that same color will appear in a row and each color has same rows. (Two important assumptions.) Thus, we can split the df into the odd part, df[::2], and the even part, df[1::2], and then merge it together.
pd.merge(df[::2], df[1::2], on="Box")
Box Code_x Category_x Code_y Category_y
0 Green 1221 Active 8391 Inactive
1 Red 3709 Inactive 2911 Pending
2 Blue 9820 Active 4530 Active
One can rename it easily by resetting its columns.

Related

Populate Pandas dataframe with group_by calculations made in Pandas series

I have created a dataframe from a dictionary as follows:
my_dict = {'VehicleType':['Truck','Car','Truck','Car','Car'],'Colour':['Green','Green','Black','Yellow','Green'],'Year':[2002,2014,1975,1987,1987],'Frequency': [0,0,0,0,0]}
df = pd.DataFrame(my_dict)
So my dataframe df currently looks like this:
VehicleType Colour Year Frequency
0 Truck Green 2002 0
1 Car Green 2014 0
2 Truck Black 1975 0
3 Car Yellow 1987 0
4 Car Green 1987 0
I'd like it to look like this:
VehicleType Colour Year Frequency
0 Truck Green 2002 1
1 Car Green 2014 2
2 Truck Black 1975 1
3 Car Yellow 1987 1
4 Car Green 1987 2
i.e., the Frequency column should represent the totals of VehicleType AND Colour combinations (but leaving out the Year column). So in row 4 for example, the 2 in the Frequency column tells you that there are a total of 2 rows with the combination of 'Car' and 'Green'.
This is essentially a 'Count' with 'Group By' calculation, and Pandas provides a way to do the calculation as follows:
grp_by_series = df.groupby(['VehicleType', 'Colour']).size()
grp_by_series
VehicleType Colour
Car Green 2
Yellow 1
Truck Black 1
Green 1
dtype: int64
What I'd like to do next is to extract the calculated group_by values from the Panda series and put them into the Frequency column of the Pandas dataframe. I've tried various approaches but without success.
The example I've given is hugely simplified - the dataframes I'm using are derived from genomic data and have hundreds of millions of rows, and will have several frequency columns based on various combinations of other columns, so ideally I need a solution which is fast and scales well.
Thanks for any help!
You are on a good path. You can continue like this:
grp_by_series=grp_by_series.reset_index()
res=df[['VehicleType', 'Colour']].merge(grp_by_series, how='left')
df['Frequency'] = res[0]
print(df)
Output:
VehicleType Colour Year Frequency
0 Truck Green 2002 1
1 Car Green 2014 2
2 Truck Black 1975 1
3 Car Yellow 1987 1
4 Car Green 1987 2
I think a .transform() does what you want:
df['Frequency'] = df.groupby(['VehicleType', 'Colour'])['Year'].transform('count')

parse all col names and create new columns

date red,heavy,new blue,light,old
1-2-20 320 120
2-3-20 220 125
I want to iterate through all rows and columns such that I can I can parse the column names and use them as values for new columns. I want to get a data of this format:
I want dates to be repeated. The 'value' col is from the original table.
date color weight condition. value
1-2-20 red heavy new 320
1-2-20 blue light. old. 120
2-3-20 red. heavy new. 220
I tried this and it worked for when I only had one column
colName = df_retransform.columns[1]
lst = colName.split(",")
color = lst[0]
weight = lst[1]
condition = lst[2]
df_retransform.rename(columns={colName: 'value'}, inplace=True)
df_retransform['color'] = color
df_retransform['weight'] = weight
df_retransform['condition'] = condition
but I am unable to modify it such that it I can do it for all columns.
Use DataFrame.melt with Series.str.split, DataFrame.pop is for using and drop column variable, last change order of columns names if necessary:
First you can test if all columns without data has 2 ,:
print ([col for col in df.columns if col.count(',') != 2])
['date']
df = df.melt('date')
df[['color', 'weight', 'condition']] = df.pop('variable').str.split(',', expand=True)
df = df[['date', 'color', 'weight', 'condition', 'value']]
print (df)
date color weight condition value
0 1-2-20 red heavy new 320
1 2-3-20 red heavy new 220
2 1-2-20 blue light old 120
3 2-3-20 blue light old 125
Or use DataFrame.stack for MultiIndex Series, then split and recreate new all levels for new columns:
print (df)
date red,heavy,new blue,light,old
0 1-2-20 320 NaN
1 NaN 220 125.0
s = df.set_index('date').stack(dropna=False)
s.index = pd.MultiIndex.from_tuples([(i, *j.split(',')) for i, j in s.index],
names=['date', 'color', 'weight', 'condition'])
df = s.reset_index(name='value')
print (df)
date color weight condition value
0 1-2-20 red heavy new 320.0
1 1-2-20 blue light old NaN
2 NaN red heavy new 220.0
3 NaN blue light old 125.0
You could also use pivot_longer function from pyjanitor; at the moment you have to install the latest development version from github:
# install latest dev version
# pip install git+https://github.com/ericmjl/pyjanitor.git
import janitor
df.pivot_longer(index="date",
names_to=("color", "weight", "condition"),
names_sep=",")
date color weight condition value
0 1-2-20 red heavy new 320
1 2-3-20 red heavy new 220
2 1-2-20 blue light old 120
3 2-3-20 blue light old 125
You pass the names of the new column to names_to, and specify the separator (,) in names_sep.
If you want it returned in order of appearance, you could pass boolean True to the sort_by_appearance argument:
df.pivot_longer(
index="date",
names_to=("color", "weight", "condition"),
names_sep=",",
sort_by_appearance=True,
)
date color weight condition value
0 1-2-20 red heavy new 320
1 1-2-20 blue light old 120
2 2-3-20 red heavy new 220
3 2-3-20 blue light old 125

Python Dataframe: Dropping duplicates base on certain conditions

Dataframe with duplicate Shop IDs where some Shop IDs occurred twice and some occurred thrice:
I only want to keep unique Shop IDs base on the shortest Shop Distance assigned to its Area.
Area Shop Name Shop Distance Shop ID
0 AAA Ly 86 5d87790c46a77300
1 AAA Hi 230 5ce5522012138400
2 BBB Hi 780 5ce5522012138400
3 CCC Ly 450 5d87790c46a77300
...
91 MMM Ju 43 4f76d0c0e4b01af7
92 MMM Hi 1150 5ce5522012138400
...
Using pandas drop_duplicates drop the row duplicates but the condition is base on the first/ last occurring Shop ID which does not allow me to sort by distance:
shops_df = shops_df.drop_duplicates(subset='Shop ID', keep= 'first')
I also tried to group by Shop ID then sort, but sort returns error: Duplicates
bbtshops_new['C'] = bbtshops_new.groupby('Shop ID')['Shop ID'].cumcount()
bbtshops_new.sort_values(by=['C'], axis=1)
So far i tried doing up till this stage:
# filter all the duplicates into a new df
df_toclean = shops_df[shops_df['Shop ID'].duplicated(keep= False)]
# create a mask for all unique Shop ID
mask = df_toclean['Shop ID'].value_counts()
# create a mask for the Shop ID that occurred 2 times
shop_2 = mask[mask==2].index
# create a mask for the Shop ID that occurred 3 times
shop_3 = mask[mask==3].index
# create a mask for the Shops that are under radius 750
dist_1 = df_toclean['Shop Distance']<=750
# returns results for all the Shop IDs that appeared twice and under radius 750
bbtshops_2 = df_toclean[dist_1 & df_toclean['Shop ID'].isin(shop_2)]
* if i use df_toclean['Shop Distance'].min() instead of dist_1 it returns 0 results
I think i'm doing it the long way and still haven't figure out dropping the duplicates, anyone knows how to solve this in a shorter way? I'm new to python, thanks for helping out!
Try to first sort the dataframe based on distance, then drop the duplicate shops.
df = shops_df.sort_values('Distance')
df = df[~df['Shop ID'].duplicated()] # The tilda (~) inverts the boolean mask.
Or just as one chained expression (per comment from #chmielcode).
df = (
shops_df
.sort_values('Distance')
.drop_duplicates(subset='Shop ID', keep= 'first')
.reset_index(drop=True) # Optional.
)
You can use idxmin:
df.loc[df.groupby('Area')['Shop Distance'].idxmin()]
Area Shop Name Shop Distance Shop ID
0 AAA Ly 86 5d87790c46a77300
2 BBB Hi 780 5ce5522012138400
3 CCC Ly 450 5d87790c46a77300
4 MMM Ju 43 4f76d0c0e4b01af7

Creating function to filter and calculate division of rows based on filter?

I have a df such as below:
I am using simple code such as below: that filters columns in the df and then I calculate simple math based on value of the column,
so if the column values is cancelled, processing, and complete; I want to calculate the % or number of rows that were cancelled of the entire df or all the rows.
df looks like:
ID | Status | Color
555 Cancelled Green
434 Processed Red
212 Cancelled Blue
121 Cancelled Green
242 Cancelled Blue
352 Processed Green
343 Processed Blue
The Code Im currently using is:
df[df['Color'] == 'Green']
df[(df['Status']=='Cancelled') & (df['Color']=='Green')]
Meaning for each different type of color I manually first filter the df to get the # of rows, then double filter it below to get the number of rows or orders cancelled then manually divide that # but he # of just green rows.
If I wanted to create a function where I can insert the color name and the status and do the math that way in a simple function what would be the best approach for that?
Expected Output would be something like:
Status Green
Cancelled 0.666667
Processed 0.333333
dtype: float64
Thanks so much!
You can use groupby and len():
df.groupby(by='Status').apply(lambda x: len(x)/len(df))
Status
Cancelled 0.666667
Processed 0.333333
dtype: float64
Breakdown by both Status and Color:
cc = df.groupby(by='Color').ID.count()
df.groupby(by=['Color', 'Status']).apply(lambda x: len(x)/cc.loc[x.Color.iloc[0]])
Color Status
Blue Cancelled 0.666667
Processed 0.333333
Green Cancelled 0.666667
Processed 0.333333
Red Processed 1.000000
dtype: float64

Broaden pandas dataframe

I have data that looks like this:
Box,Code
Green,1221
Green,8391
Red,3709
Red,2911
Blue,9820
Blue,4530
Using a pandas dataframe, I'm wondering if it is possible to output something like this:
Box,Code1,Code2
Green,1221,8391
Red,3709,2911
Blue,9820,4530
My data always has an equal number of rows per 'Box'.
I've been experimenting with pivots and crosstabs (as well as stack and unstack) in pandas but haven't found anything that gets me to the 'broaden' result I'm looking for.
You can use groupby for lists and then DataFrame constructor:
a = df.groupby('Box')['Code'].apply(list)
df = pd.DataFrame(a.values.tolist(), index=a.index).add_prefix('Code').reset_index()
print (df)
Box Code0 Code1
0 Blue 9820 4530
1 Green 1221 8391
2 Red 3709 2911
Or cumcount for new Series with pandas.pivot:
g = df.groupby('Box').cumcount()
df = pd.pivot(index=df['Box'], columns=g, values=df['Code']).add_prefix('Code').reset_index()
print (df)
Box Code0 Code1
0 Blue 9820 4530
1 Green 1221 8391
2 Red 3709 2911
And similar solution with unstack:
df['g'] = df.groupby('Box').cumcount()
df = df.set_index(['Box', 'g'])['Code'].unstack().add_prefix('Code').reset_index()
print (df)
g Box Code0 Code1
0 Blue 9820 4530
1 Green 1221 8391
2 Red 3709 2911

Categories