Broaden pandas dataframe

Broaden pandas dataframe - python

I have data that looks like this:
Box,Code
Green,1221
Green,8391
Red,3709
Red,2911
Blue,9820
Blue,4530
Using a pandas dataframe, I'm wondering if it is possible to output something like this:
Box,Code1,Code2
Green,1221,8391
Red,3709,2911
Blue,9820,4530
My data always has an equal number of rows per 'Box'.
I've been experimenting with pivots and crosstabs (as well as stack and unstack) in pandas but haven't found anything that gets me to the 'broaden' result I'm looking for.

You can use groupby for lists and then DataFrame constructor:
a = df.groupby('Box')['Code'].apply(list)
df = pd.DataFrame(a.values.tolist(), index=a.index).add_prefix('Code').reset_index()
print (df)
Box Code0 Code1
0 Blue 9820 4530
1 Green 1221 8391
2 Red 3709 2911
Or cumcount for new Series with pandas.pivot:
g = df.groupby('Box').cumcount()
df = pd.pivot(index=df['Box'], columns=g, values=df['Code']).add_prefix('Code').reset_index()
print (df)
Box Code0 Code1
0 Blue 9820 4530
1 Green 1221 8391
2 Red 3709 2911
And similar solution with unstack:
df['g'] = df.groupby('Box').cumcount()
df = df.set_index(['Box', 'g'])['Code'].unstack().add_prefix('Code').reset_index()
print (df)
g Box Code0 Code1
0 Blue 9820 4530
1 Green 1221 8391
2 Red 3709 2911

Related

Merge two dataframes with subheaders

So I have my first dataframe that has countries as headers and infected and death values as subheaders,
df
Dates Antigua & Barbuda Australia
Infected Dead Infected Dead
2020-01-22 0 0 0 0...
2020-01-23 0 0 0 0...
...
then I have my second dataframe,
df_indicators
Dates Location indicator_1 indicator_2 .....
2020-04-24 Afghanistan 0 0
2020-04-25 Afghanistan 0 0
...
2020-04-24 Yemen 0 0
2020-04-25 Yemen 0 0
I want to merge the dataframes so that the indicator columns become subheaders of the countries column like in df with the infected and dead subheaders.
What I want to produce is something like this,
df_merge
Dates Antigua & Barbuda
Infected Dead indicator_1 indicator_2....
2020-04-24 0 0 0 0...
There are so many indicators that are all named something different that I don't feel I can call them all so not sure if theres a way I can do this easily.
Thank you in advance for any help!

Because there are duplicates first aggregate by mean and then reshape by Series.unstack with DataFrame.swaplevel:
df2 = df_indicators.groupby(['Dates','Location']).mean().unstack().swaplevel(0,1,axis=1)
Or with DataFrame.pivot_table:
df2 = (df.pivot_table(index='Dates', columns='Location', aggfunc='mean')
.swaplevel(0,1,axis=1))
And last join with sorting MultiIndex in columns:
df = pd.concat([df, df2], axis=1).sort_index(axis=1)

Populate Pandas dataframe with group_by calculations made in Pandas series

I have created a dataframe from a dictionary as follows:
my_dict = {'VehicleType':['Truck','Car','Truck','Car','Car'],'Colour':['Green','Green','Black','Yellow','Green'],'Year':[2002,2014,1975,1987,1987],'Frequency': [0,0,0,0,0]}
df = pd.DataFrame(my_dict)
So my dataframe df currently looks like this:
VehicleType Colour Year Frequency
0 Truck Green 2002 0
1 Car Green 2014 0
2 Truck Black 1975 0
3 Car Yellow 1987 0
4 Car Green 1987 0
I'd like it to look like this:
VehicleType Colour Year Frequency
0 Truck Green 2002 1
1 Car Green 2014 2
2 Truck Black 1975 1
3 Car Yellow 1987 1
4 Car Green 1987 2
i.e., the Frequency column should represent the totals of VehicleType AND Colour combinations (but leaving out the Year column). So in row 4 for example, the 2 in the Frequency column tells you that there are a total of 2 rows with the combination of 'Car' and 'Green'.
This is essentially a 'Count' with 'Group By' calculation, and Pandas provides a way to do the calculation as follows:
grp_by_series = df.groupby(['VehicleType', 'Colour']).size()
grp_by_series
VehicleType Colour
Car Green 2
Yellow 1
Truck Black 1
Green 1
dtype: int64
What I'd like to do next is to extract the calculated group_by values from the Panda series and put them into the Frequency column of the Pandas dataframe. I've tried various approaches but without success.
The example I've given is hugely simplified - the dataframes I'm using are derived from genomic data and have hundreds of millions of rows, and will have several frequency columns based on various combinations of other columns, so ideally I need a solution which is fast and scales well.
Thanks for any help!

You are on a good path. You can continue like this:
grp_by_series=grp_by_series.reset_index()
res=df[['VehicleType', 'Colour']].merge(grp_by_series, how='left')
df['Frequency'] = res[0]
print(df)
Output:
VehicleType Colour Year Frequency
0 Truck Green 2002 1
1 Car Green 2014 2
2 Truck Black 1975 1
3 Car Yellow 1987 1
4 Car Green 1987 2

I think a .transform() does what you want:
df['Frequency'] = df.groupby(['VehicleType', 'Colour'])['Year'].transform('count')

parse all col names and create new columns

date red,heavy,new blue,light,old
1-2-20 320 120
2-3-20 220 125
I want to iterate through all rows and columns such that I can I can parse the column names and use them as values for new columns. I want to get a data of this format:
I want dates to be repeated. The 'value' col is from the original table.
date color weight condition. value
1-2-20 red heavy new 320
1-2-20 blue light. old. 120
2-3-20 red. heavy new. 220
I tried this and it worked for when I only had one column
colName = df_retransform.columns[1]
lst = colName.split(",")
color = lst[0]
weight = lst[1]
condition = lst[2]
df_retransform.rename(columns={colName: 'value'}, inplace=True)
df_retransform['color'] = color
df_retransform['weight'] = weight
df_retransform['condition'] = condition
but I am unable to modify it such that it I can do it for all columns.

Use DataFrame.melt with Series.str.split, DataFrame.pop is for using and drop column variable, last change order of columns names if necessary:
First you can test if all columns without data has 2 ,:
print ([col for col in df.columns if col.count(',') != 2])
['date']
df = df.melt('date')
df[['color', 'weight', 'condition']] = df.pop('variable').str.split(',', expand=True)
df = df[['date', 'color', 'weight', 'condition', 'value']]
print (df)
date color weight condition value
0 1-2-20 red heavy new 320
1 2-3-20 red heavy new 220
2 1-2-20 blue light old 120
3 2-3-20 blue light old 125
Or use DataFrame.stack for MultiIndex Series, then split and recreate new all levels for new columns:
print (df)
date red,heavy,new blue,light,old
0 1-2-20 320 NaN
1 NaN 220 125.0
s = df.set_index('date').stack(dropna=False)
s.index = pd.MultiIndex.from_tuples([(i, *j.split(',')) for i, j in s.index],
names=['date', 'color', 'weight', 'condition'])
df = s.reset_index(name='value')
print (df)
date color weight condition value
0 1-2-20 red heavy new 320.0
1 1-2-20 blue light old NaN
2 NaN red heavy new 220.0
3 NaN blue light old 125.0

You could also use pivot_longer function from pyjanitor; at the moment you have to install the latest development version from github:
# install latest dev version
# pip install git+https://github.com/ericmjl/pyjanitor.git
import janitor
df.pivot_longer(index="date",
names_to=("color", "weight", "condition"),
names_sep=",")
date color weight condition value
0 1-2-20 red heavy new 320
1 2-3-20 red heavy new 220
2 1-2-20 blue light old 120
3 2-3-20 blue light old 125
You pass the names of the new column to names_to, and specify the separator (,) in names_sep.
If you want it returned in order of appearance, you could pass boolean True to the sort_by_appearance argument:
df.pivot_longer(
index="date",
names_to=("color", "weight", "condition"),
names_sep=",",
sort_by_appearance=True,
)
date color weight condition value
0 1-2-20 red heavy new 320
1 1-2-20 blue light old 120
2 2-3-20 red heavy new 220
3 2-3-20 blue light old 125

Broaden Pandas DataFrame with multiple source columns

Following on from this question, is it possible to perform a similar 'broaden' operation in pandas where there are multiple source columns per 'entity'?
If my data now looks like:
Box,Code,Category
Green,1221,Active
Green,8391,Inactive
Red,3709,Inactive
Red,2911,Pending
Blue,9820,Active
Blue,4530,Active
How do I most efficiently get to:
Box,Code0,Category0,Code1,Category1
Green,1221,Active,8391,Inactive
Red,3709,Inactive,2911,Pending
Blue,9820,Active,4530,Active
So far, the only solution I have been able to put together that 'works', is follow the example from the linked page and to create two separate DataFrames, one grouped by Box and Code, the other grouped by Box and Category, and then merge the two together by Box.
a = get_clip.groupby('Box')['Code'].apply(list)
b = get_clip.groupby('Box')['Category'].apply(list)
broadeneda = pd.DataFrame(a.values.tolist(), index = a.index).add_prefix('Code').reset_index()
broadenedb = pd.DataFrame(b.values.tolist(), index = b.index).add_prefix('Category').reset_index()
merged = pd.merge(broadeneda, broadenedb, on='Box', how = 'inner')
Is there a way to achieve this without broadening each column separately and merging at the end?

gourpby + cumcount+unstack
df1=df.assign(n=df.groupby('Box').cumcount()).set_index(['Box','n']).unstack(1)
df1.columns=df1.columns.map('{0[0]}{0[1]}'.format)
df1
Out[141]:
Code0 Code1 Category0 Category1
Box
Blue 9820 4530 Active Active
Green 1221 8391 Active Inactive
Red 3709 2911 Inactive Pending

Option 1
Using set_index, pipe, and set_axis
df.set_index(['Box', df.groupby('Box').cumcount()]).unstack().pipe(
lambda d: d.set_axis(d.columns.map('{0[0]}{0[1]}'.format), 1, False)
)
Code0 Code1 Category0 Category1
Box
Blue 9820 4530 Active Active
Green 1221 8391 Active Inactive
Red 3709 2911 Inactive Pending
Option 2
Using defaultdict
from collections import defaultdict
d = defaultdict(dict)
for a, *b in df.values:
i = len(d[a]) // len(b)
c = (f'Code{i}', f'Category{i}')
d[a].update(dict(zip(c, b)))
pd.DataFrame.from_dict(d, 'index').rename_axis('Box')
Code0 Category0 Code1 Category1
Box
Blue 9820 Active 4530 Active
Green 1221 Active 8391 Inactive
Red 3709 Inactive 2911 Pending

This can be done with iteration of sub-dataframes:
cols = ["Box","Code0","Category0","Code1","Category1"]
newdf = pd.DataFrame(columns = cols) # create an empty dataframe to be filled
for box in pd.unique(df.Box): # for each color in Box
subdf = df[df.Box == box] # get a sub-dataframe
newrow = subdf.values[0].tolist() # get its values and then its full first row
newrow.extend(subdf.values[1].tolist()[1:3]) # add second and third entries of second row
newdf = pd.concat([newdf, pd.DataFrame(data=[newrow], columns=cols)], axis=0) # add to new dataframe
print(newdf)
Output:
Box Code0 Category0 Code1 Category1
0 Green 1221.0 Active 8391.0 Inactive
0 Red 3709.0 Inactive 2911.0 Pending
0 Blue 9820.0 Active 4530.0 Active

It seems that same color will appear in a row and each color has same rows. (Two important assumptions.) Thus, we can split the df into the odd part, df[::2], and the even part, df[1::2], and then merge it together.
pd.merge(df[::2], df[1::2], on="Box")
Box Code_x Category_x Code_y Category_y
0 Green 1221 Active 8391 Inactive
1 Red 3709 Inactive 2911 Pending
2 Blue 9820 Active 4530 Active
One can rename it easily by resetting its columns.

Plotting number of occurrences of column value

I hope the title is accurate enough, I wasn't quite sure how to phrase it.
Anyhow, my problem is that I have a Pandas df which looks like the following:
Customer Source CustomerSource
0 Apple A 141
1 Apple B 36
2 Microsoft A 143
3 Oracle C 225
4 Sun C 151
This is a df derived from a greater dataset, and the meaning the value of CustomerSource is that it's the accumulated sum of all occurrences of Customer and Source, for example, in this case there is 141 occurrences of Apple with Soure A and 225 of Customer Oracle with Source B and so on.
What I want to do with this, is I want to do a stacked barplot which gives me all Customers on the x-axis and the values of CustomerSource stacked on top of each other on the y-axis. Similar to the below example. Any hints as to how I would proceed with this?

You can use pivot or unstack for reshape and then DataFrame.bar:
df.pivot('Customer','Source','CustomerSource').plot.bar(stacked=True)
df.set_index(['Customer','Source'])['CustomerSource'].unstack().plot.bar(stacked=True)
Or if duplicates in pairs Customer, Source use pivot_table or groupby with aggregate sum:
print (df)
Customer Source CustomerSource
0 Apple A 141 <-same Apple, A
1 Apple A 200 <-same Apple, A
2 Apple B 36
3 Microsoft A 143
4 Oracle C 225
5 Sun C 151
df = df.pivot_table(index='Customer',columns='Source',values='CustomerSource', aggfunc='sum')
print (df)
Source A B C
Customer
Apple 341.0 36.0 NaN <-141 + 200 = 341
Microsoft 143.0 NaN NaN
Oracle NaN NaN 225.0
Sun NaN NaN 151.0
df.pivot_table(index='Customer',columns='Source',values='CustomerSource', aggfunc='sum')
.plot.bar(stacked=True)
df.groupby(['Customer','Source'])['CustomerSource'].sum().unstack().plot.bar(stacked=True)
Also is possible swap columns:
df.pivot('Customer','Source','CustomerSource').plot.bar(stacked=True)
df.pivot('Source', 'Customer','CustomerSource').plot.bar(stacked=True)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Broaden pandas dataframe - python

Related

Merge two dataframes with subheaders

Populate Pandas dataframe with group_by calculations made in Pandas series

parse all col names and create new columns

Broaden Pandas DataFrame with multiple source columns

Plotting number of occurrences of column value

Categories

Resources