Reshaping Data with Python [duplicate]

Reshaping Data with Python [duplicate] - python

How can I melt a pandas data frame using multiple variable names and values? I have the following data frame that changes its shape in a for loop. In one of the for loop iterations, it looks like this:
ID Cat Class_A Class_B Prob_A Prob_B
1 Veg 1 2 0.9 0.1
2 Veg 1 2 0.8 0.2
3 Meat 1 2 0.6 0.4
4 Meat 1 2 0.3 0.7
5 Veg 1 2 0.2 0.8
I need to melt it in such a way that it looks like this:
ID Cat Class Prob
1 Veg 1 0.9
1 Veg 2 0.1
2 Veg 1 0.8
2 Veg 2 0.2
3 Meat 1 0.6
3 Meat 2 0.4
4 Meat 1 0.3
4 Meat 2 0.7
5 Veg 1 0.2
5 Veg 2 0.8
During the for loop the data frame will contain different number of classes with their probabilities. That is why I am looking for a general approach that is applicable in all my for loop iterations. I saw this question and this but they were not helpful!

You need lreshape by dict for specify categories:
d = {'Class':['Class_A', 'Class_B'], 'Prob':['Prob_A','Prob_B']}
df = pd.lreshape(df,d)
print (df)
Cat ID Class Prob
0 Veg 1 1 0.9
1 Veg 2 1 0.8
2 Meat 3 1 0.6
3 Meat 4 1 0.3
4 Veg 5 1 0.2
5 Veg 1 2 0.1
6 Veg 2 2 0.2
7 Meat 3 2 0.4
8 Meat 4 2 0.7
9 Veg 5 2 0.8
More dynamic solution:
Class = [col for col in df.columns if col.startswith('Class')]
Prob = [col for col in df.columns if col.startswith('Prob')]
df = pd.lreshape(df, {'Class':Class, 'Prob':Prob})
print (df)
Cat ID Class Prob
0 Veg 1 1 0.9
1 Veg 2 1 0.8
2 Meat 3 1 0.6
3 Meat 4 1 0.3
4 Veg 5 1 0.2
5 Veg 1 2 0.1
6 Veg 2 2 0.2
7 Meat 3 2 0.4
8 Meat 4 2 0.7
9 Veg 5 2 0.8
EDIT:
lreshape is now undocumented, but is possible in future will by removed (with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.

Or you can try this by using str.contain and pd.concat
DF1=df2.loc[:,df2.columns.str.contains('_A|Cat|ID')]
name=['ID','Cat','Class','Prob']
DF1.columns=name
DF2=df2.loc[:,df2.columns.str.contains('_B|Cat|ID')]
DF2.columns=name
pd.concat([DF1,DF2],axis=0)
Out[354]:
ID Cat Class Prob
0 1 Veg 1 0.9
1 2 Veg 1 0.8
2 3 Meat 1 0.6
3 4 Meat 1 0.3
4 5 Veg 1 0.2
0 1 Veg 2 0.1
1 2 Veg 2 0.2
2 3 Meat 2 0.4
3 4 Meat 2 0.7
4 5 Veg 2 0.8

The top voted answer uses the undocumented lreshape which may at some point get deprecated because of its similarity to pd.wide_to_long which is documented and can use directly here. By default suffix matches only to numbers. You must change this to match characters (here I just used any character).
pd.wide_to_long(df, stubnames=['Class', 'Prob'], i=['ID', 'Cat'], j='DROPME', suffix='.')\
.reset_index()\
.drop('DROPME', axis=1)
ID Cat Class Prob
0 1 Veg 1 0.9
1 1 Veg 2 0.1
2 2 Veg 1 0.8
3 2 Veg 2 0.2
4 3 Meat 1 0.6
5 3 Meat 2 0.4
6 4 Meat 1 0.3
7 4 Meat 2 0.7
8 5 Veg 1 0.2
9 5 Veg 2 0.8

You could also use pd.melt.
# Make DataFrame
df = pd.DataFrame({'ID' : [i for i in range(1,6)],
'Cat' : ['Veg']*2 + ['Meat']*2 + ['Veg'],
'Class_A' : [1]*5,
'Class_B' : [2]*5,
'Prob_A' : [0.9, 0.8, 0.6, 0.3, 0.2],
'Prob_B' : [0.1, 0.2, 0.4, 0.7, 0.8]})
# Make class dataframe and prob dataframe
df_class = df.loc[:, ['ID', 'Cat', 'Class_A', 'Class_B']]
df_prob = df.loc[:, ['ID', 'Cat', 'Prob_A', 'Prob_B']]
# Melt class dataframe and prob dataframe
df_class = df_class.melt(id_vars = ['ID',
'Cat'],
value_vars = ['Class_A',
'Class_B'],
value_name = 'Class')
df_prob = df_prob.melt(id_vars = ['ID',
'Cat'],
value_vars = ['Prob_A',
'Prob_B'],
value_name = 'Prob')
# Clean variable column so only 'A','B' is left in both dataframes
df_class.loc[:, 'variable'] = df_class.loc[:, 'variable'].str.partition('_')[2]
df_prob.loc[:, 'variable'] = df_prob.loc[:, 'variable'].str.partition('_')[2]
# Merge class dataframe with prob dataframe on 'ID', 'Cat', and 'variable';
# drop 'variable'; sort values by 'ID', 'Cat'
final = df_class.merge(df_prob,
how = 'inner',
on = ['ID',
'Cat',
'variable']).drop('variable', axis = 1).sort_values(by = ['ID',
'Cat'])

One option is pivot_longer from pyjanitor, which abstracts the process, and is efficient:
# pip install janitor
import janitor
df.pivot_longer(
index = ['ID', 'Cat'],
names_to = '.value',
names_pattern = '([a-zA-Z]+)_*')
ID Cat Class Prob
0 1 Veg 1 0.9
1 2 Veg 1 0.8
2 3 Meat 1 0.6
3 4 Meat 1 0.3
4 5 Veg 1 0.2
5 1 Veg 2 0.1
6 2 Veg 2 0.2
7 3 Meat 2 0.4
8 4 Meat 2 0.7
9 5 Veg 2 0.8
The idea for this particular reshape is that whatever group in the regular expression is paired with the .value stays as the column header.

Related

Pivot Tables with Pandas

I have the following data saved as a pandas dataframe
Animal Day age Food kg
1 1 3 17 0.1
1 1 3 22 0.7
1 2 3 17 0.8
2 2 7 15 0.1
With pivot I get the following:
output = df.pivot(["Animal", "Food"], "Day", "kg") \
.add_prefix("Day") \
.reset_index() \
.rename_axis(None, axis=1)
>>> output
Animal Food Day1 Day2
0 1 17 0.1 0.8
1 1 22 0.7 NaN
2 2 15 NaN 0.1
However I would like to have the age column (and other columns) still included.
It could also be possible that for animal x the value age is not always the same, then it doesn't matter which age value is taken.
Animal Food Age Day1 Day2
0 1 17 3 0.1 0.8
1 1 22 3 0.7 NaN
2 2 15 7 NaN 0.1
How do I need to change the code above?

IIUC, what you want is to pivot the weight, but to aggregate the age.
To my knowledge, you need to do both operations separately. One with pivot, the other with groupby (here I used first for the example, but this could be anything), and join:
(df.pivot(index=["Animal", "Food"],
columns="Day",
values="kg",
)
.add_prefix('Day')
.join(df.groupby(['Animal', 'Food'])['age'].first())
.reset_index()
)
I am adding a non ambiguous example (here the age of Animal 1 changes on Day2).
Input:
Animal Day age Food kg
0 1 1 3 17 0.1
1 1 1 3 22 0.7
2 1 2 4 17 0.8
3 2 2 7 15 0.1
output:
Animal Food Day1 Day2 age
0 1 17 0.1 0.8 3
1 1 22 0.7 NaN 3
2 2 15 NaN 0.1 7

Use pivot, add other columns to index:
>>> df.pivot(df.columns[~df.columns.isin(['Day', 'kg'])], 'Day', 'kg') \
.add_prefix('Day').reset_index().rename_axis(columns=None)
Animal age Food Day1 Day2
0 1 3 17 0.1 0.8
1 1 3 22 0.7 NaN
2 2 7 15 NaN 0.1

How often is an item in a purchase?

I would like to calculate how often an item appears in a shopping cart.
I have a purchase recognizable by the buyerid. This buyerid can buy several items (also twice, triple,..., n-th times). Recognizable by itemid and description.
I would like to count the number of times an item ends up in a shopping cart. For example, out of 5 purchases, 3 people bought an apple, i.e. 0.6%. I would like to spend this on all products, how do I do that?
import pandas as pd
d = {'buyerid': [0,0,1,2,3,3,3,4,4,4,4],
'itemid': [0,1,2,1,1,1,2,4,5,1,1],
'description': ['Banana', 'Apple', 'Apple', 'Strawberry', 'Apple', 'Banana', 'Apple', 'Dog-Food', 'Beef', 'Banana', 'Apple'], }
df = pd.DataFrame(data=d)
display(df.head(20))
My try:
# How many % of the articels are the same?
# this is wrong... :/
df_grouped = df.groupby('description').count()
display(df_grouped)
df_apple = df_grouped.iloc[0]
percentage = df_apple[0] / df.shape[0]
print(percentage)
[OUT] 0.45454545454545453
The mathematic formula
count of all buys (count_buy ) = 5
count how many an apple appears in the buy (count_apple) = 3
count_buy /count_apple = 3 / 5 = 0.6
What I would like to have (please note, I have not calculated the values, these are just dumy values)

Use GroupBy.size and divide by count of unique values of buyerid by Series.nunique:
print (df.groupby(['itemid','description']).size())
itemid description
0 Banana 1
1 Apple 3
Banana 2
Strawberry 1
2 Apple 2
4 Dog-Food 1
5 Beef 1
dtype: int64
purch = df['buyerid'].nunique()
df1 = df.groupby(['itemid','description']).size().div(purch).reset_index(name='percentage')
print (df1)
itemid description percentage
0 0 Banana 0.2
1 1 Apple 0.6
2 1 Banana 0.4
3 1 Strawberry 0.2
4 2 Apple 0.4
5 4 Dog-Food 0.2
6 5 Beef 0.2

I would group it and create a new column as follows:
df_grp = df.groupby('description')['buyerid'].sum().reset_index(name='total')
df_grp['percentage'] = (df_grp.total / df_grp.total.sum()) * 100
df_grp
Result:
description total percentage
0 Apple 11 39.285714
1 Banana 7 25.000000
2 Beef 4 14.285714
3 Dog-Food 4 14.285714
4 Strawberry 2 7.142857

As always, there are multiple ways to the gold, but i would go over pivoting as following:
Your input:
import pandas as pd
d = {'buyerid': [0,0,1,2,3,3,3,4,4,4,4],
'itemid': [0,1,2,1,1,1,2,4,5,1,1],
'description': ['Banana', 'Apple', 'Apple', 'Strawberry', 'Apple', 'Banana', 'Apple', 'Dog-Food', 'Beef', 'Banana', 'Apple'], }
df = pd.DataFrame(data=d)
In a next step pivot the data with buyer_id as index and description as columns and replace NA with 0 as such
df2 = df.pivot_table(values='itemid', index='buyerid', columns='description', aggfunc='count')
df2 = df2.fillna(0)
resulting in
description Apple Banana Beef Dog-Food Strawberry
buyerid
0 1.0 1.0 0.0 0.0 0.0
1 1.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 1.0
3 2.0 1.0 0.0 0.0 0.0
4 1.0 1.0 1.0 1.0 0.0
calling the mean on the table:
df_final = df2.mean()
results in
description
Apple 1.0
Banana 0.6
Beef 0.2
Dog-Food 0.2
Strawberry 0.2
dtype: float64

Split panda frames when a specific column reaches a given value

I want to df.cut() with two different bin sizes for two specific parts of a dataframe. I believe the easiest way to do that is to read my dataframe and split it in two so I can use df.cut() in the two independent dataframe with two independent bins.
I understand I can use df.head(), but I had to keep changing the dataframe and they don't have always the same size. For example, with the following dataframe
A B
1 0.1 0.423655
2 0.2 0.645894
3 0.3 0.437587
4 0.31 0.891773
5 0.4 0.1773
6 0.43 0.91773
7 0.5 0.891773
I want to have two dataframes for value of A higher or equal than 0.4 and lower than 0.4.
So I would have df2:
A B
1 0.1 0.423655
2 0.2 0.645894
3 0.3 0.437587
4 0.31 0.891773
and df3:
A B
1 0.4 0.1773
2 0.43 0.91773
3 0.5 0.891773
Again, df.head(4) or df.tail(3) won't work.

df2 = df[df["A"] < 0.4]
df3 = df[df["A"] >= 0.4]

This should work:
import pandas as pd
data = {'A': [0.1,0.2,0.1,0.2,5,6,7,8], 'B': [5,0.2,4,8,11,9,10,14]}
df = pd.DataFrame(data)
df2 = df[df.A >= 0.4]
print(df2)
# A B
#4 5.0 11.0
#5 6.0 9.0
#6 7.0 10.0
#7 8.0 14.0
df3 = df[df.A < 0.4]
print(df3)
# A B
#0 0.1 5.0
#1 0.2 0.2
#2 0.1 4.0
#3 0.2 8.0

I added in some ficticious data as an example:
data = {'A': [1,2,3,4,5,6,7,8], 'B': [5,8,9,10,11,12,13,14]}
df = pd.DataFrame(data)
df1 = df[df.A > 4]
df2 = df[df.A <13]
print(df1)
print(df2)
Output
>>> print(df1)
A B
4 5 11
5 6 12
6 7 13
7 8 14
>>> print(df2)
A B
0 1 5
1 2 8
2 3 9
3 4 10
4 5 11
5 6 12
6 7 13
7 8 14

Create multiple dataframes based on the original dataframe columns number

I've search for quite a time, but I haven't found any similar question. If there is, please let me know!
I am currently trying to divide one dataframe into n dataframes where the n is equal to the number of columns of the original dataframe. All the new resulting dataframes must always keep the first column of the original dataframe. An extra would be gather all togheter in a list, for example, for further access.
In order to visualize my intention, here goes an brief example:
>> original df
GeneID A B C D E
1 0.3 0.2 0.6 0.4 0.8
2 0.5 0.3 0.1 0.2 0.6
3 0.4 0.1 0.5 0.1 0.3
4 0.9 0.7 0.1 0.6 0.7
5 0.1 0.4 0.7 0.2 0.5
My desired output would be something like this:
>> df1
GeneID A
1 0.3
2 0.5
3 0.4
4 0.9
5 0.1
>> df2
GeneID B
1 0.2
2 0.3
3 0.1
4 0.7
5 0.4
....
And so on, until all the columns from the original dataframe be covered.
What would be the better solution ?

You can use df.columns to get all column names and then create sub-dataframes:
outdflist =[]
# for each column beyond first:
for col in oridf.columns[1:]:
# create a subdf with desired columns:
subdf = oridf[['GeneID',col]]
# append subdf to list of df:
outdflist.append(subdf)
# to view all dataframes created:
for df in outdflist:
print(df)
Output:
GeneID A
0 1 0.3
1 2 0.5
2 3 0.4
3 4 0.9
4 5 0.1
GeneID B
0 1 0.2
1 2 0.3
2 3 0.1
3 4 0.7
4 5 0.4
GeneID C
0 1 0.6
1 2 0.1
2 3 0.5
3 4 0.1
4 5 0.7
GeneID D
0 1 0.4
1 2 0.2
2 3 0.1
3 4 0.6
4 5 0.2
GeneID E
0 1 0.8
1 2 0.6
2 3 0.3
3 4 0.7
4 5 0.5
Above for loop can also be written more simply as list comprehension:
outdflist = [ oridf[['GeneID', col]]
for col in oridf.columns[1:] ]

You can do with groupby
d={'df'+ str(x): y for x , y in df.groupby(level=0,axis=1)}
d
Out[989]:
{'dfA': A
0 0.3
1 0.5
2 0.4
3 0.9
4 0.1, 'dfB': B
0 0.2
1 0.3
2 0.1
3 0.7
4 0.4, 'dfC': C
0 0.6
1 0.1
2 0.5
3 0.1
4 0.7, 'dfD': D
0 0.4
1 0.2
2 0.1
3 0.6
4 0.2, 'dfE': E
0 0.8
1 0.6
2 0.3
3 0.7
4 0.5, 'dfGeneID': GeneID
0 1
1 2
2 3
3 4
4 5}

You can create a list of column names, and manually loop through and create a new DataFrame each loop.
>>> import pandas as pd
>>> d = {'col1':[1,2,3], 'col2':[3,4,5], 'col3':[6,7,8]}
>>> df = pd.DataFrame(data=d)
>>> df
col1 col2 col3
0 1 3 6
1 2 4 7
2 3 5 8
>>> newstuff=[]
>>> columns = list(df)
>>> for column in columns:
... newstuff.append(pd.DataFrame(data=df[column]))
Unless your dataframe is unreasonably massive, above code should serve its job.

Pandas: Adding an excel SUMIF column like =A1/SUMIF(B:B,B1,A:A)

I have a pandas DataFrame like:
pet treats lbs
0 cat 2 5.0
1 dog 1 9.9
2 snek 3 1.1
3 cat 6 4.5
4 dog 1 9.4
I would like to add a fourth column that takes each treat as a percentage of the total treats for pets of that kind. So, the treat value in row 0, divided by the sum of all treats for pets matching "cat" (and so on for each row).
In Excel, I think I would do something like this:
A B C D
1 cat 2 5.0 =B1/SUMIF(A:A,A1,B:B)
2 dog 1 9.9 =B2/SUMIF(A:A,A2,B:B)
3 snek 3 1.1 =B3/SUMIF(A:A,A3,B:B)
4 cat 6 4.5 =B4/SUMIF(A:A,A4,B:B)
5 dog 1 9.4 =B5/SUMIF(A:A,A5,B:B)
Anyone have an idea how I could add this "treat_percent" column using pandas?
pet treats lbs treat_percent
0 cat 2 5.0 33.33
1 dog 1 9.9 50.00
2 snek 3 1.1 100.00
3 cat 6 4.5 66.67
4 dog 1 9.4 50.00
So far, I have tried:
df['treat_percent'] = df['pet'] / df.groupby('pet')['treats'].sum()
and
df['treat_percent'] = df['pet'] / df.loc[df['pet'] == df['pet'], 'treats'].sum()

You can using transform
df['treat_rate']=df.treats/df.groupby('pet').treats.transform('sum')
df
Out[153]:
pet treats lbs treat_rate
0 cat 2 5.0 0.25
1 dog 1 9.9 0.50
2 snek 3 1.1 1.00
3 cat 6 4.5 0.75
4 dog 1 9.4 0.50

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reshaping Data with Python [duplicate] - python

Related

Pivot Tables with Pandas

How often is an item in a purchase?

Split panda frames when a specific column reaches a given value

Create multiple dataframes based on the original dataframe columns number

Pandas: Adding an excel SUMIF column like =A1/SUMIF(B:B,B1,A:A)

Categories

Resources