How can I melt a pandas data frame using multiple variable names and values? I have the following data frame that changes its shape in a for loop. In one of the for loop iterations, it looks like this:
ID Cat Class_A Class_B Prob_A Prob_B
1 Veg 1 2 0.9 0.1
2 Veg 1 2 0.8 0.2
3 Meat 1 2 0.6 0.4
4 Meat 1 2 0.3 0.7
5 Veg 1 2 0.2 0.8
I need to melt it in such a way that it looks like this:
ID Cat Class Prob
1 Veg 1 0.9
1 Veg 2 0.1
2 Veg 1 0.8
2 Veg 2 0.2
3 Meat 1 0.6
3 Meat 2 0.4
4 Meat 1 0.3
4 Meat 2 0.7
5 Veg 1 0.2
5 Veg 2 0.8
During the for loop the data frame will contain different number of classes with their probabilities. That is why I am looking for a general approach that is applicable in all my for loop iterations. I saw this question and this but they were not helpful!
You need lreshape by dict for specify categories:
d = {'Class':['Class_A', 'Class_B'], 'Prob':['Prob_A','Prob_B']}
df = pd.lreshape(df,d)
print (df)
Cat ID Class Prob
0 Veg 1 1 0.9
1 Veg 2 1 0.8
2 Meat 3 1 0.6
3 Meat 4 1 0.3
4 Veg 5 1 0.2
5 Veg 1 2 0.1
6 Veg 2 2 0.2
7 Meat 3 2 0.4
8 Meat 4 2 0.7
9 Veg 5 2 0.8
More dynamic solution:
Class = [col for col in df.columns if col.startswith('Class')]
Prob = [col for col in df.columns if col.startswith('Prob')]
df = pd.lreshape(df, {'Class':Class, 'Prob':Prob})
print (df)
Cat ID Class Prob
0 Veg 1 1 0.9
1 Veg 2 1 0.8
2 Meat 3 1 0.6
3 Meat 4 1 0.3
4 Veg 5 1 0.2
5 Veg 1 2 0.1
6 Veg 2 2 0.2
7 Meat 3 2 0.4
8 Meat 4 2 0.7
9 Veg 5 2 0.8
EDIT:
lreshape is now undocumented, but is possible in future will by removed (with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.
Or you can try this by using str.contain and pd.concat
DF1=df2.loc[:,df2.columns.str.contains('_A|Cat|ID')]
name=['ID','Cat','Class','Prob']
DF1.columns=name
DF2=df2.loc[:,df2.columns.str.contains('_B|Cat|ID')]
DF2.columns=name
pd.concat([DF1,DF2],axis=0)
Out[354]:
ID Cat Class Prob
0 1 Veg 1 0.9
1 2 Veg 1 0.8
2 3 Meat 1 0.6
3 4 Meat 1 0.3
4 5 Veg 1 0.2
0 1 Veg 2 0.1
1 2 Veg 2 0.2
2 3 Meat 2 0.4
3 4 Meat 2 0.7
4 5 Veg 2 0.8
The top voted answer uses the undocumented lreshape which may at some point get deprecated because of its similarity to pd.wide_to_long which is documented and can use directly here. By default suffix matches only to numbers. You must change this to match characters (here I just used any character).
pd.wide_to_long(df, stubnames=['Class', 'Prob'], i=['ID', 'Cat'], j='DROPME', suffix='.')\
.reset_index()\
.drop('DROPME', axis=1)
ID Cat Class Prob
0 1 Veg 1 0.9
1 1 Veg 2 0.1
2 2 Veg 1 0.8
3 2 Veg 2 0.2
4 3 Meat 1 0.6
5 3 Meat 2 0.4
6 4 Meat 1 0.3
7 4 Meat 2 0.7
8 5 Veg 1 0.2
9 5 Veg 2 0.8
You could also use pd.melt.
# Make DataFrame
df = pd.DataFrame({'ID' : [i for i in range(1,6)],
'Cat' : ['Veg']*2 + ['Meat']*2 + ['Veg'],
'Class_A' : [1]*5,
'Class_B' : [2]*5,
'Prob_A' : [0.9, 0.8, 0.6, 0.3, 0.2],
'Prob_B' : [0.1, 0.2, 0.4, 0.7, 0.8]})
# Make class dataframe and prob dataframe
df_class = df.loc[:, ['ID', 'Cat', 'Class_A', 'Class_B']]
df_prob = df.loc[:, ['ID', 'Cat', 'Prob_A', 'Prob_B']]
# Melt class dataframe and prob dataframe
df_class = df_class.melt(id_vars = ['ID',
'Cat'],
value_vars = ['Class_A',
'Class_B'],
value_name = 'Class')
df_prob = df_prob.melt(id_vars = ['ID',
'Cat'],
value_vars = ['Prob_A',
'Prob_B'],
value_name = 'Prob')
# Clean variable column so only 'A','B' is left in both dataframes
df_class.loc[:, 'variable'] = df_class.loc[:, 'variable'].str.partition('_')[2]
df_prob.loc[:, 'variable'] = df_prob.loc[:, 'variable'].str.partition('_')[2]
# Merge class dataframe with prob dataframe on 'ID', 'Cat', and 'variable';
# drop 'variable'; sort values by 'ID', 'Cat'
final = df_class.merge(df_prob,
how = 'inner',
on = ['ID',
'Cat',
'variable']).drop('variable', axis = 1).sort_values(by = ['ID',
'Cat'])
One option is pivot_longer from pyjanitor, which abstracts the process, and is efficient:
# pip install janitor
import janitor
df.pivot_longer(
index = ['ID', 'Cat'],
names_to = '.value',
names_pattern = '([a-zA-Z]+)_*')
ID Cat Class Prob
0 1 Veg 1 0.9
1 2 Veg 1 0.8
2 3 Meat 1 0.6
3 4 Meat 1 0.3
4 5 Veg 1 0.2
5 1 Veg 2 0.1
6 2 Veg 2 0.2
7 3 Meat 2 0.4
8 4 Meat 2 0.7
9 5 Veg 2 0.8
The idea for this particular reshape is that whatever group in the regular expression is paired with the .value stays as the column header.
Here is my data frame.
df
store_1 store_2 store_3 store_4
0 banana banana plum banana
1 orange tangerine pear orange
2 apple pear melon apple
3 pear raspberry pineapple plum
4 plum tomato peach tomato
I'm looking for the way to count number of co-occurrences in stores (to compare their similarity).
You can try something like this
import itertools as it
corr = lambda a,b: len(set(a).intersection(set(b)))/len(a)
c = [corr(*x) for x in it.combinations_with_replacement(df.T.values.tolist(),2)]
j = 0
x = []
for i in range(4, 0, -1): # replace 4 with df.shape[-1]
x.append([np.nan]*(4-i) + c[j:j+i])
j+= i
pd.DataFrame(x, columns=df.columns, index=df.columns)
Which yields
store_1 store_2 store_3 store_4
store_1 1.0 0.4 0.4 0.8
store_2 NaN 1.0 0.2 0.4
store_3 NaN NaN 1.0 0.2
store_4 NaN NaN NaN 1.0
If you wish to estimate the similarity of the stores with regards to their products, then you could use:
One hot encoding
Then each stores can be described by a vector with length of n = number of all products among all stores such as:
banana
orange
apple
pear
plum
tangerin
raspberry
tomato
melon
.
.
.
Store_1 then is described as 1 1 1 1 1 0 0 0 0 0 ...
Store_2 1 0 0 1 0 1 1 1 0 ...
This leaves you with a numerical vector, where you can compute dissimilarity measure such as Euclidean Distance.
I have a test df like this:
df = pd.DataFrame({'A': ['Apple','Apple', 'Apple','Orange','Orange','Orange','Pears','Pears'],
'B': [1,2,9,6,4,3,2,1]
})
A B
0 Apple 1
1 Apple 2
2 Apple 9
3 Orange 6
4 Orange 4
5 Orange 3
6 Pears 2
7 Pears 1
Now I need to add a new column with the respective %differences in col 'B'. How is this possible. I cannot get this to work.
I have looked at
update column value of pandas groupby().last()
Not sure that it is pertinent to my problem.
And this which looks promising
Pandas Groupby and Sum Only One Column
I need to find and insert into the col maxpercchng (all rows in group) the maximum change in col (B) per group of col 'A'.
So I have come up with this code:
grouppercchng = ((df.groupby['A'].max() - df.groupby['A'].min())/df.groupby['A'].iloc[0])*100
and try to add it to the group col 'maxpercchng' like so
group['maxpercchng'] = grouppercchng
Or like so
df_kpi_hot.groupby(['A'], as_index=False)['maxpercchng'] = grouppercchng
Does anyone know how to add to all rows in group the maxpercchng col?
I believe you need transform for Series with same size like original DataFrame filled by aggregated values:
g = df.groupby('A')['B']
df['maxpercchng'] = (g.transform('max') - g.transform('min')) / g.transform('first') * 100
print (df)
A B maxpercchng
0 Apple 1 800.0
1 Apple 2 800.0
2 Apple 9 800.0
3 Orange 6 50.0
4 Orange 4 50.0
5 Orange 3 50.0
6 Pears 2 50.0
7 Pears 1 50.0
Or:
g = df.groupby('A')['B']
df1 = ((g.max() - g.min()) / g.first() * 100).reset_index()
print (df1)
A B
0 Apple 800.0
1 Orange 50.0
2 Pears 50.0
Two tables:
Price list table PRICE_LIST:
ITEM PRICE
MANGO 5
BANANA 2
APPLE 2.5
ORANGE 1.5
Records of sale REC_SALE (list of transactions)
ITEM SELLLING_PRICE
MANGO 4
MANGO 3
BANANA 2
BANANA 1
ORANGE 0.5
ORANGE 4
Selecting records from REC_SALE where Items were sold less than the PRICE listed in the PRICE_LIST table
SELECT A.*
FROM
(
select RS.ITEM,RS.SELLING_PRICE, PL.PRICE AS ACTUAL_PRICE
from REC_SALE RS,
PRICE_LIST PL
where RS.ITEM = PL.ITEM
) A
WHERE A.SELLING_PRICE < A.ACTUAL_PRICE ;
Result:
ITEM SELLING_PRICE PRICE
MANGO 4 5
MANGO 3 5
BANANA 1 2
ORANGE 0.5 1.5
I have these same two tables as dataframe in jupyter notebook
what would be a equivalent python statement of the SQL statement above using pandas?
merge with .loc
df1.merge(df2).loc[lambda x : x.PRICE>x.SELLLING_PRICE]
Out[198]:
ITEM PRICE SELLLING_PRICE
0 MANGO 5.0 4.0
1 MANGO 5.0 3.0
3 BANANA 2.0 1.0
4 ORANGE 1.5 0.5
Use merge with query:
df = pd.merge(df1, df2, on='ITEM').query('PRICE >SELLLING_PRICE')
print (df)
ITEM PRICE SELLLING_PRICE
0 MANGO 5.0 4.0
1 MANGO 5.0 3.0
3 BANANA 2.0 1.0
4 ORANGE 1.5 0.5
There are a plethora of questions on SO about how to select rows in a DataFrame and replace values in a column in those rows, but one use case is missing. To use the example DataFrame from this question,
In [1]: df
Out[1]:
apple banana cherry
0 0 3 good
1 1 4 bad
2 2 5 good
And this works if one wants to change a single column based on another:
df.loc[df.cherry == 'bad', 'apple'] = df.banana * 2
Or this sets the values in two columns:
df.loc[df.cherry == 'bad', ['apple', 'banana'] = np.nan
But this doesn't work:
df.loc[df.cherry == 'bad', ['apple', 'banana'] = [df.banana, df.apple]
, because apparently the right side is 3x2, while the left side is 1x2, hence the error message
ValueError: Must have equal len keys and value when setting with an ndarray
So I understand what the problem is, but what is the solution?
IIUC you can try:
df['a'] = df.apple * 3
df['b'] = df.banana * 2
print df
apple banana cherry a b
0 0 3 good 0 6
1 1 4 bad 3 8
2 2 5 good 6 10
df[['a', 'b']] = df.loc[df.cherry == 'bad', ['apple', 'banana']]
print df
apple banana cherry a b
0 0 3 good NaN NaN
1 1 4 bad 1.0 4.0
2 2 5 good NaN NaN
Or use conditions with values:
df['a'] = df.apple * 3
df['b'] = df.banana * 2
df.loc[df.cherry == 'bad', ['apple', 'banana']] =
df.loc[df.cherry == 'bad', ['a', 'b']].values
print df
apple banana cherry a b
0 0 3 good 0 6
1 3 8 bad 3 8
2 2 5 good 6 10
Another options with original columns:
print df[['apple','banana']].shift() * 2
apple banana
0 NaN NaN
1 12.0 6.0
2 2.0 8.0
df.loc[df.cherry == 'bad', ['apple', 'banana']] = df[['apple','banana']].shift() * 2
print df
apple banana cherry
0 6.0 3.0 good
1 12.0 6.0 bad
2 2.0 5.0 good