Subset dataframe based on the slope - python

I have the following data frame:
df = pd.DataFrame()
df['fruit'] = ['apple','pear','banana','banana','pear','banana','apple','apple','pear','apple','apple','apple']
df['price'] = [2,1,3,3,1,3.3,1.8,1.8,1,1.6,1.6,1.6]
df['date_buy'] = ['01/01/2005','01/01/2005','01/01/2005','01/01/2005','01/02/2005','01/02/2005','01/02/2005','01/02/2005','01/03/2005','01/03/2005','01/03/2005','01/03/2005']
df.date_buy = df.date_buy.astype('datetime64')
df.set_index('date_buy', inplace = True)
The data is:
fruit price
date_buy
2005-01-01 apple 2.0
2005-01-01 pear 1.0
2005-01-01 banana 3.0
2005-01-01 banana 3.0
2005-01-02 pear 1.0
2005-01-02 banana 3.3
2005-01-02 apple 1.8
2005-01-02 apple 1.8
2005-01-03 pear 1.0
2005-01-03 apple 1.6
2005-01-03 apple 1.6
2005-01-03 apple 1.6
I have converted this dataframe into a pivot table:
df.pivot_table(index=['date_buy'],columns = ['fruit'], values = ['fruit'], aggfunc = len).\
fillna(0).resample('D', level=0).sum()
price
fruit apple banana pear
date_buy
2005-01-01 1.0 2.0 1.0
2005-01-02 2.0 1.0 1.0
2005-01-03 3.0 0.0 1.0
I want to subset this dataset based on a criteria: top two slopes of the trend line. For apple the slope is 1, for banana is -1 and for pear the slope is 0. The result should be:
price
fruit apple pear
date_buy
2005-01-01 1.0 1.0
2005-01-02 2.0 1.0
2005-01-03 3.0 1.0
This dataset is just a concept from a much larger dataset, that's why I'm not subsetting by just the names of the two fruits I see. Please, any help will be greatly appreciated.

you can use polyfit from numpy to get the slopes. While not necessary, you can use a delta on the index in terms of days as x and y as the full pivoted dataframe. Then use argsort and select the number of top slopes you want. Finally, use iloc to get the columns
pv_df = (df.pivot_table(index=['date_buy'],columns = ['fruit'],
values = ['fruit'], aggfunc = len)
.fillna(0).resample('D', level=0).sum()
)
# number of top slopes
nb_top = 2
# get the slopes
slopes = np.polyfit(x=(pv_df.index - pv_df.index.min()).days,
y=pv_df, deg=1)[0]
#select the columns
res = pv_df.iloc[:, np.argsort(slopes)[-nb_top:]]
print(res)
price
fruit pear apple
date_buy
2005-01-01 1.0 1.0
2005-01-02 1.0 2.0
2005-01-03 1.0 3.0
Note: for the slopes, you can use directly slopes = np.polyfit(x=pv_df.index.astype(int), y=pv_df, deg=1)[0] but the values are less obvious compared to 1,0 and -1 you said in your question

Related

Transpose dataframe with cells as sum over columns

I have a dataframe in the following form:
x_30d x_60d y_30d y_60d
127 1.0 1.0 0.0 1.0
223 1.0 0.0 1.0 NaN
1406 1.0 NaN 1.0 0.0
2144 1.0 0.0 1.0 1.0
2234 1.0 0.0 NaN NaN
I need to transform it into the following form (where each cell is the sum over each column above):
30d 60d
x 5 1
y 3 2
I've tried using dictionaries and splitting columns. melting the dataframe, along with transposing it, etc. but I cannot seem to get the correct pattern.
To make things slightly more complicated, here are some actual column names that have a mix of forms for date ranges: PASC_new_aches_30d_60d, PASC_new_aches_60d_180d, ... PASC_new_aches_360d, ..., PASC_new_jt_pain_180d_360d, ...
In [131]: new = df.sum()
In [132]: new.index = pd.MultiIndex.from_frame(
new.index.str.extract(r"^(.*?)_(\d+d.*)$"))
In [133]: new
Out[133]:
0 1
PASC_new_aches 30d_60d 5.0
60d_180d 1.0
x 30d 3.0
PASC_new_aches 360d 2.0
dtype: float64
In [134]: new.unstack()
Out[134]:
1 30d 30d_60d 360d 60d_180d
0
PASC_new_aches NaN 5.0 2.0 1.0
x 3.0 NaN NaN NaN
sum as usual per column
original's columns are now at the index; need to split them
using a regex here: ^(.*?)_(\d+d.*)$
^: beginning
(.*?) anything, but greedily until...
_(\d+d.*) ...underscore followed by d pattern; also anything after it
$ the end
while splitting we extracted before & after of an underscore with (...)s
make them the new index (a multiindex now)
unstack the inner level to become new columns, i.e., the parts after "_"
noting that those "1" and "0" at the top left are the "name"s of the axes of the frame; 0 is that of df.index, 1 is of df.columns. They are there due to pd.MultiIndex.from_frame. Can remove by .rename_axis(index=None, columns=None).
one option is with pivot_longer from pyjanitor:
# pip install pyjanitor
import janitor
import pandas as pd
(df
.agg(['sum'])
.pivot_longer(
index = None,
names_to = ('other', '.value'),
names_sep='_')
)
other 30d 60d
0 x 5.0 1.0
1 y 3.0 2.0
The .value determines which parts of the columns remain as column headers.
If your dataframe looks complicated (based on the columns you shared):
PASC_new_aches_30d_60d PASC_new_aches_60d_180d PASC_new_aches_360d PASC_new_jt_pain_180d_360d
127 1.0 1.0 0.0 1.0
223 1.0 0.0 1.0 NaN
1406 1.0 NaN 1.0 0.0
2144 1.0 0.0 1.0 1.0
2234 1.0 0.0 NaN NaN
then a regex, similar to #MustafaAydin works better:
(df
.agg(['sum'])
.pivot_longer(
index=None,
names_to = ('other', '.value'),
names_pattern=r"(\D+)_(.+)")
)
other 30d_60d 60d_180d 360d 180d_360d
0 PASC_new_aches 5.0 1.0 3.0 NaN
1 PASC_new_jt_pain NaN NaN NaN 2.0

How to fill missing value with different ways in python?

I have a dataset that has a number of numerical variables and a number of ordinal numeric variables. to fill missing value I want to use mean for numerical variables and use the median for the ordinal numeric variables. With the following code, each of them is created separately and is not collected in a database.
df = [['age', 'score'],
[10,1],
[20,""],
["",0],
[40,1],
[50,0],
["",3],
[70,1],
[80,""],
[90,0],
[100,1]]
df = pd.DataFrame(data[1:])
df.columns = data[0]
df = df[['age']].fillna(df.mean())
df = df[['score']].fillna(df.median())
pandas.DataFrame.fillna accepts dict with keys being column names, so you might do:
import pandas as pd
data = [['age', 'score'],
[10,1],
[20,None],
[None,0],
[40,1],
[50,0],
[None,3],
[70,1],
[80,None],
[90,0],
[100,1]]
df = pd.DataFrame(data[1:], columns=data[0])
df = df.fillna({'age':df['age'].mean(),'score':df['score'].median()})
print(df)
output
age score
0 10.0 1.0
1 20.0 1.0
2 57.5 0.0
3 40.0 1.0
4 50.0 0.0
5 57.5 3.0
6 70.0 1.0
7 80.0 1.0
8 90.0 0.0
9 100.0 1.0
Keep in mind that empty string is different than NaN, latter might be created using python's None.
First replace empty strings to missing values and then replace mising values per columns:
df = df.replace('', np.nan)
df['age'] = df['age'].fillna(df['age'].mean())
df['score'] = df['score'].fillna(df['score'].median())
print (df)
age score
0 10.0 1.0
1 20.0 1.0
2 57.5 0.0
3 40.0 1.0
4 50.0 0.0
5 57.5 3.0
6 70.0 1.0
7 80.0 1.0
8 90.0 0.0
9 100.0 1.0
You can also use DataFrame.agg for Series of aggregate values and pass to DataFrame.fillna:
df = df.replace('', np.nan)
print (df.agg({'age':'mean', 'score':'median'}))
age 57.5
score 1.0
dtype: float64
df = df.fillna(df.agg({'age':'mean', 'score':'median'}))
print (df)
age score
0 10.0 1.0
1 20.0 1.0
2 57.5 0.0
3 40.0 1.0
4 50.0 0.0
5 57.5 3.0
6 70.0 1.0
7 80.0 1.0
8 90.0 0.0
9 100.0 1.0

How to merge rows with combination of values in a DataFrame

I have a DataFrame (df1) as given below
Hair Feathers Legs Type Count
R1 1 NaN 0 1 1
R2 1 0 Nan 1 32
R3 1 0 2 1 4
R4 1 Nan 4 1 27
I want to merge rows based by different combinations of the values in each column and also want to add the count values for each merged row. The resultant dataframe(df2) will look like this:
Hair Feathers Legs Type Count
R1 1 0 0 1 33
R2 1 0 2 1 36
R3 1 0 4 1 59
The merging is performed in such a way that any Nan value will be merged with 0 or 1. In df2, R1 is calculated by merging the Nan value of Feathers (df1,R1) with the 0 value of Feathers (df1,R2). Similarly, the value of 0 in Legs (df1,R1) is merged with Nan value of Legs (df1,R2). Then the count of R1 (1) and R2(32) are added. In the same manner R2 and R3 are merged because Feathers value in R2 (df1) is similar to R3 (df1) and Legs value of Nan is merged with 2 in R3 (df1) and the count of R2 (32) and R3 (4) are added.
I hope the explanation makes sense. Any help will be highly appreciated
A possible way to do it is by replicating each of the rows containing NaN and fill them with values for the column.
First, we need to get the possible not-null unique values per column:
unique_values = df.iloc[:, :-1].apply(
lambda x: x.dropna().unique().tolist(), axis=0).to_dict()
> unique_values
{'Hair': [1.0], 'Feathers': [0.0], 'Legs': [0.0, 2.0, 4.0], 'Type': [1.0]}
Then iterate through each row of the dataframe and replace each NaN by the possible values for each column. We can do this using pandas.DataFrame.iterrows:
mask = df.iloc[:, :-1].isnull().any(axis=1)
# Keep the rows that do not contain `Nan`
# and then added modified rows
list_of_df = [r for i, r in df[~mask].iterrows()]
for row_index, row in df[mask].iterrows():
for c in row[row.isnull()].index:
# For each column of the row, replace
# Nan by possible values for the column
for v in unique_values[c]:
list_of_df.append(row.copy().fillna({c:v}))
df_res = pd.concat(list_of_df, axis=1, ignore_index=True).T
The result is a dataframe where all the NaN have been filled with possible values for the column:
> df_res
Hair Feathers Legs Type Count
0 1.0 0.0 2.0 1.0 4.0
1 1.0 0.0 0.0 1.0 1.0
2 1.0 0.0 0.0 1.0 32.0
3 1.0 0.0 2.0 1.0 32.0
4 1.0 0.0 4.0 1.0 32.0
5 1.0 0.0 4.0 1.0 27.0
To get the final result of Count grouping by possible combinations of ['Hair', 'Feathers', 'Legs', 'Type'] we just need to do:
> df_res.groupby(['Hair', 'Feathers', 'Legs', 'Type']).sum().reset_index()
Hair Feathers Legs Type Count
0 1.0 0.0 0.0 1.0 33.0
1 1.0 0.0 2.0 1.0 36.0
2 1.0 0.0 4.0 1.0 59.0
Hope it serves
UPDATE
If one or more of the elements in the row are missing, the procedure looking for all the possible combinations for the missing values at the same time. Let us add a new row with two elements missing:
> df
Hair Feathers Legs Type Count
0 1.0 NaN 0.0 1.0 1.0
1 1.0 0.0 NaN 1.0 32.0
2 1.0 0.0 2.0 1.0 4.0
3 1.0 NaN 4.0 1.0 27.0
4 1.0 NaN NaN 1.0 32.0
We will proceed in similar way, but the replacements combinations will be obtained using itertools.product:
import itertools
unique_values = df.iloc[:, :-1].apply(
lambda x: x.dropna().unique().tolist(), axis=0).to_dict()
mask = df.iloc[:, :-1].isnull().any(axis=1)
list_of_df = [r for i, r in df[~mask].iterrows()]
for row_index, row in df[mask].iterrows():
cols = row[row.isnull()].index.tolist()
for p in itertools.product(*[unique_values[c] for c in cols]):
list_of_df.append(row.copy().fillna({c:v for c, v in zip(cols, p)}))
df_res = pd.concat(list_of_df, axis=1, ignore_index=True).T
> df_res.sort_values(['Hair', 'Feathers', 'Legs', 'Type']).reset_index(drop=True)
Hair Feathers Legs Type Count
1 1.0 0.0 0.0 1.0 1.0
2 1.0 0.0 0.0 1.0 32.0
6 1.0 0.0 0.0 1.0 32.0
0 1.0 0.0 2.0 1.0 4.0
3 1.0 0.0 2.0 1.0 32.0
7 1.0 0.0 2.0 1.0 32.0
4 1.0 0.0 4.0 1.0 32.0
5 1.0 0.0 4.0 1.0 27.0
8 1.0 0.0 4.0 1.0 32.0
> df_res.groupby(['Hair', 'Feathers', 'Legs', 'Type']).sum().reset_index()
Hair Feathers Legs Type Count
0 1.0 0.0 0.0 1.0 65.0
1 1.0 0.0 2.0 1.0 68.0
2 1.0 0.0 4.0 1.0 91.0

Pandas: sum up multiple columns into one column without last column

If I have a dataframe similar to this one
Apples Bananas Grapes Kiwis
2 3 nan 1
1 3 7 nan
nan nan 2 3
I would like to add a column like this
Apples Bananas Grapes Kiwis Fruit Total
2 3 nan 1 6
1 3 7 nan 11
nan nan 2 3 5
I guess you could use df['Apples'] + df['Bananas'] and so on, but my actual dataframe is much larger than this. I was hoping a formula like df['Fruit Total']=df[-4:-1].sum could do the trick in one line of code. That didn't work however. Is there any way to do it without explicitly summing up all columns?
You can first select by iloc and then sum:
df['Fruit Total']= df.iloc[:, -4:-1].sum(axis=1)
print (df)
Apples Bananas Grapes Kiwis Fruit Total
0 2.0 3.0 NaN 1.0 5.0
1 1.0 3.0 7.0 NaN 11.0
2 NaN NaN 2.0 3.0 2.0
For sum all columns use:
df['Fruit Total']= df.sum(axis=1)
This may be helpful for beginners, so for the sake of completeness, if you know the column names (e.g. they are in a list), you can use:
column_names = ['Apples', 'Bananas', 'Grapes', 'Kiwis']
df['Fruit Total']= df[column_names].sum(axis=1)
This gives you flexibility about which columns you use as you simply have to manipulate the list column_names and you can do things like pick only columns with the letter 'a' in their name. Another benefit of this is that it's easier for humans to understand what they are doing through column names. Combine this with list(df.columns) to get the column names in a list format. Thus, if you want to drop the last column, all you have to do is:
column_names = list(df.columns)
df['Fruit Total']= df[column_names[:-1]].sum(axis=1)
It is possible to do it without knowing the number of columns and even without iloc:
print(df)
Apples Bananas Grapes Kiwis
0 2.0 3.0 NaN 1.0
1 1.0 3.0 7.0 NaN
2 NaN NaN 2.0 3.0
cols_to_sum = df.columns[ : df.shape[1]-1]
df['Fruit Total'] = df[cols_to_sum].sum(axis=1)
print(df)
Apples Bananas Grapes Kiwis Fruit Total
0 2.0 3.0 NaN 1.0 5.0
1 1.0 3.0 7.0 NaN 11.0
2 NaN NaN 2.0 3.0 5.0
Using df['Fruit Total']= df.iloc[:, -4:-1].sum(axis=1) over your original df won't add the last column ('Kiwis'), you should use df.iloc[:, -4:] instead to select all columns:
print(df)
Apples Bananas Grapes Kiwis
0 2.0 3.0 NaN 1.0
1 1.0 3.0 7.0 NaN
2 NaN NaN 2.0 3.0
df['Fruit Total']=df.iloc[:,-4:].sum(axis=1)
print(df)
Apples Bananas Grapes Kiwis Fruit Total
0 2.0 3.0 NaN 1.0 6.0
1 1.0 3.0 7.0 NaN 11.0
2 NaN NaN 2.0 3.0 5.0
I want to build on Ramon's answer if you want to come up with the total without knowing the shape/size of the dataframe.
I will use his answer below but fix one item that didn't include the last column for the total.
I have removed the -1 from the shape:
cols_to_sum = df.columns[ : df.shape[1]-1]
To this:
cols_to_sum = df.columns[ : df.shape[1]]
print(df)
Apples Bananas Grapes Kiwis
0 2.0 3.0 NaN 1.0
1 1.0 3.0 7.0 NaN
2 NaN NaN 2.0 3.0
cols_to_sum = df.columns[ : df.shape[1]]
df['Fruit Total'] = df[cols_to_sum].sum(axis=1)
print(df)
Apples Bananas Grapes Kiwis Fruit Total
0 2.0 3.0 NaN 1.0 6.0
1 1.0 3.0 7.0 NaN 11.0
2 NaN NaN 2.0 3.0 5.0
Which then gives you the correct total without skipping the last column.

Append a new column based on existing columns

Pandas newbie here.
I'm trying to create a new column in my data frame that will serve as a training label when I feed this into a classifier.
The value of the label column is 1.0 if a given Id has (Value1 > 0) or (Value2 > 0) for Apples or Pears, and 0.0 otherwise.
My dataframe is row indexed by Id and looks like this:
Out[30]:
Value1 Value2 \
ProductName 7Up Apple Cheetos Onion Pear PopTart 7Up
ProductType Drinks Groceries Snacks Groceries Groceries Snacks Drinks
Id
100 0.0 1.0 2.0 4.0 0.0 0.0 0.0
101 3.0 0.0 0.0 0.0 3.0 0.0 4.0
102 0.0 0.0 0.0 0.0 0.0 2.0 0.0
ProductName Apple Cheetos Onion Pear PopTart
ProductType Groceries Snacks Groceries Groceries Snacks
Id
100 1.0 3.0 3.0 0.0 0.0
101 0.0 0.0 0.0 2.0 0.0
102 0.0 0.0 0.0 0.0 1.0
If the pandas wizards could give me a hand with the syntax for this operation - my mind is struggling to put it all together.
Thanks!
The answer provided by #vlad.rad works, but it is not very efficient since pandas has to manually loop in Python over all rows, not being able to take advantage of numpy vectorized functions speedup. The following vectorized solution should be more efficient:
condition = (df['Value1'] > 0) | (df['Value2'] > 0)
df.loc[condition, 'label'] = 1.
df.loc[~condition, 'label'] = 0.
Define your function:
def new_column (x):
if x['Value1'] > 0 :
return '1.0'
if x['Value2'] > 0 :
return '1.0'
return '0.0'
Apply it on your data:
df.apply (lambda x: new_column (x),axis=1)

Categories