problem with re index dataframe (dealing with categorical data) - python

i have a data that look like this
subject_id hour_measure urine color heart_rate
3 1 red 40
3 1.15 red 60
4 2 yellow 50
i want to re index data to make 24 hour of measurement for every patient
i use the following code
mux = pd.MultiIndex.from_product([df['subject_id'].unique(), np.arange(1,24)],
names=['subject_id','hour_measure'])
df = df.groupby(['subject_id','hour_measure']).mean().reindex(mux).reset_index()
df.to_csv('totalafterreindex.csv')
it works good with numeric values , but with categorical values it removed it ,
how can i enhance this code to use mean for numeric and most frequent for categorical
the wanted output
subject_id hour_measure urine color heart_rate
3 1 red 40
3 2 red 60
3 3 yellow 50
3 4 yellow 50
.. .. ..

Idea is use GroupBy.agg with mean for numeric and mode for categorical, also is added next with iter for return Nones if mode return empty value:
mux = pd.MultiIndex.from_product([df['subject_id'].unique(), np.arange(1,24)],
names=['subject_id','hour_measure'])
f = lambda x: x.mean() if np.issubdtype(x.dtype, np.number) else next(iter(x.mode()), None)
df1 = df.groupby(['subject_id','hour_measure']).agg(f).reindex(mux).reset_index()
Detail:
print (df.groupby(['subject_id','hour_measure']).agg(f))
urine color heart_rate
subject_id hour_measure
3 1.00 red 40
1.15 red 60
4 2.00 yellow 50
Last if necessary forward filling missing values per subject_id use GroupBy.ffill:
cols = df.columns.difference(['subject_id','hour_measure'])
df[cols] = df.groupby('subject_id')[cols].ffill()

Related

Python pandas. How to include rows before specific conditioned rows?

I have a csv file, which for ex. kinda looks. like this:
duration
concentration
measurement
1.2
0
10
1.25
0
12
...
...
...
10.3
0
11
10.5
10
100
10.6
20
150
10.67
30
156
10.75
0
12.5
11
0
12
...
...
...
I filtered all the rows with the concentration 0 with the following code.
dF2 = dF1[dF1["concentration"]>10][["duration","measurement","concentration"]]
But I would like to have 100(or n specific) extra rows, with the concentration hold on 0, before the rows with concentrations bigger than 10 begins, that I can have a baseline when plotting the data.
Does anybody had experience with a similar problem/ could somebody help me please...
You can use boolean masks for boolean indexing:
# number of baseline rows to keep
n = 2
# cols to keep
cols = ['duration', 'measurement', 'concentration']
# is the concentration greater than 10?
m1 = dF1['concentration'].gt(10)
# is the row one of the n initial concentration 0?
m2 = dF1['concentration'].eq(0).cumsum().le(n)
# if you have values in between 0 and 10 and do not want those
# m2 = (m2:=dF1['concentration'].eq(0)) & m2.cumsum().le(n)
# or
# m2 = df.index.isin(dF1[dF1['concentration'].eq(0)].head(n).index)
# keep rows where either condition is met
dF2 = dF1.loc[m1|m2, cols]
If you only want to keep initial rows before the first value above threshold, change m2 to:
# keep up to n initial rows with concentration=0
# only until the first row above threshold is met
m2 = dF1['concentration'].eq(0).cumsum().le(n) & ~m1.cummax()
output:
duration measurement concentration
0 1.20 10.0 0
1 1.25 12.0 0
4 10.60 150.0 20
5 10.67 156.0 30
You can filter the records and concat to have desired results
n = 100 # No of initial rows with concentratin 0 required
dF2 = pd.concat([dF1[dF1["concentration"]==0].head(n),dF1[dF1["concentration"]>10]])[["duration","measurement","concentration"]]
You can simply filter the data frame for when the concentration is zero, and select the top 100 or top n rows from your filtered data frame using the 'head' and append that to your dF2.
n = 100 # you can change this to include the number of rows you want.
df_baseline = dF1[dF1["concentration"] == 0][["duration","measurement","concentration"]].head(n)
dF2 = dF1[dF1["concentration"]>10][["duration","measurement","concentration"]]
df_final = df_baseline.append(df2)

How to decode column value from rare label by matching column names

I have two dataframes like as shown below
import numpy as np
import pandas as pd
from numpy.random import default_rng
rng = default_rng(100)
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
'grade': rng.choice(list('ACD'),size=(5)),
'dash': rng.choice(list('PQRS'),size=(5)),
'dumeel': rng.choice(list('QWER'),size=(5)),
'dumma': rng.choice((1234),size=(5)),
'target': rng.choice([0,1],size=(5))
})
tdf = pd.DataFrame({'Id': [1,1,1,1,3,3,3],
'feature': ['grade=Rare','dash=Q','dumma=rare','dumeel=R','dash=Rare','dumma=rare','grade=D'],
'value': [0.2,0.45,-0.32,0.56,1.3,1.5,3.7]})
My objective is to
a) Replace the Rare or rare values in feature column of tdf dataframe by original value from cdf dataframe.
b) To identify original value, we can make use of the string before = Rare or =rare or = rare etc. That string represents the column name in cdf dataframe (from where original value to replace rare can be found)
I was trying something like the below but not sure how to go from here
replace_df = cdf.merge(tdf,how='inner',on='Id')
replace_df ["replaced_feature"] = np.where(((replace_df["feature"].str.contains('rare',regex=True)]) & (replace_df["feature"].str.split('='))])
I have to apply this on a big data where I have million rows and more than 1000 replacements to be made like this.
I expect my output to be like as shown below
Here is one possible approach using MultiIndex.map to substitute values from cdf into tdf:
s = tdf['feature'].str.split('=')
m = s.str[1].isin(['rare', 'Rare'])
v = tdf[m].set_index(['Id', s[m].str[0]]).index.map(cdf.set_index('Id').stack())
tdf.loc[m, 'feature'] = s[m].str[0] + '=' + v.astype(str)
print(tdf)
Id feature value
0 1 grade=D 0.20
1 1 dash=Q 0.45
2 1 dumma=1123 -0.32
3 1 dumeel=R 0.56
4 3 dash=P 1.30
5 3 dumma=849 1.50
6 3 grade=D 3.70
# list comprehension to find where rare is in the feature col
tdf['feature'] = [x if y.lower()=='rare' else x+'='+y for x,y in tdf['feature'].str.split('=')]
# create a mask where feature is in columns of cdf
mask = tdf['feature'].isin(cdf.columns)
# use loc to filter your frame and use merge to join cdf on the id and feature column - after you use stack
tdf.loc[mask, 'feature'] = tdf.loc[mask, 'feature']+'='+tdf.loc[mask].merge(cdf.set_index('Id').stack().to_frame(),
right_index=True, left_on=['Id', 'feature'])[0].astype(str)
Id feature value
0 1 grade=D 0.20
1 1 dash=Q 0.45
2 1 dumma=1123 -0.32
3 1 dumeel=R 0.56
4 3 dash=P 1.30
5 3 dumma=849 1.50
6 3 grade=D 3.70
My feeling is there's no need to look for Rare values.
Extract the column name from tdf to lookup in cdf. After, flatten your cdf dataframe to extract the right values:
r = tdf.set_index('Id')['feature'].str.split('=').str[0].str.lower()
tdf['feature'] = r.values + '=' + cdf.set_index('Id').unstack() \
.loc[zip(r.values, r.index)] \
.astype(str).values
Output:
>>> tdf
Id feature value
0 1 grade=D 0.20
1 1 dash=Q 0.45
2 1 dumma=1123 -0.32
3 1 dumeel=R 0.56
4 3 dash=P 1.30
5 3 dumma=849 1.50
6 3 grade=A 3.70
>>> r
Id # <- the index is the row of cdf
1 grade # <- the values are the column of cdf
1 dash
1 dumma
1 dumeel
3 dash
3 dumma
3 grade
Name: feature, dtype: object

Python - delete a row based on condition from a pandas.core.series.Series after groupby

I have this pandas.core.series.Series after grouping by 2 columns case and area
case
area
A
1
2494
2
2323
B
1
59243
2
27125
3
14
I want to keep only areas that are in case A , that means the result should be like this:
case
area
A
1
2494
2
2323
B
1
59243
2
27125
I tried this code :
a = df['B'][~df['B'].index.isin(df['A'].index)].index
df['B'].drop(a)
And it worked, the output was :
But it didn't drop it in the dataframe, it still the same.
when I assign the result of droping, all the values became NaN
df['B'] = df['B'].drop(a)
what should I do ?
it is possible to drop after grouping, here's one way
import pandas
import numpy as np
np.random.seed(1)
ungroup_df = pd.DataFrame({
'case':[
'A','A','A','A','A','A',
'A','A','A','A','A','A',
'B','B','B','B','B','B',
'B','B','B','B','B','B',
],
'area':[
1,2,1,2,1,2,
1,2,1,2,1,2,
1,2,3,1,2,3,
1,2,3,1,2,3,
],
'value': np.random.random(24),
})
df = ungroup_df.groupby(['case','area'])['value'].sum()
print(df)
#index into the multi-index to just the 'A' areas
#the ":" is saying any value at the first level (A or B)
#then the df.loc['A'].index is filtering to second level of index (area) that match A's
filt_df = df.loc[:,df.loc['A'].index]
print(filt_df)
Test df:
case area
A 1 1.566114
2 2.684593
B 1 1.983568
2 1.806948
3 2.079145
Name: value, dtype: float64
Output after dropping
case area
A 1 1.566114
2 2.684593
B 1 1.983568
2 1.806948
Name: value, dtype: float64

Check if column consists of numbers in string data type

Below is script for a simplified version of the df in question:
import pandas as pd
df = pd.DataFrame({
'feature' : ['cd_player', 'sat_nav', 'sub_woofer', 'usb_port','cd_player', 'sat_nav', 'sub_woofer', 'usb_port','cd_player', 'sat_nav', 'sub_woofer', 'usb_port'],
'feature_value' : ['1','1','0','4','1','0','0','1','1','1','1','0'],
'feature_colour' : ['red','orange','yellow','green','blue','indigo','violet','red','orange','yellow','green','blue']
})
df
feature feature_value feature_colour
0 cd_player 1 red
1 sat_nav 1 orange
2 sub_woofer 0 yellow
3 usb_port 4 green
4 cd_player 1 blue
5 sat_nav 0 indigo
6 sub_woofer 0 violet
7 usb_port 1 red
8 cd_player 1 orange
9 sat_nav 1 yellow
10 sub_woofer 1 green
11 usb_port 0 blue
df.dtypes
feature object
feature_value object
dtype: object
I want to find a way to find all columns with numerical values, and convert their datatypes to integers and/or floats. Of course in this example, it is easy to do manually, however the DF in question has ~50 potential cols with numerical values, but as they are all have object dtypes, it would be rather inefficient to determine manually.
INTENDED OUTPUT:
df.dtypes
feature object
feature_value int64
dtype: object
Any help would be greatly appreciated.
Try this:
df = df.apply(lambda x: pd.to_numeric(x, errors='ignore'))
df.dtypes

Python random sample selection based on multiple conditions

I want to make a random sample selection in python from the following df such that at least 65% of the resulting sample should have color yellow and cumulative sum of the quantities selected to be less than or equals to 18.
Original Dataset:
Date Id color qty
02-03-2018 A red 5
03-03-2018 B blue 2
03-03-2018 C green 3
04-03-2018 D yellow 4
04-03-2018 E yellow 7
04-03-2018 G yellow 6
04-03-2018 H orange 8
05-03-2018 I yellow 1
06-03-2018 J yellow 5
I have got total qty. selected condition covered but stuck on how to move forward with integrating the % condition:
df2 = df1.sample(n=df1.shape[0])
df3= df2[df2.qty.cumsum() <= 18]
Required dataset:
Date Id color qty
03-03-2018 B blue 2
04-03-2018 D yellow 4
04-03-2018 G yellow 6
06-03-2018 J yellow 5
Or something like this:
Date Id color qty
02-03-2018 A red 5
04-03-2018 D yellow 4
04-03-2018 E yellow 7
05-03-2018 I yellow 1
Any help would be really appreciated!
Thanks in advance.
Filter rows with 'yellow' and select a random sample of at least 65% of your total sample size
import random
yellow_size = float(random.randint(65,100)) / 100
df_yellow = df3[df3['color'] == 'yellow'].sample(yellow_size*sample_size)
Filter rows with other colors and select a random sample for the remaining of your sample size.
others_size = 1 - yellow_size
df_others = df3[df3['color'] != 'yellow].sample(others_size*sample_size)
Combine them both and shuffle the rows.
df_sample = pd.concat([df_yellow, df_others]).sample(frac=1)
UPDATE:
If you want to check for both conditions simultaneously, this could be one way to do it:
import random
df_sample = df
while sum(df_sample['qty']) > 18:
yellow_size = float(random.randint(65,100)) / 100
df_yellow = df[df['color'] == 'yellow'].sample(yellow_size*sample_size)
others_size = 1 - yellow_size
df_others = df[df['color'] != 'yellow'].sample(others_size*sample_size)
df_sample = pd.concat([df_yellow, df_others]).sample(frac=1)
I would use this package to over sample your yellows into a new sample that has the balance you want:
https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html
From there just randomly select items and check sum until you have the set you want.
Something less time complex would be binary searching a range the length of your data frame, and using the binary search term as your sample size, until you get the cumsum you want. The assumes the feature is symmetrically distributed.
I think this example help you. I add columns df2['yellow_rate'] and calculate rate. You only check df2.iloc[df2.shape[0] - 1]['yellow_rate'] value.
df1=pd.DataFrame({'id':['A','B','C','D','E','G','H','I','J'],'color':['red','bule','green','yellow','yellow','yellow','orange','yellow','yellow'], 'qty':[5,2, 3, 4, 7, 6, 8, 1, 5]})
df2 = df1.sample(n=df1.shape[0])
df2['yellow_rate'] = df2[df2.qty.cumsum() <= 18]['color'].apply( lambda x : 1 if x =='yellow' else 0)
df2 = df2.dropna().append(df2.sum(numeric_only=True)/ df2.count(numeric_only=True), ignore_index=True)

Categories