Find row with nearest value in a subset of a pandas DataFrame - python

I have a dataframe of the following structure:
import pandas as pd
df = pd.DataFrame({'x': [1,5,8,103,105,112],
'date': pd.DatetimeIndex(('2022-02-01', '2022-02-03', '2022-02-06',
'2022-02-05', '2022-02-05', '2022-02-07'))})
x dt
0 1 2022-02-01
1 5 2022-02-03
2 8 2022-02-06
3 103 2022-02-05
4 105 2022-02-05
5 112 2022-02-07
How can I add a new column y that contains x if x < 100 and otherwise the x-value of the row with the next smaller date, from the subset where x < 100.
What I currently have is this code. It works, but doesn't look very efficient:
df['y'] = df.x
df_ref = df.loc[df.x < 100].sort_values('date').copy()
df_ref.set_index('x', inplace=True)
for ix, row in df.iterrows():
if row.x >= 100:
delta = row.date - df_ref.date
delta_gt = delta.loc[delta > pd.Timedelta(0)]
if delta_gt.size > 0:
df.loc[ix, 'y'] = delta_gt.idxmin()
x date y
0 1 2022-02-01 1
1 5 2022-02-03 5
2 8 2022-02-06 8
3 103 2022-02-04 5
4 105 2022-02-05 5
5 112 2022-02-07 8

Sort by date, mask the values greater than 100 and ffill, sort by index again:
(df.sort_values(by='date')
.assign(y=df['x'].mask(df['x'].gt(100)))
.assign(y=lambda d: d['y'].ffill())
.sort_index()
)
Output:
x date y
0 1 2022-02-01 1
1 5 2022-02-03 5
2 8 2022-02-06 8
3 103 2022-02-05 5
4 105 2022-02-05 5
5 112 2022-02-07 8

We can check merge_asof
#df.date = pd.to_datetime(df.date)
df = df.sort_values('date')
out = pd.merge_asof(df,
df[df['x']<100].rename(columns={'x':'y'}),
on = 'date',
direction = 'backward').sort_values('x')
out
Out[160]:
x date y
0 1 2022-02-01 1
1 5 2022-02-03 5
4 8 2022-02-06 8
2 103 2022-02-05 5
3 105 2022-02-05 5
5 112 2022-02-07 8

Related

Python delete rows for each group after first occurance in a column

I Have a dataframe as follows:
df = pd.DataFrame({'Key':[1,1,1,1,2,2,2,4,4,4,5,5],
'Activity':['A','A','H','B','B','H','H','A','C','H','H','B'],
'Date':['2022-12-03','2022-12-04','2022-12-06','2022-12-08','2022-12-03','2022-12-06','2022-12-10','2022-12-03','2022-12-04','2022-12-07','2022-12-03','2022-12-13']})
I need to count the activities for each 'Key' that occur before 'Activity' == 'H' as follows:
Required Output
My Approach
Sort df by Key & Date ( Sample input is already sorted)
drop the rows that occur after 'H' Activity in each group as follows:
Groupby df.groupby(['Key', 'Activity']).count()
Is there a better approach , if not then help me in code for dropping the rows that occur after 'H' Activity in each group.
Thanks in advance !
You can bring the H dates "back" into each previous row to use in a comparison.
First mark each H date in a new column:
df.loc[df["Activity"] == "H" , "End"] = df["Date"]
Key Activity Date End
0 1 A 2022-12-03 NaT
1 1 A 2022-12-04 NaT
2 1 H 2022-12-06 2022-12-06
3 1 B 2022-12-08 NaT
4 2 B 2022-12-03 NaT
5 2 H 2022-12-06 2022-12-06
6 2 H 2022-12-10 2022-12-10
7 4 A 2022-12-03 NaT
8 4 C 2022-12-04 NaT
9 4 H 2022-12-07 2022-12-07
10 5 H 2022-12-03 2022-12-03
11 5 B 2022-12-13 NaT
Backward fill the new column for each group:
df["End"] = df.groupby("Key")["End"].bfill()
Key Activity Date End
0 1 A 2022-12-03 2022-12-06
1 1 A 2022-12-04 2022-12-06
2 1 H 2022-12-06 2022-12-06
3 1 B 2022-12-08 NaT
4 2 B 2022-12-03 2022-12-06
5 2 H 2022-12-06 2022-12-06
6 2 H 2022-12-10 2022-12-10
7 4 A 2022-12-03 2022-12-07
8 4 C 2022-12-04 2022-12-07
9 4 H 2022-12-07 2022-12-07
10 5 H 2022-12-03 2022-12-03
11 5 B 2022-12-13 NaT
You can then select rows with Date before End
df.loc[df["Date"] < df["End"]]
Key Activity Date End
0 1 A 2022-12-03 2022-12-06
1 1 A 2022-12-04 2022-12-06
4 2 B 2022-12-03 2022-12-06
7 4 A 2022-12-03 2022-12-07
8 4 C 2022-12-04 2022-12-07
To generate the final form - you can use .pivot_table()
(df.loc[df["Date"] < df["End"]]
.pivot_table(index="Key", columns="Activity", values="Date", aggfunc="count")
.reindex(df["Key"].unique()) # Add in keys with no match e.g. `5`
.fillna(0)
.astype(int))
Activity A B C
Key
1 2 0 0
2 0 1 0
4 1 0 1
5 0 0 0
Try this:
(df.loc[df['Activity'].eq('H').groupby(df['Key']).cumsum().eq(0)]
.set_index('Key')['Activity']
.str.get_dummies()
.groupby(level=0).sum()
.reindex(df['Key'].unique(),fill_value=0)
.reset_index())
Output:
Key A B C
0 1 2 0 0
1 2 0 1 0
2 4 1 0 1
3 5 0 0 0
You can try:
# sort by Key and Date
df.sort_values(['Key', 'Date'], inplace=True)
# this is to keep Key in the result when no values are kept after the filter
df.Key = df.Key.astype('category')
# filter all rows after the 1st H for each Key and then pivot
df[~df.Activity.eq('H').groupby(df.Key).cummax()].pivot_table(
index='Key', columns='Activity', aggfunc='size'
).reset_index()
#Activity Key A B C
#0 1 2 0 0
#1 2 0 1 0
#2 4 1 0 1
#3 5 0 0 0

Python pandas add N multiplied by the value of the current second and add with the value of N multiplied by the value of the previous second

I have a csv file in which I would like to make some calculation.
My csv looks something like this:
Time
Value1
Value2
10:00:00
4
1
10:00:01
5
0
10:00:02
4
4
10:00:03
5
3
10:00:04
4
1
10:00:05
5
2
10:00:06
4
4
10:00:07
5
8
10:00:08
4
4
10:00:09
5
8
In the background, I add the variable N = 10
And I would like to add a new column to the same csv file with a calculation using the formula
Something like N + 1 * Value 1 of its own line + N+ 1 * Value 2 of the previous line
To get something like this:
Time
Value1
Value2
Value3
10:00:00
4
1
44
10:00:01
5
0
66
10:00:02
4
4
44
10:00:03
5
3
99
10:00:04
4
1
77
10:00:05
5
2
66
10:00:06
4
4
66
10:00:07
5
8
99
10:00:08
4
4
132
10:00:09
5
8
99
I tried something like
out = df.groupby(['Time']) out = %N+1 * df['Value1'] + %N+1 * df['Value2']
But it doesn't quite work
import pandas as pd
df = pd.DataFrame(
[['10:00:00', 4, 1], ['10:00:01', 5, 0]],
columns=['Time', 'Value1', 'Value2']
)
N = 10
df['Value3'] = (N + 1) * (df.Value1 + df.Value2.shift(1))
print(df)
prints
index
Time
Value1
Value2
Value3
0
10:00:00
4
1
NaN
1
10:00:01
5
0
66.0

How can I assign the result of a filtered, grouped aggregation as a new column in the original Pandas DataFrame

I am having trouble making the transition from using R data.table to using Pandas for data munging.
Specifically, I am trying to assign the results of aggregations back into the original df as a new column. Note that the aggregations are functions of two columns, so I don't think df.transform() is the right approach.
To illustrate, I'm trying to replicate what I would do in R with:
library(data.table)
df = as.data.table(read.csv(text=
"id,term,node,hours,price
1,qtr,A,300,107
2,qtr,A,300,104
3,qtr,A,300,91
4,qtr,B,300,89
5,qtr,B,300,113
6,qtr,B,300,116
7,mth,A,50,110
8,mth,A,100,119
9,mth,A,150,99
10,mth,B,50,111
11,mth,B,100,106
12,mth,B,150,108"))
df[term == 'qtr' , `:=`(vwap_ish = sum(hours * price),
avg_id = mean(id) ),
.(node, term)]
df
# id term node hours price vwap_ish avg_id
# 1: 1 qtr A 300 107 90600 2
# 2: 2 qtr A 300 104 90600 2
# 3: 3 qtr A 300 91 90600 2
# 4: 4 qtr B 300 89 95400 5
# 5: 5 qtr B 300 113 95400 5
# 6: 6 qtr B 300 116 95400 5
# 7: 7 mth A 50 110 NA NA
# 8: 8 mth A 100 119 NA NA
# 9: 9 mth A 150 99 NA NA
# 10: 10 mth B 50 111 NA NA
# 11: 11 mth B 100 106 NA NA
# 12: 12 mth B 150 108 NA NA
Using Pandas, I can create an object from df that contains all rows of the original df, with the aggregations
import io
import numpy as np
import pandas as pd
data = io.StringIO("""id,term,node,hours,price
1,qtr,A,300,107
2,qtr,A,300,104
3,qtr,A,300,91
4,qtr,B,300,89
5,qtr,B,300,113
6,qtr,B,300,116
7,mth,A,50,110
8,mth,A,100,119
9,mth,A,150,99
10,mth,B,50,111
11,mth,B,100,106
12,mth,B,150,108""")
df = pd.read_csv(data)
df1 = df.groupby(['node','term']).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
)
df1
"""
id term node hours price vwap_ish avg_id
node term
B mth 9 10 mth B 50 111 32350 10.0
10 11 mth B 100 106 32350 10.0
11 12 mth B 150 108 32350 10.0
qtr 3 4 qtr B 300 89 95400 4.0
4 5 qtr B 300 113 95400 4.0
5 6 qtr B 300 116 95400 4.0
A mth 6 7 mth A 50 110 32250 7.0
7 8 mth A 100 119 32250 7.0
8 9 mth A 150 99 32250 7.0
qtr 0 1 qtr A 300 107 90600 1.0
1 2 qtr A 300 104 90600 1.0
2 3 qtr A 300 91 90600 1.0
"""
This doesn't really get me what I want because a) it re-orders and creates indices, and b) it has calculated the aggregation for all rows.
I can get the subset easily enough with
df2 = df[df.term=='qtr'].groupby(['node','term']).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
df2
"""
id term node hours price vwap_ish avg_id
node term
A qtr 0 1 qtr A 300 107 90600 1.0
1 2 qtr A 300 104 90600 1.0
2 3 qtr A 300 91 90600 1.0
B qtr 3 4 qtr B 300 89 95400 4.0
4 5 qtr B 300 113 95400 4.0
5 6 qtr B 300 116 95400 4.0
"""
but I can't get the values in the new columns (vwap_ish, avg_id) back into the old df.
I've tried:
df[df.term=='qtr'] = df[df.term == 'qtr'].groupby(['node','term']).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
)
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
And also a few variations of .merge and .join. For example:
df.merge(df2, how='left')
ValueError: 'term' is both an index level and a column label, which is ambiguous.
and
df.merge(df2, how='left', on=df.columns)
KeyError: Index(['id', 'term', 'node', 'hours', 'price'], dtype='object')
In writing this I realised I could take my first approach and then just do
df[df.term=='qtr', ['vwap_ish','avg_id']] = NaN
but this seems quite hacky. It means I have to use a new column, rather than overwriting an existing one on the filtered rows, and if the aggregation function were to break say if term='mth' then that would be problematic too.
I'd really appreciate any help with this as it's been a very steep learning curve to try to make the transition from data.table to Pandas and there's so much I would do in a one-liner that is taking me hours to figure out.
You can add group_keys=False parameter for remove MultiIndex, so left join working well:
df2 = df[df.term == 'qtr'].groupby(['node','term'], group_keys=False).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
)
df = df.merge(df2, how='left')
print (df)
id term node hours price vwap_ish avg_id
0 1 qtr A 300 107 90600.0 2.0
1 2 qtr A 300 104 90600.0 2.0
2 3 qtr A 300 91 90600.0 2.0
3 4 qtr B 300 89 95400.0 5.0
4 5 qtr B 300 113 95400.0 5.0
5 6 qtr B 300 116 95400.0 5.0
6 7 mth A 50 110 NaN NaN
7 8 mth A 100 119 NaN NaN
8 9 mth A 150 99 NaN NaN
9 10 mth B 50 111 NaN NaN
10 11 mth B 100 106 NaN NaN
11 12 mth B 150 108 NaN NaN
Solution without left join:
m = df.term == 'qtr'
df.loc[m, ['vwap_ish','avg_id']] = (df[m].groupby(['node','term'], group_keys=False)
.apply(lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
))
Improved solution with named aggregation and create vwap_ish column before groupby can improve performance:
df2 = (df[df.term == 'qtr']
.assign(vwap_ish = lambda x: x.hours * x.price)
.groupby(['node','term'], as_index=False)
.agg(vwap_ish=('vwap_ish','sum'),
avg_id=('id','mean')))
df = df.merge(df2, how='left')
print (df)
id term node hours price vwap_ish avg_id
0 1 qtr A 300 107 90600.0 2.0
1 2 qtr A 300 104 90600.0 2.0
2 3 qtr A 300 91 90600.0 2.0
3 4 qtr B 300 89 95400.0 5.0
4 5 qtr B 300 113 95400.0 5.0
5 6 qtr B 300 116 95400.0 5.0
6 7 mth A 50 110 NaN NaN
7 8 mth A 100 119 NaN NaN
8 9 mth A 150 99 NaN NaN
9 10 mth B 50 111 NaN NaN
10 11 mth B 100 106 NaN NaN
11 12 mth B 150 108 NaN NaN
One option is to break it into individual steps, if you are willing to avoid using apply (if you are keen on performance):
Compute the product of hours and price before grouping:
temp = df.assign(vwap_ish = df.hours * df.price, avg_id = df.id)
Get the groupby object after filtering term:
temp = (temp
.loc[temp.term.eq('qtr'), ['vwap_ish', 'avg_id']]
.groupby([df.node, df.term])
)
Assign back the aggregated values with transform; pandas will take care of the index alignment:
(df
.assign(vwap_ish = temp.vwap_ish.transform('sum'),
avg_id = temp.avg_id.transform('mean'))
)
id term node hours price vwap_ish avg_id
0 1 qtr A 300 107 90600.0 2.0
1 2 qtr A 300 104 90600.0 2.0
2 3 qtr A 300 91 90600.0 2.0
3 4 qtr B 300 89 95400.0 5.0
4 5 qtr B 300 113 95400.0 5.0
5 6 qtr B 300 116 95400.0 5.0
6 7 mth A 50 110 NaN NaN
7 8 mth A 100 119 NaN NaN
8 9 mth A 150 99 NaN NaN
9 10 mth B 50 111 NaN NaN
10 11 mth B 100 106 NaN NaN
11 12 mth B 150 108 NaN NaN
This is just an aside, and you can totally ignore it - pydatatable attempts to mimic R's datatable as much as it can. This is one solution with pydatatable:
from datatable import dt, f, by, ifelse, update
DT = dt.Frame(df)
query = f.term == 'qtr'
agg = {'vwap_ish': ifelse(query, (f.hours * f.price), np.nan).sum(),
'avg_id' : ifelse(query, f.id.mean(), np.nan).sum()}
# update is a near equivalent to `:=`
DT[:, update(**agg), by('node', 'term')]
DT
| id term node hours price vwap_ish avg_id
| int64 str32 str32 int64 int64 float64 float64
-- + ----- ----- ----- ----- ----- -------- -------
0 | 1 qtr A 300 107 90600 6
1 | 2 qtr A 300 104 90600 6
2 | 3 qtr A 300 91 90600 6
3 | 4 qtr B 300 89 95400 15
4 | 5 qtr B 300 113 95400 15
5 | 6 qtr B 300 116 95400 15
6 | 7 mth A 50 110 NA NA
7 | 8 mth A 100 119 NA NA
8 | 9 mth A 150 99 NA NA
9 | 10 mth B 50 111 NA NA
10 | 11 mth B 100 106 NA NA
11 | 12 mth B 150 108 NA NA
[12 rows x 7 columns]

Create New Pandas DataFrame Column Equaling Values From Other Row in Same DataFrame

I'm new to python and very new to Pandas. I've looked through the Pandas documentation and tried multiple ways to solve this problem unsuccessfully.
I have a DateFrame with timestamps in one column and prices in another, such as:
d = {'TimeStamp': [1603822620000, 1603822680000,1603822740000, 1603823040000,1603823100000,1603823160000,1603823220000], 'Price': [101,105,102,108,105,101,106], 'OtherData1': [1,2,3,4,5,6,7], 'OtherData2': [7,6,5,4,3,2,1]}
df= pd.DataFrame(d)
df
TimeStamp Price OtherData1 OtherData2
0 1603822620000 101 1 7
1 1603822680000 105 2 6
2 1603822740000 102 3 5
3 1603823040000 108 4 4
4 1603823100000 105 5 3
5 1603823160000 101 6 2
6 1603823220000 106 7 1
In addition to the two columns of interest, this DataFrame also has additional columns with data not particularly relevant to the question (represented with OtherData Cols).
I want to create a new column 'Fut2Min' (Price Two Minutes into the Future). There may be missing data, so this problem can't be solved by simply getting the data from 2 rows below.
I'm trying to find a way to make the value for Fut2Min Col in each row == the Price at the row with the timestamp + 120000 (2 minutes into the future) or null (or NAN or w/e) if the corresponding timestamp doesn't exist.
For the example data, the DF should be updated to:
(Code used to mimic desired result)
d = {'TimeStamp': [1603822620000, 1603822680000, 1603822740000, 1603822800000, 1603823040000,1603823100000,1603823160000,1603823220000],
'Price': [101,105,102,108,105,101,106,111],
'OtherData1': [1,2,3,4,5,6,7,8],
'OtherData2': [8,7,6,5,4,3,2,1],
'Fut2Min':[102,108,'NaN','NaN',106,111,'NaN','NaN']}
df= pd.DataFrame(d)
df
TimeStamp Price OtherData1 OtherData2 Fut2Min
0 1603822620000 101 1 8 102
1 1603822680000 105 2 7 108
2 1603822740000 102 3 6 NaN
3 1603822800000 108 4 5 NaN
4 1603823040000 105 5 4 106
5 1603823100000 101 6 3 111
6 1603823160000 106 7 2 NaN
7 1603823220000 111 8 1 NaN
Assuming that the DataFrame is:
TimeStamp Price OtherData1 OtherData2 Fut2Min
0 1603822620000 101 1 8 0
1 1603822680000 105 2 7 0
2 1603822740000 102 3 6 0
3 1603822800000 108 4 5 0
4 1603823040000 105 5 4 0
5 1603823100000 101 6 3 0
6 1603823160000 106 7 2 0
7 1603823220000 111 8 1 0
Then, if you use pandas.DataFrame.apply, along the column axis:
import pandas as pd
def Fut2MinFunc(row):
futTimeStamp = row.TimeStamp + 120000
if (futTimeStamp in df.TimeStamp.values):
return df.loc[df['TimeStamp'] == futTimeStamp, 'Price'].iloc[0]
else:
return None
df['Fut2Min'] = df.apply(Fut2MinFunc, axis = 1)
You will get exactly what you describe as:
TimeStamp Price OtherData1 OtherData2 Fut2Min
0 1603822620000 101 1 8 102.0
1 1603822680000 105 2 7 108.0
2 1603822740000 102 3 6 NaN
3 1603822800000 108 4 5 NaN
4 1603823040000 105 5 4 106.0
5 1603823100000 101 6 3 111.0
6 1603823160000 106 7 2 NaN
7 1603823220000 111 8 1 NaN
EDIT 2: I have updated the solution since it had some sloppy parts (exchanged the list for index determination with a dictionary and restricted the search for timestamps).
This (with import numpy as np)
indices = {ts - 120000: i for i, ts in enumerate(df['TimeStamp'])}
df['Fut2Min'] = [
np.nan
if (ts + 120000) not in df['TimeStamp'].values[i:] else
df['Price'].iloc[indices[ts]]
for i, ts in enumerate(df['TimeStamp'])
]
gives you
TimeStamp Price Fut2Min
0 1603822620000 101 102.0
1 1603822680000 105 108.0
2 1603822740000 102 NaN
3 1603822800000 108 NaN
4 1603823040000 105 106.0
5 1603823100000 101 111.0
6 1603823160000 106 NaN
7 1603823220000 111 NaN
But I'm not sure if that is an optimal solution.
EDIT: Inspired by the discussion in the comments I did some timing:
With the sample frame
from itertools import accumulate
import numpy as np
rng = np.random.default_rng()
n = 10000
timestamps = [1603822620000 + t
for t in accumulate(rng.integers(1, 4) * 60000
for _ in range(n))]
df = pd.DataFrame({'TimeStamp': timestamps, 'Price': n * [100]})
TimeStamp Price
0 1603822680000 100
... ... ...
9999 1605030840000 100
[10000 rows x 2 columns]
and the two test functions
# (1) Other solution
def Fut2MinFunc(row):
futTimeStamp = row.TimeStamp + 120000
if (futTimeStamp in df.TimeStamp.values):
return df.loc[df['TimeStamp'] == futTimeStamp, 'Price'].iloc[0]
else:
return None
def test_1():
df['Fut2Min'] = df.apply(Fut2MinFunc, axis = 1)
# (2) Solution here
def test_2():
indices = list(df['TimeStamp'] - 120000)
df['Fut2Min'] = [
np.nan
if (timestamp + 120000) not in df['TimeStamp'].values else
df['Price'].iloc[indices.index(timestamp)]
for timestamp in df['TimeStamp']
]
I conducted the experiment
from timeit import timeit
t1 = timeit('test_1()', number=100, globals=globals())
t2 = timeit('test_2()', number=100, globals=globals())
print(t1, t2)
with the result
135.98962861 40.306039344
which seems to imply that the version here is faster? (I also measured directly with time() and without the wrapping in functions and the results are virtually identical.)
With my updated version the result looks like
139.33713767799998 14.178187169000012
I finally did one try with a frame with 1,000,000 rows (number=1) and the result was
763.737430931 175.73120002400003

How to get percentage count based on multiple columns in pandas dataframe?

I have 20 columns in a dataframe.
I list 4 of them here as example:
is_guarantee: 0 or 1
hotel_star: 0, 1, 2, 3, 4, 5
order_status: 40, 60, 80
journey (Label): 0, 1, 2
is_guarantee hotel_star order_status journey
0 0 5 60 0
1 1 5 60 0
2 1 5 60 0
3 0 5 60 1
4 0 4 40 0
5 0 4 40 1
6 0 4 40 1
7 0 3 60 0
8 0 2 60 0
9 1 5 60 0
10 0 2 60 0
11 0 2 60 0
Click to View Image
But the system need to input the occurrence matrix like the following format to function:
Click to View Image
Can any body help?
df1 = pd.DataFrame(index=range(0,20))
df1['is_guarantee'] = np.random.choice([0,1], df1.shape[0])
df1['hotel_star'] = np.random.choice([0,1,2,3,4,5], df1.shape[0])
df1['order_status'] = np.random.choice([40,60,80], df1.shape[0])
df1['journey '] = np.random.choice([0,1,2], df1.shape[0])
I think you need:
reshape by melt and get counts by groupby with size, reshape by unstack
then divide sum per rows and join MultiIndex to index:
df = (df.melt('journey')
.astype(str)
.groupby(['variable', 'journey','value'])
.size()
.unstack(1, fill_value=0))
df = (df.div(df.sum(1), axis=0)
.mul(100)
.add_prefix('journey_')
.set_index(df.index.map(' = '.join))
.rename_axis(None, 1))
print (df)
journey_0 journey_1
hotel_star = 2 100.000000 0.000000
hotel_star = 3 100.000000 0.000000
hotel_star = 4 33.333333 66.666667
hotel_star = 5 80.000000 20.000000
is_guarantee = 0 66.666667 33.333333
is_guarantee = 1 100.000000 0.000000
order_status = 40 33.333333 66.666667
order_status = 60 88.888889 11.111111

Categories