I have a dataframe 'df'. Using the validation data validData, I want to compute the response rate (Florence = 1/Yes) using the rfm_aboveavg (RFM combinations response rates above the overall response). Response rate is given by considering 0/No and 1/Yes, so it would be rfm_crosstab[1] / rfm_crosstab['All'].
Using the results from the validation data, I want to only display the rows that are also shown in the training data output by the RFM column. How do I do this?
Data: 'df'
Seq# ID# Gender M R F FirstPurch ChildBks YouthBks CookBks ... ItalCook ItalAtlas ItalArt Florence Related Purchase Mcode Rcode Fcode Yes_Florence No_Florence
0 1 25 1 297 14 2 22 0 1 1 ... 0 0 0 0 0 5 4 2 0 1
1 2 29 0 128 8 2 10 0 0 0 ... 0 0 0 0 0 4 3 2 0 1
2 3 46 1 138 22 7 56 2 1 2 ... 1 0 0 0 2 4 4 3 0 1
3 4 47 1 228 2 1 2 0 0 0 ... 0 0 0 0 0 5 1 1 0 1
4 5 51 1 257 10 1 10 0 0 0 ... 0 0 0 0 0 5 3 1 0 1
My code: Crosstab for training data trainData
trainData, validData = train_test_split(df, test_size=0.4, random_state=1)
# Response rate for training data as a whole
responseRate = (sum(trainData.Florence == 1) / sum(trainData.Florence == 0)) * 100
# Response rate for RFM categories
# RFM: Combine R, F, M categories into one category
trainData['RFM'] = trainData['Mcode'].astype(str) + trainData['Rcode'].astype(str) + trainData['Fcode'].astype(str)
rfm_crosstab = pd.crosstab(index = [trainData['RFM']], columns = trainData['Florence'], margins = True)
rfm_crosstab['Percentage of 1/Yes'] = 100 * (rfm_crosstab[1] / rfm_crosstab['All'])
# RFM combinations response rates above the overall response
rfm_aboveavg = rfm_crosstab['Percentage of 1/Yes'] > responseRate
rfm_crosstab[rfm_aboveavg]
Output: Training data
Florence 0 1 All Percentage of 1/Yes
RFM
121 3 2 5 40.000000
131 9 1 10 10.000000
212 1 2 3 66.666667
221 6 3 9 33.333333
222 6 1 7 14.285714
313 2 1 3 33.333333
321 17 3 20 15.000000
322 20 4 24 16.666667
323 2 1 3 33.333333
341 61 10 71 14.084507
343 17 2 19 10.526316
411 12 3 15 20.000000
422 26 5 31 16.129032
423 32 8 40 20.000000
441 96 12 108 11.111111
511 19 4 23 17.391304
513 44 8 52 15.384615
521 24 5 29 17.241379
523 74 16 90 17.777778
533 177 28 205 13.658537
My code: Crosstab for validation data validData
# Response rate for RFM categories
# RFM: Combine R, F, M categories into one category
validData['RFM'] = validData['Mcode'].astype(str) + validData['Rcode'].astype(str) + validData['Fcode'].astype(str)
rfm_crosstab1 = pd.crosstab(index = [validData['RFM']], columns = validData['Florence'], margins = True)
rfm_crosstab1['Percentage of 1/Yes'] = 100 * (rfm_crosstab1[1] / rfm_crosstab1['All'])
rfm_crosstab1
Output: Validation data
Florence 0 1 All Percentage of 1/Yes
RFM
131 3 1 4 25.000000
141 8 0 8 0.000000
211 2 1 3 33.333333
212 2 0 2 0.000000
213 0 1 1 100.000000
221 5 0 5 0.000000
222 2 0 2 0.000000
231 21 1 22 4.545455
232 3 0 3 0.000000
233 1 0 1 0.000000
241 11 1 12 8.333333
242 8 0 8 0.000000
243 2 0 2 0.000000
311 7 0 7 0.000000
312 8 0 8 0.000000
313 1 0 1 0.000000
321 12 0 12 0.000000
322 13 0 13 0.000000
323 4 1 5 20.000000
331 19 1 20 5.000000
332 25 2 27 7.407407
333 11 1 12 8.333333
341 36 2 38 5.263158
342 30 2 32 6.250000
343 12 0 12 0.000000
411 8 2 10 20.000000
412 7 0 7 0.000000
413 13 1 14 7.142857
421 21 2 23 8.695652
422 30 1 31 3.225806
423 26 1 27 3.703704
431 51 3 54 5.555556
432 42 7 49 14.285714
433 41 5 46 10.869565
441 68 2 70 2.857143
442 78 3 81 3.703704
443 70 5 75 6.666667
511 17 0 17 0.000000
512 13 1 14 7.142857
513 26 6 32 18.750000
521 19 1 20 5.000000
522 25 6 31 19.354839
523 50 6 56 10.714286
531 66 3 69 4.347826
532 65 3 68 4.411765
533 128 24 152 15.789474
541 86 7 93 7.526882
542 100 6 106 5.660377
543 178 17 195 8.717949
All 1474 126 1600 7.875000
I have an set of stock information, with datetime set as index, stock market only open on weekdays so all my rows are weekdays, which is fine, I would like to determine if a row is start of the week or end of week, which might NOT always fall on Monday/Friday due to holidays. A better idea is to determine if there is an row entry on the next/previous day in the dataframe ( since my data is guaranteed to only exist for workday), but I dont know how to calculate this. Here is an example of my data:
date day_of_week day_of_month day_of_year month_of_year
5/1/2017 0 1 121 5
5/2/2017 1 2 122 5
5/3/2017 2 3 123 5
5/4/2017 3 4 124 5
5/8/2017 0 8 128 5
5/9/2017 1 9 129 5
5/10/2017 2 10 130 5
5/11/2017 3 11 131 5
5/12/2017 4 12 132 5
5/15/2017 0 15 135 5
5/16/2017 1 16 136 5
5/17/2017 2 17 137 5
5/18/2017 3 18 138 5
5/19/2017 4 19 139 5
5/23/2017 1 23 143 5
5/24/2017 2 24 144 5
5/25/2017 3 25 145 5
5/26/2017 4 26 146 5
5/30/2017 1 30 150 5
Here is my current code
# Date fields
def DateFields(df_input):
dates = df_input.index.to_series()
df_input['day_of_week'] = dates.dt.dayofweek
df_input['day_of_month'] = dates.dt.day
df_input['day_of_year'] = dates.dt.dayofyear
df_input['month_of_year'] = dates.dt.month
df_input['isWeekStart'] = "No" #<--- Need help here
df_input['isWeekEnd'] = "No" #<--- Need help here
df_input['date'] = dates.dt.strftime('%Y-%m-%d')
return df_input
How can I calculate if a row is beginning of week and end of week?
Example of what I am looking for:
date day_of_week day_of_month day_of_year month_of_year isWeekStart isWeekEnd
5/1/2017 0 1 121 5 1 0
5/2/2017 1 2 122 5 0 0
5/3/2017 2 3 123 5 0 0
5/4/2017 3 4 124 5 0 1 # short week, Thursday is last work day
5/8/2017 0 8 128 5 1 0
5/9/2017 1 9 129 5 0 0
5/10/2017 2 10 130 5 0 0
5/11/2017 3 11 131 5 0 0
5/12/2017 4 12 132 5 0 1
5/15/2017 0 15 135 5 1 0
5/16/2017 1 16 136 5 0 0
5/17/2017 2 17 137 5 0 0
5/18/2017 3 18 138 5 0 0
5/19/2017 4 19 139 5 0 1
5/23/2017 1 23 143 5 1 0 # short week, Tuesday is first work day
5/24/2017 2 24 144 5 0 0
5/25/2017 3 25 145 5 0 0
5/26/2017 4 26 146 5 0 1
5/30/2017 1 30 150 5 1 0
EDIT: I forgot that some holidays fall during the middle of week, in this situation, it would be good if it can treat these as a separate "week" with before and after marked accordingly. Although if it's not smart enough to figure this out, just getting the long weekend would be a good start.
Here's an idea with BusinessDay:
prev_working_day = df['date'] - pd.tseries.offsets.BusinessDay(1)
df['isFirstWeekDay'] = (df['date'].dt.isocalendar().week !=
prev_working_day.dt.isocalendar().week)
And similar for last business day. Note that the default holiday calendar is US'. Check out this post for a different one.
Output:
date day_of_week day_of_month day_of_year month_of_year isFirstWeekDay
0 2017-05-01 0 1 121 5 True
1 2017-05-02 1 2 122 5 False
2 2017-05-03 2 3 123 5 False
3 2017-05-04 3 4 124 5 False
4 2017-05-08 0 8 128 5 True
5 2017-05-09 1 9 129 5 False
6 2017-05-10 2 10 130 5 False
7 2017-05-11 3 11 131 5 False
8 2017-05-12 4 12 132 5 False
9 2017-05-15 0 15 135 5 True
10 2017-05-16 1 16 136 5 False
11 2017-05-17 2 17 137 5 False
12 2017-05-18 3 18 138 5 False
13 2017-05-19 4 19 139 5 False
14 2017-05-23 1 23 143 5 False
15 2017-05-24 2 24 144 5 False
16 2017-05-25 3 25 145 5 False
17 2017-05-26 4 26 146 5 False
18 2017-05-30 1 30 150 5 False
Here's an approach using weekly groupby.
df['date'] = pd.to_datetime(df['date'])
business_days = df.assign(date_copy = df['date']).groupby(pd.Grouper(key='date_copy', freq='W'))['date'].apply(list).to_frame()
business_days['isWeekStart'] = business_days['date'].apply(lambda x: [1 if i == min(x) else 0 for i in x])
business_days['isWeekEnd'] = business_days['date'].apply(lambda x: [1 if i == max(x) else 0 for i in x])
business_days = business_days.apply(pd.Series.explode)
pd.merge(df, business_days, left_on='date', right_on='date')
output:
date day_of_week day_of_month day_of_year month_of_year isWeekStart isWeekEnd
0 2017-05-01 0 1 121 5 1 0
1 2017-05-02 1 2 122 5 0 0
2 2017-05-03 2 3 123 5 0 0
3 2017-05-04 3 4 124 5 0 1
4 2017-05-08 0 8 128 5 1 0
5 2017-05-09 1 9 129 5 0 0
6 2017-05-10 2 10 130 5 0 0
7 2017-05-11 3 11 131 5 0 0
8 2017-05-12 4 12 132 5 0 1
9 2017-05-15 0 15 135 5 1 0
10 2017-05-16 1 16 136 5 0 0
11 2017-05-17 2 17 137 5 0 0
12 2017-05-18 3 18 138 5 0 0
13 2017-05-19 4 19 139 5 0 1
14 2017-05-23 1 23 143 5 1 0
15 2017-05-24 2 24 144 5 0 0
16 2017-05-25 3 25 145 5 0 0
17 2017-05-26 4 26 146 5 0 1
18 2017-05-30 1 30 150 5 1 1
Note that 2017-05-30 is marked as both WeekStart and WeekEnd because it is the only date of that week.
I have a set of data (200 rows) related to vanilla pound cake baking with 27 features, as below. The label caketaste is a measure how good the baked cake was, defined by bad(0), neutral(1), good(2).
Features = cake_id, flour_g, butter_g, sugar_g, salt_g, eggs_count, bakingpowder_g, milk_ml, water_ml, vanillaextract_ml, lemonzest_g, mixingtime_min, bakingtime_min, preheattime_min, coolingtime_min, bakingtemp_c, preheattemp_c, color_red, color_green, color_blue, traysize_small, traysize_medium, traysize_large, milktype_lowfat, milktype_skim, milktype_whole, trayshape.
Label = caketaste ["bad", "neutral", "good"]
My task is to find:
a) the 5 most important features that affects the label's outcome;
b) to find the values of the identified 5 most important features that contributed to "good" classification in the label.
I am able to solve this using sklearn (Python), fitting the data with RandomForestClassifier(), then identify the 5 most important features using Feature_Importance() which is mixingtime_min, bakingtime_min, sugar_g, flour_g and preheattemp_c.
Minimal, Complete, and Verifiable Example:
#################################################################
# a) Libraries
#################################################################
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MaxAbsScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
import time
#################################################################
# b) Data Loading Symlinks
#################################################################
df = pd.read_excel("poundcake.xlsx", sheet_name="Sheet0", engine='openpyxl')
#################################################################
# c) Analyzing Dataframe
#################################################################
#Getting dataframe details e.g columns, total entries, data types etc
print("\n<syntax>: df.info()")
df.info()
#Getting the 1st 5 lines in the dataframe
print("\n<syntax>: df.head()")
df.head()
#################################################################
# d) Data Visualization
#################################################################
#Scatterplot SiteID vs LTE - Spectral Efficiency
fig=plt.figure()
ax=fig.add_axes([0,0,1,1])
ax.scatter(df["cake_id"], df["caketaste"], color='r')
ax.set_xlabel('cake_id')
ax.set_ylabel('caketaste')
ax.set_title('scatter plot')
plt.show()
#################################################################
# e) Feature selection
#################################################################
#Note:
#Machine learning models cannot work well with categorical (string) data, specifically scikit-learn.
#Need to convert the categorical variables into numeric types before building a machine learning model.
categorical_columns = ["trayshape"]
numerical_columns = ["flour_g","butter_g","sugar_g","salt_g","eggs_count","bakingpowder_g","milk_ml","water_ml","vanillaextract_ml","lemonzest_g","mixingtime_min","bakingtime_min","preheattime_min","coolingtime_min","bakingtemp_c","preheattemp_c","color_red","color_green","color_blue","traysize_small","traysize_medium","traysize_large","milktype_lowfat","milktype_skim","milktype_whole"]
#################################################################
# f) Dataset (Train Test Split)
#
# (Dataset)
# ┌──────────────────────────────────────────┐
# ┌──────────────────────────┬────────────┐
# | Training │ Test │
# └──────────────────────────┴────────────┘
#################################################################
# Prediction target - Training data
X = df[categorical_columns + numerical_columns]
# Prediction target - Training data
y = df["caketaste"]
# Break off validation set from training data. Default: train_size=0.75, test_size=0.25
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=42)
#################################################################
# Pipeline
#################################################################
#######################
# g) Column Transformer
#######################
categorical_encoder = OneHotEncoder(handle_unknown='ignore')
#Mean might not be suitable, Remove rows?
numerical_pipe = Pipeline([
('imp', SimpleImputer(strategy='mean'))
])
preprocessing = ColumnTransformer(
[('cat', categorical_encoder, categorical_columns),
('num', numerical_pipe, numerical_columns)])
#####################
# b) Pipeline Printer
#####################
#RF: builds multiple decision trees and merges (bagging) them together
#to get a more accurate and stable prediction (averaging).
pipe_xxx_xxx_rfo = Pipeline([
('pre', preprocessing),
('scl', None),
('pca', None),
('clf', RandomForestClassifier(random_state=42))
])
pipe_abs_xxx_rfo = Pipeline([
('pre', preprocessing),
('scl', MaxAbsScaler()),
('pca', None),
('clf', RandomForestClassifier(random_state=42))
])
#################################################################
# h) Hyper-Parameter Tuning
#################################################################
parameters_rfo = {
'clf__n_estimators':[100],
'clf__criterion':['gini'],
'clf__min_samples_split':[2,5],
'clf__min_samples_leaf':[1,2]
}
parameters_rfo_bk = {
'clf__n_estimators':[10,20,30,40,50,60,70,80,90,100,1000],
'clf__criterion':['gini','entropy'],
'clf__min_samples_split':[5,10,15,20,25,30],
'clf__min_samples_leaf':[1,2,3,4,5]
}
#########################
# i) GridSearch Printer
#########################
# scoring can be used as 'accuracy' or for MAE use 'neg_mean_absolute_error'
scr='accuracy'
grid_xxx_xxx_rfo = GridSearchCV(pipe_xxx_xxx_rfo,
param_grid=parameters_rfo,
scoring=scr,
cv=5,
refit=True)
grid_abs_xxx_rfo = GridSearchCV(pipe_abs_xxx_rfo,
param_grid=parameters_rfo,
scoring=scr,
cv=5,
refit=True)
print("Pipeline setup.... Complete")
###################################################
# Machine Learning Models Evaluation Algorithm
###################################################
grids = [grid_xxx_xxx_rfo, grid_abs_xxx_rfo]
grid_dict = { 0: 'RandomForestClassifier',
1: 'RandomForestClassifier with AbsMaxScaler',
}
# Fit the grid search objects
print('Performing model optimizations...\n')
best_test_scr = -999999999999999 #Python3 does not allow to use None anymore
best_clf = 0
best_gs = ''
for idx, grid in enumerate(grids):
start_time = time.time()
print('*' * 100)
print('\nEstimator: %s' % grid_dict[idx])
# Fit grid search
grid.fit(X_train, y_train)
#Calculate the score once and use when needed
test_scr = grid.score(X_test,y_test)
train_scr = grid.score(X_train,y_train)
# Track best (lowest grid.score) model
if test_scr > best_test_scr:
best_test_scr = test_scr
best_train_scr = train_scr
best_rf = grid
best_clf = idx
print("..........................this model is better. SELECTED")
print("Best params : %s" % grid.best_params_)
print("Training accuracy : %s" % best_train_scr)
print("Test accuracy : %s" % best_test_scr)
print("Modeling time : %s" % time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time)))
print('\nClassifier with best test set score: %s' % grid_dict[best_clf])
#########################################################################################
# j) Feature Importance using Gini Importance or Mean Decrease in Impurity (MDI)
# Note:
# 1.Calculates each feature importance as the sum over the number of splits (accross
# all trees) that include the feature, proportionaly to the number of samples it splits.
# 2. Biased towards cardinality i.e numerical variables
########################################################################################
ohe = (best_rf.best_estimator_.named_steps['pre'].named_transformers_['cat'])
feature_names = ohe.get_feature_names(input_features=categorical_columns)
feature_names = np.r_[feature_names, numerical_columns]
tree_feature_importances = (best_rf.best_estimator_.named_steps['clf'].feature_importances_)
sorted_idx = tree_feature_importances.argsort()
# Figure: Top Features
count=-28
y_ticks = np.arange(0, abs(count))
fig, ax = plt.subplots()
ax.barh(y_ticks[count:], tree_feature_importances[sorted_idx][count:])
ax.set_yticklabels(feature_names[sorted_idx][count:], fontsize=7)
ax.set_yticks(y_ticks[count:])
ax.set_title("Random Forest Tree's Feature Importance from Mean Decrease in Impurity (MDI)")
fig.tight_layout()
plt.show()
What approach one can use to solve task b)? I am trying to answer the below research question,
What are the value for mixingtime_min, bakingtime_min, flour_g, sugar_g and preheattemp_c that statistically contributed for a good caketaste (Good:2) ?
Possible Expected Result:
mixingtime_min = [5,10,15] AND
bakingtime_min = [50,51,52,53,54,55] AND
flour_g = [150,160,170,180] AND
sugar_g = [200, 250] AND
preheattemp_c = [150,160,170]
The above result basically concludes if a person like to have a GOOD tasting cake, he need to bake his cake using 150-180g flour with 200-250g sugar and mixes the dough between 5-15mins, before baking it for another 50-55 mins in a preheated oven at 150-170ºC.
Hope you can give some pointers.
Question
Would you be able to guide me on how to go about approaching this research question?
Is there any library in sklearn or otherwise that able to get this information?
Any additional information such as confidence interval, outliers etc. is a bonus.
The data (poundcake.xlsx):
cake_id flour_g butter_g sugar_g salt_g eggs_count bakingpowder_g milk_ml water_ml vanillaextract_ml lemonzest_g mixingtime_min bakingtime_min preheattime_min coolingtime_min bakingtemp_c preheattemp_c color_red color_green color_blue traysize_small traysize_medium traysize_large milktype_lowfat milktype_skim milktype_whole trayshape caketaste
0 180 50 250 2 3 3 15 80 1 2 10 30 25 15 170 175 1 0 0 1 0 0 1 0 0 square 1
1 195 50 500 6 6 1 30 60 1 2 10 40 30 10 170 170 0 1 0 1 0 0 0 1 0 rectangle 1
2 160 40 600 6 5 1 15 90 3 3 5 30 30 10 155 160 1 0 0 1 0 0 0 0 1 square 2
3 200 80 350 8 4 2 15 50 1 1 7 40 20 10 175 165 0 1 0 1 0 0 0 0 1 rectangle 0
4 175 90 400 6 6 4 25 90 1 1 9 60 25 15 160 155 1 0 0 0 0 1 0 1 0 rectangle 0
5 180 60 650 6 3 4 20 80 2 3 7 15 20 20 155 160 0 0 1 0 0 1 0 1 0 rectangle 2
6 165 50 200 6 4 2 20 80 1 2 7 60 30 20 150 170 0 1 0 1 0 0 1 0 0 rectangle 0
7 170 70 200 6 2 3 25 50 2 3 8 70 20 10 170 150 0 1 0 1 0 0 0 1 0 rectangle 1
8 160 90 300 8 4 4 25 60 3 2 9 35 30 15 175 170 0 1 0 1 0 0 1 0 0 square 1
9 165 50 350 6 4 1 25 80 1 2 11 30 10 10 170 170 1 0 0 0 1 0 1 0 0 square 1
10 180 90 650 4 3 4 20 50 2 3 8 30 30 15 165 170 1 0 0 1 0 0 0 1 0 square 1
11 165 40 350 6 2 2 30 60 3 3 5 50 25 15 175 170 0 0 1 1 0 0 0 0 1 rectangle 1
12 175 70 500 6 2 1 25 80 1 1 7 60 20 15 170 170 0 1 0 1 0 0 1 0 0 square 2
13 175 70 350 6 2 1 15 60 2 3 9 45 30 15 175 170 0 0 0 1 0 0 0 1 0 rectangle 1
14 160 70 600 4 6 4 30 60 2 3 5 60 25 10 150 155 0 1 0 1 0 0 0 1 0 rectangle 0
15 165 50 500 2 3 4 20 60 1 3 10 30 15 20 175 175 0 1 0 1 0 0 1 0 0 rectangle 0
16 195 50 600 6 5 2 25 60 1 1 5 30 10 20 170 150 0 0 0 1 0 0 0 0 1 square 2
17 160 60 600 8 5 4 25 70 3 3 9 30 30 10 175 150 0 0 0 1 0 0 1 0 0 rectangle 0
18 160 80 550 6 3 3 23 80 1 1 9 25 30 15 155 170 0 0 1 1 0 0 0 0 1 rectangle 1
19 170 60 600 4 5 1 20 90 3 3 10 55 20 15 165 155 0 0 1 1 0 0 0 0 1 square 0
20 175 70 300 6 5 4 25 70 1 1 11 65 15 20 170 155 0 0 1 1 0 0 0 1 0 round 0
21 195 80 250 6 6 2 23 70 2 3 11 20 30 15 170 155 0 0 1 1 0 0 1 0 0 rectangle 0
22 170 90 650 6 3 4 20 70 1 2 10 60 25 15 170 155 0 0 1 0 0 1 0 1 0 rectangle 1
23 180 40 200 6 3 1 15 60 3 1 5 35 15 15 170 170 0 1 0 1 0 0 0 1 0 rectangle 2
24 165 50 550 8 4 2 23 80 1 2 5 65 30 15 155 175 0 0 0 1 0 0 1 0 0 rectangle 1
25 170 50 250 6 2 3 25 70 2 2 6 30 20 15 165 175 0 0 0 0 0 1 0 1 0 rectangle 2
26 180 50 200 6 4 2 30 80 1 3 10 30 20 15 165 165 0 0 0 1 0 0 0 1 0 rectangle 2
27 200 90 500 6 3 4 25 70 2 1 9 60 30 15 170 160 0 1 0 1 0 0 0 1 0 rectangle 2
28 170 60 300 6 2 3 25 80 1 1 9 15 15 15 160 150 1 0 0 0 0 1 0 0 1 round 1
29 170 60 400 2 3 2 25 60 1 3 9 25 15 15 160 175 0 0 0 1 0 0 1 0 0 square 0
30 195 50 650 4 5 2 25 60 1 3 7 40 15 15 165 170 0 1 0 1 0 0 1 0 0 rectangle 1
31 170 50 350 6 6 1 25 80 2 2 8 50 25 15 150 170 0 1 0 1 0 0 1 0 0 rectangle 2
32 160 80 550 4 4 4 20 70 1 3 7 25 25 15 170 165 1 0 0 0 0 1 0 0 1 rectangle 2
33 170 50 300 4 4 2 23 50 2 2 10 30 20 15 150 170 0 0 0 1 0 0 1 0 0 rectangle 0
34 175 70 650 4 4 1 23 70 3 3 10 55 10 15 150 170 0 0 1 1 0 0 0 0 1 rectangle 0
35 180 70 400 6 2 2 20 60 1 1 6 55 30 15 170 150 0 0 0 1 0 0 1 0 0 square 2
36 195 60 300 6 6 4 23 70 2 2 10 30 30 15 170 175 1 0 0 1 0 0 1 0 0 rectangle 0
37 180 70 400 6 4 1 20 70 3 2 9 30 30 20 160 170 1 0 0 1 0 0 0 1 0 rectangle 2
38 170 90 600 8 3 1 20 50 1 2 9 30 30 15 155 170 1 0 0 1 0 0 0 1 0 rectangle 2
39 180 60 200 2 3 2 20 70 1 2 10 55 30 20 165 155 0 1 0 1 0 0 0 1 0 round 2
40 180 70 400 6 4 2 15 60 1 3 7 45 30 10 170 175 0 0 0 1 0 0 0 1 0 rectangle 2
41 170 70 200 6 3 1 30 60 3 2 6 40 15 15 170 175 0 0 1 1 0 0 0 0 1 rectangle 2
42 170 60 550 6 3 4 20 80 1 2 9 60 20 15 150 165 1 0 0 1 0 0 1 0 0 round 2
43 170 50 600 6 4 3 30 60 1 2 11 15 30 15 155 150 1 0 0 0 1 0 1 0 0 rectangle 0
44 175 70 200 4 4 3 30 70 3 2 6 20 10 20 170 170 0 0 0 1 0 0 1 0 0 rectangle 1
45 195 70 500 8 4 2 25 60 2 3 6 15 30 15 165 170 1 0 0 0 0 1 0 1 0 rectangle 2
46 180 80 200 4 4 4 15 80 1 3 6 50 30 15 155 150 0 0 0 1 0 0 0 1 0 rectangle 2
47 165 50 350 6 4 2 20 60 1 1 9 40 20 15 150 155 0 0 0 1 0 0 1 0 0 rectangle 0
48 170 70 550 2 2 4 20 60 3 2 9 55 30 15 165 165 0 1 0 1 0 0 0 0 1 round 0
49 175 70 350 6 5 4 30 80 1 2 9 55 30 10 155 170 0 0 0 0 0 1 1 0 0 rectangle 1
50 180 50 400 6 4 3 25 50 2 2 9 20 20 20 160 160 0 0 0 1 0 0 0 1 0 rectangle 2
51 165 50 650 6 5 4 20 60 1 2 5 60 30 15 175 170 0 0 1 1 0 0 0 0 1 square 0
52 170 70 200 2 6 3 25 60 1 3 8 35 25 15 170 155 1 0 0 1 0 0 0 0 1 rectangle 1
53 180 40 350 4 4 3 30 60 3 2 12 45 30 15 150 175 0 0 0 1 0 0 0 1 0 rectangle 1
54 175 50 600 8 3 1 20 80 2 1 7 30 15 15 150 160 0 0 0 1 0 0 0 0 1 square 0
55 175 70 400 4 3 1 25 90 1 2 5 50 30 10 170 170 1 0 0 0 0 1 1 0 0 rectangle 1
56 170 50 650 6 6 3 20 70 1 1 6 25 30 15 170 160 1 0 0 1 0 0 0 1 0 rectangle 2
57 200 70 650 6 3 1 15 60 2 1 10 25 10 15 170 150 0 1 0 1 0 0 0 0 1 rectangle 2
58 175 80 650 6 5 2 23 70 1 1 5 45 15 15 160 170 0 1 0 1 0 0 0 0 1 rectangle 1
59 170 50 200 8 3 4 30 70 1 3 11 35 25 15 170 170 0 0 0 1 0 0 0 1 0 rectangle 1
60 170 60 300 6 3 1 20 60 3 3 11 20 30 15 170 170 1 0 0 1 0 0 0 0 1 rectangle 0
61 180 40 350 2 4 3 20 70 3 2 12 20 10 15 150 160 0 0 0 1 0 0 1 0 0 square 2
62 175 60 200 6 6 1 15 80 2 2 12 25 20 15 155 160 1 0 0 1 0 0 0 0 1 rectangle 2
63 170 70 650 6 2 3 23 90 3 3 10 25 30 20 170 155 1 0 0 1 0 0 0 1 0 rectangle 2
64 170 70 600 6 4 2 25 80 2 2 6 50 15 15 170 155 0 0 0 1 0 0 0 1 0 rectangle 0
65 170 60 250 6 2 2 30 60 1 2 9 20 15 10 165 165 0 0 0 1 0 0 0 1 0 rectangle 2
66 175 50 650 4 2 1 23 60 2 2 11 20 30 20 170 175 1 0 0 1 0 0 0 1 0 rectangle 1
67 175 70 350 4 3 3 30 50 1 2 10 35 25 15 175 170 0 0 0 1 0 0 1 0 0 rectangle 0
68 165 90 600 6 5 2 23 60 1 3 9 55 10 15 160 165 0 1 0 1 0 0 1 0 0 square 0
69 200 80 600 6 3 1 30 60 2 1 8 30 30 15 175 165 1 0 0 0 1 0 0 0 1 rectangle 1
70 165 50 200 6 5 2 23 60 2 1 12 55 30 15 170 170 0 0 0 0 0 1 0 0 1 round 0
71 175 60 300 4 6 1 15 60 3 2 12 55 20 15 175 165 0 0 0 1 0 0 0 0 1 square 0
72 175 70 200 8 5 4 20 60 1 3 12 60 25 15 175 170 0 1 0 1 0 0 0 1 0 rectangle 2
73 180 60 200 4 4 4 30 70 1 3 8 35 30 10 175 170 0 0 0 1 0 0 1 0 0 rectangle 2
74 170 80 650 6 3 1 30 60 1 2 5 55 30 20 155 175 1 0 0 1 0 0 0 0 1 rectangle 2
75 170 60 500 8 4 1 23 60 3 2 7 60 30 15 165 170 0 0 0 1 0 0 0 1 0 square 2
76 175 70 250 6 4 2 30 60 1 2 12 65 20 15 170 160 1 0 0 0 0 1 0 0 1 square 2
77 180 50 500 8 5 1 15 70 3 3 8 40 10 15 165 155 0 0 1 0 1 0 0 0 1 rectangle 1
78 175 60 550 6 4 2 20 90 1 2 7 25 30 15 175 165 0 1 0 1 0 0 0 0 1 rectangle 0
79 170 70 600 8 4 4 15 80 3 3 11 60 30 15 175 150 1 0 0 1 0 0 0 0 1 rectangle 1
80 195 60 200 4 5 3 30 60 1 2 8 30 20 15 170 170 0 1 0 1 0 0 0 1 0 square 0
81 180 70 300 6 3 3 20 90 1 3 11 25 20 10 170 150 0 0 0 1 0 0 0 1 0 rectangle 0
82 170 40 550 2 4 3 30 60 1 2 9 35 30 10 170 170 0 0 0 0 0 1 0 1 0 square 1
83 175 60 550 6 5 2 15 90 1 1 11 30 10 15 170 175 1 0 0 1 0 0 0 0 1 rectangle 0
84 180 50 350 4 4 3 23 50 2 2 7 20 30 10 170 175 0 0 0 1 0 0 0 0 1 rectangle 2
85 180 80 600 4 4 1 25 60 1 1 5 55 30 10 170 165 0 0 1 1 0 0 0 0 1 rectangle 1
86 175 50 650 8 2 3 15 50 1 2 10 50 25 15 160 160 0 0 0 1 0 0 0 0 1 square 0
87 175 50 350 2 6 3 23 80 2 2 10 20 25 15 170 155 1 0 0 1 0 0 0 0 1 rectangle 1
88 170 50 350 4 2 4 25 60 2 1 10 20 15 15 150 155 0 1 0 1 0 0 1 0 0 rectangle 0
89 180 50 550 6 5 4 30 90 2 3 7 60 30 15 155 175 0 0 0 1 0 0 0 1 0 rectangle 2
90 170 70 600 6 5 3 15 90 1 2 6 45 10 15 170 170 0 1 0 1 0 0 1 0 0 round 1
91 170 70 300 4 4 2 20 60 1 1 10 15 30 10 165 155 0 0 0 1 0 0 1 0 0 rectangle 1
92 180 50 650 4 2 4 20 80 1 2 8 65 30 15 150 160 0 1 0 1 0 0 0 0 1 rectangle 2
93 170 50 350 6 3 3 30 60 1 3 7 55 30 20 155 170 1 0 0 1 0 0 1 0 0 rectangle 0
94 170 90 400 6 4 1 30 60 3 2 12 70 30 15 170 160 0 0 1 1 0 0 0 1 0 rectangle 1
95 160 70 400 2 6 4 23 70 2 1 9 20 30 10 150 175 0 0 0 1 0 0 0 0 1 square 1
96 170 80 250 4 2 3 30 60 3 1 10 30 30 15 155 165 0 0 0 0 0 1 0 0 1 rectangle 1
97 195 70 250 6 6 4 30 80 3 1 11 20 15 15 170 170 1 0 0 1 0 0 0 0 1 rectangle 2
98 180 50 650 6 6 1 30 90 3 1 7 25 15 15 170 170 1 0 0 1 0 0 0 0 1 rectangle 2
99 195 50 200 6 3 1 23 90 1 1 9 55 25 15 160 170 0 0 0 1 0 0 0 0 1 rectangle 0
100 175 50 200 4 3 3 20 50 2 2 12 15 30 10 170 170 0 0 1 1 0 0 0 1 0 square 1
101 165 70 350 4 4 4 15 90 1 2 12 40 15 15 155 155 0 1 0 1 0 0 0 0 1 rectangle 1
102 180 80 600 4 4 3 25 50 1 2 11 30 10 15 155 170 0 0 1 1 0 0 0 0 1 rectangle 1
103 165 50 300 6 3 1 30 60 1 1 9 40 25 15 160 170 0 0 0 1 0 0 0 1 0 rectangle 1
104 160 50 600 8 2 4 20 60 1 2 12 60 30 15 170 170 0 0 0 1 0 0 1 0 0 square 2
105 170 90 200 2 2 2 15 60 3 2 5 40 20 15 170 160 0 0 0 1 0 0 0 1 0 rectangle 2
106 175 90 600 6 4 2 15 60 1 1 7 20 30 15 175 170 1 0 0 0 0 1 0 1 0 rectangle 2
107 180 70 550 6 3 1 15 90 1 1 9 25 30 15 150 160 1 0 0 1 0 0 0 1 0 rectangle 2
108 170 90 250 8 4 4 30 60 2 3 6 60 25 15 155 155 0 0 0 1 0 0 0 0 1 rectangle 0
109 200 40 500 6 6 2 20 60 3 2 10 50 30 15 170 155 0 0 0 1 0 0 1 0 0 rectangle 0
110 175 70 500 2 3 4 30 60 3 2 5 65 20 15 170 155 1 0 0 1 0 0 0 0 1 rectangle 2
111 165 60 550 6 3 2 30 80 2 1 9 20 25 20 170 175 0 0 0 1 0 0 0 0 1 rectangle 2
112 195 70 350 6 6 2 25 90 2 2 12 50 30 15 150 165 0 0 1 1 0 0 0 1 0 square 2
113 165 90 300 4 3 4 30 60 1 2 9 30 25 15 165 170 0 1 0 0 0 1 0 1 0 rectangle 0
114 195 40 650 6 2 1 23 80 1 2 5 25 25 15 170 165 0 1 0 1 0 0 0 1 0 rectangle 1
115 175 60 200 2 4 3 15 50 3 3 6 25 30 15 155 170 1 0 0 1 0 0 1 0 0 square 0
116 175 70 400 6 4 3 15 60 2 3 11 20 20 15 150 170 1 0 0 0 1 0 0 1 0 rectangle 2
117 195 70 350 6 3 2 30 60 3 2 12 25 25 20 175 175 0 0 0 1 0 0 0 0 1 rectangle 2
118 170 50 500 6 4 3 30 80 2 3 10 60 30 15 170 160 0 1 0 1 0 0 0 0 1 rectangle 0
119 195 60 650 6 4 1 20 70 3 2 5 65 20 20 170 150 0 0 1 0 0 1 0 0 1 rectangle 2
120 170 70 650 8 4 4 25 80 1 2 9 45 30 15 170 170 0 0 1 1 0 0 0 1 0 round 1
121 170 70 650 8 4 2 30 90 1 2 12 30 15 15 170 170 0 0 1 1 0 0 1 0 0 square 0
122 170 60 400 4 6 4 15 60 2 2 11 60 30 15 170 150 0 0 1 1 0 0 1 0 0 square 0
123 175 60 300 8 6 3 20 60 2 2 12 50 25 15 150 175 0 0 1 0 1 0 0 1 0 round 2
124 175 50 400 4 3 1 23 50 3 2 9 50 30 15 150 150 0 0 1 1 0 0 0 1 0 square 0
125 180 40 300 6 4 1 15 50 3 2 10 60 30 15 170 175 0 0 1 0 1 0 0 1 0 rectangle 2
126 195 60 250 6 4 3 25 90 2 2 6 60 30 10 170 175 1 0 0 0 0 1 0 0 1 rectangle 2
127 160 70 300 4 2 1 20 60 2 2 5 40 20 15 160 170 0 0 0 1 0 0 0 1 0 square 2
128 170 60 300 8 6 2 30 80 1 1 10 65 30 15 155 155 0 1 0 1 0 0 0 0 1 square 2
129 160 40 350 6 6 2 15 60 1 1 5 25 30 15 155 170 0 0 1 0 0 1 0 1 0 rectangle 2
130 170 60 500 2 5 3 30 50 3 2 10 60 10 15 165 160 0 0 0 1 0 0 1 0 0 rectangle 1
131 170 60 650 8 3 3 23 90 1 1 10 70 15 15 170 175 1 0 0 1 0 0 1 0 0 rectangle 2
132 170 50 600 4 4 1 20 50 2 2 5 60 25 15 170 160 1 0 0 1 0 0 0 0 1 square 2
133 180 50 350 6 5 2 25 90 3 2 5 20 30 15 175 160 0 0 0 1 0 0 1 0 0 rectangle 0
134 170 90 200 4 2 4 20 90 3 2 10 20 25 15 170 175 0 0 0 1 0 0 0 0 1 rectangle 1
135 200 40 350 6 6 1 30 80 1 1 5 60 25 20 170 175 0 0 1 1 0 0 0 1 0 rectangle 2
136 165 60 250 2 3 2 25 60 1 1 8 20 15 15 170 170 0 1 0 1 0 0 0 0 1 rectangle 0
137 175 70 250 6 6 4 15 60 2 2 11 50 30 15 175 175 0 1 0 0 0 1 0 1 0 rectangle 2
138 180 50 350 6 4 2 25 70 3 2 5 45 25 15 170 170 0 0 0 0 0 1 1 0 0 rectangle 0
139 195 60 600 6 4 2 20 50 1 1 10 35 15 15 165 175 1 0 0 1 0 0 0 1 0 round 2
140 180 60 300 8 4 4 25 80 1 1 5 60 30 15 165 170 0 0 0 1 0 0 0 0 1 rectangle 1
141 200 60 500 8 4 1 23 70 2 2 8 15 30 15 160 170 0 0 0 1 0 0 1 0 0 rectangle 0
142 170 60 550 6 4 4 30 60 2 2 6 65 20 15 175 165 0 1 0 1 0 0 0 0 1 rectangle 1
143 170 40 600 2 2 1 15 70 1 2 11 30 25 20 175 165 0 0 0 1 0 0 0 0 1 rectangle 0
144 175 70 250 6 4 3 30 60 1 2 10 60 30 20 155 175 0 1 0 1 0 0 1 0 0 rectangle 2
145 180 50 250 4 5 3 15 80 1 2 6 60 30 15 170 170 0 0 0 1 0 0 0 0 1 rectangle 2
146 165 50 350 6 4 4 25 80 1 2 12 25 15 15 155 165 1 0 0 1 0 0 0 0 1 rectangle 0
147 170 60 500 6 5 4 23 60 1 2 10 15 30 20 160 170 1 0 0 1 0 0 1 0 0 rectangle 1
148 170 50 400 6 4 3 20 60 2 3 6 35 10 15 170 175 0 0 1 1 0 0 0 0 1 rectangle 1
149 195 80 650 8 4 3 30 90 1 1 6 15 20 10 165 160 1 0 0 0 1 0 1 0 0 rectangle 2
150 165 90 500 8 3 4 20 60 2 2 5 25 30 15 165 170 0 1 0 0 0 1 0 0 1 rectangle 1
151 160 80 200 2 4 4 30 80 3 1 5 50 25 15 170 160 0 1 0 1 0 0 0 1 0 rectangle 0
152 180 50 500 2 6 1 15 60 1 1 8 65 20 15 170 170 1 0 0 0 0 1 1 0 0 rectangle 2
153 165 60 600 6 4 1 30 70 3 3 11 15 30 10 170 170 0 0 0 1 0 0 1 0 0 rectangle 0
154 180 60 600 2 3 2 30 70 1 2 6 55 15 15 150 165 1 0 0 1 0 0 0 0 1 rectangle 2
155 160 60 400 2 6 4 15 60 1 1 9 55 30 10 170 160 1 0 0 1 0 0 1 0 0 rectangle 0
156 180 60 250 4 3 2 25 80 3 1 6 25 25 20 170 160 0 1 0 0 1 0 0 1 0 square 2
157 195 50 200 6 4 3 30 70 3 2 6 35 30 15 165 170 1 0 0 0 0 1 1 0 0 rectangle 2
158 170 50 650 6 5 2 15 60 3 2 12 35 30 10 170 175 1 0 0 0 1 0 0 1 0 rectangle 0
159 160 70 400 6 3 2 20 50 1 2 9 20 30 15 155 155 0 0 1 0 0 1 1 0 0 rectangle 0
160 175 90 600 6 4 4 23 80 3 3 7 20 20 15 155 160 1 0 0 1 0 0 0 1 0 rectangle 0
161 180 50 400 4 4 1 23 70 1 2 12 20 30 20 165 170 0 1 0 1 0 0 0 0 1 rectangle 1
162 170 90 250 6 3 3 20 80 2 2 12 25 15 15 170 155 0 0 0 1 0 0 0 1 0 round 2
163 170 60 200 2 6 1 23 80 3 1 10 30 30 15 170 175 0 1 0 0 0 1 0 1 0 rectangle 2
164 175 50 650 2 5 3 25 70 3 2 11 60 25 15 175 160 0 1 0 1 0 0 0 0 1 rectangle 2
165 195 90 400 6 3 3 23 60 1 2 7 35 25 20 170 155 0 0 0 1 0 0 1 0 0 round 1
166 180 50 600 6 3 4 25 60 2 2 10 20 10 15 155 175 0 1 0 1 0 0 0 1 0 square 0
167 200 50 500 6 3 3 15 90 2 1 6 20 25 10 170 155 0 1 0 1 0 0 0 0 1 rectangle 1
168 200 60 200 6 2 3 20 60 3 3 5 20 10 15 170 170 1 0 0 1 0 0 0 1 0 rectangle 1
169 200 60 300 4 5 3 20 90 3 2 12 30 25 15 155 160 0 0 1 1 0 0 0 0 1 rectangle 0
170 180 70 250 6 4 3 30 50 1 2 12 35 25 10 155 150 0 0 0 1 0 0 1 0 0 rectangle 1
171 175 70 200 4 6 4 30 60 2 2 5 25 30 15 150 160 0 0 1 1 0 0 0 1 0 square 0
172 165 90 400 2 5 1 30 90 3 2 6 70 30 15 170 170 0 1 0 1 0 0 0 0 1 rectangle 2
173 165 70 200 6 6 4 20 70 1 1 5 65 20 20 175 155 0 0 0 1 0 0 0 1 0 round 0
174 180 50 650 2 3 3 20 70 3 2 12 40 30 15 155 170 0 0 0 1 0 0 0 0 1 rectangle 1
175 180 40 200 6 3 2 30 80 3 3 7 60 30 10 175 150 0 1 0 1 0 0 1 0 0 rectangle 2
176 180 60 400 2 5 3 20 50 1 3 5 20 30 15 175 150 0 1 0 1 0 0 0 1 0 rectangle 1
177 200 50 400 4 6 4 23 60 2 2 7 55 20 15 160 170 0 1 0 1 0 0 0 0 1 round 0
178 180 50 550 6 4 3 20 50 2 2 8 20 25 20 170 170 1 0 0 0 0 1 1 0 0 rectangle 0
179 175 70 250 8 4 1 20 50 2 3 6 60 30 15 170 170 0 0 0 0 1 0 0 0 1 square 0
180 195 70 400 6 4 4 23 60 3 1 7 65 25 15 170 150 1 0 0 1 0 0 0 1 0 rectangle 1
181 160 50 500 6 4 3 25 50 1 1 11 55 10 15 170 170 0 0 0 0 0 1 0 0 1 rectangle 1
182 180 90 500 6 3 3 23 60 2 1 8 20 30 15 170 170 0 0 0 0 0 1 0 1 0 rectangle 1
183 170 70 650 2 3 3 25 80 1 3 8 45 20 10 170 170 0 1 0 1 0 0 0 1 0 round 2
184 195 70 600 6 4 2 25 60 1 2 6 40 30 15 155 170 1 0 0 1 0 0 0 0 1 rectangle 1
185 165 70 200 6 4 1 20 60 1 2 8 45 15 15 170 150 0 1 0 1 0 0 0 0 1 round 1
186 165 80 200 4 4 3 30 60 1 1 8 25 30 10 160 170 0 1 0 1 0 0 1 0 0 round 0
187 175 60 600 4 2 3 20 60 1 2 6 25 20 15 170 155 0 0 0 1 0 0 1 0 0 rectangle 2
188 180 70 500 6 4 3 30 70 2 2 7 55 30 15 170 150 1 0 0 1 0 0 1 0 0 square 1
189 180 50 600 2 4 4 30 60 3 1 9 40 25 15 170 170 1 0 0 0 0 1 0 0 1 rectangle 0
190 160 50 600 8 3 2 20 60 3 2 12 30 30 15 165 150 0 0 0 0 1 0 1 0 0 rectangle 2
191 180 60 200 6 2 1 30 60 3 2 7 20 30 15 175 160 1 0 0 1 0 0 1 0 0 rectangle 2
192 195 70 600 6 4 3 23 80 2 2 12 50 25 10 170 170 0 0 0 0 0 1 1 0 0 rectangle 1
193 180 60 250 6 3 1 15 60 2 3 5 60 30 20 175 165 1 0 0 0 0 1 0 1 0 rectangle 1
194 170 70 250 6 4 1 20 90 2 2 10 25 20 20 175 170 0 0 0 1 0 0 0 1 0 round 1
195 180 90 250 6 3 1 25 50 1 2 9 55 30 15 170 175 1 0 0 0 0 1 1 0 0 rectangle 1
196 160 70 550 6 3 4 30 90 3 2 10 60 20 15 165 165 0 1 0 1 0 0 0 1 0 round 0
197 175 60 200 8 2 3 15 60 1 2 11 50 30 15 165 175 0 1 0 1 0 0 0 0 1 rectangle 1
198 170 80 500 6 3 2 25 50 1 1 5 60 20 15 175 150 1 0 0 1 0 0 0 1 0 square 2
199 180 50 600 4 4 4 15 80 1 1 5 50 20 15 170 170 1 0 0 1 0 0 0 1 0 rectangle 2
Very simple solution could be run a Decision tree classifier with your data & Visualize the tree using grapviz library here's the documentation https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html
, You could also visualize in webgraphiz after you get the dot file generated from the code. Outcome of this exercise can be the range values you are expecting.
I'm trying to get some data from xml using pandas. Currently I have "working" code, and by working i mean it almost work.
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "http://degra.wi.pb.edu.pl/rozklady/webservices.php?"
response = requests.get(url).content
soup = BeautifulSoup(response)
tables = soup.find_all('tabela_rozklad')
tags = ['dzien', 'godz', 'ilosc', 'tyg', 'id_naucz', 'id_sala',
'id_prz', 'rodz', 'grupa', 'id_st', 'sem', 'id_spec']
df = pd.DataFrame()
for table in tables:
all = map(lambda x: table.find(x).text, tags)
df = df.append([all])
df.columns = tags
a = df[(df.sem == "1")]
a = a[(a.id_spec == "0")]
a = a[(a.dzien == "1")]
print(a)
So I'm getting error on "a = df[(df.sem == "1")]" which is :
File "pandas\index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas\index.c:4443)
File "pandas\index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas\index.c:4289)
File "pandas\src\hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13733)
File "pandas\src\hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13687)
As i read other stacks questions I saw people suggest using df.loc so i modyfied this line into
a = df.loc[(df.sem == "1")]
Now code compile but the results show like this line doesn't exists. Need to mention that the problem is with the "sem" tag only. Rest works perfectly but unfortunately i need to use exactly this tag. If anyone could explain what i causing this error and how to fix it I would be grateful.
You can add ignore_index=True to append for avoid duplicated index and then need select column sem by [], because function sem:
df = pd.DataFrame()
for table in tables:
all = map(lambda x: table.find(x).text, tags)
df = df.append([all], ignore_index=True)
df.columns = tags
#print (df)
a = df[(df['sem'] == '1') & (df.id_spec == "0") & (df.dzien == "1")]
print(a)
dzien godz ilosc tyg id_naucz id_sala id_prz rodz grupa id_st sem id_spec
0 1 1 2 0 52 79 13 W 1 13 1 0
1 1 3 2 0 12 79 32 W 1 13 1 0
2 1 5 2 0 52 65 13 Ćw 1 13 1 0
3 1 11 2 0 201 3 70 Ćw 10 13 1 0
4 1 5 2 0 36 78 13 Ps 5 13 1 0
5 1 5 2 1 18 32 450 Ps 3 13 1 0
6 1 5 2 2 18 32 450 Ps 4 13 1 0
7 1 7 2 1 18 32 450 Ps 7 13 1 0
8 1 7 2 2 18 32 450 Ps 8 13 1 0
9 1 7 2 0 66 65 104 Ćw 1 13 1 0
10 1 7 2 0 283 3 104 Ćw 5 13 1 0
11 1 7 2 0 346 5 104 Ćw 8 13 1 0
12 1 7 2 0 184 29 13 Ćw 7 13 1 0
13 1 9 2 0 66 65 104 Ćw 2 13 1 0
14 1 9 2 0 346 5 70 Ćw 8 13 1 0
15 1 9 1 0 73 3 203 Ćw 9 13 1 0
16 1 10 1 0 73 3 203 Ćw 10 13 1 0
17 1 9 2 0 184 19 13 Ps 13 13 1 0
18 1 11 2 0 184 19 13 Ps 14 13 1 0
19 1 11 2 0 44 65 13 Ćw 9 13 1 0
87 1 9 2 0 201 54 463 W 1 17 1 0
88 1 3 2 0 36 29 13 Ćw 2 17 1 0
89 1 3 2 0 211 5 70 Ćw 1 17 1 0
90 1 5 2 0 211 5 70 Ćw 2 17 1 0
91 1 7 2 0 36 78 13 Ps 4 17 1 0
105 1 1 2 1 11 16 32 Ps 2 18 1 0
106 1 1 2 2 11 16 32 Ps 3 18 1 0
107 1 3 2 0 51 3 457 W 1 18 1 0
110 1 5 2 2 11 16 32 Ps 1 18 1 0
111 1 7 2 0 91 64 97 Ćw 2 18 1 0
112 1 5 2 0 283 3 457 Ćw 2 18 1 0
254 1 5 1 0 12 29 32 Ćw 6 13 1 0
255 1 6 1 0 12 29 32 Ćw 5 13 1 0
462 1 7 2 0 98 1 486 W 1 19 1 0
463 1 9 1 0 91 1 484 W 1 19 1 0
487 1 5 2 0 116 19 13 Ps 1 17 1 0
488 1 7 2 0 116 19 13 Ps 2 17 1 0
498 1 5 2 0 0 0 431 Ps 2 17 1 0
502 1 5 2 0 0 0 431 Ps 15 13 1 0
503 1 5 2 0 0 0 431 Ps 16 13 1 0
504 1 5 2 0 0 0 431 Ps 19 13 1 0
505 1 5 2 0 0 0 431 Ps 20 13 1 0
531 1 13 2 0 350 79 493 W 1 13 1 0
532 1 13 2 0 350 79 493 W 2 17 1 0
533 1 13 2 0 350 79 493 W 1 18 1 0
Given the following dataframe df:
app platform uuid minutes
0 1 0 a696ccf9-22cb-428b-adee-95c9a97a4581 67
1 2 0 8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d 1
2 2 1 40AD6CD1-4A7B-48DD-8815-1829C093A95C 13
3 1 0 26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 2
4 2 0 34271596-eebb-4423-b890-dc3761ed37ca 8
5 3 1 C57D0F52-B565-4322-85D2-C2798F7CA6FF 16
6 2 0 245501ec2e39cb782bab1fb02d7813b7 1
7 3 1 DE6E4714-5A3C-4C80-BD81-EAACB2364DF0 30
8 3 0 f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e 10
9 2 0 9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 470
10 3 1 19fdaedfd0dbdaf6a7a6b49619f11a19 3
11 3 1 AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B 58
12 2 0 4eb1024b-c293-42a4-95a2-31b20c3b524b 24
13 3 1 8E0B0BE3-8553-4F38-9837-6C907E01F84C 7
14 3 1 E8B2849C-F050-4DCD-B311-5D57015466AE 465
15 2 0 ec7fedb6-b118-424a-babe-b8ffad579685 266
16 1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184
17 2 0 f786528ded200c9f553dd3a5e9e9bb2d 10
18 3 1 1E291633-AF27-4DFB-8DA4-4A5B63F175CF 13
19 2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 2408`
I'll group it:
y=df.groupby(['app','platform','uuid']).sum().reset_index().sort(['app','platform','minutes'],ascending=[1,1,0]).set_index(['app','platform','uuid'])
minutes
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184
a696ccf9-22cb-428b-adee-95c9a97a4581 67
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 2
2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 2408
9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 470
ec7fedb6-b118-424a-babe-b8ffad579685 266
4eb1024b-c293-42a4-95a2-31b20c3b524b 24
f786528ded200c9f553dd3a5e9e9bb2d 10
34271596-eebb-4423-b890-dc3761ed37ca 8
245501ec2e39cb782bab1fb02d7813b7 1
8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d 1
1 40AD6CD1-4A7B-48DD-8815-1829C093A95C 13
3 0 f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e 10
1 E8B2849C-F050-4DCD-B311-5D57015466AE 465
AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B 58
DE6E4714-5A3C-4C80-BD81-EAACB2364DF0 30
C57D0F52-B565-4322-85D2-C2798F7CA6FF 16
1E291633-AF27-4DFB-8DA4-4A5B63F175CF 13
8E0B0BE3-8553-4F38-9837-6C907E01F84C 7
19fdaedfd0dbdaf6a7a6b49619f11a19 3
So that I got its minutes per uuid in decrescent order.
Now, I will sum the cumulative minutes per app/platform/uuid:
y.groupby(level=[0,1]).cumsum()
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184
a696ccf9-22cb-428b-adee-95c9a97a4581 251
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 253
2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 2408
9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 2878
ec7fedb6-b118-424a-babe-b8ffad579685 3144
4eb1024b-c293-42a4-95a2-31b20c3b524b 3168
f786528ded200c9f553dd3a5e9e9bb2d 3178
34271596-eebb-4423-b890-dc3761ed37ca 3186
245501ec2e39cb782bab1fb02d7813b7 3187
8e17a2eb-f0ee-49ae-b8c2-c9f9926aa56d 3188
1 40AD6CD1-4A7B-48DD-8815-1829C093A95C 13
3 0 f88eb774-fdf3-4d1d-a91d-0b4ab95cf36e 10
1 E8B2849C-F050-4DCD-B311-5D57015466AE 465
AAF1CFF7-4564-4C79-B2D8-F0AAF9C9971B 523
DE6E4714-5A3C-4C80-BD81-EAACB2364DF0 553
C57D0F52-B565-4322-85D2-C2798F7CA6FF 569
1E291633-AF27-4DFB-8DA4-4A5B63F175CF 582
8E0B0BE3-8553-4F38-9837-6C907E01F84C 589
19fdaedfd0dbdaf6a7a6b49619f11a19 592
My question is: how can I get the percent agains the total cumulative sum, per group, i.e, something like this:
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 184 0.26
a696ccf9-22cb-428b-adee-95c9a97a4581 251 0.36
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 253 0.36
...
...
...
It's not clear you came up with 0.26, 0.36 in your desired output - but assuming those are just dummy numbers, to get a running % of total for each group, you could do this:
y['cumsum'] = y.groupby(level=[0,1]).cumsum()
y['running_pct'] = y.groupby(level=[0,1])['cumsum'].transform(lambda x: x / x.iloc[-1])
Should give the right output.
In [398]: y['running_pct'].head()
Out[398]:
app platform uuid
1 0 7e302dcb-ceaf-406c-a9e5-66933d921064 0.727273
a696ccf9-22cb-428b-adee-95c9a97a4581 0.992095
26c1022a-7a8e-42a2-b7cc-bea6bffa7a6f 1.000000
2 0 953a525c-97e0-4c2f-90e0-dfebde3ec20d 0.755332
9c08c860-7a6d-4810-a5c3-f3af2a3fcf66 0.902760
Name: running_pct, dtype: float64
EDIT:
Per the comments, if you're looking to wring out a little more performance, this will be faster as of version 0.14.1
y['cumsum'] = y.groupby(level=[0,1])['minutes'].transform('cumsum')
y['running_pct'] = y['cumsum'] / y.groupby(level=[0,1])['minutes'].transform('sum')
And as #Jeff notes, in 0.15.0 this may be faster yet.
y['running_pct'] = y['cumsum'] / y.groupby(level=[0,1])['minutes'].transform('last')