Mapping - Feature Importance vs Label classification - python
I have a set of data (200 rows) related to vanilla pound cake baking with 27 features, as below. The label caketaste is a measure how good the baked cake was, defined by bad(0), neutral(1), good(2).
Features = cake_id, flour_g, butter_g, sugar_g, salt_g, eggs_count, bakingpowder_g, milk_ml, water_ml, vanillaextract_ml, lemonzest_g, mixingtime_min, bakingtime_min, preheattime_min, coolingtime_min, bakingtemp_c, preheattemp_c, color_red, color_green, color_blue, traysize_small, traysize_medium, traysize_large, milktype_lowfat, milktype_skim, milktype_whole, trayshape.
Label = caketaste ["bad", "neutral", "good"]
My task is to find:
a) the 5 most important features that affects the label's outcome;
b) to find the values of the identified 5 most important features that contributed to "good" classification in the label.
I am able to solve this using sklearn (Python), fitting the data with RandomForestClassifier(), then identify the 5 most important features using Feature_Importance() which is mixingtime_min, bakingtime_min, sugar_g, flour_g and preheattemp_c.
Minimal, Complete, and Verifiable Example:
#################################################################
# a) Libraries
#################################################################
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MaxAbsScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
import time
#################################################################
# b) Data Loading Symlinks
#################################################################
df = pd.read_excel("poundcake.xlsx", sheet_name="Sheet0", engine='openpyxl')
#################################################################
# c) Analyzing Dataframe
#################################################################
#Getting dataframe details e.g columns, total entries, data types etc
print("\n<syntax>: df.info()")
df.info()
#Getting the 1st 5 lines in the dataframe
print("\n<syntax>: df.head()")
df.head()
#################################################################
# d) Data Visualization
#################################################################
#Scatterplot SiteID vs LTE - Spectral Efficiency
fig=plt.figure()
ax=fig.add_axes([0,0,1,1])
ax.scatter(df["cake_id"], df["caketaste"], color='r')
ax.set_xlabel('cake_id')
ax.set_ylabel('caketaste')
ax.set_title('scatter plot')
plt.show()
#################################################################
# e) Feature selection
#################################################################
#Note:
#Machine learning models cannot work well with categorical (string) data, specifically scikit-learn.
#Need to convert the categorical variables into numeric types before building a machine learning model.
categorical_columns = ["trayshape"]
numerical_columns = ["flour_g","butter_g","sugar_g","salt_g","eggs_count","bakingpowder_g","milk_ml","water_ml","vanillaextract_ml","lemonzest_g","mixingtime_min","bakingtime_min","preheattime_min","coolingtime_min","bakingtemp_c","preheattemp_c","color_red","color_green","color_blue","traysize_small","traysize_medium","traysize_large","milktype_lowfat","milktype_skim","milktype_whole"]
#################################################################
# f) Dataset (Train Test Split)
#
# (Dataset)
# ┌──────────────────────────────────────────┐
# ┌──────────────────────────┬────────────┐
# | Training │ Test │
# └──────────────────────────┴────────────┘
#################################################################
# Prediction target - Training data
X = df[categorical_columns + numerical_columns]
# Prediction target - Training data
y = df["caketaste"]
# Break off validation set from training data. Default: train_size=0.75, test_size=0.25
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=42)
#################################################################
# Pipeline
#################################################################
#######################
# g) Column Transformer
#######################
categorical_encoder = OneHotEncoder(handle_unknown='ignore')
#Mean might not be suitable, Remove rows?
numerical_pipe = Pipeline([
('imp', SimpleImputer(strategy='mean'))
])
preprocessing = ColumnTransformer(
[('cat', categorical_encoder, categorical_columns),
('num', numerical_pipe, numerical_columns)])
#####################
# b) Pipeline Printer
#####################
#RF: builds multiple decision trees and merges (bagging) them together
#to get a more accurate and stable prediction (averaging).
pipe_xxx_xxx_rfo = Pipeline([
('pre', preprocessing),
('scl', None),
('pca', None),
('clf', RandomForestClassifier(random_state=42))
])
pipe_abs_xxx_rfo = Pipeline([
('pre', preprocessing),
('scl', MaxAbsScaler()),
('pca', None),
('clf', RandomForestClassifier(random_state=42))
])
#################################################################
# h) Hyper-Parameter Tuning
#################################################################
parameters_rfo = {
'clf__n_estimators':[100],
'clf__criterion':['gini'],
'clf__min_samples_split':[2,5],
'clf__min_samples_leaf':[1,2]
}
parameters_rfo_bk = {
'clf__n_estimators':[10,20,30,40,50,60,70,80,90,100,1000],
'clf__criterion':['gini','entropy'],
'clf__min_samples_split':[5,10,15,20,25,30],
'clf__min_samples_leaf':[1,2,3,4,5]
}
#########################
# i) GridSearch Printer
#########################
# scoring can be used as 'accuracy' or for MAE use 'neg_mean_absolute_error'
scr='accuracy'
grid_xxx_xxx_rfo = GridSearchCV(pipe_xxx_xxx_rfo,
param_grid=parameters_rfo,
scoring=scr,
cv=5,
refit=True)
grid_abs_xxx_rfo = GridSearchCV(pipe_abs_xxx_rfo,
param_grid=parameters_rfo,
scoring=scr,
cv=5,
refit=True)
print("Pipeline setup.... Complete")
###################################################
# Machine Learning Models Evaluation Algorithm
###################################################
grids = [grid_xxx_xxx_rfo, grid_abs_xxx_rfo]
grid_dict = { 0: 'RandomForestClassifier',
1: 'RandomForestClassifier with AbsMaxScaler',
}
# Fit the grid search objects
print('Performing model optimizations...\n')
best_test_scr = -999999999999999 #Python3 does not allow to use None anymore
best_clf = 0
best_gs = ''
for idx, grid in enumerate(grids):
start_time = time.time()
print('*' * 100)
print('\nEstimator: %s' % grid_dict[idx])
# Fit grid search
grid.fit(X_train, y_train)
#Calculate the score once and use when needed
test_scr = grid.score(X_test,y_test)
train_scr = grid.score(X_train,y_train)
# Track best (lowest grid.score) model
if test_scr > best_test_scr:
best_test_scr = test_scr
best_train_scr = train_scr
best_rf = grid
best_clf = idx
print("..........................this model is better. SELECTED")
print("Best params : %s" % grid.best_params_)
print("Training accuracy : %s" % best_train_scr)
print("Test accuracy : %s" % best_test_scr)
print("Modeling time : %s" % time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time)))
print('\nClassifier with best test set score: %s' % grid_dict[best_clf])
#########################################################################################
# j) Feature Importance using Gini Importance or Mean Decrease in Impurity (MDI)
# Note:
# 1.Calculates each feature importance as the sum over the number of splits (accross
# all trees) that include the feature, proportionaly to the number of samples it splits.
# 2. Biased towards cardinality i.e numerical variables
########################################################################################
ohe = (best_rf.best_estimator_.named_steps['pre'].named_transformers_['cat'])
feature_names = ohe.get_feature_names(input_features=categorical_columns)
feature_names = np.r_[feature_names, numerical_columns]
tree_feature_importances = (best_rf.best_estimator_.named_steps['clf'].feature_importances_)
sorted_idx = tree_feature_importances.argsort()
# Figure: Top Features
count=-28
y_ticks = np.arange(0, abs(count))
fig, ax = plt.subplots()
ax.barh(y_ticks[count:], tree_feature_importances[sorted_idx][count:])
ax.set_yticklabels(feature_names[sorted_idx][count:], fontsize=7)
ax.set_yticks(y_ticks[count:])
ax.set_title("Random Forest Tree's Feature Importance from Mean Decrease in Impurity (MDI)")
fig.tight_layout()
plt.show()
What approach one can use to solve task b)? I am trying to answer the below research question,
What are the value for mixingtime_min, bakingtime_min, flour_g, sugar_g and preheattemp_c that statistically contributed for a good caketaste (Good:2) ?
Possible Expected Result:
mixingtime_min = [5,10,15] AND
bakingtime_min = [50,51,52,53,54,55] AND
flour_g = [150,160,170,180] AND
sugar_g = [200, 250] AND
preheattemp_c = [150,160,170]
The above result basically concludes if a person like to have a GOOD tasting cake, he need to bake his cake using 150-180g flour with 200-250g sugar and mixes the dough between 5-15mins, before baking it for another 50-55 mins in a preheated oven at 150-170ºC.
Hope you can give some pointers.
Question
Would you be able to guide me on how to go about approaching this research question?
Is there any library in sklearn or otherwise that able to get this information?
Any additional information such as confidence interval, outliers etc. is a bonus.
The data (poundcake.xlsx):
cake_id flour_g butter_g sugar_g salt_g eggs_count bakingpowder_g milk_ml water_ml vanillaextract_ml lemonzest_g mixingtime_min bakingtime_min preheattime_min coolingtime_min bakingtemp_c preheattemp_c color_red color_green color_blue traysize_small traysize_medium traysize_large milktype_lowfat milktype_skim milktype_whole trayshape caketaste
0 180 50 250 2 3 3 15 80 1 2 10 30 25 15 170 175 1 0 0 1 0 0 1 0 0 square 1
1 195 50 500 6 6 1 30 60 1 2 10 40 30 10 170 170 0 1 0 1 0 0 0 1 0 rectangle 1
2 160 40 600 6 5 1 15 90 3 3 5 30 30 10 155 160 1 0 0 1 0 0 0 0 1 square 2
3 200 80 350 8 4 2 15 50 1 1 7 40 20 10 175 165 0 1 0 1 0 0 0 0 1 rectangle 0
4 175 90 400 6 6 4 25 90 1 1 9 60 25 15 160 155 1 0 0 0 0 1 0 1 0 rectangle 0
5 180 60 650 6 3 4 20 80 2 3 7 15 20 20 155 160 0 0 1 0 0 1 0 1 0 rectangle 2
6 165 50 200 6 4 2 20 80 1 2 7 60 30 20 150 170 0 1 0 1 0 0 1 0 0 rectangle 0
7 170 70 200 6 2 3 25 50 2 3 8 70 20 10 170 150 0 1 0 1 0 0 0 1 0 rectangle 1
8 160 90 300 8 4 4 25 60 3 2 9 35 30 15 175 170 0 1 0 1 0 0 1 0 0 square 1
9 165 50 350 6 4 1 25 80 1 2 11 30 10 10 170 170 1 0 0 0 1 0 1 0 0 square 1
10 180 90 650 4 3 4 20 50 2 3 8 30 30 15 165 170 1 0 0 1 0 0 0 1 0 square 1
11 165 40 350 6 2 2 30 60 3 3 5 50 25 15 175 170 0 0 1 1 0 0 0 0 1 rectangle 1
12 175 70 500 6 2 1 25 80 1 1 7 60 20 15 170 170 0 1 0 1 0 0 1 0 0 square 2
13 175 70 350 6 2 1 15 60 2 3 9 45 30 15 175 170 0 0 0 1 0 0 0 1 0 rectangle 1
14 160 70 600 4 6 4 30 60 2 3 5 60 25 10 150 155 0 1 0 1 0 0 0 1 0 rectangle 0
15 165 50 500 2 3 4 20 60 1 3 10 30 15 20 175 175 0 1 0 1 0 0 1 0 0 rectangle 0
16 195 50 600 6 5 2 25 60 1 1 5 30 10 20 170 150 0 0 0 1 0 0 0 0 1 square 2
17 160 60 600 8 5 4 25 70 3 3 9 30 30 10 175 150 0 0 0 1 0 0 1 0 0 rectangle 0
18 160 80 550 6 3 3 23 80 1 1 9 25 30 15 155 170 0 0 1 1 0 0 0 0 1 rectangle 1
19 170 60 600 4 5 1 20 90 3 3 10 55 20 15 165 155 0 0 1 1 0 0 0 0 1 square 0
20 175 70 300 6 5 4 25 70 1 1 11 65 15 20 170 155 0 0 1 1 0 0 0 1 0 round 0
21 195 80 250 6 6 2 23 70 2 3 11 20 30 15 170 155 0 0 1 1 0 0 1 0 0 rectangle 0
22 170 90 650 6 3 4 20 70 1 2 10 60 25 15 170 155 0 0 1 0 0 1 0 1 0 rectangle 1
23 180 40 200 6 3 1 15 60 3 1 5 35 15 15 170 170 0 1 0 1 0 0 0 1 0 rectangle 2
24 165 50 550 8 4 2 23 80 1 2 5 65 30 15 155 175 0 0 0 1 0 0 1 0 0 rectangle 1
25 170 50 250 6 2 3 25 70 2 2 6 30 20 15 165 175 0 0 0 0 0 1 0 1 0 rectangle 2
26 180 50 200 6 4 2 30 80 1 3 10 30 20 15 165 165 0 0 0 1 0 0 0 1 0 rectangle 2
27 200 90 500 6 3 4 25 70 2 1 9 60 30 15 170 160 0 1 0 1 0 0 0 1 0 rectangle 2
28 170 60 300 6 2 3 25 80 1 1 9 15 15 15 160 150 1 0 0 0 0 1 0 0 1 round 1
29 170 60 400 2 3 2 25 60 1 3 9 25 15 15 160 175 0 0 0 1 0 0 1 0 0 square 0
30 195 50 650 4 5 2 25 60 1 3 7 40 15 15 165 170 0 1 0 1 0 0 1 0 0 rectangle 1
31 170 50 350 6 6 1 25 80 2 2 8 50 25 15 150 170 0 1 0 1 0 0 1 0 0 rectangle 2
32 160 80 550 4 4 4 20 70 1 3 7 25 25 15 170 165 1 0 0 0 0 1 0 0 1 rectangle 2
33 170 50 300 4 4 2 23 50 2 2 10 30 20 15 150 170 0 0 0 1 0 0 1 0 0 rectangle 0
34 175 70 650 4 4 1 23 70 3 3 10 55 10 15 150 170 0 0 1 1 0 0 0 0 1 rectangle 0
35 180 70 400 6 2 2 20 60 1 1 6 55 30 15 170 150 0 0 0 1 0 0 1 0 0 square 2
36 195 60 300 6 6 4 23 70 2 2 10 30 30 15 170 175 1 0 0 1 0 0 1 0 0 rectangle 0
37 180 70 400 6 4 1 20 70 3 2 9 30 30 20 160 170 1 0 0 1 0 0 0 1 0 rectangle 2
38 170 90 600 8 3 1 20 50 1 2 9 30 30 15 155 170 1 0 0 1 0 0 0 1 0 rectangle 2
39 180 60 200 2 3 2 20 70 1 2 10 55 30 20 165 155 0 1 0 1 0 0 0 1 0 round 2
40 180 70 400 6 4 2 15 60 1 3 7 45 30 10 170 175 0 0 0 1 0 0 0 1 0 rectangle 2
41 170 70 200 6 3 1 30 60 3 2 6 40 15 15 170 175 0 0 1 1 0 0 0 0 1 rectangle 2
42 170 60 550 6 3 4 20 80 1 2 9 60 20 15 150 165 1 0 0 1 0 0 1 0 0 round 2
43 170 50 600 6 4 3 30 60 1 2 11 15 30 15 155 150 1 0 0 0 1 0 1 0 0 rectangle 0
44 175 70 200 4 4 3 30 70 3 2 6 20 10 20 170 170 0 0 0 1 0 0 1 0 0 rectangle 1
45 195 70 500 8 4 2 25 60 2 3 6 15 30 15 165 170 1 0 0 0 0 1 0 1 0 rectangle 2
46 180 80 200 4 4 4 15 80 1 3 6 50 30 15 155 150 0 0 0 1 0 0 0 1 0 rectangle 2
47 165 50 350 6 4 2 20 60 1 1 9 40 20 15 150 155 0 0 0 1 0 0 1 0 0 rectangle 0
48 170 70 550 2 2 4 20 60 3 2 9 55 30 15 165 165 0 1 0 1 0 0 0 0 1 round 0
49 175 70 350 6 5 4 30 80 1 2 9 55 30 10 155 170 0 0 0 0 0 1 1 0 0 rectangle 1
50 180 50 400 6 4 3 25 50 2 2 9 20 20 20 160 160 0 0 0 1 0 0 0 1 0 rectangle 2
51 165 50 650 6 5 4 20 60 1 2 5 60 30 15 175 170 0 0 1 1 0 0 0 0 1 square 0
52 170 70 200 2 6 3 25 60 1 3 8 35 25 15 170 155 1 0 0 1 0 0 0 0 1 rectangle 1
53 180 40 350 4 4 3 30 60 3 2 12 45 30 15 150 175 0 0 0 1 0 0 0 1 0 rectangle 1
54 175 50 600 8 3 1 20 80 2 1 7 30 15 15 150 160 0 0 0 1 0 0 0 0 1 square 0
55 175 70 400 4 3 1 25 90 1 2 5 50 30 10 170 170 1 0 0 0 0 1 1 0 0 rectangle 1
56 170 50 650 6 6 3 20 70 1 1 6 25 30 15 170 160 1 0 0 1 0 0 0 1 0 rectangle 2
57 200 70 650 6 3 1 15 60 2 1 10 25 10 15 170 150 0 1 0 1 0 0 0 0 1 rectangle 2
58 175 80 650 6 5 2 23 70 1 1 5 45 15 15 160 170 0 1 0 1 0 0 0 0 1 rectangle 1
59 170 50 200 8 3 4 30 70 1 3 11 35 25 15 170 170 0 0 0 1 0 0 0 1 0 rectangle 1
60 170 60 300 6 3 1 20 60 3 3 11 20 30 15 170 170 1 0 0 1 0 0 0 0 1 rectangle 0
61 180 40 350 2 4 3 20 70 3 2 12 20 10 15 150 160 0 0 0 1 0 0 1 0 0 square 2
62 175 60 200 6 6 1 15 80 2 2 12 25 20 15 155 160 1 0 0 1 0 0 0 0 1 rectangle 2
63 170 70 650 6 2 3 23 90 3 3 10 25 30 20 170 155 1 0 0 1 0 0 0 1 0 rectangle 2
64 170 70 600 6 4 2 25 80 2 2 6 50 15 15 170 155 0 0 0 1 0 0 0 1 0 rectangle 0
65 170 60 250 6 2 2 30 60 1 2 9 20 15 10 165 165 0 0 0 1 0 0 0 1 0 rectangle 2
66 175 50 650 4 2 1 23 60 2 2 11 20 30 20 170 175 1 0 0 1 0 0 0 1 0 rectangle 1
67 175 70 350 4 3 3 30 50 1 2 10 35 25 15 175 170 0 0 0 1 0 0 1 0 0 rectangle 0
68 165 90 600 6 5 2 23 60 1 3 9 55 10 15 160 165 0 1 0 1 0 0 1 0 0 square 0
69 200 80 600 6 3 1 30 60 2 1 8 30 30 15 175 165 1 0 0 0 1 0 0 0 1 rectangle 1
70 165 50 200 6 5 2 23 60 2 1 12 55 30 15 170 170 0 0 0 0 0 1 0 0 1 round 0
71 175 60 300 4 6 1 15 60 3 2 12 55 20 15 175 165 0 0 0 1 0 0 0 0 1 square 0
72 175 70 200 8 5 4 20 60 1 3 12 60 25 15 175 170 0 1 0 1 0 0 0 1 0 rectangle 2
73 180 60 200 4 4 4 30 70 1 3 8 35 30 10 175 170 0 0 0 1 0 0 1 0 0 rectangle 2
74 170 80 650 6 3 1 30 60 1 2 5 55 30 20 155 175 1 0 0 1 0 0 0 0 1 rectangle 2
75 170 60 500 8 4 1 23 60 3 2 7 60 30 15 165 170 0 0 0 1 0 0 0 1 0 square 2
76 175 70 250 6 4 2 30 60 1 2 12 65 20 15 170 160 1 0 0 0 0 1 0 0 1 square 2
77 180 50 500 8 5 1 15 70 3 3 8 40 10 15 165 155 0 0 1 0 1 0 0 0 1 rectangle 1
78 175 60 550 6 4 2 20 90 1 2 7 25 30 15 175 165 0 1 0 1 0 0 0 0 1 rectangle 0
79 170 70 600 8 4 4 15 80 3 3 11 60 30 15 175 150 1 0 0 1 0 0 0 0 1 rectangle 1
80 195 60 200 4 5 3 30 60 1 2 8 30 20 15 170 170 0 1 0 1 0 0 0 1 0 square 0
81 180 70 300 6 3 3 20 90 1 3 11 25 20 10 170 150 0 0 0 1 0 0 0 1 0 rectangle 0
82 170 40 550 2 4 3 30 60 1 2 9 35 30 10 170 170 0 0 0 0 0 1 0 1 0 square 1
83 175 60 550 6 5 2 15 90 1 1 11 30 10 15 170 175 1 0 0 1 0 0 0 0 1 rectangle 0
84 180 50 350 4 4 3 23 50 2 2 7 20 30 10 170 175 0 0 0 1 0 0 0 0 1 rectangle 2
85 180 80 600 4 4 1 25 60 1 1 5 55 30 10 170 165 0 0 1 1 0 0 0 0 1 rectangle 1
86 175 50 650 8 2 3 15 50 1 2 10 50 25 15 160 160 0 0 0 1 0 0 0 0 1 square 0
87 175 50 350 2 6 3 23 80 2 2 10 20 25 15 170 155 1 0 0 1 0 0 0 0 1 rectangle 1
88 170 50 350 4 2 4 25 60 2 1 10 20 15 15 150 155 0 1 0 1 0 0 1 0 0 rectangle 0
89 180 50 550 6 5 4 30 90 2 3 7 60 30 15 155 175 0 0 0 1 0 0 0 1 0 rectangle 2
90 170 70 600 6 5 3 15 90 1 2 6 45 10 15 170 170 0 1 0 1 0 0 1 0 0 round 1
91 170 70 300 4 4 2 20 60 1 1 10 15 30 10 165 155 0 0 0 1 0 0 1 0 0 rectangle 1
92 180 50 650 4 2 4 20 80 1 2 8 65 30 15 150 160 0 1 0 1 0 0 0 0 1 rectangle 2
93 170 50 350 6 3 3 30 60 1 3 7 55 30 20 155 170 1 0 0 1 0 0 1 0 0 rectangle 0
94 170 90 400 6 4 1 30 60 3 2 12 70 30 15 170 160 0 0 1 1 0 0 0 1 0 rectangle 1
95 160 70 400 2 6 4 23 70 2 1 9 20 30 10 150 175 0 0 0 1 0 0 0 0 1 square 1
96 170 80 250 4 2 3 30 60 3 1 10 30 30 15 155 165 0 0 0 0 0 1 0 0 1 rectangle 1
97 195 70 250 6 6 4 30 80 3 1 11 20 15 15 170 170 1 0 0 1 0 0 0 0 1 rectangle 2
98 180 50 650 6 6 1 30 90 3 1 7 25 15 15 170 170 1 0 0 1 0 0 0 0 1 rectangle 2
99 195 50 200 6 3 1 23 90 1 1 9 55 25 15 160 170 0 0 0 1 0 0 0 0 1 rectangle 0
100 175 50 200 4 3 3 20 50 2 2 12 15 30 10 170 170 0 0 1 1 0 0 0 1 0 square 1
101 165 70 350 4 4 4 15 90 1 2 12 40 15 15 155 155 0 1 0 1 0 0 0 0 1 rectangle 1
102 180 80 600 4 4 3 25 50 1 2 11 30 10 15 155 170 0 0 1 1 0 0 0 0 1 rectangle 1
103 165 50 300 6 3 1 30 60 1 1 9 40 25 15 160 170 0 0 0 1 0 0 0 1 0 rectangle 1
104 160 50 600 8 2 4 20 60 1 2 12 60 30 15 170 170 0 0 0 1 0 0 1 0 0 square 2
105 170 90 200 2 2 2 15 60 3 2 5 40 20 15 170 160 0 0 0 1 0 0 0 1 0 rectangle 2
106 175 90 600 6 4 2 15 60 1 1 7 20 30 15 175 170 1 0 0 0 0 1 0 1 0 rectangle 2
107 180 70 550 6 3 1 15 90 1 1 9 25 30 15 150 160 1 0 0 1 0 0 0 1 0 rectangle 2
108 170 90 250 8 4 4 30 60 2 3 6 60 25 15 155 155 0 0 0 1 0 0 0 0 1 rectangle 0
109 200 40 500 6 6 2 20 60 3 2 10 50 30 15 170 155 0 0 0 1 0 0 1 0 0 rectangle 0
110 175 70 500 2 3 4 30 60 3 2 5 65 20 15 170 155 1 0 0 1 0 0 0 0 1 rectangle 2
111 165 60 550 6 3 2 30 80 2 1 9 20 25 20 170 175 0 0 0 1 0 0 0 0 1 rectangle 2
112 195 70 350 6 6 2 25 90 2 2 12 50 30 15 150 165 0 0 1 1 0 0 0 1 0 square 2
113 165 90 300 4 3 4 30 60 1 2 9 30 25 15 165 170 0 1 0 0 0 1 0 1 0 rectangle 0
114 195 40 650 6 2 1 23 80 1 2 5 25 25 15 170 165 0 1 0 1 0 0 0 1 0 rectangle 1
115 175 60 200 2 4 3 15 50 3 3 6 25 30 15 155 170 1 0 0 1 0 0 1 0 0 square 0
116 175 70 400 6 4 3 15 60 2 3 11 20 20 15 150 170 1 0 0 0 1 0 0 1 0 rectangle 2
117 195 70 350 6 3 2 30 60 3 2 12 25 25 20 175 175 0 0 0 1 0 0 0 0 1 rectangle 2
118 170 50 500 6 4 3 30 80 2 3 10 60 30 15 170 160 0 1 0 1 0 0 0 0 1 rectangle 0
119 195 60 650 6 4 1 20 70 3 2 5 65 20 20 170 150 0 0 1 0 0 1 0 0 1 rectangle 2
120 170 70 650 8 4 4 25 80 1 2 9 45 30 15 170 170 0 0 1 1 0 0 0 1 0 round 1
121 170 70 650 8 4 2 30 90 1 2 12 30 15 15 170 170 0 0 1 1 0 0 1 0 0 square 0
122 170 60 400 4 6 4 15 60 2 2 11 60 30 15 170 150 0 0 1 1 0 0 1 0 0 square 0
123 175 60 300 8 6 3 20 60 2 2 12 50 25 15 150 175 0 0 1 0 1 0 0 1 0 round 2
124 175 50 400 4 3 1 23 50 3 2 9 50 30 15 150 150 0 0 1 1 0 0 0 1 0 square 0
125 180 40 300 6 4 1 15 50 3 2 10 60 30 15 170 175 0 0 1 0 1 0 0 1 0 rectangle 2
126 195 60 250 6 4 3 25 90 2 2 6 60 30 10 170 175 1 0 0 0 0 1 0 0 1 rectangle 2
127 160 70 300 4 2 1 20 60 2 2 5 40 20 15 160 170 0 0 0 1 0 0 0 1 0 square 2
128 170 60 300 8 6 2 30 80 1 1 10 65 30 15 155 155 0 1 0 1 0 0 0 0 1 square 2
129 160 40 350 6 6 2 15 60 1 1 5 25 30 15 155 170 0 0 1 0 0 1 0 1 0 rectangle 2
130 170 60 500 2 5 3 30 50 3 2 10 60 10 15 165 160 0 0 0 1 0 0 1 0 0 rectangle 1
131 170 60 650 8 3 3 23 90 1 1 10 70 15 15 170 175 1 0 0 1 0 0 1 0 0 rectangle 2
132 170 50 600 4 4 1 20 50 2 2 5 60 25 15 170 160 1 0 0 1 0 0 0 0 1 square 2
133 180 50 350 6 5 2 25 90 3 2 5 20 30 15 175 160 0 0 0 1 0 0 1 0 0 rectangle 0
134 170 90 200 4 2 4 20 90 3 2 10 20 25 15 170 175 0 0 0 1 0 0 0 0 1 rectangle 1
135 200 40 350 6 6 1 30 80 1 1 5 60 25 20 170 175 0 0 1 1 0 0 0 1 0 rectangle 2
136 165 60 250 2 3 2 25 60 1 1 8 20 15 15 170 170 0 1 0 1 0 0 0 0 1 rectangle 0
137 175 70 250 6 6 4 15 60 2 2 11 50 30 15 175 175 0 1 0 0 0 1 0 1 0 rectangle 2
138 180 50 350 6 4 2 25 70 3 2 5 45 25 15 170 170 0 0 0 0 0 1 1 0 0 rectangle 0
139 195 60 600 6 4 2 20 50 1 1 10 35 15 15 165 175 1 0 0 1 0 0 0 1 0 round 2
140 180 60 300 8 4 4 25 80 1 1 5 60 30 15 165 170 0 0 0 1 0 0 0 0 1 rectangle 1
141 200 60 500 8 4 1 23 70 2 2 8 15 30 15 160 170 0 0 0 1 0 0 1 0 0 rectangle 0
142 170 60 550 6 4 4 30 60 2 2 6 65 20 15 175 165 0 1 0 1 0 0 0 0 1 rectangle 1
143 170 40 600 2 2 1 15 70 1 2 11 30 25 20 175 165 0 0 0 1 0 0 0 0 1 rectangle 0
144 175 70 250 6 4 3 30 60 1 2 10 60 30 20 155 175 0 1 0 1 0 0 1 0 0 rectangle 2
145 180 50 250 4 5 3 15 80 1 2 6 60 30 15 170 170 0 0 0 1 0 0 0 0 1 rectangle 2
146 165 50 350 6 4 4 25 80 1 2 12 25 15 15 155 165 1 0 0 1 0 0 0 0 1 rectangle 0
147 170 60 500 6 5 4 23 60 1 2 10 15 30 20 160 170 1 0 0 1 0 0 1 0 0 rectangle 1
148 170 50 400 6 4 3 20 60 2 3 6 35 10 15 170 175 0 0 1 1 0 0 0 0 1 rectangle 1
149 195 80 650 8 4 3 30 90 1 1 6 15 20 10 165 160 1 0 0 0 1 0 1 0 0 rectangle 2
150 165 90 500 8 3 4 20 60 2 2 5 25 30 15 165 170 0 1 0 0 0 1 0 0 1 rectangle 1
151 160 80 200 2 4 4 30 80 3 1 5 50 25 15 170 160 0 1 0 1 0 0 0 1 0 rectangle 0
152 180 50 500 2 6 1 15 60 1 1 8 65 20 15 170 170 1 0 0 0 0 1 1 0 0 rectangle 2
153 165 60 600 6 4 1 30 70 3 3 11 15 30 10 170 170 0 0 0 1 0 0 1 0 0 rectangle 0
154 180 60 600 2 3 2 30 70 1 2 6 55 15 15 150 165 1 0 0 1 0 0 0 0 1 rectangle 2
155 160 60 400 2 6 4 15 60 1 1 9 55 30 10 170 160 1 0 0 1 0 0 1 0 0 rectangle 0
156 180 60 250 4 3 2 25 80 3 1 6 25 25 20 170 160 0 1 0 0 1 0 0 1 0 square 2
157 195 50 200 6 4 3 30 70 3 2 6 35 30 15 165 170 1 0 0 0 0 1 1 0 0 rectangle 2
158 170 50 650 6 5 2 15 60 3 2 12 35 30 10 170 175 1 0 0 0 1 0 0 1 0 rectangle 0
159 160 70 400 6 3 2 20 50 1 2 9 20 30 15 155 155 0 0 1 0 0 1 1 0 0 rectangle 0
160 175 90 600 6 4 4 23 80 3 3 7 20 20 15 155 160 1 0 0 1 0 0 0 1 0 rectangle 0
161 180 50 400 4 4 1 23 70 1 2 12 20 30 20 165 170 0 1 0 1 0 0 0 0 1 rectangle 1
162 170 90 250 6 3 3 20 80 2 2 12 25 15 15 170 155 0 0 0 1 0 0 0 1 0 round 2
163 170 60 200 2 6 1 23 80 3 1 10 30 30 15 170 175 0 1 0 0 0 1 0 1 0 rectangle 2
164 175 50 650 2 5 3 25 70 3 2 11 60 25 15 175 160 0 1 0 1 0 0 0 0 1 rectangle 2
165 195 90 400 6 3 3 23 60 1 2 7 35 25 20 170 155 0 0 0 1 0 0 1 0 0 round 1
166 180 50 600 6 3 4 25 60 2 2 10 20 10 15 155 175 0 1 0 1 0 0 0 1 0 square 0
167 200 50 500 6 3 3 15 90 2 1 6 20 25 10 170 155 0 1 0 1 0 0 0 0 1 rectangle 1
168 200 60 200 6 2 3 20 60 3 3 5 20 10 15 170 170 1 0 0 1 0 0 0 1 0 rectangle 1
169 200 60 300 4 5 3 20 90 3 2 12 30 25 15 155 160 0 0 1 1 0 0 0 0 1 rectangle 0
170 180 70 250 6 4 3 30 50 1 2 12 35 25 10 155 150 0 0 0 1 0 0 1 0 0 rectangle 1
171 175 70 200 4 6 4 30 60 2 2 5 25 30 15 150 160 0 0 1 1 0 0 0 1 0 square 0
172 165 90 400 2 5 1 30 90 3 2 6 70 30 15 170 170 0 1 0 1 0 0 0 0 1 rectangle 2
173 165 70 200 6 6 4 20 70 1 1 5 65 20 20 175 155 0 0 0 1 0 0 0 1 0 round 0
174 180 50 650 2 3 3 20 70 3 2 12 40 30 15 155 170 0 0 0 1 0 0 0 0 1 rectangle 1
175 180 40 200 6 3 2 30 80 3 3 7 60 30 10 175 150 0 1 0 1 0 0 1 0 0 rectangle 2
176 180 60 400 2 5 3 20 50 1 3 5 20 30 15 175 150 0 1 0 1 0 0 0 1 0 rectangle 1
177 200 50 400 4 6 4 23 60 2 2 7 55 20 15 160 170 0 1 0 1 0 0 0 0 1 round 0
178 180 50 550 6 4 3 20 50 2 2 8 20 25 20 170 170 1 0 0 0 0 1 1 0 0 rectangle 0
179 175 70 250 8 4 1 20 50 2 3 6 60 30 15 170 170 0 0 0 0 1 0 0 0 1 square 0
180 195 70 400 6 4 4 23 60 3 1 7 65 25 15 170 150 1 0 0 1 0 0 0 1 0 rectangle 1
181 160 50 500 6 4 3 25 50 1 1 11 55 10 15 170 170 0 0 0 0 0 1 0 0 1 rectangle 1
182 180 90 500 6 3 3 23 60 2 1 8 20 30 15 170 170 0 0 0 0 0 1 0 1 0 rectangle 1
183 170 70 650 2 3 3 25 80 1 3 8 45 20 10 170 170 0 1 0 1 0 0 0 1 0 round 2
184 195 70 600 6 4 2 25 60 1 2 6 40 30 15 155 170 1 0 0 1 0 0 0 0 1 rectangle 1
185 165 70 200 6 4 1 20 60 1 2 8 45 15 15 170 150 0 1 0 1 0 0 0 0 1 round 1
186 165 80 200 4 4 3 30 60 1 1 8 25 30 10 160 170 0 1 0 1 0 0 1 0 0 round 0
187 175 60 600 4 2 3 20 60 1 2 6 25 20 15 170 155 0 0 0 1 0 0 1 0 0 rectangle 2
188 180 70 500 6 4 3 30 70 2 2 7 55 30 15 170 150 1 0 0 1 0 0 1 0 0 square 1
189 180 50 600 2 4 4 30 60 3 1 9 40 25 15 170 170 1 0 0 0 0 1 0 0 1 rectangle 0
190 160 50 600 8 3 2 20 60 3 2 12 30 30 15 165 150 0 0 0 0 1 0 1 0 0 rectangle 2
191 180 60 200 6 2 1 30 60 3 2 7 20 30 15 175 160 1 0 0 1 0 0 1 0 0 rectangle 2
192 195 70 600 6 4 3 23 80 2 2 12 50 25 10 170 170 0 0 0 0 0 1 1 0 0 rectangle 1
193 180 60 250 6 3 1 15 60 2 3 5 60 30 20 175 165 1 0 0 0 0 1 0 1 0 rectangle 1
194 170 70 250 6 4 1 20 90 2 2 10 25 20 20 175 170 0 0 0 1 0 0 0 1 0 round 1
195 180 90 250 6 3 1 25 50 1 2 9 55 30 15 170 175 1 0 0 0 0 1 1 0 0 rectangle 1
196 160 70 550 6 3 4 30 90 3 2 10 60 20 15 165 165 0 1 0 1 0 0 0 1 0 round 0
197 175 60 200 8 2 3 15 60 1 2 11 50 30 15 165 175 0 1 0 1 0 0 0 0 1 rectangle 1
198 170 80 500 6 3 2 25 50 1 1 5 60 20 15 175 150 1 0 0 1 0 0 0 1 0 square 2
199 180 50 600 4 4 4 15 80 1 1 5 50 20 15 170 170 1 0 0 1 0 0 0 1 0 rectangle 2
Very simple solution could be run a Decision tree classifier with your data & Visualize the tree using grapviz library here's the documentation https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html
, You could also visualize in webgraphiz after you get the dot file generated from the code. Outcome of this exercise can be the range values you are expecting.
Related
In Pandas, giving a datetime index, with rows on all work days, how to determine if a row is beginning of week or end of week?
I have an set of stock information, with datetime set as index, stock market only open on weekdays so all my rows are weekdays, which is fine, I would like to determine if a row is start of the week or end of week, which might NOT always fall on Monday/Friday due to holidays. A better idea is to determine if there is an row entry on the next/previous day in the dataframe ( since my data is guaranteed to only exist for workday), but I dont know how to calculate this. Here is an example of my data: date day_of_week day_of_month day_of_year month_of_year 5/1/2017 0 1 121 5 5/2/2017 1 2 122 5 5/3/2017 2 3 123 5 5/4/2017 3 4 124 5 5/8/2017 0 8 128 5 5/9/2017 1 9 129 5 5/10/2017 2 10 130 5 5/11/2017 3 11 131 5 5/12/2017 4 12 132 5 5/15/2017 0 15 135 5 5/16/2017 1 16 136 5 5/17/2017 2 17 137 5 5/18/2017 3 18 138 5 5/19/2017 4 19 139 5 5/23/2017 1 23 143 5 5/24/2017 2 24 144 5 5/25/2017 3 25 145 5 5/26/2017 4 26 146 5 5/30/2017 1 30 150 5 Here is my current code # Date fields def DateFields(df_input): dates = df_input.index.to_series() df_input['day_of_week'] = dates.dt.dayofweek df_input['day_of_month'] = dates.dt.day df_input['day_of_year'] = dates.dt.dayofyear df_input['month_of_year'] = dates.dt.month df_input['isWeekStart'] = "No" #<--- Need help here df_input['isWeekEnd'] = "No" #<--- Need help here df_input['date'] = dates.dt.strftime('%Y-%m-%d') return df_input How can I calculate if a row is beginning of week and end of week? Example of what I am looking for: date day_of_week day_of_month day_of_year month_of_year isWeekStart isWeekEnd 5/1/2017 0 1 121 5 1 0 5/2/2017 1 2 122 5 0 0 5/3/2017 2 3 123 5 0 0 5/4/2017 3 4 124 5 0 1 # short week, Thursday is last work day 5/8/2017 0 8 128 5 1 0 5/9/2017 1 9 129 5 0 0 5/10/2017 2 10 130 5 0 0 5/11/2017 3 11 131 5 0 0 5/12/2017 4 12 132 5 0 1 5/15/2017 0 15 135 5 1 0 5/16/2017 1 16 136 5 0 0 5/17/2017 2 17 137 5 0 0 5/18/2017 3 18 138 5 0 0 5/19/2017 4 19 139 5 0 1 5/23/2017 1 23 143 5 1 0 # short week, Tuesday is first work day 5/24/2017 2 24 144 5 0 0 5/25/2017 3 25 145 5 0 0 5/26/2017 4 26 146 5 0 1 5/30/2017 1 30 150 5 1 0 EDIT: I forgot that some holidays fall during the middle of week, in this situation, it would be good if it can treat these as a separate "week" with before and after marked accordingly. Although if it's not smart enough to figure this out, just getting the long weekend would be a good start.
Here's an idea with BusinessDay: prev_working_day = df['date'] - pd.tseries.offsets.BusinessDay(1) df['isFirstWeekDay'] = (df['date'].dt.isocalendar().week != prev_working_day.dt.isocalendar().week) And similar for last business day. Note that the default holiday calendar is US'. Check out this post for a different one. Output: date day_of_week day_of_month day_of_year month_of_year isFirstWeekDay 0 2017-05-01 0 1 121 5 True 1 2017-05-02 1 2 122 5 False 2 2017-05-03 2 3 123 5 False 3 2017-05-04 3 4 124 5 False 4 2017-05-08 0 8 128 5 True 5 2017-05-09 1 9 129 5 False 6 2017-05-10 2 10 130 5 False 7 2017-05-11 3 11 131 5 False 8 2017-05-12 4 12 132 5 False 9 2017-05-15 0 15 135 5 True 10 2017-05-16 1 16 136 5 False 11 2017-05-17 2 17 137 5 False 12 2017-05-18 3 18 138 5 False 13 2017-05-19 4 19 139 5 False 14 2017-05-23 1 23 143 5 False 15 2017-05-24 2 24 144 5 False 16 2017-05-25 3 25 145 5 False 17 2017-05-26 4 26 146 5 False 18 2017-05-30 1 30 150 5 False
Here's an approach using weekly groupby. df['date'] = pd.to_datetime(df['date']) business_days = df.assign(date_copy = df['date']).groupby(pd.Grouper(key='date_copy', freq='W'))['date'].apply(list).to_frame() business_days['isWeekStart'] = business_days['date'].apply(lambda x: [1 if i == min(x) else 0 for i in x]) business_days['isWeekEnd'] = business_days['date'].apply(lambda x: [1 if i == max(x) else 0 for i in x]) business_days = business_days.apply(pd.Series.explode) pd.merge(df, business_days, left_on='date', right_on='date') output: date day_of_week day_of_month day_of_year month_of_year isWeekStart isWeekEnd 0 2017-05-01 0 1 121 5 1 0 1 2017-05-02 1 2 122 5 0 0 2 2017-05-03 2 3 123 5 0 0 3 2017-05-04 3 4 124 5 0 1 4 2017-05-08 0 8 128 5 1 0 5 2017-05-09 1 9 129 5 0 0 6 2017-05-10 2 10 130 5 0 0 7 2017-05-11 3 11 131 5 0 0 8 2017-05-12 4 12 132 5 0 1 9 2017-05-15 0 15 135 5 1 0 10 2017-05-16 1 16 136 5 0 0 11 2017-05-17 2 17 137 5 0 0 12 2017-05-18 3 18 138 5 0 0 13 2017-05-19 4 19 139 5 0 1 14 2017-05-23 1 23 143 5 1 0 15 2017-05-24 2 24 144 5 0 0 16 2017-05-25 3 25 145 5 0 0 17 2017-05-26 4 26 146 5 0 1 18 2017-05-30 1 30 150 5 1 1 Note that 2017-05-30 is marked as both WeekStart and WeekEnd because it is the only date of that week.
ValueError Length mismatch Expected axis has 2 elements, new values have 3 elements
def bar1(): df=pd.read_csv('#CSVFILELOCATION#',encoding= 'unicode_escape') x=np.arange(11) df=df.set_index(['Country']) dfl=df.iloc[:,[4,9]] w=dfl.groupby('Country')['SummerTotal' , 'WinterTotal'].sum() final_df=w.sort_values(by='Country').tail(11) final_df.reset_index(inplace=True) final_df.columns=('Country','SummerTotal','WinterTotal') final_df=final_df.drop(11,axis='index') Countries=df['Country'] STotalMed=df['SummerTotal'] WTotalMed=df['WinterTotal'] plt.bar(x-0.25,STotalMed,label='Total Medals by Countries in Summer',color='g') plt.bar(x+0.25,WTotalMed,label='Total Medals by Countries in Winter',color='r') plt.xticks(r,Countries,rotation=30) plt.title('Olympics Data Analysis of Top 10 Countries',color='red',fontsize=10) plt.xlabel('Countries') plt.ylabel('Total Medals') plt.grid() plt.legend() plt.show() THIS IS THE CODE FOR A BAR GRAPH I AM USING IN A PROJECT IN HERE THERE IS AN ERROR ValueError: Length mismatch: Expected axis has 2 elements, new values have 3 elements PLEASE HELP ANYONE I WANT TO SUBMIT THIS PROJECT FAST CSV: Country SummerTimesPart Sumgoldmedal Sumsilvermedal Sumbronzemedal SummerTotal WinterTimesPart Wingoldmedal Winsilvermedal Winbronzemedal WinterTotal TotalTimesPart Tgoldmedal Tsilvermedal Tbronzemedal TotalMedal Afghanistan 14 0 0 2 2 0 0 0 0 0 14 0 0 2 2 Algeria 13 5 4 8 17 3 0 0 0 0 16 5 4 8 17 Argentina 24 21 25 28 74 19 0 0 0 0 43 21 25 28 74 Armenia 6 2 6 6 14 7 0 0 0 0 13 2 6 6 14 Australasia 2 3 4 5 12 0 0 0 0 0 2 3 4 5 12 Australia 26 147 163 187 497 19 5 5 5 15 45 152 168 192 512 Austria 27 18 33 36 87 23 64 81 87 232 50 82 114 123 319 Azerbaijan 6 7 11 24 42 6 0 0 0 0 12 7 11 24 42 Bahamas 16 6 2 6 14 0 0 0 0 0 16 6 2 6 14 Bahrain 9 2 1 0 3 0 0 0 0 0 9 2 1 0 3 Barbados 12 0 0 1 1 0 0 0 0 0 12 0 0 1 1 Belarus 6 12 27 39 78 7 8 5 5 18 13 20 32 44 96 Belgium 26 40 53 55 148 21 1 2 3 6 47 41 55 58 154 Bermuda 18 0 0 1 1 8 0 0 0 0 26 0 0 1 1 Bohemia 3 0 1 3 4 0 0 0 0 0 3 0 1 3 4 Botswana 10 0 1 0 1 0 0 0 0 0 10 0 1 0 1 Brazil 22 30 36 63 129 8 0 0 0 0 30 30 36 63 129 British West Indies 1 0 0 2 2 0 0 0 0 0 1 0 0 2 2 Bulgaria 20 51 87 80 218 20 1 2 3 6 40 52 89 83 224 Burundi 6 1 1 0 2 0 0 0 0 0 6 1 1 0 2 Cameroon 14 3 1 2 6 1 0 0 0 0 15 3 1 2 6 INFO-----> SummerTimesPart : No. of times participated in summer by each country WinterTimesPart : No. of times participated in winter by each country
A few changes were needed to get the chart working: A tick array is required to plot the country names Use final_df for the chart data, not df Set the bar width so the bars don't overlap Here is the updated code: data = ''' Country SummerTimesPart Sumgoldmedal Sumsilvermedal Sumbronzemedal SummerTotal WinterTimesPart Wingoldmedal Winsilvermedal Winbronzemedal WinterTotal TotalTimesPart Tgoldmedal Tsilvermedal Tbronzemedal TotalMedal Afghanistan 14 0 0 2 2 0 0 0 0 0 14 0 0 2 2 Algeria 13 5 4 8 17 3 0 0 0 0 16 5 4 8 17 Argentina 24 21 25 28 74 19 0 0 0 0 43 21 25 28 74 Armenia 6 2 6 6 14 7 0 0 0 0 13 2 6 6 14 Australasia 2 3 4 5 12 0 0 0 0 0 2 3 4 5 12 Australia 26 147 163 187 497 19 5 5 5 15 45 152 168 192 512 Austria 27 18 33 36 87 23 64 81 87 232 50 82 114 123 319 Azerbaijan 6 7 11 24 42 6 0 0 0 0 12 7 11 24 42 Bahamas 16 6 2 6 14 0 0 0 0 0 16 6 2 6 14 Bahrain 9 2 1 0 3 0 0 0 0 0 9 2 1 0 3 Barbados 12 0 0 1 1 0 0 0 0 0 12 0 0 1 1 Belarus 6 12 27 39 78 7 8 5 5 18 13 20 32 44 96 Belgium 26 40 53 55 148 21 1 2 3 6 47 41 55 58 154 Bermuda 18 0 0 1 1 8 0 0 0 0 26 0 0 1 1 Bohemia 3 0 1 3 4 0 0 0 0 0 3 0 1 3 4 Botswana 10 0 1 0 1 0 0 0 0 0 10 0 1 0 1 Brazil 22 30 36 63 129 8 0 0 0 0 30 30 36 63 129 BritishWestIndies 1 0 0 2 2 0 0 0 0 0 1 0 0 2 2 Bulgaria 20 51 87 80 218 20 1 2 3 6 40 52 89 83 224 Burundi 6 1 1 0 2 0 0 0 0 0 6 1 1 0 2 Cameroon 14 3 1 2 6 1 0 0 0 0 15 3 1 2 6 '''.strip() with open('data,csv', 'w') as f: f.write(data) # write test file ############################ import numpy as np import pandas as pd import matplotlib.pyplot as plt def bar1(): df=pd.read_csv('data,csv', encoding= 'unicode_escape', sep=' ', index_col=False) x=np.arange(11) df=df.set_index(['Country']) dfl=df.iloc[:,[4,9]] w=dfl.groupby('Country')['SummerTotal' , 'WinterTotal'].sum() final_df=w.sort_values(by='Country').tail(11) final_df.reset_index(inplace=True) final_df.columns=('Country','SummerTotal','WinterTotal') print(final_df) # final_df=final_df.drop(11,axis='index') Countries=final_df['Country'] STotalMed=final_df['SummerTotal'] WTotalMed=final_df['WinterTotal'] plt.bar(x-0.25,STotalMed,width=.2, label='Total Medals by Countries in Summer',color='g') plt.bar(x+0.25,WTotalMed,width=.2, label='Total Medals by Countries in Winter',color='r') plt.xticks(np.arange(11),Countries,rotation=30) plt.title('Olympics Data Analysis of Top 10 Countries',color='red',fontsize=10) plt.xlabel('Countries') plt.ylabel('Total Medals') plt.grid() plt.legend() plt.show() bar1() Output
How to fill in values of a dataframe column if the difference between values in another column is sufficiently small?
I have a dataframe df1: Time Delta_time 0 0 NaN 1 15 15 2 18 3 3 30 12 4 45 15 5 64 19 6 80 16 7 82 2 8 100 18 9 120 20 where Delta_time is the difference between adjacent values in the Time column. I have another dataframe df2 that has time values numbering from 0 to 120 (121 rows) and another column called 'Short_gap'. How do I set the value of Short_gap to 1 for all Time values that lie in a Delta_time value smaller than 5? For example, the Short_gap column should have a value of 1 for Time = 15,16,17,18 since Delta_time = 3 < 5. Edit: Currently, df2 looks like this. Time Short_gap 0 0 0 1 1 0 2 2 0 3 3 0 ... ... ... 118 118 0 119 119 0 120 120 0 The expected output for df2 is Time Short_gap 0 0 0 1 1 0 2 2 0 ... ... ... 13 13 0 14 14 0 15 15 1 16 16 1 17 17 1 18 18 1 19 19 0 20 20 0 ... ... ... 78 78 0 79 79 0 80 80 1 81 81 1 82 82 1 83 83 0 84 84 0 ... ... ... 119 119 0 120 120 0
Try: t = df['Delta_time'].shift(-1) df2 = ((t < 5).repeat(t.fillna(1)).astype(int).reset_index(drop=True) .to_frame(name='Short_gap').rename_axis('Time').reset_index()) print(df2.head(20)) print('...') print(df2.loc[78:84]) Output: Time Short_gap 0 0 0 1 1 0 2 2 0 3 3 0 4 4 0 5 5 0 6 6 0 7 7 0 8 8 0 9 9 0 10 10 0 11 11 0 12 12 0 13 13 0 14 14 0 15 15 1 16 16 1 17 17 1 18 18 0 19 19 0 ... Time Short_gap 78 78 0 79 79 0 80 80 1 81 81 1 82 82 0 83 83 0 84 84 0
get data from xml using pandas
I'm trying to get some data from xml using pandas. Currently I have "working" code, and by working i mean it almost work. import pandas as pd import requests from bs4 import BeautifulSoup url = "http://degra.wi.pb.edu.pl/rozklady/webservices.php?" response = requests.get(url).content soup = BeautifulSoup(response) tables = soup.find_all('tabela_rozklad') tags = ['dzien', 'godz', 'ilosc', 'tyg', 'id_naucz', 'id_sala', 'id_prz', 'rodz', 'grupa', 'id_st', 'sem', 'id_spec'] df = pd.DataFrame() for table in tables: all = map(lambda x: table.find(x).text, tags) df = df.append([all]) df.columns = tags a = df[(df.sem == "1")] a = a[(a.id_spec == "0")] a = a[(a.dzien == "1")] print(a) So I'm getting error on "a = df[(df.sem == "1")]" which is : File "pandas\index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas\index.c:4443) File "pandas\index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas\index.c:4289) File "pandas\src\hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13733) File "pandas\src\hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13687) As i read other stacks questions I saw people suggest using df.loc so i modyfied this line into a = df.loc[(df.sem == "1")] Now code compile but the results show like this line doesn't exists. Need to mention that the problem is with the "sem" tag only. Rest works perfectly but unfortunately i need to use exactly this tag. If anyone could explain what i causing this error and how to fix it I would be grateful.
You can add ignore_index=True to append for avoid duplicated index and then need select column sem by [], because function sem: df = pd.DataFrame() for table in tables: all = map(lambda x: table.find(x).text, tags) df = df.append([all], ignore_index=True) df.columns = tags #print (df) a = df[(df['sem'] == '1') & (df.id_spec == "0") & (df.dzien == "1")] print(a) dzien godz ilosc tyg id_naucz id_sala id_prz rodz grupa id_st sem id_spec 0 1 1 2 0 52 79 13 W 1 13 1 0 1 1 3 2 0 12 79 32 W 1 13 1 0 2 1 5 2 0 52 65 13 Ćw 1 13 1 0 3 1 11 2 0 201 3 70 Ćw 10 13 1 0 4 1 5 2 0 36 78 13 Ps 5 13 1 0 5 1 5 2 1 18 32 450 Ps 3 13 1 0 6 1 5 2 2 18 32 450 Ps 4 13 1 0 7 1 7 2 1 18 32 450 Ps 7 13 1 0 8 1 7 2 2 18 32 450 Ps 8 13 1 0 9 1 7 2 0 66 65 104 Ćw 1 13 1 0 10 1 7 2 0 283 3 104 Ćw 5 13 1 0 11 1 7 2 0 346 5 104 Ćw 8 13 1 0 12 1 7 2 0 184 29 13 Ćw 7 13 1 0 13 1 9 2 0 66 65 104 Ćw 2 13 1 0 14 1 9 2 0 346 5 70 Ćw 8 13 1 0 15 1 9 1 0 73 3 203 Ćw 9 13 1 0 16 1 10 1 0 73 3 203 Ćw 10 13 1 0 17 1 9 2 0 184 19 13 Ps 13 13 1 0 18 1 11 2 0 184 19 13 Ps 14 13 1 0 19 1 11 2 0 44 65 13 Ćw 9 13 1 0 87 1 9 2 0 201 54 463 W 1 17 1 0 88 1 3 2 0 36 29 13 Ćw 2 17 1 0 89 1 3 2 0 211 5 70 Ćw 1 17 1 0 90 1 5 2 0 211 5 70 Ćw 2 17 1 0 91 1 7 2 0 36 78 13 Ps 4 17 1 0 105 1 1 2 1 11 16 32 Ps 2 18 1 0 106 1 1 2 2 11 16 32 Ps 3 18 1 0 107 1 3 2 0 51 3 457 W 1 18 1 0 110 1 5 2 2 11 16 32 Ps 1 18 1 0 111 1 7 2 0 91 64 97 Ćw 2 18 1 0 112 1 5 2 0 283 3 457 Ćw 2 18 1 0 254 1 5 1 0 12 29 32 Ćw 6 13 1 0 255 1 6 1 0 12 29 32 Ćw 5 13 1 0 462 1 7 2 0 98 1 486 W 1 19 1 0 463 1 9 1 0 91 1 484 W 1 19 1 0 487 1 5 2 0 116 19 13 Ps 1 17 1 0 488 1 7 2 0 116 19 13 Ps 2 17 1 0 498 1 5 2 0 0 0 431 Ps 2 17 1 0 502 1 5 2 0 0 0 431 Ps 15 13 1 0 503 1 5 2 0 0 0 431 Ps 16 13 1 0 504 1 5 2 0 0 0 431 Ps 19 13 1 0 505 1 5 2 0 0 0 431 Ps 20 13 1 0 531 1 13 2 0 350 79 493 W 1 13 1 0 532 1 13 2 0 350 79 493 W 2 17 1 0 533 1 13 2 0 350 79 493 W 1 18 1 0
Delete lines that contain decimal numbers
I am trying to delete lines that contain decimal numbers. For instance: 82.45 76.16 21.49 -2.775 5 24 13 6 9 0 3 2 4 9 7 11 54 11 1 1 18 5 0 0 1 1 0 2 2 0 0 0 0 0 0 0 14 90 21 5 24 26 73 13 20 33 23 59 158 85 17 6 158 66 15 13 13 10 2 37 81 0 0 0 1 3 0 19 8 158 75 7 10 8 5 1 23 58 148 77 120 78 6 7 158 80 15 10 16 21 6 37 100 25 0 0 0 0 0 3 1 10 9 1 0 0 0 0 11 16 57 15 0 0 0 0 158 76 9 1 0 0 0 0 22 17 0 0 0 0 0 0 50.04 143.84 18.52 -1.792 3 0 0 0 0 0 0 0 36 0 0 0 2 4 0 1 23 2 0 0 8 24 4 12 21 9 5 2 0 0 0 4 40 0 0 0 0 0 0 12 150 11 2 7 12 16 4 59 72 8 30 88 68 83 15 27 21 11 49 94 6 1 1 8 17 8 0 0 0 0 0 5 150 150 33 46 9 0 0 20 28 49 81 150 76 5 8 17 36 23 41 48 7 1 16 88 0 3 0 0 0 0 0 0 36 108 13 9 2 0 3 61 19 26 14 34 27 8 98 150 14 2 0 1 1 0 115 150 114.27 171.37 10.74 -2.245 .................. and this pattern continues for thousands of lines and likewise I have about 3000 files with similar pattern of data. So, I want to delete lines that have these decimal numbers. In most cases, every 8th line has decimal numbers and hence I tried using awk 'NR % 8! == 0' < file_name. But the problem is, not all files in the database have their every 8th line as decimal numbers. So, is there a way in which I can delete the lines that have decimal numbers? I am coding in python 2.7 in ubuntu.
You can just look for lines containing decimal limiters: with open('filename_without_decimals.txt','wb') as of: with open('filename.txt') as fp: for line in fp: if line.index(".") == -1: of.write(line) If you prefer to use sed, would be cleaner: sed -i '/\./d' file.txt
The solution would be something like file = open('textfile.txt') text = "" for line in file.readLines(): if '.' not in line: text += line print text
have you tried this: using awk: awk '!/\./{print}' your_file
deci = open('with.txt') no_deci = open('without.txt', 'w') for line in with_deci.readlines(): if '.' not in line: no_deci.write(line) deci.close() no_deci.close() readlines returns a list of all the lines in the file.