I want to create a column Quantile, for each date. Calculated the Quantile for each unique value Sales value. Ie Category always corresponds to the same number in sales for each particular date.
I have dataframe which is indexed by date. There are many dates and multiple of the same dates. Example of the subset of df for 1 day:
Category Sales Ratio 1 Ratio 2
11/19/2016 Bar 300 0.46 0.96
11/19/2016 Bar 300 0.56 0.78
11/19/2016 Bar 300 0.43 0.96
11/19/2016 Bar 300 0.47 0.94
11/19/2016 Casino 550 0.92 0.12
11/19/2016 Casino 550 0.43 0.74
11/19/2016 Casino 550 0.98 0.65
11/19/2016 Casino 550 0.76 0.67
11/19/2016 Casino 550 0.79 0.80
11/19/2016 Casino 550 0.90 0.91
11/19/2016 Casino 550 0.89 0.31
11/19/2016 Café 700 0.69 0.99
11/19/2016 Café 700 0.07 0.18
11/19/2016 Café 700 0.75 0.59
11/19/2016 Café 700 0.07 0.64
11/19/2016 Café 700 0.14 0.42
11/19/2016 Café 700 0.30 0.67
11/19/2016 Pub 250 0.64 0.09
11/19/2016 Pub 250 0.93 0.37
11/19/2016 Pub 250 0.69 0.42
I want a code which adds a new column called Quantile which calculates for each date the 0.5 quantile of unique Sales. Key thing to note is Sales is always the same for a category for a particular date (things change as dates change).
Example of a solution: df['Quantile'] = df.Sales.groupby(df.index).transform(lambda x: x.quantile(q=0.5, axis=0, interpolation='midpoint'))
However this would not suffice (even if it worked). For this example (for this one date), In the new column df['Quantile'], all values would be the same for a partcular date.
For this date the calculation would use 300, 550, 700 and 250 for the quantile.
Therefore the final df would look like this:
Category Sales Ratio 1 Ratio 2 Quantile
11/19/2016 Bar 300 0.46 0.96 425
11/19/2016 Bar 300 0.56 0.78 425
11/19/2016 Bar 300 0.43 0.96 425
11/19/2016 Bar 300 0.47 0.94 425
11/19/2016 Casino 550 0.92 0.12 425
11/19/2016 Casino 550 0.43 0.74 425
11/19/2016 Casino 550 0.98 0.65 425
11/19/2016 Casino 550 0.76 0.67 425
11/19/2016 Casino 550 0.79 0.80 425
11/19/2016 Casino 550 0.90 0.91 425
11/19/2016 Casino 550 0.89 0.31 425
11/19/2016 Café 700 0.69 0.99 425
11/19/2016 Café 700 0.07 0.18 425
11/19/2016 Café 700 0.75 0.59 425
11/19/2016 Café 700 0.07 0.64 425
11/19/2016 Café 700 0.14 0.42 425
11/19/2016 Café 700 0.30 0.67 425
11/19/2016 Pub 250 0.64 0.09 425
11/19/2016 Pub 250 0.93 0.37 425
11/19/2016 Pub 250 0.69 0.42 425
If I was to do Quantile of all Sales for a particular date without looking at only one element of each category I would get something like 550 (which I do not want).
Key thing is I would like the code to be simple, and reasonably fast (as date is quite big). Also interpolation has to be midpoint.
It seems you need drop_duplicates:
df['Quantile'] = df.Sales.groupby(df.index)
.transform(lambda x: x.drop_duplicates().quantile())
print (df)
Category Sales Ratio 1 Ratio 2 Quantile
11/19/2016 Bar 300 0.46 0.96 425
11/19/2016 Bar 300 0.56 0.78 425
11/19/2016 Bar 300 0.43 0.96 425
11/19/2016 Bar 300 0.47 0.94 425
11/19/2016 Casino 550 0.92 0.12 425
11/19/2016 Casino 550 0.43 0.74 425
11/19/2016 Casino 550 0.98 0.65 425
11/19/2016 Casino 550 0.76 0.67 425
11/19/2016 Casino 550 0.79 0.80 425
11/19/2016 Casino 550 0.90 0.91 425
11/19/2016 Casino 550 0.89 0.31 425
11/19/2016 Cafe 700 0.69 0.99 425
11/19/2016 Cafe 700 0.07 0.18 425
11/19/2016 Cafe 700 0.75 0.59 425
11/19/2016 Cafe 700 0.07 0.64 425
11/19/2016 Cafe 700 0.14 0.42 425
11/19/2016 Cafe 700 0.30 0.67 425
11/19/2016 Pub 250 0.64 0.09 425
11/19/2016 Pub 250 0.93 0.37 425
11/19/2016 Pub 250 0.69 0.42 425
df['Quantile'] = df.Sales.groupby(df.index)
.transform(lambda x: np.percentile(x.unique(), 50))
print (df)
Category Sales Ratio 1 Ratio 2 Quantile
11/19/2016 Bar 300 0.46 0.96 425
11/19/2016 Bar 300 0.56 0.78 425
11/19/2016 Bar 300 0.43 0.96 425
11/19/2016 Bar 300 0.47 0.94 425
11/19/2016 Casino 550 0.92 0.12 425
11/19/2016 Casino 550 0.43 0.74 425
11/19/2016 Casino 550 0.98 0.65 425
11/19/2016 Casino 550 0.76 0.67 425
11/19/2016 Casino 550 0.79 0.80 425
11/19/2016 Casino 550 0.90 0.91 425
11/19/2016 Casino 550 0.89 0.31 425
11/19/2016 Cafe 700 0.69 0.99 425
11/19/2016 Cafe 700 0.07 0.18 425
11/19/2016 Cafe 700 0.75 0.59 425
11/19/2016 Cafe 700 0.07 0.64 425
11/19/2016 Cafe 700 0.14 0.42 425
11/19/2016 Cafe 700 0.30 0.67 425
11/19/2016 Pub 250 0.64 0.09 425
11/19/2016 Pub 250 0.93 0.37 425
11/19/2016 Pub 250 0.69 0.42 425
Related
My class names are lengthy. So, when I save the classification report from sklearn in a txt file, the format looks messy. For example:
precision recall f1-score support
カップ 0.96 0.94 0.95 69
セット 0.70 0.61 0.65 23
パウチ 0.96 0.92 0.94 53
ビニール容器 0.53 0.47 0.50 19
ビン 0.91 0.90 0.90 305
プラ容器(ヤクルト型) 0.69 0.80 0.74 25
プラ容器(ヨーグルト型) 0.94 0.53 0.68 32
ペットボトル 0.98 0.98 0.98 1189
ペットボトル(ミニ) 0.71 0.74 0.72 23
ボトル缶 0.93 0.89 0.91 96
ポーション 0.80 0.52 0.63 23
箱(飲料) 0.76 0.77 0.76 134
紙パック(Pキャン) 0.86 0.69 0.76 35
紙パック(キャップ付き) 0.93 0.80 0.86 126
紙パック(ゲーブルトップ) 0.85 0.93 0.88 54
紙パック(ブリックパック) 0.84 0.90 0.87 277
紙パック(円柱型) 0.90 0.56 0.69 16
缶 0.89 0.96 0.92 429
accuracy 0.91 2928
macro avg 0.84 0.77 0.80 2928
weighted avg 0.91 0.91 0.91 2928
Is there any way to adjust the space among different metrics so that the alignment keeps same for all rows and columns?
I have looked into the sklearn.metrics.classification_report parameters, but I have not found anything to adjust spacing.
How do I fix this code, do I need to make the features_train and the features_test a DataFrame?
Anyone has an idea of how to fix that code? I really can't understand the problem....
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer
from sklearn.metrics import r2_score
admissions_data = pd.read_csv('admissions_data.csv')
labels = admissions_data.iloc[:, -1]
features = admissions_data.iloc[:, 1:8]
features_train, labels_train, features_test, labels_test = train_test_split(features, labels, test_size=0.2, random_state=13)
sc = StandardScaler()
features_train_scaled = sc.fit_transform(features_train)
features_test_scale = sc.transform(features_test)
features_train_scaled = pd.DataFrame(features_train_scaled)
features_test_scale = pd.DataFrame(features_test_scale)
The error is:
Traceback (most recent call last):
File "script.py", line 26, in <module>
features_test_scale = sc.transform(features_test)
File "/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/_data.py", line 794, in transform
force_all_finite='allow-nan')
File "/usr/local/lib/python3.6/dist-packages/sklearn/base.py", line 420, in _validate_data
X = check_array(X, **check_params)
File "/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py", line 73, in inner_f
return f(**kwargs)
File "/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py", line 624, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[0.57 0.78 0.59 0.64 0.47 0.63 0.65 0.89 0.84 0.73 0.75 0.64 0.46 0.78
0.62 0.53 0.85 0.67 0.84 0.94 0.64 0.53 0.47 0.86 0.62 0.7 0.77 0.61
0.61 0.63 0.86 0.82 0.65 0.58 0.7 0.7 0.84 0.72 0.71 0.77 0.69 0.8
0.52 0.62 0.79 0.71 0.9 0.84 0.6 0.86 0.67 0.61 0.71 0.52 0.62 0.37
0.73 0.64 0.71 0.8 0.88 0.78 0.45 0.62 0.62 0.86 0.74 0.94 0.58 0.7
0.92 0.64 0.65 0.83 0.34 0.66 0.67 0.7 0.71 0.54 0.68 0.61 0.68 0.79
0.57 0.94 0.59 0.79 0.73 0.91 0.86 0.95 0.9 0.92 0.68 0.84 0.69 0.72
0.94 0.53 0.45 0.77 0.77 0.91 0.61 0.78 0.77 0.82 0.9 0.92 0.54 0.92
0.72 0.5 0.68 0.78 0.72 0.53 0.79 0.49 0.68 0.72 0.73 0.93 0.72 0.52
0.54 0.86 0.65 0.93 0.89 0.72 0.34 0.64 0.96 0.79 0.73 0.49 0.73 0.94
0.7 0.95 0.65 0.86 0.78 0.75 0.89 0.94 0.91 0.87 0.93 0.81 0.94 0.89
0.57 0.77 0.39 0.46 0.78 0.64 0.76 0.58 0.56 0.53 0.79 0.9 0.92 0.96
0.67 0.65 0.64 0.58 0.94 0.76 0.78 0.88 0.84 0.68 0.66 0.42 0.56 0.66
0.46 0.65 0.58 0.72 0.48 0.68 0.89 0.95 0.46 0.71 0.79 0.52 0.57 0.76
0.52 0.8 0.77 0.91 0.75 0.49 0.72 0.72 0.61 0.97 0.8 0.85 0.73 0.64
0.87 0.63 0.97 0.72 0.82 0.54 0.71 0.45 0.8 0.49 0.77 0.93 0.89 0.93
0.81 0.62 0.81 0.66 0.78 0.76 0.48 0.61 0.82 0.68 0.7 0.68 0.62 0.81
0.87 0.94 0.38 0.67 0.64 0.84 0.62 0.7 0.62 0.5 0.79 0.78 0.36 0.77
0.57 0.87 0.74 0.71 0.61 0.57 0.64 0.73 0.81 0.74 0.8 0.69 0.66 0.64
0.93 0.64 0.59 0.71 0.82 0.69 0.69 0.89 0.93 0.74 0.64 0.84 0.91 0.97
0.55 0.74 0.72 0.71 0.93 0.96 0.8 0.8 0.81 0.88 0.64 0.38 0.87 0.73
0.78 0.89 0.56 0.61 0.76 0.46 0.78 0.71 0.81 0.59 0.47 0.7 0.42 0.76
0.8 0.67 0.94 0.65 0.51 0.73 0.9 0.8 0.65 0.7 0.96 0.96 0.73 0.79
0.86 0.89 0.85 0.76 0.76 0.71 0.83 0.76 0.42 0.9 0.58 0.66 0.86 0.71
0.8 0.51 0.65 0.58 0.76 0.8 0.7 0.61 0.71 0.69 0.95 0.72 0.79 0.97
0.74 0.96 0.47 0.56 0.73 0.94 0.76 0.79 0.71 0.58 0.94 0.66 0.75 0.76
0.84 0.59 0.68 0.75 0.76 0.72 0.87 0.78 0.67 0.79 0.91 0.57 0.77 0.69
0.73 0.43 0.93 0.68 0.82 0.67 0.74 0.82 0.85 0.62 0.54 0.71 0.92 0.85
0.79 0.63 0.59 0.73 0.66 0.74 0.9 0.81].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
You have made a mistake with splitting the data. That is because you set labels_train which are 1D to features_test by mistake, and since transform function does not expect 1D array, it returns error.
train_test_split() returns features_train, features_test, label_train, labels_test respectively.
So, change your code like this:
#features_train, labels_train, features_test, labels_test = train_test_split(features, labels, test_size=0.2, random_state=13)
features_train, features_test, label_train, labels_test = train_test_split(features, labels, test_size=0.2, random_state=13)
I'm having trouble converting my data from wide format to long format using pd.wide_to_long() method. The error reads IndexError: Too many levels: Index has only 1 level, not 2
My code:
import pandas as pd
df = pd.read_csv('data/data.csv', index_col=False)
print(df)
df.reset_index(inplace=True,drop=True)
df['ID'] = df.index
pd.wide_to_long(df, ['OT_', 'NT_'], i='ID', j=['MISS', 'HIT', 'CR', 'FA']).reset_index().rename(columns={'OT_': 'OT', 'NT_': 'NT'})
CSV (its just junk data):
PID,OT_MISS,OT_HIT,OT_CR,OT_FA,NT_MISS,NT_HIT,NT_CR,NT_FA
111,0.1,0.23,0.56,0.11,0.9,1.0,0.92,0.68
121,0.1,0.23,0.56,0.11,0.9,1.0,0.92,0.68
212,0.1,0.23,0.56,0.11,0.9,1.0,0.92,0.68
321,0.1,0.23,0.56,0.11,0.9,1.0,0.92,0.68
423,0.1,0.23,0.56,0.11,0.9,1.0,0.92,0.68
534,0.1,0.23,0.56,0.11,0.9,1.0,0.92,0.68
621,0.1,0.23,0.56,0.11,0.9,1.0,0.92,0.68
721,0.1,0.23,0.56,0.11,0.9,1.0,0.92,0.68
812,0.1,0.23,0.56,0.11,0.9,1.0,0.92,0.68
922,0.1,0.23,0.56,0.11,0.9,1.0,0.92,0.68
In panda, you can use melt() function to transform the data from wide to long format as follows:
df2=pd.melt(df, ['OT_', 'NT_'], i='ID', j=['MISS', 'HIT', 'CR', 'FA']).reset_index().rename(columns={'OT_': 'OT', 'NT_': 'NT'})
The issue is pandas.wide_to_long didn't properly recognize the suffixes, i should be 'PID' not 'ID', and j is supposed to be a string.
'\w+' is a regular expression to get one or more word characters
import pandas as pd
df2 = pd.wide_to_long(df, ['OT', 'NT'], i='PID', j='stubs', sep='_', suffix='\w+')
print(df2)
OT NT
PID stubs
111 MISS 0.10 0.90
121 MISS 0.10 0.90
212 MISS 0.10 0.90
321 MISS 0.10 0.90
423 MISS 0.10 0.90
534 MISS 0.10 0.90
621 MISS 0.10 0.90
721 MISS 0.10 0.90
812 MISS 0.10 0.90
922 MISS 0.10 0.90
111 HIT 0.23 1.00
121 HIT 0.23 1.00
212 HIT 0.23 1.00
321 HIT 0.23 1.00
423 HIT 0.23 1.00
534 HIT 0.23 1.00
621 HIT 0.23 1.00
721 HIT 0.23 1.00
812 HIT 0.23 1.00
922 HIT 0.23 1.00
111 CR 0.56 0.92
121 CR 0.56 0.92
212 CR 0.56 0.92
321 CR 0.56 0.92
423 CR 0.56 0.92
534 CR 0.56 0.92
621 CR 0.56 0.92
721 CR 0.56 0.92
812 CR 0.56 0.92
922 CR 0.56 0.92
111 FA 0.11 0.68
121 FA 0.11 0.68
212 FA 0.11 0.68
321 FA 0.11 0.68
423 FA 0.11 0.68
534 FA 0.11 0.68
621 FA 0.11 0.68
721 FA 0.11 0.68
812 FA 0.11 0.68
922 FA 0.11 0.68
I am trying to use np.genfromtxt to load a data that looks something like this into a matrix:
0.79 0.10 0.91 -0.17 0.10 0.33 -0.90 0.10 -0.19 -0.00 0.10 -0.99 -0.06 0.10 -0.42 -0.66 0.10 -0.79 0.21 0.10 0.93 0.79 0.10 0.91 -0.72 0.10 0.25 0.64 0.10 -0.27 -0.36 0.10 -0.66 -0.52 0.10 0.92 -0.39 0.10 0.43 0.63 0.10 0.25 -0.58 0.10 -0.03 0.59 0.10 0.02 -0.69 0.10 0.79 0.30 0.10 0.09 0.70 0.10 0.67 -0.04 0.10 -0.65 -0.07 0.10 0.70 -0.06 0.10 0.08 7 566 112 32 163 615 424 543 424 422 490 47 499 595 94 515 163 535
0.79 0.10 0.91 -0.17 0.10 0.33 -0.90 0.10 -0.19 -0.00 0.10 -0.99 -0.06 0.10 -0.42 -0.66 0.10 -0.79 0.21 0.10 0.93 0.79 0.10 0.91 -0.72 0.10 0.25 0.64 0.10 -0.27 -0.36 0.10 -0.66 -0.52 0.10 0.92 -0.39 0.10 0.43 0.63 0.10 0.25 -0.58 0.10 -0.03 0.59 0.10 0.02 -0.69 0.10 0.79 0.30 0.10 0.09 0.70 0.10 0.67 -0.04 0.10 -0.65 -0.07 0.10 0.70 -0.06 0.10 0.08 263 112 32 30 163 366 543 457 424 422 556 55 355 485 112 515 163 509 112 535
0.79 0.10 0.91 -0.17 0.10 0.33 -0.90 0.10 -0.19 -0.00 0.10 -0.99 -0.06 0.10 -0.42 -0.66 0.10 -0.79 0.21 0.10 0.93 0.79 0.10 0.91 -0.72 0.10 0.25 0.64 0.10 -0.27 -0.36 0.10 -0.66 -0.52 0.10 0.92 -0.39 0.10 0.43 0.63 0.10 0.25 -0.58 0.10 -0.03 0.59 0.10 0.02 -0.69 0.10 0.79 0.30 0.10 0.09 0.70 0.10 0.67 -0.04 0.10 -0.65 -0.07 0.10 0.70 -0.06 0.10 0.08 311 112 32 543 457 77 639 355 412 422 509 112 535 163 77 125 30 412 422 556 55 355 485 112 515
Suppose I want to import data into a matrix of size (4, 5). If not all rows have 5 columns, when it imports the matrix it should replace those columns without 5 rows with "". For example, if the data were simpler, it would look like this:
1,2,3,4,5
6,7,8,9,10
11,12,13,14,15
16,"","","",""
Thus, I want the number of columns to be imported to match that of the max row column count, and if a row doesn't have that many columns, it will fill it with "". I am reading from a file called "data.txt".
This is what I have tried so far:
trainData = np.genfromtxt('data.txt', usecols = range(0, 5), invalid_raise=False, missing_values = "", filling_values="")
However, it gives errors saying:
Line #4 (got 1 columns instead of 5)
How can I solve this?
Thanks!
Pandas has more robust readers and you can use the DataFrame methods to handle the missing values.
You'll have to figure out how many columns to use first:
columns = max(len(l.split()) for l in open('data.txt'))
To read the file:
import pandas
df = pandas.read_table('data.txt',
delim_whitespace=True,
header=None,
usecols=range(columns),
engine='python')
To convert to a numpy array:
import numpy
a = numpy.array(df)
This will fill in NaNs in the blank positions. You can use .fillna() to get other values for blanks.
filled = numpy.array(df.fillna(999))
You need to modify the filling_values argument to np.nan (which is considered of type float so you won't have the string conversion issue) and specify the delimiter to be comma since by default genfromtxt expects only white space as delimiters:
trainData = np.genfromtxt('data.txt', usecols = range(0, 5), invalid_raise=False, missing_values = "", filling_values=np.nan, delimiter=',')
I managed to figure out a solution.
df = pandas.DataFrame([line.strip().split() for line in open('data.txt', 'r')])
data = np.array(df)
With the copy-n-paste of the 3 big lines, this pandas reader works:
In [149]: pd.read_csv(BytesIO(txt), delim_whitespace=True,header=None,error_bad_
...: lines=False,names=list(range(91)))
Out[149]:
0 1 2 3 4 5 6 7 8 9 ... 81 82 \
0 0.79 0.1 0.91 -0.17 0.1 0.33 -0.9 0.1 -0.19 -0.0 ... 515 163
1 0.79 0.1 0.91 -0.17 0.1 0.33 -0.9 0.1 -0.19 -0.0 ... 515 163
2 0.79 0.1 0.91 -0.17 0.1 0.33 -0.9 0.1 -0.19 -0.0 ... 125 30
83 84 85 86 87 88 89 90
0 535 NaN NaN NaN NaN NaN NaN NaN
1 509 112.0 535.0 NaN NaN NaN NaN NaN
2 412 422.0 556.0 55.0 355.0 485.0 112.0 515.0
_.values to get the array.
The key is specifying a big enough names list. Pandas can fill incomplete lines, while genfromtxt requires explicit delimiters.
Given that my data is a pandas dataframe and looks like this:
Ref +1 +2 +3 +4 +5 +6 +7
2013-05-28 1 -0.44 0.03 0.06 -0.31 0.13 0.56 0.81
2013-07-05 2 0.84 1.03 0.96 0.90 1.09 0.59 1.15
2013-08-21 3 0.09 0.25 0.06 0.09 -0.09 -0.16 0.56
2014-10-15 4 0.35 1.16 1.91 3.44 2.75 1.97 2.16
2015-02-09 5 0.09 -0.10 -0.38 -0.69 -0.25 -0.85 -0.47
How can I plot a chart of the 5 lines (1 for each ref), where the X axis are the columns (+1, +2...), and starts from 0? If is in seaborn, even better. But matplotlib solutions are also welcome.
Plotting a dataframe in pandas is generally all about reshaping the table so that the individual lines you want are in separate columns, and the x-values are in the index. Some of these reshape operations are a bit ugly, but you can do:
df = pd.read_clipboard()
plot_table = pd.melt(df.reset_index(), id_vars=['index', 'Ref'])
plot_table = plot_table.pivot(index='variable', columns='Ref', values='value')
# Add extra row to have all lines start from 0:
plot_table.loc['+0', :] = 0
plot_table = plot_table.sort_index()
plot_table
Ref 1 2 3 4 5
variable
+0 0.00 0.00 0.00 0.00 0.00
+1 -0.44 0.84 0.09 0.35 0.09
+2 0.03 1.03 0.25 1.16 -0.10
+3 0.06 0.96 0.06 1.91 -0.38
+4 -0.31 0.90 0.09 3.44 -0.69
+5 0.13 1.09 -0.09 2.75 -0.25
+6 0.56 0.59 -0.16 1.97 -0.85
+7 0.81 1.15 0.56 2.16 -0.47
Now that you have a table with the right shape, plotting is pretty automatic:
plot_table.plot()