loading my data in numpy genfromtxt get errors - python

I have my data file contain 7500 lines with :
Y1C 1.53 -0.06 0.58 0.52 0.42 0.16 0.79 -0.6 -0.3
-0.78 -0.14 0.38 0.34 0.23 0.26 -1.8 -0.1 -0.17 0.3
0.6 0.9 0.71 0.5 0.49 1.06 0.25 0.96 -0.39 0.24 0.69
0.41 0.7 -0.16 -0.39 0.6 1.04 0.4 -0.04 0.36 0.23 -0.14
-0.09 0.15 -0.46 -0.05 0.32 -0.54 -0.28 -0.15 1.34 0.29
0.59 -0.43 -0.55 -0.18 -0.01 0.68 -0.06 -0.11 -0.67
-0.25 -0.34 -0.38 0.02 -0.21 0.12 0.01 0.07 0.15 0.14
0.15 -0.11 0.07 -0.41 -0.2 0.24 0.06 0.12 0.12 0.11
0.1 0.24 -0.71 0.22 -0.02 0.15 0.84 1.39 0.13 0.48
0.19 -0.23 -0.12 0.33 0.37 0.18 0.06 0.32 0.09
-0.09 0.02 -0.01 -0.06 -0.23 0.52 0.14 0.24 -0.05 0.37
0.1 0.45 0.38 1.34 0.74 0.5 0.92 0.91 1.34 1.78 2.26
0.05 0.29 0.53 0.17 0.41 0.47 0.47 1.21 0.87 0.68
1.08 0.89 0.13 0.5 0.57 -0.5 -0.78 -0.34 -0.3 0.54
0.31 0.64 1.23 0.335 0.36 -0.65 0.39 0.39 0.31 0.73
0.54 0.3 0.26 0.47 0.13 0.24 -0.6 0.02 0.11 0.27
0.21 -0.3 -1 -0.44 -0.15 -0.51 0.3 0.14 -0.15 -0.27 -0.27
Y2W -0.01 -0.3 0.23 0.01 -0.15 0.45 -0.04 0.14 -1.16
-0.14 -0.56 -0.13 0.77 0.77 -0.57 0.48 0.22 -0.08
-1.81 -0.46 -0.17 0.2 -0.18 -0.45 -0.4 1.35 0.81 1.21
0.52 0.02 -0.06 0.37 0 -0.38 -0.02 0.48 0 0.58 0.81
0.54 0.18 -0.11 0.03 0.1 -0.38 0.17 0.37 -0.05 0.13
-0.01 -0.17 0.36 0.22 0 -1.4 -0.67 -0.45 -0.62 -0.58
-0.47 -0.86 -1.12 -0.43 0.1 0.06 -0.45 -0.14 0.68 -0.16
0.14 0.14 0.18 0.14 0.17 0.13 0.07 0.05 0.04 0.07
-0.01 0.03 0.05 0.02 0.12 0.34 -0.04 -0.75 1.68 0.23
0.49 0.38 -0.57 0.17 -0.04 0.19 0.22 0.29 -0.04 -0.3
0.18 0.04 0.3 -0.06 -0.07 -0.21 -0.01 0.51 -0.04 -0.04
-0.23 0.06 0.9 -0.14 0.19 2.5 2.84 3.27 2.13 2.5 2.66
4.16 3.52 -0.12 0.13 0.44 0.32 0.44 0.46 0.7 0.68
0.99 0.83 0.74 0.51 0.33 0.22 0.01 0.33 -0.19 0.4
0.41 0.07 0.18 -0.01 0.45 -0.37 -0.49 1.02 -0.59
-1.44 -1.53 -0.74 -1.48 0.12 0.05 0.02 -0.1 0.57
-0.36 0.1 -0.16 -0.23 -0.34 -0.61 -0.37 -0.14 -0.22 -0.27
-0.08 -0.08 -0.17 0.18 -0.74
Y3W 0.15 -0.07 -0.25 -0.3 -1.12 -0.67 -0.15 -0.43 0.63
0.92 0.25 0.33 0.81 -0.12 -0.12 0.67 0.86 0.86
1.54 -0.3 0 -0.29 -0.74 0.15 0.59 0.15 0.34 0.23
0.5 0.52 0.25 0.86 0.53 0.51 0.25 -1.29 -1.79
-0.45 -0.64 0.01 -0.58 -0.51 -0.74 -1.32 -0.47
-0.81 0.55 -0.09 0.46 -0.3 -0.2 -0.81 -1.56 -2.74 1.03
1 1.01 0.29 -0.64 -1.03 0.07 0.46 0.33 0.04 -0.6
-0.64 -0.51 -0.36 -0.1 0.13 -1.4 -1.17 -0.64 -0.16 -0.5
-0.47 0.75 0.62 0.7 1.06 0.93 0.56 -2.25 -0.94 -0.09
0.08 -0.15 -1.6 -1.43 -0.84 -0.25 -1.22 -0.92 -1.22
-0.97 -0.84 -0.89 0.24 0 -0.04 -0.64 -0.94 -1.56 -2.32
0.63 -0.17 -3.06 -2.4 -2 -1.4 -0.81 -1.6 -3.06 -1.79
0.17 0.28 -0.67 -2.82 -1.47 -1.82 -1.69 -1.38 -1.96
-1.88 -2.34 -3.06 -0.18 0.5 -0.03 -0.49 -0.61 -0.54 -0.37
0.1 -0.92 -1.79 -0.03 -0.54 0.94 -1 0.15 0.95 0.55
-0.36 0.4 -0.73 0.85 -0.26 0.55 0.14 -0.36 0.38 0.87
0.62 0.66 0.79 -0.67 0.48 0.62 0.48 0.72 0.73 0.29
-0.3 -0.81
Y4W 0.24 0.76 0.2 0.34 0.11 0.07 0.01 0.36 0.4 -0.25
-0.45 0.09 -0.97 0.19 0.28 -1.81 -0.64 -0.49 -1.27
-0.95 -0.1 0.12 -0.1 0 -0.08 0.77 1.02 0.92 0.56
0.1 0.7 0.57 0.16 1.29 0.82 0.55 0.5 1.83 1.79 0.01
0.24 -0.67 -0.85 -0.42 -0.37 0.2 0.07 -0.01 -0.17 -0.2
-0.43 -0.34 0.12 -0.21 -0.23 -0.22 -0.1 -0.07 -0.61
-0.14 -0.43 -0.97 0.27 0.7 0.54 0.11 -0.5 -0.39 0.01
0.61 0.88 1 0.35 0.67 0.6 0.78 0.46 0.09 -0.06
-0.35 0.08 -0.14 -0.32 -0.11 0 0.01 0.02 0.77 0.18
0.36 -1.15 -0.42 -0.19 0.06 -0.25 -0.81 -0.63 -1.43
-0.85 -0.88 -0.68 -0.59 -1.01 -0.68 -0.71 0.15 0.08 0.08
-0.03 -0.2 0.03 -0.18 -0.01 -0.08 -1.61 -0.67 -0.74
-0.54 -0.8 -1.02 -0.84 -1.91 -0.22 -0.02 0.05 -0.84
-0.65 -0.82 -0.4 -0.83 -0.9 -1.04 -1.23 -0.91 0.28 0.68
0.57 -0.02 0.4 -1.52 0.17 0.44 -1.18 0.04 0.17 0.16
0.04 -0.26 0.04 0.1 -0.11 -0.64 -0.09 -0.16 0.16 -0.05
0.39 0.39 -0.06 0.46 0.2 -0.45 -0.35 -1.3 -0.26 -0.29
0.02 0.16 0.18 -0.35 -0.45 -1.04 -0.69
Y5C 2.85 3.34 -1 -0.47 -0.66 -0.03 1.41
0.8 0 0.41 -0.14 -0.86 -0.79 -1.69 0 0 1.52
1.29 0.84 0.58 1.02 1.35 0.45 1.02 1.47 0.82 0.46
0.25 0.77 0.93 -0.58 -0.67 -0.18 -0.56 -0.01 0.25
-0.71 -0.49 -0.43 0 -1.06 0.44 -0.29 0.26 -0.04
-0.14 -0.1 -0.12 -1.6 0.33 0.62 0.52 0.7 -0.22 0.44
-0.6 0.86 1.19 1.58 0.93 1 0.85 1.24 1.06 0.49
0.26 0.18 0.3 -0.09 -0.42 0.05 0.54 0.24 0.37 0.86
0.9 0.49 -1.47 -0.2 -0.43 0.2 0.1 -0.81 -0.74 -1.36 -0.97
-0.94 -0.86 -1.56 -1.89 -1.89 -1.06 0.12 0.06 0.04
-0.01 -0.12 0.01 -0.15 0.76 0.89 0.71 -1.12 0.03
-0.86 0.26 0 -0.25 -0.06 0.19 0.41 0.58 -0.46 0.01
-0.15 0.04 -1.01 -0.57 -0.71 -0.3 -1.01 1.83 0.59
1.04 -1.43 0.38 0.65 -6.64 -0.42 0.24 0.46 0.96 0.24
0.7 1.21 0.6 0.12 0.77 -0.03 0.53 0.31 0.46 0.51
-0.45 0.23 0.32 -0.34 -0.1 0.1 -0.45 0.74 -0.06 0.21
0.29 0.45 0.68 0.29 0.45
Y7C -0.22 -0.12 -0.29 -0.51 -0.81 -0.47 0.28 -0.1 0.15
0.38 0.18 -0.27 0.12 -0.15 0.43 0.25 0.19 0.33 0.67
0.86 -0.56 -0.29 -0.36 -0.42 0.08 0.04 -0.04 0.15 0.38
-0.07 -0.1 -0.2 -0.03 -0.29 0.06 0.65 0.58 0.86 2.05
0.3 0.33 -0.29 -0.23 -0.15 -0.32 0.08 0.34 0.15 0
-0.01 0.28 0.36 0.25 0.46 0.4 0.7 0.49 0.97 1.04
0.36 -0.47 -0.29 0.77 0.57 0.45 0.77 0.24 -0.23 0.12
0.49 0.62 0.49 0.84 0.89 1.08 0.87 -0.18 -0.43
-0.39 -0.18 -0.02 0.01 0.2 -0.2 -0.03 0.01 0.25 0.1
-0.07 -1.43 -0.2 -0.4 0.32 0.72 -0.42 -0.3 -0.38
-0.22 -0.81 -1.15 -1.6 -1.89 -2.06 -2.4 0.08 0.34 0.1
-0.15 -0.06 -0.17 -0.47 -0.4 0.15 -1.22 -1.43 -1.03
-1.03 -1.64 -1.84 -2.64 -2 0.05 0.4 0.88 -1.54 -1.21
-1.46 -1.92 -1.52 -1.92 -1.7 -1.94 -1.86 -0.1 -0.02
-0.22 -0.34 -0.48 0.28 0 0.14 0.4 -0.29 -0.27 -0.3
-0.67 -0.09 0.23 0.33 0.23 0.1 0.38 -0.51 0.23 -0.73
0.22 -0.47 0.24 0.68 0.53 0.23 -0.1 0.11 -0.18 0.16
0.68 0.55 0.28 -0.03 0.03 0.08 0.12
There is a missing value, I wanted to load it as matrix I used :
data = np.genfromtxt("This_data.txt", delimiter='\t', missing_values=np.nan)
When I print data I get :
Traceback (most recent call last):
File "matrix.py", line 8, in <module>
data = np.genfromtxt("This_data.txt", delimiter='\t', missing_values=np.nan ,usecols=np.arange(0,174))
File "/home/anaconda2/lib/python2.7/numpy/lib/npyio.py", line 1769, in genfromtxt
raise ValueError(errmsg)
ValueError: Some errors were detected !
Line #25 (got 172 columns instead of 174)
I used to put:
data = np.genfromtxt("This_data.txt", delimiter='\t', missing_values=np.nan ,usecols=np.arange(0,174))
But I have same errors. Any suggestion?

A short sample bytestring substitute for a file:
In [168]: txt = b"""Y7C\t-0.22\t-0.12\t-0.29\t-0.51\t-0.81
...: Y7C\t-0.22\t-0.12\t-0.29\t-0.51\t-0.81
...: Y7C\t-0.22\t-0.12\t-0.29\t-0.51\t-0.81
...: """
Minimal load with correct delimiter. Note the first column is nan, because it can't convert the strings to float.
In [169]: np.genfromtxt(txt.splitlines(),delimiter='\t')
Out[169]:
array([[ nan, -0.22, -0.12, -0.29, -0.51, -0.81],
[ nan, -0.22, -0.12, -0.29, -0.51, -0.81],
[ nan, -0.22, -0.12, -0.29, -0.51, -0.81]])
with dtype=None it sets each column dtype automatically, creating a structured array:
In [170]: np.genfromtxt(txt.splitlines(),delimiter='\t',dtype=None)
Out[170]:
array([(b'Y7C', -0.22, -0.12, -0.29, -0.51, -0.81),
(b'Y7C', -0.22, -0.12, -0.29, -0.51, -0.81),
(b'Y7C', -0.22, -0.12, -0.29, -0.51, -0.81)],
dtype=[('f0', 'S3'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8'), ('f5', '<f8')])
Spell out the columns to use, skipping the first:
In [172]: np.genfromtxt(txt.splitlines(),delimiter='\t',usecols=np.arange(1,6))
Out[172]:
array([[-0.22, -0.12, -0.29, -0.51, -0.81],
[-0.22, -0.12, -0.29, -0.51, -0.81],
[-0.22, -0.12, -0.29, -0.51, -0.81]])
But if I ask for more columns that it finds I get an error, like yours:
In [173]: np.genfromtxt(txt.splitlines(),delimiter='\t',usecols=np.arange(1,7))
---------------------------------------------------------------------------
....
ValueError: Some errors were detected !
Line #1 (got 6 columns instead of 6)
Line #2 (got 6 columns instead of 6)
Line #3 (got 6 columns instead of 6)
Your missing_values parameters doesn't help; that's the wrong use for that
This is the correct use of missing_values - to detect the string value and replace it with a valid float value:
In [177]: np.genfromtxt(txt.splitlines(),delimiter='\t',missing_values='Y7C',filling_val
...: ues=0)
Out[177]:
array([[ 0. , -0.22, -0.12, -0.29, -0.51, -0.81],
[ 0. , -0.22, -0.12, -0.29, -0.51, -0.81],
[ 0. , -0.22, -0.12, -0.29, -0.51, -0.81]])
If the file has sufficient delimiters, it can treat those as missing values
In [178]: txt = b"""Y7C\t-0.22\t-0.12\t-0.29\t-0.51\t-0.81\t\t
...: Y7C\t-0.22\t-0.12\t-0.29\t-0.51\t-0.81\t\t
...: Y7C\t-0.22\t-0.12\t-0.29\t-0.51\t-0.81\t\t
...: """
In [179]: np.genfromtxt(txt.splitlines(),delimiter='\t')
Out[179]:
array([[ nan, -0.22, -0.12, -0.29, -0.51, -0.81, nan, nan],
[ nan, -0.22, -0.12, -0.29, -0.51, -0.81, nan, nan],
[ nan, -0.22, -0.12, -0.29, -0.51, -0.81, nan, nan]])
In [180]: np.genfromtxt(txt.splitlines(),delimiter='\t',filling_values=0)
Out[180]:
array([[ 0. , -0.22, -0.12, -0.29, -0.51, -0.81, 0. , 0. ],
[ 0. , -0.22, -0.12, -0.29, -0.51, -0.81, 0. , 0. ],
[ 0. , -0.22, -0.12, -0.29, -0.51, -0.81, 0. , 0. ]])
I believe the pandas csv reader can handle 'ragged' columns and missing values better.

Evidently the program does not like the fact that you have missing values, probably because you're generating a matrix, so it doesn't like replacing missing values with Nans. Try adding 0's in the places with missing values, or at least the tab delimiter so that it will register as having all 174 columns.

Related

How to add a sum() value above the df column values?

Supposed I have a df as below, how to add a sum() value in below DataFrame?
df.columns=['value_a','value_b','name','up_or_down','difference']
df
value_a value_b name up_or_down difference
project_name
# sum 27.56 25.04 sum down -1.31
2021-project11 0.43 0.48 2021-project11 up 0.05
2021-project1 0.62 0.56 2021-project1 down -0.06
2021-project2 0.51 0.47 2021-project2 down -0.04
2021-porject3 0.37 0.34 2021-porject3 down -0.03
2021-porject4 0.64 0.61 2021-porject4 down -0.03
2021-project5 0.32 0.25 2021-project5 down -0.07
2021-project6 0.75 0.81 2021-project6 up 0.06
2021-project7 0.60 0.60 2021-project7 down 0.00
2021-project8 0.85 0.74 2021-project8 down -0.11
2021-project10 0.67 0.67 2021-project10 down 0.00
2021-project9 0.73 0.73 2021-project9 down 0.00
2021-project11 0.54 0.54 2021-project11 down 0.00
2021-project12 0.40 0.40 2021-project12 down 0.00
2021-project13 0.76 0.77 2021-project13 up 0.01
2021-project14 1.16 1.28 2021-project14 up 0.12
2021-project15 1.01 0.94 2021-project15 down -0.07
2021-project16 1.23 1.24 2021-project16 up 0.01
2022-project17 0.40 0.36 2022-project17 down -0.04
2022-project_11 0.40 0.40 2022-project_11 down 0.00
2022-project4 1.01 0.80 2022-project4 down -0.21
2022-project1 0.65 0.67 2022-project1 up 0.02
2022-project2 0.75 0.57 2022-project2 down -0.18
2022-porject3 0.32 0.32 2022-porject3 down 0.00
2022-project18 0.91 0.56 2022-project18 down -0.35
2022-project5 0.84 0.89 2022-project5 up 0.05
2022-project19 0.61 0.48 2022-project19 down -0.13
2022-project6 0.77 0.80 2022-project6 up 0.03
2022-project20 0.63 0.54 2022-project20 down -0.09
2022-project8 0.59 0.55 2022-project8 down -0.04
2022-project21 0.58 0.54 2022-project21 down -0.04
2022-project10 0.76 0.76 2022-project10 down 0.00
2022-project9 0.70 0.71 2022-project9 up 0.01
2022-project22 0.62 0.56 2022-project22 down -0.06
2022-project23 2.03 1.74 2022-project23 down -0.29
2022-project12 0.39 0.39 2022-project12 down 0.00
2022-project24 1.35 1.55 2022-project24 up 0.20
project25 0.45 0.42 project25 down -0.03
project26 0.53 NaN project26 down NaN
project27 0.68 NaN project27 down NaN
I tried
df.sum().columns=['value_a_sun','value_b_sum','difference_sum']
And I would like to add below sum value in the above column value,
sum 27.56 25.04 sum down -1.31
But I got AttributeError: 'Series' object has no attribute 'column', how to fix this? Thanks so much for any advice.
Filter columns names in subset by [] before sum and assign for new row in DataFrame.loc:
df.loc['sum'] = df[['value_a','value_b','difference']].sum()
For first line:
df1 = df[['value_a','value_b','difference']].sum().to_frame().T
df = pd.concat([df1, df], ignore_index=True)

How to add rows up to point?

I have pandas series. I would like to add the rows to be equal to value.
For example, I would like to add the first few values to be equal to 0.19798863694528301.
Then add from the beginning so that the second value will be equal to 3.79478220811793, and so forth.
That data is given below:
0.0
0.015
0.03
0.045
0.06
0.075
0.09
0.105
0.12
0.135
0.15
0.165
0.18
0.195
0.21
0.225
0.24
0.255
0.27
0.285
0.3
0.315
0.33
0.345
0.36
0.375
0.39
0.405
0.42
0.435
0.45
0.465
0.48
0.495
0.51
0.525
0.54
0.555
0.57
0.585
0.6
0.615
0.63
0.645
0.66
0.675
0.69
0.705
0.72
0.735
0.75
0.765
0.78
0.795
0.81
0.825

StandardScaler() python error for scaling data

How do I fix this code, do I need to make the features_train and the features_test a DataFrame?
Anyone has an idea of how to fix that code? I really can't understand the problem....
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer
from sklearn.metrics import r2_score
admissions_data = pd.read_csv('admissions_data.csv')
labels = admissions_data.iloc[:, -1]
features = admissions_data.iloc[:, 1:8]
features_train, labels_train, features_test, labels_test = train_test_split(features, labels, test_size=0.2, random_state=13)
sc = StandardScaler()
features_train_scaled = sc.fit_transform(features_train)
features_test_scale = sc.transform(features_test)
features_train_scaled = pd.DataFrame(features_train_scaled)
features_test_scale = pd.DataFrame(features_test_scale)
The error is:
Traceback (most recent call last):
File "script.py", line 26, in <module>
features_test_scale = sc.transform(features_test)
File "/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/_data.py", line 794, in transform
force_all_finite='allow-nan')
File "/usr/local/lib/python3.6/dist-packages/sklearn/base.py", line 420, in _validate_data
X = check_array(X, **check_params)
File "/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py", line 73, in inner_f
return f(**kwargs)
File "/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py", line 624, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[0.57 0.78 0.59 0.64 0.47 0.63 0.65 0.89 0.84 0.73 0.75 0.64 0.46 0.78
0.62 0.53 0.85 0.67 0.84 0.94 0.64 0.53 0.47 0.86 0.62 0.7 0.77 0.61
0.61 0.63 0.86 0.82 0.65 0.58 0.7 0.7 0.84 0.72 0.71 0.77 0.69 0.8
0.52 0.62 0.79 0.71 0.9 0.84 0.6 0.86 0.67 0.61 0.71 0.52 0.62 0.37
0.73 0.64 0.71 0.8 0.88 0.78 0.45 0.62 0.62 0.86 0.74 0.94 0.58 0.7
0.92 0.64 0.65 0.83 0.34 0.66 0.67 0.7 0.71 0.54 0.68 0.61 0.68 0.79
0.57 0.94 0.59 0.79 0.73 0.91 0.86 0.95 0.9 0.92 0.68 0.84 0.69 0.72
0.94 0.53 0.45 0.77 0.77 0.91 0.61 0.78 0.77 0.82 0.9 0.92 0.54 0.92
0.72 0.5 0.68 0.78 0.72 0.53 0.79 0.49 0.68 0.72 0.73 0.93 0.72 0.52
0.54 0.86 0.65 0.93 0.89 0.72 0.34 0.64 0.96 0.79 0.73 0.49 0.73 0.94
0.7 0.95 0.65 0.86 0.78 0.75 0.89 0.94 0.91 0.87 0.93 0.81 0.94 0.89
0.57 0.77 0.39 0.46 0.78 0.64 0.76 0.58 0.56 0.53 0.79 0.9 0.92 0.96
0.67 0.65 0.64 0.58 0.94 0.76 0.78 0.88 0.84 0.68 0.66 0.42 0.56 0.66
0.46 0.65 0.58 0.72 0.48 0.68 0.89 0.95 0.46 0.71 0.79 0.52 0.57 0.76
0.52 0.8 0.77 0.91 0.75 0.49 0.72 0.72 0.61 0.97 0.8 0.85 0.73 0.64
0.87 0.63 0.97 0.72 0.82 0.54 0.71 0.45 0.8 0.49 0.77 0.93 0.89 0.93
0.81 0.62 0.81 0.66 0.78 0.76 0.48 0.61 0.82 0.68 0.7 0.68 0.62 0.81
0.87 0.94 0.38 0.67 0.64 0.84 0.62 0.7 0.62 0.5 0.79 0.78 0.36 0.77
0.57 0.87 0.74 0.71 0.61 0.57 0.64 0.73 0.81 0.74 0.8 0.69 0.66 0.64
0.93 0.64 0.59 0.71 0.82 0.69 0.69 0.89 0.93 0.74 0.64 0.84 0.91 0.97
0.55 0.74 0.72 0.71 0.93 0.96 0.8 0.8 0.81 0.88 0.64 0.38 0.87 0.73
0.78 0.89 0.56 0.61 0.76 0.46 0.78 0.71 0.81 0.59 0.47 0.7 0.42 0.76
0.8 0.67 0.94 0.65 0.51 0.73 0.9 0.8 0.65 0.7 0.96 0.96 0.73 0.79
0.86 0.89 0.85 0.76 0.76 0.71 0.83 0.76 0.42 0.9 0.58 0.66 0.86 0.71
0.8 0.51 0.65 0.58 0.76 0.8 0.7 0.61 0.71 0.69 0.95 0.72 0.79 0.97
0.74 0.96 0.47 0.56 0.73 0.94 0.76 0.79 0.71 0.58 0.94 0.66 0.75 0.76
0.84 0.59 0.68 0.75 0.76 0.72 0.87 0.78 0.67 0.79 0.91 0.57 0.77 0.69
0.73 0.43 0.93 0.68 0.82 0.67 0.74 0.82 0.85 0.62 0.54 0.71 0.92 0.85
0.79 0.63 0.59 0.73 0.66 0.74 0.9 0.81].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
You have made a mistake with splitting the data. That is because you set labels_train which are 1D to features_test by mistake, and since transform function does not expect 1D array, it returns error.
train_test_split() returns features_train, features_test, label_train, labels_test respectively.
So, change your code like this:
#features_train, labels_train, features_test, labels_test = train_test_split(features, labels, test_size=0.2, random_state=13)
features_train, features_test, label_train, labels_test = train_test_split(features, labels, test_size=0.2, random_state=13)

How to concat two pivot tables without losing column name

I am trying to concat two pivot tables but after join the two tables, the columns lost.
Pivot1:
SATISFIED_CHECKOUT 1.0 2.0 3.0 4.0 5.0
SEGMENT
BOTH_TX_SPEND_GROWN 0.01 0.03 0.04 0.14 0.80
BOTH_TX_SPEND_NO_GROWTH 0.01 0.03 0.04 0.14 0.78
ONLY_SHOPPED_2018 NaN 0.03 0.04 0.15 0.78
ONLY_SHOPPED_2019 0.01 0.02 0.05 0.13 0.78
ONLY_SPEND_GROWN 0.01 0.02 0.03 0.12 0.82
ONLY_TX_GROWN 0.01 0.03 0.03 0.14 0.79
SHOPPED_NEITHER NaN 0.04 0.02 0.15 0.79
Pivot2:
SATISFIED_FOOD 1.0 2.0 3.0 4.0 5.0
SEGMENT
BOTH_TX_SPEND_GROWN 0.00 0.01 0.07 0.20 0.71
BOTH_TX_SPEND_NO_GROWTH 0.00 0.01 0.08 0.19 0.71
ONLY_SHOPPED_2018 0.01 0.01 0.07 0.19 0.71
ONLY_SHOPPED_2019 0.00 0.01 0.10 0.19 0.69
ONLY_SPEND_GROWN 0.00 0.01 0.08 0.18 0.72
ONLY_TX_GROWN 0.00 0.02 0.07 0.19 0.72
SHOPPED_NEITHER NaN NaN 0.10 0.20 0.70
The original df looks like below:
SATISFIED_CHECKOUT SATISFIED_FOOD Persona
1 1 BOTH_TX_SPEND_GROWN
2 3 BOTH_TX_SPEND_NO_GROWTH
3 2 ONLY_SHOPPED_2019
.... .... ............
5 3 ONLY_SHOPPED_2019
I am using the code:
a = pd.pivot_table(df,index=["SEGMENT"], columns=["SATISFIED_FOOD"], aggfunc='size').apply(lambda x: x / x.sum(), axis=1).round(2)
b = pd.pivot_table(df,index=["SEGMENT"], columns=["SATISFIED_CHECKOUT"], aggfunc='size').apply(lambda x: x / x.sum(), axis=1).round(2)
pd.concat([a, b],axis=1)
The result like below:
1.0 2.0 3.0 4.0 ... 2.0 3.0 4.0 5.0
SEGMENT ...
BOTH_TX_SPEND_GROWN 0.01 0.03 0.07 0.23 ... 0.03 0.04 0.14 0.80
BOTH_TX_SPEND_NO_GROWTH 0.01 0.03 0.06 0.22 ... 0.03 0.04 0.14 0.78
ONLY_SHOPPED_2018 0.01 0.04 0.08 0.24 ... 0.03 0.04 0.15 0.78
ONLY_SHOPPED_2019 0.01 0.03 0.08 0.25 ... 0.02 0.05 0.13 0.78
ONLY_SPEND_GROWN 0.00 0.03 0.07 0.22 ... 0.02 0.03 0.12 0.82
ONLY_TX_GROWN 0.01 0.02 0.05 0.22 ... 0.03 0.03 0.14 0.79
SHOPPED_NEITHER NaN 0.01 0.07 0.28 ... 0.04 0.02 0.15 0.79
[7 rows x 15 columns]
But what I want to see this the result like below:
SATISFIED_CHECKOUT SATISFIED_FOOD
1.0 2.0 3.0 4.0 ... 2.0 3.0 4.0 5.0
SEGMENT ...
BOTH_TX_SPEND_GROWN 0.01 0.03 0.07 0.23 ... 0.03 0.04 0.14 0.80
BOTH_TX_SPEND_NO_GROWTH 0.01 0.03 0.06 0.22 ... 0.03 0.04 0.14 0.78
ONLY_SHOPPED_2018 0.01 0.04 0.08 0.24 ... 0.03 0.04 0.15 0.78
ONLY_SHOPPED_2019 0.01 0.03 0.08 0.25 ... 0.02 0.05 0.13 0.78
ONLY_SPEND_GROWN 0.00 0.03 0.07 0.22 ... 0.02 0.03 0.12 0.82
ONLY_TX_GROWN 0.01 0.02 0.05 0.22 ... 0.03 0.03 0.14 0.79
SHOPPED_NEITHER NaN 0.01 0.07 0.28 ... 0.04 0.02 0.15 0.79
[7 rows x 15 columns]

Row based chart plot (Seaborn or Matplotlib)

Given that my data is a pandas dataframe and looks like this:
Ref +1 +2 +3 +4 +5 +6 +7
2013-05-28 1 -0.44 0.03 0.06 -0.31 0.13 0.56 0.81
2013-07-05 2 0.84 1.03 0.96 0.90 1.09 0.59 1.15
2013-08-21 3 0.09 0.25 0.06 0.09 -0.09 -0.16 0.56
2014-10-15 4 0.35 1.16 1.91 3.44 2.75 1.97 2.16
2015-02-09 5 0.09 -0.10 -0.38 -0.69 -0.25 -0.85 -0.47
How can I plot a chart of the 5 lines (1 for each ref), where the X axis are the columns (+1, +2...), and starts from 0? If is in seaborn, even better. But matplotlib solutions are also welcome.
Plotting a dataframe in pandas is generally all about reshaping the table so that the individual lines you want are in separate columns, and the x-values are in the index. Some of these reshape operations are a bit ugly, but you can do:
df = pd.read_clipboard()
plot_table = pd.melt(df.reset_index(), id_vars=['index', 'Ref'])
plot_table = plot_table.pivot(index='variable', columns='Ref', values='value')
# Add extra row to have all lines start from 0:
plot_table.loc['+0', :] = 0
plot_table = plot_table.sort_index()
plot_table
Ref 1 2 3 4 5
variable
+0 0.00 0.00 0.00 0.00 0.00
+1 -0.44 0.84 0.09 0.35 0.09
+2 0.03 1.03 0.25 1.16 -0.10
+3 0.06 0.96 0.06 1.91 -0.38
+4 -0.31 0.90 0.09 3.44 -0.69
+5 0.13 1.09 -0.09 2.75 -0.25
+6 0.56 0.59 -0.16 1.97 -0.85
+7 0.81 1.15 0.56 2.16 -0.47
Now that you have a table with the right shape, plotting is pretty automatic:
plot_table.plot()

Categories