Basemap Contour - Correct Indices - python

I'm trying to make a contour map with Basemap. My lat, lon and eof1 arrays are all 1-D and 79 items long. When I run this code, I get an error saying:
IndexError: too many indices for array
Any suggestions? I'm guessing a meshgrid or something, but all the combinations that I tried did not work.
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import numpy as np
data = np.genfromtxt('/Volumes/NO_NAME/Classwork/Lab3PCAVarimax.txt',usecols=(1,2,3,4,5,6,7),skip_header=1)
eof1 = data[:,6]
locs = np.genfromtxt('/Volumes/NO_NAME/Classwork/OK_vic_grid.txt')
lat = locs[:,1]
lon = locs[:,2]
fig, ax = plt.subplots()
m = Basemap(projection='stere',lon_0=-95,lat_0=35.,lat_ts=40,\
llcrnrlat=33,urcrnrlat=38,\
llcrnrlon=-103.8,urcrnrlon=-94)
X,Y = m(lon,lat)
m.drawcoastlines()
m.drawstates()
m.drawcountries()
m.drawmapboundary(fill_color='lightblue')
m.drawparallels(np.arange(0.,40.,2.),color='gray',dashes=[1,3],labels=[1,0,0,0])
m.drawmeridians(np.arange(0.,360.,2.),color='gray',dashes=[1,3],labels=[0,0,0,1])
m.fillcontinents(color='beige',lake_color='lightblue',zorder=0)
plt.title('Oklahoma PCA-Derived Soil Moisture Regions (Varimax)')
m.contour(X,Y,eof1)
lat and lon data:
1 33.75 -97.75
2 33.75 -97.25
3 33.75 -96.75
4 33.75 -96.25
5 33.75 -95.75
6 33.75 -95.25
7 33.75 -94.75
8 34.25 -99.75
9 34.25 -99.25
10 34.25 -98.75
11 34.25 -98.25
12 34.25 -97.75
13 34.25 -97.25
14 34.25 -96.75
15 34.25 -96.25
16 34.25 -95.75
17 34.25 -95.25
18 34.25 -94.75
19 34.75 -99.75
20 34.75 -99.25
21 34.75 -98.75
22 34.75 -98.25
23 34.75 -97.75
24 34.75 -97.25
25 34.75 -96.75
26 34.75 -96.25
27 34.75 -95.75
28 34.75 -95.25
29 34.75 -94.75
30 35.25 -99.75
31 35.25 -99.25
32 35.25 -98.75
33 35.25 -98.25
34 35.25 -97.75
35 35.25 -97.25
36 35.25 -96.75
37 35.25 -96.25
38 35.25 -95.75
39 35.25 -95.25
40 35.25 -94.75
41 35.75 -99.75
42 35.75 -99.25
43 35.75 -98.75
44 35.75 -98.25
45 35.75 -97.75
46 35.75 -97.25
47 35.75 -96.75
48 35.75 -96.25
49 35.75 -95.75
50 35.75 -95.25
51 35.75 -94.75
52 36.25 -99.75
53 36.25 -99.25
54 36.25 -98.75
55 36.25 -98.25
56 36.25 -97.75
57 36.25 -97.25
58 36.25 -96.75
59 36.25 -96.25
60 36.25 -95.75
61 36.25 -95.25
62 36.25 -94.75
63 36.75 -102.75
64 36.75 -102.25
65 36.75 -101.75
66 36.75 -101.25
67 36.75 -100.75
68 36.75 -100.25
69 36.75 -99.75
70 36.75 -99.25
71 36.75 -98.75
72 36.75 -98.25
73 36.75 -97.75
74 36.75 -97.25
75 36.75 -96.75
76 36.75 -96.25
77 36.75 -95.75
78 36.75 -95.25
79 36.75 -94.75
eof data
PC5 PC3 PC2 PC6 PC7 PC4 PC1
1 0.21 0.14 0.33 0.39 0.73 0.13 0.03
2 0.19 0.17 0.42 0.24 0.78 0.1 0.04
3 0.17 0.18 0.51 0.18 0.71 0.01 0.1
4 0.18 0.2 0.58 0.19 0.67 0.07 0.11
5 0.15 0.17 0.76 0.2 0.43 0.11 0.13
6 0.12 0.16 0.82 0.17 0.34 0.12 0.15
7 0.1 0.2 0.84 0.14 0.28 0.14 0.13
8 0.16 0.09 0.2 0.73 0.29 0.25 0.1
9 0.18 0.14 0.18 0.68 0.36 0.24 0.14
10 0.23 0.22 0.18 0.63 0.53 0.21 0.14
11 0.19 0.23 0.23 0.52 0.62 0.19 0.14
12 0.2 0.18 0.23 0.43 0.74 0.15 0.11
13 0.21 0.19 0.43 0.24 0.77 0.11 0.11
14 0.15 0.21 0.51 0.15 0.72 0.1 0.15
15 0.14 0.23 0.58 0.19 0.66 0.1 0.12
16 0.13 0.22 0.74 0.19 0.49 0.13 0.12
17 0.08 0.24 0.85 0.19 0.28 0.15 0.1
18 0.1 0.29 0.86 0.15 0.18 0.16 0.07
19 0.26 0.11 0.17 0.77 0.1 0.24 0.06
20 0.36 0.16 0.14 0.74 0.24 0.23 0.12
21 0.32 0.27 0.14 0.65 0.42 0.14 0.14
22 0.39 0.29 0.21 0.58 0.47 0.09 0.21
23 0.3 0.3 0.29 0.47 0.48 0.09 0.33
24 0.25 0.35 0.35 0.42 0.45 0.09 0.45
25 0.25 0.33 0.43 0.29 0.52 0.11 0.46
26 0.24 0.36 0.48 0.26 0.53 0.09 0.4
27 0.18 0.35 0.62 0.24 0.48 0.11 0.28
28 0.13 0.4 0.83 0.12 0.15 0.12 0.06
29 0.13 0.42 0.81 0.1 0.14 0.08 0.05
30 0.45 0.14 0.14 0.7 0.05 0.2 0.04
31 0.52 0.19 0.13 0.68 0.25 0.18 0.06
32 0.53 0.2 0.16 0.66 0.32 0.09 0.08
33 0.48 0.26 0.2 0.56 0.37 0.06 0.21
34 0.41 0.34 0.28 0.44 0.35 0.06 0.43
35 0.37 0.4 0.28 0.37 0.32 0.06 0.54
36 0.24 0.41 0.39 0.27 0.33 0.11 0.56
37 0.29 0.47 0.37 0.28 0.32 0.11 0.54
38 0.3 0.61 0.36 0.25 0.26 0.13 0.47
39 0.21 0.6 0.66 0.13 0.18 0.1 0.12
40 0.13 0.48 0.75 0.1 0.13 0.07 0.06
41 0.55 0.15 0.14 0.63 0.07 0.25 0.1
42 0.55 0.19 0.17 0.65 0.13 0.2 0.11
43 0.6 0.19 0.15 0.62 0.27 0.04 0.06
44 0.63 0.18 0.16 0.53 0.25 0.04 0.16
45 0.69 0.27 0.22 0.36 0.23 -0.01 0.28
46 0.56 0.39 0.25 0.22 0.24 0.06 0.47
47 0.45 0.51 0.28 0.23 0.25 0.11 0.51
48 0.38 0.63 0.3 0.27 0.24 0.14 0.4
49 0.3 0.75 0.34 0.19 0.21 0.13 0.3
50 0.29 0.77 0.44 0.16 0.19 0.12 0.13
51 0.18 0.66 0.63 0.11 0.17 0.1 0.06
52 0.53 0.12 0.08 0.35 0.1 0.52 0.14
53 0.68 0.19 0.14 0.4 0.09 0.36 0.12
54 0.76 0.24 0.14 0.34 0.09 0.29 0.12
55 0.84 0.25 0.12 0.29 0.15 0.1 0.14
56 0.82 0.25 0.11 0.28 0.21 0.03 0.12
57 0.64 0.44 0.22 0.23 0.21 0.06 0.36
58 0.54 0.52 0.27 0.21 0.2 0.09 0.39
59 0.44 0.72 0.26 0.22 0.17 0.17 0.23
60 0.3 0.79 0.28 0.17 0.14 0.11 0.19
61 0.26 0.81 0.35 0.18 0.17 0.12 0.08
62 0.24 0.82 0.37 0.16 0.17 0.1 0.05
63 0.17 0.07 0.22 0.26 0.18 0.75 0.07
64 0.25 0.15 0.24 0.23 0.12 0.82 0.08
65 0.3 0.15 0.16 0.23 0.11 0.82 0.04
66 0.39 0.23 0.15 0.19 0.06 0.77 0.05
67 0.58 0.2 0.09 0.21 0.12 0.55 -0.1
68 0.68 0.17 0.04 0.21 0.11 0.48 -0.07
69 0.59 0.18 0.01 0.14 0.04 0.47 0.07
70 0.75 0.2 0.1 0.29 0.06 0.36 0.11
71 0.75 0.22 0.13 0.26 0.13 0.31 0.07
72 0.82 0.25 0.12 0.2 0.19 0.17 0.06
73 0.79 0.3 0.11 0.15 0.13 0.16 0.03
74 0.76 0.41 0.13 0.16 0.17 0.08 0.13
75 0.65 0.48 0.16 0.14 0.15 0.13 0.15
76 0.52 0.66 0.18 0.16 0.2 0.22 0.05
77 0.45 0.74 0.24 0.16 0.19 0.2 0.06
78 0.38 0.78 0.32 0.17 0.14 0.15 0.02
79 0.28 0.79 0.34 0.15 0.16 0.11 0

AFAICT the essence of your problem is that your x/y grid isn't strictly rectangular. The documentation for matplotlib.pyplot.contour says:
X and Y must both be 2-D with the same shape as Z, or they must both
be 1-D such that len(X) is the number of columns in Z and len(Y) is
the number of rows in Z
see http://matplotlib.org/api/pyplot_api.html
With your unmodified data you can get a quiver plot by e.g:
# create vectors up and slightly right
v=eof1
u=[eof1[i]*0.5 for i in range(len(eof1))]
m.quiver(lon,lat,u,v, latlon=True)
plt.show()
So you will have to map your data to the 1-D,1-D,2-D or 2-D,2-D,2-D format required by contour().
It's fairly easy to make your data cover a smaller latlon rectangular area by deleting rows 1-7 and 63-68 (or I guess you could pad it out with 0 values to cover your original area), but by the time the lon/lat are projected to your stere projection coordinates they aren't rectangular any more, which I think will also be a problem. How about you use a merc projection, just to get things going?
However overall I think you will need more data, particularly to get contours over your Oaklahoma boundary you need data up to the boundary. Use the latlon=True parameter to the contour call so it transforms the lon and lat correctly, even with the merc projection. I also tried adding parameter tri=True but that seems to place different requirements on the xx/y/z data.
Another example, you can get a bubble plot using scatter():
s=[eof1[i]*500 for i in range(len(eof1))]
m.scatter(lon,lat,s=s,latlon=True)
Addition:
Managed to get some contours!
Simplest solution was to hardcode your lat/lon/data for the rectangular region, the meshgrid turns the 1-D lon and lat into a full 2-D grid in xx and yy, and the value points are 2-D. Here's the code:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
data = np.genfromtxt('Lab3PCAVarimax.txt',usecols=(1,2,3,4,5,6,7),skip_header=1)
eof1 = data[:,6]
print eof1
eof11= [
[ 0.1 ,0.14 ,0.14 ,0.14 ,0.11 ,0.11 ,0.15 ,0.12 ,0.12 ,0.1 ,0.07]
,[ 0.06 ,0.12 ,0.14 ,0.21 ,0.33 ,0.45 ,0.46 ,0.4 ,0.28 ,0.06 ,0.05]
,[ 0.04 ,0.06 ,0.08 ,0.21 ,0.43 ,0.54 ,0.56 ,0.54 ,0.47 ,0.12 ,0.06]
,[ 0.1 ,0.11 ,0.06 ,0.16 ,0.28 ,0.47 ,0.51 ,0.4 ,0.3 ,0.13 ,0.06]
,[ 0.14 ,0.12 ,0.12 ,0.14 ,0.12 ,0.36 ,0.39 ,0.23 ,0.19 ,0.08 ,0.05]
,[ 0.07 ,0.11 ,0.07 ,0.06 ,0.03 ,0.13 ,0.15 ,0.05 ,0.06 ,0.02 ,0. ]
]
locs = np.genfromtxt('OK_vic_grid.txt')
lat = locs[:,1]
lon = locs[:,2]
lat1 = [34.25 ,34.75,35.25,35.75,36.25,36.75]
lon1 =[-99.75,-99.25, -98.75, -98.25, -97.75, -97.25, -96.75, -96.25, -95.75, -95.25, -94.75]
fig, ax = plt.subplots()
m = Basemap(projection='merc',lon_0=-95,lat_0=35.,lat_ts=40,\
llcrnrlat=33,urcrnrlat=38,\
llcrnrlon=-103.8,urcrnrlon=-94)
#X,Y = m(lon,lat)
m.drawcoastlines()
m.drawstates()
m.drawcountries()
m.drawmapboundary(fill_color='lightblue')
m.drawparallels(np.arange(0.,40.,2.),color='gray',dashes=[1,3],labels=[1,0,0,0])
m.drawmeridians(np.arange(0.,360.,2.),color='gray',dashes=[1,3],labels=[0,0,0,1])
m.fillcontinents(color='beige',lake_color='lightblue',zorder=0)
plt.title('Oklahoma PCA-Derived Soil Moisture Regions (Varimax)')
xx, yy = m(*np.meshgrid(lon1,lat1))
m.contourf(xx,yy,eof11)
plt.show()
Further addition: Actually this still works when the projection is stere :-)

Related

Is there any way to adjust the space among precision, recall, f1 in the sklearn's classification report?

My class names are lengthy. So, when I save the classification report from sklearn in a txt file, the format looks messy. For example:
precision recall f1-score support
カップ 0.96 0.94 0.95 69
セット 0.70 0.61 0.65 23
パウチ 0.96 0.92 0.94 53
ビニール容器 0.53 0.47 0.50 19
ビン 0.91 0.90 0.90 305
プラ容器(ヤクルト型) 0.69 0.80 0.74 25
プラ容器(ヨーグルト型) 0.94 0.53 0.68 32
ペットボトル 0.98 0.98 0.98 1189
ペットボトル(ミニ) 0.71 0.74 0.72 23
ボトル缶 0.93 0.89 0.91 96
ポーション 0.80 0.52 0.63 23
箱(飲料) 0.76 0.77 0.76 134
紙パック(Pキャン) 0.86 0.69 0.76 35
紙パック(キャップ付き) 0.93 0.80 0.86 126
紙パック(ゲーブルトップ) 0.85 0.93 0.88 54
紙パック(ブリックパック) 0.84 0.90 0.87 277
紙パック(円柱型) 0.90 0.56 0.69 16
缶 0.89 0.96 0.92 429
accuracy 0.91 2928
macro avg 0.84 0.77 0.80 2928
weighted avg 0.91 0.91 0.91 2928
Is there any way to adjust the space among different metrics so that the alignment keeps same for all rows and columns?
I have looked into the sklearn.metrics.classification_report parameters, but I have not found anything to adjust spacing.

How to concat two pivot tables without losing column name

I am trying to concat two pivot tables but after join the two tables, the columns lost.
Pivot1:
SATISFIED_CHECKOUT 1.0 2.0 3.0 4.0 5.0
SEGMENT
BOTH_TX_SPEND_GROWN 0.01 0.03 0.04 0.14 0.80
BOTH_TX_SPEND_NO_GROWTH 0.01 0.03 0.04 0.14 0.78
ONLY_SHOPPED_2018 NaN 0.03 0.04 0.15 0.78
ONLY_SHOPPED_2019 0.01 0.02 0.05 0.13 0.78
ONLY_SPEND_GROWN 0.01 0.02 0.03 0.12 0.82
ONLY_TX_GROWN 0.01 0.03 0.03 0.14 0.79
SHOPPED_NEITHER NaN 0.04 0.02 0.15 0.79
Pivot2:
SATISFIED_FOOD 1.0 2.0 3.0 4.0 5.0
SEGMENT
BOTH_TX_SPEND_GROWN 0.00 0.01 0.07 0.20 0.71
BOTH_TX_SPEND_NO_GROWTH 0.00 0.01 0.08 0.19 0.71
ONLY_SHOPPED_2018 0.01 0.01 0.07 0.19 0.71
ONLY_SHOPPED_2019 0.00 0.01 0.10 0.19 0.69
ONLY_SPEND_GROWN 0.00 0.01 0.08 0.18 0.72
ONLY_TX_GROWN 0.00 0.02 0.07 0.19 0.72
SHOPPED_NEITHER NaN NaN 0.10 0.20 0.70
The original df looks like below:
SATISFIED_CHECKOUT SATISFIED_FOOD Persona
1 1 BOTH_TX_SPEND_GROWN
2 3 BOTH_TX_SPEND_NO_GROWTH
3 2 ONLY_SHOPPED_2019
.... .... ............
5 3 ONLY_SHOPPED_2019
I am using the code:
a = pd.pivot_table(df,index=["SEGMENT"], columns=["SATISFIED_FOOD"], aggfunc='size').apply(lambda x: x / x.sum(), axis=1).round(2)
b = pd.pivot_table(df,index=["SEGMENT"], columns=["SATISFIED_CHECKOUT"], aggfunc='size').apply(lambda x: x / x.sum(), axis=1).round(2)
pd.concat([a, b],axis=1)
The result like below:
1.0 2.0 3.0 4.0 ... 2.0 3.0 4.0 5.0
SEGMENT ...
BOTH_TX_SPEND_GROWN 0.01 0.03 0.07 0.23 ... 0.03 0.04 0.14 0.80
BOTH_TX_SPEND_NO_GROWTH 0.01 0.03 0.06 0.22 ... 0.03 0.04 0.14 0.78
ONLY_SHOPPED_2018 0.01 0.04 0.08 0.24 ... 0.03 0.04 0.15 0.78
ONLY_SHOPPED_2019 0.01 0.03 0.08 0.25 ... 0.02 0.05 0.13 0.78
ONLY_SPEND_GROWN 0.00 0.03 0.07 0.22 ... 0.02 0.03 0.12 0.82
ONLY_TX_GROWN 0.01 0.02 0.05 0.22 ... 0.03 0.03 0.14 0.79
SHOPPED_NEITHER NaN 0.01 0.07 0.28 ... 0.04 0.02 0.15 0.79
[7 rows x 15 columns]
But what I want to see this the result like below:
SATISFIED_CHECKOUT SATISFIED_FOOD
1.0 2.0 3.0 4.0 ... 2.0 3.0 4.0 5.0
SEGMENT ...
BOTH_TX_SPEND_GROWN 0.01 0.03 0.07 0.23 ... 0.03 0.04 0.14 0.80
BOTH_TX_SPEND_NO_GROWTH 0.01 0.03 0.06 0.22 ... 0.03 0.04 0.14 0.78
ONLY_SHOPPED_2018 0.01 0.04 0.08 0.24 ... 0.03 0.04 0.15 0.78
ONLY_SHOPPED_2019 0.01 0.03 0.08 0.25 ... 0.02 0.05 0.13 0.78
ONLY_SPEND_GROWN 0.00 0.03 0.07 0.22 ... 0.02 0.03 0.12 0.82
ONLY_TX_GROWN 0.01 0.02 0.05 0.22 ... 0.03 0.03 0.14 0.79
SHOPPED_NEITHER NaN 0.01 0.07 0.28 ... 0.04 0.02 0.15 0.79
[7 rows x 15 columns]

Is there a way to interpolate values while maintaining a ratio?

I have a dataframe of percentages, and I want to interpolate the intermediate values
0 5 10 15 20 25 30 35
A 0.50 0.50 0.50 0.49 0.47 0.41 0.35 0.29 0.22
B 0.31 0.31 0.31 0.29 0.28 0.24 0.22 0.18 0.13
C 0.09 0.09 0.09 0.09 0.08 0.07 0.06 0.05 0.04
D 0.08 0.08 0.08 0.08 0.06 0.06 0.05 0.04 0.03
E 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.03 0.04
F 0.01 0.01 0.01 0.04 0.10 0.20 0.30 0.41 0.54
So far, I've been using scipy's interp1d and iterating row by row, but it doesn't always maintain the ratios perfectly down the column. Is there a way to do this all together in one function?
reindex then interpolate
r = range(df.columns.min(), df.columns.max() + 1)
df.reindex(columns=r).interpolate(axis=1)
0 1 2 3 4 5 6 7 8 9 ... 31 32 33 34 35 36 37 38 39 40
A 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 ... 0.338 0.326 0.314 0.302 0.29 0.276 0.262 0.248 0.234 0.22
B 0.31 0.31 0.31 0.31 0.31 0.31 0.31 0.31 0.31 0.31 ... 0.212 0.204 0.196 0.188 0.18 0.170 0.160 0.150 0.140 0.13
C 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 ... 0.058 0.056 0.054 0.052 0.05 0.048 0.046 0.044 0.042 0.04
D 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 ... 0.048 0.046 0.044 0.042 0.04 0.038 0.036 0.034 0.032 0.03
E 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.022 0.024 0.026 0.028 0.03 0.032 0.034 0.036 0.038 0.04
F 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ... 0.322 0.344 0.366 0.388 0.41 0.436 0.462 0.488 0.514 0.54

How to use np.genfromtxt and fill in missing columns?

I am trying to use np.genfromtxt to load a data that looks something like this into a matrix:
0.79 0.10 0.91 -0.17 0.10 0.33 -0.90 0.10 -0.19 -0.00 0.10 -0.99 -0.06 0.10 -0.42 -0.66 0.10 -0.79 0.21 0.10 0.93 0.79 0.10 0.91 -0.72 0.10 0.25 0.64 0.10 -0.27 -0.36 0.10 -0.66 -0.52 0.10 0.92 -0.39 0.10 0.43 0.63 0.10 0.25 -0.58 0.10 -0.03 0.59 0.10 0.02 -0.69 0.10 0.79 0.30 0.10 0.09 0.70 0.10 0.67 -0.04 0.10 -0.65 -0.07 0.10 0.70 -0.06 0.10 0.08 7 566 112 32 163 615 424 543 424 422 490 47 499 595 94 515 163 535
0.79 0.10 0.91 -0.17 0.10 0.33 -0.90 0.10 -0.19 -0.00 0.10 -0.99 -0.06 0.10 -0.42 -0.66 0.10 -0.79 0.21 0.10 0.93 0.79 0.10 0.91 -0.72 0.10 0.25 0.64 0.10 -0.27 -0.36 0.10 -0.66 -0.52 0.10 0.92 -0.39 0.10 0.43 0.63 0.10 0.25 -0.58 0.10 -0.03 0.59 0.10 0.02 -0.69 0.10 0.79 0.30 0.10 0.09 0.70 0.10 0.67 -0.04 0.10 -0.65 -0.07 0.10 0.70 -0.06 0.10 0.08 263 112 32 30 163 366 543 457 424 422 556 55 355 485 112 515 163 509 112 535
0.79 0.10 0.91 -0.17 0.10 0.33 -0.90 0.10 -0.19 -0.00 0.10 -0.99 -0.06 0.10 -0.42 -0.66 0.10 -0.79 0.21 0.10 0.93 0.79 0.10 0.91 -0.72 0.10 0.25 0.64 0.10 -0.27 -0.36 0.10 -0.66 -0.52 0.10 0.92 -0.39 0.10 0.43 0.63 0.10 0.25 -0.58 0.10 -0.03 0.59 0.10 0.02 -0.69 0.10 0.79 0.30 0.10 0.09 0.70 0.10 0.67 -0.04 0.10 -0.65 -0.07 0.10 0.70 -0.06 0.10 0.08 311 112 32 543 457 77 639 355 412 422 509 112 535 163 77 125 30 412 422 556 55 355 485 112 515
Suppose I want to import data into a matrix of size (4, 5). If not all rows have 5 columns, when it imports the matrix it should replace those columns without 5 rows with "". For example, if the data were simpler, it would look like this:
1,2,3,4,5
6,7,8,9,10
11,12,13,14,15
16,"","","",""
Thus, I want the number of columns to be imported to match that of the max row column count, and if a row doesn't have that many columns, it will fill it with "". I am reading from a file called "data.txt".
This is what I have tried so far:
trainData = np.genfromtxt('data.txt', usecols = range(0, 5), invalid_raise=False, missing_values = "", filling_values="")
However, it gives errors saying:
Line #4 (got 1 columns instead of 5)
How can I solve this?
Thanks!
Pandas has more robust readers and you can use the DataFrame methods to handle the missing values.
You'll have to figure out how many columns to use first:
columns = max(len(l.split()) for l in open('data.txt'))
To read the file:
import pandas
df = pandas.read_table('data.txt',
delim_whitespace=True,
header=None,
usecols=range(columns),
engine='python')
To convert to a numpy array:
import numpy
a = numpy.array(df)
This will fill in NaNs in the blank positions. You can use .fillna() to get other values for blanks.
filled = numpy.array(df.fillna(999))
You need to modify the filling_values argument to np.nan (which is considered of type float so you won't have the string conversion issue) and specify the delimiter to be comma since by default genfromtxt expects only white space as delimiters:
trainData = np.genfromtxt('data.txt', usecols = range(0, 5), invalid_raise=False, missing_values = "", filling_values=np.nan, delimiter=',')
I managed to figure out a solution.
df = pandas.DataFrame([line.strip().split() for line in open('data.txt', 'r')])
data = np.array(df)
With the copy-n-paste of the 3 big lines, this pandas reader works:
In [149]: pd.read_csv(BytesIO(txt), delim_whitespace=True,header=None,error_bad_
...: lines=False,names=list(range(91)))
Out[149]:
0 1 2 3 4 5 6 7 8 9 ... 81 82 \
0 0.79 0.1 0.91 -0.17 0.1 0.33 -0.9 0.1 -0.19 -0.0 ... 515 163
1 0.79 0.1 0.91 -0.17 0.1 0.33 -0.9 0.1 -0.19 -0.0 ... 515 163
2 0.79 0.1 0.91 -0.17 0.1 0.33 -0.9 0.1 -0.19 -0.0 ... 125 30
83 84 85 86 87 88 89 90
0 535 NaN NaN NaN NaN NaN NaN NaN
1 509 112.0 535.0 NaN NaN NaN NaN NaN
2 412 422.0 556.0 55.0 355.0 485.0 112.0 515.0
_.values to get the array.
The key is specifying a big enough names list. Pandas can fill incomplete lines, while genfromtxt requires explicit delimiters.

Pandas histogram plot with kde?

I have a Pandas dataframe (Dt) like this:
Pc Cvt C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
0 1 2 0.08 0.17 0.16 0.31 0.62 0.66 0.63 0.52 0.38
1 2 2 0.09 0.15 0.13 0.49 0.71 1.28 0.42 1.04 0.43
2 3 2 0.13 0.24 0.22 0.17 0.66 0.17 0.28 0.11 0.30
3 4 1 0.21 0.10 0.23 0.08 0.53 0.14 0.59 0.06 0.53
4 5 1 0.16 0.21 0.18 0.13 0.44 0.08 0.29 0.12 0.52
5 6 1 0.14 0.14 0.13 0.20 0.29 0.35 0.40 0.29 0.53
6 7 1 0.21 0.16 0.19 0.21 0.28 0.23 0.40 0.19 0.52
7 8 1 0.31 0.16 0.34 0.19 0.60 0.32 0.56 0.30 0.55
8 9 1 0.20 0.19 0.26 0.19 0.63 0.30 0.68 0.22 0.58
9 10 2 0.12 0.18 0.13 0.22 0.59 0.40 0.50 0.24 0.36
10 11 2 0.10 0.10 0.19 0.17 0.89 0.36 0.65 0.23 0.37
11 12 2 0.19 0.20 0.17 0.17 0.38 0.14 0.48 0.08 0.36
12 13 1 0.16 0.17 0.15 0.13 0.35 0.12 0.50 0.09 0.52
13 14 2 0.19 0.19 0.29 0.16 0.62 0.19 0.43 0.14 0.35
14 15 2 0.01 0.16 0.17 0.20 0.89 0.38 0.63 0.27 0.46
15 16 2 0.09 0.19 0.33 0.15 1.11 0.16 0.87 0.16 0.29
16 17 2 0.07 0.18 0.19 0.15 0.61 0.19 0.37 0.15 0.36
17 18 2 0.14 0.23 0.23 0.20 0.67 0.38 0.45 0.27 0.33
18 19 1 0.27 0.15 0.20 0.10 0.40 0.05 0.53 0.02 0.52
19 20 1 0.12 0.13 0.18 0.22 0.60 0.49 0.66 0.39 0.66
20 21 2 0.15 0.20 0.18 0.32 0.74 0.58 0.51 0.45 0.37
.
.
.
From this i want to plot an histogram with kde for each column from C1 to C10 in an arrange just like the one that i obtain if i plot it with pandas,
Dt.iloc[:,2:].hist()
But so far i've been not able to add the kde in each histogram; i want something like this:
Any ideas on how to accomplish this?
You want to first plot your histogram then plot the kde on a secondary axis.
Minimal and Complete Verifiable Example MCVE
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(1000, 4)).add_prefix('C')
k = len(df.columns)
n = 2
m = (k - 1) // n + 1
fig, axes = plt.subplots(m, n, figsize=(n * 5, m * 3))
for i, (name, col) in enumerate(df.iteritems()):
r, c = i // n, i % n
ax = axes[r, c]
col.hist(ax=ax)
ax2 = col.plot.kde(ax=ax, secondary_y=True, title=name)
ax2.set_ylim(0)
fig.tight_layout()
How It Works
Keep track of total number of subplots
k = len(df.columns)
n will be the number of chart columns. Change this to suit individual needs. m will be the calculated number of required rows based on k and n
n = 2
m = (k - 1) // n + 1
Create a figure and array of axes with required number of rows and columns.
fig, axes = plt.subplots(m, n, figsize=(n * 5, m * 3))
Iterate through columns, tracking the column name and which number we are at i. Within each iteration, plot.
for i, (name, col) in enumerate(df.iteritems()):
r, c = i // n, i % n
ax = axes[r, c]
col.hist(ax=ax)
ax2 = col.plot.kde(ax=ax, secondary_y=True, title=name)
ax2.set_ylim(0)
Use tight_layout() as an easy way to sharpen up the layout spacing
fig.tight_layout()
Here is a pure seaborn solution, using FacetGrid.map_dataframe as explained here.
Stealing the example from #piRSquared:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000, 4)).add_prefix('C')
Get the data in the required format:
df = df.stack().reset_index(level=1, name="val")
Result:
level_1 val
0 C0 0.879714
0 C1 -0.927096
0 C2 -0.929429
0 C3 -0.571176
1 C0 -1.127939
Then:
import seaborn as sns
def distplot(x, **kwargs):
ax = plt.gca()
data = kwargs.pop("data")
sns.distplot(data[x], ax=ax, **kwargs)
g = sns.FacetGrid(df, col="level_1", col_wrap=2, size=3.5)
g = g.map_dataframe(distplot, "val")
You can adjust col_wrap as needed.

Categories