I'm working with pandas DataFrames full of float numbers, but with integers in one every three lines (the whole line is made of integers). When I make a print df, all the values displayed are shown as floats (the integers values have a ``.000000```added) for example :
aromatics charged polar unpolar
Ac_obs_counts 712.000000 1486.000000 2688.000000 2792.000000
Ac_obs_freqs 0.092732 0.193540 0.350091 0.363636
Ac_pvalues 0.524752 0.099010 0.356436 0.495050
Am_obs_counts 10.000000 59.000000 62.000000 50.000000
Am_obs_freqs 0.055249 0.325967 0.342541 0.276243
Am_pvalues 0.495050 0.980198 0.356436 0.009901
Ap_obs_counts 18.000000 34.000000 83.000000 78.000000
Ap_obs_freqs 0.084507 0.159624 0.389671 0.366197
Ap_pvalues 0.524752 0.039604 0.980198 0.663366
When I use df.iloc[range(0, len(df.index), 3)], I see integers displayed :
aromatics charged polar unpolar
Ac_obs_counts 712 1486 2688 2792
Am_obs_counts 10 59 62 50
Ap_obs_counts 18 34 83 78
Pa_obs_counts 47 81 125 144
Pf_obs_counts 31 58 99 109
Pg_obs_counts 27 106 102 108
Ph_obs_counts 7 49 42 36
Pp_obs_counts 15 83 45 65
Ps_obs_counts 57 125 170 216
Pu_obs_counts 14 62 102 84
When I use df.to_csv("mydf.csv", sep=",", encoding="utf-8") , the integers are written as floats ; how can I force the writing as integers for these lines ? Would it be better to split the data in two DataFrames ?
Thanks in advance.
Simply call object
df.astype('object')
Out[1517]:
aromatics charged polar unpolar
Ac_obs_counts 712 1486 2688 2792
Ac_obs_freqs 0.092732 0.19354 0.350091 0.363636
Ac_pvalues 0.524752 0.09901 0.356436 0.49505
Am_obs_counts 10 59 62 50
Am_obs_freqs 0.055249 0.325967 0.342541 0.276243
Am_pvalues 0.49505 0.980198 0.356436 0.009901
Ap_obs_counts 18 34 83 78
Ap_obs_freqs 0.084507 0.159624 0.389671 0.366197
Ap_pvalues 0.524752 0.039604 0.980198 0.663366
Related
I am trying to linearly interpolate a series of CDS rates. Below is the data I have available; it is represented as maturities and in years.
Maturity Company 1 Company 2
0 0.5 186.73 186.73
1 1.0 210.65 210.65
2 2.0 249.09 249.09
3 3.0 285.4 285.4
4 4.0 317.59 317.59
5 5.0 344.06 344.06
6 6.0 363.01 363.01
7 7.0 375.69 375.69
8 8.0 384.31 384.31
9 9.0 391.0 391.0
10 10.0 396.12 396.12
I am now trying to use this set of maturities and their CDS rate in a similar format to interpolate the rate; below is an example of the maturities I will need to interpolate.
Maturity Years
28 0.10410958904109589
29 0.1863013698630137
30 0.27671232876712326
31 0.3561643835616438
32 0.4328767123287671
33 0.5260273972602739
34 0.5945205479452055
35 0.684931506849315
36 0.7753424657534247
37 0.852054794520548
38 0.9397260273972603
39 1.0164383561643835
40 1.104109589041096
41 1.1863013698630136
42 1.2728102189781023
43 1.35492700729927
44 1.4343065693430657
45 1.5136861313868615
46 1.6012773722627738
47 1.686131386861314
48 1.770985401459854
49 1.853102189781022
50 1.9434306569343067
51 2.02007299270073
52 2.10492700729927
53 2.184306569343066
54 2.2751540041067764
55 2.3572895277207393
56 2.433949349760438
57 2.518822724161533
58 2.6009582477754964
59 2.6830937713894594
60 2.7707049965776864
61 2.8528405201916494
62 2.940451745379877
63 3.0198494182067077
64 3.1047227926078027
65 3.1813826146475015
66 3.2749178532311065
67 3.3543263964950714
68 3.4309967141292446
69 3.521358159912377
70 3.6007667031763417
71 3.6801752464403066
72 3.7677984665936473
73 3.8526834611171963
74 3.940306681270537
75 4.019715224534502
76 4.101861993428258
77 4.186746987951808
78 4.274760383386581
79 4.351437699680511
80 4.4281150159744405
81 4.518484710178001
82 4.600638977635782
83 4.68553172067549
84 4.767685988133272
85 4.8498402555910545
86 4.940209949794614
87 5.0196257416704695
88 5.099041533546326
89 5.18667275216796
90 5.269847477512711
91 5.354712553773954
92 5.434102463824795
93 5.518967540086038
94 5.595619867031678
95 5.685960109503324
96 5.776300351974971
97 5.8529526789206106
98 5.940555338287055
99 6.017207665232695
100 6.10481032459914
101 6.186937817755182
102 6.275154004106776
103 6.357289527720739
104 6.433949349760438
105 6.5160848733744015
106 6.6009582477754964
107 6.685831622176591
108 6.7734428473648185
109 6.852840520191649
110 6.945927446954141
111 7.014373716632443
112 7.104722792607803
113 7.186858316221766
114 7.275022817158503
115 7.3571645877700025
116 7.433830240340736
117 7.513233951931853
118 7.600851840584119
119 7.685731670216002
120 7.7706114998478855
121 7.852753270459385
122 7.943109218132035
123 8.019774870702769
124 8.10465470033465
125 8.184058411925768
126 8.274917853231106
127 8.357064622124863
128 8.433734939759036
129 8.518619934282585
130 8.600766703176342
131 8.6829134720701
132 8.77053669222344
133 8.852683461117197
134 8.940306681270537
135 9.019715224534503
136 9.104600219058051
137 9.181270536692224
138 9.272523643603783
139 9.35191637630662
In the past, I have created log-linear interpolation functions that will allow me to interpolate discount rates based on maturities represented as dates, not as yearly numerical values; below is an example of that function.
def loglinearinterpolation(df, list_of_dates, dt_rng_name, rate_rng_name):
asofDate = pd.to_datetime(list_of_dates)
low_lim = df[df[dt_rng_name] <= asofDate].tail(1)
upper_lim = df[df[dt_rng_name] >= asofDate].head(1)
if low_lim.index == upper_lim.index:
return low_lim[rate_rng_name].iloc[0]
mat_dt_min = low_lim[dt_rng_name].iloc[0]
mat_dt_max = upper_lim[dt_rng_name].iloc[0]
y_min = low_lim[rate_rng_name].iloc[0]
y_max = upper_lim[rate_rng_name].iloc[0]
return np.exp(((np.log(y_max) - np.log(y_min))/((mat_dt_max - mat_dt_min).days))*(asofDate - mat_dt_min).days + np.log(y_min))
df_Libor_interpolated = pd.DataFrame()
df_Libor_interpolated = [(pd.to_datetime(x), loglinearinterpolation(df_libor_curve, pd.to_datetime(
x), 'Dates', 'value')) for x in df_client_curve['Date'].unique()]
I now need to do a similar task using the same formula in the return function, except linearly, not log linearly; however, my code is breaking as the values I am feeding for dates are being converted to DateTime and providing numpy and timestamp comparison errors.
I have tried using the code below as a workaround; however, it is not providing me with the values my team expects.
[np.interp(x,df_cds['Maturity'],df_cds['Company 1']) for x in df_cds_interpolated['Maturity Years']]
Any guidance or insight on how I can modify the function and formula above to work with the input data I provided would be greatly appreciated!
This is a continuation of the following question:
Plot a line on a curve that is undersampled
I tried the solution provided but with real data and getting a straight line. The full data is pasted below:
mdcol, tvdcol = 'md_m', 'tvd_m'
df = df[[mdcol, tvdcol]].copy().set_index(mdcol)
df = df[~df.index.duplicated()]
data_intp = (df.reindex(index = range(int(df.index.min()), int(df.index.max())))
.reset_index() # optional, you could write 'index' in the second line plot, too.
.interpolate()
)
data_intp
Dataframe is shown below:
md_m tvd_m
0 0.00 0.00
1 281.00 281.00
2 300.00 300.00
3 330.00 330.00
4 360.00 360.00
5 390.00 390.00
6 420.00 420.00
7 450.00 450.00
8 480.00 480.00
9 510.00 510.00
10 540.00 539.99
11 570.00 569.98
12 600.00 599.97
13 630.00 629.94
14 660.00 659.91
15 690.00 689.88
16 720.00 719.84
17 750.00 749.80
18 780.00 779.75
19 810.00 809.69
20 840.00 839.58
21 870.00 869.34
22 900.00 898.90
23 930.00 928.19
24 950.00 947.55
25 960.00 957.18
26 970.00 966.76
27 980.00 976.32
28 990.00 985.83
29 1000.00 995.32
30 1010.00 1004.77
31 1020.00 1014.20
32 1030.00 1023.60
33 1040.00 1032.96
34 1050.00 1042.29
35 1060.00 1051.56
36 1070.00 1060.78
37 1080.00 1069.94
38 1090.00 1079.05
39 1100.00 1088.11
40 1110.00 1097.12
41 1120.00 1106.10
42 1130.00 1115.03
43 1140.00 1123.91
44 1150.00 1132.73
45 1160.00 1141.48
46 1170.00 1150.17
47 1180.00 1158.80
48 1190.00 1167.37
49 1200.00 1175.86
50 1210.00 1184.28
51 1220.00 1192.61
52 1230.00 1200.88
53 1240.00 1209.09
54 1250.00 1217.24
55 1260.00 1225.36
56 1270.00 1233.50
57 1280.00 1241.70
58 1290.00 1249.95
59 1300.00 1258.22
60 1310.00 1266.50
61 1320.00 1274.79
62 1330.00 1283.11
63 1340.00 1291.46
64 1350.00 1299.84
65 1360.00 1308.23
66 1370.00 1316.64
67 1380.00 1325.08
68 1390.00 1333.55
69 1400.00 1342.05
70 1410.00 1350.59
71 1420.00 1359.16
72 1430.00 1367.75
73 1440.00 1376.37
74 1450.00 1385.00
75 1460.00 1393.65
76 1470.00 1402.31
77 1480.00 1411.01
78 1490.00 1419.75
79 1500.00 1428.51
80 1510.00 1437.30
81 1520.00 1446.11
82 1530.00 1454.92
83 1540.00 1463.71
84 1550.00 1472.46
85 1560.00 1481.20
86 1570.00 1489.93
87 1580.00 1498.65
88 1590.00 1507.37
89 1600.00 1516.09
90 1610.00 1524.84
91 1620.00 1533.62
92 1630.00 1542.40
93 1640.00 1551.18
94 1650.00 1559.96
95 1660.00 1568.74
96 1670.00 1577.53
97 1680.00 1586.29
98 1690.00 1595.01
99 1700.00 1603.69
100 1710.00 1612.36
101 1720.00 1621.02
102 1730.00 1629.66
103 1740.00 1638.27
104 1750.00 1646.84
105 1760.00 1655.35
106 1770.00 1663.83
107 1780.00 1672.27
108 1790.00 1680.65
109 1800.00 1688.97
110 1810.00 1697.23
111 1820.00 1705.42
112 1830.00 1713.54
113 1840.00 1721.60
114 1850.00 1729.61
115 1860.00 1737.63
116 1870.00 1745.66
117 1880.00 1753.69
118 1890.00 1761.72
119 1900.00 1769.70
120 1910.00 1777.61
121 1920.00 1785.44
122 1930.00 1793.20
123 1940.00 1800.86
124 1950.00 1808.43
125 1960.00 1815.92
126 1970.00 1823.31
127 1980.00 1830.62
128 1990.00 1837.83
129 2000.00 1844.95
130 2010.00 1851.96
131 2020.00 1858.89
132 2030.00 1865.76
133 2040.00 1872.58
134 2050.00 1879.35
135 2060.00 1886.05
136 2070.00 1892.70
137 2080.00 1899.28
138 2090.00 1905.78
139 2100.00 1912.20
140 2110.00 1918.50
141 2120.00 1924.66
142 2130.00 1930.68
143 2140.00 1936.57
144 2150.00 1942.34
145 2160.00 1947.97
146 2170.00 1953.47
147 2180.00 1958.83
148 2190.00 1964.06
149 2200.00 1969.16
150 2210.00 1974.12
151 2220.00 1978.93
152 2230.00 1983.63
153 2240.00 1988.25
154 2250.00 1992.78
155 2260.00 1997.23
156 2270.00 2001.60
157 2280.00 2005.87
158 2290.00 2010.06
159 2300.00 2014.15
160 2310.00 2018.12
161 2320.00 2021.97
162 2330.00 2025.68
163 2340.00 2029.25
164 2373.20 2039.67
165 2401.60 2047.31
166 2430.80 2054.90
167 2459.70 2062.45
168 2488.30 2069.84
169 2488.30 2069.88
170 2489.97 2070.30
171 2493.30 2071.11
172 2503.50 2073.51
173 2519.97 2077.32
174 2549.97 2083.99
175 2563.51 2086.88
176 2579.97 2090.18
177 2609.97 2095.34
178 2639.97 2099.36
179 2662.86 2101.68
180 2752.86 2109.47
181 2759.97 2110.08
182 2789.97 2112.33
183 2819.97 2114.10
184 2849.97 2115.39
185 2879.97 2116.19
186 2902.87 2116.48
187 2909.96 2116.53
188 2939.96 2116.72
189 2969.96 2116.92
190 2999.96 2117.11
191 3029.96 2117.31
192 3059.96 2117.51
193 3089.96 2117.70
194 3119.96 2117.90
195 3149.96 2118.09
196 3179.96 2118.29
197 3209.96 2118.49
198 3239.96 2118.68
199 3252.87 2118.76
200 3352.87 2119.41
201 3359.96 2119.45
202 3389.96 2119.65
203 3419.96 2119.84
204 3449.96 2120.04
205 3479.96 2120.23
206 3509.96 2120.43
207 3539.96 2120.62
208 3569.96 2120.82
209 3599.96 2121.01
210 3629.96 2121.21
211 3652.87 2121.35
212 3779.95 2122.17
213 3852.87 2122.64
Plotting shows the horizontal line:
TOOLS = ["box_zoom", "reset", "save", "crosshair", "pan", "wheel_zoom" ,"lasso_select"]
p = figure(plot_width = 1600,
plot_height = 800, tools = TOOLS,
title = 'Well Survey', toolbar_location='above')
p.line(data_intp[mdcol], data_intp[tvdcol], line_width = 2, color='red')
show(p)
Not sure where the interpolation is going wrong. I was hoping it would interpolate between the points. Anyone have an idea what I am doing wrong here?
The issue with your data (and this proposed solution) is that you have a single duplicate value in md_m:
chk = df.groupby('md_m').agg({'tvd_m':['count',lambda x: list(x)]})
print(chk[chk['tvd_m']['count']>1])
returns:
Pandas can't "reindex from a duplicate axis" which is what this approach relies on and really, linear interpolation won't really work either when you have two identical x values and two distinct y values.
An extra layer of QA could be done on the input data, inspecting it beforehand like my snippet, and doing a groupby average or something like that if appropriate.
The only other thing I'd point out is using a range (i.e. integers) for the reindexing is kinda unneccessary --> you should be able to reindex with floats to any step size you want.
Thanks for the answer gmerrit123, but I believe I remove that error by using this line:
df = df[~df.index.duplicated()]
What solved it for me was converting the md column to INT, as the reindexing was running with step = 1 meter, but the raw data had 2 decimal points.
mdcol, tvdcol = 'md_m', 'tvd_m'
df = data[[mdcol, tvdcol]].copy()#.set_index(mdcol)#.reindex(index = range(int(df.index.min()), int(df.index.max())))
df[mdcol] = df[mdcol].astype('int')
df = df.set_index(mdcol)
df = df[~df.index.duplicated()]
data_intp = (df.reindex(index = range(int(df.index.min()), int(df.index.max()+ 1)))
.reset_index()
.interpolate()
)
I have this dataframe:
x y z parameter
0 26 24 25 Age
1 35 37 36 Age
2 57 52 54.5 Age
3 160 164 162 Hgt
4 182 163 172.5 Hgt
5 175 167 171 Hgt
6 95 71 83 Wgt
7 110 68 89 Wgt
8 89 65 77 Wgt
I'm using pandas to get this final result:
x y parameter
0 160 164 Hgt
1 182 163 Hgt
2 175 167 Hgt
I'm using groupby() to extract and isolate rows based on same parameter Hgt from the original dataframe
First, I added a column to set it as an index:
df = df.insert(0,'index', [count for count in range(df.shape[0])], True)
And the dataframe came out like this:
index x y z parameter
0 0 26 24 25 Age
1 1 35 37 36 Age
2 2 57 52 54.5 Age
3 3 160 164 162 Hgt
4 4 182 163 172.5 Hgt
5 5 175 167 171 Hgt
6 6 95 71 83 Wgt
7 7 110 68 89 Wgt
8 8 89 65 77 Wgt
Then, I used the following code to group based on index and extract the columns I need:
df1 = df.groupby('index')[['x', 'y','parameter']]
And the output was:
x y parameter
0 26 24 Age
1 35 37 Age
2 57 52 Age
3 160 164 Hgt
4 182 163 Hgt
5 175 167 Hgt
6 95 71 Wgt
7 110 68 Wgt
8 89 65 Wgt
After that, I used the following code to isolate only Hgt values:
df2 = df1[df1['parameter'] == 'Hgt']
When I ran df2, I got an error saying:
IndexError: Column(s) ['x', 'y', 'parameter'] already selected
Am I missing something here? What to do to get the final result?
Because you asked what you did wrong, let me point to useless/bad code.
Without any judgement (this is just to help you improve future code), almost everything is incorrect. It feels like a succession of complicated ways to do useless things. Let me give some details:
df = df.insert(0,'index', [count for count in range(df.shape[0])], True)
This seems a very convoluted way to do df.reset_index(). Even [count for count in range(df.shape[0])] could be have been simplified by using range(df.shape[0]) directly.
But this step is not even needed for a groupby as you can group by index level:
df.groupby(level=0)
But... the groupby is useless anyways as you only have single membered groups.
Also, when you do:
df1 = df.groupby('index')[['x', 'y','parameter']]
df1 is not a dataframe but a DataFrameGroupBy object. Very useful to store in a variable when you know what you're doing, this is however causing the error in your case as you thought this was a DataFrame. You need to apply an aggregation or transformation method of the DataFrameGroupBy object to get back a DataFrame, which you didn't (likely because, as seen above, there isn't much interesting to do on single-membered groups).
So when you run:
df1[df1['parameter'] == 'Hgt']
again, all is wrong as df1['parameter'] is equivalent to df.groupby('index')[['x', 'y','parameter']]['parameter'] (the cause of the error as you select twice 'parameter'). Even if you removed this error, the equality comparison would give a single True/False as you still have your DataFrameGroupBy and not a DataFrame, and this would incorrectly try to subselect an inexistent column of the DataFrameGroupBy.
I hope it helped!
Do you really need groupby?
>>> df.loc[df['parameter'] == 'Hgt', ['x', 'y', 'parameter']].reset_index(drop=True)
x y parameter
0 160 164 Hgt
1 182 163 Hgt
2 175 167 Hgt
Here is my dataframe:
Boston
Zipcode Employees Latitude Longitude
0 02021 174 -71.131057 42.228065
1 02026 193 -71.143038 42.237719
3 02109 45 -71.054027 42.363498
4 02110 14 -71.053642 42.357649
5 02111 30 -71.060280 42.350586
6 02113 77 -71.054618 42.365215
8 02115 116 -71.095106 42.343330
10 02118 318 -71.072103 42.339342
11 02119 804 -71.085268 42.323002
12 02120 168 -71.097569 42.332539
13 02121 781 -71.086649 42.305792
15 02124 1938 -71.066702 42.281721
16 02125 859 -71.053049 42.310813
17 02126 882 -71.090424 42.272444
19 02128 786 -71.016037 42.375254
21 02130 886 -71.114080 42.309087
22 02131 1222 -71.121464 42.285216
23 02132 1348 -71.168150 42.280316
24 02134 230 -71.123323 42.355355
25 02135 584 -71.147046 42.357537
26 02136 1712 -71.125550 42.255064
28 02152 119 -70.960324 42.351129
29 02163 1 -71.120420 42.367263
30 02186 361 -71.113223 42.258883
31 02199 4 -71.082279 42.346991
32 02210 35 -71.044281 42.347148
33 02215 83 -71.103877 42.348709
34 02459 27 -71.187563 42.286356
35 02467 66 -71.157691 42.314277
And I want to draw circles on my map, each circle corresponds to one point, the size of the circle depends on the number of Employees
Here are my map code (I try to use marker, but I think circle is better:
boston_map=folium.Map([Boston['Longitude'].mean(), Boston['Latitude'].mean()],zoom_start=12)
incidents2=plugins.MarkerCluster().add_to(boston_map)
for Latitude,Longitude,Employees in zip(Boston.Latitude,Boston.Longitude,Boston.Employees):
folium.Marker(location=[Latitude,Longitude],icon=None,popup=Employees).add_to(incidents2)
boston_map.add_child(incidents2)
boston_map
Here is my map:
If the number of employees can show in the circle, it will be better! Thank you very much!
To draw circles you can use CircleMarker instead of Marker
BTW: you have wrong column's names. Boston has lat: 42.361145, long: -71.057083 but you have values 42 in column Longitude and values -71 in column Latitude
Because I don't use Juputer so I save map in HTML file and use webbrowser to automatically open it in web browser.
Because it created big circles so I divide Employees to create smaller circles. But now some circles are very small and it shows number of circles instead circles. Maybe it should be used math.log() or other method to make it smaller (normalized).
I use tooltip=str(employees) to display number when you hover circle.
text = '''
Zipcode Employees Longitude Latitude
0 02021 174 -71.131057 42.228065
1 02026 193 -71.143038 42.237719
3 02109 45 -71.054027 42.363498
4 02110 14 -71.053642 42.357649
5 02111 30 -71.060280 42.350586
6 02113 77 -71.054618 42.365215
8 02115 116 -71.095106 42.343330
10 02118 318 -71.072103 42.339342
11 02119 804 -71.085268 42.323002
12 02120 168 -71.097569 42.332539
13 02121 781 -71.086649 42.305792
15 02124 1938 -71.066702 42.281721
16 02125 859 -71.053049 42.310813
17 02126 882 -71.090424 42.272444
19 02128 786 -71.016037 42.375254
21 02130 886 -71.114080 42.309087
22 02131 1222 -71.121464 42.285216
23 02132 1348 -71.168150 42.280316
24 02134 230 -71.123323 42.355355
25 02135 584 -71.147046 42.357537
26 02136 1712 -71.125550 42.255064
28 02152 119 -70.960324 42.351129
29 02163 1 -71.120420 42.367263
30 02186 361 -71.113223 42.258883
31 02199 4 -71.082279 42.346991
32 02210 35 -71.044281 42.347148
33 02215 83 -71.103877 42.348709
34 02459 27 -71.187563 42.286356
35 02467 66 -71.157691 42.314277
'''
import pandas as pd
import io
import folium
import folium.plugins
boston = pd.read_csv(io.StringIO(text), sep='\s+')
boston_map = folium.Map([boston.Latitude.mean(), boston.Longitude.mean(), ], zoom_start=12)
incidents2 = folium.plugins.MarkerCluster().add_to(boston_map)
for latitude, longitude, employees in zip(boston.Latitude, boston.Longitude, boston.Employees):
print(latitude, longitude, employees)
folium.vector_layers.CircleMarker(
location=[latitude, longitude],
tooltip=str(employees),
radius=employees/10,
color='#3186cc',
fill=True,
fill_color='#3186cc'
).add_to(incidents2)
boston_map.add_child(incidents2)
# display in web browser
import webbrowser
boston_map.save('map.html')
webbrowser.open('map.html')
EDIT: answer for question how to add a label on each circle in a folium.circile map python shows how to use Marker with icon=DivIcon(text) to add text but it doesn't work as I expect.
If I have a set of data that's of shape (1000,1000) and I know that the values I need from it are contained within the indices (25:888,11:957), how would I go about separating the two sections of data from one another?
I couldn't figure out how to get np.delete() to like the specific 2D case and I also need both the good and the bad sections of data for analysis, so I can't just specify my array bounds to be within the good indices.
I feel like there's a simple solution I'm missing here.
Is this how you want to divide the array?
In [364]: arr = np.ones((1000,1000),int)
In [365]: beta = arr[25:888, 11:957]
In [366]: beta.shape
Out[366]: (863, 946)
In [367]: arr[:25,:].shape
Out[367]: (25, 1000)
In [368]: arr[888:,:].shape
Out[368]: (112, 1000)
In [369]: arr[25:888,:11].shape
Out[369]: (863, 11)
In [370]: arr[25:888,957:].shape
Out[370]: (863, 43)
I'm imaging a square with a rectangle cut out of the middle. It's easy to specify that rectangle, but the frame is has to be viewed as 4 rectangles - unless it is described via the mask of what is missing.
Checking that I got everything:
In [376]: x = np.array([_366,_367,_368,_369,_370])
In [377]: np.multiply.reduce(x, axis=1).sum()
Out[377]: 1000000
Let's say your original numpy array is my_arr
Extracting the "Good" Section:
This is easy because the good section has a rectangular shape.
good_arr = my_arr[25:888, 11:957]
Extracting the "Bad" Section:
The "bad" section doesn't have a rectangular shape. Rather, it has the shape of a rectangle with a rectangular hole cut out of it.
So, you can't really store the "bad" section alone, in any array-like structure, unless you're ok with wasting some extra space to deal with the cut out portion.
What are your options for the "Bad" Section?
Option 1:
Be happy and content with having extracted the good section. Let the bad section remain as part of the original my_arr. While iterating trough my_arr, you can always discriminate between good and and bad items based on the indices. The disadvantage is that, whenever you want to process only the bad items, you have to do it through a nested double loop, rather than use some vectorized features of numpy.
Option 2:
Suppose we want to perform some operations such as row-wise totals or column-wise totals on only the bad items of my_arr, and suppose you don't want the overhead of the nested for loops. You can create something called a numpy masked array. With a masked array, you can perform most of your usual numpy operations, and numpy will automatically exclude masked out items from the calculations. Note that internally, there will be some memory wastage involved, just to store an item as "masked"
The code below illustrates how you can create a masked array called masked_arr from your original array my_arr:
import numpy as np
my_size = 10 # In your case, 1000
r_1, r_2 = 2, 8 # In your case, r_1 = 25, r_2 = 889 (which is 888+1)
c_1, c_2 = 3, 5 # In your case, c_1 = 11, c_2 = 958 (which is 957+1)
# Using nested list comprehension, build a boolean mask as a list of lists, of shape (my_size, my_size).
# The mask will have False everywhere, except in the sub-region [r_1:r_2, c_1:c_2], which will have True.
mask_list = [[True if ((r in range(r_1, r_2)) and (c in range(c_1, c_2))) else False
for c in range(my_size)] for r in range(my_size)]
# Your original, complete 2d array. Let's just fill it with some "toy data"
my_arr = np.arange((my_size * my_size)).reshape(my_size, my_size)
print (my_arr)
masked_arr = np.ma.masked_where(mask_list, my_arr)
print ("masked_arr is:\n", masked_arr, ", and its shape is:", masked_arr.shape)
The output of the above is:
[[ 0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]
[40 41 42 43 44 45 46 47 48 49]
[50 51 52 53 54 55 56 57 58 59]
[60 61 62 63 64 65 66 67 68 69]
[70 71 72 73 74 75 76 77 78 79]
[80 81 82 83 84 85 86 87 88 89]
[90 91 92 93 94 95 96 97 98 99]]
masked_arr is:
[[0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 -- -- 25 26 27 28 29]
[30 31 32 -- -- 35 36 37 38 39]
[40 41 42 -- -- 45 46 47 48 49]
[50 51 52 -- -- 55 56 57 58 59]
[60 61 62 -- -- 65 66 67 68 69]
[70 71 72 -- -- 75 76 77 78 79]
[80 81 82 83 84 85 86 87 88 89]
[90 91 92 93 94 95 96 97 98 99]] , and its shape is: (10, 10)
Now that you have a masked array, you will be able to perform most of the numpy operations on it, and numpy will automatically exclude the masked items (the ones that appear as "--" when you print the masked array)
Some examples of what you can do with the masked array:
# Now, you can print column-wise totals, of only the bad items.
print (masked_arr.sum(axis=0))
# Or row-wise totals, for that matter.
print (masked_arr.sum(axis=1))
The output of the above is:
[450 460 470 192 196 500 510 520 530 540]
[45 145 198 278 358 438 518 598 845 945]