Python : Reduce an array by only keeping number between two limits - python

I have an array matrrix Nx4 and I want to reduce it by keeping only the values that are in a specefic range for the second and third column. I have written a code that does not work because it does not take count that I am already reducing the array.
Example of data/array :
1 358 33 7.1
2 659 85 7.1
3 111 145 7.1
4 558 116 7.1
5 632 40 7.1
6 415 335 7.1
7 207 30 7.1
8 564 47 7.1
9 352 41 7.1
10 700 570 7.1
11 275 499 7.1
12 482 177 7.1
13 737 565 7.1
14 298 43 7.1
15 155 195 7.1
16 598 417 7.1
17 93 313 7.1
18 1150 597 7.1
19 410 451 7.1
20 34 793 7.1
21 997 904 7.1
22 1024 452 7.1
23 740 128 7.1
24 522 86 7.1
25 679 643 7.1
26 973 37 7.1
27 372 42 7.1
By example I want to keep the values that are in the range = [80, 2000] for the second column and in the range = [130, 2000] for the third one. My real array has over 1'000'000 rows.
Here is my code :
def filter_data(data, XRANGE, YRANGE) :
data_f = np.copy(data)
for l in range(len(data_f)) :
if XRANGE[0] < data_f[l,1] < XRANGE[1] and YRANGE[0] < data_f[l,1] < YRANGE[1] :
pass
else :
data = np.delete(data, l, axis=0)
return data
How could I do differently and well more efficiently ?

You can pull this off by using masks and combining them by computing their point-wise products (equivalent to the AND operator with booleans):
>>> x_range, y_range = [0, 2], [0, 5]
>>> data
array([[ 1, 2, 3],
[ 1, 1, 1],
[ 5, 1, 7],
[ 1, 10, 2]])
First construct two masks based on constraints on data[:, 0] and data[:, 1]:
>>> x_mask = (data[:,0] > x_range[0])*(data[:,0] < x_range[1])
array([ True, True, False, True])
>>> y_mask = (data[:,1] > y_range[0])*(data[:,1] < y_range[1])
array([ True, True, True, False])
Essentially the resulting mask is equivalent to x > x_min & x < x_max & y > y_min & y < y_max:
>>> x_mask*y_mask
array([ True, True, False, False])
>>> data[x_mask*y_mask]
array([[1, 2, 3],
[1, 1, 1]])

Here is a simple example of what I think you are talking about.
I first create an example array (n,3).
Then I find where the values in the second and third columns exceed a value (lets call it 4) AND multiply this times the array with second and third column original values.
Lastly concatenate this new array to the first column from the original array as follows
a = np.asarray([[2,3,4],
[3,4,5],
[4,5,6],
[10,12,14]])
val = 4
b = a[:,1:3] > val
c = a[:,1:3]*b
np.concatenate((a[:,0:1],c),axis=1)
EDIT: After you updated your example: for (n,4) array
a = np.asarray([[2,3,4,5],
[3,4,5,6],
[4,5,6,8],
[10,12,14,9]])
val = 4
b = a[:,1:3] > val
c = a[:,1:3]*b
np.concatenate((a[:,0:1],c,a[:,3:4]),axis=1)

Related

How can I specify a different decimal format on each column when using Pandas DataFrame to CSV?

I am parsing specific columns from a text file with data that looks like this:
n Elapsed time TimeUTC HeightMSL GpsHeightMSL P Temp RH Dewp Dir Speed Ecomp Ncomp Lat Lon
s hh:mm:ss m m hPa ∞C % ∞C ∞ m/s m/s m/s ∞ ∞
1 0 23:15:43 198 198 978.5 33.70 47 20.87 168.0 7.7 -1.6 7.6 32.835222 -97.297940
2 1 23:15:44 202 201 978.1 33.03 48 20.62 162.8 7.3 -2.2 7.0 32.835428 -97.298000
3 2 23:15:45 206 206 977.6 32.89 48 20.58 160.8 7.5 -2.4 7.0 32.835560 -97.298077
4 3 23:15:46 211 211 977.1 32.81 49 20.58 160.3 7.8 -2.6 7.4 32.835660 -97.298160
5 4 23:15:47 217 217 976.5 32.74 49 20.51 160.5 8.3 -2.7 7.8 32.835751 -97.298242
6 5 23:15:48 223 223 975.8 32.66 48 20.43 160.9 8.7 -2.8 8.2 32.835850 -97.298317
I perform one calculation on the first m/s column (converting m/s to kt) and write all data where hpa > 99.9 to an output file. That output looks like this:
978.5,198,33.7,20.87,168.0,14.967568
978.1,201,33.03,20.62,162.8,14.190032
977.6,206,32.89,20.58,160.8,14.5788
977.1,211,32.81,20.58,160.3,15.161952
976.5,217,32.74,20.51,160.5,16.133872
975.8,223,32.66,20.43,160.9,16.911407999999998
The code executes fine and the output file works for what I'm using it for, but is there a way to format the column output to a specific decimal place? As you can see in my code, I've tried df.round but it doesn't impact the output. I've also looked at float_format parameter, but that seems like it would apply the format to all columns. My intended output should look like this:
978.5, 198, 33.7, 20.9, 168, 15
978.1, 201, 33.0, 20.6, 163, 14
977.6, 206, 32.9, 20.6, 161, 15
977.1, 211, 32.8, 20.6, 160, 15
976.5, 217, 32.7, 20.5, 161, 16
975.8, 223, 32.7, 20.4, 161, 17
My code is below:
import pandas as pd
headers = ['n', 's', 'time', 'm1', 'm2', 'hpa', 't', 'rh', 'td', 'dir', 'spd', 'u', 'v', 'lat', 'lon']
df = pd.read_csv ('edt_20220520_2315.txt', encoding_errors = 'ignore', skiprows = 2, sep = '\s+', names = headers)
df['spdkt'] = df['spd'] * 1.94384
df['hpa'].round(decimals = 1)
df['spdkt'].round(decimals = 0)
df['t'].round(decimals = 1)
df['td'].round(decimals = 1)
df['dir'].round(decimals = 0)
extract = ['hpa', 'm2', 't', 'td', 'dir', 'spdkt']
with open('test_output.txt' , 'w') as fh:
df_to_write = df[df['hpa'] > 99.9]
df_to_write.to_csv(fh, header = None, index = None, columns = extract, sep = ',')
You can pass dictionary and then if round by 0 casting columns to integers:
d = {'hpa':1, 'spdkt':0, 't':1, 'td':1, 'dir':0}
df = df.round(d).astype({k:'int' for k, v in d.items() if v == 0})
print (df)
n s time m1 m2 hpa t rh td dir spd u v \
0 1 0 23:15:43 198 198 978.5 33.7 47 20.9 168 7.7 -1.6 7.6
1 2 1 23:15:44 202 201 978.1 33.0 48 20.6 163 7.3 -2.2 7.0
2 3 2 23:15:45 206 206 977.6 32.9 48 20.6 161 7.5 -2.4 7.0
3 4 3 23:15:46 211 211 977.1 32.8 49 20.6 160 7.8 -2.6 7.4
4 5 4 23:15:47 217 217 976.5 32.7 49 20.5 160 8.3 -2.7 7.8
5 6 5 23:15:48 223 223 975.8 32.7 48 20.4 161 8.7 -2.8 8.2
lat lon spdkt
0 32.835222 -97.297940 15
1 32.835428 -97.298000 14
2 32.835560 -97.298077 15
3 32.835660 -97.298160 15
4 32.835751 -97.298242 16
5 32.835850 -97.298317 17

How to read data that has been split into multiple columns?

I have the following dataframe:
q
1 0.83 97 0.7 193 0.238782 289 0.129692 385 0.090692
2 0.75 98 0.7 194 0.238782 290 0.129692 386 0.090692
...
96 0.94693 192 0.299753 288 0.145046 384 0.0965338 480 0.0823061
This data comes from somewhere else, and it has been split. However, the values correspond to a single variable 'q', along with its indices. To clarify, even though there are many columns, they all correspond to one column 'q', plus an index column (notice that the starting index of each column is the continuation of the end of the previous column).
How can I read the data with pandas? I believe I can do it by assigning names to each column and then merging them all together, but I was looking for a more elegant solution. Plus, the number of columns is not fixed.
This is the code that I am using at the moment:
q_param = pd.read_csv('Initial_solutions/initial_q_20y.dat', delim_whitespace=True)
Which does not do the trick. I would prefer to use pandas to solve this issue, but I can also work without it.
EDIT:
At the request of #user17242583, the following command:
print(q_param.head().to_dict())
Gives this output:
{'q': {(1, 0.83, 97, 0.7, 193, 0.238782, 289, 0.129692, 385): 0.090692, (2, 0.75, 98, 0.7, 194, 0.238782, 290, 0.129692, 386): 0.090692, (3, 0.64, 99, 0.64, 195, 0.238782, 291, 0.129692, 387): 0.090692, (4, 0.7, 100, 0.7, 196, 0.238782, 292, 0.129692, 388): 0.0884839, (5, 0.64, 101, 0.64, 197, 0.238782, 293, 0.129692, 389): 0.090692}}
It seems most of your data is index. Try:
df = pd.DataFrame({k:v for lst in [list(k)+[v] for k,v in q_param['q'].items()] for k,v in zip(lst[::2],lst[1::2])}, index=['q']).T.sort_index()
Try this:
data = {
0: pd.concat(q[c] for c in q.columns[0::2]).reset_index(drop=True),
1: pd.concat(q[c] for c in q.columns[1::2]).reset_index(drop=True),
}
df = pd.DataFrame(data)
Output:
>>> df
0 1
0 1 0.830000
1 2 0.750000
2 3 0.640000
3 4 0.700000
4 5 0.640000
5 97 0.700000
6 98 0.700000
7 99 0.640000
8 100 0.700000
9 101 0.640000
10 193 0.238782
11 194 0.238782
12 195 0.238782
13 196 0.238782
14 197 0.238782
15 289 0.129692
16 290 0.129692
17 291 0.129692
18 292 0.129692
19 293 0.129692
20 385 0.090692
21 386 0.090692
22 387 0.090692
23 388 0.088484
24 389 0.090692

Numpy Finding Matching number with Array

Any help is greatly appreciated!! I have been trying to solve this for the last few days....
I have two arrays:
import pandas as pd
OldDataSet = {
'id': [20,30,40,50,60,70]
,'OdoLength': [26.12,43.12,46.81,56.23,111.07,166.38]}
NewDataSet = {
'id': [3000,4000,5000,6000,7000,8000]
,'OdoLength': [25.03,42.12,45.74,46,110.05,165.41]}
df1= pd.DataFrame(OldDataSet)
df2 = pd.DataFrame(NewDataSet)
OldDataSetArray = df1.as_matrix()
NewDataSetArray = df2.as_matrix()
The result that I am trying to get is:
Array 1 and Array 2 Match by closes difference, based on left over number from Array2
20 26.12 3000 25.03
30 43.12 4000 42.12
40 46.81 6000 46
50 56.23 7000 110.05
60 111.07 8000 165.41
70 166.38 0 0
Starting at Array 1, ID 20, find the nearest which in this case would be the first Number in Array 2 ID 3000 (26.12-25.03). so ID 20, gets matched to 3000.
Where it gets tricky is if one value in Array 2 is not the closest, then it is skipped. for example, ID 40 value 46.81 is compared to 45.74, 46 and the smallest value is .81 from 46 ID 6000. So ID 40--> ID 6000. ID 5000 in array 2 is now skipped for any future comparisons. So now when comparing array 1 ID 50, it is compared to the next available number in array 2, 110.05. array 1 ID 50 is matched to Array 2 ID 7000.
UPDATE
so here's the code that i have tried and it works. Yes, it is not the greatest, so if someone has another suggestion please let me know.
import pandas as pd
import operator
OldDataSet = {
'id': [20,30,40,50,60,70]
,'OdoLength': [26.12,43.12,46.81,56.23,111.07,166.38]}
NewDataSet = {
'id': [3000,4000,5000,6000,7000,8000]
,'OdoLength': [25.03,42.12,45.74,46,110.05,165.41]}
df1= pd.DataFrame(OldDataSet)
df2 = pd.DataFrame(NewDataSet)
OldDataSetArray = df1.as_matrix()
NewDataSetArray = df2.as_matrix()
newPos = 1
CurrentNumber = 0
OldArrayLen = len(OldDataSetArray) -1
NewArrayLen = len(NewDataSetArray) -1
numberResults = []
for oldPos in range(len(OldDataSetArray)):
PreviousNumber = abs(OldDataSetArray[oldPos, 0]- NewDataSetArray[oldPos, 0])
while newPos <= len(NewDataSetArray) - 1:
CurrentNumber = abs(OldDataSetArray[oldPos, 0] - NewDataSetArray[newPos, 0])
#if it is the last row for the inner array, then match the next available
#in Array 1 to that last record
if newPos == NewArrayLen and oldPos < newPos and oldPos +1 <= OldArrayLen:
numberResults.append([OldDataSetArray[oldPos +1, 1],NewDataSetArray[newPos, 1],OldDataSetArray[oldPos +1, 0],NewDataSetArray[newPos, 0]])
if PreviousNumber < CurrentNumber:
numberResults.append([OldDataSetArray[oldPos, 1], NewDataSetArray[newPos - 1, 1], OldDataSetArray[oldPos, 0], NewDataSetArray[newPos - 1, 0]])
newPos +=1
break
elif PreviousNumber > CurrentNumber:
PreviousNumber = CurrentNumber
newPos +=1
#sort by array one values
numberResults = sorted(numberResults, key=operator.itemgetter(0))
numberResultsDf = pd.DataFrame(numberResults)
You can use NumPy broadcasting to build a distance matrix:
a = numpy.array([26.12, 43.12, 46.81, 56.23, 111.07, 166.38,])
b = numpy.array([25.03, 42.12, 45.74, 46, 110.05, 165.41,])
numpy.abs(a[:, None] - b[None, :])
# array([[ 1.09, 16. , 19.62, 19.88, 83.93, 139.29],
# [ 18.09, 1. , 2.62, 2.88, 66.93, 122.29],
# [ 21.78, 4.69, 1.07, 0.81, 63.24, 118.6 ],
# [ 31.2 , 14.11, 10.49, 10.23, 53.82, 109.18],
# [ 86.04, 68.95, 65.33, 65.07, 1.02, 54.34],
# [ 141.35, 124.26, 120.64, 120.38, 56.33, 0.97]])
of that matrix you can then find the closest elements using argmin, either row- or columnwise (depending of if you want to search in a or b).
numpy.argmin(numpy.abs(a[:, None] - b[None, :]), axis=1)
# array([0, 1, 3, 3, 4, 5])
Compute all the differences, and use `np.argmin to lookup the closest.
a,b=np.random.rand(2,10)
all_differences=np.abs(np.subtract.outer(a,b))
ia=all_differences.argmin(axis=1)
for i in range(10):
print(i,a[i],ia[i], b[ia[i]])
0 0.231603891949 8 0.21177584152
1 0.27810475456 7 0.302647382888
2 0.582133214953 2 0.548920922033
3 0.892858042793 1 0.872622982632
4 0.67293347218 6 0.677971552011
5 0.985227546492 1 0.872622982632
6 0.82431697833 5 0.83765895237
7 0.426992114791 4 0.451084369838
8 0.181147161752 8 0.21177584152
9 0.631139744522 3 0.653554586691
EDIT
with dataframes and indexes:
va,vb=np.random.rand(2,10)
na,nb=np.random.randint(0,100,(2,10))
dfa=pd.DataFrame({'id':na,'odo':va})
dfb=pd.DataFrame({'id':nb,'odo':vb})
all_differences=np.abs(np.subtract.outer(dfa.odo,dfb.odo))
ia=all_differences.argmin(axis=1)
dfc=dfa.merge(dfb.loc[ia].reset_index(drop=True),\
left_index=True,right_index=True)
Input :
In [337]: dfa
Out[337]:
id odo
0 72 0.426457
1 12 0.315997
2 96 0.623164
3 9 0.821498
4 72 0.071237
5 5 0.730634
6 45 0.963051
7 14 0.603289
8 5 0.401737
9 63 0.976644
In [338]: dfb
Out[338]:
id odo
0 95 0.333215
1 7 0.023957
2 61 0.021944
3 57 0.660894
4 22 0.666716
5 6 0.234920
6 83 0.642148
7 64 0.509589
8 98 0.660273
9 19 0.658639
Output :
In [339]: dfc
Out[339]:
id_x odo_x id_y odo_y
0 72 0.426457 64 0.509589
1 12 0.315997 95 0.333215
2 96 0.623164 83 0.642148
3 9 0.821498 22 0.666716
4 72 0.071237 7 0.023957
5 5 0.730634 22 0.666716
6 45 0.963051 22 0.666716
7 14 0.603289 83 0.642148
8 5 0.401737 95 0.333215
9 63 0.976644 22 0.666716

How to solve Index out of bounds error?

I'm a bit confused on an error I keep running into. I didn't have it before, but at the same time my data was wrong so I had to re-write the code.
Running the following:
plt.figure(figsize=(20,10))
x = np.arange(1416, 1426, 0.009766)
gaverage = np.empty((21,1024), dtype = np.float64)
calibdata = open(pathc + 'calib_5m.dat').readlines()
#print(np.size(calibdata)) ||| Yields: 624
#print(np.size(calibdata)//16) ||| Yields: 39
calib = np.empty(shape=(np.size(calibdata)//16,1024), dtype=np.float64)
for i in range(0, np.size(calibdata)//4):
calib[i] = calibdata[i*4+3].split()
caverage = np.average(calib[i] ,axis = 0)
Yields this:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-25-87f3f4739851> in <module>()
11 calib = np.empty(shape=(np.size(calibdata)//16,1024), dtype=np.float64)
12 for i in range(0, np.size(calibdata)//4):
---> 13 calib[i] = calibdata[i*4+3].split()
14 caverage = np.average(calib[i] ,axis = 0)
15
IndexError: index 39 is out of bounds for axis 0 with size 39
Now what I'm trying to do here is basically take every 4th line in the file read in calibdata and write it to a new array, calib[i]. If the indices are the same size how are they out of bounds? I think there's some fundamentally flawed logic here on my part so if anyone can point out where I'm falling short, that would be great.
calib is initialized to size (39,n). But i iterator goes well beyond that:
In [243]: for i in range(np.size(calibdata)//4):
...: print(i, i*4+3)
...:
0 3
1 7
2 11
3 15
4 19
5 23
6 27
7 31
8 35
....
147 591
148 595
149 599
150 603
151 607
152 611
153 615
154 619
155 623
In [244]: calib=np.zeros((np.size(calibdata)//16),int)
In [245]: calib.shape
Out[245]: (39,)

matplotlib histogram with frequency and counts

I have data (from a space delimited text file with two columns) which is already binned but only a width of 1. I want to increase this width to about 5. How can I do this using numpy/matplotlib in Python?
Using,
data = loadtxt('file.txt')
x = data[:, 0]
y = data[:, 1]
plt.bar(x,y)
creates too many bars and using,
plt.hist(data)
doesn't plot the histogram appropriately. I guess I don't understand how matplotlib's histogram plotting works.
See some of the data below.
264 1
265 1
266 4
267 2
268 2
269 2
270 2
271 2
272 5
273 3
274 2
275 6
276 7
277 3
278 7
279 5
280 9
281 4
282 8
283 11
284 9
285 15
286 19
287 11
288 12
289 10
290 13
291 18
292 20
293 14
294 15
What if you use numpy.reshape to transform your data before using plt.bar, for example:
In [83]: import numpy as np
In [84]: import matplotlib.pyplot as plt
In [85]: data = np.array([[1,2,3,4,5,6], [4,3,8,9,1,2]]).T
In [86]: data
Out[86]:
array([[1, 4],
[2, 3],
[3, 8],
[4, 9],
[5, 1],
[6, 2]])
In [87]: y = data[:,1].reshape(-1,2).sum(axis=1)
In [89]: y
Out[89]: array([ 7, 17, 3])
In [91]: x = data[:,0].reshape(-1,2).mean(axis=1)
In [92]: x
Out[92]: array([ 1.5, 3.5, 5.5])
In [96]: plt.bar(x, y)
Out[96]: <Container object of 3 artists>
In [97]: plt.show()
I am not an expert at matplotlib but I find hist to be incredibly useful. The examples on the matplotlib site give a great overview of some of the features.
I don't know how to use your provided sample data without transforming it. I altered your example to dequantize those data before creating a histogram.
I calculated the bin size using this question's first answer.
import matplotlib.pyplot as plt
import numpy as np
data = np.loadtxt('file.txt')
dequantized = data[:,0].repeat(data[:,1].astype(int))
dequantized[0:7]
# Each row's first column is repeated the number of times found in the
# second column creating a single array.
# array([ 264., 265., 266., 266., 266., 266., 267.])
def bins(xmin, xmax, binwidth, padding):
# Returns an array of integers which can be used to represent bins
return np.arange(
xmin - (xmin % binwidth) - padding,
xmax + binwidth + padding,
binwidth)
histbins = bins(min(dequantized), max(dequantized), 5, 5)
plt.figure(1)
plt.hist(dequantized, histbins)
plt.show()
This histogram displayed looks like this.
I hope this example is useful.

Categories