Read txt data separated by empty lines as several numpy arrays

Read txt data separated by empty lines as several numpy arrays - python

I have some data in a txt file as follows:
# Contour 0, label: 37
41.6 7.5
41.5 7.4
41.5 7.3
41.4 7.2
# Contour 1, label:
48.3 2.9
48.4 3.0
48.6 3.1
# Contour 2, label:
61.4 2.9
61.3 3.0
....
So every block begins with a comment and ends with a blank line.
I want to read out those data and bring them into a list which consists of numpy arrays, so like
# list as i want it:
[array([[41.6, 7.5], [41.5, 7.4], [1.5, 7.3], [41.4, 7.2]]),
array([[48.3, 2.9], [48.4, 3.0], [48.6, 3.1]]),
array([[61.4, 2.9], [61.3, 3.0]]), ...]
Is there an efficient way to do that with numpy? genfromtxt or loadtxt seems not to have the required options!?

Like this?
import numpy as np
text = \
'''
# Contour 0, label: 37
41.6 7.5
41.5 7.4
41.5 7.3
41.4 7.2
# Contour 1, label:
48.3 2.9
48.4 3.0
48.6 3.1
# Contour 2, label:
61.4 2.9
61.3 3.0
'''
for line in text.split('\n'):
if line != '' and not line.startswith('#'):
data = line.strip().split()
array = np.array([float(d) for d in data])
print(array)

You could use Python's groupby function to group the 3 entries together as follows:
from itertools import groupby
import numpy as np
array_list = []
with open('data.txt') as f_data:
for k, g in groupby(f_data, lambda x: x.startswith('#')):
if not k:
array_list.append(np.array([[float(x) for x in d.split()] for d in g if len(d.strip())]))
for entry in array_list:
print entry
print
This would display the array_list as follows:
[[ 41.6 7.5]
[ 41.5 7.4]
[ 41.5 7.3]
[ 41.4 7.2]]
[[ 48.3 2.9]
[ 48.4 3. ]
[ 48.6 3.1]]
[[ 61.4 2.9]
[ 61.3 3. ]]

Related

How do I mask only the output (labelled data). I don't have any problem in input data

I have so many Nan values in my output data and I padded those values with zeros. Please don't suggest me to delete Nan or impute with any other no. I want model to skip those nan positions.
example:
x = np.arange(0.5, 30)
x.shape = [10, 3]
x = [[ 0.5 1.5 2.5]
[ 3.5 4.5 5.5]
[ 6.5 7.5 8.5]
[ 9.5 10.5 11.5]
[12.5 13.5 14.5]
[15.5 16.5 17.5]
[18.5 19.5 20.5]
[21.5 22.5 23.5]
[24.5 25.5 26.5]
[27.5 28.5 29.5]]
y = np.arange(2, 10, 0.8)
y.shape = [10, 1]
y[4, 0] = 0.0
y[6, 0] = 0.0
y[7, 0] = 0.0
y = [[2. ]
[2.8]
[3.6]
[4.4]
[0. ]
[6. ]
[0. ]
[0. ]
[8.4]
[9.2]]
I expect keras deep learning model to predict zeros for 5th, 7th and 8th row as similar to the padded value in 'y'.

pretty print 2d matrix and call a sorting function at the same time

L = [['kevin', 8.5, 17.1, 5.9, 15.0, 18], ['arsene', 7.1, 4.4, 15.0, 5.6, 18], ['toufik', 1.1, 2.2, 13.4, 3.1, 20], ['lubin', 16.3, 14.8, 13.1, 5.6, 20], ['yannis', 18.8, 2.4, 12.0, 8.0, 18], ['aurelie', 3.6, 18.8, 8.2, 18.2, 18], ['luna', 14.6, 11.5, 15.2, 18.5, 19], ['sophie', 7.4, 2.1, 18.1, 2.9, 19], ['shadene', 17.9, 7.1, 16.7, 2.5, 19], ['anna', 9.7, 12.8, 10.6, 6.9, 20]]
def triNom(L):
'''sorts names alphabetically'''
n = len(L)
for i in range(n):
for j in range (n - i - 1):
if L[j] > L[j + 1]:
L[j], L[j + 1] = L[j + 1], L[j]
return L
print('\n'.join(['\t'.join([str(cell) for cell in row]) for row in L]))
Output :
kevin 8.5 17.1 5.9 15.0 18
arsene 7.1 4.4 15.0 5.6 18
toufik 1.1 2.2 13.4 3.1 20
lubin 16.3 14.8 13.1 5.6 20
yannis 18.8 2.4 12.0 8.0 18
aurelie 3.6 18.8 8.2 18.2 18
luna 14.6 11.5 15.2 18.5 19
sophie 7.4 2.1 18.1 2.9 19
shadene 17.9 7.1 16.7 2.5 19
anna 9.7 12.8 10.6 6.9 20
How can I make a pretty print like this and call my function at the same time so that the output is pretty and sorted ? It's my first time coding something like this I can't figure it out.

Your problem is with the sorting function, the pretty print is working correctly. Here is one way to do the first, without re-inventing the wheel, using native python functions.
First you need to convert L from being a 2D array into a dictionary of the following format.
L2 = {'kevin': [8.5, 17.1, 5.9, 15.0, 18], 'arsene': [7.1, 4.4, 15.0, 5.6, 18] }
This will make it easier to access the name which we are interested in and then we sort alphabetically by using sorted(list(L2)).
To convert to a dictionary of the above format you can simply do
L2: dict = {}
for item in L: # In pseudo-code L2[name] = nums
L2[item[0]] = [i for i in item[1:len(item)]] # item[0] is name and 1:len(item) the rest of the data (the numbers)
Then we can short L2, by converting it to a list, and then looping throught the sorted list and recreating the first L list of lists in order now.
L = [] # Set this to empty because we are re-transfering the data
SORTED_L2 = sorted(list(L2)) # sort the list of L2 (only the names)
for name in SORTED_L2:
numbers = L2[name]
L.append([name, *numbers]) # * is for unpacking
And then finally by calling print('\n'.join(['\t'.join([str(cell) for cell in row]) for row in L])) you can pretty print them. The output
anna 9.7 12.8 10.6 6.9 20
arsene 7.1 4.4 15.0 5.6 18
aurelie 3.6 18.8 8.2 18.2 18
kevin 8.5 17.1 5.9 15.0 18
lubin 16.3 14.8 13.1 5.6 20
luna 14.6 11.5 15.2 18.5 19
shadene 17.9 7.1 16.7 2.5 19
sophie 7.4 2.1 18.1 2.9 19
toufik 1.1 2.2 13.4 3.1 20
yannis 18.8 2.4 12.0 8.0 18
You can now wrap it all in one function like follows
def sortL(L):
# Convert to dictionary for easy access
L2: dict = {}
for item in L:
L2[item[0]] = [i for i in item[1:len(item)]]
SORTED_L2 = sorted(list(L2)) # Sort the names
L = []
for item in SORTED_L2:
L.append([item, *L2[item]])
return L
def SortAndPrettyPrint(L):
L = sortL(L)
print('\n'.join(['\t'.join([str(cell) for cell in row]) for row in L]))

How to read specific lines from text using a starting and ending condition?

I have a document.gca file that contains specific information that I need, I'm trying to extract certain information, in a part of text repeats the next sentences:
#Sta/Elev= xx
(here goes pair numbers)
#Mann
This part of text repeats several times. My goal is to catch (the pair numbers) that are in that interval, and repeat this process in my text. How can I extract that? Say I have this:
Sta/Elev= 259
0 2186.31 .3 2186.14 .9 2185.83 1.4 2185.56 2.5 2185.23
3 2185.04 3.6 2184.83 4.7 2184.61 5.6 2184.4 6.4 2184.17
6.9 2183.95 7.5 2183.69 7.6 2183.59 8 2183.35 8.6 2182.92
10.2 2181.47 10.8 2181.03 11.3 2180.63 11.9 2180.27 12.4 2179.97
13 2179.72 13.6 2179.47 14.1 2179.3 14.3 2179.21 14.7 2179.11
15.7 2178.9 17.4 2178.74 17.9 2178.65 20.1 2178.17 20.4 2178.13
20.4 2178.12 21.5 2177.94 22.6 2177.81 22.6 2177.8 22.9 2177.79
24.1 2177.78 24.4 2177.75 24.6 2177.72 24.8 2177.68 25.2 2177.54
Mann= 3 , 0 , 0
0 .2 0 26.9 .2 0 46.1 .2 0
Bank Sta=26.9,46.1
XS Rating Curve= 0 ,0
XS HTab Starting El and Incr=2176.01,0.3, 56
XS HTab Horizontal Distribution= 0 , 0 , 0
Exp/Cntr(USF)=0,0
Exp/Cntr=0.3,0.1
Type RM Length L Ch R = 1 ,2655 ,11.2,11.1,10.5
XS GIS Cut Line=4
858341.2470677761196439.12427935858354.9998313071196457.53292637
858369.2753539641196470.40256485858387.8228168661196497.81690065
Node Last Edited Time=Aug/05/2019 11:42:02
Sta/Elev= 245
0 2191.01 .8 2190.54 2.5 2189.4 5 2187.76 7.2 2186.4
8.2 2185.73 9.5 2184.74 10.1 2184.22 10.3 2184.04 10.8 2183.55
12.8 2180.84 13.1 2180.55 13.3 2180.29 13.9 2179.56 14.2 2179.25
14.5 2179.03 15.8 2178.18 16.4 2177.81 16.7 2177.65 17 2177.54
17.1 2177.51 17.2 2177.48 17.5 2177.43 17.6 2177.4 17.8 2177.39
18.3 2177.37 18.8 2177.37 19.7 2177.44 20 2177.45 20.6 2177.45
20.7 2177.45 20.8 2177.44 21 2177.42 21.3 2177.41 21.4 2177.4
21.7 2177.32 22 2177.26 22.1 2177.21 22.2 2177.13 22.5 2176.94
22.6 2176.79 22.9 2176.54 23.2 2176.19 23.5 2175.88 23.9 2175.68
24.4 2175.55 24.6 2175.54 24.8 2175.53 24.9 2175.53 25.1 2175.54
25.7 2175.63 26 2175.71 26.3 2175.78 26.4 2175.8 26.4 2175.82
#Mann= 3 , 0 , 0
0 .2 0 22.9 .2 0 43 .2 0
Bank Sta=22.9,43
XS Rating Curve= 0 ,0
XS HTab Starting El and Incr=2175.68,0.3, 51
XS HTab Horizontal Distribution= 0 , 0 , 0
Exp/Cntr(USF)=0,0
Exp/Cntr=0.3,0.1
But I want to select the numbers between Sta/Elev and Mann and save as a pair vectors, for each Sta/Elev right now I have this:
import re
with open('a.g01','r') as file:
file_contents = file.read()
#print(file_contents)
try:
found = re.search('#Sta/Elev(.+?)#Mann',file_contents).group(1)
except AttributeError:
found = '' # apply your error handling
print(found)
found is empty and I want to catch all the numbers in interval '#Sta/Elev and #Mann'

The problem is in your regex, try switching
found = re.search('#Sta/Elev(.+?)#Mann',file_contents).group(1)
to
found = re.search('Sta/Elev(.*)Mann',file_contents).group(1)
output:
>>> import re
>>> file_contents = 'Sta/ElevthisisatestMann'
>>> found = re.search('Sta/Elev(.*)Mann',file_contents).group(1)
>>> print(found)
thisisatest
Edit:
For multiline matching try adding the DOTALL parameter:
found = re.search('Sta/Elev=(.*)Mann',file_contents, re.DOTALL).group(1)
It was not clear to me on what is the separating string, since they are different in your examples, but for that you can just change it in the regex expression

Standard error of values in array corresponding to values in another array

I have an array that contains numbers that are distances, and another that represents certain values at that distance. How do I calculate the standard error of all the data at a fixed value of the distance?
The standard error is the standard deviation/ the square-root of the number of observations.
e.g distances(d):
[1 1 14 6 1 12 14 6 6 7 4 3 7 9 1 3 3 6 5 8]
e.g data corresponding to the entry of the distances:
therefore value=3.3 at d=1; value=2,1 at d=1; value=3.5 at d=14; etc..
[3.3 2.1 3.5 2.5 4.6 7.4 2.6 7.8 9.2 10.11 14.3 2.5 6.7 3.4 7.5 8.5 9.7 4.3 2.8 4.1]
For example, at distance d=6 I should calculate the standard error of 2.5, 7.8, 9.2 and 4.3 which would be the standard deviation of these values divided by the square root of the total number of values (4 in this case).
I've used the following code that works, but I don't know how to divide the result be the square-root of the total number of values at each distance:
import numpy as np
result = []
for d in set(key):
result.append(np.std[dist[i] for i in range(len(key)) if key[i] == d])
Any help would be greatly appreciated. Thanks!

Does this help?
for d in set(key):
result.append(np.std[dist[i] for i in range(len(key)) if key[i] == d] / np.sqrt(dist.count(d)))

I'm having a bit of a hard time telling exactly how you want things structured, but I would recommend a dictionary, so that you can know which result is associated with which key value. If your data is like this:
>>> key
array([ 1, 1, 14, 6, 1, 12, 14, 6, 6, 7, 4, 3, 7, 9, 1, 3, 3,
6, 5, 8])
>>> values
array([ 3.3 , 2.1 , 3.5 , 2.5 , 4.6 , 7.4 , 2.6 , 7.8 , 9.2 ,
10.11, 14.3 , 2.5 , 6.7 , 3.4 , 7.5 , 8.5 , 9.7 , 4.3 ,
2.8 , 4.1 ])
You can set up a dictionary along these lines with a dict comprehension:
result = {f'distance_{i}':np.std(values[key==i]) / np.sqrt(sum(key==i)) for i in set(key)}
>>> result
{'distance_1': 1.0045988005169029, 'distance_3': 1.818424226264781, 'distance_4': 0.0, 'distance_5': 0.0, 'distance_6': 1.3372079120316331, 'distance_7': 1.2056170619230633, 'distance_8': 0.0, 'distance_9': 0.0, 'distance_12': 0.0, 'distance_14': 0.3181980515339463}

Match two numpy arrays to find the same elements

I have a task kind of like SQL search. I have a "table" which contains the following 1D arrays (about 1 million elements) identified by ID1:
ID1, z, e, PA, n
Another "table" which contains the following 1D arrays (about 1.5 million elements) identified by ID2:
ID2, RA, DEC
I want to match ID1 and ID2 to find the common ones to form another "table" which contains ID, z, e, PA, n, RA, DEC. Most elements in ID1 can be found in ID2 but not all, otherwise I can use numpy.in1d(ID1,ID2) to accomplish it. Anyone has fast way to accomplish this task?
For example:
ID1, z, e, PA, n
101, 1.0, 1.2, 1.5, 1.8
104, 1.5, 1.8, 2.2, 3.1
105, 1.4, 2.0, 3.3, 2.8
ID2, RA, DEC
101, 4.5, 10.5
107, 90.1, 55.5
102, 30.5, 3.3
103, 60.1, 40.6
104, 10.8, 5.6
The output should be
ID, z, e, PA, n, RA, DEC
101, 1.0, 1.2, 1.5, 1.8, 4.5, 10.5
104, 1.5, 1.8, 2.2, 3.1, 10.8, 5.6

Well you can use np.in1d with swapped places for the first columns of the two arrays/tables, such that we would have two masks to index into the arrays for selection. Then, simply stack the results -
mask1 = np.in1d(a[:,0], b[:,0])
mask2 = np.in1d(b[:,0], a[:,0])
out = np.column_stack(( a[mask1], b[mask2,1:] ))
Sample run -
In [44]: a
Out[44]:
array([[ 101. , 1. , 1.2, 1.5, 1.8],
[ 104. , 1.5, 1.8, 2.2, 3.1],
[ 105. , 1.4, 2. , 3.3, 2.8]])
In [45]: b
Out[45]:
array([[ 101. , 4.5, 10.5],
[ 102. , 30.5, 3.3],
[ 103. , 60.1, 40.6],
[ 104. , 10.8, 5.6],
[ 107. , 90.1, 55.5]])
In [46]: mask1 = np.in1d(a[:,0], b[:,0])
In [47]: mask2 = np.in1d(b[:,0], a[:,0])
In [48]: np.column_stack(( a[mask1], b[mask2,1:] ))
Out[48]:
array([[ 101. , 1. , 1.2, 1.5, 1.8, 4.5, 10.5],
[ 104. , 1.5, 1.8, 2.2, 3.1, 10.8, 5.6]])

Assuming your second table, table B, is sorted, you can do a sorted lookup, then check if the indexed element is actually found:
idx = np.searchsorted(B[:-1, 0], A[:, 0])
found = A[:, 0] == B[idx, 0]
np.hstack((A[found, :], B[idx[found], 1:]))
Result:
array([[ 101. , 1. , 1.2, 1.5, 1.8, 4.5, 10.5],
[ 104. , 1.5, 1.8, 2.2, 3.1, 10.8, 5.6]])
The last element of the B indices is excluded to simplify the case where the item in A is beyond the final element in B. Without it, it is possible that the returned index would be greater than the length of B and cause indexing errors.

Use pandas:
import pandas as pd
id1 = pd.read_csv('id1.txt')
id2 = pd.read_csv('id2.txt')
df = id1.merge(id2.sort_values(by='ID2').drop_duplicates('ID2').rename(columns={'ID2':'ID1'}))
print(df)
Produces:
ID1 z e PA n RA DEC
0 101 1.0 1.2 1.5 1.8 4.5 10.5
1 104 1.5 1.8 2.2 3.1 10.8 5.6
With large datasets you may need to do things in place:
# [Optional] sort locations and drop duplicates
id2.sort_values(by='ID2', inplace=True)
id2.drop_duplicates('ID2', inplace=True)
# columns that you are merging must have the same name
id2.rename(columns={'ID2':'ID1'}, inplace=True)
# perform the merge
df = id1.merge(id2)
Without drop_duplicates you get one row for each item:
df = id1.merge(id2.rename(columns={'ID2':'ID1'}))
print(id2)
print(df)
Giving:
ID2 RA DEC
0 101 4.5 10.5
1 107 90.1 55.5
2 102 30.5 3.3
3 103 60.1 40.6
4 104 10.8 5.6
5 103 60.1 40.6
6 104 10.9 5.6
ID1 z e PA n RA DEC
0 101 1.0 1.2 1.5 1.8 4.5 10.5
1 104 1.5 1.8 2.2 3.1 10.8 5.6
2 104 1.5 1.8 2.2 3.1 10.9 5.6
Note that this solution preserves the different types for the columns:
>>> id1.ID1.dtype
dtype('int64')
>>> id1[' z'].dtype
dtype('float64')
Since you have spaces after the comma in the header row those spaces became part of the column name, hence need to refer to the second column using id1[' z']. By modifying the read statement, this is no longer necessary:
>>> id1 = pd.read_csv('id1.txt', skipinitialspace=True)
>>> id1.z.dtype
dtype('float64')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read txt data separated by empty lines as several numpy arrays - python

Related

How do I mask only the output (labelled data). I don't have any problem in input data

pretty print 2d matrix and call a sorting function at the same time

How to read specific lines from text using a starting and ending condition?

Standard error of values in array corresponding to values in another array

Match two numpy arrays to find the same elements

Categories

Resources