Removing line breaks from numpy array [duplicate] - python

This question already has answers here:
How do I print the full NumPy array, without truncation?
(22 answers)
Closed 3 years ago.
I have a function to calculate the average vector for each name which is made of many words, this function is returning numpy.ndarray with shape of (100,). The resulting vector is as the following:
[ 0.00127441 0.0002633 0.00039622 0.00055501 0.00070984 -0.00089766
-0.00073814 -0.00224919 0.00233035 -0.00037628 0.00125402 -0.00052623
0.00114087 -0.00070441 -0.00419099 0.00031204 -0.0002703 -0.00290918
...(13 lines)
0.00260704 -0.00000406 -0.00160876 0.00134342]
As upon receiving the numpy array, I am removing line breaks as follows:
temp = ["%.8f" % number for number in name_avg_vector]
temp=re.sub('\s+', ' ', temp)
name_avg_vector= np.array(list(temp))
but I am getting the following error:
---> 79 temp=re.sub('\s+', ' ', name_avg_vector)
TypeError: cannot use a string pattern on a bytes-like object
I also tried changing the printoptions, but I continue having the break line in the file storing the numpy array values:
import sys
np.set_printoptions(threshold=sys.maxsize)
np.set_printoptions(threshold=np.inf)
After, I tried with array_repr to remove the break line:
name_avg_vector = np.array_repr(name_avg_vector).replace('\n', '')
but it saves as:
['array([-0.00849786, 0.00113221, -0.00643946, 0.00437448, -0.00740928, 0.00381133, 0.00178376, -0.00065115, -0.00050142, -0.0001178 , 0.00029183, 0.00015484, -0.00001569, 0.0006973 , 0.00051486, 0.00006652, -0.00099618, -0.00049231, 0.0003479 , 0.00135821, 0.00078396, 0.00038927, 0.00040825, -0.00093267, 0.00025755, -0.00012063, -0.00074733, 0.00120466, 0.00041425, -0.00062592, 0.00098112, 0.00101578, -0.00048335, 0.00079251, -0.00112981,
...
-0.00050014, 0.00133685, -0.00020537, -0.00082505])']
As stated by Anoyz in here, converting to list gets rid of break lines such as name_avg_vector.tolist().
Thanks

Your numpy array appears to have dtype float so it doesn't actually contain any new lines. I assume what you are seeing are linebreaks when you do something like print(name_avg_vector). One way to solve the problem is to write your own loop to print the values in the format you want.

Related

How to split chunks of xy data into lists between isalpha() and newline \n

So I've got a cleaned up datafile of number strings, representing coordinates for polygons. I've had experience assigning one polygon's data in a datafile into a column and plotting it in numpy/matplotlib, but for this I have to plot multiple polygons from one datafile separated by headers. The data isn't evenly sized either; every header has several lines of data in two columns, but not the same amount of lines.
i.e. I've used .readlines() to go from:
# Title of the polygons
# a comment on the datasource
# A comment on polygon projection
Poly Two/a bit
(331222.6210000003, 672917.1531000007)
(331336.0946000004, 672911.7816000003)
(331488.4949000003, 672932.4191999994)
##etc
Poly One
[(331393.15660000034, 671982.1392999999), (331477.28839999996, 671959.8816), (331602.10170000046, 671926.8432999998), (331767.28160000034, 671894.7273999993), (331767.28529999964, 671894.7267000005), (##etc)]
to:
PolyOneandTwo
319547.04899999965,673790.8118999992
319553.2614000002,673762.4122000001
319583.4143000003,673608.7760000005
319623.6182000004,673600.1608000007
319685.3598999996,673600.1608000007
##etc
PolyTwoandabit
319135.9966000002,673961.9215999991
319139.7357999999,673918.9201999996
319223.0153000001,673611.6477000006
319254.6040000003,673478.1133999992
##etc etc
PolyOneHundredFifty
##etc
My code so far involves cleaning the original dataset up to make it like you see above;
data_easting=[]
data_northing=[]
County = open('counties.dat','r')
for line in County.readlines():
if line.lstrip().startswith('#'):
print ('Comment line ignored and leading whitespace removed')
continue
line = line.replace('/','and').replace(' ','').replace('[','').replace(']','').replace('),(','\n')
line = line.strip('()\n')
print (line)
if line.isalpha():
print ('Skipped header: '+ line)
continue
I've been using isalpha(): to ignore the headers for each polygon so far, and I was planning on using if line == '\n': continue and line.split(',') to ignore the newlines between data and begin splitting the Easting and Northing lists. I've already got the numpy and matplotlib section of the code (not shown) sorted to make 1 polygon, but I don't know how to implement it to plot multiple arrays/multiple polygons.
I realised early on though that if I tried to assign all the data to the 2 x and y lists, that would just make one large unending polygon that will make a spaghetti mess of my plot as imaginary lines will be drawn to connect them up.
I want to use the isalpha() section to instead identify and make a Dictionary or List of the polygon names, and attach an array for each polygon datablock to that, but I'm not sure of how to implement it (or if you even can). I'm also not certain how to make it stop loading data into a list at the end of a polygon datablock (maybe if line == '\n': break? but how to make it start and stop again 149 more times for each other chunk?).
To make it a bit more difficult, there is 150 polygons with x and y data in this file, so making 150 x and y lists for each individual polygon and writing specific code for each wouldn't be very efficient.
So, how do I basically do:
if line.isalpha():
#(assign to a Counties dictionary or a list as PolyOne, PolyTwo,...PolyOneHundredFifty)
#(a way of getting the data between the header and newline into a separate list)
#(a way to relate that PolyOne Data list of x and y to the dictionary "PolyOne")
if line == '\n':
#(break? continue?)
#(then restart and repeat for PolyTwo,...PolyOneHundredFifty)
line.split (',')
data_easting.append(x) #x1,x2,...x150?
data_northing.append(y) #y1,y2,y150?)
Is there a way of doing what I intend? How would I go about that without pandas?
Thanks for your time.
Parsing the raw data/file:
When you encounter a line/block like the second in your example,
>>> s = '''[(331393.15660000034, 671982.1392999999), (331477.28839999996, 671959.8816), (331602.10170000046, 671926.8432999998), (331767.28160000034, 671894.7273999993), (331767.28529999964, 671894.7267000005)]'''
it can be converted directly to a 2d numpy array as follows using ast.literal_eval which is a safe way to convert text to a python object - in this case a list of tuples.
>>> import numpy as np
>>> import ast
>>> if s.startswith('['):
#print(ast.literal_eval(s))
array = np.array(ast.literal_eval(s))
>>> array
array([[331393.1566, 671982.1393],
[331477.2884, 671959.8816],
[331602.1017, 671926.8433],
[331767.2816, 671894.7274],
[331767.2853, 671894.7267]])
>>> array.shape
(5, 2)
For the blocks that resemble the first in your (raw) example accumulate each line as a tuple of floats in a list, when you reach the next block make an array of that list and reset it. I put this all in a generator function which yields blocks as 2-d arrays.
import ast
import numpy as np
def parse(lines_read):
data = []
for line in lines_read:
if line.startswith('#'):
continue
elif line.startswith('('):
n,e = line[1:-2].split(',')
data.append((float(n),float(e)))
elif line.startswith('['):
array = np.array(ast.literal_eval(line))
yield array
else:
if data:
array = np.array(data)
data = []
yield array
Used like this
>>> for block in parse(f.readlines()):
... print(block)
... print('*******************')
[[331222.621 672917.1531]
[331336.0946 672911.7816]
[331488.4949 672932.4192]]
*******************
[[331393.1566 671982.1393]
[331477.2884 671959.8816]
[331602.1017 671926.8433]
[331767.2816 671894.7274]
[331767.2853 671894.7267]]
*******************
>>>
If you need to select the northing or easting columns separately, consult the Numpy docs.
Parsing with two regular expressions. This operates on the whole file read as a string: s = fileobject.read(). It needs to go over the file twice and does not preserve the block order.
import re, ast
import numpy as np
pattern1 = re.compile(r'(\n\([^)]+\))+')
pattern2 = re.compile(r'^\[[^]]+\]',flags=re.MULTILINE)
for m in pattern1.finditer(s):
block = m.group().strip().split('\n')
data = []
for line in block:
line = line[1:-1]
n,e = map(float,line.split(','))
data.append((n,e))
print(np.array(data))
print('****')
for m in pattern2.finditer(s):
print(np.array(ast.literal_eval(m.group())))
print('****')

How to turn items from extracted data to numbers for plotting in python?

So i have a text document with a lot of values from calculations. I have extracted all the data and stored it in an array, but they are not numbers that I can use for anything. I want to use the number to plot them in a graph, but the elements in the array are text-strings, how would i turn them into numbers and remove unneccesary signs like commas and n= for instance?
Here is code, and under is my print statement.
import numpy as np
['n=1', 'n=2', 'n=3', 'n=4', 'n=5', 'n=6', 'n=7', 'n=8', 'n=9', 'n=10', 'n=11', 'n=12', 'n=13', 'n=14', 'n=15', 'n=16', 'n=17', 'n=18', 'n=19'])
I'd use the conversion method presented in this post within the extract function, so e.g.
...
delta_x.append(strtofloat(words[1]))
...
where you might as well do the conversion inline (my strtofloat is a function you'd have to write based on mentioned post) and within a try/except block, so failed conversions are just ignored from your list.
To make it more consistent, any conversion error should discard the whole line affected, so you might want to use intermediate variables and a check for each field.
Btw. I noticed the argument to the extract function, it would seem logical to make the argument a string containing the file name from which to extract the data?
EDIT: as a side note, you might want to look into pandas, which is a library specialised in numerical data handling. Depending on the format of your data file there are probably standard functions to read your whole file into a DataFrame (which is a kind of super-charged array class which can handle a lot of data processing as well) in a single command.
I would consider using regular expression:
import re
match_number = re.compile('-?[0-9]+\.?[0-9]*(?:[Ee]-?[0-9]+)?')
for line in infile:
words = line.split()
new_delta_x = float(re.search(match_number, words[1]).group())
new_abs_error = float(re.search(match_number, words[7]).group())
new_n = int(re.search(match_number, words[10]).group())
delta_x.append(new_delta_x)
abs_error.append(new_abs_error)
n.append(new_n)
But it seems like your data is already in csv format. So try using pandas.
Then read data into dataframe without header (column names will be integers).
import numpy as np
import pandas as pd
df = pd.read_csv('approx_derivative_sine.txt', header=None)
delta_x = df[1].to_numpy()
abs_error = df[7].to_numpy()
# if n is always number of the row
n = df.index.to_numpy(dtype=int)
# if n is always in the form 'n=<integer>'
n = df[10].apply(lambda x: x.strip()[2:]).to_numpy(dtype=int)
If you could post a few rows of your approx_derivative_sine.txt file, that would be useful.
From the given array in the question, If you would like to remove the 'n=' and convert each element to an integer, you may try the following.
import numpy as np
array = np.array(['n=1', 'n=2', 'n=3', 'n=4', 'n=5', 'n=6', 'n=7', 'n=8', 'n=9',
'n=10', 'n=11', 'n=12', 'n=13', 'n=14', 'n=15', 'n=16', 'n=17', 'n=18', 'n=19'])
array = [int(i.replace('n=', '')) for i in array]
print(array)

matplotlib.pyplot.plot, ValueError: could not convert string to float: f

I'm trying to use python to make plots. Below is the simplified version of my code that cause error.
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.use("AGG")
distance = np.array('f')
depth = np.array('f')# make sure these two arrays store float type value
with open('Line671.txt','r') as datafile:
for line in datafile:
word_count = 0
for word in line.split():
word = float(word)#convert string type to float
if word_count == 0:
distance = np.append(distance, word)
elif word_count == 1:
depth = np.append(depth, word)
else:
print 'Error'
word_count += 1
datafile.closed
print depth
print distance #outputs looks correct
# original data is like this: -5.3458000e+00
# output of the array is :['f' '-5.3458' '-5.3463' ..., '-5.4902' '-5.4912' '-5.4926']
plt.plot(depth, distance)# here comes the problem
The error message says that in line for plt.plot(depth, distance): ValueError: could not convert string to float: f
I don't understand this because it seems I converted all string values into float type. I tried to search this problem on stackoverflow but they all seem to solve the problem once they cast all string values into float or int. Can anyone give any suggestion on this problem? I would be really appreciate for any help.
You confused the value with the type. If you're trying to declare the type, you need to use "dtype=". What you actually did was to stick a single character into the array.
To answer a later question, your line
word = float(word)
likely worked just fine. However, we can't tell because you didn't do anything with the resulting value. Are you expecting this to alter the original inside the variable "line"? Common variables don't work that way.

How do I get a text output from a string created from an array to remain unshortened?

Python/Numpy Problem. Final year Physics undergrad... I have a small piece of code that creates an array (essentially an n×n matrix) from a formula. I reshape the array to a single column of values, create a string from that, format it to remove extraneous brackets etc, then output the result to a text file saved in the user's Documents directory, which is then used by another piece of software. The trouble is above a certain value for "n" the output gives me only the first and last three values, with "...," in between. I think that Python is automatically abridging the final result to save time and resources, but I need all those values in the final text file, regardless of how long it takes to process, and I can't for the life of me find how to stop it doing it. Relevant code copied beneath...
import numpy as np; import os.path ; import os
'''
Create a single column matrix in text format from Gaussian Eqn.
'''
save_path = os.path.join(os.path.expandvars("%userprofile%"),"Documents")
name_of_file = 'outputfile' #<---- change this as required.
completeName = os.path.join(save_path, name_of_file+".txt")
matsize = 32
def gaussf(x,y): #defining gaussian but can be any f(x,y)
pisig = 1/(np.sqrt(2*np.pi) * matsize) #first term
sumxy = (-(x**2 + y**2)) #sum of squares term
expden = (2 * (matsize/1.0)**2) # 2 sigma squared
expn = pisig * np.exp(sumxy/expden) # and put it all together
return expn
matrix = [[ gaussf(x,y) ]\
for x in range(-matsize/2, matsize/2)\
for y in range(-matsize/2, matsize/2)]
zmatrix = np.reshape(matrix, (matsize*matsize, 1))column
string2 = (str(zmatrix).replace('[','').replace(']','').replace(' ', ''))
zbfile = open(completeName, "w")
zbfile.write(string2)
zbfile.close()
print completeName
num_lines = sum(1 for line in open(completeName))
print num_lines
Any help would be greatly appreciated!
Generally you should iterate over the array/list if you just want to write the contents.
zmatrix = np.reshape(matrix, (matsize*matsize, 1))
with open(completeName, "w") as zbfile: # with closes your files automatically
for row in zmatrix:
zbfile.writelines(map(str, row))
zbfile.write("\n")
Output:
0.00970926751178
0.00985735189176
0.00999792646484
0.0101306077521
0.0102550302672
0.0103708481917
0.010477736974
0.010575394844
0.0106635442315
.........................
But using numpy we simply need to use tofile:
zmatrix = np.reshape(matrix, (matsize*matsize, 1))
# pass sep or you will get binary output
zmatrix.tofile(completeName,sep="\n")
Output is in the same format as above.
Calling str on the matrix will give you similarly formatted output to what you get when you try to print so that is what you are writing to the file the formatted truncated output.
Considering you are using python2, using xrange would be more efficient that using rane which creates a list, also having multiple imports separated by colons is not recommended, you can simply:
import numpy as np, os.path, os
Also variables and function names should use underscores z_matrix,zb_file,complete_name etc..
You shouldn't need to fiddle with the string representations of numpy arrays. One way is to use tofile:
zmatrix.tofile('output.txt', sep='\n')

Writing to a multi dimensional array with split

I am trying to use python to parse a text file (stored in the var trackList) with times and titles in them it looks like this
00:04:45 example text
00:08:53 more example text
12:59:59 the last bit of example text
My regular expression (rem) works, I am also able to split the string (i) into two parts correctly (as in I separate times and text) but I am unable to then add the arrays (using .extend) that the split returns to a large array I created earlier (sLines).
f=open(trackList)
count=0
sLines=[[0 for x in range(0)] for y in range(34)]
line=[]
for i in f:
count+=1
line.append(i)
rem=re.match("\A\d\d\:\d\d\:\d\d\W",line[count-1])
if rem:
sLines[count-1].extend(line[count-1].split(' ',1))
else:
print("error on line: "+count)
That code should go through each line in the file trackList, test to see if the line is as expected, if so separate the time from the text and save the result of that as an array inside an array at the index of one less than the current line number, if not print an error pointing me to the line
I use array[count-1] as python arrays are zero indexed and file lines are not.
I use .extend() as I want both elements of the smaller array added to the larger array in the same iteration of the parent for loop.
So, you have some pretty confusing code there.
For instance doing:
[0 for x in range(0)]
Is a really fancy way of initializing an empty list:
>>> [] == [0 for x in range(0)]
True
Also, how do you know to get a matrix that is 34 lines long? You're also confusing yourself with calling your line 'i' in your for loop, usually that would be reserved as a short hand syntax for index, which you'd expect to be a numerical value. Appending i to line and then re-referencing it as line[count-1] is redundant when you already have your line variable (i).
Your overall code can be simplified to something like this:
# load the file and extract the lines
f = open(trackList)
lines = f.readlines()
f.close()
# create the expression (more optimized for loops)
expr = re.compile('^(\d\d:\d\d:\d\d)\s*(.*)$')
sLines = []
# loop the lines collecting both the index (i) and the line (line)
for i, line in enumerate(lines):
result = expr.match(line)
# validate the line
if ( not result ):
print("error on line: " + str(i+1))
# add an invalid list to the matrix
sLines.append([]) # or whatever you want as your invalid line
continue
# add the list to the matrix
sLines.append(result.groups())

Categories