Consider there is a few rather long numpy arrays:
importy numpy as np;
long_array1 = np.array([random.random() for i in range(10000)]);
long_array2 = np.array([random.random() for i in range(10000)]);
long_array3 = np.array([random.random() for i in range(10000)]);
I would like to save the arrays into the file file.dat, one row per numpy array.
The text representation of an array should be in a python array-like format, i.e. in the case of following numpy array:
a = np.array([0.3213,0.145323,0.852,0.723,0.421452])
I want to save following line in the file.
[0.3213,0.145323,0.852,0.723,0.421452]
There is what I do:
array1_str = ",".join([str(item) for item in long_array1]);
array2_str = ",".join([str(item) for item in long_array2]);
array3_str = ",".join([str(item) for item in long_array3]);
with open("file.dat","w") as file_arrays:
file_arrays.write("[" + array1_str + "]\n");
file_arrays.write("[" + array2_str + "]\n");
file_arrays.write("[" + array3_str + "]\n");
Everything works fine actually. I am just doubtful about the efficiency of my code. I am almost sure there has to be another (better and more efficient) way how to do this.
I welcome comments to the random list generation as well.
This is the fastest way:
','.join(map(str, long_array1.tolist()))
If you want to keep the text more compact, this is fast too:
','.join(map(lambda x: '%.7g' % x, long_array1.tolist()))
Source: I benchmarked every possible method for this as the maintainer of the pycollada library.
Since you want a Python-list-like format, how about actually using the Python list format?
array1_str = repr(list(long_array1))
That's going to stay mostly in C-land and performance should be much better.
If you don't want the spaces, take 'em out after:
array1_str = repr(list(long_array1)).translate(None, " ")
Memory usage may be an issue, however.
sounds like you might be able to use the numpy.savetxt() for this;
something like:
def dump_array(outfile, arraylike):
outfile.write('[')
numpy.savetxt(outfile, arraylike, newline=',', fmt="%s")
outfile.write(']\n')
although i don't think the corresponding numpy.loadtxt() will be able to read in this format.
Related
So i have a text document with a lot of values from calculations. I have extracted all the data and stored it in an array, but they are not numbers that I can use for anything. I want to use the number to plot them in a graph, but the elements in the array are text-strings, how would i turn them into numbers and remove unneccesary signs like commas and n= for instance?
Here is code, and under is my print statement.
import numpy as np
['n=1', 'n=2', 'n=3', 'n=4', 'n=5', 'n=6', 'n=7', 'n=8', 'n=9', 'n=10', 'n=11', 'n=12', 'n=13', 'n=14', 'n=15', 'n=16', 'n=17', 'n=18', 'n=19'])
I'd use the conversion method presented in this post within the extract function, so e.g.
...
delta_x.append(strtofloat(words[1]))
...
where you might as well do the conversion inline (my strtofloat is a function you'd have to write based on mentioned post) and within a try/except block, so failed conversions are just ignored from your list.
To make it more consistent, any conversion error should discard the whole line affected, so you might want to use intermediate variables and a check for each field.
Btw. I noticed the argument to the extract function, it would seem logical to make the argument a string containing the file name from which to extract the data?
EDIT: as a side note, you might want to look into pandas, which is a library specialised in numerical data handling. Depending on the format of your data file there are probably standard functions to read your whole file into a DataFrame (which is a kind of super-charged array class which can handle a lot of data processing as well) in a single command.
I would consider using regular expression:
import re
match_number = re.compile('-?[0-9]+\.?[0-9]*(?:[Ee]-?[0-9]+)?')
for line in infile:
words = line.split()
new_delta_x = float(re.search(match_number, words[1]).group())
new_abs_error = float(re.search(match_number, words[7]).group())
new_n = int(re.search(match_number, words[10]).group())
delta_x.append(new_delta_x)
abs_error.append(new_abs_error)
n.append(new_n)
But it seems like your data is already in csv format. So try using pandas.
Then read data into dataframe without header (column names will be integers).
import numpy as np
import pandas as pd
df = pd.read_csv('approx_derivative_sine.txt', header=None)
delta_x = df[1].to_numpy()
abs_error = df[7].to_numpy()
# if n is always number of the row
n = df.index.to_numpy(dtype=int)
# if n is always in the form 'n=<integer>'
n = df[10].apply(lambda x: x.strip()[2:]).to_numpy(dtype=int)
If you could post a few rows of your approx_derivative_sine.txt file, that would be useful.
From the given array in the question, If you would like to remove the 'n=' and convert each element to an integer, you may try the following.
import numpy as np
array = np.array(['n=1', 'n=2', 'n=3', 'n=4', 'n=5', 'n=6', 'n=7', 'n=8', 'n=9',
'n=10', 'n=11', 'n=12', 'n=13', 'n=14', 'n=15', 'n=16', 'n=17', 'n=18', 'n=19'])
array = [int(i.replace('n=', '')) for i in array]
print(array)
My script cleans arrays from the unwanted string like "##$!" and other stuff.
The script works as intended but the speed of it is extremely slow when the excel row size is big.
I tried to use numpy if it could speed it up but I'm not too familiar with is so I might be using it incorrectly.
xls = pd.ExcelFile(path)
df = xls.parse("Sheet2")
TeleNum = np.array(df['telephone'].values)
def replace(orignstr): # removes the unwanted string from numbers
for elem in badstr:
if elem in orignstr:
orignstr = orignstr.replace(elem, '')
return orignstr
for UncleanNum in tqdm(TeleNum):
newnum = replace(str(UncleanNum)) # calling replace function
df['telephone'] = df['telephone'].replace(UncleanNum, newnum) # store string back in data frame
I also tried removing the method to if that would help and just place it as one block of code but the speed remained the same.
for UncleanNum in tqdm(TeleNum):
orignstr = str(UncleanNum)
for elem in badstr:
if elem in orignstr:
orignstr = orignstr.replace(elem, '')
print(orignstr)
df['telephone'] = df['telephone'].replace(UncleanNum, orignstr)
TeleNum = np.array(df['telephone'].values)
The current speed of the script running an excel file of 200,000 is around 70it/s and take around an hour to finish. Which is not that good since this is just one function of many.
I'm not too advanced in python. I'm just learning as I script so if you have any pointer it would be appreciated.
Edit:
Most of the array elements Im dealing with are numbers but some have string in them. I trying to remove all string in the array element.
Ex.
FD3459002912
*345*9002912$
If you are trying to clear everything that isn't a digit from the strings you can directly use re.sub like this:
import re
string = "FD3459002912"
regex_result = re.sub("\D", "", string)
print(regex_result) # 3459002912
I'm writing code for college right now that works with very big amounts of data, using Pytables with various matrices/matrixes so as not to overflow memory, and it's been working well so far.
Right now I need to assign an integer identifier (from 0 to whatever) to a number of distinct Strings, store the assignment and be able to get the corresponding integer to a certain String and vice-versa. Of course, normal types don't cut it, there's just too many Strings, so I need to use something that works with files like Pytables.
I thought of just using an unidimensional Pytables EArray (because I can't know how many of Strings there will be), store there the Strings and let the index for each element be the assigned integer identifier of the String.
This is an example of what I thought of using:
import tables as tb, numpy as np
>>>file = tb.open_file("sample_file.hdf5", mode='w')
>>>sample_array = file.create_earray(file.root, 'data', tb.StringAtom(itemsize=50),
shape=(0,), expectedrows=10000)
>>>sample_array.append(np.array(["String_value"]))
That way I can get the String value of a given integer, like in any normal array
>>>sample_array[0]
b'String_value'
But I can't for the life of me find out how to do the opposite, to find the index given the String, I'm only comming up with more absurd ways of doing shit...
>>> sample_array[np.where("String_value") in sample_array]
b'String_value'
>>> sample_array[np.where("String_value")]
array([b'String_value'], dtype='|S50')
>>> np.where("String_value") in sample_array
False
Thank you in advance!
EDIT:
Forgot to update, I figured it out while working on something else... Facepalmed hard, very hard, it was really stupid, but I could't figure out what was wrong for hours.
np.where(sample_array[:] == b'String_value')
>>>(array([0]),)
OP answered his question above. However, it's buried under EDIT:, so not obvious in search results (or to the casual reader). Also, there is another way to approach the problem (using a Table instead of an Earray). This provides a comparison of the 2 methods.
OP's solution with an Earray (with some embellishment):
import tables as tb, numpy as np
h5f = tb.open_file("sample_file.hdf5", mode='w')
sample_array = h5f.create_earray(h5f.root, 'data', tb.StringAtom(itemsize=50),
shape=(0,), expectedrows=10000)
sample_array.append(np.array(['str_val0']))
sample_array.append(np.array(['str_val10']))
sample_array.append(np.array(['str_val20']))
sample_array.append(np.array(['str_val30']))
sample_array.append(np.array(['str_val40']))
print (sample_array[0])
print (sample_array[-1])
print (np.where(sample_array[:] == b'str_val0'))
print (np.where(sample_array[:] == b'str_val40'))
print ('\n')
h5f.close()
Output looks like this:
b'str_val0'
b'str_val40'
(array([0], dtype=int64),)
(array([4], dtype=int64),)
My approach with a Table:
I like Tables in Pytables. They are handy because they have multiple built-in search and iteration methods (in this case using .get_where_list(); there are many others). This example shows Table creation from a np.recarray (uses dtype to define fields/columns, and data to populate the table). Additional data rows are added later with the .append() method.
import tables as tb, numpy as np
h5f = tb.open_file("sample_file.hdf5", mode='w')
simple_recarray = np.recarray((4,),dtype=[('tstr','S50')])
simple_recarray['tstr'][0] = 'str_val1'
simple_recarray['tstr'][1] = 'str_val2'
simple_recarray['tstr'][2] = 'str_val10'
simple_recarray['tstr'][3] = 'str_val20'
simple_table = h5f.create_table(h5f.root, 'table_data', simple_recarray, 'Simple dataset')
print (simple_table.get_where_list("tstr == b'str_val1'"))
print (simple_table.get_where_list("tstr == b'str_val20'"))
simple_table.append([('str_val30',), ('str_val31',)])
print (simple_table.get_where_list("tstr == b'str_val31'"))
h5f.close()
Output looks like this (slightly different b/c strings are not stored in arrays):
[0]
[3]
[5]
My problem is really quite simple.
I have a 100 images on my computer, those images are called 1.ppm 2.ppm and so on until 100.ppm
I want to read each image to a variable using imread, and then perform a few operations. I want to do the exact same thing to all of the images.
My question is this - Instead of copy pasting one hundred times, is it possible to use imread in a loop? something like:
for i in range(1,100):
X=io.imread('/home/oria/Desktop/more pics/'i'.ppm')
Instead of copy pasting the same code block and just changing the picture number a hundred times, I want to do this in a loop.
I have a similar issue with numpy.load. I want to load files called ICA1 ICA2 etc up to ICA100. Is it possible to write something like
numpy.load('/home/oria/Desktop/ICA DB/ICA'i'.npy)?
Like this:
for i in range(1,100):
X=io.imread('/home/oria/Desktop/more pics/%s.ppm' %(i))
Or, like this:
for i in range(1,100):
X=io.imread('/home/oria/Desktop/more pics/'+str(i)+'.ppm')
Go ahead and read the article on basic string operations as well as this simple article on string formatting
If I correctly understand what you're asking, it could be done as:
for i in range(1, 101):
x = io.imread('/home/oria/Desktop/more pics/' + str(i) + '.ppm')
Note that the high end of the range function is not inclusive, so using range(1, 100) would only produce 1, 2, 3...99. Also note that i must be converted to a string or you will receive TypeError: cannot concatenate 'str' and 'int' objects.
import cv2
import os
def load_images_from_folder(folder):
images = []
for filename in os.listdir(folder):
img = cv2.imread(os.path.join(folder,filename))
if img is not None:
images.append(img)
return images
Just use str.format, passing the variable i:
for i in range(1,100):
X = io.imread('/home/oria/Desktop/more pics/{}.ppm'.format(i))
When you want to load with numpy do the same thing again:
for i in range(1,100):
X = numpy.load('/home/oria/Desktop/ICA DB/ICA{}.npy'.format(i))
I have data stored in comma delimited txt files. One of the columns represents a datetime.
I need to load each column into separate numpy arrays (and decode the date into a python datetime object).
What is the fastest way to do this (in terms of run time)?
NB. the files are several hundred MB of data and currently take several minutes to load in.
e.g. mydata.txt
15,3,0,2003-01-01 00:00:00,12.2
15,4.5,0,2003-01-01 00:00:00,13.7
15,6,0,2003-01-01 00:00:00,18.4
15,7.5,0,2003-01-01 00:00:00,17.9
15,9,0,2003-01-01 00:00:00,17.7
15,10.5,0,2003-01-01 00:00:00,16.3
15,12,0,2003-01-01 00:00:00,17.2
Here is my current code (it works, but is slow):
import csv
import datetime
import time
import numpy
a=[]
b=[]
c=[]
d=[]
timestmp=[]
myfile = open('mydata.txt',"r")
# Read in the data
csv_reader = csv.reader(myfile)
for row in csv_reader:
a.append(row[0])
b.append(row[1])
c.append(row[2])
timestmp.append(row[3])
d.append(row[4])
a = numpy.array(a)
b = numpy.array(b)
c = numpy.array(c)
d = numpy.array(d)
# Convert Time string list into list of Python datetime objects
times = []
time_format = "%Y-%m-%d %H:%M:%S"
for i in xrange(len(timestmp)):
times.append(datetime.datetime.fromtimestamp(time.mktime(time.strptime(timestmp[i], time_format))))
Is there a more efficient way to do this?
Any help is very much appreciated -thanks!
(edit: In the end the bottleneck turned out to be with the datetime conversion, and not reading the file as I originally assumed.)
First, you should run your sample script with Python's built-in profiler to see where the problem actually might be. You can do this from the command-line:
python -m cProfile myscript.py
Secondly, what jumps at me at least, why is that loop at the bottom necessary? Is there a technical reason that it can't be done while reading mydata.txt in the loop you have above the instantiation of the numpy arrays?
Thirdly, you should create the datetime objects directly, as it also supports strptime. You don't need to create a time stamp, make the time, and just make a datetime from a timestamp.
Your loop at the bottom can just be re-written like this:
times = []
timestamps = []
TIME_FORMAT = "%Y-%m-%d %H:%M:%S"
for t in timestmp:
parsed_time = datetime.datetime.strptime(t, TIME_FORMAT)
times.append(parsed_time)
timestamps.append(time.mktime(parsed_time.timetuple()))
I too the liberty of PEP-8ing your code a bit, such as changing your constant to all caps. Also, you can iterate over a list just by using the in operator.
Try numpy.loadtxt(), the doc string has a good example.
You can also try to use copy=False when call numpy.array since the default behavior is copy it, this can speed up the script (especially since you said it process a lot of data).
npa = numpy.array(ar, copy=False)
If you follow Mahmoud Abdelkader's advice and use the profiler, and find out that the bottleneck is in the csv loader, you could always try replacing your csv_reader with this:
for line in open("ProgToDo.txt"):
row = line.split(',')
a.append(int(row[0]))
b.append(int(row[1]))
c.append(int(row[2]))
timestmp.append(row[3])
d.append(float(row[4]))
But more probable I think is that you have a lot of data conversions. Especially the last loop for time conversion will take a long time if you have millions of conversions! If you succeed in doing it all in one step (read+convert), plus taking Terseus advice on not copying the arrays to numpy dittos, you will reduce execution times.
I'm not completely sure if this will help but you may be able to speed up the reading of the file by using ast.literal_eval. For example:
from ast import literal_eval
myfile = open('mydata.txt',"r")
mylist = []
for line in myfile:
line = line.strip()
e = line.rindex(",")
row = literal_eval('[%s"%s"%s]' % (line[:e-19], line[e-19:e], line[e:]))
mylist.append(row)
a, b, c, timestamp, d = zip(*mylist)
# a, b, c, timestamp, and d are what they were after your csv_reader loop