How can I optimize this code to cope with a larger dataset?

How can I optimize this code to cope with a larger dataset? - python

I have a set of images with the same size. And I want to insert them into a dataframe, with the rows being the names of the images and the columns being the pixels. They are all in the same directory.
I can already do this for a folder with a few images (as shown in the "Example for 7 images" link below), but when I try it for a dataset with 9912 images, the compile shows "killed". How can I optimize this code to get all the images?
from matplotlib import image
import numpy as np
import pandas as pd
import glob
columns = ["file"]
for i in range (150528):
columns.append("pixel" + str(i))
df = pd.DataFrame(columns = columns)
i = 0
for file in glob.glob('/home/nuno/resizepics/*.jpg'):
imgarr = image.imread(file)
imgarr = imgarr.flatten()
df.loc[i,"file"] = file
for j in range(len(imgarr)):
df.iloc[i,j+1] = imgarr[j]
i += 1
#print(df)
df.to_csv('pixels.csv')
Example for 7 images

If "killed" means it raises an error you can try using exeptions (try, except, else) and make it try again from the spot it stopped. You can also try to delay it a bit with time module because it works with large data.

Related

How can I get my dicom splitter/divider to save split/divided images?

I have cobbled together some code on python, to try and work through a folder of dicom files, splitting each image in two.
All my dicom files are X-rays of both the left and right feet, and I need to separate them.
To do this I am adapting some code produced by #g_unit seen here
Unfortunately - this attempt results in two unaltered copies of the original file - unsplit. It does work when writing the files as PNG or JPG, but not when writing as dicoms. My test image in the console also looks good.
In my below example, I am using a folder with only one file in it. I will adapt to write the new files and filenames after I get my single sample to work.
import matplotlib.pyplot as plt
import pydicom
import pydicom as pd
import os
def main():
path = 'C:/.../test_block_out/'
# iterate through the names of contents of the folder
for file in os.listdir(path):
# create the full input path and read the file
input_path = os.path.join(path, file)
dataset = pd.dcmread(input_path)
shape = dataset.pixel_array.shape
# get the half of the x dimension. For the y dimension use shape[0]
half_x = int(shape[1] / 2)
# slice the halves
# [first_axis, second_axis] so [:,:half_x] means slice all from first axis, slice 0 to half_x from second axis
left_part = dataset.pixel_array[:, :half_x].tobytes()
right_part = dataset.pixel_array[:,half_x:].tobytes()
#Save halves
path_to_left_image = 'C:.../test_file/left.dcm'
path_to_right_image = 'C:.../test_file/right.dcm'
dataset.save_as(path_to_left_image, left_part)
dataset.save_as(path_to_right_image, right_part)
#print test image
plt.imshow(dataset.pixel_array[:, :half_x])
#plt.imshow(dataset.pixel_array[:,half_x:])
if __name__ == '__main__':
main()
I have tried to write the pixel array to dataset.PixelData - but this throws the error:
ValueError: The length of the pixel data in the dataset (5120000 bytes) doesn't match the expected length (10240000 bytes). The dataset may be corrupted or there may be an issue with the pixel data handler.
Which makes sense, since its half my original dimensions. It will write a DCM, but I cannot load this DCM into any dicom viewer tools ('Decode error!')
Is there a way to get this to write the files as DCMs, not PNGs? Or will the DCMs always bug if the dimensions are incorrect?

A kind colleague has helped by providing the answer.
The issue was that I was saving "dataset", not "left_part".
The solution was to create a new pydicom object , deep copying the dcm file, and then modifying the copy.
Code below:
# iterate through the names of contents of the folder
for file in os.listdir(path):
# create the full input path and read the file
input_path = os.path.join(path, file)
dataset = pd.dcmread(input_path)
left_part = copy.deepcopy(dataset)
right_part = copy.deepcopy(dataset)
shape = dataset.pixel_array.shape
# get the half of the x dimension. For the y dimension use shape[0]
half_x = int(shape[1] / 2)
# slice the halves
# [first_axis, second_axis] so [:,:half_x] means slice all from first axis, slice 0 to half_x from second axis
left_part.PixelData = dataset.pixel_array[:, :half_x].tobytes()
left_part['Columns'].value=half_x
right_part.PixelData = dataset.pixel_array[:,half_x:].tobytes()
right_part['Columns'].value=shape[1]-half_x
#Save halves
path_to_left_image = os.path.join(path, 'left_'+file)
path_to_right_image = os.path.join(path, 'right_'+file)
left_part.save_as(path_to_left_image)
right_part.save_as(path_to_right_image)
#print test image
plt.imshow(left_part.pixel_array)
plt.show()

Accelerating speed of reading contents from dataframe in pandas

Let us suppose we have table with following dimension :
print(metadata.shape)-(8732, 8)
let us suppose we want to read slice_file_name for each row( and then read sound files from drive ) and extract mel frequencies :
def feature_extractor(file_name):
audio,sample_rate =librosa.load(file_name,res_type='kaiser_fast')
mfccs_features =librosa.feature.mfcc(y=audio,sr=sample_rate,n_mfcc=40)
mfccs_scaled_features =np.mean(mfccs_features.T,axis=0)
return mfccs_scaled_features
if i use following loops :
from tqdm import tqdm
extracted_features =[]
for index_num, row in tqdm(metadata.iterrows()):
file_name = os.path.join(os.path.abspath(Base_Directory),str(row["slice_file_name"]))
final_class_labels=row["class"]
data=feature_extractor(file_name)
extracted_features.append([data,final_class_labels])
it takes in total following amount of time :
3555it [21:15, 2.79it/s]/usr/local/lib/python3.7/dist-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=2048 is too small for input signal of length=1323
n_fft, y.shape[-1]
8326it [48:40, 3.47it/s]/usr/local/lib/python3.7/dist-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=2048 is too small for input signal of length=1103
n_fft, y.shape[-1]
8329it [48:41, 3.89it/s]/usr/local/lib/python3.7/dist-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=2048 is too small for input signal of length=1523
n_fft, y.shape[-1]
8732it [50:53, 2.86it/s]
how can i optimize this code to do the thing in less amount of time? it is possible?

You could try and run the feature extractor in parallel, this could give a new column in your dataframe with the mfccs_scaled_features.
from pandarallel import pandarallel
pandarallel.initialize()
PATH = os.path.abspath(Base_Directory)
def feature_extractor(file_name):
# If using windows, you may need to put these here~
# import librosa
# import numpy as np
# import os
file_name = os.path.join(PATH, file_name)
audio,sample_rate = librosa.load(file_name, res_type='kaiser_fast')
mfccs_features = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
mfccs_scaled_features = np.mean(mfccs_features.T, axis=0)
return mfccs_scaled_features
df['mfccs_scaled_features'] = df['slice_file_name'].parallel_apply(feature_extractor)

How to concatenate three or more images vertically?

I try to concatenate three images vertically and I need some assistance with the code/function. So far I imported one image and cropped 3 smaller images with the same size. Now I want to concatenate them in one image that will be long, but narrow. However, I can't find an appropriate function or even if I find one I get an error message when I apply it to my code.
I already tried to make a collection from my three pictures and then use the function skimage.io.concatenate_images(sf_collection), but this results in a 4-dimensional picture that cannot be visualized.
sf_collection = (img1,img2,img3)
concat_page = skimage.io.concatenate_images(sf_collection)
My expected output is the three images to be concatenated vertically in one image (very long and narrow).

Ive never used skimage.io.concatenate, but I think you are looking for np.concatenate. It defaults to axis=0, but you can specify axis=1 for a horizontal stack. This also assumes you have already loaded the images into their array from.
from scipy.misc import face
import numpy as np
import matplotlib.pyplot as plt
face1 = face()
face2 = face()
face3 = face()
merge = np.concatenate((face1,face2,face3))
plt.gray()
plt.imshow(merge)
which returns:
If you look at the skimage.io.concatenate_images docs, it's using np.concatenate too. It seems like that function provides a data structure to hold collections of images, but not merge into a single image.

Like this:
import numpy as np
h, w = 100, 400
yellow = np.zeros((h,w,3),dtype=np.uint8) + np.array([255,255,0],dtype=np.uint8)
red = np.zeros((h,w,3),dtype=np.uint8) + np.array([255,0,0],dtype=np.uint8)
blue = np.zeros((h,w,3),dtype=np.uint8) + np.array([0,0,255],dtype=np.uint8)
# Stack vertically
result = np.vstack((yellow,red,blue))
Use the following to stack side-by-side (horizontally):
result = np.hstack((yellow,red,blue))

Parallel process dealing with read a bunch of files and save into one Pandas.Dataframe

Background
For some place of interest, we want to extract some useful information from open datasource. Take meteorology data for example, we just want to recognize the long-term temporal pattern of one point, but the datafiles sometimes cover the whole word.
Here, I use Python to extract the vertical velocity for one spot in FNL(ds083.2) file every 6 hours in one year.
In other way, I want to read the original data and save the target variables through the timeline
My attempt
import numpy as np
from netCDF4 import Dataset
import pygrib
import pandas as pd
import os, time, datetime
# Find the corresponding grid box
def find_nearest(array,value):
idx = (np.abs(array-value)).argmin()
return array[idx]
## Obtain the X, Y indice
site_x,site_y =116.4074, 39.9042 ## The area of interests.
grib='./fnl_20140101_06_00.grib2' ## Any files for obtaining lat-lon
grbs=pygrib.open(grib)
grb = grbs.select(name='Vertical velocity')[8]#
lon_list,lat_list = grb.latlons()[1][0],grb.latlons()[0].T[0]
x_indice = np.where(lon_list == find_nearest(lon_list, site_x))[0]
y_indice = np.where(lat_list == find_nearest(lat_list, site_y))[0]
def extract_vm():
files = os.listdir('.') ### All files have already save in one path
files.sort()
dict_vm = {"V":[]}
### Travesing the files
for file in files[1:]:
if file[-5:] == "grib2":
grib=file
grbs=pygrib.open(grib)
grb = grbs.select(name='Vertical velocity')[4] ## Select certain Z level
data=grb.values
data = data[y_indice,x_indice]
dict_vm['V'].append(data)
ff = pd.DataFrame(dict_vm)
return ff
extract_vm()
My thought
How to speed up the reading process? Now, I use the linear reading method, the implement time will increase linearly with the processing temporal period.
Can we split those files in several cluster and tackle with them separately with multi-core processor. Are there any other advices on my code to improve the speed?
Any comments will be appreciate!

How do I import tif using gdal?

How do I import tif using gdal?
I'm trying to get my tif file in a usable format in Python, so I can analyze the data. However, every time I import it, I just get an empty list. Here's my code:
xValues = [447520.0, 432524.0, 451503.0]
yValues = [4631976.0, 4608827.0, 4648114.0]
gdal.AllRegister()
dataset = gdal.Open('final_snow.tif', GA_ReadOnly)
if dataset is None:
print 'Could not open image'
sys.exit(1)
data = np.array([gdal.Open(name, gdalconst.GA_ReadOnly).ReadAsArray() for name, descr in dataset.GetSubDatasets()])
print 'this is data ', data`
It always prints an empty list, but it doesn't throw an error. I checked out other questions, such as [this] (Create shapefile from tif file using GDAL) What might be the problem?

For osgeo.gdal, it should look like this:
from osgeo import gdal
gdal.UseExceptions() # not required, but a good idea
dataset = gdal.Open('final_snow.tif', gdal.GA_ReadOnly)
data = dataset.ReadAsArray()
Where data is either a 2D array for 1-banded rasters, or a 3D array for multiband.
An alternative with rasterio looks like:
import rasterio
with rasterio.open('final_snow.tif', 'r') as r:
data = r.read()
Where data is always a 3D array, with the first dimension as band index.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I optimize this code to cope with a larger dataset? - python

If "killed" means it raises an error you can try using exeptions (try, except, else) and make it try again from the spot it stopped. You can also try to delay it a bit with time module because it works with large data.

Related

How can I get my dicom splitter/divider to save split/divided images?

Accelerating speed of reading contents from dataframe in pandas

How to concatenate three or more images vertically?

Parallel process dealing with read a bunch of files and save into one Pandas.Dataframe

How do I import tif using gdal?

Categories

Resources