Writing a function to combine arrays from different files - python

Sorry for the novice question. I'm just starting to learn Python and I don't have any coding background. I already ended up doing this process manually, but I'm curious what the automated process would look like and would like to learn from the example.
So I have a folder of 50 npz files. I need to pull a specific 29x5 array from each npz file and concatenate all of it into a single csv. This is what I did manually:
import numpy as np
import os
os.chdir('D:/Documents/WorkingDir')
data1=np.load('file1.npz', mmap_mode='r')
array1 = data1.f.array
#data2=etc.
#array2=etc.
grandarray = np.concatenate((array1,array2), axis = 0)
np.savetext('grandarray.csv', grandarrray, delimiter=",")
I gather you can use glob to get a list of all files in the same folder with the .npz extension, but I can't figure out how to turn my manual process into a script and automate it. I'll gladly take links to tutorial websites that can get me going in this direction as well. Thank you all for your time.

You need to use iteration. A loop would be fine, but I find a list comprehension to be acceptable here.
import glob
import numpy as np
import os
os.chdir('D:/Documents/WorkingDir')
filenames = glob.glob('*.npz')
data_arrays = [np.load(filename, mmap_mode='r').f.array for filename in filenames]
grandarray = np.concatenate(data_arrays, axis = 0)
np.savetext('grandarray.csv', grandarrray, delimiter=",")

Related

How do I save each iteration as my file format without overwriting the previous iteration?

I am new to coding. I basically have a bunch of files in "nifti" format, I wanted to simply load them, apply a thresholding function to them and then save them. I was able to write the few lines of code to do it to one file (it worked), but I have many so I created another python file and tried to make a for loop. I think it does everything fine but the last step for saving my files just keeps overwriting so in the end I only get one output file.
import numpy as np
import nibabel as nb
import glob
import os
path= 'subjects'
all_files=glob.glob(path + '/*.nii')
for filename in all_files:
image=nb.load(filename)
data=image.get_fdata()
data [data<0.1]=0
new_image=nb.Nifti1Image(data, affine=image.affine, header=image.header)
nb.save(new_image,filename+1)

Difficulty combining csv files into a single file

My dataset looks at flight delays and cancellations from 2009 to 2018. Here are the important points to consider:
Each year is its own csv file so '2009.csv', '2010.csv', all the way to '2018.csv'
Each file is roughly 700mb
I used the following to combine csv files
import pandas as pd
import numpy as np
import os, sys
import glob
os.chdir('c:\\folder'
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
combined_airline_csv = pd.concat([pd.read_csv(f) for f in all_filenames])
combined_airline_csv.to_csv('combined_airline_csv.csv', index =False, encoding = 'utf-8-sig')
When I run this, I receive the following message:
MemoryError: Unable to allocate 43.3MiB for an array with shape(5674621, ) and data type float64.
I am presuming that my file is too large and that will need to run this on a virtual machine (i.e. AWS).
Any thoughts?
Thank you!
This is a duplicate of how to merge 200 csv files in Python.
Since you just want to combine them into one file, there is no need to load all data into a dataframe at the same time. Since they all have the same structure, I would advise creating one filewriter, then open each file with a file reader and write (if we want to be fancy let's call it stream) the data line by line. Just be careful not to copy the headers each time, since you only want them one time. Pandas is simply not the best tool for this task :)
In general, this is a typical task that can also be done easily and even faster directly on the command line. (code depends on the os)

Loading Multiple Data files from same folder in Python

I am trying to load a large number of data files from the same folder in Python. The ultimate goal here is to simply choose which file I would like to use in calculations, rather than individually opening files.
Here is what I have. This seems to work in opening the data in the files, but I am having a hard time choosing a specific file I want to work with (and assigning a value to each column in each file).
import astropy
import numpy as np
import matplotlib.pyplot as plt
dir = '/S34_east_tfa/'
import glob, os
os.chdir(dir)
for file in glob.glob("*.data"):
data = np.loadtxt(file)
print (data)
Time = data[:,0]
Use a python dictionary, instead of overwriting the results in data variable inside your loop.
data_dict = dict()
for file in glob.glob("*.data"):
data_dict[file] = np.loadtxt(file)
Is this what you were looking for?

Python: unziping special files into memory and getting them into a DataFrame

I'm quite stuck with a code I'm writing in Python, I'm a beginner and maybe is really easy, but I just can't see it. Any help would be appreciated. So thank you in advance :)
Here is the problem: I have to read some special data files with an special extension .fen into a pandas DataFrame.This .fen files are inside a zipped file .fenx that contains the .fen file and a .cfg configuration file.
In the code I've written I use zipfile library in order to unzip the files, and then get them in the DataFrame. This code is the following
import zipfile
import numpy as np
import pandas as pd
def readfenxfile(Directory,File):
fenxzip = zipfile.ZipFile(Directory+ '\\' + File, 'r')
fenxzip.extractall()
fenxzip.close()
cfgGeneral,cfgDevice,cfgChannels,cfgDtypes=readCfgFile(Directory,File[:-5]+'.CFG')
#readCfgFile redas the .cfg file and returns some important data.
#Here only the cfgDtypes would be important as it contains the type of data inside the .fen and that will become the column index in the final DataFrame.
if cfgChannels!=None:
dtDtype=eval('np.dtype([' + cfgDtypes + '])')
dt=np.fromfile(Directory+'\\'+File[:-5]+'.fen',dtype=dtDtype)
dt=pd.DataFrame(dt)
else:
dt=[]
return dt,cfgChannels,cfgDtypes
Now, the extract() method saves the unzipped file in the hard drive. The .fenx files can be quite big so this need of storing (and afterwards deleting them) is really slow. I would like to do the same I do now, but getting the .fen and .cfg files into the memory, not the hard drive.
I have tried things like fenxzip.read('whateverthenameofthefileis.fen')and some other methods like .open() from the zipfile library. But I can't get what .read() returns into a numpy array in anyway i tried.
I know it can be a difficult question to answer, because you don't have the files to try and see what happens. But if someone would have any ideas I would be glad of reading them. :) Thank you very much!
Here is the solution I finally found in case it can be helpful for anyone. It uses the tempfile library to create a temporal object in memory.
import zipfile
import tempfile
import numpy as np
import pandas as pd
def readfenxfile(Directory,File,ExtractDirectory):
fenxzip = zipfile.ZipFile(Directory+ r'\\' + File, 'r')
fenfile=tempfile.SpooledTemporaryFile(max_size=10000000000,mode='w+b')
fenfile.write(fenxzip.read(File[:-5]+'.fen'))
cfgGeneral,cfgDevice,cfgChannels,cfgDtypes=readCfgFile(fenxzip,File[:-5]+'.CFG')
if cfgChannels!=None:
dtDtype=eval('np.dtype([' + cfgDtypes + '])')
fenfile.seek(0)
dt=np.fromfile(fenfile,dtype=dtDtype)
dt=pd.DataFrame(dt)
else:
dt=[]
fenfile.close()
fenxzip.close()
return dt,cfgChannels,cfgDtypes

Iterate a simple calculation in Pandas across multiple files

The code below generates a sum from the "Value" column in an ndarray called 'File1.csv'.
How do I apply this code to every file in a directory and place the sums in a new file called Sum.csv?
import pandas as pd
import numpy as np
df = pd.read_csv("~/File1.csv")
df["Value"].sum()
Many thanks!
There's probably a nice way to do this with a pandas Panel, but this is a basic python implementation.
import os
import pandas as pd
# Get the home directory (not recommended, work somewhere else)
directory = os.environ["HOME"]
# Read all files in directory, filter out non-csv
files = [os.path.join(directory, f)
for f in os.listdir(directory) if f.endswith(".csv")]
# Make list of tuples [(filename, sum)]
sums = [(filename, pd.read_csv(filename)["Value"].sum())
for filename in files ]
# Make a dataframe
df = pd.DataFrame(sums, columns=["filename", "sum"])
df.to_csv(os.path.join(directory, "files_with_sum.csv"))
Note that the built in python os.listdir() doesn't understand "~/" like pandas does, so we get it out of the environment map. Using the home directory isn't really recommended, so this gives any adopter of this code an opportunity to set a different path.

Categories