How to load data set having multiple 'No-extension files' in python?

How to load data set having multiple 'No-extension files' in python? - python

I am trying to load a dataset for my machine learning project and it requires me to load files having no extensions.
I tried :
import os
import glob
files = filter(os.path.isfile, glob.glob("./[0-9]*"))
for name in files:
with open(name) as fh:
contents = fh.read()
But doesn't return anything, mainly that glob command has nothing in it.
Also tried :
import os
import glob
path = './dataset1/training_validation/2012-07-10/'
for infile in glob.glob(os.path.join(path, '*')):
print("test")
file = open(infile, 'r')
print(file)
but this returns [] because of that glob command.
I'm stuck in here and couldn't find anything over the internet.
My actual problem is to load 'no extension files in a training and testing set' from two folders, validation, and the test itself. I can iterate through the folder but don't know how to handle those file types.
When I open those files in a text editor. it shows me something like this.
So I know that it's a binary format of an image, but have no idea how can I store and train them.
any help would be appreciated. thanks.

Two things:
File extensions (.txt , .dat , .bat, .f90, etc.) are not meaningful to python, at least when using glob or numpy or something of the sort, because it's just part of a string. Some of us are raised (within Windows) to believe that file extensions mean something (I too fell for it).
The file you are looking at is a text file, containing the ASCII representation of a binary image on 0's and 1's. So, it's not a binary file, and it's not an image file (per-se), but it is a text file, which means we can read it as such from python.
To read this in, you could do either:
1. Use numpy to do data = numpy.loadtxt(<filename>), however you might have trouble delimiting the digits.
2. Use Python's standard open function on the file, and loop through each line using for line in <file_handle>:. This way, each row of data is a string, which can be parsed easily (see documentation on string indexing).
Good luck!

IMO this simply means that your path does not exist.
Perhaps you try in a first test an absolute path to your folder, as you eventually confused the relative position of the folder to your current working directory.

I got it to work with the following code.
fileNames = [f for f in listdir(dirName) if isfile(join(dirName, f))]
random.shuffle(fileNames)
for files in fileNames:
data = open(dirName+'/'+files,'r');
Thanks for your responses.

Related

For loop to iterate through a folder of images and output into a single JSON file

I am not a developer and I have a very limited knowledge of programming. I understand how to use and operate python scripts, however writing them is something I am yet to learn. Please can someone help a total noob :)
I am using an API by sightengine to asses a large folder of .jpg images for their sharpness and colour properties. The documentation for the API only provides a small script for assessing one image at a time. I have spoken to Sight Engines support and they are unwilling to provide a script for batch processing which is bizarre considering all other API companies usually do.
I need some help creating a for loop that will use a python script to iterate through a folder of images and output the API result into a single JSON file. Any help in how to structure this script would be extremely appreciated.
Here is the sightengine code for a simple one image check:
from sightengine.client import SightengineClient
client = SightengineClient("{api_user}", "{api_secret}")
output = client.check('properties','type').set_file('/path/to/local/file.jpg')
Thank you

Part of this is sort of a guess, as I don't know exactly what output will look like. I am assuming it's returned as json format. Which if thats the case, you can append the individual json responses into a single json structure, then use json.dump() to write to file.
So that part is a guess. The other aspect is you want to iterate through your jpg files, which you can do by using os and fnmatch. Just adjust the root directory/folder for it to walk through while it searches fro all the .jpg extensions.
from sightengine.client import SightengineClient
import os
from fnmatch import fnmatch
import json
client = SightengineClient("{api_user}", "{api_secret}")
# Get your jpg files into a list
r = 'C:/path/to/local'
pattern = "*.jpg"
filenames = []
for path, subdirs, files in os.walk(r):
for name in files:
if fnmatch(name, pattern):
#print (path+'/'+name)
filenames.append(path+'/'+name)
# Now iterate through those jpg files
jsonData = []
for file in filenames:
output = client.check('properties','type').set_file(file)
jsonData.append(output)
with open('C:/result.json', 'w') as fp:
json.dump(jsonData, fp, indent=2)

Iterating over .wav files in subdirectories of parent directory

Cheers everybody,
I need help with something in python 3.6 exactly. So i have structure of data like this:
|main directory
| |subdirectory's(plural)
| | |.wav files
I'm currently working from a directory where main directory is placed so I don't need to specify paths before that. So firstly I wanna iterate over my main directory and find all subdirectorys. Then in each of them I wanna find the .wav files, and when done with processing them I wanna go to next subdirectory and so on until all of them are opened, and all .wav files are processed. Exactly what I wanna do with those .wav files is input them in my program, process them so i can convert them to numpy arrays, and then I convert that numpy array into some other object (working with tensorflow to be exact, and wanna convert to TF object). I wrote about the whole process if anybody has any fast advices on doing that too so why not.
I tried doing it with for loops like:
for subdirectorys in open(data_path, "r"):
for files in subdirectorys:
#doing some processing stuff with the file
The problem is that it always raises error 13, Permission denied showing on that data_path I gave him but when I go to properties there it seems okay and all permissions are fine.
I tried some other ways like with os.open or i replaced for loop with:
with open(data_path, "r") as data:
and it always raises permission denied error.
os.walk works in some way but it's not what I need, and when i tried to modify it id didn't give errors but it also didnt do anything.
Just to say I'm not any pro programmer in python so I may be missing an obvious thing but ehh, I'm here to ask and learn. I also saw a lot of similiar questions but they mainly focus on .txt files and not specificaly in my case so I need to ask it here.
Anyway thanks for help in advance.

Edit: If you want an example for glob (more sane), here it is:
from pathlib import Path
# The pattern "**" means all subdirectories recursively,
# with "*.wav" meaning all files with any name ending in ".wav".
for file in Path(data_path).glob("**/*.wav"):
if not file.is_file(): # Skip directories
continue
with open(file, "w") as f:
# do stuff
For more info see Path.glob() on the documentation. Glob patterns are a useful thing to know.
Previous answer:
Try using either glob or os.walk(). Here is an example for os.walk().
from os import walk, path
# Recursively walk the directory data_path
for root, _, files in walk(data_path):
# files is a list of files in the current root, so iterate them
for file in files:
# Skip the file if it is not *.wav
if not file.endswith(".wav"):
continue
# os.path.join() will create the path for the file
file = path.join(root, files)
# Do what you need with the file
# You can also use block context to open the files like this
with open(file, "w") as f: # "w" means permission to write. If reading, use "r"
# Do stuff
Note that you may be confused about what open() does. It opens a file for reading, writing, and appending. Directories are not files, and therefore cannot be opened.
I suggest that you Google for documentation and do more reading about the functions used. The documentation will help more than I can.
Another good answer explaining in more detail can be seen here.

import glob
import os
main = '/main_wavs'
wavs = [w for w in glob.glob(os.path.join(main, '*/*.wav')) if os.path.isfile(w)]
In terms of permissions on a path A/B/C... A, B and C must all be accessible. For files that means read permission. For directories, it means read and execute permissions (listing contents).

Run only if "if " statement is true.!

So I've a question, Like I'm reading the fits file and then i'm using the information from the header of the fits to define the other files which are related to the original fits file. But for some of the fits file, the other files (blaze_file, bis_file, ccf_table) are not available. And because of that my code gives the pretty obvious error that No Such file or directory.
import pandas as pd
import sys, os
import numpy as np
from glob import glob
from astropy.io import fits
PATH = os.path.join("home", "Desktop", "2d_spectra")
for filename in os.listdir(PATH):
if filename.endswith("_e2ds_A.fits"):
e2ds_hdu = fits.open(filename)
e2ds_header = e2ds_hdu[0].header
date = e2ds_header['DATE-OBS']
date2 = date = date[0:19]
blaze_file = e2ds_header['HIERARCH ESO DRS BLAZE FILE']
bis_file = glob('HARPS.' + date2 + '*_bis_G2_A.fits')
ccf_table = glob('HARPS.' + date2 + '*_ccf_G2_A.tbl')
if not all(file in os.listdir(PATH) for file in [blaze_file,bis_file,ccf_table]):
continue
So what i want to do is like, i want to make my code run only if all the files are available otherwise don't. But the problem is that, i'm defining the other files as variable inside the for loop as i'm using the header information. So how can i define them before the for loop???? and then use something like
So can anyone help me out of this?

The filenames returned by os.listdir() are always relative to the path given there.
In order to be used, they have to be joined with this path.
Example:
PATH = os.path.join("home", "Desktop", "2d_spectra")
for filename in os.listdir(PATH):
if filename.endswith("_e2ds_A.fits"):
filepath = os.path.join(PATH, filename)
e2ds_hdu = fits.open(filepath)
…
Let the filenames be ['a', 'b', 'a_ed2ds_A.fits', 'b_ed2ds_A.fits']. The code now excludes the two first names and then prepends the file path to the remaining two.
a_ed2ds_A.fits becomes /home/Desktop/2d_spectra/a_ed2ds_A.fits and
b_ed2ds_A.fits becomes /home/Desktop/2d_spectra/b_ed2ds_A.fits.
Now they can be accessed from everywhere, not just from the given file path.
I should become accustomed to reading a question in full before trying to answer it.
The problem I mentionned is a problem if you don't start the script from any path outside the said directory. Nevertheless, applying it will make your code much more consistent.
Your real problem, however, lies somewhere else: you examine a file and then, after checking its contents, want to read files whose names depend on informations from that first file.
There are several ways to accomplish your goal:
Just extend your loop with the proper tests.
Pseudo code:
for file in files:
if file.endswith("fits"):
open file
read date from header
create file names depending on date
if all files exist:
proceed
or
for file in files:
if file.endswith("fits"):
open file
read date from header
create file names depending on date
if not all files exist:
continue # actual keyword, no pseudo code!
proceed
Put some functionality into functions (variation of 1.)
Create a loop in a generator function which yields the "interesting information" of one fits file (or alternatively nothing) and have another loop run over them to actually work with the data.
If I am still missing some points or am not detailled enough, please let me know.

Since you have to read the fits file to know the other dependant files names, there's no way you can avoid reading the fit file first. The only thing you can do is test for the dependant files existance before trying to read them and skip the rest of the loop (using continue) if not.

Edit this line
e2ds_hdu = fits.open(filename)
And replace with
e2ds_hdu = fits.open(os.path.join(PATH, filename))

Errors with Glob while outputting file names

I am combining two questions here because they are related to each other.
Question 1: I am trying to use glob to open all the files in a folder but it is giving me "Syntax Error". I am using Python 3.xx. Has the syntax changed for Python 3.xx?
Error Message:
File "multiple_files.py", line 29
files = glob.glob(/src/xyz/rte/folder/)
SyntaxError: invalid syntax
Code:
import csv
import os
import glob
from pandas import DataFrame, read_csv
#extracting
files = glob.glob(/src/xyz/rte/folder/)
for fle in files:
with open (fle) as f:
print("output" + fle)
f_read.close()
Question 2: I want to read input files, append "output" to the names and print out the names of the files. How can I do that?
Example: Input file name would be - xyz.csv and the code should print output_xyz.csv .
Your help is appreciated.

Your first problem is that strings, including pathnames, need to be in quotes. This:
files = glob.glob(/src/xyz/rte/folder/)
… is trying to divide a bunch of variables together, but the leftmost and rightmost divisions are missing operands, so you've confused the parser. What you want is this:
files = glob.glob('/src/xyz/rte/folder/')
Your next problem is that this glob pattern doesn't have any globs in it, so the only thing it's going to match is the directory itself.
That's perfectly legal, but kind of useless.
And then you try to open each match as a text file. Which you can't do with a directory, hence the IsADirectoryError.
The answer here is less obvious, because it's not clear what you want.
Maybe you just wanted all of the files in that directory? In that case, you don't want glob.glob, you want listdir (or maybe scandir): os.listdir('/src/xyz/rte/folder/').
Maybe you wanted all of the files in that directory or any of its subdirectories? In that case, you could do it with rglob, but os.walk is probably clearer.
Maybe you did want all the files in that directory that match some pattern, so glob.glob is right—but in that case, you need to specify what that pattern is. For example, if you wanted all .csv files, that would be glob.glob('/src/xyz/rte/folder/*.csv').
Finally, you say "I want to read input files, append "output" to the names and print out the names of the files". Why do you want to read the files if you're not doing anything with the contents? You can do that, of course, but it seems pretty wasteful. If you just want to print out the filenames with output appended, that's easy:
for filename in os.listdir('/src/xyz/rte/folder/'):
print('output'+filename)

This works in http://pyfiddle.io:
Doku: https://docs.python.org/3/library/glob.html
import csv
import os
import glob
# create some files
for n in ["a","b","c","d"]:
with open('{}.txt'.format(n),"w") as f:
f.write(n)
print("\nFiles before")
# get all files
files = glob.glob("./*.*")
for fle in files:
print(fle) # print file
path,fileName = os.path.split(fle) # split name from path
# open file for read and second one for write with modified name
with open (fle) as f,open('{}{}output_{}'.format(path,os.sep, fileName),"w") as w:
content = f.read() # read all
w.write(content.upper()) # write all modified
# check files afterwards
print("\nFiles after")
files = glob.glob("./*.*") # pattern for all files
for fle in files:
print(fle)
Output:
Files before
./d.txt
./main.py
./c.txt
./b.txt
./a.txt
Files after
./d.txt
./output_c.txt
./output_d.txt
./main.py
./output_main.py
./c.txt
./b.txt
./output_b.txt
./a.txt
./output_a.txt
I am on windows and would use os.walk (Doku) instead.
for d,subdirs,files in os.walk("./"): # deconstruct returned aktDir, all subdirs, files
print("AktDir:", d)
print("Subdirs:", subdirs)
print("Files:", files)
Output:
AktDir: ./
Subdirs: []
Files: ['d.txt', 'output_c.txt', 'output_d.txt', 'main.py', 'output_main.py',
'c.txt', 'b.txt', 'output_b.txt', 'a.txt', 'output_a.txt']
It also recurses into subdirs.

taking data from files which are in folder

How do I get the data from multiple txt files that placed in a specific folder. I started with this could not fix. It gives an error like 'No such file or directory: '.idea' (??)
(Let's say I have an A folder and in that, there are x.txt, y.txt, z.txt and so on. I am trying to get and print the information from all the files x,y,z)
def find_get(folder):
for file in os.listdir(folder):
f = open(file, 'r')
for data in open(file, 'r'):
print data
find_get('filex')
Thanks.

If you just want to print each line:
import glob
import os
def find_get(path):
for f in glob.glob(os.path.join(path,"*.txt")):
with open(os.path.join(path, f)) as data:
for line in data:
print(line)
glob will find only your .txt files in the specified path.
Your error comes from not joining the path to the filename, unless the file was in the same directory you were running the code from python would not be able to find the file without the full path. Another issue is you seem to have a directory .idea which would also give you an error when trying to open it as a file. This also presumes you actually have permissions to read the files in the directory.
If your files were larger I would avoid reading all into memory and/or storing the full content.

First of all make sure you add the folder name to the file name, so you can find the file relative to where the script is executed.
To do so you want to use os.path.join, which as it's name suggests - joins paths. So, using a generator:
def find_get(folder):
for filename in os.listdir(folder):
relative_file_path = os.path.join(folder, filename)
with open(relative_file_path) as f:
# read() gives the entire data from the file
yield f.read()
# this consumes the generator to a list
files_data = list(find_get('filex'))
See what we got in the list that consumed the generator:
print files_data
It may be more convenient to produce tuples which can be used to construct a dict:
def find_get(folder):
for filename in os.listdir(folder):
relative_file_path = os.path.join(folder, filename)
with open(relative_file_path) as f:
# read() gives the entire data from the file
yield (relative_file_path, f.read(), )
# this consumes the generator to a list
files_data = dict(find_get('filex'))
You will now have a mapping from the file's name to it's content.
Also, take a look at the answer by #Padraic Cunningham . He brought up the glob module which is suitable in this case.

The error you're facing is simple: listdir returns filenames, not full pathnames. To turn them into pathnames you can access from your current working directory, you have to join them to the directory path:
for filename in os.listdir(directory):
pathname = os.path.join(directory, filename)
with open(pathname) as f:
# do stuff
So, in your case, there's a file named .idea in the folder directory, but you're trying to open a file named .idea in the current working directory, and there is no such file.
There are at least four other potential problems with your code that you also need to think about and possibly fix after this one:
You don't handle errors. There are many very common reasons you may not be able to open and read a file--it may be a directory, you may not have read access, it may be exclusively locked, it may have been moved since your listdir, etc. And those aren't logic errors in your code or user errors in specifying the wrong directory, they're part of the normal flow of events, so your code should handle them, not just die. Which means you need a try statement.
You don't do anything with the files but print out every line. Basically, this is like running cat folder/* from the shell. Is that what you want? If not, you have to figure out what you want and write the corresponding code.
You open the same file twice in a row, without closing in between. At best this is wasteful, at worst it will mean your code doesn't run on any system where opens are exclusive by default. (Are there such systems? Unless you know the answer to that is "no", you should assume there are.)
You don't close your files. Sure, the garbage collector will get to them eventually--and if you're using CPython and know how it works, you can even prove the maximum number of open file handles that your code can accumulate is fixed and pretty small. But why rely on that? Just use a with statement, or call close.
However, none of those problems are related to your current error. So, while you have to fix them too, don't expect fixing one of them to make the first problem go away.

Full variant:
import os
def find_get(path):
files = {}
for file in os.listdir(path):
if os.path.isfile(os.path.join(path,file)):
with open(os.path.join(path,file), "r") as data:
files[file] = data.read()
return files
print(find_get("filex"))
Output:
{'1.txt': 'dsad', '2.txt': 'fsdfs'}
After the you could generate one file from that content, etc.
Key-thing:
os.listdir return a list of files without full path, so you need to concatenate initial path with fount item to operate.
there could be ideally used dicts :)
os.listdir return files and folders, so you need to check if list item is really file

You should check if the file is actually file and not a folder, since you can't open folders for reading. Also, you can't just open a relative path file, since it is under a folder, so you should get the correct path with os.path.join. Check below:
import os
def find_get(folder):
for file in os.listdir(folder):
if not os.path.isfile(file):
continue # skip other directories
f = open(os.path.join(folder, file), 'r')
for line in f:
print line

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.