Merging Multiple (Ideally) JSON Files Into One

Merging Multiple (Ideally) JSON Files Into One - python

Simple enough situation; I'm working from within a directory which contains a script, and a subdirectory at the same level which contains many JSON files.
Using ideally Python, I'd like to combine all of the JSON files into one. Depending on your suggestion, this may leave behind redundant headers, but I can pop those off the JSON as I convert that file into a python dictionary object. Not a problem.
The problem is that I have been unable to combine the files into one. I'm practicing on text files for a start, to no avail. I'm using the python "os" module, but no luck. Keenly;
path = "/Users/me/ScriptsAndData/BagOfJSON"
...
for filename in os.listdir(path):
with open(filename, 'rb') as read file:
....
Results in the error;
with open(filename, 'rb') as readfile:
FileNotFoundError: [Errno 2] No such file or directory: 'firstFile.JSON'
And this finds and names the first file from within the directory, but doesn't operate on it like a file.
tldr;
I'm trying to merge multiple JSON files, all located within a single directory, into a single JSON file. If you know how to do this for any filetype, I'd be happy to know how you do it, then build from there.
Cheers!

Related

How to load data set having multiple 'No-extension files' in python?

I am trying to load a dataset for my machine learning project and it requires me to load files having no extensions.
I tried :
import os
import glob
files = filter(os.path.isfile, glob.glob("./[0-9]*"))
for name in files:
with open(name) as fh:
contents = fh.read()
But doesn't return anything, mainly that glob command has nothing in it.
Also tried :
import os
import glob
path = './dataset1/training_validation/2012-07-10/'
for infile in glob.glob(os.path.join(path, '*')):
print("test")
file = open(infile, 'r')
print(file)
but this returns [] because of that glob command.
I'm stuck in here and couldn't find anything over the internet.
My actual problem is to load 'no extension files in a training and testing set' from two folders, validation, and the test itself. I can iterate through the folder but don't know how to handle those file types.
When I open those files in a text editor. it shows me something like this.
So I know that it's a binary format of an image, but have no idea how can I store and train them.
any help would be appreciated. thanks.

Two things:
File extensions (.txt , .dat , .bat, .f90, etc.) are not meaningful to python, at least when using glob or numpy or something of the sort, because it's just part of a string. Some of us are raised (within Windows) to believe that file extensions mean something (I too fell for it).
The file you are looking at is a text file, containing the ASCII representation of a binary image on 0's and 1's. So, it's not a binary file, and it's not an image file (per-se), but it is a text file, which means we can read it as such from python.
To read this in, you could do either:
1. Use numpy to do data = numpy.loadtxt(<filename>), however you might have trouble delimiting the digits.
2. Use Python's standard open function on the file, and loop through each line using for line in <file_handle>:. This way, each row of data is a string, which can be parsed easily (see documentation on string indexing).
Good luck!

IMO this simply means that your path does not exist.
Perhaps you try in a first test an absolute path to your folder, as you eventually confused the relative position of the folder to your current working directory.

I got it to work with the following code.
fileNames = [f for f in listdir(dirName) if isfile(join(dirName, f))]
random.shuffle(fileNames)
for files in fileNames:
data = open(dirName+'/'+files,'r');
Thanks for your responses.

How to work with CSV files inside a zipped folder?

I'm working with zipped files in python for the first time, and I'm stumped.
I read the documentation for zipfile, but I'm not sure what would be the best way to do what I'm trying to do. I have a zipped folder with CSV files inside, and I'd like to be able to open the zip file, and retrieve certain values from the csv files inside.
Do I use zipfile.extract(file name here) to bring it to the current working directory? And if I do that, do I just use the file name to work with the file, or does this index or list them differently?
Currently, I manually extract all files in the zipped folder to the current working directory for my project, and then use the csv module to read them. All I'm really trying to do is remove that step.
Any and all help would be greatly appreciated!

You are looking to avoid extracting to disk, in the zip docs for python there is ZipFile.open() which gives you a file-like object. That is an object that mostly behaves like a regular file on disk, but it is in memory. It gives a bytes array when read, at least in py3.
Something like this...
from zipfile import ZipFile
import csv
with ZipFile('abc.zip') as myzip:
print(myzip.filelist)
for mf in myzip.filelist:
with myzip.open(mf.filename) as myfile:
mc = myfile.read()
c = csv.StringIO(mc.decode())
for row in c:
print(row)
The documentation of Python is actually quite good once one has learned how to find things as well as some of the basic programming terms/descriptions used in the documentation.
For some reason csv.BytesIO is not implemented, hence the extra step via csv.StringIO.

Reading gzipped data in Python

I have a *.tar.gz compressed file that I would like to read in with Python 2.7. The file contains multiple h5 formatted files as well as a few text files. I'm a novice with Python. Here is the code I'm trying to adapt:
`subset_path='c:\data\grant\files'
f=gzip.open(filename,'subset_full.tar.gz')
subset_data_path=os.path.join(subset_path,'f')
The first statement identifies the path to the folder with the data. The second statement tells Python to open a specific compressed file and the third statement (hopefully) executes a join of the prior two statements.
Several lines below this code I get an error when Python tries to use the 'subset_data_path' assignment.
What's going on?

The gzip module will only open a single file that has been compressed, i.e. my_file.gz. You have a tar archive of multiple files that are also compressed. This needs to be both untarred and uncompressed.
Try using the tarfile module instead, see https://docs.python.org/2/library/tarfile.html#examples
edit: To add a bit more information on what has happened, you have successfully opened the zipped tarball into a gzip file object, which will work almost the same as a standard file object. For instance you could call f.readlines() as if f was a normal file object and it would return the uncompressed lines.
However, this did not actually unpack the archive into new files in the filesystem. You did not create a subdirectory 'c:\data\grant\files\f', and so when you try to use the path subset_data_path you are looking for a directory that does not exist.
The following ought to work:
import tarfile
subset_path='c:\data\grant\files'
tar = tarfile.open("subset_full.tar.gz")
tar.extractall(subset_path)
subset_data_path=os.path.join(subset_path,'subset_full')

taking data from files which are in folder

How do I get the data from multiple txt files that placed in a specific folder. I started with this could not fix. It gives an error like 'No such file or directory: '.idea' (??)
(Let's say I have an A folder and in that, there are x.txt, y.txt, z.txt and so on. I am trying to get and print the information from all the files x,y,z)
def find_get(folder):
for file in os.listdir(folder):
f = open(file, 'r')
for data in open(file, 'r'):
print data
find_get('filex')
Thanks.

If you just want to print each line:
import glob
import os
def find_get(path):
for f in glob.glob(os.path.join(path,"*.txt")):
with open(os.path.join(path, f)) as data:
for line in data:
print(line)
glob will find only your .txt files in the specified path.
Your error comes from not joining the path to the filename, unless the file was in the same directory you were running the code from python would not be able to find the file without the full path. Another issue is you seem to have a directory .idea which would also give you an error when trying to open it as a file. This also presumes you actually have permissions to read the files in the directory.
If your files were larger I would avoid reading all into memory and/or storing the full content.

First of all make sure you add the folder name to the file name, so you can find the file relative to where the script is executed.
To do so you want to use os.path.join, which as it's name suggests - joins paths. So, using a generator:
def find_get(folder):
for filename in os.listdir(folder):
relative_file_path = os.path.join(folder, filename)
with open(relative_file_path) as f:
# read() gives the entire data from the file
yield f.read()
# this consumes the generator to a list
files_data = list(find_get('filex'))
See what we got in the list that consumed the generator:
print files_data
It may be more convenient to produce tuples which can be used to construct a dict:
def find_get(folder):
for filename in os.listdir(folder):
relative_file_path = os.path.join(folder, filename)
with open(relative_file_path) as f:
# read() gives the entire data from the file
yield (relative_file_path, f.read(), )
# this consumes the generator to a list
files_data = dict(find_get('filex'))
You will now have a mapping from the file's name to it's content.
Also, take a look at the answer by #Padraic Cunningham . He brought up the glob module which is suitable in this case.

The error you're facing is simple: listdir returns filenames, not full pathnames. To turn them into pathnames you can access from your current working directory, you have to join them to the directory path:
for filename in os.listdir(directory):
pathname = os.path.join(directory, filename)
with open(pathname) as f:
# do stuff
So, in your case, there's a file named .idea in the folder directory, but you're trying to open a file named .idea in the current working directory, and there is no such file.
There are at least four other potential problems with your code that you also need to think about and possibly fix after this one:
You don't handle errors. There are many very common reasons you may not be able to open and read a file--it may be a directory, you may not have read access, it may be exclusively locked, it may have been moved since your listdir, etc. And those aren't logic errors in your code or user errors in specifying the wrong directory, they're part of the normal flow of events, so your code should handle them, not just die. Which means you need a try statement.
You don't do anything with the files but print out every line. Basically, this is like running cat folder/* from the shell. Is that what you want? If not, you have to figure out what you want and write the corresponding code.
You open the same file twice in a row, without closing in between. At best this is wasteful, at worst it will mean your code doesn't run on any system where opens are exclusive by default. (Are there such systems? Unless you know the answer to that is "no", you should assume there are.)
You don't close your files. Sure, the garbage collector will get to them eventually--and if you're using CPython and know how it works, you can even prove the maximum number of open file handles that your code can accumulate is fixed and pretty small. But why rely on that? Just use a with statement, or call close.
However, none of those problems are related to your current error. So, while you have to fix them too, don't expect fixing one of them to make the first problem go away.

Full variant:
import os
def find_get(path):
files = {}
for file in os.listdir(path):
if os.path.isfile(os.path.join(path,file)):
with open(os.path.join(path,file), "r") as data:
files[file] = data.read()
return files
print(find_get("filex"))
Output:
{'1.txt': 'dsad', '2.txt': 'fsdfs'}
After the you could generate one file from that content, etc.
Key-thing:
os.listdir return a list of files without full path, so you need to concatenate initial path with fount item to operate.
there could be ideally used dicts :)
os.listdir return files and folders, so you need to check if list item is really file

You should check if the file is actually file and not a folder, since you can't open folders for reading. Also, you can't just open a relative path file, since it is under a folder, so you should get the correct path with os.path.join. Check below:
import os
def find_get(folder):
for file in os.listdir(folder):
if not os.path.isfile(file):
continue # skip other directories
f = open(os.path.join(folder, file), 'r')
for line in f:
print line

simple python file search and record to csv

import os, csv
f=open("C:\\tempa\\file.csv", 'wb') #write to an existing blank csv file
w=csv.writer(f)
for path, dirs, files, in os.walk("C:\\tempa"):
for filename in files:
w.writerow([filename])
running win7 64bit latest python, using anaconda spyder, pyscripter issue persists regardless of the ide.
I have some media in folders in tempa jpg, pdf and mov... and I wanted to get a file list of all of them, and the code works but it stops without any issue at row 113, nothing special with the file it stops on, no weird characters.
I could have 3 blocks of code one for each folder to work around this weird bug. but it shouldnt have an issue.. the folders are all in the root folder without going too deep in sub folders:
C:\
-tempa
-jpg
-pdf
-mov
I have heard there are issues with os.walk but I didn't expext anything weird like this.
Maybe I need an f=close?

You were examining the file before it was fully closed. (f won't be closed until, at least, it is no longer referenced by any in-scope variable name.) If you examine a file before it is closed, you may not see the final, partial, data buffer.
Use the file object's context manager to ensure that the file is flushed and closed in all cases:
import os, csv
with open("C:\\tempa\\file.csv", 'wb') as f: #write to an existing blank csv file
w=csv.writer(f)
for path, dirs, files, in os.walk("C:\\tempa"):
for filename in files:
w.writerow([filename])
# Now no need for f.close()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.