extracting git time recursivley for subfolders and files - python

I am trying to create a dictionary with elements in the format filename: timestamp in yy-mm-dd hh:mm:ss . This should recursively include all subfolders and files in the repo . I came across ths piece of code :
import git
repo = git.Repo("./repo")
tree = repo.tree()
for blob in tree:
commit = repo.iter_commits(paths=blob.path, max_count=1).next()
print(blob.path, commit.committed_date)
However, this includes only the main sub folders. How to include sub folders and files recursively
Note: The following solution by Roland here does not include sub folders, only files.Also I need to be in the path where git repo is downloaded and then run the script by giving its absolute path
Get time of last commit for Git repository files via Python?

This works for me
http://gitpython.readthedocs.io/en/stable/tutorial.html#the-tree-object
As per the doc As trees allow direct access to their intermediate child entries only, use the traverse method to obtain an iterator to retrieve entries recursively
It creates a generator object which does the work
print tree.traverse()
<generator object traverse at 0x0000000004129DC8>
d=dict()
for blob in tree.traverse():
commit=repo.iter_commits(paths=blob.path).next()
d[blob.path]=commit.committed_date

Related

Python, Inconsistent zip file extraction

I am trying to extract zip files using the zipfile module's extractall method.
My code snippet is
import zipfile
file_path = '/something/airway.zip'
dir_path = 'something/'
with zipfile.ZipFile(file_path, "r") as zip_ref:
zip_ref.extractall(dir_path)
I have two zip files named, test (1.1 mb) and airway (520 mb).
For test.zip the folder contains all the files but for airway.zip, it creates another folder inside my target folder named Airway, and then extracts all the files there. Even after renaming the airway.zip to any garbage name, the result was same.
Is there some workaround to get only the files extracted in my target folder? It is critical for me as I'm doing this extraction automated from django
Python version: 3.9.6;
Django version: 2.2
I ran your code and it seems to be only a problem of the zipfile itself. If you create a zipfile by selecting only the elements you get the result you got with test.zip. If you create it by selecting a folder holding the elements the folder will be there if you extract it again, no matter what you name your zip file.
I have two articles related to this:
https://www.kite.com/python/docs/zipfile.ZipFile.extractall
https://www.geeksforgeeks.org/working-zip-files-python/
Even if both of these articles do not solve your problem then I think that instead of zipping the files in the folder you just zipped the folder itself so try by zipping the files inside the folder.

Dealing with OS Error Python: [Errno 20] Not a directory: '/Volumes/ARLO/ADNI/.DS_Store'

I am trying to write a piece of code that will recursively iterate through the subdirectories of a specific directory and stop only when reaching files with a '.nii' extension, appending these files to a list called images - a form of a breadth first search. Whenever I run this code, however, I keep receiving [Errno 20] Not a directory: '/Volumes/ARLO/ADNI/.DS_Store'
*/Volumes/ARLO/ADNI is the folder I wish to traverse through
*I am doing this in Mac using the Spyder IDE from Anaconda because it is the only way I can use the numpy and nibabel libraries, which will become important later
*I have already checked that this folder directly contains only other folders and not files
#preprocessing all the MCIc files
import os
#import nibabel as nib
#import numpy as np
def pwd():
cmd = 'pwd'
os.system(cmd)
print(os.getcwd())
#Part 1
os.chdir('/Volumes/ARLO')
images = [] #creating an empty list to store MRI images
os.chdir('/Volumes/ARLO/ADNI')
list_sample = [] #just an empty list for an earlier version of
#the program
#Part 2
#function to recursively iterate through folder of raw MRI
#images and extract them into a list
#breadth first search
def extract(dir):
#dir = dir.replace('.DS_Store', '')
lyst = os.listdir(dir) #DS issue
for item in lyst:
if 'nii' not in item: #if item is not a .nii file, if
#item is another folder
newpath = dir + '/' + item
#os.chdir(newpath) #DS issue
extract(newpath)
else: #if item is the desired file type, append it to
#the list images
images.append(item)
#Part 3
adni = os.getcwd() #big folder I want to traverse
#print(adni) #adni is a string containing the path to the ADNI
#folder w/ all the images
#print(os.listdir(adni)) this also works, prints the actual list
"""adni = adni + '/' + '005_S_0222'
os.chdir(adni)
print(os.listdir(adni))""" #one iteration of the recursion,
#works
extract(adni)
print(images)
With every iteration, I wish to traverse further into the nested folders by appending the folder name to the growing path, and part 3 of the code works, i.e. I know that a single iteration works. Why does os keep adding the '.DS_Store' part to my directories in the extract() function? How can I correct my code so that the breadth first traversal can work? This folder contains hundreds of MRI images, I cannot do it without automation.
Thank you.
The .DS_Store files are not being created by the os module, but by the Finder (or, I think, sometimes Spotlight). They're where macOS stores things like the view options and icon layout for each directory on your system.
And they've probably always been there. The reason you didn't see them when you looked is that files that start with a . are "hidden by convention" on Unix, including macOS. Finder won't show them unless you ask it to show hidden files; ls won't show them unless you pass the -a flag; etc.
So, that's your core problem:
I have already checked that this folder directly contains only other folders and not files
… is wrong. The folder does contain at least one regular file; .DS_Store.
So, what can you do about that?
You could add special handling for .DS_Store.
But a better solution is probably to just check each file to see if it's a file or directory, by calling os.path.isdir on it.
Or, even better, use os.scandir instead of listdir, which gives you entries with more information than just the name, so you don't need to make extra calls like isdir.
Or, best of all, just throw out this code and use os.walk to recursively visit every file in every directory underneath your top-level directory.

move files from child directories (from unzipping) to parent directory in unzip step?

I've got a specific problem:
I am downloading some large sets of data using requests. Each request provides me with a compressed file, containing a manifest of the download, and folders, each containing 1 file.
I can unzip the archive + remove archive, and afterwards extract all files from subdirectories + remove subdirectories.
Is there a way to combine this? Since I'm new to both actions, I studied some tutorials and stack overflow questions on both topics. I'm glad it is working, but I'd like to refine my code and possibly combine these two steps - I didn't encounter it while I was browsing other information.
So for each set of parameters, I perform a request which ends up with:
# Write the file
with open((file_location+file_name), "wb") as output_file:
output_file.write(response.content)
# Unzip it
with tarfile.open((file_location+file_name), "r:gz") as tarObj:
tarObj.extractall(path=file_location)
# Remove compressed file
os.remove(file_location+file_name)
And then for the next step I wrote a function that:
target_dir = keyvalue[1] # target directory is stored in this tuple
subdirs = get_imm_subdirs(target_dir) # function to get subdirectories
for f in subdirs:
c = os.listdir(os.path.join(target_dir, f)) # find file in subdir
shutil.move(c, str(target_dir)+"ALL_FILES/") # move them into 1 subdir
os.rmdir([os.path.join(target_dir, x) for x in subdirs]) # remove other subdirs
Is there an action I can perform during the unzip step?
You can extract the files individually rather than using extractall.
with tarfile.open('musthaves.tar.gz') as tarObj:
for member in tarObj.getmembers():
if member.isfile():
member.name = os.path.basename(member.name)
tarObj.extract(member, ".")
With appropriate credit to this SO question and the tarfile docs.
getmembers() will provide a list what is inside the archive (as objects); you could use listnames() but then you'd have to devise you own test as to whether or not each entry is a file or directory.
isfile() - if it's not a file, you don't want it.
member.name = os.path.basename(member.name) resets the subdirectory depth - the extractor things everything is at the top level.

Issue in list() method of module boto

I am using the list method as:
all_keys = self.s3_bucket.list(self.s3_path)
The bucket "s3_path" contains files and folders. The return value of above line is confusing. It is returning:
Parent directory
A few directories not all
All the files in folder and subfolders.
I had assumed it would return files only.
There is actually no such thing as a folder in Amazon S3. It is just provided for convenience. Objects can be stored in a given path even if a folder with that path does not exist. The Key of the object is the full path plus the filename.
For example, this will copy a file even if the folder does not exist:
aws s3 cp file.txt s3://my-bucket/foo/bar/file.txt
This will not create the /foo/bar folder. It simply creates an object with a Key of: /foo/bar/file.txt
However, if folders are created in the S3 Management Console, a zero-length file is created with the name of the folder so that it appears in the console. When listing files, this will appear as the name of the directory, but it is actually the name of a zero-length file.
That is why some directories might appear but not others -- it depends whether they were specifically created, or whether they objects were simply stored in that path.
Bottom line: Amazon S3 is an object storage system. It is really just a big Key/Value store -- the Key is the name of the Object, the Value is the contents of the object. Do not assume it works the same as a traditional file system.
If you have a lot of items in the bucket, the results of a list_objects will be paginated. By default, it will return up to 1000 items. See the Boto docs to learn how to use Marker to pagniate through all items.
Oh, looks like you're on Boto 2. For you, it will be BucketListResultSet.

Turn subversion path into walkable directory

I have a subversion repo ie "http://crsvn/trunk/foo" ... I want to walk this directory or for starters simply to a directory list.
The idea is to create a script that will do mergeinfo on all the branches in "http://crsvn/branches/bar" and compare them to trunk to see if the branch has been merged.
So the first problem I have is that I cannot walk or do
os.listdir('http://crsvn/branches/bar')
I get the value label syntax is incorrect (mentioning the URL)
You can use PySVN. In particular, the pysvn.Client.list method should do what you want:
import pysvn
svncl = pysvn.Client()
entries = svncl.list("http://rabbitvcs.googlecode.com/svn/trunk/")
# Gives you a list of directories:
dirs = (entry[0].repos_path for entry in entries if entry[0].kind == pysvn.node_kind.dir)
list(dirs)
No checkout needed. You could even specify a revision to work on, to ensure your script can ignore other people working on the repository while it runs.
listdir takes a path and not a url. It would be nice if python could be aware of the structure on a remote server but i don't think that is the case.
If you were to checkout your repository locally first you could easly walk the directories using pythons functions.

Categories