I am new to python and currently work on data analysis.
I am trying to open multiple folders in a loop and read all files in folders.
Ex. working directory contains 10 folders needed to open and each folder contains 10 files.
My code for open each folder with .txt file;
file_open = glob.glob("home/....../folder1/*.txt")
I want to open folder 1 and read all files, then go to folder 2 and read all files... until folder 10 and read all files.
Can anyone help me how to write loop to open folder, included library needed to be used?
I have my background in R, for example, in R I could write loop to open folders and files use code below.
folder_open <- dir("......./main/")
for (n in 1 to length of (folder_open)){
file_open <-dir(paste0("......./main/",folder_open[n]))
for (k in 1 to length of (file_open){
file_open<-readLines(paste0("...../main/",folder_open[n],"/",file_open[k]))
//Finally I can read all folders and files.
}
}
This recursive method will scan all directories within a given directory and then print the names of the txt files. I kindly invite you to take it forward.
import os
def scan_folder(parent):
# iterate over all the files in directory 'parent'
for file_name in os.listdir(parent):
if file_name.endswith(".txt"):
# if it's a txt file, print its name (or do whatever you want)
print(file_name)
else:
current_path = "".join((parent, "/", file_name))
if os.path.isdir(current_path):
# if we're checking a sub-directory, recursively call this method
scan_folder(current_path)
scan_folder("/example/path") # Insert parent direcotry's path
Given the following folder/file tree:
C:.
├───folder1
│ file1.txt
│ file2.txt
│ file3.csv
│
└───folder2
file4.txt
file5.txt
file6.csv
The following code will recursively locate all .txt files in the tree:
import os
import fnmatch
for path,dirs,files in os.walk('.'):
for file in files:
if fnmatch.fnmatch(file,'*.txt'):
fullname = os.path.join(path,file)
print(fullname)
Output:
.\folder1\file1.txt
.\folder1\file2.txt
.\folder2\file4.txt
.\folder2\file5.txt
Your glob() pattern is almost correct. Try one of these:
file_open = glob.glob("home/....../*/*.txt")
file_open = glob.glob("home/....../folder*/*.txt")
The first one will examine all of the text files in any first-level subdirectory of home/......, whatever that is. The second will limit itself to subdirectories named like "folder1", "folder2", etc.
I don't speak R, but this might translate your code:
for filename in glob.glob("......../main/*/*.txt"):
with open(filename) as file_handle:
for line in file_handle:
# perform data on each line of text
I think nice way to do that would be to use os.walk. That will generate tree and you can then iterate through that tree.
import os
directory = './'
for d in os.walk(directory):
print(d)
This code will look for all directories inside a directory, printing out the names of all files found there:
#--------*---------*---------*---------*---------*---------*---------*---------*
# Desc: print filenames one level down from starting folder
#--------*---------*---------*---------*---------*---------*---------*---------*
import os, fnmatch, sys
def find_dirs(directory, pattern):
for item in os.listdir(directory):
if os.path.isdir(os.path.join(directory, item)):
if fnmatch.fnmatch(item, pattern):
filename = os.path.join(directory, item)
yield filename
def find_files(directory, pattern):
for item in os.listdir(directory):
if os.path.isfile(os.path.join(directory, item)):
if fnmatch.fnmatch(item, pattern):
filename = os.path.join(directory, item)
yield filename
#--------*---------*---------*---------*---------*---------*---------*---------#
while True:# M A I N L I N E #
#--------*---------*---------*---------*---------*---------*---------*---------#
# # Set directory
os.chdir("C:\\Users\\Mike\\\Desktop")
for filedir in find_dirs('.', '*'):
print ('Got directory:', filedir)
for filename in find_files(filedir, '*'):
print (filename)
sys.exit() # END PROGRAM
pathlib is a good choose
from pathlib import Path
# or use: glob('**/*.txt')
for txt_path in [_ for _ in Path('demo/test_dir').rglob('*.txt') if _.is_file()]:
print(txt_path.absolute())
Related
Usually I navigate to the folder I am extracting data from and copy the file name directly:
df2=pd.read_csv('10_90_bnOH-MEA.csv',usecols=[1])
If I have multiple files and want to do the same for all the files, how do I specify the folder to open and get all the files inside?
I want to run the above code without specifying the file's full path
(C:\Users\X\Desktop\Y\Z\10_90_bnOH-MEA.csv)
You want listdir from the os module.
import os
path = "C:\\Users\\X\\Desktop\\Y\\Z\\"
files = os.listdir(path)
print(files)
dataframe_list = []
for filename in files:
dataframe_list.append(pd.read_csv(os.path.join(path,filename)))
You should open the desired directory and loop through all the files then do something to them.
# import required module
import os
# assign directory
directory = 'files'
# iterate over files in
def goThroughDirectory(directory):
for filename in os.listdir(directory):
f = os.path.join(directory, filename)
# checking if it is a file
if os.path.isfile(f):
# do something
If you also want to loop through all the files in a directory you should add a check for if os.path.isdir(f) like this
...
def goThroughDirectory(directory):
for filename in os.listdir(directory):
f = os.path.join(directory, filename)
# checking if it is a file
if os.path.isfile(f):
# do something
elif os.path.isdir(f):
# its not a file but a directory then loop through that directory aswell
goThroughDirectory(directory + "\" + f)
for more information you should check geeksforgeeks
I have the following directory structure:
-mailDir
-folderA
-sub1
-sub2
-inbox
-1.txt
-2.txt
-89.txt
-subInbox
-subInbox2
-folderB
-sub1
-sub2
-inbox
-1.txt
-2.txt
-200.txt
-577.txt
The aim is to copy all the txt files under inbox folder into another folder.
For this I tried the below code
import os
from os import path
import shutil
rootDir = "mailDir"
destDir = "destFolder"
eachInboxFolderPath = []
for root, dirs, files in os.walk(rootDir):
for dirName in dirs:
if(dirName=="inbox"):
eachInboxFolderPath.append(root+"\\"+dirName)
for ii in eachInboxFolderPath:
for i in os.listdir(ii):
shutil.copy(path.join(ii,i),destDir)
If the inbox directory only has .txt files then the above code works fine. Since the inbox folder under folderA directory has other sub directory along with .txt files, the code returns permission denied error. What I understood is shutil.copy won't allow to copy the folders.
The aim is to copy only the txt files in every inbox folder to some other location. If the file names are same in different inbox folder I have to keep both file names. How we can improve the code in this case ? Please note other than .txt all others are folders only.
One simple solution is to filter for any i that does not have the .txt extension by using the string endswith() method.
import os
from os import path
import shutil
rootDir = "mailDir"
destDir = "destFolder"
eachInboxFolderPath = []
for root, dirs, files in os.walk(rootDir):
for dirName in dirs:
if(dirName=="inbox"):
eachInboxFolderPath.append(root+"\\"+dirName)
for ii in eachInboxFolderPath:
for i in os.listdir(ii):
if i.endswith('.txt'):
shutil.copy(path.join(ii,i),destDir)
This should ignore any folders and non-txt files that are found with os.listdir(ii). I believe that is what you are looking for.
Just remembered that I once wrote several files to solve this exact problem before. You can find the source code here on my Github.
In short, there are two functions of interest here:
list_files(loc, return_dirs=False, return_files=True, recursive=False, valid_exts=None)
copy_files(loc, dest, rename=False)
For your case, you could copy and paste these functions into your project and modify copy_files like this:
def copy_files(loc, dest, rename=False):
# get files with full path
files = list_files(loc, return_dirs=False, return_files=True, recursive=True, valid_exts=('.txt',))
# copy files in list to dest
for i, this_file in enumerate(files):
# change name if renaming
if rename:
# replace slashes with hyphens to preserve unique name
out_file = sub(r'^./', '', this_file)
out_file = sub(r'\\|/', '-', out_file)
out_file = join(dest, out_file)
copy(this_file, out_file)
files[i] = out_file
else:
copy(this_file, dest)
return files
Then just call it like so:
copy_files('mailDir', 'destFolder', rename=True)
The renaming scheme might not be exactly what you want, but it will at least not override your files. I believe this should solve all your problems.
Here you go:
import os
from os import path
import shutil
destDir = '<absolute-path>'
for root, dirs, files in os.walk(os.getcwd()):
# Filter out only '.txt' files.
files = [f for f in files if f.endswith('.txt')]
# Filter out only 'inbox' directory.
dirs[:] = [d for d in dirs if d == 'inbox']
for f in files:
p = path.join(root, f)
# print p
shutil.copy(p, destDir)
Quick and simple.
sorry, I forgot the part where, you also need unique file names as well. The above solution only works for distinct file names in a single inbox folder.
For copying files from multiple inboxes and having a unique name in the destination folder, you can try this:
import os
from os import path
import shutil
sourceDir = os.getcwd()
fixedLength = len(sourceDir)
destDir = '<absolute-path>'
filteredFiles = []
for root, dirs, files in os.walk(sourceDir):
# Filter out only '.txt' files in all the inbox directories.
if root.endswith('inbox'):
# here I am joining the file name to the full path while filtering txt files
files = [path.join(root, f) for f in files if f.endswith('.txt')]
# add the filtered files to the main list
filteredFiles.extend(files)
# making a tuple of file path and file name
filteredFiles = [(f, f[fixedLength+1:].replace('/', '-')) for f in filteredFiles]
for (f, n) in filteredFiles:
print 'copying file...', f
# copying from the path to the dest directory with specific name
shutil.copy(f, path.join(destDir, n))
print 'copied', str(len(filteredFiles)), 'files to', destDir
If you need to copy all files instead of just txt files, then just change the condition f.endswith('.txt') to os.path.isfile(f) while filtering out the files.
I have been working this challenge for about a day or so. I've looked at multiple questions and answers asked on SO and tried to 'MacGyver' the code used for my purpose, but still having issues.
I have a directory (lets call it "src\") with hundreds of files (.txt and .xml). Each .txt file has an associated .xml file (let's call it a pair). Example:
src\text-001.txt
src\text-001.xml
src\text-002.txt
src\text-002.xml
src\text-003.txt
src\text-003.xml
Here's an example of how I would like it to turn out so each pair of files are placed into a single unique folder:
src\text-001\text-001.txt
src\text-001\text-001.xml
src\text-002\text-002.txt
src\text-002\text-002.xml
src\text-003\text-003.txt
src\text-003\text-003.xml
What I'd like to do is create an associated folder for each pair and then move each pair of files into its respective folder using Python. I've already tried working from code I found (thanks to a post from Nov '12 by Sethdd, but am having trouble figuring out how to use the move function to grab pairs of files. Here's where I'm at:
import os
import shutil
srcpath = "PATH_TO_SOURCE"
srcfiles = os.listdir(srcpath)
destpath = "PATH_TO_DEST"
# grabs the name of the file before extension and uses as the dest folder name
destdirs = list(set([filename[0:9] for filename in srcfiles]))
def create(dirname, destpath):
full_path = os.path.join(destpath, dirname)
os.mkdir(full_path)
return full_path
def move(filename, dirpath):
shutil.move(os.path.join(srcpath, filename)
,dirpath)
# create destination directories and store their names along with full paths
targets = [
(folder, create(folder, destpath)) for folder in destdirs
]
for dirname, full_path in targets:
for filename in srcfile:
if dirname == filename[0:9]:
move(filename, full_path)
I feel like it should be easy, but Python isn't something I work with everyday and it's been a while since my scripting days... Any help would be greatly appreciated!
Thanks,
WK2EcoD
Use the glob module to interate all of the 'txt' files. From that you can parse and create the folders and copy the files.
The process should be as simple as it appears to you as a human.
for file_name in os.listdir(srcpath):
dir = file_name[:9]
# if dir doesn't exist, create it
# move file_name to dir
You're doing a lot of intermediate work that seems to be confusing you.
Also, insert some simple print statements to track data flow and execution flow. It appears that you have no tracing output so far.
You can do it with os module. For every file in directory check if associated folder exists, create if needed and then move the file. See the code below:
import os
SRC = 'path-to-src'
for fname in os.listdir(SRC):
filename, file_extension = os.path.splitext(fname)
if file_extension not in ['xml', 'txt']:
continue
folder_path = os.path.join(SRC, filename)
if not os.path.exists(folder_path):
os.mkdir(folderpath)
os.rename(
os.path.join(SRC, fname),
os.path.join(folder_path, fname)
)
My approach would be:
Find the pairs that I want to move (do nothing with files without a pair)
Create a directory for every pair
Move the pair to the directory
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import os, shutil
import re
def getPairs(files):
pairs = []
file_re = re.compile(r'^(.*)\.(.*)$')
for f in files:
match = file_re.match(f)
if match:
(name, ext) = match.groups()
if ext == 'txt' and name + '.xml' in files:
pairs.append(name)
return pairs
def movePairsToDir(pairs):
for name in pairs:
os.mkdir(name)
shutil.move(name+'.txt', name)
shutil.move(name+'.xml', name)
files = os.listdir()
pairs = getPairs(files)
movePairsToDir(pairs)
NOTE: This script works when called inside the directory with the pairs.
I have a directory structure that resembles the following:
Dir1
Dir2
Dir3
Dir4
L SubDir4.1
L SubDir4.2
L SubDir4.3
I want to generate a list of files (with full paths) that include all the contents of Dirs1-3, but only SubDir4.2 inside Dir4. The code I have so far is
import fnmatch
import os
for root, dirs, files in os.walk( '.' )
if 'Dir4' in dirs:
if not 'SubDir4.2' in 'Dir4':
dirs.remove( 'Dir4' )
for file in files
print os.path.join( root, file )
My problem is that the part where I attempt to exclude any file that does not have SubDir4.2 in it's path is excluding everything in Dir4, including the things I would like to remain. How should I amend that above to to do what I desire?
Update 1: I should add that there are a lot of directories below Dir4 so manually listing them in an excludes list isn't a practical option. I'd like to be able to specify SubDur4.2 as the only subdirectory within Dir4 to be read.
Update 2: For reason outside of my control, I only have access to Python version 2.4.3.
There are a few typos in your snippet. I propose this:
import os
def any_p(iterable):
for element in iterable:
if element:
return True
return False
include_dirs = ['Dir4/SubDir4.2', 'Dir1/SubDir4.2', 'Dir3', 'Dir2'] # List all your included folder names in that
for root, dirs, files in os.walk( '.' ):
dirs[:] = [d for d in dirs if any_p(d in os.path.join(root, q_inc) for q_inc in include_dirs)]
for file in files:
print file
EDIT: According to comments, I have changed that so this is include list, instead of an exclude one.
EDIT2: Added a any_p (any() equivalent function for python version < 2.5)
EDIT3bis: if you have other subfolders with the same name 'SubDir4.2' in other folders, you can use the following to specify the location:
include_dirs = ['Dir4/SubDir4.2', 'Dir1/SubDir4.2']
Assuming you have a Dir1/SubDir4.2.
If they are a lot of those, then you may want to refine this approach with fnmatch, or probably a regex query.
I altered mstud's solution to give you what you are looking for:
import os;
for root, dirs, files in os.walk('.'):
# Split the root into its path parts
tmp = root.split(os.path.sep)
# If the lenth of the path is long enough to be your path AND
# The second to last part of the path is Dir4 AND
# The last part of the path is SubDir4.2 THEN
# Stop processing this pass.
if (len(tmp) > 2) and (tmp[-2] == 'Dir4') and (tmp[-1] != 'SubDir4.2'):
continue
# If we aren't in Dir4, print the file paths.
if tmp[-1] != 'Dir4':
for file in files:
print os.path.join(root, file)
In short, the first "if" skips the printing of any directory contents under Dir4 that aren't SubDir4.2. The second "if" skips the printing of the contents of the Dir4 directory.
for root, dirs, files in os.walk('.'):
tmp = root.split(os.path.sep)
if len(tmp)>2 and tmp[-2]=="Dir4" and tmp[-1]=="SubDir4.2":
continue
for file in files:
print os.path.join(root, file)
I am trying to list directories and files (recursivley) in a directory with python:
./rootdir
./file1.html
./subdir1
./file2.html
./file3.html
./subdir2
./file4.html
Now I can list the directories and files just fine (borrowed it from here). But I would like to list it in the following format and ORDER (which is very important for what I am doing.
/rootdir/
/rootdir/file1.html
/rootdir/subdir1/
/rootdir/subdir1/file2.html
/rootdir/subdir1/file3.html
/rootdir/subdir2/
/rootdir/file4.html
I don't care how it gets done. If I walk the directory and then organize it or get everything in order. Either way, thanks in advance!
EDIT: Added code below.
# list books
import os
import sys
lstFiles = []
rootdir = "/srv/http/example/www/static/dev/library/books"
# Append the directories and files to a list
for path, dirs, files in os.walk(rootdir):
#lstFiles.append(path + "/")
lstFiles.append(path)
for file in files:
lstFiles.append(os.path.join(path, file))
# Open the file for writing
f = open("sidebar.html", "w")
f.write("<ul>")
for item in lstFiles:
splitfile = os.path.split(item)
webpyPath = splitfile[0].replace("/srv/http/example/www", "")
itemName = splitfile[1]
if item.endswith("/"):
f.write('<li>' + itemName + '</li>\n')
else:
f.write('<li>' + itemName + '</li>\n')
f.write("</ul>")
f.close()
Try the following:
for path, dirs, files in os.walk("."):
print path
for file in files:
print os.path.join(path, file)
You do not need to print entries from dirs because each directory will be visited as you walk the path, so you will print it later with print path.