Getting folder names and file names within them into a data frame - python

I have a folder called Contracts and then in that folder I have folders for several companies. In the company folders I have several contracts that we have with those companies. I am trying to get a data frame that has two columns, Folder_Name and Contract.
I tried to follow this question, Python list directory, subdirectory, and files which got me close, I think, but I could not get a column with the folder name that the contract was from.
I thought this would work:
import pathlib, sys, os
import pandas as pd
cwd = os.getcwd()
lst1 = []
lst2 = []
for path, subdir, file in os.walk(cwd):
for i in subdir:
for name in file:
lst1.append(i)
lst2.append(name)
df = pd.DataFrame(zip(lst1, lst2), columns = ['Folder_Name', 'Contract'])
but it only gave me the folder names in one column and the names of files in the contracts folder instead of in the company folders
Folder_Name Contract
0 .ipynb_checkpoints Untitled.ipynb
1 AWS Untitled.ipynb

I ran this code:
import pathlib, sys, os
import pandas as pd
cwd = os.getcwd()
lst1 = []
lst2 = []
for path, subdir, file in os.walk(os.path.join(cwd,'Contracts')):
print(path, subdir, file)
for i in subdir:
for name in file:
print(i,name)
In an exemple folder and I found your problem.
Here is the console response
As you can see when subdir is full, file is empty and when file is full, subdir is empty.
In fact, subdir lists the forward folders whereas file only lists you the forward files considering to the path you are in, regarding to your situation there is either one or another, but never both at the same time. That's why your loop always has an empty element and never prints anything.
I tryed to do something which works in the situation you described, this is a ltle bit longer but you can try that:
import pathlib, sys, os
import pandas as pd
cwd = os.getcwd()
contracts_path=os.path.join(cwd,'Contracts')
lst1 = []
lst2 = []
for path, subdir, file in os.walk(contracts_path):
for company in subdir:
for path, subdir, file in os.walk(os.path.join(contracts_path,company)):
for name in file:
lst1.append(company)
lst2.append(name)
df = pd.DataFrame(zip(lst1, lst2), columns = ['Folder_Name', 'Contract'])

Related

Using glob to parse through folders for files

I have the following example dataset output from pandas.
What I would like to do in an efficient way is using glob to search the filename in the associated main folder and sub folder only and not to loop through all the main folders/ sub folders as per my current code. I need this to then match against a main folders and sub folder I have and if it matches then it copies the file. I have code that works but it is very inefficient and has to go through all folders/sub folders for each search. The code is as below;At the moment, main_folder and searchdate are lists.filenames_i_want, is also the list that I will be matching to. Any way i can make it go straight to the folder/subfolder e.g if i provided this as CSV input?
import itertools
import glob
import shutil
from pathlib import Path
filenames_i_want = Search_param
main_folder=locosearch
searchfolder= Search_date
TargetFolder = r'C:\ELK\LOGS\XX\DEST'
for directory,folder in itertools.product(main_folder, searchfolder):
files = glob.glob('Z:/{}/{}/asts_data_logger/*.bz2'.format(directory, folder))
for f in files:
current_path = Path(f)
cpn = current_path.name
if current_path.name in filenames_i_want:
print(f"found target file: {f}")
shutil.copy2(f, TargetFolder)
I created a column and used the fields to make an absolute path and then used tuples to go through each row and shutil to copy
TargetFolder = r'C:\ELK\LOGS\ATH\DEST'
for row in df.itertuples():
search = row.Search
try:
shutil.copy2(search, TargetFolder)
except Exception as e:
print(e)

Search and copy files listed in a dataframe

Hi I'm working on a simple script that copy files from a directory to another based on a dataframe that contains a list of invoices.
Is there any way to do this as a partial match? like i want all the files that contains "F11000", "G13000" and go on continue this loop until no more data in DF.
I tried to figure it out by myself and I'm pretty sure changing the "X" on the copy function will do the trick, but can't see it.
import pandas as pd
import os
import glob
import shutil
data = {'Invoice':['F11000','G13000','H14000']}
df = pd.DataFrame(data,columns=['Doc'])
path = 'D:/Pyfilesearch'
dest = 'D:/Dest'
def find(name,path):
for root,dirs,files in os.walk(path):
if name in files:
return os.path.join(root,name)
def copy():
for x in df['Invoice']:
shutil.copy(find(x,path),dest)
copy()
Using pathlib
This is part of the standard library
Treats paths and objects with methods instead of strings
Python 3's pathlib Module: Taming the File System
Script assumes dest is an existing directory.
.rglob searches subdirectories for files
from pathlib import Path
import pandas as pd
import shutil
# convert paths to pathlib objects
path = Path('D:/Pyfilesearch')
dest = Path('D:/Dest')
# find files and copy
for v in df.Invoice.unique(): # iterate through unique column values
files = list(path.rglob(f'*{v}*')) # create a list of files for a value
files = [f for f in files if f.is_file()] # if not using file extension, verify item is a file
for f in files: # iterate through and copy files
print(f)
shutil.copy(f, dest)
Copy to subdirectories for each value
path = Path('D:/Pyfilesearch')
for v in df.Invoice.unique():
dest = Path('D:/Dest')
files = list(path.rglob(f'*{v}*'))
files = [f for f in files if f.is_file()]
dest = dest / v # create path with value
if not dest.exists(): # check if directory exists
dest.mkdir(parents=True) # if not, create directory
for f in files:
shutil.copy(f, dest)

Get all folder names exept for one in python

i need to get all folder names EXCEPT for "Archives" using Path() ONLY as i need to use glob later in the for loop. Im on Kali Linux and the file structure is ./sheets/ and then the folders Archives, and Test(ALSO NOTHINGS INSIDE) with files creds.json and sheets.py.
# Imports
from pathlib import Path
import pandas as pd
import pygsheets
import glob
import os
# Setup
gc = pygsheets.authorize(service_file='./creds.json')
email = str(input("Enter email to share sheet: "))
folderName = Path("./") # <<<<<<<<<<<<< HERE IS PROBLEM
for file in folderName.glob("*.txt"):
if not Path("./Archives").exists():
os.mkdir("./Archives")
df = pd.DataFrame()
df['name'] = ['John', 'Steve', 'Sarah', 'YESSSS']
gc.create(folderName)
sh = gc.open(file)
sh.share(email, role='writer', type='user')
wks = sh[0]
wks.set_dataframe(df,(1,1))
i expect the output to the variable folderName be any folder name except Archives as a string.
my goal is a script that when run, gets the folder name in ./sheets/ (Test in this case) as the newly created spreadsheet's name, get file names as headers, and stuff in files (seperated by newlines) as the stuff underneath the header(file names) then shares the sheet with me at my email. using pygsheets by the way
from pathlib import Path
p = Path('./sheets')
# create Archives under p if needed
archives = p / 'Archives'
if not archives.exists(): archives.mkdir()
# find all directories under p that don't include 'Archives'
folder_dirs = [x for x in p.glob('**/*') if x.is_dir() and 'Archives' not in x.parts]
# find all *.txt* files under p that don't include 'Archives'
txt_file_dirs = [x for x in p.glob('**/*.txt') if x.is_file() and 'Archives' not in x.parts]
for file in txt_file_dirs:
file_name = file.stem
When using pathlib you will be working with objects, not strings and there are many methods for working on those objects, within the library.

How to read particular text files out-of multiple files in a sub directories in python

I have a one folder, within it contains 5 sub-folders.
Each sub folder contains some 'x.txt','y.txt' and 'z.txt' files and it repeats in every sub-folders
Now I need to read and print only 'y.txt' file from all sub-folders.
My problem is I'm unable to read and print 'y.txt' files. Can you tell me how solve this problem.
Below is my code which I have written for reading y.txt file
import os, sys
import pandas as pd
file_path = ('/Users/Naga/Desktop/Python/Data')
for root, dirs, files in os.walk(file_path):
for name in files:
print(os.path.join(root, name))
pd.read_csv('TextInformation.txt',delimiter=";", names = ['Name', 'Value'])
error :File TextInformation.txt does not exist: 'TextInformation.txt'
You could also try the following approach to fetch all y.txt files from your subdirectories:
import glob
import pandas as pd
# get all y.txt files from all subdirectories
all_files = glob.glob('/Users/Naga/Desktop/Python/Data/*/y.txt')
for file in all_files:
data_from_this_file = pd.read_csv(file, sep=" ", names = ['Name', 'Value'])
# do something with the data
Subsequently, you can apply your code to all the files within the list all_files. The great thing with glob is that you can use wilcards (*). Using them you don't need the names of the subdirectories (you can even use it within the filename, e.g. *y.txt). Also see the documentation on glob.
Your issue is forgot adding the parent path of 'y.txt' file. I suggest this code for you, hope it help.
import os
pth = '/Users/Naga/Desktop/Python/Data'
list_sub = os.listdir(pth)
filename = 'TextInformation.txt'
for sub in list_sub:
TextInfo = open('{}/{}/{}'.format(pth, sub, filename), 'r').read()
print(TextInfo)
I got you a little code. you can personalize it anyway you like but the code works for you.
import os
for dirPath,foldersInDir,fileName in os.walk(path_to_main_folder):
if fileName is not []:
for file in fileName:
if file.endswith('y.txt'):
loc = os.sep.join([dirPath,file])
y_txt = open(loc)
y = y_txt.read()
print(y)
But keep in mind that {path_to_main} is the path that has the subfolders.

Loop over multiple folders from list with glob.glob

How can I loop over a defined list of folders and all of the individual files inside each of those folders?
I'm trying to have it copy all the months in each year folder. But when I run it nothing happens..
import shutil
import glob
P4_destdir = ('Z:/Source P4')
yearlist = ['2014','2015','2016']
for year in yearlist:
for file in glob.glob(r'{0}/*.csv'.format(yearlist)):
print (file)
shutil.copy2(file,P4_destdir)
I think the problem might be that you require a / in you source path:
import shutil
import glob
P4_destdir = ('Z:/Source P4/')
yearlist = ['2014','2015','2016'] # assuming these files are in the same directory as your code.
for year in yearlist:
for file in glob.glob(r'{0}/*.csv'.format(yearlist)):
print (file)
shutil.copy2(file,P4_destdir)
Another thing that might be a problem is if the destination file does not yet exist. You can create it using os.mkdir:
import os
dest = os.path.isdir('Z:/Source P4/') # Tests if file exists
if not dest:
os.mkdir('Z:/Source P4/')

Categories