reading HTML(different folders) files - python

I want to read HTML files in python. Normaly I do it like this (and it works):
import codecs
f = codecs.open("test.html",'r')
print f.read()
The Problem is that my html files are not all in the same Folder since have a program which generates this html files and save them into folders which are inside the folder where I have my script to read the files.
Summarizing, I have my script in a Folder and inside this Folder there are more Folders where the generated html files are.
Does anybody know how can I proceed?

import os
import codecs
for root, dirs, files in os.walk("./"):
for name in files:
abs_path = os.path.normpath(root + '/' + name)
file_name, file_ext = os.path.splitext(abs_path)
if file_ext == '.html':
f = codecs.open(abs_path,'r')
print f.read()
This will walk through <script dir>/ (./ will get translated to your script-directory) and loop through all files in each sub-directory.
It will check if the extension is .html and do the work on each .html file.
You would perhaps define more file endings that are "accepted" (for instance .htm).

use os.walk:
import os,codecs
for root, dirs, files in os.walk("/mydir"):
for file in files:
if file.endswith(".html"):
f = codecs.open(os.path.join(root, file),'r')
print f.read()

Related

open a folder to then use the files in python correctly

Usually I navigate to the folder I am extracting data from and copy the file name directly:
df2=pd.read_csv('10_90_bnOH-MEA.csv',usecols=[1])
If I have multiple files and want to do the same for all the files, how do I specify the folder to open and get all the files inside?
I want to run the above code without specifying the file's full path
(C:\Users\X\Desktop\Y\Z\10_90_bnOH-MEA.csv)
You want listdir from the os module.
import os
path = "C:\\Users\\X\\Desktop\\Y\\Z\\"
files = os.listdir(path)
print(files)
dataframe_list = []
for filename in files:
dataframe_list.append(pd.read_csv(os.path.join(path,filename)))
You should open the desired directory and loop through all the files then do something to them.
# import required module
import os
# assign directory
directory = 'files'
# iterate over files in
def goThroughDirectory(directory):
for filename in os.listdir(directory):
f = os.path.join(directory, filename)
# checking if it is a file
if os.path.isfile(f):
# do something
If you also want to loop through all the files in a directory you should add a check for if os.path.isdir(f) like this
...
def goThroughDirectory(directory):
for filename in os.listdir(directory):
f = os.path.join(directory, filename)
# checking if it is a file
if os.path.isfile(f):
# do something
elif os.path.isdir(f):
# its not a file but a directory then loop through that directory aswell
goThroughDirectory(directory + "\" + f)
for more information you should check geeksforgeeks

Find directories missing .csv file in Python

I have ~1000 directories, containing various .csv files within them. I am trying to check if a specific type of csv file, containing a filename that begins with PTSD_OCOTBER, exists in each directory.
If this file does not exist in the directory, I want to print out that directory into a .txt file.
Here is what I have so far.
import os,sys,time,shutil
import subprocess
#determine filetype to look for.
file_type = ".csv"
print("Running file counter for" + repr(file_type))
#for each folder in the root directory
for subdir, dirs, files in os.walk(rootdir):
if("GeneSet" in subdir):
folder_name = subdir.rsplit('/', 1)[-1] #get the folder name.
for f in files:
#unclear how to write this part.
#how to tell if no files exist in directory?
This successfully finds the .csv files of interest, but how do achieve the above?
So files is the list of files in that directory that you are currently walking. You want to know if there are no files that start with PTSD_OCOTBER (PTSD_OCTOBER ?):
for subdir, dirs, files in os.walk(rootdir):
if("GeneSet" in subdir):
folder_name = subdir.rsplit('/', 1)[-1] #get the folder name.
dir_of_interest = not any(f.startswith('PTSD_OCOTBER') for f in files)
if dir_of_interest:
# do stuff with folder_name
Now you want to save the results into a text file? If you have a Unix-style computer, then you can use output redirection on your terminal, such as
python3 fileanalysis.py > result.txt
after writing print(folder_name) instead of # do stuff with folder_name.
Or you can use Python itself to write the file, such as:
found_dirs = []
for subdir, dirs, files in os.walk(rootdir):
...
if dir_of_interest:
found_dirs.append(folder_name)
with open('result.txt', 'w') as f:
f.write('\n'.join(found_dirs))

Processing filenames in Python

I've written a function to strip double spaces out of my raw data files:
def fixDat(file):
'''
Removes extra spaces in the data files. Replaces original file with new
and renames original to "...._original.dat".
'''
import os
import re
with open(file+'.dat', 'r') as infile:
with open(file+'_fixed.dat', 'w') as outfile:
lines = infile.readlines()
for line in lines:
fixed = re.sub("\s\s+" , " ", line)
outfile.write(fixed)
os.rename(file+'.dat', file+'_original.dat')
os.rename(file+'_fixed.dat', file+'.dat')
I have 19 files in a folder that I need to process with this function, but I'm not sure how to parse the filenames and pass them to the function. Something like
for filename in folder:
fixDat(filename)
but how do I code filename and folder in Python?
If I understand correctly, you are asking about the os module's .walk() functionality. Where an example would look like:
import os
for root, dirs, files in os.walk(".", topdown=False): # "." uses current folder
# change it to a pathway if you want to process files not where your script is located
for name in files:
print(os.path.join(root, name))
With filename outputs which can be fed to your fixDat() function such as:
./tmp/test.py
./amrood.tar.gz
./httpd.conf
./www.tar.gz
./mysql.tar.gz
./test.py
Note that these are all strings so you could change the script to:
import os
for root, dirs, files in os.walk(".", topdown=False):
for name in files:
if name.endswith('.dat'): # or some other extension
print(os.path.join(root, name))
fixDat(os.path.join(root, name))

Trying to reach all .txt files in Python

I have a folder which is "labels". In this folder, thera are 50 folders again and each of these 50 folder have .txt files. How can I reach these .txt files with using Python 2?
Here's code that will go through all folders in labels and print content of txt files located inside them.
import os
for folder in os.listdir('labels'):
for txt_file in os.listdir('labels/{}'.format(folder)):
if txt_file.endswith('.txt'):
file = open('labels/{}/{}'.format(folder, txt_file), 'r')
content = file.read()
file.close()
print(content)
If you just want to list the files in the folders:
import os
rootdir = 'C:/Users/youruser/Desktop/test'
for subdir, dirs, files in os.walk(rootdir):
for file in files:
print (os.path.join(subdir, file))

Copying files in python using shutil

I have the following directory structure:
-mailDir
-folderA
-sub1
-sub2
-inbox
-1.txt
-2.txt
-89.txt
-subInbox
-subInbox2
-folderB
-sub1
-sub2
-inbox
-1.txt
-2.txt
-200.txt
-577.txt
The aim is to copy all the txt files under inbox folder into another folder.
For this I tried the below code
import os
from os import path
import shutil
rootDir = "mailDir"
destDir = "destFolder"
eachInboxFolderPath = []
for root, dirs, files in os.walk(rootDir):
for dirName in dirs:
if(dirName=="inbox"):
eachInboxFolderPath.append(root+"\\"+dirName)
for ii in eachInboxFolderPath:
for i in os.listdir(ii):
shutil.copy(path.join(ii,i),destDir)
If the inbox directory only has .txt files then the above code works fine. Since the inbox folder under folderA directory has other sub directory along with .txt files, the code returns permission denied error. What I understood is shutil.copy won't allow to copy the folders.
The aim is to copy only the txt files in every inbox folder to some other location. If the file names are same in different inbox folder I have to keep both file names. How we can improve the code in this case ? Please note other than .txt all others are folders only.
One simple solution is to filter for any i that does not have the .txt extension by using the string endswith() method.
import os
from os import path
import shutil
rootDir = "mailDir"
destDir = "destFolder"
eachInboxFolderPath = []
for root, dirs, files in os.walk(rootDir):
for dirName in dirs:
if(dirName=="inbox"):
eachInboxFolderPath.append(root+"\\"+dirName)
for ii in eachInboxFolderPath:
for i in os.listdir(ii):
if i.endswith('.txt'):
shutil.copy(path.join(ii,i),destDir)
This should ignore any folders and non-txt files that are found with os.listdir(ii). I believe that is what you are looking for.
Just remembered that I once wrote several files to solve this exact problem before. You can find the source code here on my Github.
In short, there are two functions of interest here:
list_files(loc, return_dirs=False, return_files=True, recursive=False, valid_exts=None)
copy_files(loc, dest, rename=False)
For your case, you could copy and paste these functions into your project and modify copy_files like this:
def copy_files(loc, dest, rename=False):
# get files with full path
files = list_files(loc, return_dirs=False, return_files=True, recursive=True, valid_exts=('.txt',))
# copy files in list to dest
for i, this_file in enumerate(files):
# change name if renaming
if rename:
# replace slashes with hyphens to preserve unique name
out_file = sub(r'^./', '', this_file)
out_file = sub(r'\\|/', '-', out_file)
out_file = join(dest, out_file)
copy(this_file, out_file)
files[i] = out_file
else:
copy(this_file, dest)
return files
Then just call it like so:
copy_files('mailDir', 'destFolder', rename=True)
The renaming scheme might not be exactly what you want, but it will at least not override your files. I believe this should solve all your problems.
Here you go:
import os
from os import path
import shutil
destDir = '<absolute-path>'
for root, dirs, files in os.walk(os.getcwd()):
# Filter out only '.txt' files.
files = [f for f in files if f.endswith('.txt')]
# Filter out only 'inbox' directory.
dirs[:] = [d for d in dirs if d == 'inbox']
for f in files:
p = path.join(root, f)
# print p
shutil.copy(p, destDir)
Quick and simple.
sorry, I forgot the part where, you also need unique file names as well. The above solution only works for distinct file names in a single inbox folder.
For copying files from multiple inboxes and having a unique name in the destination folder, you can try this:
import os
from os import path
import shutil
sourceDir = os.getcwd()
fixedLength = len(sourceDir)
destDir = '<absolute-path>'
filteredFiles = []
for root, dirs, files in os.walk(sourceDir):
# Filter out only '.txt' files in all the inbox directories.
if root.endswith('inbox'):
# here I am joining the file name to the full path while filtering txt files
files = [path.join(root, f) for f in files if f.endswith('.txt')]
# add the filtered files to the main list
filteredFiles.extend(files)
# making a tuple of file path and file name
filteredFiles = [(f, f[fixedLength+1:].replace('/', '-')) for f in filteredFiles]
for (f, n) in filteredFiles:
print 'copying file...', f
# copying from the path to the dest directory with specific name
shutil.copy(f, path.join(destDir, n))
print 'copied', str(len(filteredFiles)), 'files to', destDir
If you need to copy all files instead of just txt files, then just change the condition f.endswith('.txt') to os.path.isfile(f) while filtering out the files.

Categories