How to read files from two folders and avoid duplicates in Python

How to read files from two folders and avoid duplicates in Python - python

I have the following folders that I read SQL files from and save them as variables:
++folder1
-1.sql
-2.sql
-3.sql
++folder2
-2.sql
The following code does the job well for a single folder. How I can modify this code to read not just from one folder but from two with a rule that if a file exists in folder2 than don't read the file with the same name from folder1?
folder1 = '../folder1/'
for filename in os.listdir(folder1):
path = os.path.join(folder1, filename)
if os.path.isdir(path):
continue
with open(folder1 + filename, 'r') as myfile:
data = myfile.read()
query_name = filename.replace(".sql", "")
exec (query_name + " = data")

You can try something like follows:
folders = ['../folder2/','../folder1/']
checked =[]
for folder in folders:
for filename in os.listdir(folder):
if filename not in checked:
checked.append(filename)
path = os.path.join(folder, filename)
if os.path.isdir(path):
continue
with open(folder + filename, 'r') as myfile:
data = myfile.read()
query_name = filename.replace(".sql", "")
exec (query_name + " = data")

The answer to this is simple: Do two listdir calls, then skip over the files in folder1 that are also in folder2.
One way to do this is with set operations: the set difference a - b means all elements in a that are not also in b, which is exactly what you want.
files1 = set(os.listdir(folder1))
files2 = set(os.listdir(folder2))
files1 -= files2
paths1 = [os.path.join(folder1, file) for file in files1]
paths2 = [os.path.join(folder2, file) for file in files2]
for path in paths1 + paths2:
if os.path.isdir(path):
# etc.
As a side note, dynamically creating a bunch of variables like this is almost always a very bad idea, and doing it with exec instead of globals or setattr is an even worse idea. It's usually be much better to store everything in, e.g., a dict. For example:
queries = {}
for path in paths1 + paths2:
if os.path.isdir(path):
continue
name = os.path.splitext(os.path.basename(path))[0]
with open(path) as f:
queries[name] = f.read()

Related

How to zip files that ends with certain extension

I want to get all files in a directory (I reached it after doing several for loops - hence fourth.path) that ends with .npy or with csv and then zip those files.
My code is running putting one file only in the zip file. What am I doing wrong?
I tried to change my indents, but no zip file is being created
import json
import os
import zipfile
import zlib
directory = os.path.join(os.getcwd(), 'recs')
radarfolder = 'RadarIfxAvian'
file = os.listdir(directory)
def r(p, name):
p = os.path.join(p, name)
return p.replace("/", "\\")
#This code will list all json files in e ach file
for first in os.scandir(directory):
if first.is_dir():
for second in os.scandir(first.path):
if second.is_dir():
for third in os.scandir(second.path):
if third.is_dir():
radar_folder_name = ''
list_files = ()
for fourth in os.scandir(third.path):
if fourth.is_dir():
if radarfolder in fourth.path:
radar_folder_name = fourth.path
print(radar_folder_name)
list_files = ()
for file in os.listdir(fourth.path):
if file.endswith(".npy") | file.endswith(".csv"):
list_files = (file)
print(list_files)
with zipfile.ZipFile(radar_folder_name +'\\' +'radar.zip', 'w', compression=zipfile.ZIP_DEFLATED ) as zipMe:
zipMe.write(radar_folder_name +'\\' +list_files)
zipMe.close()
I tried to change my indents either resulting in error: TypeError: can only concatenate str (not "tuple") to str or no zip file being created

As I said in my second comment, your problem comes from the 'w' argument in your zipping statement. It causes the zip to be overwritten every time it's opened, which you do for each file you zip in. You can fix this 2 ways (at least):
Replace 'w' with 'a'; this way the files will be appended to your zip (with the side effect that, if you do this several times, files will be added more than once).
Keep the 'w', but only open the zip once, having listed all the files you want to zip before. See my code below.
I've taken the liberty to rewrite the part of your code where you look for the 'RadarIfxAvian' folder, since embedded for are clumsy (and if your folder structure changes, they might not work), replacing it with a multi-purpose recursive function.
Note that the folder structure will be included in the .zip; if you want to zip only the files themselves, consider doing os.chdir(radar_folder_name) before zipping the files.
# This function recursively looks for the 'filename' file or folder
# under 'start_path' and returns the full path, or an empty string if not found.
def find_file(start_path, filename):
if filename in os.listdir(start_path):
return start_path + '/' + filename
for file in os.scandir(start_path):
if not file.is_dir():
continue
if (deep_path:=find_file(start_path + '/' + file.name, filename)):
return deep_path
return ''
directory = os.path.join(os.getcwd(), 'recs')
radarfolder = 'RadarIfxAvian'
radar_folder_name = find_file(directory, radarfolder)
print(radar_folder_name)
list_files = []
for file in os.listdir(radar_folder_name):
if file.endswith(".npy") or file.endswith(".csv"):
list_files.append(file)
with zipfile.ZipFile(radar_folder_name + '/' + 'radar.zip', 'w', compression=zipfile.ZIP_DEFLATED ) as zipMe:
for file in list_files:
zipMe.write(radar_folder_name + '/' + file)

If I understand your code correctly, you are looking for a folder "RadarIfxAvian" and want to place a .ZIP in that folder containing any .CSV or .NPY files in that directory. This should do the equivalent, using os.walk for the recursive search:
import os
import zipfile
for path, dirs, files in os.walk('recs'):
if os.path.basename(path) == 'RadarIfxAvian':
print(path)
with zipfile.ZipFile(os.path.join(path, 'radar.zip'), 'w', zipfile.ZIP_DEFLATED) as zip:
for file in files:
if file.endswith(".npy") | file.endswith(".csv"):
print(file)
zip.write(file)
break # stop search once the directory is found and processed

I adjusted my code with the following steps:
Put the if in a function
writing the the zip by looping over each item in the list I appended
import json
import os
import glob
import zipfile
import zlib
directory = os.path.join(os.getcwd(), 'recs')
radarfolder = 'RadarIfxAvian'
file = os.listdir(directory)
list_files = []
def r(p, name):
p = os.path.join(p, name)
return p.replace("/", "\\")
def tozip(path, file):
filestozip = []
if file.endswith(".npy") or file.endswith(".csv"):
filestozip = (path + '\\' + file)
list_files.append(filestozip)
return list_files
#This code will list all json files in each file
for first in os.scandir(directory):
if first.is_dir():
for second in os.scandir(first.path):
if second.is_dir():
for third in os.scandir(second.path):
if third.is_dir():
radar_folder_name = ''
filestozip = []
list_files.clear()
for fourth in os.scandir(third.path):
if fourth.is_dir():
if radarfolder in fourth.path:
radar_folder_name = fourth.path
for file in os.listdir(fourth.path):
filestozip = tozip(radar_folder_name,file)
print(filestozip)
ZipFile = zipfile.ZipFile(r(radar_folder_name,"radar.zip"), "w")
for a in filestozip:
ZipFile.write(a, compress_type= zipfile.ZIP_DEFLATED)
print(radar_folder_name + "added to zip")

Finding Files in File Tree Based on List of Extensions

I'm working on a small python 3 utility to build a zip file based on a list of file extensions. I have a text file of extensions and I'm passing a folder into the script:
working_folder = sys.argv[1]
zip_name = sys.argv[2]
#Open the extension file
extensions = []
with open('CxExt.txt') as fp:
lines = fp.readlines()
for line in lines:
extensions.append(line)
#Now get the files in the directory. If they have the right exttension add them to the list.
files = os.listdir(working_folder)
files_to_zip = []
for ext in extensions:
results = glob.glob(working_folder + '**/' + ext, recursive=True)
print(str(len(results)) + " results for " + working_folder + '**/*' + ext)
#search = "*"+ext
#results = [y for x in os.walk(working_folder) for y in glob(os.path.join(x[0], search))]
#results = list(Path(".").rglob(search))
for result in results:
files_to_zip.append(result)
if len(files_to_zip) == 0:
print("No Files Found")
sys.exit()
for f in files:
print("Checking: " + f)
filename, file_extension = os.path.splitext(f)
print(file_extension)
if file_extension in extensions:
print(f)
files_to_zip.append(file)
ZipFile = zipfile.ZipFile(zip_name, "w" )
for z in files_to_zip:
ZipFile.write(os.path.basename(z), compress_type=zipfile.ZIP_DEFLATED)
I've tried using glob, os.walk, and Path.rglob and I still can't get a list of files. There's got to be something just obvious that I'm missing. I built a test directory that has some directories, py files, and a few zip files. It returns 0 for all file types. What am I overlooking?

This is my first answer, so please don't expect it to be perfect.
I notice you're using file.readlines(). According to the Python docs here, file.readlines() returns a list of lines including the newline at the end. If your text file has the extensions separated by newlines, maybe try using file.read().split("\n") instead. Besides that, your code looks okay. Tell me if this fix doesn't work.

How to search through both zipped and unzipped folders for a specific line

I'm trying to implement a Python script that takes a folder from the user (can be zipped or unzipped), and search through all the files in the folder to output the specific lines that my regular expression matches. My code below works for regular unzipped folders, but I can't figure out how to do the same with zipped folders that are inputted to function. Below are my code, thanks in advance!
def myFunction(folder_name):
path = folder_name
for (path, subdirs, files) in os.walk(path):
files = [f for f in os.listdir(path) if f.endswith('.txt') or f.endswith('.log') or f.endswith('-release') or f.endswith('.out') or f.endswith('messages') or f.endswith('.zip')] # Specify here the format of files you hope to search from (ex: ".txt" or ".log")
files.sort() # file is sorted list
files = [os.path.join(path, name) for name in files] # Joins the path and the name, so the files can be opened and scanned by the open() function
# The following for loop searches all files with the selected format
for filename in files:
#print('start parsing... ' + str(datetime.datetime.now()))
matched_line = []
try:
with open(filename, 'r', encoding = 'utf-8') as f:
f = f.readlines()
except:
with open(filename, 'r') as f:
f = f.readlines()
# print('Finished parsing... ' + str(datetime.datetime.now()))
for line in f:
#0strip out \x00 from read content, in case it's encoded differently
line = line.replace('\x00', '')
RE2 = r'^Version: \d.+\d.+\d.\w\d.+'
RE3 = r'^.+version.(\d+.\d+.\d+.\d+)'
pattern2 = re.compile('('+RE2+'|'+RE3+')', re.IGNORECASE)
for match2 in pattern2.finditer(line):
matched_line.append(line)
print(line)
#Calling the function to use it
myFunction(r"SampleZippedFolder.zip")
The try and except block of my code was my attempt to open the zipped folder and read it. I'm still not very clear with how to open the zipped folder or how it works. Please let me know how I can modify my code to make it work, much appreciated!

One possibility is first determine what object type folder_name is using zipfile and os.isdir() and whichever one succeeds, get the list of files and proceed. Maybe something like this:
import zipfile, os, re
def myFunction(folder_name):
files = None # nothing yet
path = folder_name
if zipfile.is_zipfile(path):
print('ZipFile: {}'.format(path))
f = zipfile.ZipFile(path)
files = f.namelist()
# for name in f.namelist(): # debugging
# print('file: {}'.format(name))
elif os.path.isdir(path):
print('Folder: {}'.format(path))
files = os.listdir(path)
# for name in os.listdir(path): # debugging
# print('file: {}'.format(name))
# should now have a list of files
# proceed processing the files
for filename in files:
...

renaming the extracted file from zipfile

I have lots of zipped files on a Linux server and each file includes multiple text files.
what I want is to extract some of those text files, which have the same name across zipped files and save it a folder; I am creating one folder for each zipped file and extract the text file to it. I need to add the parent zipped folder name to the end of file names and save all text files in one directory. For example, if the zipped folder was March132017.zip and I extracted holding.txt, my filename would be holding_march13207.txt.
My problem is that I am not able to change the extracted file's name.
I would appreciate if you could advise.
import os
import sys
import zipfile
os.chdir("/feeds/lipper/emaxx")
pwkwd = "/feeds/lipper/emaxx"
for item in os.listdir(pwkwd): # loop through items in dir
if item.endswith(".zip"): # check for ".zip" extension
file_name = os.path.abspath(item) # get full path of files
fh = open(file_name, "rb")
zip_ref = zipfile.ZipFile(fh)
filelist = 'ISSUERS.TXT' , 'SECMAST.TXT' , 'FUND.TXT' , 'HOLDING.TXT'
for name in filelist :
try:
outpath = "/SCRATCH/emaxx" + "/" + os.path.splitext(item)[0]
zip_ref.extract(name, outpath)
except KeyError:
{}
fh.close()

import zipfile
zipdata = zipfile.ZipFile('somefile.zip')
zipinfos = zipdata.infolist()
# iterate through each file
for zipinfo in zipinfos:
# This will do the renaming
zipinfo.filename = do_something_to(zipinfo.filename)
zipdata.extract(zipinfo)
Reference:
https://bitdrop.st0w.com/2010/07/23/python-extracting-a-file-from-a-zip-file-with-a-different-name/

Why not just read the file in question and save it yourself instead of extracting? Something like:
import os
import zipfile
source_dir = "/feeds/lipper/emaxx" # folder with zip files
target_dir = "/SCRATCH/emaxx" # folder to save the extracted files
# Are you sure your files names are capitalized in your zip files?
filelist = ['ISSUERS.TXT', 'SECMAST.TXT', 'FUND.TXT', 'HOLDING.TXT']
for item in os.listdir(source_dir): # loop through items in dir
if item.endswith(".zip"): # check for ".zip" extension
file_path = os.path.join(source_dir, item) # get zip file path
with zipfile.ZipFile(file_path) as zf: # open the zip file
for target_file in filelist: # loop through the list of files to extract
if target_file in zf.namelist(): # check if the file exists in the archive
# generate the desired output name:
target_name = os.path.splitext(target_file)[0] + "_" + os.path.splitext(file_path)[0] + ".txt"
target_path = os.path.join(target_dir, target_name) # output path
with open(target_path, "w") as f: # open the output path for writing
f.write(zf.read(target_file)) # save the contents of the file in it
# next file from the list...
# next zip file...

You could simply run a rename after each file is extracted right? os.rename should do the trick.
zip_ref.extract(name, outpath)
parent_zip = os.path.basename(os.path.dirname(outpath)) + ".zip"
new_file_name = os.path.splitext(os.path.basename(name))[0] # just the filename
new_name_path = os.path.dirname(outpath) + os.sep + new_file_name + "_" + parent_zip
os.rename(outpath, new_namepath)
For the filename, if you want it to be incremental, simply start a count and for each file, go up by on.
count = 0
for file in files:
count += 1
# ... Do our file actions
new_file_name = original_file_name + "_" + str(count)
# ...
Or if you don't care about the end name you could always use something like a uuid.
import uuid
random_name = uuid.uuid4()

outpath = '/SCRATCH/emaxx'
suffix = os.path.splitext(item)[0]
for name in filelist :
index = zip_ref.namelist().find(name)
if index != -1: # check the file exists in the zipfile
filename, ext = os.path.splitext(name)
zip_ref.filelist[index].filename = f'{filename}_{suffix}.{ext}' # rename the extracting file to the suffix file name
zip_ref.extract(zip_ref.filelist[index], outpath) # use the renamed file descriptor to extract the file

I doubt this is possible to rename file during their extraction.
What about renaming files once they are extracted ?
Relying on linux bash, you can achieve it in a one line :
os.system("find "+outpath+" -name '*.txt' -exec echo mv {} `echo {} | sed s/.txt/"+zipName+".txt/` \;")
So, first we search all txt files in the specified folder, then exec the renaming command, with the new name computed by sed.
Code not tested, i'm on windows now ^^'

Exporting multiple files with different filenames

Lets say I have n files in a directory with filenames: file_1.txt, file_2.txt, file_3.txt .....file_n.txt. I would like to import them into Python individually and then do some computation on them, and then store the results into n corresponding output files: file_1_o.txt, file_2_o.txt, ....file_n_o.txt.
I've figured out how to import multiple files:
import glob
import numpy as np
path = r'home\...\CurrentDirectory'
allFiles = glob.glob(path + '/*.txt')
for file in allFiles:
# do something to file
...
...
np.savetxt(file, ) ???
Not quite sure how to append the _o.txt (or any string for that matter) after the filename so that the output file is file_1_o.txt

Can you use the following snippet to build the output filename?
parts = in_filename.split(".")
out_filename = parts[0] + "_o." + parts[1]
where I assumed in_filename is of the form "file_1.txt".
Of course would probably be better to put "_o." (the suffix before the extension) in a variable so that you can change at will just in one place and have the possibility to change that suffix more easily.
In your case it means
import glob
import numpy as np
path = r'home\...\CurrentDirectory'
allFiles = glob.glob(path + '/*.txt')
for file in allFiles:
# do something to file
...
parts = file.split(".")
out_filename = parts[0] + "_o." + parts[1]
np.savetxt(out_filename, ) ???
but you need to be careful, since maybe before you pass out_filename to np.savetxt you need to build the full path so you might need to have something like
np.savetxt(os.path.join(path, out_filename), )
or something along those lines.
If you would like to combine the change in basically one line and define your "suffix in a variable" as I mentioned before you could have something like
hh = "_o." # variable suffix
..........
# inside your loop now
for file in allFiles:
out_filename = hh.join(file.split("."))
which uses another way of doing the same thing by using join on the splitted list, as mentioned by #NathanAck in his answer.

import os
#put the path to the files here
filePath = "C:/stack/codes/"
theFiles = os.listdir(filePath)
for file in theFiles:
#add path name before the file
file = filePath + str(file)
fileToRead = open(file, 'r')
fileData = fileToRead.read()
#DO WORK ON SPECIFIC FILE HERE
#access the file through the fileData variable
fileData = fileData + "\nAdd text or do some other operations"
#change the file name to add _o
fileVar = file.split(".")
newFileName = "_o.".join(fileVar)
#write the file with _o added from the modified data in fileVar
fileToWrite = open(newFileName, 'w')
fileToWrite.write(fileData)
#close open files
fileToWrite.close()
fileToRead.close()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read files from two folders and avoid duplicates in Python - python

Related

How to zip files that ends with certain extension

Finding Files in File Tree Based on List of Extensions

How to search through both zipped and unzipped folders for a specific line

renaming the extracted file from zipfile

Exporting multiple files with different filenames

Categories

Resources