I have two PDFs that I would like to determine if they're equal. I don't care about the byte level but more so on appearance of the PDF. It seems that the Python package filecmp doesn't do the comparison justice; I get False when I run
filecamp.cmp(Path1, Path2)
when I know the PDFs are the same. Does anyone have any solutions to this?
They're certain packages to implement this. One that I used for my needs is diff-pdf-visually. I had to do some additional installations but otherwise works quite well.
I feel this question needs a better title and I will amend it if someone suggests something better. The problem is I'm not sure of the terminology of the feature that I'm using here.
The best way to describe my problem is to show what I've done. The project is here: https://github.com/jeffnyman/quendor
This project is setup so it can be executed as a module. For example, from the project root someone could do this:
python3 -m quendor
I also have a build script to generate an in-memory zip (if I'm using that terminology correctly):
https://github.com/jeffnyman/quendor/blob/master/build.py
That works in that if you run build.py it will generate a quendor.py file that executes the entire project. That worked fine up until I included other directories (like my utilities and zinterface).
With the project as it is in the repo right now, if you run the build (.\build.py) and then run the generated file:
./quendor.py
You get the following error:
File "./quendor.py/quendor/__main__.py", line 6, in <module>
ModuleNotFoundError: No module named 'quendor.zinterface'
So a key point: if all of my files are in the same directory (i.e., in quendor) this build script works fine in terms of producing an executable script file.
But once I include the subdirectories and files in those directories, things go south on me with the above error.
I'm sure all the files are being gathered. I handle that starting on line 18 (https://github.com/jeffnyman/quendor/blob/master/build.py#L18). And if you were to add to line 24 this statement:
print(f"* {file_path}")
You would see it outputs the following:
* quendor/__init__.py
* quendor/__version__.py
* quendor/zinterface/fileio.py
* quendor/utilities/messages.py
* quendor/__main__.py
So I'm suspecting it might have to do with the code where I write the string at line 28 (https://github.com/jeffnyman/quendor/blob/master/build.py#L28). I feel I have to do more to let the executable zipped script file know about the modules.
But I'm not sure if (1) I'm accurate and (2) even if I'm accurate, if that's possible. I'm finding I'm in a bit over my head here.
Any thoughts would be appreciated and I'm happy to update with any necessarily clarifications or terminology.
So it won't let me comment unless I have more reputation but I can post an answer. Even though I don't have an answer, but rather a comment. I think the above comment was not meant for your actual __main__.py file but rather the one that is getting generated in your quendor.py file. You might want to try adding the import statements to your packed string that you write.
For example, see what happens if on line 32 you add this: import quendor.zinterface.fileio as zio. (Don't replace the line that's there. Just put my line and then keep your others.) I'm not sure how the zip process works but if it tries to mirror the module process that should work. However, if it doesn't, that won't work. You might also just want to try doing import quendor.zinterface. By itself that won't work but it would be interesting to see if it gave you a different error.
Actually, it turns out I found a way to do this! It required using os.walk rather than os.listdir. This required taking a few ideas that people here discussed. Here is the script that does the trick:
https://github.com/jeffnyman/quendor/blob/master/build.py
You can compare that with my previous commit that was trying to handle this a different way.
Eldritch was right that I couldn't just flatten the directory nor could I just add imports to the string I was writing to the final zip file. Jean-François was correct that I had to focus on the __main__.py that was being generated. My contribution was figuring out os.walk() and then parameterizing the written string to handle the different directories.
Finally, this solution does require, as per HTF's suggestion, that I put an empty __init__.py file in each package.
With my solution in place, you can run build.py which then generates the quendor.py script. That script then executes correctly, in terms of recognizing the imports to various packages.
Playing around with just about every variation of import and file gathering that I can think of with your repo, there's a good news / bad news thing.
The bad news is that the answer is this: it isn't possible.
The good news is this: you do have a working implementation if you just keep all files in the quendor directory rather than having subdirectories.
The other good news is you stumbled on something, and posed a problem, that Python gurus aren't able to answer. And there's a certain pleasure to be found in that! I guarantee you will not get an answer to this that works (except for the "all files in one directory" solution).
A refinement to the answer is that if you're setting up the program to run as a module anyway, just use a pip configuration. That basically does the same thing that you want but without having to go through the contortions. (Unless there's a reason you were doing the build the way you were rather than using pip.)
I saw this answer - How can I search sub-folders using glob.glob module in Python? but I didn't understand what os.walk() was doing (I read the docs but it didn't quite make sense).
I'm really new to pathing and still trying to make sense of it.
I have a script that lives in - /users/name/Desktop/folder/ and I want to access some files in /users/name/Desktop/folder/subfolder/*.html
I tried glob.glob('/users/name/Desktop/folder/subfolder/*.html') but it returned an empty list. I realize this is what the previous person did and it didn't work (I was just hoping that glob had been updated!)
Any thoughts on how to do this?
Without any further information it's hard to say what the issue is. I tested your syntax, it works fine for me. Are you sure the extension is .html not .htm in your /users/name/Desktop/folder/subfolder/ directory?
Also, to further troubleshoot you can check what python can see in you directory by running:
os.listdir('/users/name/Desktop/folder/subfolder/')
This should get you started.
I am looking for an efficient implementation for finding modified files in multiple directories.
I do know that I can just recursively go through all those directories and check the modified date of those files I wanna check. This is quite trivial.
But what if I have a complex folder structure with many files in it? The upper approach won't really scale and might take up several minutes.
Is there a better approach to probe for modifications? Like is there something like a checksum on folders that I could use to narrow down the number of folders I have to check for modifications or anything like that?
A second step to my problem is also finding newly created files.
I am looking for a python based solution which is windows compatible
I'm trying to write a project that will have some autonomous components. One of these is the need to diff two folders and spit out the different files into an array of strings. Dircmp does part of this - it spits out the different files. But it would appear it doesn't actually go into the remaining files to see which are different when compared against the same file in a different folder.
Currently I've played with difflib and filecmp, and unless I'm doing something entirely wrong I can't find a way to achieve what I'm looking for without writing it all from scratch. The reason I need this is because this python script will be deployed on windows boxen where the standard linux diff tools will not be available.
My only other thought would be to just call diff and such from the command line, but that doesn't solve either of my problems (getting the files in an array AND not requiring GNU tools).
Can anyone help me? I'm still a total scrub at python and would really appreciate the expert advice. Thank you!
It seems that filecmp.dircmp does what you want already. If you compare two directories, diff_files will be a list of files which are in both directories, but whose contents differ:
>>> dc = filecmp.dircmp('dir1', 'dir2')
>>> dc.diff_files
<<< ['foo']
As pointed out by Jonathanb, if you want actual diffs, it's easy to use difflib at this point to do so.