I have two PDFs that I would like to determine if they're equal. I don't care about the byte level but more so on appearance of the PDF. It seems that the Python package filecmp doesn't do the comparison justice; I get False when I run
filecamp.cmp(Path1, Path2)
when I know the PDFs are the same. Does anyone have any solutions to this?
They're certain packages to implement this. One that I used for my needs is diff-pdf-visually. I had to do some additional installations but otherwise works quite well.
I feel this question needs a better title and I will amend it if someone suggests something better. The problem is I'm not sure of the terminology of the feature that I'm using here.
The best way to describe my problem is to show what I've done. The project is here: https://github.com/jeffnyman/quendor
This project is setup so it can be executed as a module. For example, from the project root someone could do this:
python3 -m quendor
I also have a build script to generate an in-memory zip (if I'm using that terminology correctly):
https://github.com/jeffnyman/quendor/blob/master/build.py
That works in that if you run build.py it will generate a quendor.py file that executes the entire project. That worked fine up until I included other directories (like my utilities and zinterface).
With the project as it is in the repo right now, if you run the build (.\build.py) and then run the generated file:
./quendor.py
You get the following error:
File "./quendor.py/quendor/__main__.py", line 6, in <module>
ModuleNotFoundError: No module named 'quendor.zinterface'
So a key point: if all of my files are in the same directory (i.e., in quendor) this build script works fine in terms of producing an executable script file.
But once I include the subdirectories and files in those directories, things go south on me with the above error.
I'm sure all the files are being gathered. I handle that starting on line 18 (https://github.com/jeffnyman/quendor/blob/master/build.py#L18). And if you were to add to line 24 this statement:
print(f"* {file_path}")
You would see it outputs the following:
* quendor/__init__.py
* quendor/__version__.py
* quendor/zinterface/fileio.py
* quendor/utilities/messages.py
* quendor/__main__.py
So I'm suspecting it might have to do with the code where I write the string at line 28 (https://github.com/jeffnyman/quendor/blob/master/build.py#L28). I feel I have to do more to let the executable zipped script file know about the modules.
But I'm not sure if (1) I'm accurate and (2) even if I'm accurate, if that's possible. I'm finding I'm in a bit over my head here.
Any thoughts would be appreciated and I'm happy to update with any necessarily clarifications or terminology.
So it won't let me comment unless I have more reputation but I can post an answer. Even though I don't have an answer, but rather a comment. I think the above comment was not meant for your actual __main__.py file but rather the one that is getting generated in your quendor.py file. You might want to try adding the import statements to your packed string that you write.
For example, see what happens if on line 32 you add this: import quendor.zinterface.fileio as zio. (Don't replace the line that's there. Just put my line and then keep your others.) I'm not sure how the zip process works but if it tries to mirror the module process that should work. However, if it doesn't, that won't work. You might also just want to try doing import quendor.zinterface. By itself that won't work but it would be interesting to see if it gave you a different error.
Actually, it turns out I found a way to do this! It required using os.walk rather than os.listdir. This required taking a few ideas that people here discussed. Here is the script that does the trick:
https://github.com/jeffnyman/quendor/blob/master/build.py
You can compare that with my previous commit that was trying to handle this a different way.
Eldritch was right that I couldn't just flatten the directory nor could I just add imports to the string I was writing to the final zip file. Jean-François was correct that I had to focus on the __main__.py that was being generated. My contribution was figuring out os.walk() and then parameterizing the written string to handle the different directories.
Finally, this solution does require, as per HTF's suggestion, that I put an empty __init__.py file in each package.
With my solution in place, you can run build.py which then generates the quendor.py script. That script then executes correctly, in terms of recognizing the imports to various packages.
Playing around with just about every variation of import and file gathering that I can think of with your repo, there's a good news / bad news thing.
The bad news is that the answer is this: it isn't possible.
The good news is this: you do have a working implementation if you just keep all files in the quendor directory rather than having subdirectories.
The other good news is you stumbled on something, and posed a problem, that Python gurus aren't able to answer. And there's a certain pleasure to be found in that! I guarantee you will not get an answer to this that works (except for the "all files in one directory" solution).
A refinement to the answer is that if you're setting up the program to run as a module anyway, just use a pip configuration. That basically does the same thing that you want but without having to go through the contortions. (Unless there's a reason you were doing the build the way you were rather than using pip.)
I am looking for an efficient implementation for finding modified files in multiple directories.
I do know that I can just recursively go through all those directories and check the modified date of those files I wanna check. This is quite trivial.
But what if I have a complex folder structure with many files in it? The upper approach won't really scale and might take up several minutes.
Is there a better approach to probe for modifications? Like is there something like a checksum on folders that I could use to narrow down the number of folders I have to check for modifications or anything like that?
A second step to my problem is also finding newly created files.
I am looking for a python based solution which is windows compatible
in this link you can see what I did
I am beginner with python and I have to do the same for my work. (link included)
filecmp- code compare directories, take the different files. Everything okay withls -1R /tmp/test/current/ & ls -1R /tmp/test/old/ BUT for the third file and the most important does not give me any results.
ls -1R /users/diff/newdiff
I will appreciate your answer. What should I change to take results.
try this:
dircmp dir1 dir2
description is here: http://www.computerhope.com/unix/udircmp.htm
works for Windows too. You may have to manipulate the output depending on what your needs are, but that should be fairly simple. I use it all the time.
I would like to be able to compare a binary file X to a directory of other binary files and find which other file is most similar to X. The nature of the data is such that identical chunks will exist between files, but possibly shifted in location. The files are all 1MB in size, and there are about 200 of them. I would like to be have something quick enough to analyze these in a few minutes or less on a modern desktop computer.
I've googled a bit and found a few different binary diff utilities, but none of them seem appropriate for my application.
For example there is bsdiff, which looks like it creates some a patch file which is optimized for size. Or vbindiff which just displays the differences graphically, but those don't really seem to help me figure out if one file is more similar to X than another file.
If there is not a tool that I can use directly for this purpose, is there a good library someone could recommend for writing my own utility? Python would be preferable, but I'm flexible.
Here's a simple perl script which more or less tries to do exactly that.
Edit: Also have a look at the following stackoverflow thread.