How to delete almost duplicate files

How to delete almost duplicate files - python

Edit 2:
Solved, see my answer waaaaaaay below.
Edit:
After banging my head a few times, I almost did it.
Here's my (not cleaned up, you can tell I was troubleshooting a bunch of stuff) code:
http://pastebin.com/ve4Qkj2K
And here's the problem: It works sometimes and other times not so much. For example, it will work perfectly with some files, then leave one of the longest codes instead of the shortest one, and for others it will delete maybe 2 out of 5 duplicates, leaving 3 behind. If it just performed reliably, I might be able to fix it, but I don't understand the seemingly random behavior. Any ideas?
Original Post:
Just so you know, I'm just beginning with python, and I'm using python 3.3
So here's my problem:
Let's say I have a folder with about 5,000 files in it. Some of these files have very similar names, but different contents and possible different extensions. After a readable name, there is a code, always with a "(" or a "[" (no quotes) before it. The name and code are separated by a space. For example:
something (TZA).blah
something [TZZ].another
hello (YTYRRFEW).extension
something (YJTR).another_ext
I'm trying to only get one of the something's.something, and delete the others. Another fact which may be important is that there are usually more than one code, such as "something (THTG) (FTGRR) [GTGEES!#!].yet_another_random_extension", all separated by spaces. Although it doesn't matter 100%, it would be best to save the one with the least codes.
I made some (very very short) code to get a list of all files:
import glob
files=[]
files=glob.glob("*")
but after this I'm pretty much lost. Any help would be appreciated, even if it's just pointing me in the right direction!

I would suggest creating separate array of bare file names and check the condition if any element exists in any other place by taking array with all indices excluding the current checked in loop iteration.
The
if str_fragment in name
condition simply finds any string fragment in any string-type name. It can be useful as well.

I got it! The version I ended up with works (99%) perfectly. Although it needs to make multiply passes, reading,analyzing, and deleting over 2 thousand files took about 2 seconds on my pitiful, slow notebook. My final version is here:
http://pastebin.com/i7SE1mh6
The only tiny bug is that if the final item in the list has a duplicate, it will leave it there (and no more than 2). That's very simple to manually correct so I didn't bother to fix it (ain't nobody got time fo that and all).
Hope sometime in the future this could actually help somebody other than me.
I didn't get too many answers here, but it WAS a pretty unusual problem, so thanks anyway. See ya.

Related

Edit and run a sequence of scripts in a less-manual way [Python]

I am trying to find a way to edit and run a sequence of python scripts in a less manual way.
For context, I am running a series of simulations, which consist in running three codes in order 10 times, making minor changes to each code every time. The problem I am encountering is that this process leads to easy mistakes and chaotic work.
These are the type of edits I have to make to each code.
- Modify input/output file name
- Change value of a parameter
What is the best practice to deal with this? I imagine that the best idea would be to write another python script that does all this. Is there a way to edit other python codes, from within a code, and run them?
I don't intend or want anyone to write a code for me. I just need to be pointed in a general direction. I have searched for ways to 'automatize' codes, but haven't yet been successful in finding a solution to my query (mainly the editing part).
Thanks!

The thing that can change (files or parameter values) should be able to be either passed in or injected. Could be from a command line parameter, configuration file, or method argument. This is the "general direction" I offer.

Knowing exactly where which computation process wrong in intellij

I have the following python code inside a large loop
arr_a[indx]*arr_b[arr_c[indx],]
and with one running, an exception occurred and it said index out of range, but there are two possibilities(indx is out of range for predictErr, or arr_c[indx]), how do I know which part goes wrong?
This problem also extend to some general case that when one write several operations in one line and when things goes wrong, it is hard to tell which part causes this, and note that the above mentioned expression is inside a large loop, which means one can not simply start a debug mode and find that out.

Add a print statement for each segment that might be to blame to see which one fails:
print arr_a[indx]
print arr_c[indx]
arr_a[indx]*arr_b[arr_c[indx],]

Maintaining two versions of an ipython notebook

I often need to create two versions of an ipython notebook: One contains tasks to be carried out (usually including some python code and output), the other contains the same text plus solutions. Let's call them the assignment and the solution.
It is easy to generate the solution document first, then strip the answers to generate the assignment (or vice versa). But if I subsequently need to make changes (and I always do), I need to repeat the stripping process. Is there a reasonable workflow that will allow changes in the assignment to be propagated to the solutions document?
Partial self-answer: I have experimented with leveraging mercurial's hg copy, which will let two files with different names share history. But I can only get this to work if assignment and solution are in different directories, in two linked hg repositories. I would much prefer a simpler set-up. I've also noticed that diff gets very confused when one JSON file has more sections than another, making a VCS-based solution even less attractive. (To be clear: Ordinary use of a VCS with notebooks is fine; it's the parallel versions that stumble).
This question covers similar ground, but does not solve my problem. In fact an answer to my question would solve the OP's second remaining problem, "pulling changes" (see the Update section).

It sounds like you are maintaining an assignment and an answer key of some kind and want to be able to distribute the assignments (without solutions) to students, and still have the answers for yourself or a TA.
For something like this, I would create two branches "unsolved" and "solved". First write the questions on the "unsolved" branch. Then create the "solved" branch from there and add the solutions. If you ever need to update a question, update back to the "unsolved" branch, make the update and merge the change into "solved" and fix the solution.
You could try going the other way, but my hunch is that going "backwards" from solved to unsolved might be strange to maintain.

After some experimentation I concluded that it is best to tackle this by processing the notebook's JSON code. Version control systems are not the right approach, for the following reasons:
JSON doesn't diff very well when adding or deleting cells. A minimal change leads to mis-matched braces and a very messy diff.
In my use case, the superset version of the file (containing both the assignments and their solutions) must be the source document. This is because the assignment includes example code and output that depends on earlier parts, to be written by the students. This model does not play well with version control, as pointed out by #ChrisPhillips in his answer.
I ended up filtering the JSON structure for the notebook and stripping out the solution cells; they may be recognized via special metadata (which can be set interactively using the metadata button in the interface), or by pattern-matching on the cell contents. The following snippet shows how to filter out cells whose first line starts with # SOLUTION:
def stripcell(cell, pattern):
"""Check if the first line of the cell's content matches `pattern`"""
if cell["cell_type"] == "code":
content = cell["input"]
else:
content = cell["source"]
return ( len(content) > 0 and re.search(pattern, content[0]) )
pattern = r"^# SOLUTION:"
struct = json.load(open("input.ipynb"))
cells = struct["worksheets"][0]["cells"]
struct["worksheets"][0]["cells"] = [ c for c in cells if not stripcell(c, pattern) ]
json.dump(struct, open("output.ipynb", "wb"), indent=1)
I used the generic json library rather than the notebook API. If there's a better way to go about it, please let me know.

Python 2.7: How to begin parsing in the middle of a document

I am writing a program to parse the IETF Internet-drafts and pull out such things as title, date, protocol, and the countries of the authors. I realize this has been done before (arkko.com), but it's a little self-imposed programming exercise.
The problem I'm having is this:
Using some logic, some basic parsing, and
position = doc.tell()
I have precisely identified the point in each document where I need to begin examining lines and looking for, identifying, and pulling out the authors' countries of origin. And I can get to that precise point with:
doc.seek(position)
The problem I'm having is...then what? Having gotten to that position, I've tried every combination of file and string methods that I know to start parsing an arbitrary number of following lines, but I cannot make it work.
Sorry I don't have any full code snippets, but I've tried way too many and I think I might be barking up the entirely wrong tree at this point.
Edit: Actually I came up with a fairly simple solution:
I went through the file once, counted lines, and noted the line number of where I needed to begin parsing.
Then I went through the file again counting lines, and when the line numbers were greater than the first line number, I began parsing.
Probably not the most elegant solution in that I think I should have been able to use doc.seek() to avoid a second count, but it works. And now I know an area of string and file manipulation I need to explore a bit more.

You just need to call doc.read(some_buffer_length) and you'll get a string back.
How you deal with that string is a completely separate issue, but it doesn't matter if it comes from the beginning of the file, or not.

Do you have a hard time keeping to 80 columns with Python? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I find myself breaking strings constantly just to get them on the next line. And of course when I go to change those strings (think logging messages), I have to reformat the breaks to keep them within the 80 columns.
How do most people deal with this?

I recommend trying to stay true to 80-column, but not at any cost. Sometimes, like for logging messages, it just makes more sense to keep 'em long than breaking up. But for most cases, like complex conditions or list comprehensions, breaking up is a good idea because it will help you divide the complex logic to more understandable parts. It's easier to understand:
print sum(n for n in xrange(1000000)
if palindromic(n, bits_of_n) and palindromic(n, digits))
Than:
print sum(n for n in xrange(1000000) if palindromic(n, bits_of_n) and palindromic(n, digits))
It may look the same to you if you've just written it, but after a few days those long lines become hard to understand.
Finally, while PEP 8 dictates the column restriction, it also says:
A style guide is about consistency. Consistency with this style guide
is important. Consistency within a project is more important.
Consistency within one module or function is most important.
But most importantly: know when to be inconsistent -- sometimes the
style guide just doesn't apply. When in doubt, use your best judgment.
Look at other examples and decide what looks best. And don't hesitate
to ask!

"A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines."
The important part is "foolish".
The 80-column limit, like other parts of PEP 8 is a pretty strong suggestion. But, there is a limit, beyond which it could be seen as foolish consistency.
I have the indentation guides and edge line turned on in Komodo. That way, I know when I've run over. The questions are "why?" and "is it worth fixing it?"
Here are our common situations.
logging messages. We try to make these easy to wrap. They look like this
logger.info( "unique thing %s %s %s",
arg1, arg2, arg3 )
Django filter expressions. These can run on, but that's a good thing. We often
knit several filters together in a row. But it doesn't have to be one line of code,
multiple lines can make it more clear what's going on.
This is an example of functional-style programming, where a long expression is sensible. We avoid it, however.
Unit Test Expected Result Strings. These happen because we cut and paste to create the unit test code and don't spend a lot of time refactoring it. When it bugs us we pull the strings out into separate string variables and clean the self.assertXXX() lines up.
We generally don't have long lines of code because we don't use lambdas. We don't strive for fluent class design. We don't pass lots and lots of arguments (except in a few cases).
We rarely have a lot of functional-style long-winded expressions. When we do, we're not embarrassed to break them up and leave an intermediate result lying around. If we were functional purists, we might have gas with intermediate result variables, but we're not purists.

It doesn't matter what year is it or what output devices you use (to some extent). Your code should be readable if possible by humans. It is hard for humans to read long lines.
It depends on the line's content how long it should be. If It is a log message then its length matters less. If it is a complex code then its big length won't be helping to comprehend it.

Temporary variables. They solve almost every problem I have with long lines. Very occasionally, I'll need to use some extra parens (like in a longer if-statement). I won't make any arguments for or against 80 character limitations since that seems irrelevant.
Specifically, for a log message; instead of:
self._log.info('Insert long message here')
Use:
msg = 'Insert long message here'
self._log.info(msg)
The cool thing about this is that it's going to be two lines no matter what, but by using good variable names, you also make it self-documenting. E.g., instead of:
obj.some_long_method_name(subtotal * (1 + tax_rate))
Use:
grand_total = subtotal * (1 + tax_rate)
obj.some_long_method_name(grand_total)
Most every long line I've seen is trying to do more than one thing and it's trivial to pull one of those things out into a temp variable. The primary exception is very long strings, but there's usually something you can do there too, since strings in code are often structured. Here's an example:
br = mechanize.Browser()
ua = '; '.join(('Mozilla/5.0 (Macintosh', 'U', 'Intel Mac OS X 10.4',
'en-US', 'rv:1.9.0.6) Gecko/2009011912 Firefox/3.0.6'))
br.addheaders = [('User-agent', ua)]

This is a good rule to keep to a large part of the time, but don't pull your hair out over it. The most important thing is that stylistically your code looks readable and clean, and keeping your lines to reasonable length is part of that.
Sometimes it's nicer to let things run on for more than 80 columns, but most of the time I can write my code such that it's short and concise and fits in 80 or less. As some responders point out the limit of 80 is rather dated, but it's not bad to have such a limit and many people have terminals
Here are some of the things that I keep in mind when trying to restrict the length of my lines:
is this code that I expect other people to use? If so, what's the standard that those other people and use for this type of code?
do those people have laptops, use giant fonts, or have other reasons for their screen real estate being limited?
does the code look better to me split up into multiple lines, or as one long line?
This is a stylistic question, but style is really important because you want people to read and understand your code.

I would suggest being willing to go beyond 80 columns. 80 columns is a holdover from when it was a hard limit based on various output devices.
Now, I wouldn't go hog wild...set a reasonable limit, but an arbitary limit of 80 columns seems a bit overzealous.
EDIT: Other answers are also clarifing this: it matters what you're breaking. Strings can more often be "special cases" where you may want to bend the rules a bit, for the sake of clarity. If your code, on the other hand, is getting long, that's a good time to look at where it is logical to break it up.

80 character limits? What year is it?
Make your code readable. If a long line is readable, it's fine. If it's hard to read, split it.
For example, I tend to make long lines when there is a method call with lots of arguments, and the arguments are the normal arguments you'd expect. So, let's say I'm passing 10 variables around to a bunch of methods. If every method takes a transaction id, an order id, a user id, a credit card number, etc, and these are stored in appropriately named variables, then it's ok for the method call to appear on one line with all the variables one after another, because there are no surprises.
If, however, you are dealing with multiple transactions in one method, you need to ensure that the next programmer can see that THIS time you're using transId1, and THAT time transId2. In that case make sure it's clear. (Note: sometimes using long lines HELPS that too).
Just because a "style guide" says you should do something doesn't mean you have to do it. Some style guides are just plain wrong.

The 80 column count is one of the few places I disagree with the Python style guide. I'd recommend you take a look at the audience for your code. If everyone you're working with uses a modern IDE on a monitor with a reasonable resolution, it's probably not worth your time. My monitor is cheap and has a relatively weak resolution, but I can still fit 140 columns plus scroll bars, line markers, break markers, and a separate tree-view frame on the left.
However, you will probably end up following some kind of limit, even if it's not a fixed number. With the exception of messages and logging, long lines are hard to read. Lines that are broken up are also harder to read. Judge each situation on its own, and do what you think will make life easiest for the person coming after you.

Strings are special because they tend to be long, so break them when you need and don't worry about it.
When your actual code starts bumping the 80 column mark it's a sign that you might want to break up deeply nested code into smaller logical chunks.

I deal with it by not worrying about the length of my lines. I know that some of the lines I write are longer than 80 characters but most of them aren't.
I know that my position is not considered "pythonic" by many and I understand their points. Part of being an engineer is knowing the trade-offs for each decision and then making the decision that you think is the best.

Sticking to 80 columns is important not only for readability, but because many of us like to have narrow terminal windows so that, at the same time as we are coding, we can also see things like module documentation loaded in our web browser and an error message sitting in an xterm. Giving your whole screen to your IDE is a rather primitive, if not monotonous, way to use screen space.
Generally, if a line stretches to more than 80 columns it means that something is going wrong anyway: either you are trying to do too much on one line, or have allowed a section of your code to become too deeply indented. I rarely find myself hitting the right edge of the screen unless I am also failing to refactor what should be separate functions; name temporary results; and do other things like that will make testing and debugging much easier in the end. Read Linus's Kernel Coding Style guide for good points on this topic, albeit from a C perspective:
http://www.kernel.org/doc/Documentation/CodingStyle
And always remember that long strings can either be broken into smaller pieces:
print ("When Python reads in source code"
" with string constants written"
" directly adjacent to one another"
" without any operators between"
" them, it considers them one"
" single string constant.")
Or, if they are really long, they're generally best defined as a constant then used in your code under that abbreviated name:
STRING_MESSAGE = (
"When Python reads in source code"
" with string constants written directly adjacent to one"
" another without any operators between them, it considers"
" them one single string constant.")
...
print STRING_MESSAGE
...

Pick a style you like, apply a layer of common sense, and use it consistently.
PEP 8 is a style guide for libraries included as part of the Python standard library. It was never intended to be picked up as the style rules for all Python code. That said, there's no reason people shouldn't use it, but it's definitely not a set of hard rules. Like any style, there is no single correct way and the most important thing is consistency.

I do run into code that spills past 79 columns on every now and then. I've either broken them up with '\' (although recently, I've read about using parenthesis instead as a preferred alternative, so I'll give that a shot), or just let it be if it's no more than 15 or so past. And this coming from someone who indents only 2, not 4 spaces (I know, shame on me :\ )! It isn't entirely science. It's also part style, and sometimes, keeping things on one line is just easier to manage or read. Other times, excessive side-to-side scrolling can be worse.
Much of the time has to do with longer variable names. For variables beyond temp values and iterators, I don't want to reduce them to 1 to 5 letters. These that are 7 to 15 characters long actually do provide context as to their uses and what classes they refer to.
When I need to print stuff out where parts of the output are dynamic, I'll replace those portions with function calls that cuts down on the conditional statements and sheer content that would've been in that body of code.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.