Python identify file with largest number as part of filename - python

I have files with a number appended at the end e.g:
file_01.csv
file_02.csv
file_03.csv
I am looking for a simple way of identifying the file with the largest number appended to it. Is there a moderately simple way of achieving this? ... I was thinking of importing all file names in the folder, extracting the last digits, converting to number, and then looking for max number, however that seems moderately complicated for what I assume is a relatively common task.

if the filenames are really formatted in such a nice way, then you can simply use max:
>>> max(['file_01.csv', 'file_02.csv', 'file_03.csv'])
'file_03.csv'
but note that:
>>> 'file_5.csv' > 'file_23.csv'
True
>>> 'my_file_01' > 'file_123'
True
>>> 'fyle_01' > 'file_42'
True
so you might want to add some kind of validation to your function, and/or or use glob.glob:
>>> max(glob.glob('/tmp/file_??'))
'/tmp/file_03'

import re
x=["file_01.csv","file_02.csv","file_03.csv"]
print max(x,key=lambda x:re.split(r"_|\.",x)[1])

Related

Using Regular Expressions to extract numerical quantities from a file and find the sum

I am a beginner and learning python. The problem is that I have to extract numbers from a file (in which numbers can be anywhere. can be multiple times in the same line. some lines may not have numbers and some lines may be new lines) and find their sum. I did know how to solve it, and this was my code
import re
new=[]
s=0
fhand=open("sampledata.txt")
for line in fhand:
if re.search('^.+',line): #to exclude lines which have nothing
y=re.findall('([0-9]*)',line) #this part is supposed to extract only the
for i in range(len(y)): #the numerical part, but it extracts all the words. why?
try:
y[i]=float(y[i])
except:
y[i]=0
s=s+sum(y)
print s
The code works, but it is not a pythonic way to do it. Why is the ([0-9]*) extracting all the words instead of only numbers?
What is the pythonic way to do it?
Your regular expression has ([0-9]*) which will find all words with zero or more numbers. You probably want ([0-9]+) instead.
Hello you made a mistake in the regular expression by adding the "*", like this should work:
y=re.findall('([0-9])',line)
Expanding on wind85's answer, you might want to fine tune your regular expression depending on what kind of numbers you expect to find in your file. For example, if your numbers might have a decimal point in them, then you might want something like [0-9]+(?:\.[0-9]+)? (one or more digits optionally followed by a period and one or more digits).
As for making it more pythonic, here's how I'd probably write it:
s=0
for line in open("sampledata.txt"):
s += sum(float(y) for y in re.findall(r'[0-9]+',line))
print s
If you want to get really fancy, you can make it a one-liner:
print sum(float(y) for line in open('sampledata.txt')
for y in re.findall(r'[0-9]+',line))
but personally I find that kind of thing hard to read.

Simple regular expression not working

I am trying to match a string with a regular expression but it is not working.
What I am trying to do is simple, it is the typical situation when an user intruduces a range of pages, or single pages. I am reading the string and checking if it is correct or not.
Expressions I am expecting, for a range of pages are like: 1-3, 5-6, 12-67
Expressions I am expecting, for single pages are like: 1,5,6,9,10,12
This is what I have done so far:
pagesOption1 = re.compile(r'\b\d\-\d{1,10}\b')
pagesOption2 = re.compile(r'\b\d\,{1,10}\b')
Seems like the first expression works, but not the second.
And, would it be possible to merge both of them in one single regular expression?, In a way that, if the user introduces either something like 1-2, 7-10 or something like 3,5,6,7 the expression will be recogniced as good.
Simpler is better
Matching the entire input isn't simple, as the proposed solutions show, at least it is not as simple as it could/should be. Will become read only very quickly and probably be scrapped by anyone that isn't regex savvy when they need to modify it with a simpler more explicit solution.
Simplest
First parse the entire string and .split(","); into individual data entries, you will need these anyway to process. You have to do this anyway to parse out the useable numbers.
Then the test becomes a very simple, test.
^(\d+)(?:-\(d+))?$
It says, that there the string must start with one or more digits and be followed by optionally a single - and one or more digits and then the string must end.
This makes your logic as simple and maintainable as possible. You also get the benefit of knowing exactly what part of the input is wrong and why so you can report it back to the user.
The capturing groups are there because you are going to need the input parsed out to actually use it anyway, this way you get the numbers if they match without having to add more code to parse them again anyway.
This regex should work -
^(?:(\d+\-\d+)|(\d+))(?:\,[ ]*(?:(\d+\-\d+)|(\d+)))*$
Demo here
Testing this -
>>> test_vals = [
'1-3, 5-6, 12-67',
'1,5,6,9,10,12',
'1-3,1,2,4',
'abcd',
]
>>> regex = re.compile(r'^(?:(\d+\-\d+)|(\d+))(?:\,[ ]*(?:(\d+\-\d+)|(\d+)))*$')
>>> for val in test_vals:
print val
if regex.match(val) == None:
print "Fail"
else:
print "Pass"
1-3, 5-6, 12-67
Pass
1,5,6,9,10,12
Pass
1-3,1,2,4.5
Fail
abcd
Fail

Compare two strings in python

well i need to compare two strings or at least find a sequence of characters from a string to another string. The two strings contain md5 of files which i must compare and say if i find a match.
my current code is:
def comparemd5():
origmd5=getreferrerurl()
dlmd5=md5_for_file(file_name)
print "original md5 is",origmd5
print "downloader file md5 is",dlmd5
s = difflib.SequenceMatcher(None, origmd5, dlmd5)
print "ratio is:",s.ratio()
the output i get is:
original md5 is ['0430f244a18146a0815aa1dd4012db46', '0430f244a18146a0815aa1dd40
12db46', '59739CCDA2F15D5AC16DB6695CAE3378']
downloader file md5 is 59739ccda2f15d5ac16db6695cae3378
ratio is : 0.0
Thus! there is a match from dlmd5 in origmd5 but somehow its not finding it...
I am doing something wrong somewhere...Please help me out :/
Basically, you want the idom if test_string in list_of_strings. Looks like you don't need case sensitivity, so you might want
if test_string.lower() in (s.lower() for s in list_of_strings)
In your case:
>>> originals = ['0430f244a18146a0815aa1dd4012db46', '0430f244a18146a0815aa1dd40 12db46', '59739CCDA2F15D5AC16DB6695CAE3378']
>>> test = '59739ccda2f15d5ac16db6695cae3378'
>>> if test.lower() in (s.lower() for s in originals):
... print '%s is match, yeih!' % test
...
59739ccda2f15d5ac16db6695cae3378 is match, yeih!
Looks like you're having a problem since the case isn't matching on the letters. May want to try:
def comparemd5():
origmd5=[item.lower() for item in getreferrerurl()]
dlmd5=md5_for_file(file_name)
print "original md5 is",origmd5
print "downloader file md5 is",dlmd5
s = difflib.SequenceMatcher(None, origmd5, dlmd5)
print "ratio is:",s.ratio()
Given the input:
original md5 is ['0430f244a18146a0815aa1dd4012db46', '0430f244a18146a0815aa1dd40 12db46', '59739CCDA2F15D5AC16DB6695CAE3378']
downloader file md5 is 59739ccda2f15d5ac16db6695cae3378
You have two problems.
First of all, that first one isn't just an MD5, but an MD5 and two other things.
To fix that: If you know that origmd5 will always be in this format, just use origmd5[2] instead of origmd5. If you have no idea what origmd5 is, except that one of the things in it is the actual MD5, you'll have to compare against all of the elements.
Second, the actual MD5 values are both hex strings representing the same binary data, but they're different hex strings (because one is in uppercase, the other in lowercase). You could fix this by just doing a case-insensitive comparison, but it's probably more robust to unhexlify them both and compare the binary values.
In fact, if you've copied and pasted the output correctly, at least one of those hex strings has a space in the middle of it, so you actually need to unhexlify hex strings with optional spaces between hex pairs. AFAIK, there is no stdlib function that does this, but you can write it yourself in one step:
def unhexlify(s):
return binascii.unhexlify(s.replace(' ', ''))
Meanwhile, I'm not sure why you're trying to use difflib.SequenceMatcher at all. Two slightly different MD5 hashes refer to completely different original sources; that's kind of the whole point of MD5, and crypto hash functions in general. There's no such thing as a 95% match; there's either a match, or a non-match.
So, if you know the 3rd value in origmd5 is the one you want, just do this:
s = unhexlify(origmd5[2]) == unhexlify(dlmd5)
Otherwise, do this:
s = any(unhexlify(origthingy) == unhexlify(dlmd5) for origthingy in origmd5)
Or, turning it around to make it simpler:
s = unhexlify(dlmd5) in map(unhexlify, origthingy)
Or whatever equivalent you find most readable.

how to handle '../' in python?

i need to strip ../something/ from a url
eg. strip ../first/ from ../first/bit/of/the/url.html where first can be anything.
what's the best way to achieve this?
thanks :)
You can simply split the path twice at the official path separator (os.sep, and not '/') and take the last bit:
>>> s = "../first/bit/of/the/path.html"
>>> s.split(os.sep, 2)[-1]
'bit/of/the/path.html'
This is also more efficient than splitting the path completely and stringing it back together.
Note that this code does not complain when the path contains fewer than 3+ path elements (for instance, 'file.html' yields 'file.html'). If you want the code to raise an exception if the path is not of the expected form, you can just ask for its third element (which is not present for paths that are too short):
>>> s.split(os.sep, 2)[2]
This can help detect some subtle errors.
EOL has given a nice and clean approach however I could not resist giving a regex alternative to it:)
>>> import re
>>> m=re.search('^(\.{2}\/\w+/)(.*)$','../first/bit/of/the/path.html')
>>> m.group(1)
'../first/'

How to work with very long strings in Python?

I'm tackling project euler's problem 220 (looked easy, in comparison to some of the
others - thought I'd try a higher numbered one for a change!)
So far I have:
D = "Fa"
def iterate(D,num):
for i in range (0,num):
D = D.replace("a","A")
D = D.replace("b","B")
D = D.replace("A","aRbFR")
D = D.replace("B","LFaLb")
return D
instructions = iterate("Fa",50)
print instructions
Now, this works fine for low values, but when you put it to repeat higher then you just get a "Memory error". Can anyone suggest a way to overcome this? I really want a string/file that contains instructions for the next step.
The trick is in noticing which patterns emerge as you run the string through each iteration. Try evaluating iterate(D,n) for n between 1 and 10 and see if you can spot them. Also feed the string through a function that calculates the end position and the number of steps, and look for patterns there too.
You can then use this knowledge to simplify the algorithm to something that doesn't use these strings at all.
Python strings are not going to be the answer to this one. Strings are stored as immutable arrays, so each one of those replacements creates an entirely new string in memory. Not to mention, the set of instructions after 10^12 steps will be at least 1TB in size if you store them as characters (and that's with some minor compressions).
Ideally, there should be a way to mathematically (hint, there is) generate the answer on the fly, so that you never need to store the sequence.
Just use the string as a guide to determine a method which creates your path.
If you think about how many "a" and "b" characters there are in D(0), D(1), etc, you'll see that the string gets very long very quickly. Calculate how many characters there are in D(50), and then maybe think again about where you would store that much data. I make it 4.5*10^15 characters, which is 4500 TB at one byte per char.
Come to think of it, you don't have to calculate - the problem tells you there are 10^12 steps at least, which is a terabyte of data at one byte per character, or quarter of that if you use tricks to get down to 2 bits per character. I think this would cause problems with the one-minute time limit on any kind of storage medium I have access to :-)
Since you can't materialize the string, you must generate it. If you yield the individual characters instead of returning the whole string, you might get it to work.
def repl220( string ):
for c in string:
if c == 'a': yield "aRbFR"
elif c == 'b': yield "LFaLb"
else yield c
Something like that will do replacement without creating a new string.
Now, of course, you need to call it recursively, and to the appropriate depth. So, each yield isn't just a yield, it's something a bit more complex.
Trying not to solve this for you, so I'll leave it at that.
Just as a word of warning be careful when using the replace() function. If your strings are very large (in my case ~ 5e6 chars) the replace function would return a subset of the string (around ~ 4e6 chars) without throwing any errors.
You could treat D as a byte stream file.
Something like:-
seedfile = open('D1.txt', 'w');
seedfile.write("Fa");
seedfile.close();
n = 0
while (n
warning totally untested

Categories