Python - Removing unknown 10 character string - python

Im using a modified version of Eric Bidelman's/HTML5Rocks cachebust.py file for css/js. link is here
Instead of appending timestamp like
.css?2012-07-30
I modified variable to -
cachebust = ''.join(random.choice(string.ascii_uppercase + string.digits) for x in range(10))
so it becomes (for example)
.css?6SKD39SFJ3
his original version didnt seem to remove the date either, so im not really sure how that is a 'cache control' but i figured if i could just auto-strip those 10 characters, it would work. first targeting any js files (for new files), then if js? (with cachecontrol already in place), strip that existing cachecontrol
asset = re.search('\.(js")><\/script>', line)
if asset is not None:
existing = re.search('\.(js?"', line)
if existing is not None:
line[i] = line.replace('.js?'STRING????'"', '.js"')
lines[i] = line.replace('.js"></script>', '.js?%s"></script>' % cachebust)
thoughts on what that STRING???? should be, or if this method wouldnt work? im new to python so im just experimenting here...

You could replace the 3 lines:
existing = re.search('\.(js?"', line)
if existing is not None:
line[i] = line.replace('.js?'STRING????'"', '.js"')
with:
re.sub(r'\.js\?[-0-9]{10}">',r'.js?">', line)
Output:
>>> re.sub(r'\.js\?[-0-9]{10}">',r'.js?">', '<script type="blah" src="url/to/path.js?2012-07-02">')
'<script type="blah" src="url/to/path.js?">'
I have used the regexp [-0-9]{10} which stands for 10 characters of digits and a dash. In case that can stand for any 10 characters, use: .{10}

Related

removing a string of four characters from the front and thirteen characters from the end of a filename

I have seen the basic Python code for a filename replacement in a directory but they are always for known strings, but how would you remove random characters of a certain length?
Would this work?
newFileName = file.replace([-5:], "")
As I am trying to remove the last five characters from the filename without removing the extension.
Here is an update:
I am trying to do this:
DMC-CIWS15-AAAA-A00-00-0000-00A-018A-D_014-00_EN-US.xml
to
CIWS15-AAAA-A00-00-0000-00A-018A-D.xml
which removes DMC- and _014-00_EN-US from the end.
I need to add this to a code that will fix a directory of files.
This problem (if I understand it correctly) has a clear separation. Remove extension, remove X characters from beginning and end, and then add the extension again to get the final answer.
import os
oldFileName = 'xxxx-filename-xxxxx.XML'
# remove n chars in beginning, m chars at end
n = 5
m = 6
name, ext = os.path.splitext(oldFileName)
# splice away the chars, and add the extension
newFileName = '{}{}'.format(name[0:-m][n:], ext)
# newFileName == 'filename.XML'
So in your case, you would use n=4 and m=13.
If you didn't know the length, but you knew you wanted everything up to and including the first dash out, and likewise everything after the first underscore (which would mean there couldn't be underscores in the normal filename or the first part of it), this would work also:
import os
oldFileName = 'DMC-CIWS15-AAAA-A00-00-0000-00A-018A-D_014-00_EN-US.xml'
name, ext = os.path.splitext(oldFileName)
newFileName = '{}{}'.format(name[name.index('-')+1:name.index('_')], ext)
# newFileName == 'CIWS15-AAAA-A00-00-0000-00A-018A-D.xml'
And even if the pattern is something else, but there is a pattern, you can code to match it, like I have here.
Its not nice but I hope this works for you tho
If you know the files that you want to rename all have the same length, you can try:
>>>file = 'DMC-CIWS15-AAAA-A00-00-0000-00A-018A-D_014-00_EN-US.xml'
>>>ext = file[51:]
>>>newFile = file[4:38]+ext
when you print the newFile you now have:
>>>print(newFile)
CIWS15-AAAA-A00-00-0000-00A-018A-D.xml

Bell character as Fields separator in Python print output

I am fairly new to Python and need a little help here.
I have a Python script running on Python 2.6 that parses some JSON.
Example Code:
if "prid" in data[p]["prdts"][n]:
print data[p]["products"][n]["prid"],
if "metrics" in data[p]["prdts"][n]:
lenmet = len(data[p]["prdts"][n]["metrics"])
i = 0
while (i < lenmet):
if (data[p]["prdts"][n]["metrics"][i]["metricId"] == "price"):
print data[p]["prdts"][n]["metrics"][i]["metricValue"]["value"]
break
Now, this prints values in 2 columns:
prid price
123 20
234 40
As you see the fields separator above is ' '. How can I put a field separator like BEL character in the output?
Sample expected output:
prid price
123^G20
234^G40
FWIW, your while loop doesn't increment i, so it will loop forever, but I assume that was just a copy & paste error, and I'll ignore it in the rest of my answer.
If you want to use two separate print statements to print your data on one line you can't avoid getting that space produced by the first print statement. Instead, simply save the prid data until you can print it with the price in one go using string concatenation. Eg,
if "prid" in data[p]["prdts"][n]:
prid = [data[p]["products"][n]["prid"]]
if "metrics" in data[p]["prdts"][n]:
lenmet = len(data[p]["prdts"][n]["metrics"])
i = 0
while (i < lenmet):
if (data[p]["prdts"][n]["metrics"][i]["metricId"] == "price"):
price = data[p]["prdts"][n]["metrics"][i]["metricValue"]["value"]
print str(prid) + '\a' + str(price)
break
Note that I'm explicitly converting the prid and price to string. Obviously, if either of those items is already a string then you don't need to wrap it in str(). Normally, we can let print convert objects to string for us, but we can't do
print prid, '\a', price
here because that will give us an unwanted space between each item.
Another approach is to make use of the new print() function, which we can import using a __future__ import at the top of the script, before other imports:
from __future__ import print_function
# ...
if "prid" in data[p]["prdts"][n]:
print(data[p]["products"][n]["prid"], end='\a')
if "metrics" in data[p]["prdts"][n]:
lenmet = len(data[p]["prdts"][n]["metrics"])
i = 0
while (i < lenmet):
if (data[p]["prdts"][n]["metrics"][i]["metricId"] == "price"):
print(data[p]["prdts"][n]["metrics"][i]["metricValue"]["value"])
break
I don't understand why you want to use BEL as a separator rather than something more conventional, eg TAB. The BEL char may print as ^G in your terminal, but it's invisible in mine, and if you save this output to a text file it may not display correctly in a text viewer / editor.
BTW, It would have been better if you posted a Minimal, Complete, Verifiable Example that focuses on your actual printing problem, rather than all that crazy JSON parsing stuff, which just makes your question look more complicated than it really is, and makes it impossible to test your code or their modifications to it.

Regex and renaming file error

I was writing a little program that finds all files with given prefix, let's say 'spam' for this example, in a folder and locates gaps in numbering and renames subsequent folders to fill the gap. Below illustrates a portion of the program that locates the files using a regex and renames it:
prefix = 'spam'
newNumber = 005
# Regex for finding files with specified prefix + any numbering + any file extension
prefixRegex = re.compile(r'(%s)((\d)+)(\.[a-zA-Z0-9]+)' % prefix)
# Rename file by keeping group 1 (prefix) and group 4 (file extension),
# but substituting numbering with newNumber
newFileName = prefixRegex.sub(r'\1%s\4' % newNumber, 'spam006.txt')
What I was expecting from above was spam005.txt, but instead I got #5.txt
I figured out I could use r'%s%s\4' % (prefix, newNumber) instead and then it does work as intended, but I'd like to understand why this error is happening. Does it have something to do with the %s used during re.compile()?
There are two problems here:
Your newNumber needs to be a string if you want it to be 005 as the first two 0 are dropped when it is being interpreted as an integer.
Your next problem is indeed in your substitution. By using the string formating you effectively create the new regexp \15\4 (see the 5 in there, that was your newNumber). When python sees this it tries to get capturing group 15 and not group 1 followed by a literal 5. You can enclose the reference in a g like this to get your desired behavior: \g<1>5\4
So your code needs to be changed to this:
prefix = 'spam'
newNumber = '005'
# Regex for finding files with specified prefix + any numbering + any file extension
prefixRegex = re.compile(r'(%s)((\d)+)(\.[a-zA-Z0-9]+)' % prefix)
# Rename file by keeping group 1 (prefix) and group 4 (file extension),
# but substituting numbering with newNumber
newFileName = prefixRegex.sub(r'\g<1>%s\4' % newNumber, 'spam006.txt')
More information about the \g<n> behavior can be found at the end of the re.sub doucmentation

Python/IPython strange non reproducible list index out of range error

I have recently been learning some Python and how to apply it to my work. I have written a couple of scripts successfully, but I am having an issue I just cannot figure out.
I am opening a file with ~4000 lines, two tab separated columns per line. When reading the input file, I get an index error saying that the list index is out of range. However, while I get the error every time, it doesn't happen on the same line every time (as in, it will throw the error on different lines everytime!). So, for some reason, it works generally but then (seemingly) randomly fails.
As I literally only started learning Python last week, I am stumped. I have looked around for the same problem, but not found anything similar. Furthermore I don't know if this is a problem that is language specific or IPython specific. Any help would be greatly appreciated!
input = open("count.txt", "r")
changelist = []
listtosort = []
second = str()
output = open("output.txt", "w")
for each in input:
splits = each.split("\t")
changelist = list(splits[0])
second = int(splits[1])
print second
if changelist[7] == ";":
changelist.insert(6, "000")
va = "".join(changelist)
var = va + ("\t") + str(second)
listtosort.append(var)
output.write(var)
elif changelist[8] == ";":
changelist.insert(6, "00")
va = "".join(changelist)
var = va + ("\t") + str(second)
listtosort.append(var)
output.write(var)
elif changelist[9] == ";":
changelist.insert(6, "0")
va = "".join(changelist)
var = va + ("\t") + str(second)
listtosort.append(var)
output.write(var)
else:
#output.write(str("".join(changelist)))
va = "".join(changelist)
var = va + ("\t") + str(second)
listtosort.append(var)
output.write(var)
output.close()
The error
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/home/a/Desktop/sharedfolder/ipytest/individ.ins.count.test/<ipython-input-87-32f9b0a1951b> in <module>()
57 splits = each.split("\t")
58 changelist = list(splits[0])
---> 59 second = int(splits[1])
60
61 print second
IndexError: list index out of range
Input:
ID=cds0;Name=NP_414542.1;Parent=gene0;Dbxref=ASAP:ABE-0000006,UniProtKB%2FSwiss-Prot:P0AD86,Genbank:NP_414542.1,EcoGene:EG11277,GeneID:944742;gbkey=CDS;product=thr 12
ID=cds1000;Name=NP_415538.1;Parent=gene1035;Dbxref=ASAP:ABE-0003451,UniProtKB%2FSwiss-Prot:P31545,Genbank:NP_415538.1,EcoGene:EG11735,GeneID:946500;gbkey=CDS;product=deferrrochelatase%2C 50
ID=cds1001;Name=NP_415539.1;Parent=gene1036;Note=PhoB-dependent%2C 36
Desired output:
ID=cds0000;Name=NP_414542.1;Parent=gene0;Dbxref=ASAP:ABE-0000006,UniProtKB%2FSwiss-Prot:P0AD86,Genbank:NP_414542.1,EcoGene:EG11277,GeneID:944742;gbkey=CDS;product=thr 12
ID=cds1000;Name=NP_415538.1;Parent=gene1035;Dbxref=ASAP:ABE-0003451,UniProtKB%2FSwiss-Prot:P31545,Genbank:NP_415538.1,EcoGene:EG11735,GeneID:946500;gbkey=CDS;product=deferrrochelatase%2C 50
ID=cds1001;Name=NP_415539.1;Parent=gene1036;Note=PhoB-dependent%2C 36
The reason you're getting the IndexError is that your input-file is apparently not entirely tab delimited. That's why there is nothing at splits[1] when you attempt to access it.
Your code could use some refactoring. First of all you're repeating yourself with the if-checks, it's unnecessary. This just pads the cds0 to 7 characters which is probably not what you want. I threw the following together to demonstrate how you could refactor your code to be a little more pythonic and dry. I can't guarantee it'll work with your dataset, but I'm hoping it might help you understand how to do things differently.
to_sort = []
# We can open two files using the with statement. This will also handle
# closing the files for us, when we exit the block.
with open("count.txt", "r") as inp, open("output.txt", "w") as out:
for each in inp:
# Split at ';'... So you won't have to worry about whether or not
# the file is tab delimited
changed = each.split(";")
# Get the value you want. This is called unpacking.
# The value before '=' will always be 'ID', so we don't really care about it.
# _ is generally used as a variable name when the value is discarded.
_, value = changed[0].split("=")
# 0-pad the desired value to 7 characters. Python string formatting
# makes this very easy. This will replace the current value in the list.
changed[0] = "ID={:0<7}".format(value)
# Join the changed-list with the original separator and
# and append it to the sort list.
to_sort.append(";".join(changed))
# Write the results to the file all at once. Your test data already
# provided the newlines, you can just write it out as it is.
output.writelines(to_sort)
# Do what else you need to do. Maybe to_list.sort()?
You'll notice that this code is reduces your code down to 8 lines but achieves the exact same thing, does not repeat itself and is pretty easy to understand.
Please read the PEP8, the Zen of python, and go through the official tutorial.
This happens when there is a line in count.txt which doesn't contain the tab character. So when you split by tab character there will not be any splits[1]. Hence the error "Index out of range".
To know which line is causing the error, just add a print(each) after splits in line 57. The line printed before the error message is your culprit. If your input file keeps changing, then you will get different locations. Change your script to handle such malformed lines.

str.startswith() not working as I intended

I'm trying to test for a /t or a space character and I can't understand why this bit of code won't work. What I am doing is reading in a file, counting the loc for the file, and then recording the names of each function present within the file along with their individual lines of code. The bit of code below is where I attempt to count the loc for the functions.
import re
...
else:
loc += 1
for line in infile:
line_t = line.lstrip()
if len(line_t) > 0 \
and not line_t.startswith('#') \
and not line_t.startswith('"""'):
if not line.startswith('\s'):
print ('line = ' + repr(line))
loc += 1
return (loc, name)
else:
loc += 1
elif line_t.startswith('"""'):
while True:
if line_t.rstrip().endswith('"""'):
break
line_t = infile.readline().rstrip()
return(loc,name)
Output:
Enter the file name: test.txt
line = '\tloc = 0\n'
There were 19 lines of code in "test.txt"
Function names:
count_loc -- 2 lines of code
As you can see, my test print for the line shows a /t, but the if statement explicitly says (or so I thought) that it should only execute with no whitespace characters present.
Here is my full test file I have been using:
def count_loc(infile):
""" Receives a file and then returns the amount
of actual lines of code by not counting commented
or blank lines """
loc = 0
for line in infile:
line = line.strip()
if len(line) > 0 \
and not line.startswith('//') \
and not line.startswith('/*'):
loc += 1
func_loc, func_name = checkForFunction(line);
elif line.startswith('/*'):
while True:
if line.endswith('*/'):
break
line = infile.readline().rstrip()
return loc
if __name__ == "__main__":
print ("Hi")
Function LOC = 15
File LOC = 19
\s is only whitespace to the re package when doing pattern matching.
For startswith, an ordinary method of ordinary strings, \s is nothing special. Not a pattern, just characters.
Your question has already been answered and this is slightly off-topic, but...
If you want to parse code, it is often easier and less error-prone to use a parser. If your code is Python code, Python comes with a couple of parsers (tokenize, ast, parser). For other languages, you can find a lot of parsers on the internet. ANTRL is a well-known one with Python bindings.
As an example, the following couple of lines of code print all lines of a Python module that are not comments and not doc-strings:
import tokenize
ignored_tokens = [tokenize.NEWLINE,tokenize.COMMENT,tokenize.N_TOKENS
,tokenize.STRING,tokenize.ENDMARKER,tokenize.INDENT
,tokenize.DEDENT,tokenize.NL]
with open('test.py', 'r') as f:
g = tokenize.generate_tokens(f.readline)
line_num = 0
for a_token in g:
if a_token[2][0] != line_num and a_token[0] not in ignored_tokens:
line_num = a_token[2][0]
print(a_token)
As a_token above is already parsed, you can easily check for function definition, too. You can also keep track where the function ends by looking at the current column start a_token[2][1]. If you want to do more complex things, you should use ast.
You string literals aren't what you think they are.
You can specify a space or TAB like so:
space = ' '
tab = '\t'

Categories