How to get everything before a character and x amount after? - python

It has been trial and error and can't seem to get what I want.
I am accessing an API to get some info. Unfortunately it's the only API to get that info and to do it, it downloads a binary content of a file and names it:
folder\filename.whatever
i.e. test\purpleMonkeyTest.docx
There is a bunch more info that comes in from the call but there is this line:
Saved the binary content to: /home/user/python/test\purpleMonkeyTest.docx
Some of the files have " or other special characters so I can't just get the file name and delete it as part of the script, since I won't know what to escape.
So my goal here is to strip the line and get:
/home/user/python/test\purpleMonkeyTest.docx
then get only:
/home/user/python/test\pu
then:
os.remove "/home/user/python/test\pu"*
I'm thinking that a wildcard should work for all, unless there is a better way to do it. All files saved have the character \ in them so I've got to the point where I'm getting everything prior to the \ but I want one or two characters after that as well.
Here's what I've tried:
def fileName(itemID):
import fnmatch
details = itemDetails(itemID, True) # get item id and file details
filepath = matchPattern((details), 'Saved the binary content to: *')
filepath = (filepath).split('\\')[0]
print(filepath)
#os.remove(re.escape(filepath))
return (matchPattern((details), 'Binary component: *'))
def matchPattern(details, pattern):
import fnmatch
return (fnmatch.filter((details), pattern)[0].split(": " ,1)[1])
Output:
/home/user/python/test
purpleMonkeyTest.docx
I do want the file name for later: that's actually the main goal. The API downloads the damn file automatically though.
EDIT:
Answer below works for getting the chars I want. Os remove is not removing the file though.
OSError: [Errno 2] No such file or directory: '/home/user/python/test\\Re*'
Managed to get it to work using glob, I guess os.remove doesn't support Wilds.
files = glob.glob((filepath)+"*")
for file in files:
os.remove(file)
Thanks for the help!!

As far as I understand your question you would like to retrieve 2 parts - everything between first / and \ with 2 chars afterwards and then everything after \:
str = "Saved the binary content to: /home/user/python/test\purpleMonkeyTest.docx"
print (str[str.index("/"):str.rindex("\\") + 3])
print (str[str.rindex("\\") + 1:])
Output
/home/user/python/test\pu
purpleMonkeyTest.docx

Related

Python - open a file [duplicate]

def choose_option(self):
if self.option_picker.currentRow() == 0:
description = open(":/description_files/program_description.txt","r")
self.information_shower.setText(description.read())
elif self.option_picker.currentRow() == 1:
requirements = open(":/description_files/requirements_for_client_data.txt", "r")
self.information_shower.setText(requirements.read())
elif self.option_picker.currentRow() == 2:
menus = open(":/description_files/menus.txt", "r")
self.information_shower.setText(menus.read())
I am using resource files and something is going wrong when i am using it as argument in open function, but when i am using it for loading of pictures and icons everything is fine.
That is not a valid file path. You must either use a full path
open(r"C:\description_files\program_description.txt","r")
Or a relative path
open("program_description.txt","r")
Add 'r' in starting of path:
path = r"D:\Folder\file.txt"
That works for me.
I also ran into this fault when I used open(file_path). My reason for this fault was that my file_path had a special character like "?" or "<".
I received the same error when trying to print an absolutely enormous dictionary. When I attempted to print just the keys of the dictionary, all was well!
In my case, I was using an invalid string prefix.
Wrong:
path = f"D:\Folder\file.txt"
Right:
path = r"D:\Folder\file.txt"
In my case the error was due to lack of permissions to the folder path. I entered and saved the credentials and the issue was solved.
I had the same problem
It happens because files can't contain special characters like ":", "?", ">" and etc.
You should replace these files by using replace() function:
filename = filename.replace("special character to replace", "-")
you should add one more "/" in the last "/" of path, that is:
open('C:\Python34\book.csv') to open('C:\Python34\\book.csv'). For example:
import csv
with open('C:\Python34\\book.csv', newline='') as csvfile:
spamreader = csv.reader(csvfile, delimiter='', quotechar='|')
for row in spamreader:
print(row)
Just replace with "/" for file path :
open("description_files/program_description.txt","r")
In Windows-Pycharm: If File Location|Path contains any string like \t then need to escape that with additional \ like \\t
just use single quotation marks only and use 'r' raw string upfront and a single '/'
for eg
f = open(r'C:/Desktop/file.txt','r')
print(f.read())
I had special characters like '' in my strings, for example for one location I had a file Varzea*, then when I tried to save ('Varzea.csv') with f-string Windows complained. I just "sanitized" the string and all got back to normal.
The best way in my case was to let the strings with just letters, without special characters!
For me this issue was caused by trying to write a datetime to file.
Note: this doesn't work:
myFile = open(str(datetime.now()),"a")
The datetime.now() object contains the colon ''':''' character
To fix this, use a filename which avoid restricted special characters. Note this resource on detecting and replacing invalid characters:
https://stackoverflow.com/a/13593932/9053474
For completeness, replace unwanted characters with the following:
import re
re.sub(r'[^\w_. -]', '_', filename)
Note these are Windows restricted characters and invalid characters differ by platform.
for folder, subs, files in os.walk(unicode(docs_dir, 'utf-8')):
for filename in files:
if not filename.startswith('.'):
file_path = os.path.join(folder, filename)
In my case,the problem exists beacause I have not set permission for drive "C:\" and when I change my path to other drive like "F:\" my problem resolved.
import pandas as pd
df = pd.read_excel ('C:/Users/yourlogin/new folder/file.xlsx')
print (df)
I got this error because old server instance was running and using log file, hence new instance was not able to write to log file. Post deleting log file this issue got resolved.
When I copy the path by right clicking the file---> properties-->security, it shows the error. The working method for this is to copy path and filename separately.
I had faced same issue while working with pandas and trying to open a big csv file:
wrong_df = pd.read_csv("D:\Python Projects\ML\titanic.csv")
right_df = pd.read_csv("D:\Python Projects\ML\\titanic.csv")

os.path.basename to outfile

For every input file processed (see code below) I am trying to use "os.path.basename" to write to a new output file - I know I am missing something obvious...?
import os
import glob
import gzip
dbpath = '/home/university/Desktop/test'
for infile in glob.glob( os.path.join(dbpath, 'G[D|E]/????/*.gz') ):
print("current file is: " + infile)
**
outfile=os.path.basename('/home/university/Desktop/test/G[D|E]
/????/??????.xaa.fastq.gz').rsplit('.xaa.fastq.gz')[0]
file=open(outfile, 'w+')
**
gzsuppl = Chem.ForwardSDMolSupplier(gzip.open(infile))
for m in gzsuppl:
if m is None: continue
...etc
file.close()
print(count)
It is not clear to me how to capture the variable [0] (i.e. everything upstream of .xaa.fastq.gz) and use as the basename for the new output file?
Unfortunately it simply writes the new output file as "??????" rather than the actual sequence of 6 letters.
Thanks for any help given.
This seems like it will get everything upstream of the .xaa.fastq.gz in the paths returned from glob() in your sample code:
import os
filepath = '/home/university/Desktop/test/GD /AAML/DEAAML.xaa.fastq.gz'
filepath = os.path.normpath(filepath) # Changes path separators for Windows.
# This section was adapted from answer https://stackoverflow.com/a/3167684/355230
folders = []
while 1:
filepath, folder = os.path.split(filepath)
if folder:
folders.append(folder)
else:
if filepath:
folders.append(filepath)
break
folders.reverse()
if len(folders) > 1:
# The last element of folders should contain the original filename.
filename_prefix = os.path.basename(folders[-1]).split('.')[0]
outfile = os.path.join(*(folders[:-1] + [filename_prefix + '.rest_of_filename']))
print(outfile) # -> \home\university\Desktop\test\GD \AAML\DEAAML.rest_of_filename
Of course what ends-up in outfile isn't the final path plus filename since I don't know what the remainder of the filename will be and just put a placeholder in (the '.rest_of_filename').
I'm not familiar with the kind of input data you're working with, but here's what I can tell you:
The "something obvious" you're missing is that outfile has no connection to infile. Your outfile line uses the ?????? rather than the actual filename because that's what you ask for. It's glob.glob that turns it into a list of matches.
Here's how I'd write that aspect of the outfile line:
outfile = infile.rsplit('.xaa.fastq.gz', 1)[0]
(The , 1 ensures that it'll never split more than once, no matter how crazy a filename gets. It's just a good habit to get into when using split or rsplit like this.)
You're setting yourself up for a bug, because the glob pattern can match *.gz files which don't end in .xaa.fastq.gz, which would mean that a random .gz file which happens to wind up in the folder listing would cause outfile to have the same path as infile and you'd end up writing to the input file.
There are three solutions to this problem which apply to your use case:
Use *.xaa.fastq.gz instead of *.gzin your glob. I don't recommend this because it's easy for a typo to sneak in and make them different again, which would silently reintroduce the bug.
Write your output to a different folder than you took your input from.
outfile = os.path.join(outpath, os.path.relpath(infile, dbpath))
outparent = os.path.dirname(outfile)
if not os.path.exists(outparent):
os.makedirs(outparent)
Add an assert outfile != infile line so the program will die with a meaningful error message in the "this should never actually happen" case, rather than silently doing incorrect things.
The indentation of what you posted could be wrong, but it looks like you're opening a bunch of files, then only closing the last one. My advice is to use this instead, so it's impossible to get that wrong:
with open(outfile, 'w+') as file:
# put things which use `file` here
The name file is already present in the standard library and the variable names you chose are unhelpful. I'd rename infile to inpath, outfile to outpath, and file to outfile. That way, you can tell whether each one is a path (ie. a string) or a Python file object just from the variable name and there's no risk of accessing file before you (re)define it and getting a very confusing error message.

How to take input of a directory

What I'm trying to do is troll through a directory of log files which begin like this "filename001.log" there can be 100s of files in a directory
The code I want to run against each files is to check to make sure that the 8th position of the log always contains a number. I have a suspicion that a non-digit is throwing off our parser. Here's some simple code I'm trying to check this:
# import re
from urlparse import urlparse
a = '/folderA/filename*.log' #<< currently this only does 1 file
b = '/folderB/' #<< I'd like it to write the same file name as it read
with open(b, 'w') as newfile, open(a, 'r') as oldfile:
data = oldfile.readlines()
for line in data:
parts = line.split()
status = parts[8] # value of 8th position in the log file
isDigit = status.isdigit()
if isDigit = False:
print " Not A Number :",status
newfile.write(status)
My problem is:
How do I tell it to read all the files in a directory? (The above really only works for 1 file at a time)
If I find something is not a number I would like to write that character into a file in a different folder but of the same name as the log file. For example I find filename002.log has a "*" in one of the log lines. I would like folderB/filename002.log to made and the non-digit character to the written.
Sounds sounds simple enough I'm just a not very good at coding.
To read files in one directory matching a given pattern and write to another, use the glob module and the os.path functions to construct the output files:
srcpat = '/folderA/filename*.log'
dstdir = '/folderB'
for srcfile in glob.iglob(srcpat):
if not os.path.isfile(srcfile): continue
dstfile = os.path.join(dstdir, os.path.basename(srcfile))
with open(srcfile) as src, open(dstfile, 'w') as dst:
for line in src:
parts = line.split()
status = parts[8] # value of 8th position in the log file
if not status.isdigit():
print " Not A Number :", status
dst.write(status) # Or print >>dst, status if you want newline
This will create empty files even if no bad entries are found. You can either wait until you're finished processing the files (and the with block is closed) and just check the file size for the output and delete it if empty then, or you can move to a lazy approach where you delete the output file before beginning iteration unconditionally, but don't open it; only if you get a bad value do you open the file (for append instead of write to keep earlier loops' output from being discarded), write to it, allow it to close.
Import os and use: for filenames in os.listdir('path'):. This will list all files in the directory, including subdirectories.
Simply open a second file with the correct path. Since you already have filename from iterating with the above method, you only have to replace the directory. You can use os.path.join for that.

Python - hashlib won't "pick up" the files in a routine

I'm using the hashlib func in a script I'm writing, and I can't get it to take the files I'm pointing it towards - its only returning the hash of a 0kb file d41d8cd98f00b204e9800998ecf8427e.
I'm calling it like so: fHash=md5Checksum(f) where f is a full path string of a file.
The md5Checksum func is a lift from here: http://www.joelverhagen.com/blog/2011/02/md5-hash-of-file-in-python/ and I've tested the 2nd example directly with an explicitly typed path to a file and it returns the correct hash.
I am also using the os.path.getsize method in the same way (fSize=os.path.getsize(f)) and that is picking the file up correctly.
When I dump the contents of the f string in my code, and compare it to the explicitly typed path, I notice it lacks ' markers around the string:
/home/.../(500) Days of Summer[2009]/11 - Regina Spektor - Hero.mp3 from a 'print f'
and from the explicitly typed path:
print 'The MD5 checksum of text.txt is', md5Checksum('/home/.../deduper/test.txt') (which works)
If I manually add ' markers to the path the code falls over:
IOError: [Errno 2] No such file or directory: "'/home/.../(500) Days of Summer[2009]/11 - Regina Spektor - Hero.mp3'"
This makes me suspect I'm not passing the path correctly. I'm on a ubuntu box if that matters.
EDIT
I'm a buffoon. I've been stuck on this for a few days, and its only through posting it on here and checking the indentations I've noticed I've messed one of them up on the md5Checker method along the way... I've fixed it, and this totally works. Thank you all for making me check.... (for the record, I had the m.update(data) line inline with the break. Thats not going to work now is it.... :s
def md5Checksum(filePath):
fh = open(filePath, 'rb')
m = hashlib.md5()
# print "File being hashed: " + filePath
while True:
data = fh.read(8192)
if not data:
break
m.update(data)
return m.hexdigest()
I had somehow got the indentation misalign, not enough to cause it to fall over and give me an error, but enough for it to not work. Answer is posted in the original question.

file won't write in python

I'm trying to replace a string in all the files within the current directory. for some reason, my temp file ends up blank. It seems my .write isn't working because the secondfile was declared outside its scope maybe? I'm new to python, so still climbing the learning curve...thanks!
edit: I'm aware my tempfile isn't being copied currently. I'm also aware there are much more efficient ways of doing this. I'm doing it this way for practice. If someone could answer specifically why the .write method fails to work here, that would be great. Thanks!
import os
import shutil
for filename in os.listdir("."):
file1 = open(filename,'r')
secondfile = open("temp.out",'w')
print filename
for line in file1:
line2 = line.replace('mrddb2.','shpdb2.')
line3 = line2.replace('MRDDB2.','SHPDB2.')
secondfile.write(line3)
print 'file copy in progress'
file1.close()
secondfile.close()
Just glancing at the thing, it appears that your problem is with the 'w'.
It looks like you keep overwriting, not appending.
So you're basically looping through the file(s),
and by the end you've only copied the last file to your temp file.
You'll may want to open the file with 'a' instead of 'w'.
Your code (correctly indented, though I don't think there's a way to indent it so it runs but doesn't work right) actually seems right. Keep in mind, temp.out will be the replaced contents of only the last source file. Could it be that file is just blank?
Firstly,
you have forgotten to copy the temp file back onto the original.
Secondly:
use sed -i or perl -i instead of python.
For instance:
perl -i -pe 's/mrddb2/shpdb2/;s/MRDDB2/SHPDB2/' *
I don't have the exact answer for you, but what might help is to stick some print lines in there in strategic places, like print each line before it was modified, then again after it was modified. Then place another one after the line was modified just before it is written to the file. Then just before you close the new file do a:
print secondfile.read()
You could also try to limit the results you get if there are too many for debugging purposes. You can limit string output by attaching a subscript modifier to the end, for example:
print secondfile.read()[:n]
If n = 100 it will limit the output to 100 characters.
if your code is actually indented as showed in the post, the write is working fine. But if it is failing, the write call may be outside the inner for loop.
Just to make sure I wasn't really missing something, I tested the code and it worked fine for me. Maybe you could try continue for everything but one specific filename and then check the contents of temp.out after that.
import os
for filename in os.listdir("."):
if filename != 'findme.txt': continue
print 'Processing', filename
file1 = open(filename,'r')
secondfile = open("temp.out",'w')
print filename
for line in file1:
line2 = line.replace('mrddb2.','shpdb2.')
line3 = line2.replace('MRDDB2.','SHPDB2.')
print 'About to write:', line3
secondfile.write(line3)
print 'Done with', filename
file1.close()
secondfile.close()
Also, as others have mentioned, you're just clobbering your temp.out file each time you process a new file. You've also imported shutil without actually doing anything with it. Are you forgetting to copy temp.out back to your original file?
I noticed sometimes it will not print to file if you don't have a file.close after file.write.
For example, this program never actually saves to file, it just makes a blank file (unless you add outfile.close() right after the outfile.write.)
outfile=open("ok.txt","w")
fc="filecontents"
outfile.write(fc.encode("utf-8"))
while 1:
print "working..."
#OP, you might also want to try fileinput module ( this way, you don't have to use your own temp file)
import fileinput
for filename in os.listdir("."):
for line in fileinput.FileInput(filename,inplace=1):
line = line.strip().replace('mrddb2.','shpdb2.')
line = line.strip().replace('MRDDB2.','SHPDB2.')
print line
set "inplace" to 1 for editing the file in place. Set to 0 for normal print to stdout

Categories