Proper format for a file name in python

Proper format for a file name in python - python

I'm importing an mp3 file using IPython (more specifically, the IPython.display.display(IPython.display.Audio() command) and I wanted to know if there was a specific way you were supposed to format the file path.
The documentation says it takes the file path so I assumed (perhaps incorrectly) that it should be something like \home\downloads\randomfile.mp3 which I used an online converter to convert into unicode. I put that in (using, of course, filename=u'unicode here' but that didn't work, instead giving a bunch of errors. I tried reformatting it in different ways (just \downloads\randomfile.mp3, etc) but none of them worked. For those who are curious, here is the unicode characters: \u005c\u0044\u006f\u0077\u006e\u006c\u006f\u0061\u0064\u0073\u005c\u0062\u0064\u0061\u0079\u0069\u006e\u0073\u0074\u0072\u0075\u006d\u0065\u006e\u0074\u002e\u006d\u0070\u0033 which translates to \home\Downloads\bdayinstrument.mp3, I believe.
So, am I doing something wrong? What is the correct way to format the "file path"?
Thanks!

Related

python youtube-dl output file not in unicode?

Is there anyway to get the youtube-dl.extract_info() function to use unicode when creating the output file?
I have encountered the problem that if you download something with unicode characters like | in the title then the output file name will not have the same character. It will be replaced with _ instead.
Take this song title for example.
If I download it with youtube-dl then I get this file name 【Nightcore】→ Pretty Girl _ Lyrics-dMAOnScOyGE. Same thing happens with different kind of characters.
Is there any way to stop this?
Because it's a annoying if you want do do anything with that file afterwards.
To get the new file name I would need to do something like os.listdir(dir) to get the file. So it's not impossible to get the new file name, but I am just interested if there is a easier way.

The encoding of | to _ is hardcoded in sanitize_filename in youtube_dl/utils.py. You can turn it off programatically by substituting youtube_dl.utils.sanitize_filename with your own implementation.
However, doing so is not recommended, and not supported out of the box. This is because | is an invalid character on Windows and can be used to execute arbitrary commands if expanded in a buggy script.
Insecure filenames were supported at one time, but I removed them from youtube-dl because too many people were shooting themselves in the foot, and often reported problems that clearly would have let any attacker execute arbitrary code on their machines.

Playing Audio with subprocess.call in Python

I wanted to play a .wav file, without using external modules, and i read i could do that using this:
def play(audio_file_path):
subprocess.call(["ffplay", "-nodisp", "-autoexit", /Users/me/Downloads/sample.wav])
I however get:
SyntaxError: invalid syntax
If i use os.path.realpath to get the absolute path of the file, i get just the same thing. (The path i see at get info)
Environment is OSX, Python 2.7
Can someone tell me what i am doing wrong? I am new to Python (and to Programming).

There are multiple problems.
Indentation
Code inside the function should be indented, to show that it is part of the function
File name should be in a quotes
It should be a string
It should be:
def play(audio_file_path):
subprocess.call(["ffplay", "-nodisp", "-autoexit", "/Users/me/Downloads/sample.wav"])

Can you use os.path.exists() on a file that starts with a number?

I have a set of files named 16ID_#.txt where # represents a number. I want to check if a specific file number exists, using os.path.exists(), before attempting to import the file to python. When I put together my variable for the folder where the files are, with the name of the file (e.x.: folderpath+"\16ID_#.txt"), python interprets the "\16" as a music note.
Is there any way I can prevent this, so that folderpath+"\16ID_#.txt" is interpreted as I want it to be?
I cannot change the names of the files, they are output by another program over which I have no control.

You can use / to build paths, regardless of operating system, but the correct way is to use os.path.join:
os.path.exists(os.path.join(folderpath, "16ID_#.txt"))

I get these are windows \paths. Maybe the problem is that you need to escape the backslash, because \16 could be interpreted as a special code. So maybe you need to put \\16 instead of \16.

Cannot read in files

I have a small problem with reading in my file. My code:
import csv as csv
import numpy
with open("train_data.csv","rb") as training:
csv_file_object = csv.reader(training)
header = csv_file_object.next()
data = []
for row in csv_file_object:
data.append(row)
data = numpy.array(data)
I get the error no such file "train_data.csv", so I know the problem lies with the location. But whenever I specify the pad like this: open("C:\Desktop...etc) it doesn't work either. What am I doing wrong?

If you give the full file path, your script should work. Since it is not, it must be that you have escape characters in your path. To fix this, use a raw-string to specify the file path:
# Put an 'r' at the start of the string to make it a raw-string.
with open(r"C:\path\to\file\train_data.csv","rb") as training:
Raw strings do not process escape characters.
Also, just a technical fact, not giving the full file path causes Python to look for the file in the directory that the script is launched from. If it is not there, an error is thrown.

When you use open() and Windows you need to deal with the backslashes properly.
Option 1.) Use the raw string, this will be the string prefixed with an r.
open(r'C:\Users\Me\Desktop\train_data.csv')
Option 2.) Escape the backslashes
open('C:\\Users\\Me\\Desktop\\train_data.csv')
Option 3.) Use forward slashes
open('C:/Users/Me/Desktop/train_data.csv')
As for finding the file you are using, if you just do open('train_data.csv') it is looking in the directory you are running the python script from. So, if you are running it from C:\Users\Me\Desktop\, your train_data.csv needs to be on the desktop as well.

pyPdf unable to extract text from some pages in my PDF

I'm trying to use pyPdf to extract and print pages from a multipage PDF. Problem is, text is not extracted from some pages. I've put an example file here:
http://www.4shared.com/document/kmJF67E4/forms.html
If you run the following, the first 81 pages return no text, while the final 11 extract properly. Can anyone help?
from pyPdf import PdfFileReader
input = PdfFileReader(file("forms.pdf", "rb"))
for page in input1.pages:
print page.extractText()

Note that extractText() still has problems extracting the text properly. From the documentation for extractText():
This works well for some PDF files,
but poorly for others, depending on
the generator used. This will be
refined in the future. Do not rely on
the order of text coming out of this
function, as it will change if this
function is made more sophisticated.
Since it is the text you want, you can use the Linux command pdftotext.
To invoke that using Python, you can do this:
>>> import subprocess
>>> subprocess.call(['pdftotext', 'forms.pdf', 'output'])
The text is extracted from forms.pdf and saved to output.
This works in the case of your PDF file and extracts the text you want.

This isn't really an answer, but the problem with pyPdf is this: it doesn't yet support CMaps. PDF allows fonts to use CMaps to map character IDs (bytes in the PDF) to Unicode character codes. When you have a PDF that contains non-ASCII characters, there's probably a CMap in use, and even sometimes when there's no non-ASCII characters. When pyPdf encounters strings that are not in standard Unicode encoding, it just sees a bunch of byte code; it can't convert those bytes to Unicode, so it just gives you empty strings. I actually had this same problem and I'm working on the source code at the moment. It's time consuming, but I hope to send a patch to the maintainer some time around mid-2011.

You could also try the pdfminer library (also in python), and see if it's better at extracting the text. For splitting however, you will have to stick with pyPdf as pdfminer doesn't support that.

I find it sometimes useful to convert it to ps (try with pdf2psand pdftops for potential differences) then back to pdf (ps2pdf). Then try your original script again.

I had similar problem with some pdfs and for windows, this is working excellent for me:
1.- Download Xpdf tools for windows
2.- copy pdftotext.exe from xpdf-tools-win-4.00\bin32 to C:\Windows\System32 and also to C:\Windows\SysWOW64
3.- use subprocess to run command from console:
import subprocess
try:
extInfo = subprocess.check_output('pdftotext.exe '+filePath + ' -',shell=True,stderr=subprocess.STDOUT).strip()
except Exception as e:
print (e)

I'm starting to think I should adopt a messy two-part solution. there are two sections to the PDF, pp 1-82 which have text page labels (pdftotext can extract), and pp 83-end which have no page labels but pyPDF can extract and it explicitly knows pages.
I think I need to combine the two. Clunky, but I don't see any way round it. Sadly I'm having to do this on a Windows machine.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Proper format for a file name in python - python

Related

python youtube-dl output file not in unicode?

Playing Audio with subprocess.call in Python

Can you use os.path.exists() on a file that starts with a number?

Cannot read in files

pyPdf unable to extract text from some pages in my PDF

Categories

Resources