python youtube-dl output file not in unicode? - python

Is there anyway to get the youtube-dl.extract_info() function to use unicode when creating the output file?
I have encountered the problem that if you download something with unicode characters like | in the title then the output file name will not have the same character. It will be replaced with _ instead.
Take this song title for example.
If I download it with youtube-dl then I get this file name 【Nightcore】→ Pretty Girl _ Lyrics-dMAOnScOyGE. Same thing happens with different kind of characters.
Is there any way to stop this?
Because it's a annoying if you want do do anything with that file afterwards.
To get the new file name I would need to do something like os.listdir(dir) to get the file. So it's not impossible to get the new file name, but I am just interested if there is a easier way.

The encoding of | to _ is hardcoded in sanitize_filename in youtube_dl/utils.py. You can turn it off programatically by substituting youtube_dl.utils.sanitize_filename with your own implementation.
However, doing so is not recommended, and not supported out of the box. This is because | is an invalid character on Windows and can be used to execute arbitrary commands if expanded in a buggy script.
Insecure filenames were supported at one time, but I removed them from youtube-dl because too many people were shooting themselves in the foot, and often reported problems that clearly would have let any attacker execute arbitrary code on their machines.

Related

Executing subprocess cannot find specified file on Windows

I'm working inside a system that has Jython2.5 but I need to be able to call some of Google's apis so I wrote an offline script that I wanted to call from my Jython environment and return to me small pieces of data. Like a JobID or a sheet URL or something from Google.
I've tried a number of things but I always get an error back from Windows, saying that it cannot find the file specified.
Path is done in two ways.
The first way using a string
stringPath = r"‪C:\GooglePipes\Scripts\filetobq.py C:\GooglePipes\Keys\DEV-BigQueryKey.json nofile C:\GooglePipes\BQ_Downtime\TESTFILE.CSV dataset1 table1"
And the second way, as a sequence (per the docs, using shell=false supply a sequence)
seqPath = [r"‪C:\GooglePipes\Scripts\filetobq.py",r"C:\GooglePipes\Keys\DEV-BigQueryKey.json","nofile",r"C:\GooglePipes\BQ_Downtime\TESTFILE.CSV","dataset1","table1"]
Called with
data, err = Popen(seqPath, shell=True, stderr=PIPE, stdout=PIPE).communicate()
#Read values back in
print data
print err
Replacing seqPath with stringPath to try it either way.
I've been at this all weekend, every time I run it I get from Windows
The system cannot find the path specified.
from the err print. I've been unable to debug much further than this. I'm not really sure what's happening. When I paste the stringPath variable directly into my computer's command window it executes.
I've also called subprocess.list2cmdline(seqPath) to see what it's outputting. It's giving me a ? in front of the string, but I haven't been able to figure out what that means. I can paste the rest of the string, starting after the question mark into the command window and it executes.
?C:\GooglePipes\Scripts\filetobq.py C:\GooglePipes...
I've tried a number of different combinations of true and false on shell, passing different args into Popen, double slashes, and I have no less than 30 tabs open from stack overflow and other help forums. I just have no idea what to do at this point and any help is appreciated.
Edit
The ? at the start of the sting is actually a NULL character when I did some additional logging. This seems to be the root of my problem. I can't figure out why it shows up, but it was present in my copy pastes. I started manually typing, and I got it working. When I feed the path with my Jython program it is present again.
Ultimately the error was the ?/NULL character.
I went back to the source value where the program was grabbing the path and it was present there. After I hand-re keyed it in, everything started working.
If you copy and paste what I put in the question, you can see the NULL character in the string if you run it through a string->ASCII converter.
>C:
>NULL 67 58
What a bunch of bullsh***.

Proper format for a file name in python

I'm importing an mp3 file using IPython (more specifically, the IPython.display.display(IPython.display.Audio() command) and I wanted to know if there was a specific way you were supposed to format the file path.
The documentation says it takes the file path so I assumed (perhaps incorrectly) that it should be something like \home\downloads\randomfile.mp3 which I used an online converter to convert into unicode. I put that in (using, of course, filename=u'unicode here' but that didn't work, instead giving a bunch of errors. I tried reformatting it in different ways (just \downloads\randomfile.mp3, etc) but none of them worked. For those who are curious, here is the unicode characters: \u005c\u0044\u006f\u0077\u006e\u006c\u006f\u0061\u0064\u0073\u005c\u0062\u0064\u0061\u0079\u0069\u006e\u0073\u0074\u0072\u0075\u006d\u0065\u006e\u0074\u002e\u006d\u0070\u0033 which translates to \home\Downloads\bdayinstrument.mp3, I believe.
So, am I doing something wrong? What is the correct way to format the "file path"?
Thanks!

Can you use os.path.exists() on a file that starts with a number?

I have a set of files named 16ID_#.txt where # represents a number. I want to check if a specific file number exists, using os.path.exists(), before attempting to import the file to python. When I put together my variable for the folder where the files are, with the name of the file (e.x.: folderpath+"\16ID_#.txt"), python interprets the "\16" as a music note.
Is there any way I can prevent this, so that folderpath+"\16ID_#.txt" is interpreted as I want it to be?
I cannot change the names of the files, they are output by another program over which I have no control.
You can use / to build paths, regardless of operating system, but the correct way is to use os.path.join:
os.path.exists(os.path.join(folderpath, "16ID_#.txt"))
I get these are windows \paths. Maybe the problem is that you need to escape the backslash, because \16 could be interpreted as a special code. So maybe you need to put \\16 instead of \16.

Specifying filename in os.system call from python

I am creating a simple file in python to reorganize some text data I grabbed from a website. I put the data in a .txt file and then want to use the "tail" command to get rid of the first 5 lines. I'm able to make this work for a simple filename shown below, but when I try to change the filename (to what I'd actually like it to be) I get an error. My code:
start = 2010
end = 2010
for i in range(start,end+1)
year = str(i)
...write data to a file called file...
teamname=open(file).readline() # want to use this in the new filename
teamfname=teamname.replace(" ","") #getting rid of spaces
file2 = "gotdata2_"+year+".txt"
os.system("tail -n +5 gotdata_"+year+".txt > "+file2)
The above code works as intended, creating file, then creating file2 that excludes the first 5 lines of file. However, when I change the name of file2 to be:
file2 = teamfname+"_"+year+".txt"
I get the error:
sh: line 1: _2010.txt: command not found
It's as if the end of my file2 statement is getting chopped off and the .txt part isn't being recognized. In this case, my code outputs a file but is missing the _2010.txt at the end. I've double checked that both year and teamfname are strings. I've also tried it with and without spaces in the teamfname string. I get the same error when I try to include a os.system mv statement that would rename the file to what I want it to be, so there must be something wrong with my understanding of how to specify the string here.
Does anyone have any ideas about what causes this? I haven't been able to find a solution, but I've found this problem difficult to search for.
Without knowing what your actual strings are, it's impossible to be sure what the problem is. However, it's almost certainly something to do with failing to properly quote and/or escape arguments for the command line.
My first guess would be that you have a newline in the middle of your filename, and the shell is truncating the command at the newline. But I wouldn't bet too heavily on that. If you actually printed out the repr of the pathname, I could tell you for sure. But why go through all this headache?
The solution to almost any problem with os.system is to not use os.system.
If you look at the docs, they even tell you this:
The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function. See the Replacing Older Functions with the subprocess Module section in the subprocess documentation for some helpful recipes.
If you use subprocess instead of os.system, you can avoid the shell entirely. You can also pass arguments as a list instead of trying to figure out how to quote them and escape them properly. Which would completely avoid the exact problem you're having.
For example, if you do this:
file2 = "gotdata2_"+year+".txt"
with open(file2, 'wb') as f:
subprocess.check_call(['tail', '-n', '+5', "gotdata_"+year+".txt"], stdout=f)
Then, if you change that first line to this:
file2 = teamfname+"_"+year+".txt"
It will still work even if teamfname has a space or a quote or another special character in it.
That being said, I'm not sure why you want to use tail in the first place. You can skip the first 5 lines just as easily directly in Python.

wrong rename with python

I need to rename my picture with the EXIF data, but I have a problem: if I use ":" to separate the time (hour:minute:second), the file name gets crazy!
metadata = pyexiv2.ImageMetadata(lunga + i)
metadata.read()
tag = metadata['Exif.Image.DateTime']
estensione = ".jpg"
new_data = tag.value.strftime('%Y-%m-%d %H-%M-%S')
new_name = lunga + str(new_data) + estensione
os.renames(lunga + i, new_name)
works great, but with
new_data = tag.value.strftime('%Y-%m-%d %H:%M:%S')
I get something like
2A443K~H.jpg
The problem is that you're not allowed to put colons into filenames on Windows. You're not actually using Windows… but you are using an SMB share, which means you're bound by Windows rules.
The fix is to not put colons into your filenames.
If you want to understand why this bizarre stuff is happening, read on.
The details on Windows filenames are described in Naming Files, Paths, and Namespaces at MSDN, but I'll summarize the relevant parts here.
The NT kernel underneath Windows has no problems with colons, but the Win32 layer on top of it can't handle them (and the quasi-POSIX layer in MSVCRT sits on top of Win32).
So, at the C level, if you call NT functions like NtSetInformationFile, it will save them just fine. If you call Win32 functions like MoveFileEx, they will normally give you an error, but if you use the special \\?\ syntax to say "pass this name straight through to NT", it will work. And if you call MSVCRT functions like rename, you will get an error. Older versions of Python called rename, which would just give you an error. Newer versions call MoveFileEx, and will try to wrap the name up in \\?\ syntax (because that also allows you to get around some other stupid limitations, like the excessively short MAX_PATH value).
So, what happens if you give a file a name that Win32 can't understand? Remember that on Windows, every file has two different names: the "long name" and the "short name". The short name is a DOS-style 8.3 filename. So whenever it can't display the long name, it displays the short name instead.
Where does the short name come from? If you don't create one explicitly, Windows will create one for you from the long name by using the first 6 characters, a tilde, and a number of letter. So, for example, the short name for "Program Files" is "PROGRA~1". But if Windows can't handle the long name, it will just make up a short name out of 6 random characters, a tilde, and a random character. So you get something like 2A443K~H.
The NTFS filesystem, being designed for Windows, expects to be used in Windows-y ways. So, if you're using an NTFS volume, even on a non-Windows system, the driver will emulate some of this functionality, giving you similar but not identical behavior.
And of course if you're talking to a share from a Windows system, or a share backed by an NTFS drive on a non-Windows system, again, some of the same things will apply.
Even if both your computer and the file server are non-Windows and the filesystem is not NTFS, if you're using SMB/CIFS for file sharing, SMB was also designed for Windows, and you will again get similar behavior.
At least you no longer have to worry about VMS, classic Mac, and other naming systems, just POSIX and Windows.
Colons are reserved characters in Windows filesystem (see How would I go about creating a filename with invalid characters such as :?>?), so the name was replaced with an auto-generated on instead.
To be clear, this is not a Python issue. Don't use colons or other reserved characters in filenames if you don't want this to happen.

Categories