While the following code works well in windows, in Linux server (of pythonanywhere) the function only returns 0, without errors. What am I missing?
import os
def folder_size(path):
total = 0
for entry in os.scandir(path):
if entry.is_file():
total += entry.stat().st_size
elif entry.is_dir():
total += folder_size(entry.path)
return total
print(folder_size("/media"))
Ref: Code from https://stackoverflow.com/a/37367965/6546440
The solution was given by #gilen-tomas in the comments:
import os
def folder_size(path):
total = 0
for entry in os.scandir(path):
if entry.is_file():
total += entry.stat().st_size
elif entry.is_dir():
total += folder_size(entry.path)
return total
print(folder_size("/home/your-user/your-proyect/media/"))
A complete path is needed!
Depending on the filesystem, the underlying struct dirent may not know if any given entry is a file or directory (or something else). Perhaps, on the filesystem used by pythonanywhere, you need to stat first (stat_result.st_type ought to be valid).
Edit: A look in discussion on os.scandir suggests the DT_UNKNOWN case is handled by doing another stat. I'd still try confirming those checks work as expected.
you can try this..
For linux :
import os
path = '/home/user/Downloads'
folder = sum([sum(map(lambda fname: os.path.getsize(os.path.join(directory, fname)), files)) for directory, folders, files in os.walk(path)])
MB=1024*1024.0
print "%.2f MB"%(folder/MB)
For windows :
import win32com.client as com
folderPath = r"/home/user/Downloads"
fso = com.Dispatch("Scripting.FileSystemObject")
folder = fso.GetFolder(folderPath)
MB=1024*1024.0
print "%.2f MB"%(folder.Size/MB)
It's worked for me in linux (Ubuntu server 16.04, python 3.5), but there could be some permission errors if the process doesn't have permission for reading a file.
Not a solution for this, but other way to get the size is using the cmd from python:
import subprocess
import re
cmd = ["du", "-sh", "-b", "media"]
proc = subprocess.Popen(cmd, stdout=subprocess.PIPE)
tmp = str(proc.stdout.read())
tmp = re.findall('\d+', tmp)[0]
print(tmp)
If you are executing this from your proyect (instead of manually in the terminal) a complete path is needed in "media" ("/home/your-user/your-proyect/media/")
I need to obtain the path to the directory created for a usb drive(I think it's something like /media/user/xxxxx) for a simple usb mass storage device browser that I am making. Can anyone suggest the best/simplest way to do this? I am using an Ubuntu 13.10 machine and will be using it on a linux device.
Need this in python.
This should get you started:
#!/usr/bin/env python
import os
from glob import glob
from subprocess import check_output, CalledProcessError
def get_usb_devices():
sdb_devices = map(os.path.realpath, glob('/sys/block/sd*'))
usb_devices = (dev for dev in sdb_devices
if 'usb' in dev.split('/')[5])
return dict((os.path.basename(dev), dev) for dev in usb_devices)
def get_mount_points(devices=None):
devices = devices or get_usb_devices() # if devices are None: get_usb_devices
output = check_output(['mount']).splitlines()
is_usb = lambda path: any(dev in path for dev in devices)
usb_info = (line for line in output if is_usb(line.split()[0]))
return [(info.split()[0], info.split()[2]) for info in usb_info]
if __name__ == '__main__':
print get_mount_points()
How does it work?
First, we parse /sys/block for sd* files (courtesy of https://stackoverflow.com/a/3881817/1388392) to filter out usb devices.
Later you call mount and parse output for lines only for those devices.
Of course they might be some edge cases, when this won't work, portability issues etc. Or better ways to do it. But for more information you should rather seek help on SuperUser or ServerFault, with more experienced linux hackers.
I had to modify #m.wasowski 's code to make it work on Python3.5.4 as follows.
def get_mount_points(devices=None):
devices = devices or get_usb_devices() # if devices are None: get_usb_devices
output = check_output(['mount']).splitlines()
output = [tmp.decode('UTF-8') for tmp in output]
def is_usb(path):
return any(dev in path for dev in devices)
usb_info = (line for line in output if is_usb(line.split()[0]))
return [(info.split()[0], info.split()[2]) for info in usb_info]
Using m.wasowski code, unexpected behavior can occur:
return [(info.split()[0], info.split()[2]) for info in usb_info]
This part of code can produce bug, if your USB device name has white space character in it. I got that behavior with device named "USB DEVICE".
info.split()[2]
Returned media/home/USB for me, when it is media/home/USB DEVICE.
I modified that part, so it was founding word "type", and replaced that line with this:
#return [(info.split()[0], info.split()[2]) for info in usb_info]
fullInfo = []
for info in usb_info:
print(info)
mountURI = info.split()[0]
usbURI = info.split()[2]
print(info.split().__sizeof__())
for x in range(3, info.split().__sizeof__()):
if info.split()[x].__eq__("type"):
for m in range(3, x):
usbURI += " "+info.split()[m]
break
fullInfo.append([mountURI, usbURI])
return fullInfo
I had to further modify #nick-sikrier and #m-wasowski response to handle LUKs encrypted devices.
def get_usb_devices():
sdb_devices = map(os.path.realpath, glob('/sys/block/sd*'))
usb_devices = (dev for dev in sdb_devices
if any(['usb' in dev.split('/')[5],
'usb' in dev.split('/')[6]]))
return dict((os.path.basename(dev), dev) for dev in usb_devices)
def get_mount_points(
devices = get_usb_devices()
fullInfo = []
for dev in devices:
output = subprocess.check_output(['lsblk', '-lnpo', 'NAME,MOUNTPOINT', '/dev/' + dev]).splitlines()
for mnt_point in output:
mnt_point_split = mnt_point.split(' ', 1)
if len(mnt_point_split) > 1 and mnt_point_split[1].strip():
fullInfo.append([mnt_point_split[0], mnt_point_split[1]])
return fullInfo
With a simple shell pipe executed in python:
import subprocess
driver_name = "my_usb_stick"
path = subprocess.check_output("cat /proc/mounts | grep '"+driver_name+"' | awk '{print $2}'", shell=True)
path = path.decode('utf-8') # convert bytes in string
>>> "/media/user/my_usb_stick"
Explanations
/proc/mounts/ : Is a file listing all mounted devices
The 1st column specifies the device that is mounted.
The 2nd column reveals the mount point.
The 3rd column tells the file-system type.
The 4th column tells you if it is mounted read-only (ro) or read-write (rw).
The 5th and 6th columns are dummy values designed to match the format used in /etc/mtab
More details see this answer : How to interpret /proc/mounts?
grep returns the line containing your driver's name
awk returns the 2nd columns, aka the mount point, aka your path.
I have downloaded a bunch of videos from coursera.org and have them stored in one particular folder. There are many individual videos in a particular folder (Coursera breaks a lecture into multiple short videos). I would like to have a python script which gives the combined length of all the videos in a particular directory. The video files are .mp4 format.
First, install the ffprobe command (it's part of FFmpeg) with
sudo apt install ffmpeg
then use subprocess.run() to run this bash command:
ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 -- <filename>
(which I got from http://trac.ffmpeg.org/wiki/FFprobeTips#Formatcontainerduration), like this:
from pathlib import Path
import subprocess
def video_length_seconds(filename):
result = subprocess.run(
[
"ffprobe",
"-v",
"error",
"-show_entries",
"format=duration",
"-of",
"default=noprint_wrappers=1:nokey=1",
"--",
filename,
],
capture_output=True,
text=True,
)
try:
return float(result.stdout)
except ValueError:
raise ValueError(result.stderr.rstrip("\n"))
# a single video
video_length_seconds('your_video.webm')
# all mp4 files in the current directory (seconds)
print(sum(video_length_seconds(f) for f in Path(".").glob("*.mp4")))
# all mp4 files in the current directory and all its subdirectories
# `rglob` instead of `glob`
print(sum(video_length_seconds(f) for f in Path(".").rglob("*.mp4")))
# all files in the current directory
print(sum(video_length_seconds(f) for f in Path(".").iterdir() if f.is_file()))
This code requires Python 3.7+ because that's when text= and capture_output= were added to subprocess.run. If you're using an older Python version, check the edit history of this answer.
Download MediaInfo and install it (don't install the bundled adware)
Go to the MediaInfo source downloads and in the "Source code, All included" row, choose the link next to "libmediainfo"
Find MediaInfoDLL3.py in the downloaded archive and extract it anywhere.
Example location: libmediainfo_0.7.62_AllInclusive.7z\MediaInfoLib\Source\MediaInfoDLL\MediaInfoDLL3.py
Now make a script for testing (sources below) in the same directory.
Execute the script.
MediaInfo works on POSIX too. The only difference is that an so is loaded instead of a DLL.
Test script (Python 3!)
import os
os.chdir(os.environ["PROGRAMFILES"] + "\\mediainfo")
from MediaInfoDLL3 import MediaInfo, Stream
MI = MediaInfo()
def get_lengths_in_milliseconds_of_directory(prefix):
for f in os.listdir(prefix):
MI.Open(prefix + f)
duration_string = MI.Get(Stream.Video, 0, "Duration")
try:
duration = int(duration_string)
yield duration
print("{} is {} milliseconds long".format(f, duration))
except ValueError:
print("{} ain't no media file!".format(f))
MI.Close()
print(sum(get_lengths_in_milliseconds_of_directory(os.environ["windir"] + "\\Performance\\WinSAT\\"
)), "milliseconds of content in total")
In addition to Janus Troelsen's answer above, I would like to point out a small problem I
encountered when implementing his answer. I followed his instructions one by one but had different results on windows (7) and linux (ubuntu). His instructions worked perfectly under linux but I had to do a small hack to get it to work on windows. I am using a 32-bit python 2.7.2 interpreter on windows so I utilized MediaInfoDLL.py. But that was not enough to get it to work for me I was receiving this error at this point in the process:
"WindowsError: [Error 193] %1 is not a valid Win32 application".
This meant that I was somehow using a resource that was not 32-bit, it had to be the DLL MediaInfoDLL.py was loading. If you look at the MediaInfo intallation directory you will see 3 dlls MediaInfo.dll is 64-bit while MediaInfo_i386.dll is 32-bit. MediaInfo_i386.dll is the one which I had to use because of my python setup. I went to
MediaInfoDLL.py (which I already had included in my project) and changed this line:
MediaInfoDLL_Handler = windll.MediaInfo
to
MediaInfoDLL_Handler = WinDLL("C:\Program Files (x86)\MediaInfo\MediaInfo_i386.dll")
I didn't have to change anything for it to work in linux
Nowadays pymediainfo is available, so Janus Troelsen's answer could be simplified.
You need to install MediaInfo and pip install pymediainfo. Then the following code would print you the total length of all video files:
import os
from pymediainfo import MediaInfo
def get_track_len(file_path):
media_info = MediaInfo.parse(file_path)
for track in media_info.tracks:
if track.track_type == "Video":
return int(track.duration)
return 0
print(sum(get_track_len(f) for f in os.listdir('directory with video files')))
This link shows how to get the length of a video file https://stackoverflow.com/a/3844467/735204
import subprocess
def getLength(filename):
result = subprocess.Popen(["ffprobe", filename],
stdout = subprocess.PIPE, stderr = subprocess.STDOUT)
return [x for x in result.stdout.readlines() if "Duration" in x]
If you're using that function, you can then wrap it up with something like
import os
for f in os.listdir('.'):
print "%s: %s" % (f, getLength(f))
Here's my take. I did this on Windows. I took the answer from Federico above, and changed the python program a little bit to traverse a tree of folders with video files. So you need to go above to see Federico's answer, to install MediaInfo and to pip install pymediainfo, and then write this program, summarize.py:
import os
import sys
from pymediainfo import MediaInfo
number_of_video_files = 0
def get_alternate_len(media_info):
myJson = media_info.to_data()
myArray = myJson['tracks']
for track in myArray:
if track['track_type'] == 'General' or track['track_type'] == 'Video':
if 'duration' in track:
return int(track['duration'] / 1000)
return 0
def get_track_len(file_path):
global number_of_video_files
media_info = MediaInfo.parse(file_path)
for track in media_info.tracks:
if track.track_type == "Video":
number_of_video_files += 1
if type(track.duration) == int:
len_in_sec = int(track.duration / 1000)
elif type(track.duration) == str:
len_in_sec = int(float(track.duration) / 1000)
else:
len_in_sec = get_alternate_len(media_info)
if len_in_sec == 0:
print("File path = " + file_path + ", problem in type of track.duration")
return len_in_sec
return 0
sum_in_secs = 0.0
os.chdir(sys.argv[1])
for root, dirs, files in os.walk("."):
for name in files:
sum_in_secs += get_track_len(os.path.join(root, name))
hours = int(sum_in_secs / 3600)
remain = sum_in_secs - hours * 3600
minutes = int(remain / 60)
seconds = remain - minutes * 60
print("Directory: " + sys.argv[1])
print("Total number of video files is " + str(number_of_video_files))
print("Length: %d:%02d:%02d" % (hours, minutes, seconds))
Run it: python summarize.py <DirPath>
Have fun. I found I have about 1800 hours of videos waiting for me to have some free time. Yeah sure
I'm getting an error in a program that is supposed to run for a long time that too many files are open. Is there any way I can keep track of which files are open so I can print that list out occasionally and see where the problem is?
To list all open files in a cross-platform manner, I would recommend psutil.
#!/usr/bin/env python
import psutil
for proc in psutil.process_iter():
print(proc.open_files())
The original question implicitly restricts the operation to the currently running process, which can be accessed through psutil's Process class.
proc = psutil.Process()
print(proc.open_files())
Lastly, you'll want to run the code using an account with the appropriate permissions to access this information or you may see AccessDenied errors.
I ended up wrapping the built-in file object at the entry point of my program. I found out that I wasn't closing my loggers.
import io
import sys
import builtins
import traceback
from functools import wraps
def opener(old_open):
#wraps(old_open)
def tracking_open(*args, **kw):
file = old_open(*args, **kw)
old_close = file.close
#wraps(old_close)
def close():
old_close()
open_files.remove(file)
file.close = close
file.stack = traceback.extract_stack()
open_files.add(file)
return file
return tracking_open
def print_open_files():
print(f'### {len(open_files)} OPEN FILES: [{", ".join(f.name for f in open_files)}]', file=sys.stderr)
for file in open_files:
print(f'Open file {file.name}:\n{"".join(traceback.format_list(file.stack))}', file=sys.stderr)
open_files = set()
io.open = opener(io.open)
builtins.open = opener(builtins.open)
On Linux, you can look at the contents of /proc/self/fd:
$ ls -l /proc/self/fd/
total 0
lrwx------ 1 foo users 64 Jan 7 15:15 0 -> /dev/pts/3
lrwx------ 1 foo users 64 Jan 7 15:15 1 -> /dev/pts/3
lrwx------ 1 foo users 64 Jan 7 15:15 2 -> /dev/pts/3
lr-x------ 1 foo users 64 Jan 7 15:15 3 -> /proc/9527/fd
Although the solutions above that wrap opens are useful for one's own code, I was debugging my client to a third party library including some c extension code, so I needed a more direct way. The following routine works under darwin, and (I hope) other unix-like environments:
def get_open_fds():
'''
return the number of open file descriptors for current process
.. warning: will only work on UNIX-like os-es.
'''
import subprocess
import os
pid = os.getpid()
procs = subprocess.check_output(
[ "lsof", '-w', '-Ff', "-p", str( pid ) ] )
nprocs = len(
filter(
lambda s: s and s[ 0 ] == 'f' and s[1: ].isdigit(),
procs.split( '\n' ) )
)
return nprocs
If anyone can extend to be portable to windows, I'd be grateful.
On Linux, you can use lsof to show all files opened by a process.
As said earlier, you can list fds on Linux in /proc/self/fd, here is a simple method to list them programmatically:
import os
import sys
import errno
def list_fds():
"""List process currently open FDs and their target """
if not sys.platform.startswith('linux'):
raise NotImplementedError('Unsupported platform: %s' % sys.platform)
ret = {}
base = '/proc/self/fd'
for num in os.listdir(base):
path = None
try:
path = os.readlink(os.path.join(base, num))
except OSError as err:
# Last FD is always the "listdir" one (which may be closed)
if err.errno != errno.ENOENT:
raise
ret[int(num)] = path
return ret
On Windows, you can use Process Explorer to show all file handles owned by a process.
There are some limitations to the accepted response, in that it does not seem to count pipes. I had a python script that opened many sub-processes, and was failing to properly close standard input, output, and error pipes, which were used for communication. If I use the accepted response, it will fail to count these open pipes as open files, but (at least in Linux) they are open files and count toward the open file limit. The lsof -p solution suggested by sumid and shunc works in this situation, because it also shows you the open pipes.
Get a list of all open files. handle.exe is part of Microsoft's Sysinternals Suite. An alternative is the psutil Python module, but I find 'handle' will print out more files in use.
Here is what I made. Kludgy code warning.
#!/bin/python3
# coding: utf-8
"""Build set of files that are in-use by processes.
Requires 'handle.exe' from Microsoft SysInternals Suite.
This seems to give a more complete list than using the psutil module.
"""
from collections import OrderedDict
import os
import re
import subprocess
# Path to handle executable
handle = "E:/Installers and ZIPs/Utility/Sysinternalssuite/handle.exe"
# Get output string from 'handle'
handle_str = subprocess.check_output([handle]).decode(encoding='ASCII')
""" Build list of lists.
1. Split string output, using '-' * 78 as section breaks.
2. Ignore first section, because it is executable version info.
3. Turn list of strings into a list of lists, ignoring first item (it's empty).
"""
work_list = [x.splitlines()[1:] for x in handle_str.split(sep='-' * 78)[1:]]
""" Build OrderedDict of pid information.
pid_dict['pid_num'] = ['pid_name','open_file_1','open_file_2', ...]
"""
pid_dict = OrderedDict()
re1 = re.compile("(.*?\.exe) pid: ([0-9]+)") # pid name, pid number
re2 = re.compile(".*File.*\s\s\s(.*)") # File name
for x_list in work_list:
key = ''
file_values = []
m1 = re1.match(x_list[0])
if m1:
key = m1.group(2)
# file_values.append(m1.group(1)) # pid name first item in list
for y_strings in x_list:
m2 = re2.match(y_strings)
if m2:
file_values.append(m2.group(1))
pid_dict[key] = file_values
# Make a set of all the open files
values = []
for v in pid_dict.values():
values.extend(v)
files_open = sorted(set(values))
txt_file = os.path.join(os.getenv('TEMP'), 'lsof_handle_files')
with open(txt_file, 'w') as fd:
for a in sorted(files_open):
fd.write(a + '\n')
subprocess.call(['notepad', txt_file])
os.remove(txt_file)
I'd guess that you are leaking file descriptors. You probably want to look through your code to make sure that you are closing all of the files that you open.
You can use the following script. It builds on Claudiu's answer. It addresses some of the issues and adds additional features:
Prints a stack trace of where the file was opened
Prints on program exit
Keyword argument support
Here's the code and a link to the gist, which is possibly more up to date.
"""
Collect stacktraces of where files are opened, and prints them out before the
program exits.
Example
========
monitor.py
----------
from filemonitor import FileMonitor
FileMonitor().patch()
f = open('/bin/ls')
# end of monitor.py
$ python monitor.py
----------------------------------------------------------------------------
path = /bin/ls
> File "monitor.py", line 3, in <module>
> f = open('/bin/ls')
----------------------------------------------------------------------------
Solution modified from:
https://stackoverflow.com/questions/2023608/check-what-files-are-open-in-python
"""
from __future__ import print_function
import __builtin__
import traceback
import atexit
import textwrap
class FileMonitor(object):
def __init__(self, print_only_open=True):
self.openfiles = []
self.oldfile = __builtin__.file
self.oldopen = __builtin__.open
self.do_print_only_open = print_only_open
self.in_use = False
class File(self.oldfile):
def __init__(this, *args, **kwargs):
path = args[0]
self.oldfile.__init__(this, *args, **kwargs)
if self.in_use:
return
self.in_use = True
self.openfiles.append((this, path, this._stack_trace()))
self.in_use = False
def close(this):
self.oldfile.close(this)
def _stack_trace(this):
try:
raise RuntimeError()
except RuntimeError as e:
stack = traceback.extract_stack()[:-2]
return traceback.format_list(stack)
self.File = File
def patch(self):
__builtin__.file = self.File
__builtin__.open = self.File
atexit.register(self.exit_handler)
def unpatch(self):
__builtin__.file = self.oldfile
__builtin__.open = self.oldopen
def exit_handler(self):
indent = ' > '
terminal_width = 80
for file, path, trace in self.openfiles:
if file.closed and self.do_print_only_open:
continue
print("-" * terminal_width)
print(" {} = {}".format('path', path))
lines = ''.join(trace).splitlines()
_updated_lines = []
for l in lines:
ul = textwrap.fill(l,
initial_indent=indent,
subsequent_indent=indent,
width=terminal_width)
_updated_lines.append(ul)
lines = _updated_lines
print('\n'.join(lines))
print("-" * terminal_width)
print()
This is a question regarding Unix shell scripting (any shell), but any other "standard" scripting language solution would also be appreciated:
I have a directory full of files where the filenames are hash values like this:
fd73d0cf8ee68073dce270cf7e770b97
fec8047a9186fdcc98fdbfc0ea6075ee
These files have different original file types such as png, zip, doc, pdf etc.
Can anybody provide a script that would rename the files so they get their appropriate file extension, probably based on the output of the file command?
Answer:
J.F. Sebastian's script will work for both ouput of the filenames as well as the actual renaming.
Here's mimetypes' version:
#!/usr/bin/env python
"""It is a `filename -> filename.ext` filter.
`ext` is mime-based.
"""
import fileinput
import mimetypes
import os
import sys
from subprocess import Popen, PIPE
if len(sys.argv) > 1 and sys.argv[1] == '--rename':
do_rename = True
del sys.argv[1]
else:
do_rename = False
for filename in (line.rstrip() for line in fileinput.input()):
output, _ = Popen(['file', '-bi', filename], stdout=PIPE).communicate()
mime = output.split(';', 1)[0].lower().strip()
ext = mimetypes.guess_extension(mime, strict=False)
if ext is None:
ext = os.path.extsep + 'undefined'
filename_ext = filename + ext
print filename_ext
if do_rename:
os.rename(filename, filename_ext)
Example:
$ ls *.file? | python add-ext.py --rename
avi.file.avi
djvu.file.undefined
doc.file.dot
gif.file.gif
html.file.html
ico.file.obj
jpg.file.jpe
m3u.file.ksh
mp3.file.mp3
mpg.file.m1v
pdf.file.pdf
pdf.file2.pdf
pdf.file3.pdf
png.file.png
tar.bz2.file.undefined
Following #Phil H's response that follows #csl' response:
#!/usr/bin/env python
"""It is a `filename -> filename.ext` filter.
`ext` is mime-based.
"""
# Mapping of mime-types to extensions is taken form here:
# http://as3corelib.googlecode.com/svn/trunk/src/com/adobe/net/MimeTypeMap.as
mime2exts_list = [
["application/andrew-inset","ez"],
["application/atom+xml","atom"],
["application/mac-binhex40","hqx"],
["application/mac-compactpro","cpt"],
["application/mathml+xml","mathml"],
["application/msword","doc"],
["application/octet-stream","bin","dms","lha","lzh","exe","class","so","dll","dmg"],
["application/oda","oda"],
["application/ogg","ogg"],
["application/pdf","pdf"],
["application/postscript","ai","eps","ps"],
["application/rdf+xml","rdf"],
["application/smil","smi","smil"],
["application/srgs","gram"],
["application/srgs+xml","grxml"],
["application/vnd.adobe.apollo-application-installer-package+zip","air"],
["application/vnd.mif","mif"],
["application/vnd.mozilla.xul+xml","xul"],
["application/vnd.ms-excel","xls"],
["application/vnd.ms-powerpoint","ppt"],
["application/vnd.rn-realmedia","rm"],
["application/vnd.wap.wbxml","wbxml"],
["application/vnd.wap.wmlc","wmlc"],
["application/vnd.wap.wmlscriptc","wmlsc"],
["application/voicexml+xml","vxml"],
["application/x-bcpio","bcpio"],
["application/x-cdlink","vcd"],
["application/x-chess-pgn","pgn"],
["application/x-cpio","cpio"],
["application/x-csh","csh"],
["application/x-director","dcr","dir","dxr"],
["application/x-dvi","dvi"],
["application/x-futuresplash","spl"],
["application/x-gtar","gtar"],
["application/x-hdf","hdf"],
["application/x-javascript","js"],
["application/x-koan","skp","skd","skt","skm"],
["application/x-latex","latex"],
["application/x-netcdf","nc","cdf"],
["application/x-sh","sh"],
["application/x-shar","shar"],
["application/x-shockwave-flash","swf"],
["application/x-stuffit","sit"],
["application/x-sv4cpio","sv4cpio"],
["application/x-sv4crc","sv4crc"],
["application/x-tar","tar"],
["application/x-tcl","tcl"],
["application/x-tex","tex"],
["application/x-texinfo","texinfo","texi"],
["application/x-troff","t","tr","roff"],
["application/x-troff-man","man"],
["application/x-troff-me","me"],
["application/x-troff-ms","ms"],
["application/x-ustar","ustar"],
["application/x-wais-source","src"],
["application/xhtml+xml","xhtml","xht"],
["application/xml","xml","xsl"],
["application/xml-dtd","dtd"],
["application/xslt+xml","xslt"],
["application/zip","zip"],
["audio/basic","au","snd"],
["audio/midi","mid","midi","kar"],
["audio/mpeg","mp3","mpga","mp2"],
["audio/x-aiff","aif","aiff","aifc"],
["audio/x-mpegurl","m3u"],
["audio/x-pn-realaudio","ram","ra"],
["audio/x-wav","wav"],
["chemical/x-pdb","pdb"],
["chemical/x-xyz","xyz"],
["image/bmp","bmp"],
["image/cgm","cgm"],
["image/gif","gif"],
["image/ief","ief"],
["image/jpeg","jpg","jpeg","jpe"],
["image/png","png"],
["image/svg+xml","svg"],
["image/tiff","tiff","tif"],
["image/vnd.djvu","djvu","djv"],
["image/vnd.wap.wbmp","wbmp"],
["image/x-cmu-raster","ras"],
["image/x-icon","ico"],
["image/x-portable-anymap","pnm"],
["image/x-portable-bitmap","pbm"],
["image/x-portable-graymap","pgm"],
["image/x-portable-pixmap","ppm"],
["image/x-rgb","rgb"],
["image/x-xbitmap","xbm"],
["image/x-xpixmap","xpm"],
["image/x-xwindowdump","xwd"],
["model/iges","igs","iges"],
["model/mesh","msh","mesh","silo"],
["model/vrml","wrl","vrml"],
["text/calendar","ics","ifb"],
["text/css","css"],
["text/html","html","htm"],
["text/plain","txt","asc"],
["text/richtext","rtx"],
["text/rtf","rtf"],
["text/sgml","sgml","sgm"],
["text/tab-separated-values","tsv"],
["text/vnd.wap.wml","wml"],
["text/vnd.wap.wmlscript","wmls"],
["text/x-setext","etx"],
["video/mpeg","mpg","mpeg","mpe"],
["video/quicktime","mov","qt"],
["video/vnd.mpegurl","m4u","mxu"],
["video/x-flv","flv"],
["video/x-msvideo","avi"],
["video/x-sgi-movie","movie"],
["x-conference/x-cooltalk","ice"]]
#NOTE: take only the first extension
mime2ext = dict(x[:2] for x in mime2exts_list)
if __name__ == '__main__':
import fileinput, os.path
from subprocess import Popen, PIPE
for filename in (line.rstrip() for line in fileinput.input()):
output, _ = Popen(['file', '-bi', filename], stdout=PIPE).communicate()
mime = output.split(';', 1)[0].lower().strip()
print filename + os.path.extsep + mime2ext.get(mime, 'undefined')
Here's a snippet for old python's versions (not tested):
#NOTE: take only the first extension
mime2ext = {}
for x in mime2exts_list:
mime2ext[x[0]] = x[1]
if __name__ == '__main__':
import os
import sys
# this version supports only stdin (part of fileinput.input() functionality)
lines = sys.stdin.read().split('\n')
for line in lines:
filename = line.rstrip()
output = os.popen('file -bi ' + filename).read()
mime = output.split(';')[0].lower().strip()
try: ext = mime2ext[mime]
except KeyError:
ext = 'undefined'
print filename + '.' + ext
It should work on Python 2.3.5 (I guess).
You can use
file -i filename
to get a MIME-type. You could potentially lookup the type in a list and then append an extension. You can find a list of MIME-types and example file extensions on the net.
Following csl's response:
You can use
file -i filename
to get a MIME-type.
You could potentially lookup the type
in a list and then append an
extension. You can find list of
MIME-types and suggested file
extensions on the net.
I'd suggest you write a script that takes the output of file -i filename, and returns an extension (split on spaces, find the '/', look up that term in a table file) in your language of choice - a few lines at most. Then you can do something like:
ls | while read f; do mv "$f" "$f".`file -i "$f" | get_extension.py`; done
in bash, or throw that in a bash script. Or make the get_extension script bigger, but that makes it less useful next time you want the relevant extension.
Edit: change from for f in * to ls | while read f because the latter handles filenames with spaces in (a particular nightmare on Windows).
Of course, it should be added that deciding on a MIME type just based on file(1) output can be very inaccurate/vague (what's "data" ?) or even completely incorrect...
Agreeing with Keltia, and elaborating some on his answer:
Take care -- some filetypes may be problematic.
JPEG2000, for example.
And others might return too much info given the "file" command without any option tags. The way to avoid this is to use "file -b" for a brief return of information.BZT