Reading srt (subtitle) files with Python3 - python

I wish to be able to read an srt file with python3.
These files can be found here:
http://www.opensubtitles.org/
With info here:
http://en.wikipedia.org/wiki/SubRip
Subrip supports any encoding: ascii or unicode, for example.
If I understand correctly then I need to specify which decoder to use when I use pythons read function. So am I right in saying that I need to know how the file is encoded in order to make this judgement? If so how do I establish that for each file if I have a hundred such files with different sources and language support?
Ultimately I would prefer if I could convert the files so that they are all in utf-8 encoding to start with. But some of these files might be some obscure encoding for all I know.
Please help,
Barry

You could use the charade package (formerly chardet) to detect the encoding.

You can check for the byte order mark at the start of each .srt file to test for encoding. However, this probably won't work for all files, as it is not a required attribute, and only specified in UTF files anyways. A check can be performed by
testStr = b'\xff\xfeOtherdata'
if testStr[0:2] == b'\xff\xfe':
print('UTF-16 Little Endian')
elif testStr[0:2] == b'\xfe\xff':
print('UTF-16 Big Endian')
#...
What you probably want to do is simply open your file, then decode whatever you pull out of the file into unicode, deal with the unicode representation until you are ready to print, and then encode it back again. See this talk for some more information, and code samples that might be relevant.

There's also a decent library for handling SRT files:
https://pypi.python.org/pypi/pysrt
You can specify the encoding when opening and writing SRT files.

Related

Python - pdfme - writing utf-8 characters to file

I would like to generate report to pdf using pdfme library. I need the Polish characters to be there as well. The example report end with:
with open('document.pdf', 'wb') as f:
build_pdf(document, f)
So I cannot add encoding = "utf-8". Is there any way I can still use Polish characters?
I tried:
Change to write mode and set encoding to utf-8. Getting: "TypeError: write() argument must be str, not bytes".
While having Polish characters add .encode("utf-8"). Example: "Paweł".encode("utf-8"). Getting: "TypeError: value of . attr must be of type str, list or tuple: b'Pawe\xc5\x82'"
In this case, the part of the code responsible for dealing with the unicode characters is the PDF library. The build_pdf call there, for whatever library it is, has to be able to handle any character in "document". And if it fails it is the context for the PDF library, owner of the "build_pdf" call that has to be changed so that it will handle all the characters you need.
"utf-8" is just one form os expressing characters as bytes - aPDF file is a binary file, and it does have internal headers, structures and settings to do its own character encoding handling: your text may endup inside the PDF either encoded as utf-8, or some other, legacy encoding- but that will be transparent for you and anyone using the PDF file.
It may be that the document, if it is text (we don't know if it is plain text, or if it is some object from your library that has already been pre-processed) - but if it is text, and your library says that build_pdf can accept bytes instead, you can encode the document prior to this call:
build_pdf(document.encode('utf-8', f) - but that would be some strange way of working - it is likely that either build_pdf will do the encoding, or whatever process generated the document had already done so.
To get more meaningful help, you have to say which library you are using to geneate the PDF, and include the import lines in your code,including the creation of your document so that we have a minimal reproducible example: i.e. I can copy your code, paste in a .py file here, install the lib, run it, and see a corrupted PDF file with the Polish characters magled: then I, and others, can be able to fix it. Otherwise, this answer is as far as I can get.

I keep getting 'charmap' codec can't encode characters error when trying to save python script's output to clipboard or text file [duplicate]

I'm trying to scrape a website, but it gives me an error.
I'm using the following code:
import urllib.request
from bs4 import BeautifulSoup
get = urllib.request.urlopen("https://www.website.com/")
html = get.read()
soup = BeautifulSoup(html)
print(soup)
And I'm getting the following error:
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 70924-70950: character maps to <undefined>
What can I do to fix this?
I was getting the same UnicodeEncodeError when saving scraped web content to a file. To fix it I replaced this code:
with open(fname, "w") as f:
f.write(html)
with this:
with open(fname, "w", encoding="utf-8") as f:
f.write(html)
If you need to support Python 2, then use this:
import io
with io.open(fname, "w", encoding="utf-8") as f:
f.write(html)
If you want to use a different encoding than UTF-8, specify whatever your actual encoding is for encoding.
I fixed it by adding .encode("utf-8") to soup.
That means that print(soup) becomes print(soup.encode("utf-8")).
In Python 3.7, and running Windows 10 this worked (I am not sure whether it will work on other platforms and/or other versions of Python)
Replacing this line:
with open('filename', 'w') as f:
With this:
with open('filename', 'w', encoding='utf-8') as f:
The reason why it is working is because the encoding is changed to UTF-8 when using the file, so characters in UTF-8 are able to be converted to text, instead of returning an error when it encounters a UTF-8 character that is not suppord by the current encoding.
set PYTHONIOENCODING=utf-8
set PYTHONLEGACYWINDOWSSTDIO=utf-8
You may or may not need to set that second environment variable PYTHONLEGACYWINDOWSSTDIO.
Alternatively, this can be done in code (although it seems that doing it through env vars is recommended):
sys.stdin.reconfigure(encoding='utf-8')
sys.stdout.reconfigure(encoding='utf-8')
Additionally: Reproducing this error was a bit of a pain, so leaving this here too in case you need to reproduce it on your machine:
set PYTHONIOENCODING=windows-1252
set PYTHONLEGACYWINDOWSSTDIO=windows-1252
While saving the response of get request, same error was thrown on Python 3.7 on window 10. The response received from the URL, encoding was UTF-8 so it is always recommended to check the encoding so same can be passed to avoid such trivial issue as it really kills lots of time in production
import requests
resp = requests.get('https://en.wikipedia.org/wiki/NIFTY_50')
print(resp.encoding)
with open ('NiftyList.txt', 'w') as f:
f.write(resp.text)
When I added encoding="utf-8" with the open command it saved the file with the correct response
with open ('NiftyList.txt', 'w', encoding="utf-8") as f:
f.write(resp.text)
Even I faced the same issue with the encoding that occurs when you try to print it, read/write it or open it. As others mentioned above adding .encoding="utf-8" will help if you are trying to print it.
soup.encode("utf-8")
If you are trying to open scraped data and maybe write it into a file, then open the file with (......,encoding="utf-8")
with open(filename_csv , 'w', newline='',encoding="utf-8") as csv_file:
For those still getting this error, adding encode("utf-8") to soup will also fix this.
soup = BeautifulSoup(html_doc, 'html.parser').encode("utf-8")
print(soup)
There are multiple aspects to this problem. The fundamental question is which character set you want to output into. You may also have to figure out the input character set.
Printing (with either print or write) into a file with an explicit encoding="..." will translate Python's internal Unicode representation into that encoding. If the output contains characters which are not supported by that encoding, you will get an UnicodeEncodeError. For example, you can't write Russian or Chinese or Indic or Hebrew or Arabic or emoji or ... anything except a restricted set of some 200+ Western characters to a file whose encoding is "cp1252" because this limited 8-bit character set has no way to represent these characters.
Basically the same problem will occur with any 8-bit character set, including nearly all the legacy Windows code pages (437, 850, 1250, 1251, etc etc), though some of them support some additional script in addition to or instead of English (1251 supports Cyrillic, for example, so you can write Russian, Ukrainian, Serbian, Bulgarian, etc). An 8-bit encoding has only a maximum of 256 character codes and no way to represent a character which isn't among them.
Perhaps now would be a good time to read Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
On platforms where the terminal is not capable of printing Unicode (only Windows these days really, though if you're into retrocomputing, this problem was also prevalent on other platforms in the previous millennium) attempting to print Unicode strings can also produce this error, or output mojibake. If you see something like Héllö instead of Héllö, this is your issue.
In short, then, you need to know:
What is the character set of the page you scraped, or the data you received? Was it correctly scraped? Did the originator correctly identify its encoding, or are you able to otherwise obtain this information (or guess it)? Some web sites incorrectly declare a different character set than the page actually contains, some sites have incorrectly configured the connection between the web server and a back-end database. See e.g. scrape with correct character encoding (python requests + beautifulsoup) for a more detailed example with some solutions.
What is the character set you want to write? If printing to the screen, is your terminal correctly configured, and is your Python interpreter configured identically?
Perhaps see also How to display utf-8 in windows console
If you are here, probably the answer to one of these questions is not "UTF-8". This is increasingly becoming the prevalent encoding for web pages, too, though the former standard was ISO-8859-1 (aka Latin-1) and more recently Windows code page 1252.
Going forward, you basically want all your textual data to be Unicode, outside of a few fringe use cases. Generally, that means UTF-8, though on Windows (or if you need Java compatibility), UTF-16 is also vaguely viable, albeit somewhat cumbersome. (There are several other Unicode serialization formats, which may be useful in specialized circumstances. UTF-32 is technically trivial, but takes up a lot more memory; UTF-7 is used in a few network protocols where 7-bit ASCII is required for transport.)
Perhaps see also https://utf8everywhere.org/
Naturally, if you are printing to a file, you also need to examine that file using a tool which can correctly display it. A common pilot error is to open the file using a tool which only displays the currently selected system encoding, or one which tries to guess the encoding, but guesses wrong. Again, a common symptom when viewing UTF-8 text using Windows code page 1252 would result, for example, in Héllö displaying as Héllö.
If the encoding of character data is unknown, there is no simple way to automatically establish it. If you know what the text is supposed to represent, you can perhaps infer it, but this is typically a manual process with some guesswork involved. (Automatic tools like chardet and ftfy can help, but they get it wrong some of the time, too.)
To establish which encoding you are looking at, it can be helpful if you can identify the individual bytes in a character which isn't displayed correctly. For example, if you are looking at H\x8ell\x9a but expect it to represent Héllö, you can look up the bytes in a translation table. I have published one such table at https://tripleee.github.io/8bit where you can see that in this example, it's probably one of the legacy Mac 8-bit character sets; with more data points, perhaps you can narrow it down to just one of them (and if not, any one of them will do in practice, since all the code points you care about map to the same Unicode characters).
Python 3 on most platforms defaults to UTF-8 for all input and output, but on Windows, this is commonly not the case. It will then instead default to the system's default encoding (still misleadingly called "ANSI code page" in some Microsoft documentation), which depends on a number of factors. On Western systems, the default encoding out of the box is commonly Windows code page 1252.
(Earlier Python versions had somewhat different expectations, and in Python 2, the internal string representation was not Unicode.)
If you are on Windows and write UTF-8 to a text file, maybe specify encoding="utf-8-sig" which adds a BOM sequence at the beginning of the file. This is strictly speaking not necessary or correct, but some Windows tools need it to correctly identify the encoding.
Several of the earlier answers here suggest blindly applying some encoding, but hopefully this should help you understand how that's not generally the correct approach, and how to figure out - rather than guess - which encoding to use.
From Python 3.7 onwards,
Set the the environment variable PYTHONUTF8 to 1
The following script included other useful variables too which set System Environment Variables.
setx /m PYTHONUTF8 1
setx PATHEXT "%PATHEXT%;.PY" ; In CMD, Python file can be executed without extesnion.
setx /m PY_PYTHON 3.10 ; To set default python version for py
Source
I got the same error so I use (encoding="utf-8") and it solve the error.
This generally happens when we got some unidentified symbol or pattern in text data that our encoder does not understand.
with open("text.txt", "w", encoding='utf-8') as f:
f.write(data)
This will solve your problem.
if you are using windows try to pass encoding='latin1', encoding='iso-8859-1' or encoding='cp1252'
example:
csv_data = pd.read_csv(csvpath,encoding='iso-8859-1')
print(print(soup.encode('iso-8859-1')))

Which encoding is in use by csv.DictReader when reading csv?

I have a csv file saved encoded as UTF-8.
It contains non-ascii chars [umlauts].
I am reading the file using:
csv.DictReader(<file>,delimiter=<delimiter>).
My questions are:
In which encoding is the file being read?
I noticed that in order to refer to the strings as utf-8 I need to perform:
str.decode('utf-8')
Is there a better approach then reading the file in one encoding and then to convert to another, i.e. utf-8?
[Python version: 2.7]
In Python 2.7, the CSV module does not apply any decoding - it opens the file in binary mode and returns bytes strings.
Use https://github.com/jdunck/python-unicodecsv, which decodes on the fly.
Use it like:
with open("myfile.csv", 'rb') as my_file:
r = unicodecsv.DictReader(my_file, encoding='utf-8')
r will contain a dict of Unicodes. It's important that the source file is opened as binary mode.
How about using instances and classes in order to achieve this?
You can store the shared dictionary at the class level and also make it load Unicode text files, and even detect their encoding, with or without use of BOM file masks.
Long time ago I wrote a simple library which overrides the default open() with one that is Unicode aware.
If you do import tendo.unicode you will be able to change the way csv library loads the files too.
If your files do not have a BOM header the library will assume UTF-8 instead of the old ascii. You can even specify another fallback encoding if you want.

Python open("x", "r") function, how do I know or control which encoding the file is supposed to have?

If a python script uses the open("filename", "r") function to open, and subsequently read, the contents of a text file, how can I tell which encoding this file is supposed to have?
Note that since I'm executing this script from my own program, if there is any way to control this through environment variables, then that is good enough for me.
This is Python 2.7 by the way.
The code in question comes from Mercurial, it can be given a list of files to, say, add to the repository, through a file on disk, instead of passing them on the command line.
So basically, instead of this:
hg add A B C
I can write out A, B and C to a file, with newlines between each, and then execute the following:
hg add listfile:input.txt
The code that ends up reading this file is this:
files = open(name, 'r').read().split(delimiter)
Hence my question. The answer I was given on IRC when I asked which encoding I should use was this:
it is the same encoding than the one you use on command line when passing a file argument
I take this to mean that it is the same encoding I "use" when I execute Mercurial (hg). Since I have no idea which encoding that is, I just give everything to the .NET Process object, I ask here.
You can't. Reading a file is independent of its encoding; you'll need to know the encoding in advance in order to properly interpret the bytes you read in.
For example, if you know the file is encoded in UTF-8:
with open('filename', 'rb') as f:
contents = f.read().decode('utf-8-sig') # -sig deals with BOM, if present
Or if you know the file is ASCII only:
with open('filename', 'r') as f:
contents = f.read() # results in a str object
If you really don't know the encoding of the file, then there's obviously no guarantee that you can read it properly; however, you can guess at the encoding using a tool like chardet.
UPDATE:
I think I understand your question now. I thought you had a file you needed to write code for, but it seems you have code you need to write a file for ;-)
The code in question probably only deals properly with plain ASCII (it's possible the strings are converted later, but unlikely I think). So you'll want to make a text file that contains only ASCII (codepoint < 128) characters, and make sure it is saved in an ASCII encoding (i.e. not UTF-16 or anything like that). This is a little unfortunate considering that Mercurial deals with filenames, which can contain Unicode characters.

Can seek and tell work with UTF-8 encoded documents in Python?

I have an application that generates some large log files > 500MB.
I have written some utilities in Python that allows me to quickly browse the log file and find data of interest. But I now get some datasets where the file is too big to load it all into memory.
I thus want to scan the document once, build an index and then only load the section of the document into memory that I want to look at at a time.
This works for me when I open a 'file' read it one line at a time and store the offset with from file.tell().
I can then come back to that section of the file later with file.seek( offset, 0 ).
My problem is however that I may have UTF-8 in the log files so I need to open them with the codecs module (codecs.open(<filename>, 'r', 'utf-8')). With the resulting object I can call seek and tell but they do not match up.
I assume that codecs needs to do some buffering or maybe it returns character counts instead of bytes from tell?
Is there a way around this?
If true, this sounds like a bug or limitation of the codecs module, as it's probably confusing byte and character offsets.
I would use the regular open() function for opening the file, then seek()/tell() will give you byte offsets that are always consistent. Whenever you want to read, use f.readline().decode('utf-8').
Beware though, that using the f.read() function can land you in the middle of a multi-byte character, thus producing an UTF-8 decode error. readline() will always work.
This doesn't transparently handle the byte-order mark for you, but chances are your log files do not have BOMs anyway.
For UTF-8, you don't actually need to open the file with codecs.open. Instead, it is reliable to read the file as a byte string first, and only then decode an individual section (invoking the .decode method on the string). Breaking the file at line boundaries is safe; the only unsafe way to split it would be in the middle of a multi-byte character (which you can recognize from its byte value > 128).
Much of what goes on with UTF8 in python makes sense if you look at how it was done in Python 3. In your case, it'll make quite a bit more sense if you read the Files chapter in Dive into Python 3: http://diveintopython3.org/files.html
The short of it, though, is that file.seek and file.tell work with byte positions, whereas unicode characters can take up multiple bytes. Thus, if you do:
f.seek(10)
f.read(1)
f.tell()
You can easily get something other than 17, depending on what length the one character you read was.
Update: You can't do seek/tell on the object returned by codec.open(). You need to use a normal file, and decode the strings to unicode after reading.
I do not know why it doesn't work but I can't make it work. The seek seems to only work once, for example. Then you need to close and reopen the file, which is of course not useful.
The tell does not use character positions, but doesn't show you where your position in the stream is (but probably where the underlying file object is in reading from disk).
So probably because of some sort of underlying buffering, you can't do it. But deocding after reading works just fine, so go for that.

Categories