Translate Java file compression to Python3 with GZIP - python

I need to compress a file into a specific format that is required by our country's tax regulation entity and it has to be sent encoded in base64.
I work on Python3 and attempted to do the compression with the following code:
import gzip
# Work file generated before and stored in BytesBuffer
my_file = bytes_buffer.getvalue()
def compress(work_file):
encoded_work_file = base64.b64encode(work_file)
compressed_work_file = gzip.compress(encoded_work_file )
return base64.b64encode(compressed_work_file )
compress(my_file)
Now the tax entity returns an error message about an unknown compression format.
Luckily, they provided us the following Java example code:
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.zip.GZIPInputStream;
import java.util.zip.GZIPOutputStream;
public class DemoGZIP {
private final byte[] BUFFER = new byte[1024];
/**
* #param work_file File to compress
* The file is compressed over the original file name with the extension .zip
* #return boolean
* TRUE success
* FALSE failure
*/
public boolean compress(File work_file ) {
try (GZIPOutputStream out = new GZIPOutputStream (new FileOutputStream(work_file .getAbsolutePath() + ".zip"));
FileInputStream in = new FileInputStream(work_file )) {
int len;
while ((len = in.read(BUFFER)) != -1) {
out.write(BUFFER, 0, len);
}
out.close();
} catch (IOException ex) {
System.err.println(ex.getMessage());
return false;
}
return true;
}
The problem is that I do not have any experience working on Java and do not understand much of the provided code.
Can someone please help me adapt my code to do what the provided code does in python?

As noted in the comment, the Java code does not do Base64 coding, and names the resulting file incorrectly. It is most definitely not a zip file, it is a gzip file. The suffix should be ".gz". Though I doubt that the name matters to your tax agency.
More importantly, you are encoding with Base64 twice. From your description, you should only do that once, after gzip compression. From the Java code, you shouldn't do Base64 encoding at all! You need to get clarification on that.

Related

tabula-py can't read file when the python script called by java

I'm working on a project base on java. And the java program will run command to call a python script.
The python script is used tabula-py to read a pdf file and return the data.
I tried the python script was work when I direct call it in terminal (pytho3 xxx.py)
However, when I tried to call the python script from java, it will throw error:
Error from tabula-java:Error: File does not exist
Command '['java', '-Dfile.encoding=UTF8', '-jar', '/home/ubuntu/.local/lib/python3.8/site-packages/tabula/tabula-1.0.5-jar-with-dependencies.jar', '--pages', 'all', '--lattice', '--guess', '--format', 'JSON', '/home/ubuntu/Documents/xxxx.pdf']' returned non-zero exit status 1.
I tried to call the script in full path, provide the pdf file in full path, tried sys.append(python script path) and both of them are not worked.
I've tried to call the tabula in java command, i.e. java -Dfile.encoding=UTF8 -jar /home/ubuntu/.local/lib/python3.8/site-packages/tabula/tabula-1.0.5-jar-with-dependencies.jar "file_path"
And it's work and can read the file. However back to java to call the python script is not work
Is there any method to solve this? Use the tabula in java program is not an option for my case
Now that you mention that you mention you use java for base code and python for reading PDF, It's better of using java entirely for more efficient code. Why? Because there are tools already ready for you. There is absolutely no need for struggling to link one language to another.
code:
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
/**
* This class is used to read an existing
* pdf file using iText jar.
*/
public class PDFReadExample {
public static void main(String args[]){
try {
//Create PdfReader instance.
PdfReader pdfReader = new PdfReader("D:\\testFile.pdf");
//Get the number of pages in pdf.
int pages = pdfReader.getNumberOfPages();
//Iterate the pdf through pages.
for(int i=1; i<=pages; i++) {
//Extract the page content using PdfTextExtractor.
String pageContent =
PdfTextExtractor.getTextFromPage(pdfReader, i);
//Print the page content on console.
System.out.println("Content on Page "
+ i + ": " + pageContent);
}
//Close the PdfReader.
pdfReader.close();
} catch (Exception e) {
e.printStackTrace();
}
}

Pass a pickle buffer from Node to Python

I have a Node application that subscribes to JSON data streams. I would like to extend this to subscribe to Python pickle data streams (I am willing to drop or convert non primitive types). The node-pickle & jpickle packages have failed me. I now wish to write my own Python script to convert pickles to JSON.
I fiddled with the node-pickle source code to get part of it to work (can pass JSON from Node to Python and get back a pickle string, can also use a predefined Python dict and pass to Node as JSON). My problem is getting Python to recognize the data from Node as pickled data. I am passing the data stream buffer from Node to Python and trying desparately to get the string buffer argument into a format for me to pickle.loads it.
After much trial and error I have ended up with this:
main.js
const pickle = require('node-pickle');
const amqp = require('amqplib/callback_api');
amqp.connect(`amqp://${usr}:${pwd}#${url}`, (err, conn) => {
if (err) {
console.error(err);
}
conn.createChannel((err, ch) => {
if (err) {
console.error(err);
}
ch.assertExchange(ex, 'fanout', { durable: false });
ch.assertQueue('', {}, (err, q) => {
ch.bindQueue(q.queue, ex, '');
console.log('consuming');
ch.consume(q.queue, msg => {
console.log('Received [x]');
const p = msg.content.toString('base64');
pickle.loads(p).then(r => console.log('Res:', r));
// conn.close();
});
});
});
});
index.js (node-pickle)
const spawn = require('child_process').spawn,
Bluebird = require('bluebird');
module.exports.loads = function loads(pickle) {
return new Bluebird((resolve, reject) => {
const convert = spawn('python', [__dirname + '/convert.py', '--loads']),
stdout_buffer = [];
convert.stdout.on('data', function(data) {
stdout_buffer.push(data);
});
convert.on('exit', function(code) {
const data = stdout_buffer.join('');
// console.log('buffer toString', stdout_buffer[0] ? stdout_buffer[0].toString() : null);
if (data == -1) {
resolve(false);
} else {
let result;
try {
result = JSON.parse(data);
} catch (err) {
console.log('failed parse');
result = false;
}
resolve(result);
}
});
convert.stdin.write(pickle);
convert.stdin.end();
});
};
convert.py (node-pickle)
import sys
try:
import simplejson as json
except ImportError:
import json
try:
import cPickle as pickle
except ImportError:
import pickle
import codecs
import jsonpickle
def main(argv):
try:
if argv[0] == '--loads':
buffer = sys.stdin.buffer.read()
decoded = codecs.decode(buffer, 'base64')
d = pickle.loads(decoded, encoding='latin1')
j = jsonpickle.encode(d,False)
sys.stdout.write(j)
elif argv[0] == '--dumps':
d = json.loads(argv[1])
p = pickle.dumps(d)
sys.stdout.write(str(p))
except Exception as e:
print('Error: ' + str(e))
sys.stdout.write('-1')
if __name__ == '__main__':
main(sys.argv[1:])
The error I come up against at the moment is:
invalid load key, '\xef'
EDIT 1:
I am now sending the buffer string representation, instead of the buffer, to Python. I then use stdin to read it in as bytes. I started writing the bytes object to a file to compare to the data received from Node, to the buffer received when I subscribe to the data stream from a Python script. I have found that they seem to be identical, apart from certain \x.. sequences found when subscribing from Python, being represented as \xef\xbf\xbd when subscribing from Node. I assume this has something to do with string encoding?? Some examples of the misrepresented sequences are: \x80 (this is the first sequence after the b'; however \x80 does appear elsewhere), \xe3, and \x85.
EDIT 2:
I have now encoded the string I'm sending to Python as base64, then, in Python, decoding the stdin buffer using codecs.decode. The buffer I'm writing to the file now looks more identical to the Python only stream, with no more \xef\xbf\xbd substitutions. However, I now come up against this error:
'ascii' codec can't decode byte 0xe3 in position 1: ordinal not in range(128)
Also, I found a slight difference when trying to match the last 1000 characters of each stream. The is a section in the Python stream (\x0c,'\x023) that looks like this (\x0c,\'\x023) in the stream from Node. Not sure how much that'll affect things.
EDIT 3 (Success!):
After searching up my new error, I found the last piece of this encoding puzzle. Since I was working in Python 3, and the pickle came from Python 2.x, I needed to specify the encoding for pickle.loads as bytes or latin1(the one I needed). I was then able to make use of the wonderful jsonpickle package to do the work of JSON serializing the dict, changing datetime objects into date strings.
So I was able to get the node-pickle npm package to work. My flow of getting a buffer of pickled data from Node to Python to get back JSON is:
In Node
Encode the buffer as a base64 string
Send the string to the Python child process as a stdin input, not an argument
In Python
Read in the buffer from stdin as bytes
Use codecs to decode it from base64
If using Python 3, specify bytes or latin1 encoding for pickle.loads
Use jsonpickle to serialize python objects in JSON
In Node
Collect the buffer from stdout and JSON.parse it

Create image from string sent from iOS device

I have an iOS mobile application that sends an encoded image to a Python3 server.
static func prepareImageAndUpload(imageView: UIImageView) -> String?
{
if let image: UIImage? = imageView.image {
// You create a NSData from your image
let imageData = UIImageJPEGRepresentation(imageView.image!, 0.5)
// You create a base64 string
let base64String = imageData!.base64EncodedStringWithOptions([])
// And you encode it in order to delete any problem of specials char
let encodeImg = base64String.stringByAddingPercentEncodingWithAllowedCharacters(.URLHostAllowedCharacterSet()) as String!
return encodeImg
}
return nil
}
And I am trying to receive that image using the following code:
imageName = "imageToSave.jpg"
fh = open(imageName, "wb")
imgDataBytes = bytes(imgData, encoding="ascii")
imgDataBytesDecoded = base64.b64decode(imgDataBytes)
fh.write(imgDataBytesDecoded)
fh.close()
I create the image file successfully and nothing breaks. And I can see that the filesize is correct, but the image is not correct, since it can't be opened and shows that it is broken.
I am not sure where the error can be, since the logic is as follows:
Encode image with base64 on iOS device
Send it
Decode image with base64 on Python3 server
Save image from decoded bytes
I have tried two new variants:
Remove stringByAddingPercentEncodingWithAllowedCharacters and
the result was the same
Add urldecode in Python3 server and the result was the same

How to generate valid filename with unicode characters for download? [duplicate]

Web applications that want to force a resource to be downloaded rather than directly rendered in a Web browser issue a Content-Disposition header in the HTTP response of the form:
Content-Disposition: attachment; filename=FILENAME
The filename parameter can be used to suggest a name for the file into which the resource is downloaded by the browser. RFC 2183 (Content-Disposition), however, states in section 2.3 (The Filename Parameter) that the file name can only use US-ASCII characters:
Current [RFC 2045] grammar restricts
parameter values (and hence
Content-Disposition filenames) to
US-ASCII. We recognize the great
desirability of allowing arbitrary
character sets in filenames, but it is
beyond the scope of this document to
define the necessary mechanisms.
There is empirical evidence, nevertheless, that most popular Web browsers today seem to permit non-US-ASCII characters yet (for the lack of a standard) disagree on the encoding scheme and character set specification of the file name. Question is then, what are the various schemes and encodings employed by the popular browsers if the file name “naïvefile” (without quotes and where the third letter is U+00EF) needed to be encoded into the Content-Disposition header?
For the purpose of this question, popular browsers being:
Google Chrome
Safari
Internet Explorer or Edge
Firefox
Opera
I know this is an old post but it is still very relevant. I have found that modern browsers support rfc5987, which allows utf-8 encoding, percentage encoded (url-encoded). Then Naïve file.txt becomes:
Content-Disposition: attachment; filename*=UTF-8''Na%C3%AFve%20file.txt
Safari (5) does not support this. Instead you should use the Safari standard of writing the file name directly in your utf-8 encoded header:
Content-Disposition: attachment; filename=Naïve file.txt
IE8 and older don't support it either and you need to use the IE standard of utf-8 encoding, percentage encoded:
Content-Disposition: attachment; filename=Na%C3%AFve%20file.txt
In ASP.Net I use the following code:
string contentDisposition;
if (Request.Browser.Browser == "IE" && (Request.Browser.Version == "7.0" || Request.Browser.Version == "8.0"))
contentDisposition = "attachment; filename=" + Uri.EscapeDataString(fileName);
else if (Request.Browser.Browser == "Safari")
contentDisposition = "attachment; filename=" + fileName;
else
contentDisposition = "attachment; filename*=UTF-8''" + Uri.EscapeDataString(fileName);
Response.AddHeader("Content-Disposition", contentDisposition);
I tested the above using IE7, IE8, IE9, Chrome 13, Opera 11, FF5, Safari 5.
Update November 2013:
Here is the code I currently use. I still have to support IE8, so I cannot get rid of the first part. It turns out that browsers on Android use the built in Android download manager and it cannot reliably parse file names in the standard way.
string contentDisposition;
if (Request.Browser.Browser == "IE" && (Request.Browser.Version == "7.0" || Request.Browser.Version == "8.0"))
contentDisposition = "attachment; filename=" + Uri.EscapeDataString(fileName);
else if (Request.UserAgent != null && Request.UserAgent.ToLowerInvariant().Contains("android")) // android built-in download manager (all browsers on android)
contentDisposition = "attachment; filename=\"" + MakeAndroidSafeFileName(fileName) + "\"";
else
contentDisposition = "attachment; filename=\"" + fileName + "\"; filename*=UTF-8''" + Uri.EscapeDataString(fileName);
Response.AddHeader("Content-Disposition", contentDisposition);
The above now tested in IE7-11, Chrome 32, Opera 12, FF25, Safari 6, using this filename for download: 你好abcABCæøåÆØÅäöüïëêîâéíáóúýñ½§!#¤%&()=`#£$€{[]}+´¨^~'-_,;.txt
On IE7 it works for some characters but not all. But who cares about IE7 nowadays?
This is the function I use to generate safe file names for Android. Note that I don't know which characters are supported on Android but that I have tested that these work for sure:
private static readonly Dictionary<char, char> AndroidAllowedChars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ._-+,#£$€!½§~'=()[]{}0123456789".ToDictionary(c => c);
private string MakeAndroidSafeFileName(string fileName)
{
char[] newFileName = fileName.ToCharArray();
for (int i = 0; i < newFileName.Length; i++)
{
if (!AndroidAllowedChars.ContainsKey(newFileName[i]))
newFileName[i] = '_';
}
return new string(newFileName);
}
#TomZ: I tested in IE7 and IE8 and it turned out that I did not need to escape apostrophe ('). Do you have an example where it fails?
#Dave Van den Eynde: Combining the two file names on one line as according to RFC6266 works except for Android and IE7+8 and I have updated the code to reflect this. Thank you for the suggestion.
#Thilo: No idea about GoodReader or any other non-browser. You might have some luck using the Android approach.
#Alex Zhukovskiy: I don't know why but as discussed on Connect it doesn't seem to work terribly well.
There is no interoperable way to encode non-ASCII names in Content-Disposition. Browser compatibility is a mess.
The theoretically correct syntax for use of UTF-8 in Content-Disposition is very weird: filename*=UTF-8''foo%c3%a4 (yes, that's an asterisk, and no quotes except an empty single quote in the middle)
This header is kinda-not-quite-standard (HTTP/1.1 spec acknowledges its existence, but doesn't require clients to support it).
There is a simple and very robust alternative: use a URL that contains the filename you want.
When the name after the last slash is the one you want, you don't need any extra headers!
This trick works:
/real_script.php/fake_filename.doc
And if your server supports URL rewriting (e.g. mod_rewrite in Apache) then you can fully hide the script part.
Characters in URLs should be in UTF-8, urlencoded byte-by-byte:
/mot%C3%B6rhead # motörhead
There is discussion of this, including links to browser testing and backwards compatibility, in the proposed RFC 5987, "Character Set and Language Encoding for Hypertext Transfer Protocol (HTTP) Header Field Parameters."
RFC 2183 indicates that such headers should be encoded according to RFC 2184, which was obsoleted by RFC 2231, covered by the draft RFC above.
RFC 6266 describes the “Use of the Content-Disposition Header Field in the Hypertext Transfer Protocol (HTTP)”. Quoting from that:
6. Internationalization Considerations
The “filename*” parameter (Section 4.3), using the encoding defined
in [RFC5987], allows the server to transmit characters outside the
ISO-8859-1 character set, and also to optionally specify the language
in use.
And in their examples section:
This example is the same as the one above, but adding the "filename"
parameter for compatibility with user agents not implementing
RFC 5987:
Content-Disposition: attachment;
filename="EURO rates";
filename*=utf-8''%e2%82%ac%20rates
Note: Those user agents that do not support the RFC 5987 encoding
ignore “filename*” when it occurs after “filename”.
In Appendix D there is also a long list of suggestions to increase interoperability. It also points at a site which compares implementations. Current all-pass tests suitable for common file names include:
attwithisofnplain: plain ISO-8859-1 file name with double quotes and without encoding. This requires a file name which is all ISO-8859-1 and does not contain percent signs, at least not in front of hex digits.
attfnboth: two parameters in the order described above. Should work for most file names on most browsers, although IE8 will use the “filename” parameter.
That RFC 5987 in turn references RFC 2231, which describes the actual format. 2231 is primarily for mail, and 5987 tells us what parts may be used for HTTP headers as well. Don't confuse this with MIME headers used inside a multipart/form-data HTTP body, which is governed by RFC 2388 (section 4.4 in particular) and the HTML 5 draft.
The following document linked from the draft RFC mentioned by Jim in his answer further addresses the question and definitely worth a direct note here:
Test Cases for HTTP Content-Disposition header and RFC 2231/2047 Encoding
Put the file name in double quotes. Solved the problem for me. Like this:
Content-Disposition: attachment; filename="My Report.doc"
http://kb.mozillazine.org/Filenames_with_spaces_are_truncated_upon_download
I've tested multiple options. Browsers do not support the specs and act differently, I believe double quotes is the best option.
I use the following code snippets for encoding (assuming fileName contains the filename and extension of the file, i.e.: test.txt):
PHP:
if ( strpos ( $_SERVER [ 'HTTP_USER_AGENT' ], "MSIE" ) > 0 )
{
header ( 'Content-Disposition: attachment; filename="' . rawurlencode ( $fileName ) . '"' );
}
else
{
header( 'Content-Disposition: attachment; filename*=UTF-8\'\'' . rawurlencode ( $fileName ) );
}
Java:
fileName = request.getHeader ( "user-agent" ).contains ( "MSIE" ) ? URLEncoder.encode ( fileName, "utf-8") : MimeUtility.encodeWord ( fileName );
response.setHeader ( "Content-disposition", "attachment; filename=\"" + fileName + "\"");
in asp.net mvc2 i use something like this:
return File(
tempFile
, "application/octet-stream"
, HttpUtility.UrlPathEncode(fileName)
);
I guess if you don't use mvc(2) you could just encode the filename using
HttpUtility.UrlPathEncode(fileName)
In ASP.NET Web API, I url encode the filename:
public static class HttpRequestMessageExtensions
{
public static HttpResponseMessage CreateFileResponse(this HttpRequestMessage request, byte[] data, string filename, string mediaType)
{
HttpResponseMessage response = new HttpResponseMessage(HttpStatusCode.OK);
var stream = new MemoryStream(data);
stream.Position = 0;
response.Content = new StreamContent(stream);
response.Content.Headers.ContentType =
new MediaTypeHeaderValue(mediaType);
// URL-Encode filename
// Fixes behavior in IE, that filenames with non US-ASCII characters
// stay correct (not "_utf-8_.......=_=").
var encodedFilename = HttpUtility.UrlEncode(filename, Encoding.UTF8);
response.Content.Headers.ContentDisposition =
new ContentDispositionHeaderValue("attachment") { FileName = encodedFilename };
return response;
}
}
In PHP this did it for me (assuming the filename is UTF8 encoded):
header('Content-Disposition: attachment;'
. 'filename="' . addslashes(utf8_decode($filename)) . '";'
. 'filename*=utf-8\'\'' . rawurlencode($filename));
Tested against IE8-11, Firefox and Chrome.
If the browser can interpret filename*=utf-8 it will use the UTF8 version of the filename, else it will use the decoded filename. If your filename contains characters that can't be represented in ISO-8859-1 you might want to consider using iconv instead.
Just an update since I was trying all this stuff today in response to a customer issue
With the exception of Safari configured for Japanese, all browsers our customer tested worked best with filename=text.pdf - where text is a customer value serialized by ASP.Net/IIS in utf-8 without url encoding. For some reason, Safari configured for English would accept and properly save a file with utf-8 Japanese name but that same browser configured for Japanese would save the file with the utf-8 chars uninterpreted. All other browsers tested seemed to work best/fine (regardless of language configuration) with the filename utf-8 encoded without url encoding.
I could not find a single browser implementing Rfc5987/8187 at all. I tested with the latest Chrome, Firefox builds plus IE 11 and Edge. I tried setting the header with just filename*=utf-8''texturlencoded.pdf, setting it with both filename=text.pdf; filename*=utf-8''texturlencoded.pdf. Not one feature of Rfc5987/8187 appeared to be getting processed correctly in any of the above.
If you are using a nodejs backend you can use the following code I found here
var fileName = 'my file(2).txt';
var header = "Content-Disposition: attachment; filename*=UTF-8''"
+ encodeRFC5987ValueChars(fileName);
function encodeRFC5987ValueChars (str) {
return encodeURIComponent(str).
// Note that although RFC3986 reserves "!", RFC5987 does not,
// so we do not need to escape it
replace(/['()]/g, escape). // i.e., %27 %28 %29
replace(/\*/g, '%2A').
// The following are not required for percent-encoding per RFC5987,
// so we can allow for a little better readability over the wire: |`^
replace(/%(?:7C|60|5E)/g, unescape);
}
I tested the following code in all major browsers, including older Explorers (via the compatibility mode), and it works well everywhere:
$filename = $_GET['file']; //this string from $_GET is already decoded
if (strstr($_SERVER['HTTP_USER_AGENT'],"MSIE"))
$filename = rawurlencode($filename);
header('Content-Disposition: attachment; filename="'.$filename.'"');
I ended up with the following code in my "download.php" script (based on this blogpost and these test cases).
$il1_filename = utf8_decode($filename);
$to_underscore = "\"\\#*;:|<>/?";
$safe_filename = strtr($il1_filename, $to_underscore, str_repeat("_", strlen($to_underscore)));
header("Content-Disposition: attachment; filename=\"$safe_filename\""
.( $safe_filename === $filename ? "" : "; filename*=UTF-8''".rawurlencode($filename) ));
This uses the standard way of filename="..." as long as there are only iso-latin1 and "safe" characters used; if not, it adds the filename*=UTF-8'' url-encoded way. According to this specific test case, it should work from MSIE9 up, and on recent FF, Chrome, Safari; on lower MSIE version, it should offer filename containing the ISO8859-1 version of the filename, with underscores on characters not in this encoding.
Final note: the max. size for each header field is 8190 bytes on apache. UTF-8 can be up to four bytes per character; after rawurlencode, it is x3 = 12 bytes per one character. Pretty inefficient, but it should still be theoretically possible to have more than 600 "smiles" %F0%9F%98%81 in the filename.
From .NET 4.5 (and Core 1.0) you can use ContentDispositionHeaderValue to do the formatting for you.
var fileName = "Naïve file.txt";
var h = new System.Net.Http.Headers.ContentDispositionHeaderValue("attachment");
h.FileNameStar = fileName;
h.FileName = "fallback-ascii-name.txt";
Response.Headers.Add("Content-Disposition", h.ToString());
h.ToString() Will result in:
attachment; filename*=utf-8''Na%C3%AFve%20file.txt; filename=fallback-ascii-name.txt
PHP framework Symfony 4 has $filenameFallback in HeaderUtils::makeDisposition.
You can look into this function for details - it is similar to the answers above.
Usage example:
$filenameFallback = preg_replace('#^.*\.#', md5($filename) . '.', $filename);
$disposition = $response->headers->makeDisposition(ResponseHeaderBag::DISPOSITION_ATTACHMENT, $filename, $filenameFallback);
$response->headers->set('Content-Disposition', $disposition);
For those who need a JavaScript way of encoding the header, I found that this function works well:
function createContentDispositionHeader(filename:string) {
const encoded = encodeURIComponent(filename);
return `attachment; filename*=UTF-8''${encoded}; filename="${encoded}"`;
}
This is based on what Nextcloud seems to be doing when downloading a file. The filename appears first as UTF-8 encoded, and possibly for compatibility with some browsers, the filename also appears without the UTF-8 prefix.
Classic ASP Solution
Most modern browsers support passing the Filename as UTF-8 now but as was the case with a File Upload solution I use that was based on FreeASPUpload.Net (site no longer exists, link points to archive.org) it wouldn't work as the parsing of the binary relied on reading single byte ASCII encoded strings, which worked fine when you passed UTF-8 encoded data until you get to characters ASCII doesn't support.
However I was able to find a solution to get the code to read and parse the binary as UTF-8.
Public Function BytesToString(bytes) 'UTF-8..
Dim bslen
Dim i, k , N
Dim b , count
Dim str
bslen = LenB(bytes)
str=""
i = 0
Do While i < bslen
b = AscB(MidB(bytes,i+1,1))
If (b And &HFC) = &HFC Then
count = 6
N = b And &H1
ElseIf (b And &HF8) = &HF8 Then
count = 5
N = b And &H3
ElseIf (b And &HF0) = &HF0 Then
count = 4
N = b And &H7
ElseIf (b And &HE0) = &HE0 Then
count = 3
N = b And &HF
ElseIf (b And &HC0) = &HC0 Then
count = 2
N = b And &H1F
Else
count = 1
str = str & Chr(b)
End If
If i + count - 1 > bslen Then
str = str&"?"
Exit Do
End If
If count>1 then
For k = 1 To count - 1
b = AscB(MidB(bytes,i+k+1,1))
N = N * &H40 + (b And &H3F)
Next
str = str & ChrW(N)
End If
i = i + count
Loop
BytesToString = str
End Function
Credit goes to Pure ASP File Upload by implementing the BytesToString() function from include_aspuploader.asp in my own code I was able to get UTF-8 filenames working.
Useful Links
Multipart/form-data and UTF-8 in a ASP Classic application
Unicode, UTF, ASCII, ANSI format differences
The method mimeHeaderEncode($string) from the library class Unicode does the job.
$file_name= Unicode::mimeHeaderEncode($file_name);
Example in drupal/php:
https://github.com/drupal/core-utility/blob/8.8.x/Unicode.php
/**
* Encodes MIME/HTTP headers that contain incorrectly encoded characters.
*
* For example, Unicode::mimeHeaderEncode('tést.txt') returns
* "=?UTF-8?B?dMOpc3QudHh0?=".
*
* See http://www.rfc-editor.org/rfc/rfc2047.txt for more information.
*
* Notes:
* - Only encode strings that contain non-ASCII characters.
* - We progressively cut-off a chunk with self::truncateBytes(). This ensures
* each chunk starts and ends on a character boundary.
* - Using \n as the chunk separator may cause problems on some systems and
* may have to be changed to \r\n or \r.
*
* #param string $string
* The header to encode.
* #param bool $shorten
* If TRUE, only return the first chunk of a multi-chunk encoded string.
*
* #return string
* The mime-encoded header.
*/
public static function mimeHeaderEncode($string, $shorten = FALSE) {
if (preg_match('/[^\x20-\x7E]/', $string)) {
// floor((75 - strlen("=?UTF-8?B??=")) * 0.75);
$chunk_size = 47;
$len = strlen($string);
$output = '';
while ($len > 0) {
$chunk = static::truncateBytes($string, $chunk_size);
$output .= ' =?UTF-8?B?' . base64_encode($chunk) . "?=\n";
if ($shorten) {
break;
}
$c = strlen($chunk);
$string = substr($string, $c);
$len -= $c;
}
return trim($output);
}
return $string;
}
We had a similar problem in a web application, and ended up by reading the filename from the HTML <input type="file">, and setting that in the url-encoded form in a new HTML <input type="hidden">. Of course we had to remove the path like "C:\fakepath\" that is returned by some browsers.
Of course this does not directly answer OPs question, but may be a solution for others.
I normally URL-encode (with %xx) the filenames, and it seems to work in all browsers. You might want to do some tests anyway.

Android , Read in binary data and write it to file

Im trying to read in image file from a server , with the code below . It keeps going into the exception. I know the correct number of bytes are being sent as I print them out when received. Im sending the image file from python like so
#open the image file and read it into an object
imgfile = open (marked_image, 'rb')
obj = imgfile.read()
#get the no of bytes in the image and convert it to a string
bytes = str(len(obj))
#send the number of bytes
self.conn.send( bytes + '\n')
if self.conn.sendall(obj) == None:
imgfile.flush()
imgfile.close()
print 'Image Sent'
else:
print 'Error'
Here is the android part , this is where I'm having the problem. Any suggestions on the best way to go about receiving the image and writing it to a file ?
//read the number of bytes in the image
String noOfBytes = in.readLine();
Toast.makeText(this, noOfBytes, 5).show();
byte bytes [] = new byte [Integer.parseInt(noOfBytes)];
//create a file to store the retrieved image
File photo = new File(Environment.getExternalStorageDirectory(), "PostKey.jpg");
DataInputStream dis = new DataInputStream(link.getInputStream());
try{
os =new FileOutputStream(photo);
byte buf[]=new byte[1024];
int len;
while((len=dis.read(buf))>0)
os.write(buf,0,len);
Toast.makeText(this, "File recieved", 5).show();
os.close();
dis.close();
}catch(IOException e){
Toast.makeText(this, "An IO Error Occured", 5).show();
}
EDIT: I still cant seem to get it working. I have been at it since and the result of all my efforts have either resulted in a file that is not the full size or else the app crashing. I know the file is not corrupt before sending server side. As far as I can tell its definitely sending too as the send all method in python sends all or throws an exception in the event of an error and so far it has never thrown an exception. So the client side is messed up . I have to send the file from the server so I cant use the suggestion suggested by Brian .
The best way to get a bitmap from a server is to execute the following.
HttpClient client = new DefaultHttpClient();
HttpGet get = new HttpGet("http://yoururl");
HttpResponse response = client.execute(get);
InputStream is = response.getEntity().getContent();
Bitmap image = BitmapFactory.decodeStream(is);
That will give you your bitmap, to save it to a file do something like the following.
FileOutputStream fos = new FileOutputStream("yourfilename");
image.compress(CompressFormat.PNG, 1, fos);
fos.close();
You can also combine the two and just write straight to disk
HttpClient client = new DefaultHttpClient();
HttpGet get = new HttpGet("http://yoururl");
HttpResponse response = client.execute(get);
InputStream is = response.getEntity().getContent();
FileOutputStream fos = new FileOutputStream("yourfilename");
byte[] buffer = new byte[256];
int read = is.read(buffer);
while(read != -1){
fos.write(buffer, 0, read);
read = is.read(buffer);
}
fos.close();
is.close();
Hope this helps;
I'm not sure I understand your code. You are calling dis.readFully(bytes); to write the content of dis into your byte array. But then you don't do anything with the array, and then try to write the content of dis through a buffer into your FileOutputStream.
Try commenting out the line dis.readFully(bytes);.
As a side note, I would write to the log rather than popping up a toast for things like the number of bytes or when an exception occurs:
...
} catch (IOException e) {
Log.e("MyTagName","Exception caught " + e.toString());
e.printStackTrace();
}
You could look at these links for examples of writing a file to the SD card:
Android download binary file problems
Android write to sd card folder
I solved it with the help of a Ubuntu Forums member. It was the reading of the bytes that was the problem . It was cutting some of the bytes from the image. The solution was to just send the image whole and remove the sending of the bytes from the equation all together

Categories