Issue trying to decode utf-8 encoded json in python

Issue trying to decode utf-8 encoded json in python - python

I do have a json file I am trying to read.
The Json includes the following text in the file, which I am trying to decode with the code following this text.
Silikon-Ersatzpl\u00ef\u00bf\u00bdttchen Regenlichtsensor
with open("file_name", encoding = "utf-8") as file:
pdf_labels = json.loads(file.read())
When I try to load it with the json module and specify utf-8 encoding I get some weird results.
"\u00ef\u00bf\u00bd" will become "ï¿½" instead of a desired "ä"
The desired ouput should look like the following.
Silikon-Ersatzplättchen Regenlichtsensor
Please don´t be harsh, this is my first question :)

Related

How to extract fields from 'Layout' JSON file of Power BI tool with Python?

I request a code modification that extracts columns from a (nested) JSON file inside a '.PBIX' file (Power BI tool) in Python. The details are below:
Original code to extract some columns written by Mr Umberto Grando:
Code on GitHub: https://github.com/Inzaniak/pybistuff/tree/master/pbixExtractor
Explanation of the code on Medium: https://python.plainenglish.io/extracting-measures-and-fields-from-a-power-bi-report-in-python-1b928d9fb128
I extracted the Layout file for your convenience in Google Drive: https://drive.google.com/drive/folders/1Z5cqgE-iuS0__G5hCl7Ge-MZW9WKmJu7?usp=sharing
Request:
I need to extract the 'visualType' field as well in a similar format in the GitHub code above for Business Documentation. To be able to extract parameters like this in a table automatically will save hours of time in documentation.
I tried:
JS Beautify
JSON dumps, JSON LOADS, JSON Normalize, Different Encoding types, Adding ['visualTypes'] in code but didn't know how to append. I have trouble understanding the structure of the JSON file here, too.
Other tries :
!cd "path/Layout"
import json
# Opening JSON file
f = open('path/Layout', encoding='utf-16 le')
# returns JSON object as
# a dictionary
data = json.load(f)
# Iterating through the json
# list
for i in data:
print(i)
###Output: ###
id sections pods config reportId resourcePackages layoutOptimization
publicCustomVisuals

Main reason this was downwoted what that you did not post any code or errors when you tried to solve this.
When I tried I got this error:
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
and:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 9380: invalid start byte
This you can most likely find lots of answers to with google:
When you see byte errors in json.load() it is most likely encoding, so chardet will help you:
#!/usr/bin/env python3
import json
import chardet
raw = "C:\\Users\\mobj\\Downloads\\Layout"
with open(raw, 'rb') as FI:
print(chardet.detect(FI.read()))
dat = {}
# encoding from chardet:
with open(raw, 'r', encoding='utf-16le', errors='ignore') as FI:
dat = json.load(FI)
print(json.dumps(dat, indent=2, sort_keys=True))

How to encode and decode docx file using base64 in python

I am trying to encode docx file and decode/pass it on frontend/UI in streamlit. As of now i knew how to encode/decode strings using base64 but not with docx file.
If any of you guys have any code on how to achieve it. Please do share here.
import base64
import streamlit as st
data = open('/home/lungsang/Desktop/streamlit-practice/content/A0/A0.02-vocab.docx', 'rb').read()
encoded = base64.b64encode(data)
decoded = base64.b64decode(encoded)
st.download_button('Download Here', decoded)
I used the above code but not getting the desired result.
Instead I got collection of .xml file. As shown in the below screenshot
The supposed decoded document should look like this..
If you guys need the docx file that i am trying to encode/decode, here is the link https://docs.google.com/document/d/10zkg1HLDHhZNh83i2tbJqBVMfIsdqW-3/edit

You need to add filename argument to download_button function
import base64
import streamlit as st
data = open("test.docx", "rb").read()
encoded = base64.b64encode(data)
decoded = base64.b64decode(encoded)
st.download_button('Download Here', decoded, "decoded_file.docx")

This is just encoding You Have to Decode
with open('YOUR DATA FILE', 'rb') as binary_file:
binary_file_data = binary_file.read()
base64_encoded_data = base64.b64encode(binary_file_data)
base64_message = base64_encoded_data.decode('utf-8')
print(base64_message)
open the file using open('Your Data file', 'rb'). Note how we passed the 'rb' argument along with the file path - this tells Python that we are reading a binary file. Without using 'rb', Python would assume we are reading a text file.
use the read() method to get all the data in the file into the binary_file_data variable. Similar to how we treated strings, we Base64 encoded the bytes with base64.b64encode and then used the decode('utf-8') on base64_encoded_data to get the Base64 encoded data using human-readable characters.
Executing the code will produce similar output to:
python3 encoding_binary.py

while decode the csv file gives wrong data

I want to decode a csv file but it gives the wrong data..
example: in csv file i have BP1-R241 after decode the file it gives BP1+AC0-R241
if the columns contain (-,/,\,*,....etc) it gives +AC0 is added
How can I rectify this ?
My code:
import base64
data = 'Y29kZSxxdWFudGl0eSxsb2NhdGlvbgoxMjM0NTY2NDMsMSxCUDErQUMwLVIyNDEKMTIzNDUsMixCUDErQUMwLVIyNDEKMTIzNDU2LDMsQlAxK0FDMC1SMjQxCnEyMzIzNDM1NDY1Niw0LEJQMStBQzAtUjI0MQpkc2Zkc2YsNSxCUDErQUMwLVIyNDEKMjMzNDU2LDYsQlAxK0FDMC1SMjQxCmRkZnNkZiw3LEJQMStBQzAtUjI0MQozNTQ2NzgsOCxCUDErQUMwLVIyNDEKMTIzNDU2Nyw5LEJQMStBQzAtUjI0MQoyMzQ1NjcsMTAsQlAxK0FDMC1SMjQxCml1NjU0MzIsMTEsQlAxK0FDMC1SMjQxCmpoZ2ZkLDEyLEJQMStBQzAtUjI0MQp4Y3ZmZ2JobiwxMyxCUDErQUMwLVIyNDEKY2ZjZ2hqaywxNCxCUDErQUMwLVIyNDEKc2RmZ2hqLDE1LEJQMStBQzAtUjI0MQphc2RmZ2hqLDE2LEJQMStBQzAtUjI0MQpzYWRmZ2hqaywxNyxCUDErQUMwLVIyNDEKc2RzZHNkc2QsMTgsQlAxK0FDMC1SMjQxCjExMjIzMzQ0LDE5LEJQMStBQzAtUjI0MQoxMTIyMzM0NDIsMjAsQlAxK0FDMC1SMjQxClRFU1QxMjMsMjEsQlAxK0FDMC1SMjQxCg=='
data = base64.b64decode(data).decode('utf-8')
output:-
code,quantity,location
123456643,1,BP1+AC0-R241
12345,2,BP1+AC0-R241
123456,3,BP1+AC0-R241
q23234354656,4,BP1+AC0-R241
dsfdsf,5,BP1+AC0-R241
233456,6,BP1+AC0-R241
ddfsdf,7,BP1+AC0-R241
354678,8,BP1+AC0-R241
1234567,9,BP1+AC0-R241
234567,10,BP1+AC0-R241
iu65432,11,BP1+AC0-R241
jhgfd,12,BP1+AC0-R241
xcvfgbhn,13,BP1+AC0-R241
cfcghjk,14,BP1+AC0-R241
sdfghj,15,BP1+AC0-R241
asdfghj,16,BP1+AC0-R241
sadfghjk,17,BP1+AC0-R241
sdsdsdsd,18,BP1+AC0-R241
11223344,19,BP1+AC0-R241
112233442,20,BP1+AC0-R241
TEST123,21,BP1+AC0-R241

The data you've pasted in simply contains BP1+AC0-R241, there's no way around it.
The problem is not in decoding, it's in wherever you get that data from.
Googling "+AC0" leads me to this thread, and namely this:
The data in your file is encoded as UTF-7 (http://en.wikipedia.org/wiki/UTF-7), instead of the more usual ascii/latin-1 or UTF-8. Each of the +ACI- sequences encodes one double quote character.
Are you sure you've exported the file as UTF-8, not UTF-7?

Handle Multiple Languages with xml.tree.ElementTree Module

I'm attempting to parse an XML file and print sections of the contents into a CSV file for manipulation with a program such as Microsoft Excel. The issue I'm running into is that the XML file contains multiple alphabets (Arabic, Cyrillic, etc.) and I'm getting confused over what encoding I should be using.
import csv
import xml.etree.ElementTree as ET
import os
file = 'example.xml'
csvf = open(os.path.splitext(file)[0] + '.csv', "w+", newline='')
csvf.seek(0)
csvw = csv.writer(csvf, delimiter=',')
root = ET.parse(file).getroot()
name_base = root.find("name")
name_base_string = ET.tostring(name_base, encoding="unicode", method="xml").strip()
csv_data.append(name_base_string)
csvf.close()
I do not know what encoding to pass to the tostring() method. If I use 'unicode' it returns a unicode python string and all is well when writing to the CSV file, but Excel seems to handle this really improperly (all editors on windows and linux seem to see the character sets properly). If I use encoding 'UTF-8' the method returns a bytearray, which if I pass to the CSV writer without decoding I receive the string b'stuff' in the csv document.
Is there something I'm missing here? Does Excel just suck at handling certain encodings? I've read up on how UTF-8 is an encoding and Unicode is just a character set (that you can't really compare them) but I'm still confused.

Special characters not being displayed correctly when being written to a csv file in excel

I am trying to write strings containing special characters such as chinese letters and french accents to a csv file. At first I was getting the classic Unicode encode error and looked online for a solution. Many resources told me to use .encode('utf-8',errors='ignore') to solve the problem.This places bytes in the excel file. In my code shown below I tried getting the function that appends the character to the csv file to convert the character to utf-8. This makes the program run without error, however, when I open up the excel document I see that instead of "é" and "蒋" being added to the file, I see "Ã©" and "è’‹".
import csv
def appendToCSV(specialCharacter):
with open('myCSVFile.csv',"a",newline="",encoding='utf-8') as csvFile:
csvFileWriter = csv.writer(csvFile)
csvFileWriter.writerow([specialCharacter])
csvFile.close()
appendToCSV('é')
appendToCSV('蒋')
I would like to get display the characters in the excel document exactly as shown, any help would be appreciated. Thank you.

Use utf-8-sig for the encoding. Excel requires the byte order mark (BOM) signature or it will interpret the file in the local ANSI encoding.

I'm pretty sure your excel worksheet is set to use "Latin 1". Try to switch the setting to use utf-8.
Note:
>>> x = "蒋"
>>> bs = x.encode()
>>> bs
b'\xe8\x92\x8b'
>>> bs.decode("latin")
'è\x92\x8b'
And:
>>> x = 'é'
>>> bs = x.encode()
>>> bs.decode('latin-1')
'Ã©'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Issue trying to decode utf-8 encoded json in python - python

Related

How to extract fields from 'Layout' JSON file of Power BI tool with Python?

How to encode and decode docx file using base64 in python

while decode the csv file gives wrong data

Handle Multiple Languages with xml.tree.ElementTree Module

Special characters not being displayed correctly when being written to a csv file in excel

Categories

Resources