import json
f=open("99_jiayi.txt",'r',encoding = 'utf-8-sig')
a=json.load(f)
f.close()
https://i.stack.imgur.com/vD3M5.png
https://i.stack.imgur.com/QP0oW.png
I tried to turn the text file into a list in order to analyze the data
I used the "json load" and it worked on another file which is written in the same mode
But when i want to use it on another file it comes out the error
i searched google for a lot of time but still cant get the ans
Hope someone can help me with this question
i have some problem to express my thought with eng so if anyone cant understand what i am typing plz let me know tks!!
The error message points to "1":NR, which does not look like valid JSON, so it seems like a valid error.
Edit: try putting all NR within quotes.
Related
I want to find out the filetypes of some files with no file ending.
The best case would be if I could get the same string you get in the file properties. Iam using Python and already tried with mimetypes, which doesn't worked.
Thanks for any help :)
I think your question has been resolved in this issue on stackoverflow :
check type of file without extensions
hopes this help :)
You can try using the file command, executed via subprocess.
result = subprocess.check_output(['file', '/path/to/allcfgconv'])
The resulting string is a bit verbose; you'll have to parse the file type from it yourself.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I am very new to the coding world and have been stuck on this one problem for 3 days now, searching everywhere for an answer, so any help will be greatly appreciated. I am needing to extract a small amount of text from a url-located Pdf file. I'm using sessions.get(chart_PDF) as the driver for locating the URL where chart_PDF is the example url below.
Example url is https://www.airservicesaustralia.com/aip/pending/dap/PADGN01-166_09SEP2021.pdf
I know I am able to write it to my local drive but I don't want to do that, I want to be able to do it remotely, since I only need a couple of numbers from it.
I have tried finding the password from the url page for decrypting, couldn't find. I've tried to use PyPDF2, pdfminer and pikepdf (probably not well).
I only need to retrieve two numbers near the bottom of the PDF that can be used for the rest of my code. Please help, even if it is a simple fix, I'm new to all this and need some help. Thanks.
from io import BytesIO
from pikepdf import Pdf as PDF
from pdfminer import high_level
chart_PDF = https://www.airservicesaustralia.com/aip/pending/dap/PADGN01-166_09SEP2021.pdf
retrieve = s.get(chart_PDF)
content = retrieve.content
response =urllib.request.urlopen(chart_PDF)
p = BytesIO(content)
p.getbuffer()
check = PDFPage.get_pages(p, check_extractable=False)
extract = high_level.extract_text(p)
I'm getting:
PDFTextExtractionNotAllowedWarning: The PDF <_io.BytesIO object at 0x000001B007ABEC20> contains a metadata field indicating that it should not allow text extraction. Ignoring this field and proceeding.warnings.warn(warning_msg, PDFTextExtractionNotAllowedWarning)
Alternately, if I try this:
from pikepdf import Pdf as PDF
from pdfminer.pdfpage import PDFPage
from PyPDF2 import PdfFileReader
new_pdf = PDF.new()
with PDF.open(p) as pdf:
print(len(pdf.pages))
page1 = pdf.pages[0]
if PdfFileReader.getIsEncrypted(pdf):
print(True)
PdfFileReader.decrypt(page1, password='')
pdf.close()
I get:
line 1987, in decrypt
return self._decrypt(password)
AttributeError: _decrypt
UPDATE 3/8/21
Thank you so much K J! You've seriously been a huge help!
from io import BytesIO
from pdfminer.pdfpage import PDFPage
from pdfminer import high_level
retrieve = s.get(chart_PDF)
content = retrieve.content
bytes = BytesIO(content)
bytes.getbuffer()
PDFPage.get_pages(bytes, check_extractable=False)
extract = high_level.extract_text(bytes, password='') #THIS LINE THROWS ERROR
joined = ''.join(extract)
find_txt = re.findall(r'[(]\d*[-]\d[.]\d[)]', joined)
print(find_txt)
bytes.close()
This is now working well and I have been able to pull the numbers that I need (I have basically pulled all numbers from inside brackets off the PDF). I'll sort through that to find which one I need.
Strangely enough, although its giving me what I need, my extract = high_level.extract_text(bytes, password='') line still throws the Warning: (warning_msg, PDFTextExtractionNotAllowedWarning) which is rather annoying. Not sure how this process works but its still letting the info out.
I can't use try except or it skips over it. What is the way around this? how can I stop that error coming up?
FINAL UPDATE
I got around the warning and it works well now.
with warnings.catch_warnings():
warnings.simplefilter("ignore")
extract = high_level.extract_text(bytes)
Cheers fellas for putting up with my ignorance, you've helped so much.
The whole file has to be downloaded to a device via RAM so the blob as a FILE can be parsed at the very END for one OR more %%EOF and the location of page 0 (it gets converted to 1 or i) it could be ANYWHERE IN THE STREAM,.
THEN you can navigate to other sequential numbered pages in the RANDOM order they are built. Any complaints please contact Adobe.
However it is easiest if it is cached as a physical FILE object. If you dont want that on disk use a ram drive for your browser.
Again those two objects at bottom of page one could be anywhere mixed into the content of "page" 99's objects, or otherwise. each letter in a PDF can in its extreme be more than one object anywhere in the file. but a good authoring editor would try to keep them as lines by lines. (there is no such PDF thing as a word or paragraph.)
We can Print the file as Plain Text to see how it is composited and although (secured) that is allowed.
I tried printing from browser with little success but know that can depend on browser system and OS print drivers. Here I have printed the page as text using Acrobat portable, so we can see the sequential offsets of each text block from Left Hand margin JUST LIKE a PDF VIEWER would need to rebuild them.
UPDATE
You said your target is (1380-4.4) to the RIGHT of ALTERNATE but again A PDF has no concept of Left and Right or BEFORE or AFTER so we find IN THIS FILE the variable target is in 2 separate pieces PRIOR to the KNOWN characters which luckily is a complete single block (alternate). Thus here proximity of plain text could well work if the capture is confined to that nearby locality. However there is no guarantee that ALTERNATE would always be a single block.
It was perhaps not a good Idea To show the way a Printer would be given a stream of sequential data
Here is the way one PDF viewer goes about decrypting the file
As stated on this occasion the word ALTERNATE is defined as text however the next item is the "3" under "B" which is text as a vector path it is not called a "character" although it looks like one but a numbered glyph from a font table. We do see later that some of those numbers are stored as "text" and for your target it is mixed in with similar text in the same object.
Thus you need to call a PDF interpreter to give you a meaningful translation of all bits and pieces of objects so that you can extract the "right" text.
The easiest way for a "simple" one line target in a complex file is to use MuPDF to first tidy up the file
mutool clean -gggg -D infile.pdf outfile.pdf
combined with
PDFTOTXT -layout outfile.pdf outfile.txt
or similar to hopefully export that text on a line by line basis, such that you can consistently find your target instantly before ! or after ALTERNATE.
N.B Mutool convert to HTML would place the target value in a table entry AFTER the key word, and if the lines are consistent in number that would be a simpler way to find or grep.
So I'm thinking this is one of those problems where I can't see the forest for the tree. Here is the assignment:
Using the file object input, write code that read an integer from a file called
rawdata into a variable datum (make sure you assign an integer value to datum).
Open the file at the beginning of your code, and close it at the end.
okay so first thing: I thought the input function was for assigning data to an object such as a variable, not for reading data from an object. Wouldn't that be read.file_name ?
But I gave it shot:
infile = open('rawdata','r')
datum = int(input.infile())
infile.close()
Now first problem... MyProgrammingLab doesn't want to grade it. By that I mean I type in the code, click 'submit' and I get the "Checking" screen. And that's it. At the time of writing this, my latest attempt to submit as been 'checking' for 11 minutes. It's not giving me an error, it's just not... 'checking' I guess.
Now at the moment I can't use Python to try the program because it's looking for a while and I'm on a school computer that is write locked, so even if I have the code right (I doubt it), the program will fail to run because it can neither find the file rawdata nor create it.
So... what's the deal? Am I reading the instructions wrong or is it telling me to use input in some other way then I'm trying to use it? Or am I supposed to be using a different method?
You are so close. You're just using the file object slightly incorrectly. Once it's open, you can just .read() it, and get the value.
It would probably look something like this
infile = open('rawdata','r')
datum = int(infile.read())
infile.close()
I feel like your confusion is based purely on the wording of the question - the term "file object input" can certainly be confusing if you haven't worked with Python I/O before. In this case, the "file object" is infile and the "input" would be the rawdata file, I suppose.
Currently taking this class and figured this out. This is my contribution to all of us college peeps just trying to make it through, lol. MPL accepts this answer.
input = open('rawdata','r')
datum = int(input.readline())
input.close()
I got a InDesign page filled with 4 pictures.
Im using
ex = os.popen('exiftool -T -IngredientsFilePath '+item).read()
to get back all pictures and their filepath used in that document
But when i try to write this answer into a file it seems the encoding went went wrong,
or the answer from that command is already in any other encoding.
All ü,Ü,ä,Ä,ö,Ö,',',ß or spaces look like this : u%cc%88, o%cc%88, %c3%9f and so on.
Does anyone know how to fix this? At the moment im using .replace('%c3%9f', 'ß') for ex
to get the right encoding but im not srsly lucky with that.
Greetings Yierith
Here's my first simple test program in Python. Without importing the os library the program runs fine... Leading me to believe there's something wrong with my import statement, however this is the only way i ever see them written. Why am I still getting a syntax error?
import os # <-- why does this line give me a syntax error?!?!?! <unicode error> -->
CalibrationData = r'C:/Users/user/Desktop/blah Redesign/Data/attempts at data gathering/CalibrationData.txt'
File = open(CalibrationData, 'w')
File.write('Test')
File.close()
My end goal is to write a simple program that will look through a directory and tabularize data from relevant .ini files within it.
Well, as MDurant pointed out... I pasted in some unprintable character - probably when i entered the URL.