Download pdfs and join them using python [duplicate] - python

This question already has answers here:
Merge PDF files
(15 answers)
Closed 29 days ago.
I have a list, named links_to_announcement, of urls for different pdfs.
How do I download them and join them together? My code generated a corrupt pdf which doesn't open in pdf reader at all.
with open('joined_pdfs.pdf', 'wb') as f:
for l in links_to_announcement:
response = requests.get(l)
f.write(response.content)

Many file formats have a specific and structured format (rather than being simply lines of arbitrary text) and appending them isn't sufficient!
Instead, in this (writing a PDF) and with many other formats, it's necessary to rewrite them with something that understands their context
For a simple example, if two CSVs were blindly appended, the second CSV's header would be spliced into the middle of the new document and any nonexact columns wouldn't parse correctly or be misinterpreted, even if the corrupt line was removed
file1.csv
colA,colB
1,2
file2.csv
colC,colD,colA
3,4,5
blindly appending the two files
colA,colB
1,2
colC,colD,colA
3,4,5
how should this document be interpreted?
Instead, a context-aware parser can merge the documents correctly
colA,colB,colC,colD
1,2,,
5,,3,4
As suggested by #esqew, PDF files can be merged with logic like Merge PDF files
You show how to download the files, but it's probably significantly faster to unpack each web request into a BytesIO and combine them all in memory (python requests return file-like object for streaming)
NOTE this will frustrate attempts to restart after a failed request, and you should consider writing each PDF to disk intermediately and checking if you have the file in your local cache before downloading again if you find frequent failed requests
import requests # aiohttp might be better to asyncio.gather()
from pypdf import PdfMerger
merger = PdfMerger()
with open("links_to_announcement.txt") as fh:
for url in fh:
r = request.get(url, stream=True)
# TODO error handling: .raise_for_status(), backoff, etc.
r.raw.decode_content = True # possibly fix encoding issues
merger.append(r.raw) # probably an io.BytesIO
merger.write("combined.pdf")
merger.close()

Related

Python 3.10.6 - Trying to use zipfile to extract from zip with outdated header

I'm working with the following block of code, in an attempt to extract data from a zip file
import zipfile
def get_zip(filenam,targetdir):
with zipfile.ZipFile(filenam,"r") as zip_ref:
zip_ref.extractall(targetdir)
zip_file = 'coolThing.zip'
targetdir = 'C:/puItHere/'
get_zip(zip_file,targetdir)
However, I get the error
"BadZipFile: Bad magic number for file header"
Looking through previous forums like this one, I find that my zip file needs to have the header "\x50\x4B\x03\x04" but it actually has the header "b'PK\x03\x04"
Does anyone know of a way where I can use zipfile, pyunpack, or any other library in order to extract what I need from this file type? I'm getting data from a large repository, and will be iterating through 30 TB of data, only taking what I need out of the zip files, and so far from what I've seen, they all use the same header
Thanks!

How to read Json files in a directory separately with a for loop and performing a calculation

Update: Sorry it seems my question wasn't asked properly. So I am analyzing a transportation network consisting of more than 5000 links. All the data included in a big CSV file. I have several JSON files which each consist of subset of this network. I am trying to loop through all the JSON files INDIVIDUALLY (i.e. not trying to concatenate or something), read the JSON file, extract the information from the CVS file, perform calculation, and save the information along with the name of file in new dataframe. Something like this:
enter image description here
This is the code I wrote, but not sure if it's efficient enough.
name=[]
percent_of_truck=[]
path_to_json = \\directory
import glob
z= glob.glob(os.path.join(path_to_json, '*.json'))
for i in z:
with open(i, 'r') as myfile:
l=json.load(myfile)
name.append(i)
d_2019= final.loc[final['LINK_ID'].isin(l)] #retreive data from main CSV file
avg_m=(d_2019['AADTT16']/d_2019['AADT16']*d_2019['Length']).sum()/d_2019['Length'].sum() #calculation
percent_of_truck.append(avg_m)
f=pd.DataFrame()
f['Name']=name
f['% of truck']=percent_of_truck
I'm assuming here you just want a dictionary of all the JSON. If so, use the JSON library ( import JSON). If so, this code may be of use:
import json
def importSomeJSONFile(f):
return json.load(open(f))
# make sure the file exists in the same directory
example = importSomeJSONFile("example.json")
print(example)
#access a value within this , replacing key with what you want like "name"
print(JSON_imported[key])
Since you haven't added any Schema or any other specific requirements.
You can follow this approach to solve your problem, in any language you prefer
Get Directory of the JsonFiles, which needs to be read
Get List of all files present in directory
For each file-name returned in Step2.
Read File
Parse Json from String
Perform required calculation

Unzipping a gzip file that contains a csv

I have just hit an endpoint and can pull down a gzip compressed file.
I have tried saving it and extracting the csv inside but I keep getting errors around encoding whether I try casting from its current state in binary to utf-8/utf-16.
To write to the saved gzip I write in binary mode:
r = requests.get(url, auth=auth, stream=True)
with gzip.open('file.gz', 'wb') as f:
f.write(r.content)
Where r.content looks like:
b'PK\x03\x04\x14\x00\x08\x08\x08\x00f\x8dKM\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00A\x00\x00\x00RANKTRACKING_report_created_at_11_10_18_17_41-20181011-174141.csv\xec\xbdk\x8f\xe3V\x96\xae\xf9}\x80\xf9\x0f\ ... '
To extract the file on my machine manually I first have to extract to zip and then I can extract that to get the csv. I have tried the same there but ran into encoding errors there too.
Looking for a way to pull out this csv so I can print lines in python console.
That's not a gzip file. That's a zip file. You are then taking the zip file that you retrieved from the URL, and trying to compress it again as a gzip file. So now you have a zip file inside a gzip file. You have moved one step further away from extracting the CSV contents, as opposed to one step closer.
You need to use zipfile to extract the contents of the zip file that you downloaded.

How to parse WIkidata JSON (.bz2) file using Python?

I want to look at entities and relationships using Wikidata. I downloaded the Wikidata JSON dump (from here .bz2 file, size ~ 18 GB).
However, I cannot open the file, it's just too big for my computer.
Is there a way to look into the file without extracting the full .bz2
file. Especially using Python, I know that there is a PHP dump
reader (here ), but I can't use it.
I came up with a strategy that allows to use json module to access information without opening the file:
import bz2
import json
with bz2.open(filename, "rt") as bzinput:
lines = []
for i, line in enumerate(bzinput):
if i == 10: break
tweets = json.loads(line)
lines.append(tweets)
In this way lines will be a list of dictionaries that you can easly manipulate and, for example, reduce their size by removing keys that you don't need.
Note also that (obviously) the condition i==10 can be arbitrarly changed to fit anyone(?) needings. For example, you may parse some line at a time, analyze them and writing on a txt file the indices of the lines you really want from the original file. Than it will be sufficient to read only those lines (using a similar condition in i in the for loop).
you can use BZ2File interface to manipulate the compressed file. But you can NOT use json module to access information for it, it will take too much space. You will have to index the file meaning you have to read the file line by line and save position and length of interesting object in a Dictionary (hashtable) and then you could extract a given object and load it with the json module.
You'd have to do line-by-line processing:
import bz2
import json
path = "latest.json.bz2"
with bz2.BZ2File(path) as file:
for line in file:
line = line.decode().strip()
if line in {"[", "]"}:
continue
if line.endswith(","):
line = line[:-1]
entity = json.loads(line)
# do your processing here
print(str(entity)[:50] + "...")
Seeing as WikiData is now 70GB+, you might wish to process it directly from the URL:
import bz2
import json
from urllib.request import urlopen
path = "https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2"
with urlopen(path) as stream:
with bz2.BZ2File(path) as file:
...

Python basics - request data from API and write to a file

I am trying to use "requests" package and retrieve info from Github, like the Requests doc page explains:
import requests
r = requests.get('https://api.github.com/events')
And this:
with open(filename, 'wb') as fd:
for chunk in r.iter_content(chunk_size):
fd.write(chunk)
I have to say I don't understand the second code block.
filename - in what form do I provide the path to the file if created? where will it be saved if not?
'wb' - what is this variable? (shouldn't second parameter be 'mode'?)
following two lines probably iterate over data retrieved with request and write to the file
Python docs explanation also not helping much.
EDIT: What I am trying to do:
use Requests to connect to an API (Github and later Facebook GraphAPI)
retrieve data into a variable
write this into a file (later, as I get more familiar with Python, into my local MySQL database)
Filename
When using open the path is relative to your current directory. So if you said open('file.txt','w') it would create a new file named file.txt in whatever folder your python script is in. You can also specify an absolute path, for example /home/user/file.txt in linux. If a file by the name 'file.txt' already exists, the contents will be completely overwritten.
Mode
The 'wb' option is indeed the mode. The 'w' means write and the 'b' means bytes. You use 'w' when you want to write (rather than read) froma file, and you use 'b' for binary files (rather than text files). It is actually a little odd to use 'b' in this case, as the content you are writing is a text file. Specifying 'w' would work just as well here. Read more on the modes in the docs for open.
The Loop
This part is using the iter_content method from requests, which is intended for use with large files that you may not want in memory all at once. This is unnecessary in this case, since the page in question is only 89 KB. See the requests library docs for more info.
Conclusion
The example you are looking at is meant to handle the most general case, in which the remote file might be binary and too big to be in memory. However, we can make your code more readable and easy to understand if you are only accessing small webpages containing text:
import requests
r = requests.get('https://api.github.com/events')
with open('events.txt','w') as fd:
fd.write(r.text)
filename is a string of the path you want to save it at. It accepts either local or absolute path, so you can just have filename = 'example.html'
wb stands for WRITE & BYTES, learn more here
The for loop goes over the entire returned content (in chunks incase it is too large for proper memory handling), and then writes them until there are no more. Useful for large files, but for a single webpage you could just do:
# just W becase we are not writing as bytes anymore, just text.
with open(filename, 'w') as fd:
fd.write(r.content)

Categories