Python Pandas PDF/Web Scrape - python

I am trying to extract the first top 2 pdf's from this page under the current product range.
https://www.intermediary.natwest.com/intermediary-solutions/products.html
I have managed to create a function that uses selenium to click the download links and download the 2 pdf's into a temporary location, however, I am struggling to find a viable way to read in the tables with minimal cleaning required.
Please can anyone help with potential solutions to download these 2 pdf tables and export them into csv's, I have tried using PDF plumber but it converts the data into a list of lists which is a nightmare to clean. I have also tried PyPDF2 which is also very messy with hundreds of lines of code needed to clean the data. I would just like to find a good best practice solution to read the pdfs in as they are and convert them to csv's.
Any help would be immensely appreciated.
:)

Related

Extracting specific page from multiple PDF files

I have hundreds of PDF files with same format, but different content.
I need to extract second page from every files individually.
It's like below
original1.pdf -> 2_original1.pdf
original2.pdf -> 2_original2.pdf
original3.pdf -> 2_original3.pdf
I'm trying to use PyPDF2 but I cannot figure out the code from google.
So far I used below one, but I don't think this is correct.
cd C:\Users\ukil.yeom\Desktop\pdf_extracts
convert -density 150 *.pdf[2] only-page-2.pdf
Please show me the way how to deal with it.
Thanks in advance.

Switching the order of pages of a pdf in python

I'm currently working on some PDF file generation in python for nametags. However, in my freshly generated files I have all fronts and then all backs instead of a front, then the according back, then the next front and so on. I would like to correct that after the files have been generated.
So I have the following:
p1f, p2f, p3f,... ,p1b, p2b, p3b,...
Where pn describes the n-th page, f is for front and b is for back. What I want to end up with is:
p1f, p1b, p2f, p2b, p3f, p3b,...
What are possible ways to approach this? What libraries could I use?
Thanks in advance!
For libraries you can use PyPDF2 or pdfrw.
For approaches I'd suggest when you have small files:
load them into memory, reorder pages, and write them back to disk.
If a PDF file is too large you could split pages into sperate files and build the output file one page after another.
However it is safe to say that there are more efficient ways to do this.
Also you might want to check PDF-Shuffler which is a python-gtk tool to perform such tasks on a non programmatic basis.

Scraping PDF data into Excel *absolute beginner*

This is literally day 1 of python for me. I've coded in VBA, Java, and Swift in the past, but I am having a particularly hard time following guides online for coding a pdf scraper. Since I have no idea what I am doing, I keep running into a wall every time I want to test out some of the code I've found online.
Basic Info
Windows 7 64bit
python 3.6.0
Spyder3
I have many of the pdf related code packages (PyPDF2, pdfminer, pdfquery, pdfwrw, etc)
Goals
To create something in python that allows me to convert PDFs from a folder into an excel file (ideallY) OR a text file (from which I will use VBA to convert).
Issues
Every time I try some sample code from guides i've found online, I always run into syntax errors on the lines where I am calling the pdf that I want to test the code on. Some guide links and error examples below. Should I be putting my test.pdf into the same file as the .py file?
How to scrape tables in thousands of PDF files?
I got an invalid syntax error due to "for" on the last line
PDFMiner guide (Link)
runfile('C:/Users/U587208/Desktop/pdffolder/pdfminer.py', wdir='C:/Users/U587208/Desktop/pdffolder')
File "C:/Users/U587208/Desktop/pdffolder/pdfminer.py", line 79
print pdf_to_csv('test.pdf', separator, threshold)
^
SyntaxError: invalid syntax
It seems that the tutorials you are following make use of python 2. There are usually few noticable differences, the the biggest is that in python 3, print became a funtion so
print()
I would recomment either changing you version of python or finding a tutorial for python 3. Hope this helps
Here
Pdfminer python 3.5 an example, how to extract informations from a PDF.
But it does not solve the problem with tables you want to export to Excel. Commercial products are probably better in doing that...
I am trying to do this exact same thing! I have been able to convert my pdf to text however the formatting is extremely random and messy and I need the tables to stay in tact to be able to write them into excel data sheets. I am now attempting to convert to XML to see if it will be easier to extract from. If I get anywhere on this I will let you know :)
btw, use python 2 if you're going to use pdfminer. Here's some help with pdfminer https://media.readthedocs.org/pdf/pdfminer-docs/latest/pdfminer-docs.pdf

How to extract financial statements only from XBRL files using Arelle's Python API?

Somehow, with the broken documentation on Arelle's python API as of date, I managed to make the API work and successfully load an XBRL file.
Anyways, my question is:
How do I extract only the STATEMENTS from the XBRL file?
Below is a screenshot from Arelle's Windows App.
URL used in this example: https://www.sec.gov/Archives/edgar/data/101984/000010198416000062/ueic-20151231.xml
I tried experimenting with the API and here's my code
from arelle import Cntlr
xbrl = Cntlr.Cntlr().modelManager.load('https://www.sec.gov/Archives/edgar/data/101984/000010198416000062/ueic-20151231.xml')
for fact in xbrl.facts:
print(fact)
but after executing this snippet, I'm bombarded with these:
I tried getting the keys available per modelFact and its a mixture between contextRef, id, decimals and unitRef which is not helpful from what I want to extract. With no documentation to help further with this, I'm at a loss here. Can someone enlighten me on how to achieve extracting only the statements?
I am doing something similar and have so far had some progress which I can share:
Going through the python code files of arelle you can detect which properties you can access for the different classes such as ModelFact, ModelContext, ModelUnit etc.
To extract the individual data, you can for example put them in a panda dataframe as follows:
factData=pd.DataFrame(data=[(fact.concept.qname,
fact.value,
fact.isNumeric,
fact.contextID,
fact.context.isStartEndPeriod,
fact.context.isInstantPeriod,
fact.context.isForeverPeriod,
fact.context.startDatetime,
fact.context.endDatetime,
fact.unitID) for fact in xbrl.facts])
Now it is easier to work with all the data, filter those that you want to use etc. If you want to reproduce the statements tables, you will also need to incorporate the links for each of the facts and than order and sort, but I haven't gotten this far either.

Script that converts html tables to CSV (preferably python)

I have a large number of html tables that I'd like to convert into CSV. Pasting individual tables into excel and saving them as .csv works, as does pasting the html tables into simple online converters. But I have thousands of individual tables, so I need a script that can automate the conversion process.
I was wondering if anyone has any suggestions as to how I could go about doing this? Python is the only language I have a decent knowledge of, so some sort of python script would be ideal. I've searched for similar questions, but all the python examples I've found are quite complicated to me, and go beyond my basic level of understanding.
Any advice would be much appreciated.
Use pandas. It has a function to read html tables into a data structure, and then a function that will write that data structure to a csv file.
import pandas as pd
url = 'http://myurl.com/mypage/'
for i, df in enumerate(pd.read_html(url)):
df.to_csv('myfile_%s.csv' % i)
Note that since an html page may have more than one table, the function to get the table always returns a list of tables (even if there is only one table). That is why I use a loop here.

Categories