Scraping PDF data into Excel *absolute beginner* - python

This is literally day 1 of python for me. I've coded in VBA, Java, and Swift in the past, but I am having a particularly hard time following guides online for coding a pdf scraper. Since I have no idea what I am doing, I keep running into a wall every time I want to test out some of the code I've found online.
Basic Info
Windows 7 64bit
python 3.6.0
Spyder3
I have many of the pdf related code packages (PyPDF2, pdfminer, pdfquery, pdfwrw, etc)
Goals
To create something in python that allows me to convert PDFs from a folder into an excel file (ideallY) OR a text file (from which I will use VBA to convert).
Issues
Every time I try some sample code from guides i've found online, I always run into syntax errors on the lines where I am calling the pdf that I want to test the code on. Some guide links and error examples below. Should I be putting my test.pdf into the same file as the .py file?
How to scrape tables in thousands of PDF files?
I got an invalid syntax error due to "for" on the last line
PDFMiner guide (Link)
runfile('C:/Users/U587208/Desktop/pdffolder/pdfminer.py', wdir='C:/Users/U587208/Desktop/pdffolder')
File "C:/Users/U587208/Desktop/pdffolder/pdfminer.py", line 79
print pdf_to_csv('test.pdf', separator, threshold)
^
SyntaxError: invalid syntax

It seems that the tutorials you are following make use of python 2. There are usually few noticable differences, the the biggest is that in python 3, print became a funtion so
print()
I would recomment either changing you version of python or finding a tutorial for python 3. Hope this helps

Here
Pdfminer python 3.5 an example, how to extract informations from a PDF.
But it does not solve the problem with tables you want to export to Excel. Commercial products are probably better in doing that...

I am trying to do this exact same thing! I have been able to convert my pdf to text however the formatting is extremely random and messy and I need the tables to stay in tact to be able to write them into excel data sheets. I am now attempting to convert to XML to see if it will be easier to extract from. If I get anywhere on this I will let you know :)
btw, use python 2 if you're going to use pdfminer. Here's some help with pdfminer https://media.readthedocs.org/pdf/pdfminer-docs/latest/pdfminer-docs.pdf

Related

How to convert fb2 file in to txt using Python

Hello I am struggling to convert hundreds of fb2 files to txt using Python. I find pyandoc and EbookLib but I didn't find in their functionality this option, or I didn't search carefully.
Can someone suggest me something relevant in my case ? Maybe free API, but I think there could be a library.
something relevant in my case
I did look for fb2 and FictionBook2 at PyPI and found 2 potentially useful to you: catpandoc and FB2. 1st does Cat multiple documents to the terminal. and support numerous file formats. 2nd is Python package for working with FictionBook2. For FB2 example is given how to create FB2 file, but not how to read. I do not know if it means documentation is poor or it does not have read support at all.
EDIT: After some research I found that FictionBook2 files are XML files. Example can be seen here. That being said I encourage you to first try existing FB2 tools and only if they fail to deliver desired result implement extraction by XML parsing.

Automating excel reporting and graphs - Python xlsxWriter/xlswings or Ruby axlsx/win32ole

I want to create a program, which automates excel reporting including various graphs in colours. The program needs to be able to read an excel dataset. Based on this dataset, the program then has to create report pages and graphs and then export to an excel file as well as pdf file.
I have done some research and it seems this is possible using python with pandas - xlsxWriter or xlswings as well as Ruby gems - axlsx or win32ole.
Which is the user-friendlier and easy to learn alternative? What are the advantages and disadvantages? Are there other options I should consider (I would like to avoid VBA - as this is how the reports are currently produced)?
Any responses and comments are appreciated. Thank you!
If you already have VBA that works for your project, then translating it to Ruby + WIN32OLE is probably your quickest path to working code. Anything you can do in VBA is doable in Ruby (if you find something you can't do, post here to ask for help).
I prefer working with Excel via OLE since I know the file produced by Excel will work anywhere I open it. I haven't used axlsx but I'm sure it's a fine project; I just wouldn't trust that it would produce working Excel files every time.

Scientific Notation is badly written to csv from python

I'm totally puzzled by the situation at hand. I have a large dataset with a broad range of numbers, all between 0 and 2. However, when I write the data to a .csv file with
df_Signals.to_csv('signals_IDG_TOut1.csv', sep=',')
to be able to import the file in another program something strange happens. When I for example call the number with
print(df_Signals["Column"].iloc[44])
python prints: 2.8147020866287068e-05
However, when I open the .csv file it reads 281470208662,87. A quick inspection shows that this happens for all number written in E-notation I could find. I have tried to figure out what is going on, but have no idea what the answer is . So my main question is: Why? And secondly, how can I resolve this? And is this a structural problem when exporting to .csv files?
I use PyCharm 2017.1.4, with the Anaconda 3 interpreter.
Regards
Update: As the comments correctly pointed out, it is Excel that wrongly opens the data. Which still intrigues me why that happens.

Using python to build a latex document

I have a script I have written in python which pulls data from a bunch of files on my computer which change daily. I want to insert the results into a latex template so that I can review the summary.
What is the best way to open a file and insert text into it at a specific point?
Preferably using a python, but I'm open to other tools if there is something better.
Thanks
Russ
You could also do it all from within python using the pylatex library.
https://github.com/JelteF/PyLaTeX
This way you only have to run the python file every day.
I figured out how to do it.
It seems to me the best way is to have python output low level latex code, then use latex's command \input to pull that code into a larger document.
https://en.wikibooks.org/wiki/LaTeX/Modular_Documents

How to parse a .shp file?

I am interested in gleaning information from an ESRI .shp file.
Specifically the .shp file of a polyline feature class.
When I open the .dbf of a feature class, I get what I would expect: a table that can open in excel and contains the information from the feature class' table.
However, when I try to open a .shp file in any program (excel, textpad, etc...) all I get is a bunch of gibberish and unusual ASCII characters.
I would like to use Python (2.x) to interpret this file and get information out of it (in this case the vertices of the polyline).
I do not want to use any modules or non built-in tools, as I am genuinely interested in how this process would work and I don't want any dependencies.
Thank you for any hints or points in the right direction you can give!
Your question, basically, is "I have a file full of data stored in an arbitrary binary format. How can I use python to read such a file?"
The answer is, this link contains a description of the format of the file. Write a dissector based on the technical specification.
If you don't want to go to all the trouble of writing a parser, you should take look at pyshp, a pure Python shapefile library. I've been using it for a couple of months now, and have found it quite easy to use.
There's also a python binding to shapelib, if you search the web. But I found the pure Python solution easier to hack around with.
might be a long shot, but you should check out ctypes, and maybe use the .dll file that came with a program (if it even exists lol) that can read that type of file. in my experience, things get weird when u start digging around .dlls

Categories