Opening a pdf and reading in tables with python pandas - python

Is it possible to open PDFs and read it in using python pandas or do I have to use the pandas clipboard for this function?

you can use tabula
https://blog.chezo.uno/tabula-py-extract-table-from-pdf-into-python-dataframe-6c7acfa5f302
from tabula import read_pdf
df = read_pdf('data.pdf')
I can see more in the link!

There is a new version of tabula called tabula-py
pip install tabula-py
the .read_pdf method works just like in the old version, documentation is here:
https://pypi.org/project/tabula-py/

In case it is a one-off, you can copy the data from your PDF table into a text file, format it (using search-and-replace, Notepad++ macros, a script), save it as a CSV file and load it into Pandas.
If you need to do this in a scalable way, you might try this product: http://tabula.technology/. I have not used it yet, so I don't know how well it works, but you can explore it if you need it.

I have been doing some tests with Camelot (https://camelot-py.readthedocs.io/en/master/), and it works very good in many situations. And you can try to adjust some parameters if the default ones doesn't work.
It's similar to Tabula, but it use different algorithms (Tabula use the vector data in the PDF and raster the lines of the table; Camelot uses Hough Transform), so you can try both to find the best one.
Both have a web version, so you can try with some example to decide which is the best one for your application.

this is not possible. PDF is a data format for printing. The table structure is therefor lost. with some luck you can extract the text with pypdf and guess the former table columns.

Copy the table data from a PDF and paste into an Excel file (which usually gets pasted as a single rather than multiple columns). Then use FlashFill (available in Excel 2016, not sure about earlier Excel versions) to separate the data into the columns originally viewed in the PDF. The process is fast and easy. Then use Pandas to wrangle the Excel data.

I use Tabula library
for install, via:
pip install tabula-py
reading several tables inside PDF by link , example:
import tabula
df = tabula.io.read_pdf(url, pages='all')
then you will get many tables, you can call it by using index, it's like printing element from list, Example:
# ex
df[0]
more info here - https://pypi.org/project/tabula-py/

Related

Excel data reshaping with Python

I have a particular spreadsheet which has point of sale data exported feom a sql database.
Im trying to migrate to a new point of sale system and si i need to copy this data that i exported into a csv file into another csv file which has a different format, for example different columns thatvi have to rearrange the original data into.
Im trto do this using python but im failing to find a way to automate this task.
Does anyone have any ideas or any videos on a similar project
Pandas seems like the python tool for you.
Open up the first CSV file with Pandas as a DataFrame, apply any modifications you want, and save as a new CSV file. There is A LOT of documentation and support for Pandas, so I'm sure you can find tutorials on how to do any kind of data reshaping that you want.

How to_csv in Bluemix

We have a dataframe we are working it in a ipython notebook. Granted, if one could save a dataframe in such a way that the whole group could have access to it through their notebooks, would be ideal, and I'd love to know how to do that. However could you help with the following specific problem?
When we do df.to_csv("Csv file name") it appears that it is located in the exact same place as the files we placed in object storage to utilize in the ipython notebook. However, when one goes to Manage Files, it's nowhere to be found.
When one runs pd.DataFrame.to_csv(df), text of the csv file is apparently given. However when one copies that into a text editor (ex- Sublime text), saves it at a csv, and attempts to read it in to a dataframe, the expected dataframe is not yielded.
How does one export a dataframe to csv format, and then access it?
I'm not familiar with bluemix, but it sounds like you're trying to save a pandas dataframe in a way that all of your collaborators can access and it look the same way for everyone.
Maybe saving and reading from CSVs is messing up the formatting of your dataframe. Have you tried using pickling? Since pickling is based around python, it should give consistent results.
Try this:
import pandas as pd
pd.to_pickle(df, "/path/to/pickle/My_pickle")
and on the read side:
df_read = pd.read_pickle("/path/to/pickle/My_pickle")

Automatic input from text file in excel

My problem is rather simple : I have an Excel Sheet that does calculations and creates a graph based on the values of two cells in the sheet. I also have two lists of inputs in text files. I would like to loop through those text files, add the values to the excel sheet, refresh the sheet, and print the resulting graph to a pdf file or an excel file named something like 'input1 - input2.xlsx'.
My programming knowledge is limited, I am decent with Python and have looked into python libraries that work with excel such as openpyxl, however most of those don't seem to work for me for various reasons. Openpyxl deletes the graphs when opening an excel file; XlsxWriter can only write files, not read from them; and xlwings won't work for me.
Should I use python, which I'm familiar with, or would VBA work for this kind of problem? Have any of you ever done something of the sort?
Thanks in advance
As a more transitional approach to what m. wasowski wrote above, I'd suggest you do the following.
Install the pandas package, and see how easy it is to load a file using read_excel. Then, read 10 Minutes to Pandas, and manipulate the data.
You state that the Excel sheet is complex. In general, the more complex it is, this approach will eventually make it simpler. But you don't have to switch everything immediately. You can still do parts in Excel and parts in pandas.
I think you should consider win32Com for excel operation in python instead of Openpyxl,XlsxWriter.
you can read/write excel, create chart and format excel file using win32com without any limitation.
And creating chart you can consider matplotlib, in that after creating chart you can save it in pdf file also.

Modifying, recalculating and and extracting results from Excel in Python

I would like to try and make a program which does the following, preferably in Python, but if it needs to be C# or other, that's OK.
Writes data to an excel spreadsheet
Makes Excel recalculate formulas etc. in the modified spreadsheet
Extracts the results back out
I have used things like openpyxl before, but this obviously can't do step 2. Is the a way of recalculating the spreadsheet without opening it in Excel? Any thoughts greatly appreciated.
Thanks,
Jack
You need some sort of UI automation with which you can control a UI application such as Excel. Excel probably exposes some COM interface that you should be able to use for what you need. Python has the PyWin32 library which you should install, after which you'll have the win32com module available.
See also:
Excel Python API
Automation Excel from Python
If you don't necessarily have to work with Excel specifically and just need to do spreadsheet using Python, you might want to look at http://manns.github.io/pyspread/.
you could you pandas for reading in the data, using python to recalculate and then write the new files.
For pandas it's something like:
#Import Excel file
xls = pd.ExcelFile('path_to_file' + '/' + 'file.xlsx')
xls.parse('nyc-condominium-dataset', index_col='property_id', na_values=['NA'])
so not difficult. Here the link to pandas.
Have fun!

Getting data from an Excel sheet

How do I load data from an Excel sheet into my Django application? I'm using database PosgreSQL as the database.
I want to do this programmatically. A client wants to load two different lists onto the website weekly and they don't want to do it in the admin section, they just want the lists loaded from an Excel sheet. Please help because I'm kind of new here.
Have a look at the xlrd package, which allows you to read Excel files in Python. Once you've read the data you can do whatever you want with it, including saving it to the database.
For a basic usage example, look at http://scienceoss.com/read-excel-files-from-python/
Use django-batchimport http://code.google.com/p/django-batchimport/ It provides a very simple way to upload data in Excel sheets to your Django models. I have used it in a couple of projects. It can be integrated very easily into your existing Django project.
Read the documentation on the project page to know how to use it.
It is built on XLRD.
Have a look at the presentation "Excel & Python" that Chris Withers gave at PyCon US:
"This lightning talk explains that you don't need to use COM or be on Windows to read and write native Excel files."
http://www.simplistix.co.uk/presentations/python_excel_09/excel-lightning.pdf
Programatically or manually? If manualy then just save the excel as a CSV (with csv or txt extension) and import into Postgresql using
copy the_data from '/path/to/csv/MYFILE.txt' DELIMITERS ',' CSV;
As I remember of this.
The best way is to save this sheet as plain text ( CSV or something )
And then load with some custom SQL script.
http://www.postgresql.org/docs/8.3/static/populate.html
Or have a look at SQLAlchemy if you're going to write some kind of script to help you with that.(http://www.sqlalchemy.org/)
If you want to use COM to interface excel (i.e. you are running on a Windows machine), see "Migrating Excel data to SQLite" - http://www.saltycrane.com/blog/2007/11/migrating-excel-to-sqlite-using-python/
I built django-batchimport on top of xlrd which is AMAZING. The only issues I had were with getting data into Django. Had nothing to do with any limitations of xlrd. It rocks. John's work is incredible.
Note that I've actually done some update work to django-batchimport and just released. Take a look: http://code.google.com/p/django-batchimport/
Just started using XLRD and it looks very easy and simple to use.
Beware that it does not support Excel 2007 yet, so keep in mind to save your excel at 2003 format.

Categories