I have a large number of html tables that I'd like to convert into CSV. Pasting individual tables into excel and saving them as .csv works, as does pasting the html tables into simple online converters. But I have thousands of individual tables, so I need a script that can automate the conversion process.
I was wondering if anyone has any suggestions as to how I could go about doing this? Python is the only language I have a decent knowledge of, so some sort of python script would be ideal. I've searched for similar questions, but all the python examples I've found are quite complicated to me, and go beyond my basic level of understanding.
Any advice would be much appreciated.
Use pandas. It has a function to read html tables into a data structure, and then a function that will write that data structure to a csv file.
import pandas as pd
url = 'http://myurl.com/mypage/'
for i, df in enumerate(pd.read_html(url)):
df.to_csv('myfile_%s.csv' % i)
Note that since an html page may have more than one table, the function to get the table always returns a list of tables (even if there is only one table). That is why I use a loop here.
Related
I am trying to extract the first top 2 pdf's from this page under the current product range.
https://www.intermediary.natwest.com/intermediary-solutions/products.html
I have managed to create a function that uses selenium to click the download links and download the 2 pdf's into a temporary location, however, I am struggling to find a viable way to read in the tables with minimal cleaning required.
Please can anyone help with potential solutions to download these 2 pdf tables and export them into csv's, I have tried using PDF plumber but it converts the data into a list of lists which is a nightmare to clean. I have also tried PyPDF2 which is also very messy with hundreds of lines of code needed to clean the data. I would just like to find a good best practice solution to read the pdfs in as they are and convert them to csv's.
Any help would be immensely appreciated.
:)
I have a particular spreadsheet which has point of sale data exported feom a sql database.
Im trying to migrate to a new point of sale system and si i need to copy this data that i exported into a csv file into another csv file which has a different format, for example different columns thatvi have to rearrange the original data into.
Im trto do this using python but im failing to find a way to automate this task.
Does anyone have any ideas or any videos on a similar project
Pandas seems like the python tool for you.
Open up the first CSV file with Pandas as a DataFrame, apply any modifications you want, and save as a new CSV file. There is A LOT of documentation and support for Pandas, so I'm sure you can find tutorials on how to do any kind of data reshaping that you want.
I'm kinda new to Python and webscraping but I'm currently at a point where I need to extract data to a database. Can someone tell me the pros and cons by using sqlite, excel or xml?
I've read that sqlite should be the fastest, so I may go for that database structure, but can someone then tell me what IDE you use to handle sqlite data after I've extracted it from python?
Edit: I hope my post makes sense. I'm currently trying to use a web scraper from here: https://github.com/gingeleski/odds-portal-scraper
Thanks in advance.
For the short term, Excel is a good way to examine your data and prototype analysis and visualizations. It gets old using it for very large datasets, or multiple similar datasets. Basically as soon as you start doing the same thing more than twice or writing VB code you should switch to the pandas/matplotlib solution.
It looks like the scraper you are using already puts the results in an SQLITE database, but if you have your data in a list or dictionary, I'd suggest using pandas to do calculations and matplotlib for visualizations, as that will give you a robust, extensible solution over the long term. It is very easy to read and write data between an SQLITE database and pandas.
A good way of viewing the data in the DB is a must. I'm currently using SQLiteStudio.
When you say IDE, I'm assuming you're looking for a way to view the SQLite data? If so, DBeaver is a free, open source SQL client. You could use this to view the data quite easily.
How do I get 4 million rows and 28 columns from Python to Tableau in a table form?
I assume (based on searching) that I should use a JSON format. This format can handle a lot of data and is fast enough.
I have made a subset of 12 rows of the data and tried to get it working. The good news is: it's working. The bad news: not the way I want to.
My issue is that when I import it in Tableau it doesn't look like a table. I have tried the variances which are displayed here.
This is the statement in Python (pandas):
jsonfile = pbg.to_json("//vsv1f40/Pricing_Management$/Z-DataScience/01_Requests/Marketing/Campaign_Dashboard/Bronbestanden/pbg.json",orient='values')
Maybe I select too many schemas in Tableau (I select them all), but I think my problem is in Python. Do I need to use another library instead of Pandas? Or do I need to change the variables?
Other ways are also welcome. I have no preference for JSON, but I thought that was the best way, based on the search results.
Note: I am new to python and tableau :) I use python 3.5.2 and work in Jupyter. From Tableau I only have the free trial desktop version.
JSON is good for certain types of data, but if your DataFrame is purely tabular (no MultiIndexes, complex objects, etc.) and contains simple data types (strings, digits, floats), then a comma-separated value (CSV) text file is probably the best format to use, as it would take up the least space. A DataFrame can easily be saved as a CSV using the to_csv() method, and there are a number of customization options available. I'm not terribly familiar with Tableau, but according to their website CSV files are a supported input format.
I have this problem, I need to scrape lots of different HTML data sources, each data source contains a table with lots of rows, for example country name, phone number, price per minute.
I would like to build some semi automatic scraper which will try to ..
find automatically the right table in the HTML page,
-- probably by searching the text for some sample data and trying to find the common HTML element which contain both
extract the rows
-- by looking at above two elements and selecting the same patten
identify which column contains what
-- by using some fuzzy algorithm to best guess which column is what.
export it to some python / other list
-- cleaning everytihng.
does this look like a good design ? what tools would you choose to do it in if you program in python ?
does this look like a good design ?
No.
what tools would you choose to do it in if you program in python ?
Beautiful Soup
find automatically the right table in the HTML page, -- probably by searching the text for some sample data and trying to find the common HTML element which contain both
Bad idea. A better idea is to write a short script to find all tables, dump the table and the XPath to the table. A person looks at the table and copies the XPath into a script.
extract the rows -- by looking at above two elements and selecting the same patten
Bad idea. A better idea is to write a short script to find all tables, dump the table with the headings. A person looks at the table and configures a short block of Python code to map the table columns to data elements in a namedtuple.
identify which column contains what -- by using some fuzzy algorithm to best guess which column is what.
A person can do this trivially.
export it to some python / other list -- cleaning everytihng.
Almost a good idea.
A person picks the right XPath to the table. A person writes a short snippet of code to map column names to a namedtuple. Given these parameters, then a Python script can get the table, map the data and produce some useful output.
Why include a person?
Because web pages are filled with notoriously bad errors.
After having spent the last three years doing this, I'm pretty sure that fuzzy logic and magical "trying to find" and "selecting the same patten" isn't a good idea and doesn't work.
It's easier to write a simple script to create a "data profile" of the page.
It's easier to write a simple script reads a configuration file and does the processing.
I cannot see better solution.
It is convenient to use XPath to find the right table.