Need to extract data from html tables - python

I am new to scraping and I am trying to extract the data from html tables and save it as a csv file. How do I do that?
This is what I have done so far:
from bs4 import BeautifulSoup
import os
os.chdir('/Users/adityavemuganti/Downloads/Accounts_Monthly_Data-June2018')
soup=BeautifulSoup(open('Prod224_0055_00007464_20170930.html'),"html.parser")
Format=soup.prettify()
table=soup.find("table",attrs={"class":"details"})
Here is the html file I am trying to scrape from:
http://download.companieshouse.gov.uk/Accounts_Bulk_Data-2019-08-03.zip (It is a zip file). I have uncompressed the zipfile and read the contents into 'soup' as mentioned above. Now I am trying to read the data sitting in the tag into a csv/xlsx format.

Pandas is the way to go here. read_html and to_csv or if you desire you can also output to xlsx to_excel.
import pandas as pd
dataframes = pd.read_html('yoururlhere')
# Assuming there is only one table in the file, if not then you may need to do a little more digging
df = dataframes[0]
df.to_csv('filename.csv')

Related

How do you run an excel function from within Python?

We are using an excel plugin to pull some data from an API. Our excel file contains a column with an entity identifier, and we use an excel formula to pull data for this entity from the internet.
Is it possible to run this from within Python?
I could export my pd.DataFrame to csv, open it with excel, append the data I want, and read it back into pandas... but is there a quicker way?
You can import the request and extract data using the Json() method
import pandas as pd
import requests
url = 'https://api.covid19api.com/summary'
r = requests.get(url)
json = r.json()
json
Then you have the data and just need to include it in your dataframe

Using Pandas, how to read a csv file inside a zip file which you fetch using an url[Python]

This url
https://ihmecovid19storage.blob.core.windows.net/latest/ihme-covid19.zip
contains 2 csv files, and 1 pdf which is updated daily, containing Covid-19 Data.
I want to be able to load the Summary_stats_all_locs.csv as a Pandas DataFrame.
Usually if there is a url that points to a csv I can just use df = pd.read_csv(url) but since the csv is inside a zip, I can't do that here.
How would I do this?
Thanks
You will need to first fetch the file, then load it using the ZipFile module. Pandas can read csvs from inside a zip actually, but the problem here is there are multiple, so we need to this and specify the file name.
import requests
import pandas as pd
from zipfile import ZipFile
from io import BytesIO
r = requests.get("https://ihmecovid19storage.blob.core.windows.net/latest/ihme-covid19.zip")
files = ZipFile(BytesIO(r.content))
pd.read_csv(files.open("2020_05_16/Summary_stats_all_locs.csv"))

Python pandas create datafrane from csv embeded within a web txt file

I am trying to import CSV formatted data to Pandas dataframe. The CSV data is located within a .txt file the is located at a web URL. The issue is that I only want to import a part (or parts) of the .txt file that is formatted as CSV (see image below). Essentially I need to skip the first 9 rows and then import rows 10-16 as CSV.
My code
import csv
import pandas as pd
import io
url = "http://www.bom.gov.au/climate/averages/climatology/windroses/wr15/data/086282-3pmMonth.txt"
df = pd.read_csv(io.StringIO(url), skiprows = 9, sep =',', skipinitialspace = True)
df
I get a lengthy error msg that ultimately says "EmptyDataError: No columns to parse from file"
I have looked at similar examples Read .txt file with Python Pandas - strings and floats but this is different.
The code above attempts to read a CSV file from the URL itself rather than the text file fetched from that URL. To see what I mean take out the skiprows parameter and then show the data frame. You'll see this:
Empty DataFrame
Columns: [http://www.bom.gov.au/climate/averages/climatology/windroses/wr15/data/086282-3pmMonth.txt]
Index: []
Note that the columns are the URL itself.
Import requests (you may have to install it first) and then try this:
content = requests.get(url).content
df = pd.read_csv(io.StringIO(content.decode('utf-8')),skiprows=9)

How to Read a WebPage with Python and write to a flat file?

Very novice at Python here.
Trying to read the table presented at this page (w/ the current filters set as is) and then write it to a csv file.
http://www65.myfantasyleague.com/2017/options?L=47579&O=243&TEAM=DAL&POS=RB
I tried this next approach. It creates the csv file but does not fill it w/ the actual table contents.
Appreciate any help in advance. thanks.
import requests
import pandas as pd
url = 'http://www65.myfantasyleague.com/2017/optionsL=47579&O=243&TEAM=DAL&POS=RB'
csv_file='DAL.RB.csv'
pd.read_html(requests.get(url).content)[-1].to_csv(csv_file)
Generally, try to emphasize your problems better, try to debug and don't put everything in one line. With that said, your specific problem here was the index and the missing ? in the code (after options):
import requests
import pandas as pd
url = 'http://www65.myfantasyleague.com/2017/options?L=47579&O=243&TEAM=DAL&POS=RB'
# -^-
csv_file='DAL.RB.csv'
pd.read_html(requests.get(url).content)[1].to_csv(csv_file)
# -^-
This yields a CSV file with the table in it.

How to use Python Web Scraping to download CSV file then convert it to Pandas Dataframe?

I'd like my script to do the following:
1) Access this website:
2) Import a CSV file titled "Sales Data with Leading Indicator"
3) Convert it to pandas Dataframe for data analysis.
Currently, the code I have is this:
response = request.urlopen("http://vincentarelbundock.github.io/Rdatasets/datasets.html")
csv = response.read()
Thanks in advance
pandas.read_csv() method accepts a URL to a csv file as its buffer, so
import pandas as pd
pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/datasets/BJsales.csv')
Should basically work. See further info here .

Categories