Webscraping results to data frame

Webscraping results to data frame - python

Im in the process of trying to figure out how to take my scraped data and convert it to a dataframe using pandas,
I scraped data off a grocery website as an experiment. Using Beautiful soup, once I import the URL and use beautiful soup I then created a loop to pull anything within a certain class with div tags. Then i used the code below to pull the data below:
import urllib2
import requests
import pandas as pd
from bs4 import BeautifulSoup
import pickle
link=requests.get("https://www.iga.net/en/online_grocery/frozen_grocery")
soup = BeautifulSoup(link.content, 'html.parser')
##print soup.prettify()
bowl=soup.find_all('div',class_='js-product js-equalized js-addtolist-container js-ga')
for bowls in bowl:
list=bowls.get('data-product')
print list
list printed:
{'ProductId':'00000_000000005500059917','BrandName':'Nestle','FullDisplayName':'10 Pack Mini Rolo Bars','IsAgeRequired':false,'SizeLabel':'','Size':'10 x 45 ml','ProductUrl':'/en/product/mini-rolo-bars10-pack/00000_000000005500059917','ProductImageUrl':'https://az836796.vo.msecnd.net/media/image/product/en/medium/0005500059917.jpg','HasNewPrice':false,'PromotionName':null,'RegularPrice':6.49000,'SalesPrice':null}
{'ProductId':'00000_000000005574253356','BrandName':'Compliments','FullDisplayName':'100% Pure Frozen Concentrate Pulp Free Juice','IsAgeRequired':false,'SizeLabel':'','Size':'283 ml','ProductUrl':'/en/product/juice100--pure-frozen-concentrate-pulp-free/00000_000000005574253356','ProductImageUrl':'https://az836796.vo.msecnd.net/media/image/product/en/medium/0005574253356.jpg','HasNewPrice':false,'PromotionName':null,'RegularPrice':1.79000,'SalesPrice':null}
I'm trying to take the productID, size and regular price for example and dump that into a table. I'd even be ok to take the entire keys and values and dump them into a data frame so i can play around with it in excel.
I've tried to do the following but i get an error (added in the data frame in the last block:
import urllib2
import requests
import pandas as pd
from bs4 import BeautifulSoup
import pickle
link=requests.get("https://www.iga.net/en/online_grocery/frozen_grocery")
soup = BeautifulSoup(link.content, 'html.parser')
##print soup.prettify()
bowl=soup.find_all('div',class_='js-product js-equalized js-addtolist-container js-ga')
for bowls in bowl:
list=bowls.get('data-product')
df = pd.DataFrame(list)
print df
This results in an error. Any help is appreciated. I'm a rookie to this.

You need convert each data-product to be a valid python dictionary, then merge all of them to one dictionary, after that, you can convert to dataframe like this:
import urllib2
import requests
import pandas as pd
from bs4 import BeautifulSoup
import pickle
import json
import collections
link=requests.get("https://www.iga.net/en/online_grocery/frozen_grocery")
soup = BeautifulSoup(link.content, 'html.parser')
##print soup.prettify()
bowl=soup.find_all('div',class_='js-product js-equalized js-addtolist-container js-ga')
super_dict = collections.defaultdict(list)
for bowls in bowl:
data=bowls.get('data-product')
data = data.replace("\'","\"") #json.loads accepts only double quotes for json properties, so replace ' with "
dict_data = json.loads(data) #convert to valid python dictionary
for k, v in dict_data.iteritems(): # d.items() in Python 3+
super_dict[k].append(v) #merge all dictionary
df = pd.DataFrame(dict(super_dict))
df
Output will be the dataframe you want:
Update:
if you want to view the dataframe in excel file, you can write to excel file with below code:
writer = pd.ExcelWriter('pandas_simple.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
df.to_excel(writer, sheet_name='Sheet1')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
Then you can open file pandas_simple.xlsx to check the data in excel format.

Related

how to fetch text data from website and storing as excel file using python

I want to create a script that fetches the all the data in the following website : https://www.bis.doc.gov/dpl/dpl.txt and store it in a excel file and count the number of records in it, using python language. I've tried to achieve by implementing the code as:
import requests
import re
from bs4 import BeautifulSoup
URL = "https://www.bis.doc.gov/dpl/dpl.txt"
page = requests.get(URL)
soup = BeautifulSoup(page.text, "lxml")
print(soup)
I've fetched the data but didn't know the next step of storing it as excel file. Anyone pls guide or share your valuable ideas. Thank you in advance!

You can do it with pandas easily. Since the data is in tab seperated value.
Note: openpyxl needs to be installed for this to work.
import requests
import io
import pandas as pd
URL = "https://www.bis.doc.gov/dpl/dpl.txt"
page = requests.get(URL)
df = pd.read_csv(io.StringIO(page.text), sep="\t")
df.to_excel(r'i_data.xlsx', index = False)

unable to parse html table with Beautiful Soup

I am very new to using Beautiful Soup and I'm trying to import data from the below url as a pandas dataframe.
However, the final result has the correct columns names, but no numbers for the rows.
What should I be doing instead?
Here is my code:
from bs4 import BeautifulSoup
import requests
def get_tables(html):
soup = BeautifulSoup(html, 'html.parser')
table = soup.find_all('table')
return pd.read_html(str(table))[0]
url = 'https://www.cmegroup.com/trading/interest-rates/stir/eurodollar.html'
html = requests.get(url).content
get_tables(html)

The data you see in the table is loaded from another URL via JavaScript. You can use this example to save the data to csv:
import json
import requests
import pandas as pd
data = requests.get('https://www.cmegroup.com/CmeWS/mvc/Quotes/Future/1/G').json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
df = pd.json_normalize(data['quotes'])
df.to_csv('data.csv')
Saves data.csv (screenshot from LibreOffice):

The website you're trying to scrape data from is rendering the table values dynamically and using requests.get will only return the HTML the server sends prior to JavaScript rendering.
You will have to find an alternative way of accessing the data or render the webpages JS (see this example).
A common way of doing this is to use selenium to automate a browser which allows you to render the JavaScript and get the source code that way.
Here is a quick example:
import time
import pandas as pd
from selenium.webdriver import Chrome
#Request the dynamically loaded page source
c = Chrome(r'/path/to/webdriver.exe')
c.get('https://www.cmegroup.com/trading/interest-rates/stir/eurodollar.html')
#Wait for it to render in browser
time.sleep(5)
html_data = c.page_source
#Load into pd.DataFrame
tables = pd.read_html(html_data)
df = tables[0]
df.columns = df.columns.droplevel() #Convert the MultiIndex to an Index
Note that I didn't use BeautifulSoup, you can directly pass the html to pd.read_html. You'll have to do some more cleaning from there but that's the gist.
Alternatively, you can take a peak at requests-html which is a library that offers JavaScript rendering and might be able to help, search for a way to access the data as JSON or .csv from elsewhere and use that, etc.

Import table from website with BeautifulSoup

I am trying to import a table from a website and afterwards transform the data into a pandas dataframe.
The website is: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
Thats my code so far:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
website_url = requests.get(
'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_url,'lxml')
My_table = soup.find('table',{'class':'wikitable sortable'})
for x in soup.find_all('table',{'class':'wikitable sortable'}):
table = x.text
print(My_table)
print(table)
Output of print(My_table)
Output of print(table)
How do I convert this webpage table to a panda dataframe?
panda dataframe

have you tried
pd.read_html()
?
Also, since the table is very standard, why not directly copy the table into excel and import it as DataFrame?

How can I pull webpage data into my DataFrame by referencing a specific HTML class or id using pandas read_html?

I'm trying to pull the data from the table at this site and save it in a CSV with the column 'ticker' included. Right now my code is this:
import requests
import pandas as pd
url = 'https://www.biopharmcatalyst.com/biotech-stocks/company-pipeline-database#marketCap=mid|stages=approved,crl'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[0]
print (df)
df.to_csv('my data.csv')
and it results in a file that looks like this.
I want to have the 'ticker' column in my CSV file with the corresponding ticker listed for each company. The ticker is in the HTML here (class="ticker--small"). The output should look like this.
I'm totally stuck on this. I've tried doing it in BeautifulSoup too but I can't get it working. Any help would be greatly appreciated.

it has multiple table, use BeautifulSoup to extract and do loop to write the csv.
from bs4 import BeautifulSoup
import requests, lxml
import pandas as pd
url = 'https://www.biopharmcatalyst.com/biotech-stocks/company-pipeline-database#marketCap=mid|stages=approved,crl'
html = requests.get(url).text
soup = BeautifulSoup(html, 'lxml')
tables = soup.findAll('table')
for table in tables:
df = pd.read_html(str(table))[0]
with open('my_data.csv', 'a+') as f:
df.to_csv(f)

How can I use pandas to parse CSV already loaded from somewhere else?

I download and scrape a webpage for some data in TSV format. Around the TSV data is HTML that I don't want.
I download the html for the webpage, and scrape out the data I want, using beautifulsoup.
However, I've now got the TSV data in memory.
How can I use this TSV data in memory with pandas? Every method I can find seems to want to read from file or URI rather than from data I've already scraped in.
I don't want to download text, write it to file, and then rescrape it.
#!/usr/bin/env python2
from pandas import pandas as p
from BeautifulSoup import BeautifulSoup
import urllib2
def main():
url = "URL"
html = urllib2.urlopen(url)
soup = BeautifulSoup(html)
# pre is the tag that the data is within
tab_sepd_vals = soup.pre.string
data = p.LOAD_CSV(tab_sepd_vals)
process(data)

If you feed the text/string version of the data into a StringIO.StringIO (or io.StringIO in Python 3.X), you can pass that object to the pandas parser. So your code becomes:
#!/usr/bin/env python2
import pandas as p
from BeautifulSoup import BeautifulSoup
import urllib2
import StringIO
def main():
url = "URL"
html = urllib2.urlopen(url)
soup = BeautifulSoup(html)
# pre is the tag that the data is within
tab_sepd_vals = soup.pre.string
# make the StringIO object
tsv = StringIO.StringIO(tab_sepd_vals)
# something like this
data = p.read_csv(tsv, sep='\t')
# then what you had
process(data)

Methods like read_csv do two things, they parse the CSV and they construct a DataFrame object - so in your case you might want to construct the DataFrame directly:
>>> import pandas as pd
>>> df = pd.DataFrame([['a', 1], ['b', 2], ['c', 3]])
>>> print(df)
0 1
0 a 1
1 b 2
2 c 3
The constructor accepts a variety of data structures.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Webscraping results to data frame - python

Related

how to fetch text data from website and storing as excel file using python

unable to parse html table with Beautiful Soup

Import table from website with BeautifulSoup

How can I pull webpage data into my DataFrame by referencing a specific HTML class or id using pandas read_html?

How can I use pandas to parse CSV already loaded from somewhere else?

Categories

Resources