Web Mining with Python

Web Mining with Python - python

I am planning to scraping exchange rates with Python.After I get the raw data from HTML pages, what kind of processing will I need to get prepared for my output/visualization? Will I need some text processing, NLP algorithms, graph processing or cleaning of your data?

I don't know exactly what you need but according to your comment, you can use following code to extract all data from that page:
import urllib
import bs4
url=urllib.urlopen('http://www.tcmb.gov.tr/kurlar/201501/02012015.xml').read().decode('Windows-1252')
soup=bs4.BeautifulSoup(url)
data=soup.get_text(' ')
print(data)
this script wrote on python 2.7 and you need to install beautifulsoup4.
or you can use below code. in this code I extract rates for us dollar:
import urllib.request
import xml.etree.ElementTree as ET
url=urllib.request.urlopen('http://www.tcmb.gov.tr/kurlar/201501/02012015.xml').read()
f=open('data.xml','w+b')
f.write(url)
f.close()
tree = ET.parse('data.xml')
root = tree.getroot()
for i in range(len(root[0])):
print(root[0][i].text)
or you can extract all rates for ForexBuying:
for i in root.iter('ForexBuying'):
print(i.text)

Related

How to mport a file with extension .A?

I downloaded a file with extension .A which contains a time series I would like to work on in Python. I'm not an expert at all with .A files, but if I open it with a notepad I see it contains the data I'd like to work on. How can I conver that file in Python in order to work on it (i.e. an array, a pandas series...)?
import requests
response = requests.get("https://sdw-wsrest.ecb.europa.eu/service/data/EXR/D.USD.EUR.SP00.A?startPeriod=2021-02-20&endPeriod=2021-02-25")
data = response.text

You need to read up on parsing XML. This code will get the data into a data structure typical for XML. You may mangle it as you see fit from there. You need to provide more information about how you'd like these data to look in order to get a more complete answer.
import requests
import xml.etree.ElementTree as ET
response = requests.get("https://sdw-wsrest.ecb.europa.eu/service/data/EXR/D.USD.EUR.SP00.A?startPeriod=2021-02-20&endPeriod=2021-02-25")
data = response.text
root = ET.fromstring(data)

Replace code with no python external library

Since this code contains external python library lxml but I need to change code such that it works with python internal library + request library only .because Zapier doesn't support external libraries
from lxml import html
import requests
import lxml.html
# download & parse web page
doc = requests.get('https://www.makeuseof.com/feed/')
parser = lxml.html.fromstring(doc.content)
#find all image tags
x = parser.xpath('//img')
#get all image tags
la=[]
for t in x:
la.append((t.get("src")))
#remove multiple tags
for cell in set(la):
if cell is not None:
print(cell)
PS I have heard urlib can do but I have no such idea. Also above code just find the images from feed pages

HTML hidden elements

I'm actually trying to code a little "GPS" and actually I couldn't use Google API because of the daily restriction.
I decided to use a site "viamichelin" which provide me the distance between two adresses. I created a little code to fetch all the URL adresses I needed like this :
import pandas
import numpy as np
df = pandas.read_excel('C:\Users\Bibi\Downloads\memoire\memoire.xlsx', sheet_name='Clients')
df2= pandas.read_excel('C:\Users\Bibi\Downloads\memoire\memoire.xlsx', sheet_name='Agences')
matrix=df.as_matrix(columns=None)
clients = np.squeeze(np.asarray(matrix))
matrix2=df2.as_matrix(columns=None)
agences = np.squeeze(np.asarray(matrix2))
compteagences=0
comptetotal=0
for j in agences:
compteclients=0
for i in clients:
print agences[compteagences]
print clients[compteclients]
url ='https://fr.viamichelin.be/web/Itineraires?departure='+agences[compteagences]+'&arrival='+clients[compteclients]+'&arrivalId=34MTE1MnJ5ZmQwMDMzb3YxMDU1ZDFvbGNOVEF1TlRVNU5UUT1jTlM0M01qa3lOZz09Y05UQXVOVFl4TlE9PWNOUzQzTXpFNU5nPT1jTlRBdU5UVTVOVFE9Y05TNDNNamt5Tmc9PTBqUnVlIEZvbmQgZGVzIEhhbGxlcw==&index=0&vehicle=0&type=0&distance=km&currency=EUR&highway=false&toll=false&vignette=false&orc=false&crossing=true&caravan=false&shouldUseTraffic=false&withBreaks=false&break_frequency=7200&coffee_duration=1200&lunch_duration=3600&diner_duration=3600&night_duration=32400&car=hatchback&fuel=petrol&fuelCost=1.393&allowance=0&corridor=&departureDate=&arrivalDate=&fuelConsumption='
print url
compteclients+=1
comptetotal+=1
compteagences+=1
All my datas are on Excel that's why I used the pandas library. I have all the URL's needed for my project.
Although, I would like to extract the number of kilometers needed but there's a little problem. In the source code, I don't have the information I need, so I can't extract it with Python... The site is presented like this:
Michelin
When I click on "inspect" I can find the information needed (on the left) but I can't on the source code (on the right) ... Can someone provide me some help?
Itinerary
I have already tried this, without succeeding :
import os
import csv
import requests
from bs4 import BeautifulSoup
requete = requests.get("https://fr.viamichelin.be/web/Itineraires?departure=Rue%20Lebeau%2C%20Liege%2C%20Belgique&departureId=34MTE1Mmc2NzQwMDM0NHoxMDU1ZW44d2NOVEF1TmpNek5ERT1jTlM0MU5qazJPQT09Y05UQXVOak16TkRFPWNOUzQxTnpBM01nPT1jTlRBdU5qTXpOREU9Y05TNDFOekEzTWc9PTBhUnVlIExlYmVhdQ==&arrival=Rue%20Rys%20De%20Mosbeux%2C%20Trooz%2C%20Belgique&arrivalId=34MTE1MnJ5ZmQwMDMzb3YxMDU1ZDFvbGNOVEF1TlRVNU5UUT1jTlM0M01qa3lOZz09Y05UQXVOVFl4TlE9PWNOUzQzTXpFNU5nPT1jTlRBdU5UVTVOVFE9Y05TNDNNamt5Tmc9PTBqUnVlIEZvbmQgZGVzIEhhbGxlcw==&index=0&vehicle=0&type=0&distance=km&currency=EUR&highway=false&toll=false&vignette=false&orc=false&crossing=true&caravan=false&shouldUseTraffic=false&withBreaks=false&break_frequency=7200&coffee_duration=1200&lunch_duration=3600&diner_duration=3600&night_duration=32400&car=hatchback&fuel=petrol&fuelCost=1.393&allowance=0&corridor=&departureDate=&arrivalDate=&fuelConsumption=")
page = requete.content
soup = BeautifulSoup(page, "html.parser")
print soup

Looking at the inspector for the page, the actual routing is done via a JavaScript invocation to this rather long URL.
The data you need seems to be in that response, starting from _scriptLoaded(. (Since it's a JavaScript object literal, you can use Python's built-in JSON library to load the data into a dict.)

Extracting text/numbers from HTML list using Python requests and lxml

I am trying to extract the 'Seller rank' from items on amazon using Python requests and lxml. So:
<li id="SalesRank">
<b>Amazon Bestsellers Rank:</b>
957,875 in Books (See Top 100 in Books)
from this example, 957875 is the number I want to extract.
(Please note, the actual HTML has about 100 blank lines between 'Amazon Bestsellers Rank:' and '957875'. Unsure if this is effecting my result.)
My current Python code is set up like so:
import re
import requests
from lxml import html
page = requests.get('http://www.amazon.co.uk/Lakeland-Expanding-Together-Compartments-Organiser/dp/B00A7Q77GM/ref=sr_1_1?s=kitchen&ie=UTF8&qid=1452504370&sr=1-1-spons&psc=1')
tree = html.fromstring(page.content)
salesrank = tree.xpath('//li[#id="SalesRank"]/text()')
print 'Sales Rank:', salesrank
and the printed output is
Sales Rank: []
I was expecting to receive the full list data including all the blank lines of which I would later parse.
Am I correct in assuming that /text() is not the correct use in this instance and I need to put something else?
Any help is greatly appreciated.

You are getting an empty list because in one call of the url you are not getting the complete data of the web page. For that you have to stream through the url and get all the data in small chunks. And then find out the required in the non-empty chunk. The code for the following is :-
import requests as rq
import re
from bs4 import BeautifulSoup as bs
r=rq.get('http://www.amazon.in/gp/product/0007950306/ref=s9_al_bw_g14_i1?pf_rd_m=A1VBAL9TL5WCBF&pf_rd_s=merchandised-search-3&pf_rd_r=1XBKB22RGT2HBKH4K2NP&pf_rd_t=101&pf_rd_p=798805127&pf_rd_i=4143742031',stream=True)
for chunk in r.iter_content(chunk_size=1024):
if chunk:
data = chunk
soup=bs(data)
elem=soup.find_all('li',attrs={'id':'SalesRank'})
if elem!=[]:
s=re.findall('#[\d+,*]*\sin',str(elem[0]))
print s[0].split()[0]
break

Wrong output using tree.xpath

Iam a beginner to data scraping.I want to extract all the marathon events name from a website Wikipedia
For this I have written a small code :-
from lxml import html
import requests
page = requests.get('https://en.wikipedia.org/wiki/List_of_marathon_races')
tree = html.fromstring(page.text)
events = tree.xpath('//td/a[#class="new"]/text()')
print events
The problem is when I try to execute the following code,a blank array comes as an output.What is the problem with this code?It would be grateful if anyone can help me in correcting or finding mistakes my code.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web Mining with Python - python

I am planning to scraping exchange rates with Python.After I get the raw data from HTML pages, what kind of processing will I need to get prepared for my output/visualization? Will I need some text processing, NLP algorithms, graph processing or cleaning of your data?

Related

How to mport a file with extension .A?

Replace code with no python external library

HTML hidden elements

Extracting text/numbers from HTML list using Python requests and lxml

Wrong output using tree.xpath

Categories

Resources