Display XML tree structure with BeautifulSoup - python

When working with a new XML structure, it is always helpful to see the big picture first.
When loading it with BeautifulSoup:
import requests, bs4
s = requests.get('https://www.w3schools.com/xml/cd_catalog.xml').text
x = bs4.BeautifulSoup(s, 'xml')
print(x)
is there a built-in way to display its tree structure with different depths?
Example for https://www.w3schools.com/xml/cd_catalog.xml, with maxdepth=0, it would be:
CATALOG
with maxdepth=1, it would be:
CATALOG
CD
CD
CD
...
and with maxdepth=2, it would be:
CATALOG
CD
TITLE
ARTIST
COUNTRY
COMPANY
PRICE
YEAR
CD
TITLE
ARTIST
COUNTRY
COMPANY
PRICE
YEAR
...

Here's a quick way to do it: Use the prettify() function to structure it, then get the indentation and opening tag names via regex (catches uppercase words inside opening tags in this case). If the indentation from pretify() meets the depth specification, then print it with the specified indentation size.
import requests, bs4
import re
maxdepth = 1
indent_size = 2
s = requests.get('https://www.w3schools.com/xml/cd_catalog.xml').text
x = bs4.BeautifulSoup(s, 'xml').prettify()
for line in x.split("\n"):
match = re.match("(\s*)<([A-Z]+)>", line)
if match and len(match.group(1)) <= maxdepth:
print(indent_size*match.group(1) + match.group(2))

I have used xmltodict 0.12.0 (installed via anaconda), which did the job for xml parsing, not for depth-wise viewing though. Works much like any other dictionary. From here a recursion with depth counting should be a way to go.
import requests, xmltodict, json
s = requests.get('https://www.w3schools.com/xml/cd_catalog.xml').text
x = xmltodict.parse(s, process_namespaces=True)
for key in x:
print(json.dumps(x[key], indent=4, default=str))

Here is one solution without BeautifulSoup.
import requests
s = requests.get('https://www.w3schools.com/xml/cd_catalog.xml').text
array = []
tab_size = 2
target_depth = 2
for element in s.split('\n'):
depth = (len(element) - len(element.lstrip())) / tab_size
if depth <= target_depth:
print(' ' * int(depth) + element)

Related

Python requests returns incomprehensible content

I'm trying to parse the site. I don't want to use selenium. Requests is coping. BUT! something strange is happening. I can't cut out the text I need with a regular expression (and it's there - you can see it if you do print(data.text)) But re doesn't see him. If this text is copied to notepad++, it outputs this - it sees these characters as a single line.
import requests
import re
data = requests.get('https://ru.runetki3.com/?page=1')
print(data.text)
What is it and how to work with it?pay attention to the line numbers
You can try to use their Ajax API to load all usernames + thumb images:
import pandas as pd
import requests
url = 'https://ru.runetki3.com/tools/listing_v3.php?livetab=female&offset=0&limit={}'
headers = {'X-Requested-With': 'XMLHttpRequest'}
all_data = []
for p in range(1, 4): # <-- increase number of pages here
data = requests.get(url.format(p * 144), headers=headers).json()
for m in data['models']:
all_data.append((m['username'], m['display_name'], m['thumb_image'].replace('{ext}', 'jpg')))
df = pd.DataFrame(all_data, columns=['username', 'display_name', 'thumb'])
print(df.head())
Prints:
username display_name thumb
0 wetlilu Little_Lilu //i.bimbolive.com/live/034/263/131/xbig_lq/c30823.jpg
1 mellannie8 mellannieSEX //i.bimbolive.com/live/034/24f/209/xbig_lq/314348.jpg
2 mokkoann mokkoann //i.bimbolive.com/live/034/270/279/xbig_lq/cb25cb.jpg
3 ogurezzi CynEp-nuCbka //i.bimbolive.com/live/034/269/02c/xbig_lq/3ebe2a.jpg
4 Pepetka22 _-Katya-_ //i.bimbolive.com/live/034/24f/36e/xbig_lq/18da8e.jpg
Avoid using . in a regex unless you really want to get any character; here, the usernames (as far as I can see) only contain - and alphanumeric characters, so you can retrieve them with:
re.findall(r'"username":"([\w|-]+)"',data.text)
An even simpler way, which will remove the need to deal with special characters by getting all characters except " is:
re.findall(r'"username":"([^"]+)"',data.text)
So here's a way of getting the info you seek (I joined them into a dictionary, but you can change that to whatever you prefer):
import requests
import re
data = requests.get('https://ru.runetki3.com/?page=1')
with open ("return.txt",'w', encoding = 'utf-8') as f:
f.write(data.text)
names = re.findall(r'"username":"([^"]+)"',data.text)
disp_names = re.findall(r'"display_name":"([^"]+)"',data.text)
thumbs = re.findall(r'"thumb_image":"([^"]+)"',data.text)
names_dict = {name:[disp, thumb.replace('{ext}', 'jpg')] for name, disp, thumb in zip(names, disp_names, thumbs)}
Example
names_dict['JuliaCute']
# ['_Cute',
# '\\/\\/i.bimbolive.com\\/live\\/055\\/2b0\\/15d\\/xbig_lq\\/d89ef4.jpg']

How can I edit web scraped text data using python?

Trying to build my first webscraper to print out how the stock market is doing on Yahoo finance. I have found out how to isolate the information I want but it returns super sloppy. How can I manipulate this data to present in an easier way?
import requests
from bs4 import BeautifulSoup
#Import your website here
html_text = requests.get('https://finance.yahoo.com/').text
soup = BeautifulSoup(html_text, 'lxml')
#Find the part of the webpage where your information is in
sp_market = soup.find('h3', class_ = 'Maw(160px)').text
print(sp_market)
The return here is : S&P 5004,587.18+65.64(+1.45%)
I want to grab these elements such as the labels and percentages and isolate them so I can print them in a way I want. Anyone know how? Thanks so much!
edit:
((S&P 5004,587.18+65.64(+1.45%)))
For simple splitting you could use the .split(separator) method that is built-in. (f.e. First split by 'x', then split by 'y', then split by 'z' with x, y, z being seperators). Since this is not efficient and if you have bit more complex regular expressions that look the same way for different elements (here: stocks) then take a look at the python regex module.
string = "Stock +45%"
pattern = '[a-z]+[0-9][0-9]'
Then, consider to use a function like find_all oder search.
I assume that the format is always S&P 500\n[number][+/-][number]([+/-][number]%).
If that is the case, we could do the following.
import re
# [your existing code]
# e.g.
# sp_market = 'S&P 500\n4,587.18+65.64(+1.45%)'
label,line2 = sp_market.split('\n')
pm = re.findall(r"[+-]",line2)
total,change,percent,_ = re.split(r"[\+\-\(\)%]+",line2)
total = float(''.join(total.split(',')))
change = float(change)
if pm[0]=='-':
change=-change
percent = float(percent)
if pm[1]=='-':
percent=-percent
print(label, total,change,percent)
# S&P 500 4587.18 65.64 1.45
Not sure, cause question do not provide an expected result, but you can "isolate" the information with stripped_strings.
This will give you a list of "isolated" values you can process:
list(soup.find('h3', class_ = 'Maw(160px)').stripped_strings)
#Output
['S&P 500', '4,587.18', '+65.64', '(+1.45%)']
For example stripping following characters "()%":
[x.strip('\(|\)|%') for x in soup.find('h3', class_ = 'Maw(160px)').stripped_strings]
#Output
['S&P 500', '4,587.18', '+65.64', '+1.45']
Simplest way to print the data not that sloppy way, is to join() the values by whitespace:
' '.join([x.strip('\(|\)|%') for x in soup.find('h3', class_ = 'Maw(160px)').stripped_strings])
#Output
S&P 500 4,587.18 +65.64 +1.45
You can also create dict() and print the key / value pairs:
for k, v in dict(zip(['Symbol','Last Price','Change','% Change'], [x.strip('\(|\)|%') for x in soup.find('h3', class_ = 'Maw(160px)').stripped_strings])).items():
print(f'{k}: {v}')
#Output
Symbol: S&P 500
Last Price: 4,587.18
Change: +65.64
% Change: +1.45

Turning text into a dictionary

I've successfully extracted my sitemap, and I would like to turn the urls into a list. I can't quite figure out how to do that, separating the https from the dates modified. Ideally I would also like to turn it into a dictionary, with the associated date stamp. In the end, I plant to iterate over the list and create text files of the web pages, and save the date time stamp at the top of the text file.
I will settle for the next step of turning this into a list. This is my code:
import urllib.request
import inscriptis
from inscriptis import get_text
sitemap = "https://grapaes.com/sitemap.xml"
i=0
url = sitemap
html=urllib.request.urlopen(url).read().decode('utf-8')
text=get_text(html)
dicto = {text}
print(dicto)
for i in dicto:
if i.startswith ("https"):
print (i + '/n')
The output is basically a row with the date stamp, space, and the url.
You can split the text around whitespaces first, then proceed like this:
text = text.split(' ')
dicto = {}
for i in range(0, len(text), 2):
dicto[text[i+1]] = text[i]
gives a dictionary with timestamp as key and URL as value, as follows:
{
'2020-01-12T09:19+00:00': 'https://grapaes.com/',
'2020-01-12T12:13+00:00': 'https://grapaes.com/about-us-our-story/',
...,
'2019-12-05T12:59+00:00': 'https://grapaes.com/211-retilplast/',
'2019-12-01T08:29+00:00': 'https://grapaes.com/fruit-logistica-berlin/'
}
I believe you can do further processing from here onward.
In addition to the answer above: You could also use an XML Parser (standard module) to achieve what you are trying to do:
# Save your xml on disk
with open('sitemap.xml', 'w') as f:
f.write(text)
f.close()
# Import XML-Parser
import xml.etree.ElementTree as ET
# Load xml and obtain the root node
tree = ET.parse('sitemap.xml')
root_node = tree.getroot()
From here you can access your xml's nodes just like every other list-like object:
print(root_node[1][0].text) # output: 'https://grapaes.com/about-us-our-story/'
print(root_node[1][1].text) # output: '2020-01-12T12:13+00:00'
Creating a dict from this is as easy as that:
dicto = dict()
for child in root_node:
dicto.setdefault(child[0], child[1])

Can't isolate desired results out of crude ones

I've created a script in python to get the name of neighbors from a webpage. I've used requests library along with re module to parse the content from some script tag out of that site. when I run the script I get the name of neighbors in the right way. However, the problem is i've used this line if not item.startswith("NY:"):continue to get rid of unwanted results from that page. I do not wish to use this hardcoded portion NY: to do this trick.
website link
I've tried with:
import re
import json
import requests
link = 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=New%20York%2C%20NY&start=1'
resp = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
data = json.loads(re.findall(r'data-hypernova-key[^{]+(.*)--></script>',resp.text)[0])
items = data['searchPageProps']['filterPanelProps']['filterInfoMap']
for item in items:
if not item.startswith("NY:"):continue
print(item)
Result I'm getting (desired result):
NY:New_York:Brooklyn:Mill_Basin
NY:New_York:Bronx:Edenwald
NY:New_York:Staten_Island:Stapleton
If I do not use this line if not item.startswith("NY:"):continue, the results are something like:
rating
NY:New_York:Brooklyn:Mill_Basin
NY:New_York:Bronx:Edenwald
NY:New_York:Staten_Island:Stapleton
NY:New_York:Staten_Island:Lighthouse_Hill
NY:New_York:Queens:Rochdale
NY:New_York:Queens:Pomonok
BusinessParking.validated
food_court
NY:New_York:Queens:Little_Neck
The bottom line is I wish to get everything started with NY:New_York:. What I meant by unwanted results are rating, BusinessParking.validated, food_court and so on.
How can I get the neighbors without using any hardcoded portion of search within the script?
I'm not certain what your complete data set looks like, but based on your sample,
you might use something like:
if ':' not in item:
continue
# or perhaps:
if item.count(':') < 3:
continue
# I'd prefer a list comprehension if I didn't need the other data
items = [x for x in data['searchPageProps']['filterPanelProps']['filterInfoMap'] if ':' in x]
If that doesn't work for what you're trying to achieve then you could just use a variable for the state.
Another solution - using BeautifulSoup - which doesn't involve regex or hardcoding "NY:New_York" is below; it's convoluted, but mainly because Yelp buried it's treasure several layers deep...
So for future reference:
from bs4 import BeautifulSoup as bs
import json
import requests
link = 'https://www.yelp.com/search?find_desc=Restaurants&find_loc=New%20York%2C%20NY&start=1'
resp = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
target = soup.find_all('script')[14]
content = target.text.replace('<!--','').replace('-->','')
js_data = json.loads(content)
And now the fun of extracting NYC info from the json begins....
for a in js_data:
if a == 'searchPageProps':
level1 = js_data[a]
for b in level1:
if b == 'filterPanelProps':
level2 = level1[b]
for c in level2:
if c == 'filterSets':
level3 = level2[c][1]
for d in level3:
if d == 'moreFilters':
level4 = level3[d]
for e in range(len(level4)):
print(level4[e]['title'])
print(level4[e]['sectionFilters'])
print('---------------')
The output is the name of each borough plus a list of all neighborhoods in that borough. For example:
Manhattan
['NY:New_York:Manhattan:Alphabet_City',
'NY:New_York:Manhattan:Battery_Park',
'NY:New_York:Manhattan:Central_Park', 'NY:New_York:Manhattan:Chelsea',
'...]
etc.

Python/lxml/Xpath: How do I find the row containing certain text?

Given the URL http://www.smartmoney.com/quote/FAST/?story=financials&timewindow=1&opt=YB&isFinprint=1&framework.view=smi_emptyView , how would you capture and print the contents of an entire row of data?
For example, what would it take to get an output that looked something like:
"Cash & Short Term Investments 144,841 169,760 189,252 86,743 57,379"? Or something like "Property, Plant & Equipment - Gross 725,104 632,332 571,467 538,805 465,493"?
I've been introduced to the basics of Xpath through sites http://www.techchorus.net/web-scraping-lxml . However, the Xpath syntax is still largely a mystery to me.
I already have successfully done this in BeautifulSoup. I like the fact that BeautifulSoup doesn't require me to know the structure of the file - it just looks for the element containing the text I search for. Unfortunately, BeautifulSoup is too slow for a script that has to do this THOUSANDS of times. The source code for my task in BeautifulSoup is (with title_input equal to "Cash & Short Term Investments"):
page = urllib2.urlopen (url_local)
soup = BeautifulSoup (page)
soup_line_item = soup.findAll(text=title_input)[0].parent.parent.parent
list_output = soup_line_item.findAll('td') # List of elements
So what would the equivalent code in lxml be?
EDIT 1: The URLs were concealed the first time I posted. I have now fixed that.
EDIT 2: I have added my BeautifulSoup-based solution to clarify what I'm trying to do.
EDIT 3: +10 to root for your solution. For the benefit of future developers with the same question, I'm posting here a quick-and-dirty script that worked for me:
#!/usr/bin/env python
import urllib
import lxml.html
url = 'balancesheet.html'
result = urllib.urlopen(url)
html = result.read()
doc = lxml.html.document_fromstring(html)
x = doc.xpath(u'.//th[div[text()="Cash & Short Term Investments"]]/following-sibling::td/text()')
print x
In [18]: doc.xpath(u'.//th[div[text()="Cash & Short Term Investments"]]/following-sibling::td/text()')
Out[18]: [' 144,841', ' 169,760', ' 189,252', ' 86,743', ' 57,379']
or you can define a little function to get the rows by text:
In [19]: def func(doc,txt):
...: exp=u'.//th[div[text()="{0}"]]'\
...: u'/following-sibling::td/text()'.format(txt)
...: return [i.strip() for i in doc.xpath(exp)]
In [20]: func(doc,u'Total Accounts Receivable')
Out[20]: ['338,594', '270,133', '214,169', '244,940', '236,331']
or you can get all the rows to a dict:
In [21]: d={}
In [22]: for i in doc.xpath(u'.//tbody/tr'):
...: if len(i.xpath(u'.//th/div/text()')):
...: d[i.xpath(u'.//th/div/text()')[0]]=\
...: [e.strip() for e in i.xpath(u'.//td/text()')]
In [23]: d.items()[:3]
Out[23]:
[('Accounts Receivables, Gross',
['344,241', '274,894', '218,255', '247,600', '238,596']),
('Short-Term Investments',
['27,165', '26,067', '24,400', '851', '159']),
('Cash & Short Term Investments',
['144,841', '169,760', '189,252', '86,743', '57,379'])]
let html holds the html source code:
import lxm.html
doc = lxml.html.document_fromstring(html)
rows_element = doc.xpath('/html/body/div/div[2]/div/div[5]/div/div/table/tbody/tr')
for row in rows_element:
print row.text_content()
not tested but should work
P.S.Install xpath cheker or firefinder in firefox to help you with xpath

Categories