I'm having a hard time extracting data from a httprequest response.
Can somebody help me? Here's a part of my code:
import requests
r = requests.get('https://www.example.com', verify=True)
keyword = r.text.find('loginfield')
print (keyword)
>>> 42136
42136 value basically means that string 'loginfield' exists on the response.text. But how do I extract specific strings from it?
Like for example I want to extract these exact strings:
<title>Some title here</title>
or this one:
<div id='bla...' #continues extracting of strings until it stops where I want it to stop extracting.
Anybody got an idea on how should I approach this problem?
You can use BeautifulSoup to parse HTML and get tags. Here's an example piece of code:
import requests
from bs4 import BeautifulSoup as BS
r = requests.get('https://www.example.com', verify=True)
soup = BS(r.text)
print(soup.find('title').text)
Should print:
Some title here
But depends on if it's the first title or not
Please note that for HTML-page data extraction, you should take a look at a specialized library like Beautiful soup. Your program will be less fragile and more maintainable that way.
string.find will return -1 if the string does not exists.
There is no string "loginfield" in the page you retrieved.
Once you have the correct index for your string, the returned value is the position of the first char of that string.
since you edited your question:
>>> r.text.find('loginfield')
42136
That means, the string "loginfield" starts at offset 42136 in the text. You could display say 200 chars starting at that position that way:
>>> print(r.text[42136:42136+200])
To find the various values you looking for, you have to figure out where there are relative to that position.
Related
I am trying to use BeautifulSoup to scrape a particular download URL from a web page, based on a partial text match. There are many links on the page, and it changes frequently. The html I'm scraping is full of sections that look something like this:
<section class="onecol habonecol">
<a href="https://longGibberishDownloadURL" title="Download">
<img src="\azure_storage_blob\includes\download_for_windows.png"/>
</a>
sentinel-3.2022335.1201.1507_1608C.ab.L3.FL3.v951T202211_1_3.CIcyano.LakeOkee.tif
</section>
The second to last line (sentinel-3.2022335...LakeOkee.tif) is the part I need to search using a partial string to pull out the correct download url. The code I have attempted so far looks something like this:
import requests, re
from bs4 import BeautifulSoup
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}, string=re.compile(?))
I've been searching StackOverflow a long time now and while there are similar questions and answers, none of the proposed solutions have worked for me so far (re.compile, lambdas, etc.). I am able to pull up a section if I remove the string argument, but when I try to include a partial matching string I get None for my result. I'm unsure what to put for the string argument (? above) to find a match based on partial text, say if I wanted to find the filename that has "CIcyano" somewhere in it (see second to last line of html example at top).
I've tried multiple methods using re.compile and lambdas, but I don't quite understand how either of those functions really work. I was able to pull up other sections from the html using these solutions, but something about this filename string with all the periods seems to be preventing it from working. Or maybe it's the way it is positioned within the section? Perhaps I'm going about this the wrong way entirely.
Is this perhaps considered part of the section id, and so the string argument can't find it?? An example of a section on the page that I AM able to find has html like the one below, and I'm easily able to find it using the string argument and re.compile using "Name", "^N", etc.
<section class="onecol habonecol">
<h3>
Name
</h3>
</section>
Appreciate any advice on how to go about this! Once I get the correct section, I know how to pull out the URL via the a tag.
Here is the full html of the page I'm scraping, if that helps clarify the structure I'm working against.
I believe you are overthinking. Just remove the regular expression part, take the text and you will be fine.
import requests
from bs4 import BeautifulSoup
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}).text
print(result)
You can query inside every section for the string you want. Like so:
s.find('section', attrs={'class':'onecol habonecol'}).find(string=re.compile(r'.sentinel.*'))
Using this regular expression you will match any text that has sentinel in it, be careful that you will have to match some characters like spaces, that's why there is a . at beginning of the regex, you might want a more robust regex which you can test here:
https://regex101.com/
I ended up finding another method not using the string argument in find(), instead using something like the code below, which pulls the first instance of a section that contains a partial text match.
sections = soup.find_all('section', attrs={'class':'onecol habonecol'})
for s in sections:
text = s.text
if 'CIcyano' in text:
print(s)
break
links = s.find('a')
dwn_url = links.get('href')
This works for my purposes and fetches the first instance of the matching filename, and grabs the URL.
I have a type element, bs4.element.Tag, product of a web scraping, I usually do: json.loads (soup.find ('script', type = 'application / ld + json'). Text) , but on this page it only appears in: <script> </script> so I had to do: scripts = soup.find_all ('script') until I get to the one that interests me: script = scripts [18].
The variable in question is script. My problem is that I want to access its attributes, for example script ['goodsInfo'], obviously being an element type bs4.element.Tag, try to do: script.attrs and return me {}. Then I tried to convert it to the type json: json.loads (str (script)) and it throws me the exception: 'JSONDecodeError: Expecting value: line 1 column 1 (char 0)'
This is my code:
import json
from bs4 import BeautifulSoup
import requests
url_aux = 'https://www.shein.com/Mock-neck-Brush-Stroke-Print-Bodycon-Dress-p-941649-cat-1727.html?scici=navbar_2~~tab01navbar04~~4~~real_1727~~~~0~~0'
response = requests.get(url_aux)
soup = BeautifulSoup(response.content, "html.parser")
scripts = soup.find_all('script')
script = scripts[18]
print(json.loads(str(script)))
#output: JSONDecodeError: Expecting value: line 1 column 1 (char 0)
print(type(script))
#output: bs4.element.Tag
print(str(json.loads(str(script))))
You can use json module to extract the data, but first it's necessary to locate the right info - you can use re module for that.
For example:
import re
import json
import requests
url = 'https://eur.shein.com/Mock-neck-Brush-Stroke-Print-Bodycon-Dress-p-941649-cat-1727.html?scici=navbar_2~~tab01navbar04~~4~~real_1727~~~~0~~0&ref=www&rep=dir&ret=eur'
txt = re.findall(r'goodsInfo\s*:\s*({.*})', requests.get(url).text)[0]
data = json.loads(txt)
# print(json.dumps(data, indent=4)) # <-- uncomment to see all data
print(data['detail']['goods_name'])
print(data['detail']['brand'])
print('Num of comments:', data['detail']['comment']['comment_num'])
Prints:
Mock-neck Brush Stroke Print Bodycon Dress
SHEIN
Num of comments: 17
BS4 does not parse javascript, from BS4's Tag object's POV the text in a <script> tag is, well, just text. I don't have any idea what this script looks like (since you didn't post it and I'm not going to bother try and find it), but if your expectations were that script ['goodsInfo'] would return the value of a JS variables named 'goodInfo' then, bad news, it's not going to work that way.
Also, Javascript is not JSON, so the chances a JS snippet will be valid json are rather small to say the least. The proper syntax to test it would be quite simply the same as the one you used for you first use case, ie json.loads(script.text), but I assume that's the first thing you tried ;-)
So, well, I'm afraid you'll have to manually parse this script to extract the relevant part. Depending on what the js code looks like, it may be a matter of a few lines of basic string parsing / regexp stuff, or it may require a proper Javascript parser etc.
I am a (very) new Python user, and decided some of my first work would be to grab some lyrics from a forum and sort according to word frequency. I obviously haven't gotten to the frequency part yet, but the following is the code that does not work for obtaining the string values I want, resulting in an "AttributeError: 'ResultSet' object has no attribute 'getText' ":
from bs4 import BeautifulSoup
import urllib.request
url = 'http://www.thefewgoodmen.com/thefgmforum/threads/gdr-marching-songs-section-b.14998'
wp = urllib.request.urlopen(url)
soup = BeautifulSoup(wp.read())
message = soup.findAll("div", {"class": "messageContent"})
words = message.getText()
print(words)
If I alter the code to have getText() operate on the soup object:
words = soup.getText()
I, of course, get all of the string values throughout the webpage, rather than those limited to only the class messageContent.
My question, therefore, is two-fold:
1) Is there a simple way to limit the tag-stripping to only the intended sections?
2) What simple thing do I not understand in that I cannot have getText() operate on the message object?
Thanks.
The message in this case is a BeautifulSoup ResultSet, which is a list of BeautifulSoup Tag(s). What you need to do is call getText on each element of message like so,
words = [item.getText() for item in message]
Similarly, if you are just interested in a single Tag (let's say the first one for the sake of argument), you could get its content with,
words = message[0].getText()
I am trying to make a simple python script to extract certain links from a webpage. I am able to extract link successfully but now I want to extract some more information like bitrate,size,duration given on that webpage.
I am using the below xpath to extract the above mentioned info
>>> doc = lxml.html.parse('http://mp3skull.com/mp3/linkin_park_faint.html')
>>> info = doc.xpath(".//*[#id='song_html']/div[1]/text()")
>>> info[0:7]
['\n\t\t\t', '\n\t\t\t\t3.71 mb\t\t\t', '\n\t\t\t', '\n\t\t\t\t3.49 mb\t\t\t', '\n\t\t\t', '\n\t\t\t\t192 kbps', '2:41']
Now what I need is that for a particular link the info I require is generated in a form of tuple like (bitrate,size,duration).
The xpath I mentioned above generates the required info but it is ill-formatted that is it is not possible to achieve my required format with any logic at least I am not able to that.
So, is there any way to achieve the output in my format.?
I think BeautifulSoup will do the job, it parses even badly formatted HTML:
http://www.crummy.com/software/BeautifulSoup/
parsing is quite easy with BeautifulSoup - for example:
import bs4
import urllib
soup = bs4.BeautifulSoup(urllib.urlopen('http://mp3skull.com/mp3/linkin_park_faint.html').read())
print soup.find_all('a')
and have quite good docs:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
You can actually strip everything out with XPath:
translate(.//*[#id='song_html']/div[1]/text(), "\n\t,'", '')
So for your additional question, either:
info[0, len(info)]
for altogether, or:
info.rfind(" ")
Since the translate leaves a space character, but you could replace that with whatever you wanted.
Addl info found here
How are you with regular expressions and python's re module?
http://docs.python.org/library/re.html may be essential.
As far as getting the data out of the array, re.match(regex,info[n]) should suffice, as far as the triple tuple goes, the python tuple syntax takes care of it. Simply match from members of your info array with re.match.
import re
matching_re = '.*' # this re matches whole strings, rather than what you need
incoming_value_1 = re.match(matching_re,info[1])
# etc.
var truple = (incoming_value_1, incoming_value_2, incoming_value_2
Hey all, I am using beautifulsoup (after unsuccessfully struggling for two days with scrapy) to scrape starcraft 2 league data however I am encountering a problem.
I have this table with the result of which I want the string content of all tags which i do like this:
from BeautifulSoup import *
from urllib import urlopen
def parseWithSoup(url):
print "Reading:" , url
html = urlopen(url).read().lower()
bs = BeautifulSoup(html)
table = bs.find(lambda tag: tag.name=='table' and tag.has_key('id') and tag['id']=="tblt_table")
rows = table.findAll(lambda tag: tag.name=='tr')
rows.pop(0) #first row is header
for row in rows:
tags = row.findAll(lambda tag: tag.name=='a')
content = []
for tagcontent in tags:
content.append(tagcontent.string)
print content
if __name__ == '__main__':
content = "http://www.teamliquid.net/tlpd/sc2-international/games#tblt-5018-1-1-DESC"
metSoup = parseWithSoup(content)
however the output is as follows:
[u'+', u'gadget show live i..', u'crevasse', u'naniwa', u'socke']
[u'+', u'gadget show live i..', u'metalopolis 1.1', u'naniwa', u'socke']
[u'+', u'gadget show live i..', u'shakuras plateau 2.0', u'socke', u'select']
etc...
My question is: where does the u'' come from (is it from unicode?) and how can I remove this? I just need the strings that are in u''...
The u means Unicode string. It doesn't change anything for you as a programmer and you should just disregard it. Treat them like normal strings. You actually want this u there.
Be aware that all Beautiful Soup output is unicode. That's a good thing, because if you run across any Unicode characters in your scraping, you won't have any problems. If you really want to get rid of the u, (I don't recommend it), you can use the unicode string's decode() method.
What you see are Python unicode strings.
Check the Python documentation
http://docs.python.org/howto/unicode.html
in order to deal correctly with unicode strings.