I've been trying to scrape out data from a profile to have a set of information whether something changed, here's a snippet of what overall code would probably would look like:
import requests
response = requests.get('https://twitter.com/elonmusk')
print(response.text[30907:30957])
#need to print out "sensitive_media_settings_enabled":{"value":false}
I need to have "sensitive_media_settings_enabled":{"value":false} printed out in the shell, how can I do this?
Like Ali said in a comment, a better approach to this is to use a regular expression to find and extract the string you're looking for. When I tried this, the index start and stop were at 43539 and 43589 respectively.
Here's how you could do it with regex
import re
import requests
response = requests.get('https://twitter.com/elonmusk')
reg_expression = r'"sensitive_media_settings_enabled":{"value":(true|false)}'
result = re.search(reg_expression, response.text)
print(result[0])
prints "sensitive_media_settings_enabled":{"value":false}
Related
I'm using urllib to parse a url, but I was wanting it to take input from a text box so I could put in multiple url's whenever I needed instead of changing the code to parse just one url. I tried using tkinter but I couldn't figure out how to get urllib to grab the input from that.
You haven't provided much information on your use case but let's pretend you have multiple URLs already and that part is working.
def retrieve_input(list_of_urls):
for url in list_of_urls:
# do parsing as needed
Now if you wanted to have a way to get more than one URL and put them in a list, maybe you would do something like:
list_of_urls = []
while True:
url = input('What is your URL?')
if url != 'Stop':
list_of_urls.append(url)
else:
break
With that example you would probably want to control inputs more but just to give you an idea. If you are expecting to get help with the tkinter portion, you'll need to provide more information and examples of what you have tried, your expected input (and method), and expected output.
My idea was to explore the Groupon's website to extract the url of the deals. The problem is that I'm trying to do a findall on the Groupon's page to find datas like this: (of this page: http://www.groupon.de/alle-deals/muenchen/restaurant-296)
"category":"RESTAURANT1","dealPermaLink":"/deals/muenchen-special/Casa-Lavecchia/24788330", and I'd like to get the 'deals/muenchen-special/Casa-Lavecchia/24788330'.
I tried the whole night but I'm unable to find a correct regex. I tried:
import urllib2
import re
Page_Web = urllib2.urlopen('http://www.groupon.de/alle-deals/muenchen/restaurant-296').read()
for m in re.findall('category*RESATAURANT1*dealPermaLink*:?/*/*/*/*\d$',Page_Web):
print m
But it doesn't print anything.
In order to extrapolate the block that interest you, I would do this way:
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen('http://www.groupon.de/alle-deals/muenchen/restaurant-296').read()
soup = BeautifulSoup(html)
scriptResults = soup('script',{'type' : 'text/javascript'})
js_block = scriptResults[12]
Starting from this you can parse with a regex if you want or try to interprete the js (there are some threads on stackoverflow about that).
Anyway, like the others said, you should use groupon api...
P.S.
The block that you are parsing can be easily parsed as a dictionary, is already a list of dictionary if you look well...
How about changing RESATAURANT1 to RESTAURANT1, for starters?
I'm having a hard time extracting data from a httprequest response.
Can somebody help me? Here's a part of my code:
import requests
r = requests.get('https://www.example.com', verify=True)
keyword = r.text.find('loginfield')
print (keyword)
>>> 42136
42136 value basically means that string 'loginfield' exists on the response.text. But how do I extract specific strings from it?
Like for example I want to extract these exact strings:
<title>Some title here</title>
or this one:
<div id='bla...' #continues extracting of strings until it stops where I want it to stop extracting.
Anybody got an idea on how should I approach this problem?
You can use BeautifulSoup to parse HTML and get tags. Here's an example piece of code:
import requests
from bs4 import BeautifulSoup as BS
r = requests.get('https://www.example.com', verify=True)
soup = BS(r.text)
print(soup.find('title').text)
Should print:
Some title here
But depends on if it's the first title or not
Please note that for HTML-page data extraction, you should take a look at a specialized library like Beautiful soup. Your program will be less fragile and more maintainable that way.
string.find will return -1 if the string does not exists.
There is no string "loginfield" in the page you retrieved.
Once you have the correct index for your string, the returned value is the position of the first char of that string.
since you edited your question:
>>> r.text.find('loginfield')
42136
That means, the string "loginfield" starts at offset 42136 in the text. You could display say 200 chars starting at that position that way:
>>> print(r.text[42136:42136+200])
To find the various values you looking for, you have to figure out where there are relative to that position.
Is there a way to extract parts of text from MediaWikia's API? For example, this link dumps all the content into XML format:
http://marvel.wikia.com/api.php?action=query&prop=revisions&titles=All-New%20X-Men%20Vol%201%201&rvprop=content&format=xml
But there isn't much structure to it, even in the json format.
I'd like to get the text of Writer1_1, Penciler1_1, etc. Perhaps I'm not making my parameters right, so maybe there are other options I could output.
You can see the content in a more user-readable way here.
I'm sure the regex and final splitting could be more efficient, but this gets the job done for what you asked.
import urllib2
import re
data = urllib2.urlopen('http://marvel.wikia.com/api.php?action=query&prop=revisions&titles=All-New%20X-Men%20Vol%201%201&rvprop=content')
regex = re.compile('(Writer1_1|Penciler1_1)')
for line in data.read().split('|'):
if regex.search(line):
#assume everything after = is the full name
print ' '.join(line.split()[2:])
I am trying to make a simple python script to extract certain links from a webpage. I am able to extract link successfully but now I want to extract some more information like bitrate,size,duration given on that webpage.
I am using the below xpath to extract the above mentioned info
>>> doc = lxml.html.parse('http://mp3skull.com/mp3/linkin_park_faint.html')
>>> info = doc.xpath(".//*[#id='song_html']/div[1]/text()")
>>> info[0:7]
['\n\t\t\t', '\n\t\t\t\t3.71 mb\t\t\t', '\n\t\t\t', '\n\t\t\t\t3.49 mb\t\t\t', '\n\t\t\t', '\n\t\t\t\t192 kbps', '2:41']
Now what I need is that for a particular link the info I require is generated in a form of tuple like (bitrate,size,duration).
The xpath I mentioned above generates the required info but it is ill-formatted that is it is not possible to achieve my required format with any logic at least I am not able to that.
So, is there any way to achieve the output in my format.?
I think BeautifulSoup will do the job, it parses even badly formatted HTML:
http://www.crummy.com/software/BeautifulSoup/
parsing is quite easy with BeautifulSoup - for example:
import bs4
import urllib
soup = bs4.BeautifulSoup(urllib.urlopen('http://mp3skull.com/mp3/linkin_park_faint.html').read())
print soup.find_all('a')
and have quite good docs:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
You can actually strip everything out with XPath:
translate(.//*[#id='song_html']/div[1]/text(), "\n\t,'", '')
So for your additional question, either:
info[0, len(info)]
for altogether, or:
info.rfind(" ")
Since the translate leaves a space character, but you could replace that with whatever you wanted.
Addl info found here
How are you with regular expressions and python's re module?
http://docs.python.org/library/re.html may be essential.
As far as getting the data out of the array, re.match(regex,info[n]) should suffice, as far as the triple tuple goes, the python tuple syntax takes care of it. Simply match from members of your info array with re.match.
import re
matching_re = '.*' # this re matches whole strings, rather than what you need
incoming_value_1 = re.match(matching_re,info[1])
# etc.
var truple = (incoming_value_1, incoming_value_2, incoming_value_2