Python: remove unwanted data to a standard json [closed]

Python: remove unwanted data to a standard json [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
A url has this JSON which is not in standard format:
{}&& {identifier:'ID', label:'As at 08-03-2018 5:06 PM',items:[{ID:0,N:'2ndChance W200123',SIP:'',NC:'CDWW',R:'',I:'',M:'',LT:0.009,C:0.000,VL:108.200,BV:2149.900,B:'0.008',S:'0.009',SV:7218.300,O:0.009,H:0.009,L:0.008,V:873.700,SC:'5',PV:0.009,P:0.000,BL:'100',P_:'X',V_:''},{ID:1,N:'3Cnergy',SIP:'',NC:'502',R:'',I:'',M:'t',LT:0,C:0,VL:0.000,BV:50.000,B:'0.022',S:'0.025',SV:36.000,O:0,H:0,L:0,V:0.000,SC:'2',PV:0.021,P:0,BL:'100',P_:'X',V_:''},{ID:2,N:'3Cnergy W200528',SIP:'',NC:'1E0W',R:'',I:'',M:'t',LT:0,C:0,VL:0.000,BV:0,B:'',S:'0.004',SV:50.000,O:0,H:0,L:0,V:0.000,SC:'5',PV:0.002,P:0,BL:'100',P_:'X',V_:''}`
I want to make all the data into list or in pandas started from ID.
{}&& {identifier:'ID', label:'As at 08-03-2018 5:06 PM',items: is not wanted when I requested the url
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
url ='AttributeError: 'http://www.sgx.com/JsonRead/JsonstData?qryId=RAll'
page = requests.get(url,headers=headers)
alldata = html.fromstring(page.content)
However, I am unable to continue as the JSON format is not standard. How to correct it?

import requests
import execjs
url = 'http://www.sgx.com/JsonRead/JsonstData?qryId=RAll'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
page = requests.get(url,headers=headers)
content = page.content[len('{}&& '):] if page.content.startswith('{}&& ') else page.content
data = execjs.get().eval(content)
print(data)
The data is JavaScript Object, in literal notation.
We can use PyExecJs to eval it and get corresponding python dict.

Related

Using redis in Python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed yesterday.
Improve this question
How can I use celery thats its expire time is 60sec using celery?
from bs4 import BeautifulSoup
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
def weather(cities):
results = []
for city in cities:
res = requests.get(f'https://www.google.com/search?q={city} weather&oq={city} weather&aqs=chrome.0.35i39l2j0l4j46j69i60.6128j1j7&sourceid=chrome&ie=UTF-8', headers=headers)
soup = BeautifulSoup(res.text, 'html.parser')
weather = soup.select('#wob_tm')[0].getText().strip()
results.append({city: weather})
return results
cities = ["tehran" , "Mashhad","Shiraaz","Semirom","Ahvaz","zahedan","baghdad","van","herat","sari"]
weather_data = weather(cities)
print(weather_data)
def temporary_city(city):
res = requests.get(f'https://www.google.com/search?q={city} weather&oq={city} weather&aqs=chrome.0.35i39l2j0l4j46j69i60.6128j1j7&sourceid=chrome&ie=UTF-8', headers=headers)
return res

Why is Selenium unable to find elements on some sites?

I am using a python version of Selenium to capture comments on a Chinese website.
The website is https://v.douyu.com/show/kDe0W2q5bB2MA4Bz
I want to find this span element. In Chinese, this is called "弹幕列表".
I tried the absolute path like:
driver.find_elements_by_xpath('/body/demand-video-app/main/div[2]/demand-video-helper//div/div[1]/a[3]/span')
But it returns NoSuchElementException. I just thought that maybe this site has a protection mechanism. However, I don't know much about Selenium and would like to ask for help. Thanks in advance.

I guess you use Selenium because requests can't capture the value.
If it's not what you want to do, don’t read my answer.
Because you are requests.get(url='https://v.douyu.com/show/kDe0W2q5bB2MA4Bz')
You need to find the source of the data ApiUrl on F12 Network.
In fact, his source of information is
https://v.douyu.com/wgapi/vod/center/getBarrageListByPage + parameter
↓
https://v.douyu.com/wgapi/vod/center/getBarrageListByPage?vid=kDe0W2q5bB2MA4Bz&forward=0&offset=-1
Although I can't help you solve the Selenium problem.
But I will use the following methods to get the data.
import requests
url = 'https://v.douyu.com/wgapi/vod/center/getBarrageListByPage?vid=kDe0W2q5bB2MA4Bz&forward=0&offset=-1'
headers = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'}
res = requests.get(url=url, headers=headers).json()
print(res)
for i in res['data']['list']:
print(i)
Get All Data
import requests
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}
url = 'https://v.douyu.com/wgapi/vod/center/getBarrageListByPage?vid=kDe0W2q5bB2MA4Bz&forward=0&offset=-1'
while True:
res = requests.get(url=url, headers=headers).json()
next_json = res['data']['pre']
if next_json == -1:
break
for i in res['data']['list']:
print(i)
url = f'https://v.douyu.com/wgapi/vod/center/getBarrageListByPage?vid=kDe0W2q5bB2MA4Bz&forward=0&offset={next_json}'

extract hidden email from webpage using requests and regular expression [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I want to find the email in this webpage:
https://reachuae.com/livesearch/brand-detail/3910/A-ALICO-LTD-Sharjah
I created this code but no email found:
import requests
import re
url = 'https://reachuae.com/livesearch/brand-detail/3910/A-ALICO-LTD-Sharjah'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
r = requests.get(url, headers=headers)
print(r.status_code)
page_text = r.text
email = re.findall(r'\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,}\b',page_text)
print(email)
return empty list

The email is not found in the URL which you have mentioned in the question, but when you click the "(Click here to send enquiry)" another URL is generating at the bottom of page. That URL is containing the mail id. Using the below python code you can extract that Mail id
import requests
from lxml import html
Mail_url = 'https://reachuae.com/livesearch/brand-detail/3910/A-ALICO-LTD-Sharjah'
def mailExtractor():
mail = Mail_url.split('/')
innumber = mail[-2]
Actual_url = 'https://reachuae.com/includes/contact_company.php?id={}&KeepThis=true&'.format(innumber)
getr = requests.get(Actual_url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"})
sour = html.fromstring(getr.content)
emails = sour.xpath('//input[#name="mail"]//#value')
for mail in emails:
print(mail)
mailExtractor()

problem with 'headers = headers' in web crawling

I am practice my web crawling to get text from website, but I have problem with my 'headers = headers'. when I am run .py, it returns like this:
AttributeError: 'set' object has no attribute 'items'
my code is as below:
import requests
import time
import re
headers = {'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'}
f = open('/Users/pgao/Desktop/doupo.rtf','a+')
def get_info(url):
res = requests.get(url, headers = headers)
if res.status_code == 200:
contents = re.findall('<p>(.*?)</p>', res.content.decode('utf-8'),re.S)
for content in contents:
f.write(content+'\n')
else:
pass
if __name__ == '__main__':
urls = ['http://www.doupoxs.com/doupocangqiong/{}.html'.format(str(i)) for i in range(2,10)]
for url in urls:
get_info(url)
time.sleep(1)
f.close()
I am struggle with the reason to use 'headers = headers' since some time when web scraping there is no need of it, but sometime it need. and the result where I googled is not that helpful.

The header needs to be a dict but you created a set. The syntax is similar, but notice how the following has a key:value pair
header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'}

From docs, headers for requests.get() must be a dict.
If you’d like to add HTTP headers to a request, simply pass in a dict to the headers parameter.
You have passed a set. Sets do not have any items() method. That is why you are getting this AttributeError.
headers = {'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'}
print(type(headers))
# <class 'set'>
Add a key to your headers variable.
headers = {'User-Agent': 'Mozilla/5.0 .....'}
Edit: Updated key value for "User-Agent" header.

Requests posts with Python

This is the first time I am trying requests.post() because I have always used requests.get(). So I'm trying to navigate to a website and search. I am using yellowpages.com, and before I get negative feedback about using the site to scrape or about an API, I just want to try it out. The problem I am running into is that it spits out some html that isn't remotely what I am looking for. I'll post my code below to show you what I am talking about.
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = "https://www.yellowpages.com"
search_terms = "Car Dealership"
location = "Jackson, MS"
q = {'search_terms': search_terms, 'geo_locations_terms': location}
page = requests.post(url, headers=headers, params=q)
print(page.text)

Your request boils down to
$ curl -X POST \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36' \
'https://www.yellowpages.com/?search_terms=Car Dealership&geo_locations_terms=Jackson, MS'
For this the server returns a 502 Bad Gateway status code.
The reason is that you use POST together wihy query parameters params. The two don't go well together. Use data instead:
requests.post(url, headers=headers, data=q)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: remove unwanted data to a standard json [closed] - python

Related

Using redis in Python [closed]

Why is Selenium unable to find elements on some sites?

extract hidden email from webpage using requests and regular expression [closed]

problem with 'headers = headers' in web crawling

Requests posts with Python

Categories

Resources