I have output of API request in given below.
From each atom:entry I need to extract
<c:series href="http://company.com/series/product/123"/>
<c:series-order>2020-09-17T00:00:00Z</c:series-order>
<f:assessment-low precision="0">980</f:assessment-low>
I tried to extract them to different list with BeautifulSoup, but that wasn't successful because in some entries there are dates but there isn't price (I've shown example below). How could I conditionally extract it? at least put N/A for entries where price is ommited.
soup = BeautifulSoup(request.text, "html.parser")
date = soup.find_all('c:series-order')
value = soup.find_all('f:assessment-low')
quot=soup.find_all('c:series')
p_day = []
p_val = []
q_val=[]
for i in date:
p_day.append(i.text)
for j in value:
p_val.append(j.text)
for j in quot:
q_val.append(j.get('href'))
d2={'date': p_day,
'price': p_val,
'quote': q_val
}
and
<atom:feed xmlns:atom="http://www.w3.org/2005/Atom" xmlns:a="http://company.com/ns/assets" xmlns:c="http://company.com/ns/core" xmlns:f="http://company.com/ns/fields" xmlns:s="http://company.com/ns/search">
<atom:id>http://company.com/search</atom:id>
<atom:title> COMPANYSearch Results</atom:title>
<atom:updated>2022-11-24T19:36:19.104414Z</atom:updated>
<atom:author>COMPANY atom:author>
<atom:generator> COMPANY/search Endpoint</atom:generator>
<atom:link href="/search" rel="self" type="application/atom"/>
<s:first-result>1</s:first-result>
<s:max-results>15500</s:max-results>
<s:selected-count>212</s:selected-count>
<s:returned-count>212</s:returned-count>
<s:query-time>PT0.036179S</s:query-time>
<s:request version="1.0">
<s:scope>
<s:series>http://company.com/series/product/123</s:series>
</s:scope>
<s:constraints>
<s:compare field="c:series-order" op="ge" value="2018-10-01"/>
<s:compare field="c:series-order" op="le" value="2022-11-18"/>
</s:constraints>
<s:options>
<s:first-result>1</s:first-result>
<s:max-results>15500</s:max-results>
<s:order-by key="commodity-name" direction="ascending" xml:lang="en"/>
<s:no-currency-rate-scheme>no-element</s:no-currency-rate-scheme>
<s:precision>embed</s:precision>
<s:include-last-commit-time>false</s:include-last-commit-time>
<s:include-result-types>live</s:include-result-types>
<s:relevance-score algorithm="score-logtfidf"/>
<s:lang-data-missing-scheme>show-available-language-content</s:lang-data-missing-scheme>
</s:options>
</s:request>
<s:facets/>
<atom:entry>
<atom:title>http://company.com/series-item/product/123-pricehistory-20200917000000</atom:title>
<atom:id>http://company.com/series-item/product/123-pricehistory-20200917000000</atom:id>
<atom:updated>2020-09-17T17:09:43.55243Z</atom:updated>
<atom:relevance-score>60800</atom:relevance-score>
<atom:content type="application/vnd.icis.iddn.entity+xml"><a:price-range>
<c:id>http://company.com/series-item/product/123-pricehistory-20200917000000</c:id>
<c:version>1</c:version>
<c:type>series-item</c:type>
<c:created-on>2020-09-17T17:09:43.55243Z</c:created-on>
<c:descriptor href="http://company.com/descriptor/price-range"/>
<c:domain href="http://company.com/domain/product"/>
<c:released-on>2020-09-17T21:30:00Z</c:released-on>
<c:series href="http://company.com/series/product/123"/>
<c:series-order>2020-09-17T00:00:00Z</c:series-order>
<f:assessment-low precision="0">980</f:assessment-low>
<f:assessment-high precision="0">1020</f:assessment-high>
<f:mid precision="1">1000</f:mid>
<f:assessment-low-delta>0</f:assessment-low-delta>
<f:assessment-high-delta>+20</f:assessment-high-delta>
<f:delta-type href="http://company.com/ref-data/delta-type/regular"/>
</a:price-range></atom:content>
</atom:entry>
<atom:entry>
<atom:title>http://company.com/series-item/product/123-pricehistory-20200910000000</atom:title>
<atom:id>http://company.com/series-item/product/123-pricehistory-20200910000000</atom:id>
<atom:updated>2020-09-10T18:57:55.128308Z</atom:updated>
<atom:relevance-score>60800</atom:relevance-score>
<atom:content type="application/vnd.icis.iddn.entity+xml"><a:price-range>
<c:id>http://company.com/series-item/product/123-pricehistory-20200910000000</c:id>
<c:version>1</c:version>
<c:type>series-item</c:type>
<c:created-on>2020-09-10T18:57:55.128308Z</c:created-on>
<c:descriptor href="http://company.com/descriptor/price-range"/>
<c:domain href="http://company.com/domain/product"/>
<c:released-on>2020-09-10T21:30:00Z</c:released-on>
<c:series href="http://company.com/series/product/123"/>
<c:series-order>2020-09-10T00:00:00Z</c:series-order>
for example here is no price
<f:delta-type href="http://company.com/ref-data/delta-type/regular"/>
</a:price-range></atom:content>
</atom:entry>
May try to iterate per entry, use xml parser to get a propper result and check if element exists:
soup = BeautifulSoup(request.text,'xml')
data = []
for i in soup.select('entry'):
data.append({
'date':i.find('series-order').text,
'value': i.find('assessment-low').text if i.find('assessment-low') else None,
'quot': i.find('series').get('href')
})
data
or with html.parser:
soup = BeautifulSoup(xml,'html.parser')
data = []
for i in soup.find_all('atom:entry'):
data.append({
'date':i.find('c:series-order').text,
'value': i.find('f:assessment-low').text if i.find('assessment-low') else None,
'quot': i.find('c:series').get('href')
})
data
Output:
[{'date': '2020-09-17T00:00:00Z',
'value': '980',
'quot': 'http://company.com/series/product/123'},
{'date': '2020-09-10T00:00:00Z',
'value': None,
'quot': 'http://company.com/series/product/123'}]
You can try this:
split your request.text by <atom:entry>
deal with each section seperately.
Use enumerate to identify the section that it came from
entries = request.text.split("<atom:entry>")
p_day = []
p_val = []
q_val=[]
for i, entry in enumerate(entries):
soup = BeautifulSoup(entry, "html.parser")
date = soup.find_all('c:series-order')
value = soup.find_all('f:assessment-low')
quot=soup.find_all('c:series')
for d in date:
p_day.append([i, d.text])
for v in value:
p_val.append([i, v.text])
for q in quot:
q_val.append([i, q.get('href')])
d2={'date': p_day,
'price': p_val,
'quote': q_val
}
print(d2)
OUTPUT:
{'date': [[1, '2020-09-17T00:00:00Z'], [2, '2020-09-10T00:00:00Z']],
'price': [[1, '980']],
'quote': [[1, 'http://company.com/series/product/123'],
[2, 'http://company.com/series/product/123']]}
Related
I need to iterate over the tag ObjectHeader and when the tag ObjectType/Id is equal to 1424 I need to extract all the values inside the following tags ObjectVariant/ObjectValue/Characteristic/Name and ObjectVariant/ObjectValue/PropertyValue/Value and put them in a dictionary. The expected output will be like this:
{"Var1": 10.4,
"Var2": 15.6}
Here is a snippet from the XML that I'm working with which has 30k lines (Hint: Id 1424 only appears once in the whole XML file).
<ObjectContext>
<ObjectHeader>
<ObjectType>
<Id>1278</Id>
<Name>ID_NAME</Name>
</ObjectType>
<ObjectVariant>
<ObjectValue>
<Characteristic>
<Name>Var1</Name>
<Description>Something about the name</Description>
</Characteristic>
<PropertyValue>
<Value>10.6</Value>
<Description>Something about the value</Description>
</PropertyValue>
</ObjectValue>
</ObjectVariant>
</ObjectHeader>
<ObjectHeader>
<ObjectType>
<Id>1424</Id>
<Name>ID_NAME</Name>
</ObjectType>
<ObjectVariant>
<ObjectValue>
<Characteristic>
<Name>Var1</Name>
<Description>Something about the name</Description>
</Characteristic>
<PropertyValue>
<Value>10.4</Value>
<Description>Something about the value</Description>
</PropertyValue>
</ObjectValue>
<ObjectValue>
<Characteristic>
<Name>Var2</Name>
<CharacteristicType>Something about the name</CharacteristicType>
</Characteristic>
<PropertyValue>
<Value>15.6</Value>
<Description>Something about the value</Description>
</PropertyValue>
</ObjectValue>
</ObjectVariant>
</ObjectHeader>
</ObjectContext>
Here is one possibility to write all to pandas and then filter the interessting values:
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse("xml_to_dict.xml")
root = tree.getroot()
columns = ["id", "name", "value"]
row_list = []
for objHead in root.findall('.//ObjectHeader'):
for elem in objHead.iter():
if elem.tag == 'Id':
id = elem.text
if elem.tag == 'Name':
name = elem.text
if elem.tag == 'Value':
value = elem.text
row = id, name, value
row_list.append(row)
df = pd.DataFrame(row_list, columns=columns)
dff = df.query('id == "1424"')
print("Dictionary:", dict(list(zip(dff['name'], dff['value']))))
Output:
Dictionary: {'Var1': '10.4', 'Var2': '15.6'}
I have a csv file passed into a function as a string:
csv_input = """
quiz_date,location,size
2022-01-01,london_uk,134
2022-01-02,edingburgh_uk,65
2022-01-01,madrid_es,124
2022-01-02,london_uk,125
2022-01-01,edinburgh_uk,89
2022-01-02,madric_es,143
2022-01-02,london_uk,352
2022-01-01,edinburgh_uk,125
2022-01-01,madrid_es,431
2022-01-02,london_uk,151"""
I want to print the sum of how many people were surveyed in each city by date, so something like:
Date. City. Pop-Surveyed
2022-01-01. London. 134
2022-01-01. Edinburgh. 214
2022-01-01. Madrid. 555
2022-01-02. London. 628
2022-01-02. Edinburgh. 65
2022-01-02. Madrid. 143
As I can't import pandas on my machine (can't install without internet access) I thought I could use a defaultdict to store the value of each city by date
from collections import defaultdict
survery_data = csv_input.split()[1:]
survery_data = [survey.split(',') for survey in survery_data]
survey_sum = defaultdict(dict)
for survey in survery_data:
date = survey[0]
city = survey[1].split("_")[0]
quantity = survey[-1]
survey_sum[date][city] += quantity
print(survey_sum)
But doing this returns a KeyError:
KeyError: 'london'
When I was hoping to have a defaultdict of
{'2022-01-01': {'london': 134}, {'edinburgh': 214}, {'madrid': 555}},
{'2022-01-02': {'london': 628}, {'edinburgh': 65}, {'madrid': 143}}
Is there a way to create a default dict that gives a structure so I could then iterate over to print out each column like above?
Try:
csv_input = """\
quiz_date,location,size
2022-01-01,london_uk,134
2022-01-02,edingburgh_uk,65
2022-01-01,madrid_es,124
2022-01-02,london_uk,125
2022-01-01,edinburgh_uk,89
2022-01-02,madric_es,143
2022-01-02,london_uk,352
2022-01-01,edinburgh_uk,125
2022-01-01,madrid_es,431
2022-01-02,london_uk,151"""
header, *rows = (
tuple(map(str.strip, line.split(",")))
for line in map(str.strip, csv_input.splitlines())
)
tmp = {}
for date, city, size in rows:
key = (date, city.split("_")[0])
tmp[key] = tmp.get(key, 0) + int(size)
out = {}
for (date, city), size in tmp.items():
out.setdefault(date, []).append({city: size})
print(out)
Prints:
{
"2022-01-01": [{"london": 134}, {"madrid": 555}, {"edinburgh": 214}],
"2022-01-02": [{"edingburgh": 65}, {"london": 628}, {"madric": 143}],
}
Changing
survey_sum = defaultdict(dict)
to
survey_sum = defaultdict(lambda: defaultdict(int))
allows the return of
defaultdict(<function survey_sum.<locals>.<lambda> at 0x100edd8b0>, {'2022-01-01': defaultdict(<class 'int'>, {'london': 134, 'madrid': 555, 'edinburgh': 214}), '2022-01-02': defaultdict(<class 'int'>, {'edingburgh': 65, 'london': 628, 'madrid': 143})})
Allowing iterating over to create a list.
So basically I was trying to scrape a Reddit link about game of thrones. This is the link: https://www.reddit.com/r/gameofthrones/wiki/episode_discussion, this has many other links! What i was trying was to scrape all the links in a file which is done! Now i Have to individually scrape every link and print out the data in individual files either csv or json.
Ive tried all possible methods from google but still unable to come to a solution! Any help would be helpful
import praw
import json
import pandas as pd #Pandas for scraping and saving it as a csv
#This is PRAW.
reddit = praw.Reddit(client_id='',
client_secret='',
user_agent='android:com.example.myredditapp:v1.2.3 (by /u/AshKay12)',
username='******',
password='******')
subreddit=reddit.subreddit("gameofthrones")
Comments = []
submission = reddit.submission("links")
with open('got_reddit_links.json') as json_file:
data = json.load(json_file)
for p in data:
print('season: ' + str(p['season']))
print('episode: ' + str(p['episode']))
print('title: ' + str(p['title']))
print('links: ' + str(p['links']))
print('')
submission.comments.replace_more(limit=None)
for comment in submission.comments.list():
print(20*'#')
print('Parent ID:',comment.parent())
print('Comment ID:',comment.id)
print(comment.body)
Comments.append([comment.body, comment.id])
Comments = pd.DataFrame(Comments, columns=['All_Comments', 'Comment ID'])
Comments.to_csv('Reddit3.csv')
This code prints out the links, title and episode number. It also extracts data when the link is manually entered but there are over 50 links in the webiste so i extracted those and put it in a file.
You can find all episode blocks with the links, and then write a function to scrape the comments for each episode discovered by each link:
from selenium import webdriver
import requests, itertools, re
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://www.reddit.com/r/gameofthrones/wiki/episode_discussion')
new_d = soup(d.page_source, 'html.parser').find('div', {'class':'md wiki'}).find_all(re.compile('h2|h4|table'))
g = [(a, list(b)) for a, b in itertools.groupby(new_d, key=lambda x:x.name == 'h2')]
r = {g[i][-1][0].text:{g[i+1][-1][k].text:g[i+1][-1][k+1] for k in range(0, len(g[i+1][-1]), 2)} for i in range(0, len(g), 2)}
final_r = {a:{b:[j['href'] for j in c.find_all('a', {'href':re.compile('redd\.it')})] for b, c in k.items()} for a, k in r.items()}
Now, you have a dictionary with all the links structured according to Season and episode:
{'Season 1 Threads': {'1.01 Winter Is Coming': ['https://redd.it/gsd0t'], '1.02 The Kingsroad': ['https://redd.it/gwlcx'], '1.03 Lord Snow': ['https://redd.it/h1otp/'], '1.04 Cripples, Bastards, & Broken Things': ['https://redd.it/h70vv'].....
To get the comments, you have to use selenium as well to be able click on the button to display the entire comment structure:
import time
d = webdriver.Chrome('/path/to/chromedriver')
def scrape_comments(url):
d.get(url)
_b = [i for i in d.find_elements_by_tag_name('button') if 'VIEW ENTIRE DISCUSSION' in i.text][0]
_b.send_keys('\n')
time.sleep(1)
p_obj = soup(d.page_source, 'html.parser').find('div', {'class':'_1YCqQVO-9r-Up6QPB9H6_4 _1YCqQVO-9r-Up6QPB9H6_4'}).contents
p_obj = [i for i in p_obj if i != '\n']
c = [{'poster':'[deleted]' if i.a is None else i.a['href'], 'handle':getattr(i.find('div', {'class':'_2X6EB3ZhEeXCh1eIVA64XM _2hSecp_zkPm_s5ddV2htoj _zMIUk6t-WDI7fxfkvD02'}), 'text', 'N/A'), 'points':getattr(i.find('span', {'class':'_2ETuFsVzMBxiHia6HfJCTQ _3_GZIIN1xcMEC5AVuv4kfa'}), 'text', 'N/A'), 'time':getattr(i.find('a', {'class':'_1sA-1jNHouHDpgCp1fCQ_F'}), 'text', 'N/A'), 'comment':getattr(i.p, 'text', 'N/A')} for i in p_obj]
return c
Sample output when running scrape_comments on one of the urls:
[{'poster': '/user/BWPhoenix/', 'handle': 'N/A', 'points': 'Score hidden', 'time': '2 years ago', 'comment': 'Week one, so a couple of quick questions:'}, {'poster': '/user/No0neAtAll/', 'handle': 'N/A', 'points': '957 points', 'time': '2 years ago', 'comment': "Davos fans showing their love Dude doesn't say a word the entire episode and gives only 3 glances but still get's 548 votes."}, {'poster': '/user/MairmanChao/', 'handle': 'N/A', 'points': '421 points', 'time': '2 years ago', 'comment': 'Davos always gets votes for being the most honorable man in Westeros'}, {'poster': '/user/BourbonSlut/', 'handle': 'N/A', 'points': '47 points', 'time': '2 years ago', 'comment': 'I was hoping for some Tyrion dialogue too..'}.....
Now, putting it all together:
final_result = {a:{b:[scrape_comments(i) for i in c] for b, c in k.items()} for a, k in final_r.items()}
From here, you can now create a pd.DataFrame from final_result or write the results to the file.
I'm trying to parse the item names and it's corresponding values from the below snippet. dt tag holds names and dd containing values. There are few dt tags which do not have corresponding values. So, all the names do not have values. What I wish to do is keep the values blank against any name if the latter doesn't have any values.
These are the elements I would like to scrape data from:
content="""
<div class="movie_middle">
<dl>
<dt>Genres:</dt>
<dt>Resolution:</dt>
<dd>1920*1080</dd>
<dt>Size:</dt>
<dd>1.60G</dd>
<dt>Quality:</dt>
<dd>1080p</dd>
<dt>Frame Rate:</dt>
<dd>23.976 fps</dd>
<dt>Language:</dt>
</dl>
</div>
"""
I've tried like below:
soup = BeautifulSoup(content,"lxml")
title = [item.text for item in soup.select(".movie_middle dt")]
result = [item.text for item in soup.select(".movie_middle dd")]
vault = dict(zip(title,result))
print(vault)
It gives me messy results (wrong pairs):
{'Genres:': '1920*1080', 'Resolution:': '1.60G', 'Size:': '1080p', 'Quality:': '23.976 fps'}
My expected result:
{'Genres:': '', 'Resolution:': '1920*1080', 'Size:': '1.60G', 'Quality:': '1080p','Frame Rate:':'23.976 fps','Language:':''}
Any help on fixing the issue will be highly appreciated.
You can loop through the elements inside dl. If the current element is dt and the next element is dd, then store the value as the next element, else set the value as empty string.
dl = soup.select('.movie_middle dl')[0]
elems = dl.find_all() # Returns the list of dt and dd
data = {}
for i, el in enumerate(elems):
if el.name == 'dt':
key = el.text.replace(':', '')
# check if the next element is a `dd`
if i < len(elems) - 1 and elems[i+1].name == 'dd':
data[key] = elems[i+1].text
else:
data[key] = ''
You can use BeautifulSoup to parse the dl structure, and then write a function to create the dictionary:
from bs4 import BeautifulSoup as soup
import re
def parse_result(d):
while d:
a, *_d = d
if _d:
if re.findall('\<dt', a) and re.findall('\<dd', _d[0]):
yield [a[4:-5], _d[0][4:-5]]
d = _d[1:]
else:
yield [a[4:-5], '']
d = _d
else:
yield [a[4:-5], '']
d = []
print(dict(parse_result(list(filter(None, str(soup(content, 'html.parser').find('dl')).split('\n')))[1:-1])))
Output:
{'Genres:': '', 'Resolution:': '1920*1080', 'Size:': '1.60G', 'Quality:': '1080p', 'Frame Rate:': '23.976 fps', 'Language:': ''}
For a slightly longer, although cleaner solution, you can create a decorator to strip the HTML tags of the output, thus removing the need for the extra string slicing in the main parse_result function:
def strip_tags(f):
def wrapper(data):
return {a[4:-5]:b[4:-5] for a, b in f(data)}
return wrapper
#strip_tags
def parse_result(d):
while d:
a, *_d = d
if _d:
if re.findall('\<dt', a) and re.findall('\<dd', _d[0]):
yield [a, _d[0]]
d = _d[1:]
else:
yield [a, '']
d = _d
else:
yield [a, '']
d = []
print(parse_result(list(filter(None, str(soup(content, 'html.parser').find('dl')).split('\n')))[1:-1]))
Output:
{'Genres:': '', 'Resolution:': '1920*1080', 'Size:': '1.60G', 'Quality:': '1080p', 'Frame Rate:': '23.976 fps', 'Language:': ''}
from collections import defaultdict
test = soup.text.split('\n')
d = defaultdict(list)
for i in range(len(test)):
if (':' in test[i]) and (':' not in test[i+1]):
d[test[i]] = test[i+1]
elif ':' in test[i]:
d[test[i]] = ''
d
defaultdict(list,
{'Frame Rate:': '23.976 fps',
'Genres:': '',
'Language:': '',
'Quality:': '1080p',
'Resolution:': '1920*1080',
'Size:': '1.60G'})
The logic here is that you know that every key will have a colon. Knowing this, you can write an if else statement to capture the unique combinations, whether that is key followed by key or key followed by value
Edit:
In case you wanted to clean your keys, below replaces the : in each one:
d1 = { x.replace(':', ''): d[x] for x in d.keys() }
d1
{'Frame Rate': '23.976 fps',
'Genres': '',
'Language': '',
'Quality': '1080p',
'Resolution': '1920*1080',
'Size': '1.60G'}
The problem is that empty elements are not present. Since there is no hierarchy between the <dt> and the <dd>, I'm afraid you'll have to craft the dictionary yourself.
vault = {}
category = ""
for item in soup.find("dl").findChildren():
if item.name == "dt":
if category == "":
category = item.text
else:
vault[category] = ""
category = ""
elif item.name == "dd":
vault[category] = item.text
category = ""
Basically this code iterates over the child elements of the <dl> and fills the vault dictionary with the values.
I have a collection of an item like below in my mongoDB database:
{u'Keywords': [[u'european', 7], [u'bill', 5], [u'uk', 5], [u'years', 4], [u'brexit', 4]], u'Link': u'http://www.bbc.com/
news/uk-politics-39042876', u'date': datetime.datetime(2017, 2, 21, 22, 47, 7, 463000), u'_id': ObjectId('58acc36b3040a218bc62c6d3')}
.....
These come from a mongo DB query
mydb = client['BBCArticles']
##mydb.adminCommand({'setParameter': True, 'textSearchEnabled': True})
my_collection = mydb['Articles']
print 'Articles containing higher occurences of the keyword is sorted as follow:'
for doc in my_collection.find({"Keywords":{"$elemMatch" : {"$elemMatch": {"$in": [keyword.lower()]}}}}):
print doc
However, I want to print documents as follow:
doc1
Kewords: european,bill, uk
Link:"http://www.bbc.com/"
doc2
....
Since your collection looks like a list of dictionaries, it should be iterable and parseable using a for-loop. If indeed you want only a portion of the url and keywords, this should work:
# c = your_collection, a list of dictionaries
from urlparse import urlparse
for n in range(len(c)):
print 'doc{n}'.format(n=n+1)
for k, v in c[n].iteritems():
if k == 'Keywords':
print k+':', ', '.join([str(kw[0]) for kw in v[0:3]])
if k == 'Link':
parsed_uri = urlparse( v )
domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
print k+':', '"{0}"\n'.format(domain)
prints:
doc1
Keywords: european, bill, uk
Link: "http://www.bbc.com/"