Parsing a python list converted from JSON - python

I am trying to use google custom search API to search through US news outlets. Using the code example provided by google you end up with a python dictionary containing a multitude of other dictionaries and lists. The tags listed next to "res" in the meta function are the values I am trying to access for each article.
import os.path
import csv
from lxml import html
from googleapiclient.discovery import build
def newslist():
'''
Uses google custom search to search 20 US news sources for gun control articles,
and converts info into python dictionary.
in - none
out - res: JSON formatted search results
'''
service = build("customsearch", "v1",
developerKey="key")
res = service.cse().list(
q='query',
cx='searchid',
).execute()
return res
def meta(res, doc_count):
'''
Finds necessary meta-data of all articles. Avoids collections, such as those found on Huffington Post and New York Times.
in - res: defined above
out - meta_csv: csv file with article meta-data
'''
row1 = ['doc_id', 'url', 'title', 'publisher', 'date']
if res['context']['items']['pagemap']['metatags']['applicationname'] is not 'collection':
for art in res['context']['items']:
url = res['context']['items']['link']
title = res['context']['items']['pagemap']['article']['newsarticle']['headline']
publisher = res['context']['items']['displayLink'].split('www.' and '.com')
date = res['context']['items']['pagemap']['newsarticle']['datepublished']
row2 = [doc_count, url, title, publisher, date]
with open('meta.csv', 'w', encoding = 'utf-8') as meta:
csv_file = csv.writer(meta, delimiter = ',', quotechar = '|',
quoting = csv.QUOTE_MINIMAL)
if doc_count == 1:
csv_file.writerow(row1)
csv_file.writerow(row2)
doc_count += 1
Here's and example of the printed output from a search query:
{'context': {'title': 'Gun Control articles'},
'items': [{'displayLink': 'www.washingtonpost.com',
'formattedUrl': 'https://www.washingtonpost.com/.../white-resentment-is-fueling-opposition- '
'to-gun-control-researchers-say/',
'htmlFormattedUrl': 'https://www.washingtonpost.com/.../white-resentment-is-fueling-opposition- '
'to-<b>gun</b>-<b>control</b>-researchers-say/',
'htmlSnippet': 'Apr 4, 2016 <b>...</b> Racial prejudice could play '
'a significant role in white Americans' '
'opposition to <br>\n'
'<b>gun control</b>, according to new research from '
'political scientists at ...',
'htmlTitle': 'White resentment is fueling opposition to <b>gun '
'control</b>, researchers say',
'kind': 'customsearch#result',
'link': 'https://www.washingtonpost.com/news/wonk/wp/2016/04/04/white-resentment-is-fueling-opposition-to-gun-control-researchers-say/',
'pagemap': {'cse_image': [{'src': 'https://img.washingtonpost.com/rf/image_1484w/2010-2019/WashingtonPost/2015/10/03/Others/Images/2015-10-03/Botsford_gunshow1004_15_10_03_41831443897980.jpg'}],
'cse_thumbnail': [{'height': '183',
'src': 'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcSXtMnfm_GHkX3d2dOWgmto3rFjmhzxV8InoPao1tBuiBrEWsDMz4WDKcPB',
'width': '275'}],
'metatags': [{'apple-itunes-app': 'app-id=938922398, '
'app-argument=https://www.washingtonpost.com/news/wonk/wp/2016/04/04/white-resentment-is-fueling-opposition-to-gun-control-researchers-say/',
'article:author': 'https://www.facebook.com/chrisingraham',
'article:publisher': 'https://www.facebook.com/washingtonpost',
'author': 'https://www.facebook.com/chrisingraham',
'fb:admins': '1513210492',
'fb:app_id': '41245586762',
'news_keywords': 'guns, gun control, '
'racial resentment, '
'white people',
'og:description': 'Some white gun owners '
'"understand '
"'freedom' in a very "
'particular way."',
'og:image': 'https://img.washingtonpost.com/rf/image_1484w/2010-2019/WashingtonPost/2015/10/03/Others/Images/2015-10-03/Botsford_gunshow1004_15_10_03_41831443897980.jpg',
'og:site_name': 'Washington Post',
'og:title': 'White resentment is fueling '
'opposition to gun control, '
'researchers say',
'og:type': 'article',
'og:url': 'https://www.washingtonpost.com/news/wonk/wp/2016/04/04/white-resentment-is-fueling-opposition-to-gun-control-researchers-say/',
'referrer': 'unsafe-url',
'twitter:card': 'summary_large_image',
'twitter:creator': '#_cingraham',
'viewport': 'width=device-width, '
'initial-scale=1.0, '
'user-scalable=yes, '
'minimum-scale=0.5, '
'maximum-scale=2.0'}],
'newsarticle': [{'articlebody': 'People look at '
'handguns during the '
"Nation's Gun Show in "
'Chantilly, Va. in '
'October 2015. (Photo '
'by Jabin Botsford/The '
'Washington Post) '
'Racial prejudice '
'could play a '
'significant role in '
'white...',
'datepublished': '2016-04-04T11:46-500',
'description': 'Some white gun owners '
'"understand '
"'freedom' in a very "
'particular way."',
'headline': 'White resentment is '
'fueling opposition to '
'gun control, researchers '
'say',
'image': 'https://img.washingtonpost.com/rf/image_1484w/2010-2019/WashingtonPost/2015/10/03/Others/Images/2015-10-03/Botsford_gunshow1004_15_10_03_41831443897980.jpg',
'mainentityofpage': 'True',
'url': 'https://www.washingtonpost.com/news/wonk/wp/2016/04/04/white-resentment-is-fueling-opposition-to-gun-control-researchers-say/'}],
'person': [{'name': 'Christopher Ingraham'}]},
'snippet': 'Apr 4, 2016 ... Racial prejudice could play a '
"significant role in white Americans' opposition to \n"
'gun control, according to new research from political '
'scientists at\xa0...',
'title': 'White resentment is fueling opposition to gun control, '
'researchers say'},
I understand that I could basically write a for loop, but I'm wondering if there is an easier, less code intensive way of accessing this data for each desired value: URL, title, publisher, and date.

Why don't you use json module?
import json
s = ... # Your JSON text
result = json.loads(s)
result will be a dict or list depending on your JSON.

Related

Clean Html Tags using Item Pipelines and get output in Scrapy

I am scraping company data using scrapy.
Some outputs are strings and some are lists.
I could process the string outputs using extract.strip() in the spider file itself but the list outputs are tricky.
I want a way to process all items together using item pipelines and print the result.
E.g:
this is my parse function:
def parse(self,response):
item = CompanycrawlerItem()
item['address'] = response.xpath('//*[#class="text data"]/text()').extract()[0]
item['status'] = response.xpath('//*[#class="text data"]/text()').extract()[1]
item['type'] = response.xpath('//*[#class="text data"]/text()').extract()[2]
item['accounts'] = response.xpath('//*[#class="column-half"]/p').extract()[:2]
item['confirmation_status'] = response.xpath('//*[#class="column-half"]/p').extract()[2:4]
item['incorporate'] = response.xpath('//*[#id="company-creation-date"]/text()').extract()
yield item
this is the output I am getting:
{'accounts': ['<p>\n'
'Next accounts made up to <strong>31 '
'December 2020</strong>\n'
' <br>\n'
' due by\n'
' <strong>30 September 2021</strong>\n'
' </p>',
'<p>\n'
' Last accounts made up to\n'
' <strong>31 December 2019</strong>\n'
' </p>'],
'address': '\n'
'Wellington House, 69/71 Upper Ground, London, SE1 9PQ ',
'confirmation_status': ['<p>\n'
'Next statement date <strong>7 June '
'2021</strong> <br>\n'
' due by <strong>21 June 2021</strong>\n'
' </p>',
'<p>\n'
' Last statement dated <strong>7 June '
'2020</strong>\n'
' </p>'],
'incorporate': ['31 December 1987'],
'status': '\n Active\n ',
'type': '\n Private limited Company\n '}
I want to use the Item Pipelines to get the tags and space out

Is it possible for a python script to check whether row exists in google sheets before writing that row?

I have a python script that searches for vehicles on a vehicle listing site and writes the results to a spreadsheet. What I want is to automate this script to run every night to get new listings, but what I don't want is to create numerous duplicates if the listing exists each day that the script is run.
So is it possible to get the script to check whether that row (potential duplicate) already exists before writing a new row?
To clarify the code I have works perfectly to print the results exactly how I want them into the google sheets document, what I am trying to do is to run a check before it prints new lines into the sheet to see if that result already exists. Is that clearer? With thanks in advance.
Here is a screenshot of an example where I might have a row already existing with the specific title, but one of the column cells may have a different value in it and I only want to update the original row with the latest/highest price value.
UPDATE:
I am trying something like this but it just seems to print everything rather than only if it doesn't already exist which is what I am trying to do.
listing = [title, img['src'], video, vin,loc, exterior_colour, interior_colour, 'N/A', mileage, gearbox, 'N/A', 'Live', auction_date,'', '£' + bid.attrs['amount'][:-3], 'The Market', '', '', '', '', year, make, model, variant]
list_of_dicts = sheet2.get_all_records()
# Convert listing into dictionary output to be read by following statement to see if listing exists in sheet before printing
i = iter(listing)
d_listing = dict(zip(i, i))
if not d_listing in list_of_dicts:
print(listing)
#print(title, img['src'], video, vin,loc, exterior_colour, interior_colour, 'N/A', mileage, gearbox, 'N/A', 'Live', auction_date,'', '£' + bid.attrs['amount'][:-3], 'The Market', '', '', '', '', year, make, model, variant)
index = 2
row = [title, img['src'], video, vin,loc, exterior_colour, interior_colour, 'N/A', mileage, gearbox, 'N/A', 'Live', auction_date,'', '£' + bid.attrs['amount'][:-3], 'The Market', '', '', '', '', year, make, model, variant]
sheet2.insert_row(row,index)
My code is:
import requests
import re
from bs4 import BeautifulSoup
import pandas
import gspread
from oauth2client.service_account import ServiceAccountCredentials
# use creds to create a client to interact with the Google Drive API
scope = ['https://spreadsheets.google.com/feeds', 'https://www.googleapis.com/auth/drive']
creds = ServiceAccountCredentials.from_json_keyfile_name('creds.json', scope)
client = gspread.authorize(creds)
sheet = client.open("CAR AGGREGATOR")
sheet2 = sheet.worksheet("Auctions - Live")
url = "https://themarket.co.uk/live.xml"
get_url = requests.get(url)
get_text = get_url.text
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for loc in soup.select('url > loc'):
loc = loc.text
r=requests.get(loc)
c=r.content
hoop = BeautifulSoup(c, 'html.parser')
soup = BeautifulSoup(c, 'lxml')
current_bid = soup.find('div', 'bid-step__header')
bid = soup.find('bid-display')
title = soup.find('h2').text.split()
year = title[0]
if not year:
year = ''
if any(make in 'ASTON ALFA HEALEY ROVER Arnolt Bristol Amilcar Amphicar LOREAN De Cadenet Cosworth'.split() for make in title):
make = title[1] + ' ' + title[2]
model = title[3]
try:
variant = title[4]
except:
variant = ''
else:
make = title[1]
model = title[2]
try:
variant = title[3]
if 'REIMAGINED' in variant:
variant = 'REIMAGINED BY SINGER'
if 'SINGER' in variant:
variant = 'REIMAGINED BY SINGER'
except:
variant = ''
title = year + ' ' + make + ' ' + model
img = soup.find('img')
vehicle_details = soup.find('ul', 'vehicle__overview')
try:
mileage = vehicle_details.find_all('li')[1].text.split()[2]
except:
mileage = ''
try:
vin = vehicle_details.find_all('li')[2].text.split()[2]
except:
vin = ''
try:
gearbox = vehicle_details.find_all('li')[4].text.split()[2]
except:
gearbox = 'N/A'
try:
exterior_colour = vehicle_details.find_all('li')[5].text.split()[1:]
exterior_colour = "-".join(exterior_colour)
except:
exterior_colour = 'N/A'
try:
interior_colour = vehicle_details.find_all('li')[6].text.split()[1:]
interior_colour = "-".join(interior_colour)
except:
interior_colour = 'N/A'
try:
video = soup.find('iframe')['src']
except:
video = ''
tag = soup.countdown
try:
auction_date = tag.attrs['formatted_date'].split()
auction_day = auction_date[0][:2]
auction_month = auction_date[1]
auction_year = auction_date[2]
auction_time = auction_date[3]
auction_date = auction_day + ' ' + auction_month + ' ' + auction_year + ' ' + auction_time
except:
continue
print(title, img['src'], video, vin,loc, exterior_colour, interior_colour, 'N/A', mileage, gearbox, 'N/A', 'Live', auction_date,'', '£' + bid.attrs['amount'][:-3], 'The Market', '', '', '', '', year, make, model, variant)
index = 2
row = [title, img['src'], video, vin,loc, exterior_colour, interior_colour, 'N/A', mileage, gearbox, 'N/A', 'Live', auction_date,'', '£' + bid.attrs['amount'][:-3], 'The Market', '', '', '', '', year, make, model, variant]
sheet2.insert_row(row,index)
I would load all data in two dictionaries, one representing the freshly scraped information, the other one the full information of the GoogleSheet. (To load the information from GoogleSheet, use its API, as described in Google's documentation.)
Both dictionaries, let's call them scraped and sheets, could have the titles as keys, and all the other columns as value (represented in a dictionary), so they would look like this:
{
"1928 Aston Martin V8": {
"Link": "...",
"Price": "12 $",
},
...
}
Then update the Sheets-dictionary with dict.update():
sheets.update(scraped)
and rewrite the Google Sheet with the data in sheets.
Without exactly knowing your update logic, I cannot give a more specific advice than this.

Why isn't this if statement returning True?

I'm making a program that counts how many times a band has played a song from a webpage of all their setlists. I have grabbed the webpage and converted all the songs played into one big list so all I wanted to do was see if the song name was in the list and add to a counter but it isn't working and I can't seem to figure out why.
I've tried using the count function instead and that didn't work
sugaree_counter = 0
link = 'https://www.cs.cmu.edu/~mleone/gdead/dead-sets/' + year + '/' + month+ '-' + day + '-' + year + '.txt'
page = requests.get(link)
page_text = page.text
page_list = [page_text.split('\n')]
print(page_list)
This code returns the list:
[['Winterland Arena, San Francisco, CA (1/2/72)', '', "Truckin'", 'Sugaree',
'Mr. Charlie', 'Beat it on Down the Line', 'Loser', 'Jack Straw',
'Chinatown Shuffle', 'Your Love At Home', 'Tennessee Jed', 'El Paso',
'You Win Again', 'Big Railroad Blues', 'Mexicali Blues',
'Playing in the Band', 'Next Time You See Me', 'Brown Eyed Women',
'Casey Jones', '', "Good Lovin'", 'China Cat Sunflower', 'I Know You Rider',
"Good Lovin'", 'Ramble On Rose', 'Sugar Magnolia', 'Not Fade Away',
"Goin' Down the Road Feeling Bad", 'Not Fade Away', '',
'One More Saturday Night', '', '']]
But when I do:
sugaree_counter = int(sugaree_counter)
if 'Sugaree' in page_list:
sugaree_counter += 1
print(str(sugaree_counter))
It will always be zero.
It should add 1 to that because 'Sugaree' is in that list
Your page_list is a list of lists, so you need two for loops to get the pages, you need to do
for page in page_list:
for item in page:
sugaree_counter += 1
Use sum() and list expressions:
sugaree_counter = sum([page.count('Sugaree') for page in page_list])

Adding symbols to json Python

I was getting from website a json with tags.
['adventure in the comments', 'artist:s.guri', 'clothes', 'comments locked down', 'dashie slippers', 'edit', 'fractal', 'no pony', 'recursion', 'safe', 'simple background', 'slippers', 'tanks for the memories', 'the ride never ends', 'transparent background', 'vector', 'wat', 'we need to go deeper']
And i want to print it more or less like that
#adventureinthecomments #artist:s.guri #clothes #commentslockeddown #dashie #slippers #edit #fractal #nopony #recursion
Does somebody knows what method i need to use to remove all comas an add hashtag before word?
P.S Using Python 3
One of the ways is to join to the single string with '#' and strip all white spaces and replace '#' with ' #' (with space)
arr = ['adventure in the comments', 'artist:s.guri', 'clothes', 'comments locked down', 'dashie slippers', 'edit', 'fractal', 'no pony', 'recursion', 'safe', 'simple background', 'slippers', 'tanks for the memories', 'the ride never ends', 'transparent background', 'vector', 'wat', 'we need to go deeper']
s= "#"
res = '#' + s.join(arr)
newVal = res.replace(' ','')
newNew = newVal.replace('#', ' #')
print(newNew)
What's the rule for split the original sentences?, because the first one looks like
'adventure in the comments' = '#adventureinthecomments'
but
'comments locked down' is splitted to #comments #locked #down
?
If there is no rules this could works
>>> jsontags = ['adventure in the comments', 'artist:s.guri', 'clothes', 'comments locked down', 'dashie slippers', 'edit', 'fractal', 'no pony', 'recursion', 'safe', 'simple background', 'slippers', 'tanks for the memories', 'the ride never ends', 'transparent background', 'vector', 'wat', 'we need to go deeper']
>>> '#'+' #'.join([tag.replace(' ','') for tag in jsontags])
This will be the result
'#adventureinthecomments #artist:s.guri #clothes #commentslockeddown #dashieslippers #edit #fractal #nopony #recursion #safe #simplebackground #slippers #tanksforthememories #therideneverends #transparentbackground #vector #wat #weneedtogodeeper'

Scraping HTML data from a page that adds new tables as you scroll

Im trying to learn html scraping for a project, I'm using python and lxml. I've been successful so far in getting the data I needed but now I have another problem. The site that I'm scraping from (op.gg) when you scroll down it adds new tables with more information. When I run my script (below) it only gets the first 50 entries and nothing more. My question is how can I get at least the first 200 names on the page or if it is even possible.
from lxml import html
import requests
page = requests.get('https://na.op.gg/ranking/ladder/')
tree = html.fromstring(page.content)
names = tree.xpath('//td[#class="SummonerName Cell"]/a/text()')
print (names)
Borrow the idea from Pedro, https://na.op.gg/ranking/ajax2/ladders/start=number will give you 50 records start from number, for example:
https://na.op.gg/ranking/ajax2/ladders/start=0 get (1-50),
https://na.op.gg/ranking/ajax2/ladders/start=50 get (51-100),
https://na.op.gg/ranking/ajax2/ladders/start=100 get (101-150),
https://na.op.gg/ranking/ajax2/ladders/start=150 get (151-200),
etc....
After that you could change your scrap code, as the page is different as your original one, suppose you want get first 200 names, here is the amended code:
from lxml import html
import requests
start_url = 'https://na.op.gg/ranking/ajax2/ladders/start='
names_200 = list()
for i in [0,50,100,150]:
dest_url = start_url + str(i)
page = requests.get(dest_url)
tree = html.fromstring(page.content)
names_50 = tree.xpath('//a[not(#target) and not(#onclick)]/text()')
names_200.extend(names_50)
print names_200
print len(names_200)
Output:
[u'am\xc3\xa9liorer', 'pireaNn', 'C9 Ray', 'P1 Pirean', 'Pobelter', 'mulgokizary', 'consensual clown', 'Jue VioIe Grace', 'Deep Learning', 'Keegun', 'Free Papa Chau', 'C9 Gun', 'Dhokla', 'Arrowlol', 'FOX Brandini', 'Jurassiq', 'Win or Learn', 'Acoldblazeolive', u'R\xc3\xa9venge', u'M\xc3\xa9ru', 'Imaqtpie', 'Rohammers', 'blaberfish2', 'qldurtms', u'd\xc3\xa0wolfsclaw', 'TheOddOrange', 'PandaTv 656826', 'stuntopolis', 'Butler Delta', 'P1 Shady', 'Entranced', u'Linsan\xc3\xadty', 'Ablazeolive', 'BukZacH', 'Anivia Kid', 'Contractz', 'Eitori', 'MistyStumpey', 'Prodedgy', 'Splitting', u'S\xc4\x99b B\xc4\x99rnal', 'N For New York', 'Naeun', '5tunt', 'C9 Winter', 'Doubtfull', 'MikeYeung', 'Rikara', u'RAH\xc3\x9cLK', ' Sudzzi', 'joong ki song', 'xWeixin VinLeous', 'rhubarbs', u'Ch\xc3\xa0se', 'XueGao', 'Erry', 'C9 EonYoung', 'Yeonbee', 'M ckg', u'Ari\xc3\xa1na Lovato', 'OmarGod', 'Wiggily', 'lmpactful', 'Str1fe', 'LL Stylish', '2017', 'FlREFLY', 'God Fist Monk', 'rWeiXin VinLeous', 'Grigne', 'fantastic ad', 'bobqinX', 'grigne 1v10', 'Sora1', 'Juuichi san ', 'duoking2', 'SandPaperX', 'Xinthus', 'TwichTv CoMMa', 'xFSN Rin', 'UBC CJ', 'PotIuck', 'DarkWingsForSale', 'Get After lt', 'old chicken', u'\xc4\x86ris', 'VK Deemo', 'Pekin Woof', 'YIlIlIlIlI', 'RiceLegend', 'Chimonaa1', 'DJNDREE5', u'CloudNguy\xc3\xa9n', 'Diamond 1 Khazix', 'dawolfsfang', 'clg imaqtpie69', 'Pyrites', 'Lava', 'Rathma', 'PieCakeLord', 'feed l0rd', 'Eygon', 'Autolycus1', 'FateFalls 20xx', 'nIsHIlEzHIlA', 'C9 Sword', 'TET Fear', 'a very bad time', u'Jur\xc3\xa1ssiq', 'Ginormous Noob', 'Saskioo', 'S D 2 NA', 'C9 Smoothie', 'dufTlalgkqtlek', 'Pants are Dragon', u'H\xc3\xb3llywood', 'Serenitty', 'Waggily ', 'never lucky help', u'insan\xc3\xadty', 'Joyul', 'TheeBrandini', 'FoTheWin', 'RyuShoryu', 'avi is me', 'iKingVex', 'PrismaI', 'An Obese Panda', 'TdollasAKATmoney', 'feud999', 'Soligo', 'Steel I', 'SNH48 Ruri', 'BillyBoss1', 'Annie Bot', 'Descraton', 'Cris', 'GrayHoves', 'RegisZZ', 'lron Pyrite', 'Zaion', 'Allorim', 't d', u'Alex \xc3\xafch', 'godrjsdnd', 'DOUBLELIFTSUCKS', 'John Mcrae', u'Lobo Solitari\xc3\xb3', 'MikeYeunglol', 'i xo u', 'NoahMost', 'Vsionz', 'GladeGleamBright', 'Tuesdayy', 'RealDarkness', 'CC Dean', 'na mid xd LFT', 'Piggy Kitten', 'Abou222', 'TG Strompest', 'MooseHater', 'Day after Day', 'bat8man', 'AxAxAxAxA', 'Boyfriend', 'EvanRL', '63FYWJMbam', 'Fiftygbl', u'Br\xc4\xb1an', 'MlST', u'S\xc3\xb8ren Bjerg', 'FOX Akaadian', '5word', 'tchikou', 'Hakuho', 'Noobkiller291', 'woxiangwanAD', 'Doublelift', 'Jlaol', u'z\xc3\xa3ts', 'Cow Goes Mooooo', u'Be Like \xc3\x91e\xc3\xb8\xc3\xb8', 'Liquid Painless', 'Zergy', 'Huge Rooster', 'Shiphtur', 'Nikkone', 'wiggily1', 'Dylaran', u'C\xc3\xa0m', 'byulbit', 'dirtybirdy82', 'FreeXpHere', u'V\xc2\xb5lcan', 'KaNKl', 'LCS Actor 4', 'bie sha wo', 'Mookiez', 'BKSMOOTH', 'FatMiku']
200
BTW, you could expand it based on your requirement.

Categories