Scrapy add.xpath or join xpath

Scrapy add.xpath or join xpath - python

I hope everyone is doing well.
I have this code(part of it) for a spider, now this is the last part of the scraping, here it start to scrape and then write in the csv file, so I got this doubdt, it is possible to join or add xpath with the result printed in the file, for example:
<h5>Soundbooster</h5> <br><br>
<p class="details">
<b>Filtro attuale</b>
</p>
<blockquote>
<p>
<b>Catalogo:</b>
Aliant</br>
<b>Marca e Modello:</b>
Mazda - 3 </br>
<b>Versione:</b>
(3th gen) 2013-now (Petrol)
</p>
</blockquote>
I want to join the following for one field in the csv file, should be something like this:
Soundbooster per Mazda - 3 - (3th gen) 2013-now (Petrol)
And here it is where I am lost, It is possible? I don't know if I have to use add.xpath or join or another method and how to use it right.
This is part of my code:
def parse_content_details(self, response):
exists = os.path.isfile("ntp/ntp_aliant.csv")
with open("ntp/ntp_aliant.csv", "a+", newline='') as csvfile:
fieldnames = ['*Action(SiteID=Italy|Country=IT|Currency=EUR|Version=745|CC=UTF-8)','*Category','*Title','Model','ConditionID','PostalCode',\
'VATPercent','*C:Marca','Product:EAN','*C:MPN','PicURL', 'Description','*Format','*Duration','StartPrice','*Quantity','PayPalAccepted','PayPalEmailAddress',\
'PaymentInstructions','*Location','ShippingService-1:FreeShipping', 'ShippingService-1:Option','ShippingService-1:Cost', 'ShippingService-1:Priority',\
'ShippingService-2:Option','ShippingService-2:Cost','ShippingService-2:Priority','ShippingService-3:Option','ShippingService-3:Cost',\
'ShippingService-3:Priority','ShippingService-4:Option','ShippingService-4:Cost','ShippingService-4:Priority','*DispatchTimeMax',\
'*ReturnsAcceptedOption','ReturnsWithinOption','RefundOption','ShippingCostPaidByOption']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
if not exists:
writer.writeheader()
for ntp in response.css('div.content-1col-nobox'):
name = ntp.xpath('normalize-space(//h5/text())').extract_first()
brand = ntp.xpath('normalize-space(//div/blockquote[1]/p/text()[4])').extract_first()
version = ntp.xpath('normalize-space(//div/blockquote[1]/p/text()[6])').extract_first()
result = response.xpath(name + " per " + brand + " - " + version)
MPN = ntp.xpath('normalize-space(//tr[2]/td[1]/text())').extract_first()
description = ntp.xpath('normalize-space(//div[6]/div[1]/div[2]/div/blockquote[2]/p/text())').extract_first()
price = ntp.xpath('normalize-space(//tr[2]/td[#id="right_cell"][1])').extract()[0].split(None,1)[0].replace(",",".")
picUrl = response.urljoin(ntp.xpath('//div/p[3]/img/#src').extract_first())
writer.writerow({
'*Action(SiteID=Italy|Country=IT|Currency=EUR|Version=745|CC=UTF-8)':'Add',\
'*Category':'30895',\
'*Title': name,\
'Model': result,\
'ConditionID': '1000',\
'PostalCode':'154',\
'VATPercent':'22',\
'*C:Marca':'Priority Parts',\
'Product:EAN':'',\
'*C:MPN': MPN,\
'PicURL': picUrl,\
'Description': description,\
'*Format' : 'FixedPrice',\
'*Duration': 'GTC',\
'StartPrice' : price,\
'*Quantity':'3',\
'PayPalAccepted': '1',\
'PayPalEmailAddress' : 'your#gmail.com',\
'PaymentInstructions' : 'your#gmail.com',\
'*Location' : 'Italia',\
'ShippingService-1:FreeShipping' : '1',\
'ShippingService-1:Option' : 'IT_OtherCourier3To5Days',\
'ShippingService-1:Cost' : '10',\
'ShippingService-1:Priority' : '1',\
'ShippingService-2:Option' : 'IT_QuickPackage3',\
'ShippingService-2:Cost' : '15',\
'ShippingService-2:Priority' : '2',\
'ShippingService-3:Option': 'IT_QuickPackage1',\
'ShippingService-3:Cost' : '12',\
'ShippingService-3:Priority' : '3',\
'ShippingService-4:Option': 'IT_Pickup',\
'ShippingService-4:Cost' : '0',\
'ShippingService-4:Priority' : '4',\
'*DispatchTimeMax' : '5',\
'*ReturnsAcceptedOption' : 'ReturnsAccepted',\
'ReturnsWithinOption' : 'Days_14',\
'RefundOption' : 'MoneyBackOrExchange',\
'ShippingCostPaidByOption' : 'Buyer'})
Any help will be appreciate it.
Cheers.
Valter.

At the end #Casper was right, in the comments we see the right answer
"{} per {} - {}".format(name, brand, version)
This is the final result:
name = ntp.xpath('normalize-space(//h5/text())').extract_first()
brand = ntp.xpath('normalize-space(//div/blockquote[1]/p//text()[4])').extract_first()
version = ntp.xpath('normalize-space(//div/blockquote[1]/p//text()[6])').extract_first()
result = ("{} per {} - {}".format(name, brand, version))
writer.writerow({
'*Title': result,\

Related

Search in List; Display names based on search input

I have sought different articles here about searching data from a list, but nothing seems to be working right or is appropriate in what I am supposed to implement.
I have this pre-created module with over 500 list (they are strings, yes, but is considered as list when called into function; see code below) of names, city, email, etc. The following are just a chunk of it.
empRecords="""Jovita,Oles,8 S Haven St,Daytona Beach,Volusia,FL,6/14/1965,32114,386-248-4118,386-208-6976,joles#gmail.com,http://www.paganophilipgesq.com,;
Alesia,Hixenbaugh,9 Front St,Washington,District of Columbia,DC,3/3/2000,20001,202-646-7516,202-276-6826,alesia_hixenbaugh#hixenbaugh.org,http://www.kwikprint.com,;
Lai,Harabedian,1933 Packer Ave #2,Novato,Marin,CA,1/5/2000,94945,415-423-3294,415-926-6089,lai#gmail.com,http://www.buergimaddenscale.com,;
Brittni,Gillaspie,67 Rv Cent,Boise,Ada,ID,11/28/1974,83709,208-709-1235,208-206-9848,bgillaspie#gillaspie.com,http://www.innerlabel.com,;
Raylene,Kampa,2 Sw Nyberg Rd,Elkhart,Elkhart,IN,12/19/2001,46514,574-499-1454,574-330-1884,rkampa#kampa.org,http://www.hermarinc.com,;
Flo,Bookamer,89992 E 15th St,Alliance,Box Butte,NE,12/19/1957,69301,308-726-2182,308-250-6987,flo.bookamer#cox.net,http://www.simontonhoweschneiderpc.com,;
Jani,Biddy,61556 W 20th Ave,Seattle,King,WA,8/7/1966,98104,206-711-6498,206-395-6284,jbiddy#yahoo.com,http://www.warehouseofficepaperprod.com,;
Chauncey,Motley,63 E Aurora Dr,Orlando,Orange,FL,3/1/2000,32804,407-413-4842,407-557-8857,chauncey_motley#aol.com,http://www.affiliatedwithtravelodge.com
"""
a = empRecords.strip().split(";")
And I have the following code for searching:
import empData as x
def seecity():
empCitylist = list()
for ct in x.a:
empCt = ct.strip().split(",")
empCitylist.append(empCt)
t = sorted(empCitylist, key=lambda x: x[3])
for c in t:
city = (c[3])
print(city)
live_city = input("Enter city: ")
for cy in city:
if live_city in cy:
print(c[1])
# print("Name: "+ c[1] + ",", c[0], "| Current City: " + c[3])
Forgive my idiotic approach as I am new to Python. However, what I am trying to do is user will input the city, then the results should display the employee's last name, first name who are living in that city (I dunno if I made sense lol)
By the way, the code I used above doesn't return any answers. It just loops to the input.
Thank you for helping. Lovelots. <3
PS: the format of the empData is: first name, last name, address, city, country, birthday, zip, phone, and email

You can use the csv module to read easily a file with comma separated values
import csv
with open('test.csv', newline='') as csvfile:
records = list(csv.reader(csvfile))
def search(data, elem, index):
out = list()
for row in data:
if row[index] == elem:
out.append(row)
return out
#test
print(search(records, 'Orlando', 3))

Based on your original code, you can do it like this:
# Make list of list records, sorted by city
t = sorted((ct.strip().split(",") for ct in x.a), key=lambda x: x[3])
# List cities
print("Cities in DB:")
for c in t:
city = (c[3])
print("-", city)
# Define search function
def seecity():
live_city = input("Enter city: ")
for c in t:
if live_city == c[3]:
print("Name: "+ c[1] + ",", c[0], "| Current City: " + c[3])
seecity()
Then, after you understand what's going on, do as #Hoxha Alban suggested, and use the csv module.

The beauty of python lies in list comprehension.
empRecords="""Jovita,Oles,8 S Haven St,Daytona Beach,Volusia,FL,6/14/1965,32114,386-248-4118,386-208-6976,joles#gmail.com,http://www.paganophilipgesq.com,;
Alesia,Hixenbaugh,9 Front St,Washington,District of Columbia,DC,3/3/2000,20001,202-646-7516,202-276-6826,alesia_hixenbaugh#hixenbaugh.org,http://www.kwikprint.com,;
Lai,Harabedian,1933 Packer Ave #2,Novato,Marin,CA,1/5/2000,94945,415-423-3294,415-926-6089,lai#gmail.com,http://www.buergimaddenscale.com,;
Brittni,Gillaspie,67 Rv Cent,Boise,Ada,ID,11/28/1974,83709,208-709-1235,208-206-9848,bgillaspie#gillaspie.com,http://www.innerlabel.com,;
Raylene,Kampa,2 Sw Nyberg Rd,Elkhart,Elkhart,IN,12/19/2001,46514,574-499-1454,574-330-1884,rkampa#kampa.org,http://www.hermarinc.com,;
Flo,Bookamer,89992 E 15th St,Alliance,Box Butte,NE,12/19/1957,69301,308-726-2182,308-250-6987,flo.bookamer#cox.net,http://www.simontonhoweschneiderpc.com,;
Jani,Biddy,61556 W 20th Ave,Seattle,King,WA,8/7/1966,98104,206-711-6498,206-395-6284,jbiddy#yahoo.com,http://www.warehouseofficepaperprod.com,;
Chauncey,Motley,63 E Aurora Dr,Orlando,Orange,FL,3/1/2000,32804,407-413-4842,407-557-8857,chauncey_motley#aol.com,http://www.affiliatedwithtravelodge.com
"""
rows = empRecords.strip().split(";")
data = [ r.strip().split(",") for r in rows ]
then you can use any condition to filter the list, like
print ( [ "Name: " + emp[1] + "," + emp[0] + "| Current City: " + emp[3] for emp in data if emp[3] == "Washington" ] )
['Name: Hixenbaugh,Alesia| Current City: Washington']

Check for a build up name if it exist in a list of names -python

I'm trying to build a cvs file that has all Active directory fields I need, from 2 external files.
the first file has a list of users that need to be created and some other info relevant to an AD object
and the second report is a list of exported SamAccountName and emails dumped from AD. So what I want to do is create a unique SamAccountName, I form my test SamaAccountName from the firstname and lastname of the first report and wan to compare it vs the second report. I'm currently storing ins alist all the data I get from the second report and I want to check my generated SamAccountName exists in that list
so far I'm not able to do so and only get a csv with the SamAccoutnNames I made up( it does not do the check )
note: I can't use any other plugin to check directly to Active Directory
import csv
def getSamA(fname, lname):
Sams = []
sama = lname[0:5] + fname[0:2]
with open('test-input.txt','r') as AD:
rows = csv.DictReader(AD)
for ad in rows:
Sams.append(ad['SamAccountName'])
#check if built sama is in list
if sama in Sams:
#if sama in list, add one more character to sama
sama = lname[0:5] + fname[0:3]
return sama.lower()
else:
return sama.lower()
with open('users.csv') as csv_file:
rows = csv.DictReader(csv_file)
with open('users-COI2-Multi.csv', 'w', newline='') as output:
header = ['FirstName','Initials','LastName','DisplayName','Description','Office','TelePhone','UserLogonName','SamAccountName','JobTitle','Department','Manager','Mobile','faxNumber','Notes','Assistant','employeeID','ex1','ex2','ex3','ex15','Office365License','ExpiresInDays','EmailToUSer','AddToGroup']
output_file = csv.DictWriter(output, fieldnames=header, delimiter=';')
output_file.writeheader()
for data in rows:
employeeId = data['Associate ID']
fName = data['First Name']
lName = data['Last Name']
Location = data['Location']
Department = data['Department']
Manager = data['Manager Name']
JobTitle = data['Title']
context = {
'FirstName' : fName,
'Initials' : getInitials(fName, lName),
'LastName' : lName,
'DisplayName' : getDisplayName(fName, lName),
'Description' : 'Account for: '+getDisplayName(fName, lName),
'Office': getOffice(Location).strip(),
'TelePhone' : '+1 XXX XXX XXXX',
'UserLogonName' : getMail(fName, lName),
'SamAccountName' : getSamA(fName, lName),
'JobTitle' : JobTitle,
'Department' : Department,
'Manager' : Manager,
'Mobile' : '',
'faxNumber' : '',
'Notes' : '',
'Assistant' : '',
'employeeID' : employeeId,
'ex1' : 'End User',
'ex2' : 'NoMailbox',
'ex3' : getSiteCode(Location),
'ex15' : getSKID(Location),
'Office365License' : '',
'ExpiresInDays' : '',
'EmailToUSer' : 'user#test.com',
'AddToGroup' : '',
}
output_file.writerow(context)

Error when creating dictionaries from text files

I've been working on a function which will update two dictionaries (similar authors, and awards they've won) from an open text file. The text file looks something like this:
Brabudy, Ray
Hugo Award
Nebula Award
Saturn Award
Ellison, Harlan
Heinlein, Robert
Asimov, Isaac
Clarke, Arthur
Ellison, Harlan
Nebula Award
Hugo Award
Locus Award
Stephenson, Neil
Vonnegut, Kurt
Morgan, Richard
Adams, Douglas
And so on. The first name is an authors name (last name first, first name last), followed by awards they may have won, and then authors who are similar to them. This is what I've got so far:
def load_author_dicts(text_file, similar_authors, awards_authors):
name_of_author = True
awards = False
similar = False
for line in text_file:
if name_of_author:
author = line.split(', ')
nameA = author[1].strip() + ' ' + author[0].strip()
name_of_author = False
awards = True
continue
if awards:
if ',' in line:
awards = False
similar = True
else:
if nameA in awards_authors:
listawards = awards_authors[nameA]
listawards.append(line.strip())
else:
listawards = []
listawards.append(line.strip()
awards_authors[nameA] = listawards
if similar:
if line == '\n':
similar = False
name_of_author = True
else:
sim_author = line.split(', ')
nameS = sim_author[1].strip() + ' ' + sim_author[0].strip()
if nameA in similar_authors:
similar_list = similar_authors[nameA]
similar_list.append(nameS)
else:
similar_list = []
similar_list.append(nameS)
similar_authors[nameA] = similar_list
continue
This works great! However, if the text file contains an entry with just a name (i.e. no awards, and no similar authors), it screws the whole thing up, generating an IndexError: list index out of range at this part Zname = sim_author[1].strip()+" "+sim_author[0].strip() )
How can I fix this? Maybe with a 'try, except function' in that area?
Also, I wouldn't mind getting rid of those continue functions, I wasn't sure how else to keep it going. I'm still pretty new to this, so any help would be much appreciated! I keep trying stuff and it changes another section I didn't want changed, so I figured I'd ask the experts.

How about doing it this way, just to get the data in, then manipulate the dictionary any ways you want.
test.txt contains your data
Brabudy, Ray
Hugo Award
Nebula Award
Saturn Award
Ellison, Harlan
Heinlein, Robert
Asimov, Isaac
Clarke, Arthur
Ellison, Harlan
Nebula Award
Hugo Award
Locus Award
Stephenson, Neil
Vonnegut, Kurt
Morgan, Richard
Adams, Douglas
And my code to parse it.
award_parse.py
data = {}
name = ""
awards = []
f = open("test.txt")
for l in f:
# make sure the line is not blank don't process blank lines
if not l.strip() == "":
# if this is a name and we're not already working on an author then set the author
# otherwise treat this as a new author and set the existing author to a key in the dictionary
if "," in l and len(name) == 0:
name = l.strip()
elif "," in l and len(name) > 0:
# check to see if recipient is already in list, add to end of existing list if he/she already
# exists.
if not name.strip() in data:
data[name] = awards
else:
data[name].extend(awards)
name = l.strip()
awards = []
# process any lines that are not blank, and do not have a ,
else:
awards.append(l.strip())
f.close()
for k, v in data.items():
print("%s got the following awards: %s" % (k,v))

Extracting data with BeautifulSoup and output to CSV

As mentioned in the previous questions, I am using Beautiful soup with python to retrieve weather data from a website.
Here's how the website looks like:
<channel>
<title>2 Hour Forecast</title>
<source>Meteorological Services Singapore</source>
<description>2 Hour Forecast</description>
<item>
<title>Nowcast Table</title>
<category>Singapore Weather Conditions</category>
<forecastIssue date="18-07-2016" time="03:30 PM"/>
<validTime>3.30 pm to 5.30 pm</validTime>
<weatherForecast>
<area forecast="TL" lat="1.37500000" lon="103.83900000" name="Ang Mo Kio"/>
<area forecast="SH" lat="1.32100000" lon="103.92400000" name="Bedok"/>
<area forecast="TL" lat="1.35077200" lon="103.83900000" name="Bishan"/>
<area forecast="CL" lat="1.30400000" lon="103.70100000" name="Boon Lay"/>
<area forecast="CL" lat="1.35300000" lon="103.75400000" name="Bukit Batok"/>
<area forecast="CL" lat="1.27700000" lon="103.81900000" name="Bukit Merah"/>`
<channel>
I managed to retrieve the information I need using these codes :
import requests
from bs4 import BeautifulSoup
import urllib3
#getting the ValidTime
r = requests.get('http://www.nea.gov.sg/api/WebAPI/?
dataset=2hr_nowcast&keyref=781CF461BB6606AD907750DFD1D07667C6E7C5141804F45D')
soup = BeautifulSoup(r.content, "xml")
time = soup.find('validTime').string
print "validTime: " + time
#getting the date
for currentdate in soup.find_all('item'):
element = currentdate.find('forecastIssue')
print "date: " + element['date']
#getting the time
for currentdate in soup.find_all('item'):
element = currentdate.find('forecastIssue')
print "time: " + element['time']
for area in soup.find('weatherForecast').find_all('area'):
area_attrs_li = [area.attrs for area in soup.find('weatherForecast').find_all('area')]
print area_attrs_li
Here are my results :
{'lat': u'1.34039000', 'lon': u'103.70500000', 'name': u'Jurong West',
'forecast': u'LR'}, {'lat': u'1.31200000', 'lon': u'103.86200000', 'name':
u'Kallang', 'forecast': u'LR'},
How do I remove u' from the result? I tried using the method I found while googling but it doesn't seem to work
I'm not strong in Python and have been stuck at this for quite a while.
EDIT : I tried doing this :
f = open("C:\\scripts\\nea.csv" , 'wt')
try:
for area in area_attrs_li:
writer = csv.writer(f)
writer.writerow( (time, element['date'], element['time'], area_attrs_li))
finally:
f.close()
print open("C:/scripts/nea.csv", 'rt').read()
It worked however, I would like to split the area apart as the records are duplicates in the CSV :
Thank you.

EDIT 1 -Topic:
You're missing escape characters:
C:\scripts>python neaweather.py
File "neaweather.py", line 30
writer.writerow( ('time', 'element['date']', 'element['time']', 'area_attrs_li') )
writer.writerow( ('time', 'element[\'date\']', 'element[\'time\']', 'area_attrs_li')
^
SyntaxError: invalid syntax
EDIT 2:
if you want to insert values:
writer.writerow( (time, element['date'], element['time'], area_attrs_li) )
EDIT 3:
to split the result to different lines:
for area in area_attrs_li:
writer.writerow( (time, element['date'], element['time'], area)
EDIT 4:
The splitting is not correct at all, but it shall give a better understanding of how to parse and split data to change it for your needs.
to split the area element again as you show in your image, you can parse it
for area in area_attrs_li:
# cut off the characters you don't need
area = area.replace('[','')
area = area.replace(']','')
area = area.replace('{','')
area = area.replace('}','')
# remove other characters
area = area.replace("u'","\"").replace("'","\"")
# split the string into a list
areaList = area.split(",")
# create your own csv-seperator
ownRowElement = ';'.join(areaList)
writer.writerow( (time, element['date'], element['time'], ownRowElement)
Offtopic:
This works for me:
import csv
import json
x="""[
{'lat': u'1.34039000', 'lon': u'103.70500000', 'name': u'Jurong West','forecast': u'LR'}
]"""
jsontxt = json.loads(x.replace("u'","\"").replace("'","\""))
f = csv.writer(open("test.csv", "w+"))
# Write CSV Header, If you dont need that, remove this line
f.writerow(['lat', 'lon', 'name', 'forecast'])
for jsontext in jsontxt:
f.writerow([jsontext["lat"],
jsontext["lon"],
jsontext["name"],
jsontext["forecast"],
])

Looping over a list within a dict

I'm new to python, so bear with me.
I have a dict containing lists:
ophav = {'ill': ['Giunta, John'], 'aut': ['Fox, Gardner', 'Doe, John'], 'clr': ['Mumle, Mads'], 'trl': ['Cat, Fat']}
The key names ('ill', 'aut', ...) and the number of items in the lists will be different on each run of the script.
I'd love to do something like:
opfmeta = []
for key, person in ophav.items():
opfmeta.append('<dc:creator role="' + key + '">' + person + '</dc:creator>')
I know this is not working ("cannot concatenate 'str' and 'list' objects") - I have to loop over the list within the dict somehow. How do I do that?
Edit: I need separate entries for each person, like:
<dc:creator role="ill">Fox, Gardner</dc:creator>
<dc:creator role="ill">Doe, John</dc:creator>

You can do that using ' & '.join():
opfmeta = []
for key, person in ophav.items():
opfmeta.append('<dc:creator role="' + key + '">' + ' & '.join(person) + '</dc:creator>')
This will join all the elements of the list together with the specified delimiter (in this case ' & ') so your result will something like this:
<dc:creator role="ill">Fox, Gardner & Doe, John</dc:creator>
You can check out the full working demonstration HERE
Answer to your updated question:
ophav = {'ill': ['Giunta, John'], 'aut': ['Fox, Gardner', 'Doe, John'], 'clr': ['Mumle, Mads'], 'trl': ['Cat, Fat']}
opfmeta = []
for key, person in ophav.items():
for i in person:
opfmeta.append('<dc:creator role="' + key + '">' + i + '</dc:creator>')
for i in opfmeta:
print i
[OUTPUT]
<dc:creator role="ill">Giunta, John</dc:creator>
<dc:creator role="aut">Fox, Gardner</dc:creator>
<dc:creator role="aut">Doe, John</dc:creator>
<dc:creator role="clr">Mumle, Mads</dc:creator>
<dc:creator role="trl">Cat, Fat</dc:creator>
NEW DEMO

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy add.xpath or join xpath - python

Related

Search in List; Display names based on search input

Check for a build up name if it exist in a list of names -python

Error when creating dictionaries from text files

Extracting data with BeautifulSoup and output to CSV

Looping over a list within a dict

Categories

Resources