How to scrape multiple websites with different data in urls

How to scrape multiple websites with different data in urls - python

I'm scraping some data from a webpage where at the end of the url has the id of the product, it appears to rewrite the data at every single row, like its not appending the data from the next line, I don't know exactly what's going on, if my first for is wrong, or the indentation, I tried before without the dictionary, and it was appending but at the same line and I transpose it but didn't work as I wanted so I made it this way and now it doesn't append the next lines, help please
data_cols = []
cols = {'pro_header': [],
'pro_id': [],
.
.
.
'pro_uns5': []
}
#the id for each product
fileID = open('idProductsList.txt', 'r')
proIDS = fileID.read().split()
for proID in proIDS:
url = 'https:/website.com/mall/es/mx/Catalog/Product/' + proID
html = urllib2.urlopen(url).read()
soup = bs.BeautifulSoup(html , 'lxml')
table = soup.find("table",{"class": "ProductDetailsTable"})
rows = table.find_all('tr')
for row in rows:
labels.append(str(row.find_all('td')[0].text))
try:
data.append(str(row.find_all('td')[1].text))
except IndexError:
data.append('')
cols['pro_header'].append(data[0])
cols['pro_id'].append(data[1])
.
.
.
cols['pro_uns5'].append(data[43])
df = pd.DataFrame(cols)
df.set_index
#df.reindex()
df.to_csv('sample1.csv')
The actual output is:
pro_id pro_priceCostumer pro_priceData
1FK7011-5AK24-1AA3 " Mostrar precios
" PM300:Producto activo
1FK7011-5AK24-1AA3 " Mostrar precios
" PM300:Producto activo
1FK7011-5AK24-1AA3 " Mostrar precios
" PM300:Producto activo
Should be something like this (This is just a small representation of the data):
pro_id pro_priceCostumer pro_priceData
1FK7011-5AK24-1AA3 " Mostrar precios
" PM300:Producto activo
1FK7011-5AK24-1JA3 " Mostrar precios
" PM300:Producto activo
1FK7022-5AK21-1UA0 " Mostrar precios
" PM300:Producto activo

I guess labels are working as a variable. to append this you need to use a list.
add labels=list() at the top of your code as global variable. The same thing should be done for data too.

Related

Trying to create a streamlit app that uses user-provided URLs to scrape and return a downloadable df

I'm trying to use this create_df() function in Streamlit to gather a list of user-provided URLs called "recipes" and loop through each URL to return a df I've labeled "res" towards the end of the function. I've tried several approaches with the Streamlit syntax but I just cannot get this to work as I'm getting this error message:
recipe_scrapers._exceptions.WebsiteNotImplementedError: recipe-scrapers exception: Website (h) not supported.
Have a look at my entire repo here. The main.py script works just fine once you've installed all requirements locally, but when I try running the same script with Streamlit syntax in the streamlit.py script I get the above error. Once you run streamlit run streamlit.py in your terminal and have a look at the UI I've create it should be quite clear what I'm aiming at, which is providing the user with a csv of all ingredients in the recipe URLs they provided for a convenient grocery shopping list.
Any help would be greatly appreciated!
def create_df(recipes):
"""
Description:
Creates one df with all recipes and their ingredients
Arguments:
* recipes: list of recipe URLs provided by user
Comments:
Note that ingredients with qualitative amounts e.g., "scheutje melk", "snufje zout" have been ommitted from the ingredient list
"""
df_list = []
for recipe in recipes:
scraper = scrape_me(recipe)
recipe_details = replace_measurement_symbols(scraper.ingredients())
recipe_name = recipe.split("https://www.hellofresh.nl/recipes/", 1)[1]
recipe_name = recipe_name.rsplit('-', 1)[0]
print("Processing data for "+ recipe_name +" recipe.")
for ingredient in recipe_details:
try:
df_temp = pd.DataFrame(columns=['Ingredients', 'Measurement'])
df_temp[str(recipe_name)] = recipe_name
ing_1 = ingredient.split("2 * ", 1)[1]
ing_1 = ing_1.split(" ", 2)
item = ing_1[2]
measurement = ing_1[1]
quantity = float(ing_1[0]) * 2
df_temp.loc[len(df_temp)] = [item, measurement, quantity]
df_list.append(df_temp)
except (ValueError, IndexError) as e:
pass
df = pd.concat(df_list)
print("Renaming duplicate ingredients e.g., Kruimige aardappelen, Voorgekookte halve kriel met schil -> Aardappelen")
ingredient_dict = {
'Aardappelen': ('Dunne frieten', 'Half kruimige aardappelen', 'Voorgekookte halve kriel met schil',
'Kruimige aardappelen', 'Roodschillige aardappelen', 'Opperdoezer Ronde aardappelen'),
'Ui': ('Rode ui'),
'Kipfilet': ('Kipfilet met tuinkruiden en knoflook'),
'Kipworst': ('Gekruide kipworst'),
'Kipgehakt': ('Gemengd gekruid gehakt', 'Kipgehakt met Mexicaanse kruiden', 'Half-om-halfgehakt met Italiaanse kruiden',
'Kipgehakt met tuinkruiden'),
'Kipshoarma': ('Kalkoenshoarma')
}
reverse_label_ing = {x:k for k,v in ingredient_dict.items() for x in v}
df["Ingredients"].replace(reverse_label_ing, inplace=True)
print("Assigning ingredient categories")
category_dict = {
'brood': ('Biologisch wit rozenbroodje', 'Bladerdeeg', 'Briochebroodje', 'Wit platbrood'),
'granen': ('Basmatirijst', 'Bulgur', 'Casarecce', 'Cashewstukjes',
'Gesneden snijbonen', 'Jasmijnrijst', 'Linzen', 'Ma√Øs in blik',
'Parelcouscous', 'Penne', 'Rigatoni', 'Rode kidneybonen',
'Spaghetti', 'Witte tortilla'),
'groenten': ('Aardappelen', 'Aubergine', 'Bosui', 'Broccoli',
'Champignons', 'Citroen', 'Gele wortel', 'Gesneden rodekool',
'Groene paprika', 'Groentemix van paprika, prei, gele wortel en courgette',
'IJsbergsla', 'Kumato tomaat', 'Limoen', 'Little gem',
'Paprika', 'Portobello', 'Prei', 'Pruimtomaat',
'Radicchio en ijsbergsla', 'Rode cherrytomaten', 'Rode paprika', 'Rode peper',
'Rode puntpaprika', 'Rode ui', 'Rucola', 'Rucola en veldsla', 'Rucolamelange',
'Semi-gedroogde tomatenmix', 'Sjalot', 'Sperziebonen', 'Spinazie', 'Tomaat',
'Turkse groene peper', 'Veldsla', 'Vers basilicum', 'Verse bieslook',
'Verse bladpeterselie', 'Verse koriander', 'Verse krulpeterselie', 'Wortel', 'Zoete aardappel'),
'kruiden': ('A√Øoli', 'Bloem', 'Bruine suiker', 'Cranberrychutney', 'Extra vierge olijfolie',
'Extra vierge olijfolie met truffelaroma', 'Fles olijfolie', 'Gedroogde laos',
'Gedroogde oregano', 'Gemalen kaneel', 'Gemalen komijnzaad', 'Gemalen korianderzaad',
'Gemalen kurkuma', 'Gerookt paprikapoeder', 'Groene currykruiden', 'Groentebouillon',
'Groentebouillonblokje', 'Honing', 'Italiaanse kruiden', 'Kippenbouillonblokje', 'Knoflookteen',
'Kokosmelk', 'Koreaanse kruidenmix', 'Mayonaise', 'Mexicaanse kruiden', 'Midden-Oosterse kruidenmix',
'Mosterd', 'Nootmuskaat', 'Olijfolie', 'Panko paneermeel', 'Paprikapoeder', 'Passata',
'Pikante uienchutney', 'Runderbouillonblokje', 'Sambal', 'Sesamzaad', 'Siciliaanse kruidenmix',
'Sojasaus', 'Suiker', 'Sumak', 'Surinaamse kruiden', 'Tomatenblokjes', 'Tomatenblokjes met ui',
'Truffeltapenade', 'Ui', 'Verse gember', 'Visbouillon', 'Witte balsamicoazijn', 'Wittewijnazijn',
'Zonnebloemolie', 'Zwarte balsamicoazijn'),
'vlees': ('Gekruide runderburger', 'Half-om-half gehaktballetjes met Spaanse kruiden', 'Kipfilethaasjes', 'Kipfiletstukjes',
'Kipgehaktballetjes met Italiaanse kruiden', 'Kippendijreepjes', 'Kipshoarma', 'Kipworst', 'Spekblokjes',
'Vegetarische d√∂ner kebab', 'Vegetarische kaasschnitzel', 'Vegetarische schnitzel'),
'zuivel': ('Ei', 'Geraspte belegen kaas', 'Geraspte cheddar', 'Geraspte grana padano', 'Geraspte oude kaas',
'Geraspte pecorino', 'Karnemelk', 'Kruidenroomkaas', 'Labne', 'Melk', 'Mozzarella',
'Parmigiano reggiano', 'Roomboter', 'Slagroom', 'Volle yoghurt')
}
reverse_label_cat = {x:k for k,v in category_dict.items() for x in v}
df["Category"] = df["Ingredients"].map(reverse_label_cat)
col = "Category"
first_col = df.pop(col)
df.insert(0, col, first_col)
df = df.sort_values(['Category', 'Ingredients'], ascending = [True, True])
print("Merging ingredients by row across all recipe columns using justify()")
gp_cols = ['Ingredients', 'Measurement']
oth_cols = df.columns.difference(gp_cols)
arr = np.vstack(df.groupby(gp_cols, sort=False, dropna=False).apply(lambda gp: justify(gp.to_numpy(), invalid_val=np.NaN, axis=0, side='up')))
# Reconstruct DataFrame
# Remove entirely NaN rows based on the non-grouping columns
res = (pd.DataFrame(arr, columns=df.columns)
.dropna(how='all', subset=oth_cols, axis=0))
res = res.fillna(0)
res['Total'] = res.drop(['Ingredients', 'Measurement'], axis=1).sum(axis=1)
res=res[res['Total'] !=0] #To drop rows that are being duplicated with 0 for some reason; will check later
print("Processing complete!")
return res

Your function create_df needs a list as an argument but st.text_input returs always a string.
In your streamlit.py, replace this df_download = create_df(recs) by this df_download = create_df([recs]). But if you need to handle multiple urls, you should use str.split like this :
def create_df(recipes):
recipes = recipes.split(",") # <--- add this line to make a list from the user-input
### rest of the code ###
if download:
df_download = create_df(recs)
# Output :

More efficient way to manipulate large dataframe

It's my first real Python script, so feel free to make comments in order to improve my code.
The purpose of this script is to extract 2 Oracle tables with Python, store them in a dataframe and then join them with pandas.
But for queries returning more than 500k lines I feel that it is slow. Do you know why?
import pandas as pd
from datetime import date
from sqlalchemy import create_engine
import cx_Oracle, time
import pandas as pd
import config
## Variable pour le timer
start = time.time()
## User input en ligne de commande
year = input('Saisir une annee : ')
month = input('Saisir le mois, au fomat MM : ')
societe_var = input('SA (APPLE,PEACH,BANANA,ANANAS,ALL) : ')
## SA + BU correspondantes aux SA
sa_list = ['APPLE','PEACH','BANANA','ANANAS']
bu_list_MERE = ['006111','1311402','1311403','1311404','1340115','13411106','1311407','1111','6115910','1166157','6811207','8311345','1111','1188100','8118101','8811102','8810113','8811104','8118105','8811106','8811107','8118108','1111']
bu_list_GARE = ['131400','310254']
bu_list_VOYA = ['0151100','1110073','1007115','1311335','1113340','1311341','1113342','1331143']
bu_list_RESO = ['1211345','13111345','11113395','73111345']
#Permet de pointre vers la bonne liste en fonction de la SA saisie
bu_list_map = {
'APPLE': bu_list_APPLE,
'PEACH': bu_list_PEACH,
'BANANA': bu_list_BANANA,
'ANANAS' : bu_list_ANANAS
}
if societe_var == 'ALL' :
print('non codé pour le moment')
elif societe_var in sa_list :
bu_list = bu_list_map.get(societe_var)
sa_var = societe_var
i=1
for bu in bu_list :
start_bu = time.time()
## On vient ici charger la requête SQL avec les bonnes variables pour gla_va_parametre -- EPOST
query1 = open('gla_va_parametre - VAR.sql',"r").read()
query1 = query1.replace('#ANNEE',"'" + year + "'").replace('%MOIS%',"'" + month + "'").replace('%SA%',"'" + societe_var + "'").replace('%BUGL%',"'" + bu + "'").replace('%DIVISION%','"C__en__PS_S1_D_OP_UNIT13".OPERATING_UNIT')
## On vient ici charger la requête SQL avec les bonnes variables pour cle-gla_tva -- FPOST
query2 = open('cle-gla_tva - VAR.sql',"r").read()
query2 = query2.replace('#ANNEE',"'" + year + "'").replace('%MOIS%',"'" + month + "'").replace('%SA%',"'" + societe_var + "'").replace('%BUGL%',"'" + bu + "'").replace('%DIVISION%','OPERATING_UNIT')
# Param de connexion
connection_EPOST = cx_Oracle.connect(user=config.user_EPOST, password=config.password_EPOST, dsn=config.host_EPOST, )
connection_FPOST = cx_Oracle.connect(user=config.user_FPOST, password=config.password_FPOST, dsn=config.host_FPOST, )
## Récup partie EPOST
with connection_EPOST :
# On déclare une variable liste vide
dfl = []
# On déclare un DataFrame vide
dfs = pd.DataFrame()
z=1
# Start Chunking
for chunk in pd.read_sql(query1, con=connection_EPOST,chunksize=25000) :
# Start Appending Data Chunks from SQL Result set into List
dfl.append(chunk)
print('chunk num : ' + str(z))
z = z + 1
# Start appending data from list to dataframe
dfs = pd.concat(dfl, ignore_index=True)
print('param récupéré')
## Récup partie FPOST
with connection_FPOST :
# On déclare une variable liste vide
df2 = []
# On déclare un DataFrame vide
dfs2 = pd.DataFrame()
# Start Chunking
for chunk in pd.read_sql(query2, con=connection_FPOST,chunksize=10000) :
# Start Appending Data Chunks from SQL Result set into List
df2.append(chunk)
# Start appending data from list to dataframe
dfs2 = pd.concat(df2, ignore_index=True)
print('clé récupéré')
print('Début de la jointure')
jointure = pd.merge(dfs,dfs2,how='left',left_on=['Code_BU_GL','Code_division','Code_ecriture','Date_comptable','Code_ligne_ecriture','UNPOST_SEQ'],right_on=['BUSINESS_UNIT','OPERATING_UNIT','JOURNAL_ID','JOURNAL_DATE','JOURNAL_LINE','UNPOST_SEQ']).drop(columns= ['BUSINESS_UNIT','OPERATING_UNIT','JOURNAL_ID','JOURNAL_DATE','JOURNAL_LINE'])
jointure.to_csv('out\gla_va_'+year+month+"_"+societe_var+"_"+bu+"_"+date.today().strftime("%Y%m%d")+'.csv', index=False, sep='|')
print('Fichier ' + str(i) + "/" + str(len(bu_list)) + ' généré en : '+ str(time.time() - start_bu)+' secondes')
i = i + 1
print("L'extraction du périmètre de la SA " + societe_var + " s'est effectué en :" + str((time.time() - start)/60) + " min" )

Python | Modifying not Bound Objects still modifies both

I have a problem concerning copying object attributes, and making sure the attributes are not bound together
I am implementing Data Tables in python, whose attributes are
rows : list [ list ]
column_names : list[str]
I want to have a method that copy a table, to make modifications for instance, but I want it to be fully independent from the original table in order to have a non destructive approach
def copy(self):
print(Fore.YELLOW + "COPY FCN BEGINS" + Fore.WHITE)
rows = list(self.rows)
# FOR DEBUG PURPOSES
pprint(rows)
pprint(self.rows)
print(f'{rows is self.rows}')
names = list(self.column_names)
# FOR DEBUG PURPOSES
pprint(names)
pprint(self.column_names)
print(f'{names is self.column_names}')
print(Fore.YELLOW + "COPY FCN ENDS" + Fore.WHITE)
return Tableau( rows= rows, column_names= names )
Then I test it
creating a table and copying it into another
then i modify the new table and make sure it modified the latest only
Problem : it modifies both
however i made sure that the rows' list were not pointing to the same object so i am a bit confused
here are the rsults
result of copy function
and here is the test function (using unittest)
the test function :
def test_copy(self):
# on s'assure que les deux Tableaux sont bien identiques après la copie, mais différents si on en modifie l'un ( ils ne aprtagent pas la même liste en terme d adresse mémoire )
# on copie
NewTable = self.TableauA.copy()
self.assertEqual(NewTable.rows, self.TableauA.rows)
self.assertEqual(NewTable.column_names, self.TableauA.column_names)
print("Row B")
pprint(self.rowB)
print("New Table")
print(NewTable)
print("tableau A")
print(self.TableauA)
print( Fore.GREEN + "IS THE SAME OBJECT ?" + Fore.WHITE)
print(f"{NewTable is self.TableauA}")
print( Fore.GREEN + "ROWS IS THE SAME OBJECT ?" + Fore.WHITE)
print(f"{NewTable.rows is self.TableauA.rows}")
print( Fore.GREEN + "NAMES IS THE SAME OBJECT ?" + Fore.WHITE)
print(f"{NewTable.column_names is self.TableauA.column_names}")
# on modifie le nouveau Tableau
NewTable.add_column(name="NewCol", column=self.rowB)
print(Fore.YELLOW + "MODIFICATION" + Fore.WHITE)
print(Fore.GREEN + "New Table" + Fore.WHITE)
print(NewTable)
print(Fore.GREEN + "tableau A" + Fore.WHITE)
print(self.TableauA)
# on s'assure qu'on a pas modifié les lignes dans les deux
self.assertNotEqual(NewTable.rows, self.TableauA.rows)
return
and the results :
results of the test function
and finally :
the add_column method
def add_column(self, name : str, column : list, position : int =-1):
n =len(self.rows)
if position == -1 :
position = n
for k in range(n) :
self.rows[k].insert(position, column[k])
self.column_names.insert(position, name)
return
Thank you !

I found it, in the end it was very subtle
as a list of list
the highest list in hierarchy : rows was indeed unique
however the content of rows [the list that contains lists] was not : the observation lists were still tied even with the list() function
here is how i made them unique
rows = [ list( self.rows[k] ) for k in range( len(self.rows) ) ]
here is the final code that wortks :
def copy(self):
rows = [ list( self.rows[k] ) for k in range( len(self.rows) ) ]
names = list(self.column_names)
return Tableau( rows= rows, column_names= names )
hope this will help others

Search in List; Display names based on search input

I have sought different articles here about searching data from a list, but nothing seems to be working right or is appropriate in what I am supposed to implement.
I have this pre-created module with over 500 list (they are strings, yes, but is considered as list when called into function; see code below) of names, city, email, etc. The following are just a chunk of it.
empRecords="""Jovita,Oles,8 S Haven St,Daytona Beach,Volusia,FL,6/14/1965,32114,386-248-4118,386-208-6976,joles#gmail.com,http://www.paganophilipgesq.com,;
Alesia,Hixenbaugh,9 Front St,Washington,District of Columbia,DC,3/3/2000,20001,202-646-7516,202-276-6826,alesia_hixenbaugh#hixenbaugh.org,http://www.kwikprint.com,;
Lai,Harabedian,1933 Packer Ave #2,Novato,Marin,CA,1/5/2000,94945,415-423-3294,415-926-6089,lai#gmail.com,http://www.buergimaddenscale.com,;
Brittni,Gillaspie,67 Rv Cent,Boise,Ada,ID,11/28/1974,83709,208-709-1235,208-206-9848,bgillaspie#gillaspie.com,http://www.innerlabel.com,;
Raylene,Kampa,2 Sw Nyberg Rd,Elkhart,Elkhart,IN,12/19/2001,46514,574-499-1454,574-330-1884,rkampa#kampa.org,http://www.hermarinc.com,;
Flo,Bookamer,89992 E 15th St,Alliance,Box Butte,NE,12/19/1957,69301,308-726-2182,308-250-6987,flo.bookamer#cox.net,http://www.simontonhoweschneiderpc.com,;
Jani,Biddy,61556 W 20th Ave,Seattle,King,WA,8/7/1966,98104,206-711-6498,206-395-6284,jbiddy#yahoo.com,http://www.warehouseofficepaperprod.com,;
Chauncey,Motley,63 E Aurora Dr,Orlando,Orange,FL,3/1/2000,32804,407-413-4842,407-557-8857,chauncey_motley#aol.com,http://www.affiliatedwithtravelodge.com
"""
a = empRecords.strip().split(";")
And I have the following code for searching:
import empData as x
def seecity():
empCitylist = list()
for ct in x.a:
empCt = ct.strip().split(",")
empCitylist.append(empCt)
t = sorted(empCitylist, key=lambda x: x[3])
for c in t:
city = (c[3])
print(city)
live_city = input("Enter city: ")
for cy in city:
if live_city in cy:
print(c[1])
# print("Name: "+ c[1] + ",", c[0], "| Current City: " + c[3])
Forgive my idiotic approach as I am new to Python. However, what I am trying to do is user will input the city, then the results should display the employee's last name, first name who are living in that city (I dunno if I made sense lol)
By the way, the code I used above doesn't return any answers. It just loops to the input.
Thank you for helping. Lovelots. <3
PS: the format of the empData is: first name, last name, address, city, country, birthday, zip, phone, and email

You can use the csv module to read easily a file with comma separated values
import csv
with open('test.csv', newline='') as csvfile:
records = list(csv.reader(csvfile))
def search(data, elem, index):
out = list()
for row in data:
if row[index] == elem:
out.append(row)
return out
#test
print(search(records, 'Orlando', 3))

Based on your original code, you can do it like this:
# Make list of list records, sorted by city
t = sorted((ct.strip().split(",") for ct in x.a), key=lambda x: x[3])
# List cities
print("Cities in DB:")
for c in t:
city = (c[3])
print("-", city)
# Define search function
def seecity():
live_city = input("Enter city: ")
for c in t:
if live_city == c[3]:
print("Name: "+ c[1] + ",", c[0], "| Current City: " + c[3])
seecity()
Then, after you understand what's going on, do as #Hoxha Alban suggested, and use the csv module.

The beauty of python lies in list comprehension.
empRecords="""Jovita,Oles,8 S Haven St,Daytona Beach,Volusia,FL,6/14/1965,32114,386-248-4118,386-208-6976,joles#gmail.com,http://www.paganophilipgesq.com,;
Alesia,Hixenbaugh,9 Front St,Washington,District of Columbia,DC,3/3/2000,20001,202-646-7516,202-276-6826,alesia_hixenbaugh#hixenbaugh.org,http://www.kwikprint.com,;
Lai,Harabedian,1933 Packer Ave #2,Novato,Marin,CA,1/5/2000,94945,415-423-3294,415-926-6089,lai#gmail.com,http://www.buergimaddenscale.com,;
Brittni,Gillaspie,67 Rv Cent,Boise,Ada,ID,11/28/1974,83709,208-709-1235,208-206-9848,bgillaspie#gillaspie.com,http://www.innerlabel.com,;
Raylene,Kampa,2 Sw Nyberg Rd,Elkhart,Elkhart,IN,12/19/2001,46514,574-499-1454,574-330-1884,rkampa#kampa.org,http://www.hermarinc.com,;
Flo,Bookamer,89992 E 15th St,Alliance,Box Butte,NE,12/19/1957,69301,308-726-2182,308-250-6987,flo.bookamer#cox.net,http://www.simontonhoweschneiderpc.com,;
Jani,Biddy,61556 W 20th Ave,Seattle,King,WA,8/7/1966,98104,206-711-6498,206-395-6284,jbiddy#yahoo.com,http://www.warehouseofficepaperprod.com,;
Chauncey,Motley,63 E Aurora Dr,Orlando,Orange,FL,3/1/2000,32804,407-413-4842,407-557-8857,chauncey_motley#aol.com,http://www.affiliatedwithtravelodge.com
"""
rows = empRecords.strip().split(";")
data = [ r.strip().split(",") for r in rows ]
then you can use any condition to filter the list, like
print ( [ "Name: " + emp[1] + "," + emp[0] + "| Current City: " + emp[3] for emp in data if emp[3] == "Washington" ] )
['Name: Hixenbaugh,Alesia| Current City: Washington']

Scrapy add.xpath or join xpath

I hope everyone is doing well.
I have this code(part of it) for a spider, now this is the last part of the scraping, here it start to scrape and then write in the csv file, so I got this doubdt, it is possible to join or add xpath with the result printed in the file, for example:
<h5>Soundbooster</h5> <br><br>
<p class="details">
<b>Filtro attuale</b>
</p>
<blockquote>
<p>
<b>Catalogo:</b>
Aliant</br>
<b>Marca e Modello:</b>
Mazda - 3 </br>
<b>Versione:</b>
(3th gen) 2013-now (Petrol)
</p>
</blockquote>
I want to join the following for one field in the csv file, should be something like this:
Soundbooster per Mazda - 3 - (3th gen) 2013-now (Petrol)
And here it is where I am lost, It is possible? I don't know if I have to use add.xpath or join or another method and how to use it right.
This is part of my code:
def parse_content_details(self, response):
exists = os.path.isfile("ntp/ntp_aliant.csv")
with open("ntp/ntp_aliant.csv", "a+", newline='') as csvfile:
fieldnames = ['*Action(SiteID=Italy|Country=IT|Currency=EUR|Version=745|CC=UTF-8)','*Category','*Title','Model','ConditionID','PostalCode',\
'VATPercent','*C:Marca','Product:EAN','*C:MPN','PicURL', 'Description','*Format','*Duration','StartPrice','*Quantity','PayPalAccepted','PayPalEmailAddress',\
'PaymentInstructions','*Location','ShippingService-1:FreeShipping', 'ShippingService-1:Option','ShippingService-1:Cost', 'ShippingService-1:Priority',\
'ShippingService-2:Option','ShippingService-2:Cost','ShippingService-2:Priority','ShippingService-3:Option','ShippingService-3:Cost',\
'ShippingService-3:Priority','ShippingService-4:Option','ShippingService-4:Cost','ShippingService-4:Priority','*DispatchTimeMax',\
'*ReturnsAcceptedOption','ReturnsWithinOption','RefundOption','ShippingCostPaidByOption']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
if not exists:
writer.writeheader()
for ntp in response.css('div.content-1col-nobox'):
name = ntp.xpath('normalize-space(//h5/text())').extract_first()
brand = ntp.xpath('normalize-space(//div/blockquote[1]/p/text()[4])').extract_first()
version = ntp.xpath('normalize-space(//div/blockquote[1]/p/text()[6])').extract_first()
result = response.xpath(name + " per " + brand + " - " + version)
MPN = ntp.xpath('normalize-space(//tr[2]/td[1]/text())').extract_first()
description = ntp.xpath('normalize-space(//div[6]/div[1]/div[2]/div/blockquote[2]/p/text())').extract_first()
price = ntp.xpath('normalize-space(//tr[2]/td[#id="right_cell"][1])').extract()[0].split(None,1)[0].replace(",",".")
picUrl = response.urljoin(ntp.xpath('//div/p[3]/img/#src').extract_first())
writer.writerow({
'*Action(SiteID=Italy|Country=IT|Currency=EUR|Version=745|CC=UTF-8)':'Add',\
'*Category':'30895',\
'*Title': name,\
'Model': result,\
'ConditionID': '1000',\
'PostalCode':'154',\
'VATPercent':'22',\
'*C:Marca':'Priority Parts',\
'Product:EAN':'',\
'*C:MPN': MPN,\
'PicURL': picUrl,\
'Description': description,\
'*Format' : 'FixedPrice',\
'*Duration': 'GTC',\
'StartPrice' : price,\
'*Quantity':'3',\
'PayPalAccepted': '1',\
'PayPalEmailAddress' : 'your#gmail.com',\
'PaymentInstructions' : 'your#gmail.com',\
'*Location' : 'Italia',\
'ShippingService-1:FreeShipping' : '1',\
'ShippingService-1:Option' : 'IT_OtherCourier3To5Days',\
'ShippingService-1:Cost' : '10',\
'ShippingService-1:Priority' : '1',\
'ShippingService-2:Option' : 'IT_QuickPackage3',\
'ShippingService-2:Cost' : '15',\
'ShippingService-2:Priority' : '2',\
'ShippingService-3:Option': 'IT_QuickPackage1',\
'ShippingService-3:Cost' : '12',\
'ShippingService-3:Priority' : '3',\
'ShippingService-4:Option': 'IT_Pickup',\
'ShippingService-4:Cost' : '0',\
'ShippingService-4:Priority' : '4',\
'*DispatchTimeMax' : '5',\
'*ReturnsAcceptedOption' : 'ReturnsAccepted',\
'ReturnsWithinOption' : 'Days_14',\
'RefundOption' : 'MoneyBackOrExchange',\
'ShippingCostPaidByOption' : 'Buyer'})
Any help will be appreciate it.
Cheers.
Valter.

At the end #Casper was right, in the comments we see the right answer
"{} per {} - {}".format(name, brand, version)
This is the final result:
name = ntp.xpath('normalize-space(//h5/text())').extract_first()
brand = ntp.xpath('normalize-space(//div/blockquote[1]/p//text()[4])').extract_first()
version = ntp.xpath('normalize-space(//div/blockquote[1]/p//text()[6])').extract_first()
result = ("{} per {} - {}".format(name, brand, version))
writer.writerow({
'*Title': result,\

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape multiple websites with different data in urls - python

I guess labels are working as a variable. to append this you need to use a list. add labels=list() at the top of your code as global variable. The same thing should be done for data too.

Related

Trying to create a streamlit app that uses user-provided URLs to scrape and return a downloadable df

More efficient way to manipulate large dataframe

Python | Modifying not Bound Objects still modifies both

Search in List; Display names based on search input

Scrapy add.xpath or join xpath

Categories

Resources