Remove HTML markup (getting the desired text)

Remove HTML markup (getting the desired text) - python

When parsing data from an AJAX table, the data is parsed with the type "bs4.element.Tag" (checked via "type"):
enter image description here
Although I specified the text attribute when requesting:
enter image description here
And I can't get the text i need inside the HTML markup. As in the screenshot:
enter image description here
For example, the replace/strip e.t.c. function does not work with this type of data :
enter image description here
The class containing the number of comments class_="tcl2".
That is, there is a problem, i can't delete 2 through re.sub, because the number of comments can be equal to 2 and this 2 remains.
enter image description here
Code:
import requests
from bs4 import BeautifulSoup
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import json
import re
import time
catalog = {}
def parse ():
header = {
'user-agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.167 YaBrowser/22.7.5.940 Yowser/2.5 Safari/537.36"
}
session = requests.Session()
retry = Retry(connect=1, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
inb = 1
comp = 'all'
while inb <= 1:
url_not_based = (f"http://foodmarkets.ru/firms/filter_result/7/{comp}/0/posted/page{inb}/")
session.mount(f'http://foodmarkets.ru/firms/filter_result/7/{comp}/0/posted/page{inb}/'
,adapter)
r = session.get (url_not_based,verify=True,headers=header,timeout=5)
soup = BeautifulSoup (r.text, "lxml")
rounded_block = soup.find_all('tr')
for round in rounded_block:
round_сompany = round.find('td',class_='tcl'>'href')
clear_comp1 = re.sub(r'[a-zA-Z<>/\t\n=''0-9.:""]','',str(round_сompany))
clear_comp2 = re.sub(r'[\xa0]',' ',clear_comp1)
round_сity = round.find('td',class_="tc2 nowrap")
clear_city1 = re.sub(r'[a-zA-Z<>/\t\n=''0-9.:""]','',str(round_сity))
clear_city2 = re.sub(r'[\xa0]',' ',clear_city1)
round_сommment = round.find('td',class_="tc2 cntr")
clear_comm1 = re.sub(r'[a-zA-Z<>""''/\t\n=.:]','',str( round_сommment))
if round_сompany in catalog:
continue
else:
catalog [round_сompany] = {
"Company": clear_comp2,
"City":clear_city2,
"Comment": clear_comm1,
}
inb = inb+1
time.sleep(0.5)
# print (catalog)
#with open ("catalog.json","w",encoding='utf-8') as file:
# json.dump(catalog, file, indent=4, ensure_ascii=False)
if __name__ == "__main__":
parse()

To get the data from the table to the DataFrame you can use following code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'http://foodmarkets.ru/firms/filter_result/7/all/0/posted/page1/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for row in soup.select('tr:has(td.tcl)'):
tds = [cell.get_text(strip=True, separator=' ') for cell in row.select('td')]
all_data.append(tds)
df = pd.DataFrame(all_data, columns=['Название компании', 'Города', 'Комментариев', 'Последнее сообщение'])
print(df)
Prints:
Название компании Города Комментариев Последнее сообщение
0 РусКонфета ООО Санкт-Петербург 0 Сегодня 08:46 от ruskonfeta.ru
1 "СУББОТА" ООО Санкт-Петербург 0 Вчера 14:07 от limon4ik
2 "КАРАБАНОВСКИЙ КОНДИТЕРСКИЙ КОМБИНАТ" ООО Москва 1 Вчера 12:13 от kurmanskiy
3 Мажор ООО Москва 0 30.01.2023 23:11 от bstgroup
4 ОрионСвит ООО Минск 0 16.01.2023 09:28 от Boetc
5 КД Стайл ООО Санкт-Петербург 1 11.01.2023 15:00 от kozhemyakinaJ
6 БАСТИОН ООО Владивосток 0 10.01.2023 14:52 от dv_zakupka
7 ИП Давыдов Р.А. ИП Саратов 0 21.12.2022 07:53 от dfkthbz98
8 Гипермаркет Сити Центр ООО Назрань 0 28.11.2022 21:23 от Calibri
9 ЭКА ООО Краснодар 1 Вчера 08:49 от intense
10 Арсанукаев Бекхан Бадруддинович ИП Грозный 1 26.10.2022 08:33 от sale555
11 ООО "Хлебный Дом" ООО Симферополь 0 20.10.2022 07:39 от AlexMeganom
12 Горелый Николай Иванович ИП Брянск 0 18.10.2022 10:20 от Dinastya Vlad
13 АЛЬЯНС ПРОДУКТ ООО Орел 1 10.10.2022 12:32 от viola_vrn2017
14 ООО «ТК Русские Традиции» ООО Москва 1 25.11.2022 15:34 от ZefirVK
15 "Технотрейд" ООО Минск 0 23.09.2022 08:28 от Alejandros
16 ООО ТК АТЛАС ООО Киров 0 15.09.2022 09:47 от Sal291279
17 Владторг ООО Владивосток 4 25.01.2023 05:54 от Andrey_Bragin
18 Кондитерская фабрика "Золотая Русь" ООО Тула 0 30.08.2022 14:48 от ilya_aldagir
19 ООО "Кондитерская фабрика "Финтур" ООО Санкт-Петербург 1 15.08.2022 11:15 от dvp_wholesaler
20 Новая Система Услуг ООО Тамбов 0 04.08.2022 11:32 от NSU
21 Шидакова И.М. (Ника-Трейд) ИП Нальчик 1 17.10.2022 12:16 от otdelprodazh.6a
22 Лапин Вячеслав Геннадьевич ИП Белгород 4 24.01.2023 13:24 от Anton Bel
23 ТД Первый Вкус ИП Москва 5 18.01.2023 12:34 от pvkioni
24 ГУДДРИНКС ООО Москва 0 25.07.2022 12:49 от sergeiaustraliangroup.ru
25 ООО ГХП Бизнес Гифт ООО Москва 0 19.07.2022 14:51 от visss71
26 Винотека ООО Севастополь 0 13.07.2022 12:30 от Ooo.vinoteka
27 Череповецхлеб АО Череповец 0 06.07.2022 13:54 от Alexcher35
28 Лысанов ИВ(Лысанова ЛБ Мещуров СВ) ИП Пермь 6 10.11.2022 08:34 от Andrey_Bragin
29 ХОРЕКА РБ ООО Уфа 0 09.06.2022 23:52 от horecarb

Related

Beautiful Soup 4 Python Webscraping price from website (not the same)

Someone automatically closed my post. This is not the same question as someone else. Do not close it please.
I am trying to get the price from the website. Can anyone please find the error in my code?
import requests
from bs4 import BeautifulSoup
def priceTracker():
url = ' https://www.britishairways.com/travel/book/public/en_gb/flightList?onds=LGW-AMS_2022-3-11&ad=1&yad=0&ch=0&inf=0&cabin=M&flex=LOWEST&usedChangeSearch=true'
response = requests.get(url)
soup = BeautifulSoup(response.text,'lxml')
#print(soup.prettify)
price = soup.find(class_=" heading-sm ng-tns-c70-4 ng-star-inserted").next_sibling
#print(soup) test if soup working properly
print(price)
while True:
priceTracker()
I have attached the DOM screen.
I have attached the DOM screen of the price. I have updated the url (in case the url does not work, to get the url you go to the main website and press the search button)

The page is rendered through javascript. You can get data through the api but requires a little work:
import requests
import pandas as pd
s = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
s.get('https://www.britishairways.com', headers=headers)
cookies = s.cookies.get_dict()
cookieStr = ''
for k, v in cookies.items():
cookieStr += f'{k}={v};'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
'ba_integrity_tokenV1': '27f532c2f83fb5c560bcd25af3125d9658321fb753c1becc68735fd076ccbc46',
'ba_api_context': 'https://www.britishairways.com/api/sc4',
'ba_client_applicationName': 'ba.com',
'Authorization': 'Bearer 09067a6cba44a44a7119a15c123064f6',
'x-dtpc': '1$590599503_459h15vVJSUFBOHGPMRNQQHGCWULORUCSWNCPSO-0e0',
'ba_client_sessionId': '72bb7a96-f635-4a55-bf5f-125f8c83c464',
'Content-Type': 'application/json',
'Referer': 'https://www.britishairways.com/travel/book/public/en_gb/flightList?onds=LGW-AMS_2022-3-11&ad=1&yad=0&ch=0&inf=0&cabin=M&flex=LOWEST&usedChangeSearch=true',
'Cookie': cookieStr}
url = 'https://www.britishairways.com/api/sc4/badotcomadapter-paa/rs/v1/flightavailability/search;ondwanted=1'
payload = {"ondSearches":[
{"originLocationCode":"LGW",
"destinationLocationCode":"AMS",
"departureDate":"2022-03-11"
}
],
"cabin":"M",
"ticketFlexibility":"LOWEST",
"passengerMix":{
"adultCount":1,
"youngAdultCount":0,
"childCount":0,
"infantCount":0
},
"cug":'false',
"includeCalendar":'true',
"calendarDays":3,
"baIntegrityTokenV1":"27f532c2f83fb5c560bcd25af3125d9658321fb753c1becc68735fd076ccbc46"}
jsonData = s.post(url, json=payload, headers=headers).json()
calendarEntries = pd.DataFrame(jsonData['calendar']['calendarEntries'])
flightEvents = pd.json_normalize(jsonData['flightOption'],
record_path=['flightEvents'])
availableCabinsForOption = pd.json_normalize(jsonData['flightOption'],
record_path=['availableCabinsForOption'])
Output:
for table in [calendarEntries, flightEvents, availableCabinsForOption]:
print(table)
date cheapestSegmentPrice cheapestJourneyPrice
0 2022-03-08 47.07 47.07
1 2022-03-09 54.07 54.07
2 2022-03-10 51.07 51.07
3 2022-03-11 80.07 80.07
4 2022-03-12 69.73 69.73
5 2022-03-13 51.07 51.07
6 2022-03-14 54.07 54.07
eventType duration ... aircraft.aircraftCode aircraft.aircraftName
0 FLIGHT_SEGMENT PT1H30M ... 319 Airbus A319 jet
1 FLIGHT_SEGMENT PT1H25M ... 320 Airbus A320 jet
2 FLIGHT_SEGMENT PT1H10M ... E90 Embraer E190SR
3 FLIGHT_SEGMENT PT1H25M ... 319 Airbus A319 jet
4 FLIGHT_SEGMENT PT1H5M ... E90 Embraer E190SR
5 FLIGHT_SEGMENT PT1H5M ... E90 Embraer E190SR
6 FLIGHT_SEGMENT PT1H20M ... 319 Airbus A319 jet
7 FLIGHT_SEGMENT PT1H25M ... 319 Airbus A319 jet
[8 rows x 32 columns]
availabilityInSellingClass ... fareBasisCode.BA2758
0 9 ... NaN
1 9 ... NaN
2 9 ... NaN
3 1 ... NaN
4 1 ... NaN
5 9 ... NaN
6 9 ... NaN
7 9 ... NaN
8 9 ... NaN
9 2 ... NaN
10 2 ... NaN
11 9 ... NaN
12 9 ... NaN
13 9 ... NaN
14 9 ... NaN
15 7 ... NaN
16 7 ... NaN
17 9 ... NaN
18 2 ... NaN
19 2 ... NaN
20 9 ... NaN
21 2 ... [KZ0RO]
22 2 ... [KV2RO]
23 9 ... [DMV2RO]
[24 rows x 33 columns]

Try this :
import requests
from bs4 import BeautifulSoup
def priceTracker():
url = ' https://www.britishairways.com/travel/book/public/en_gb/flightList?onds=LGW-AMS_2022-3-11&ad=1&yad=0&ch=0&inf=0&cabin=M&flex=LOWEST&usedChangeSearch=true'
response = requests.get(url)
soup = BeautifulSoup(response.text,'lxml')
#print(soup.prettify)
price = soup.find(class_="ng-star-inserted").text # changed
#print(soup) test if soup working properly
print(price)
while True:
priceTracker()

Pulling information from a Bid and Ask column in a stock exchange website

I have a trouble of pulling prices on the Bid and Ask columns of this website: [https://banggia.vps.com.vn/chung-khoan/derivative-VN30][1]. Now I can only pull the name of the class, which is "price-table-content". How can I improve these codes so that I can pull prices on the Bid and Ask columns? Any helps to pull these prices are greatly appreciated :)
from selenium import webdriver
options = webdriver.ChromeOptions()
options.headless = True
path = 'C:/Users/quank/PycharmProjects/pythonProject2/chromedriver.exe'
driver = webdriver.Chrome(executable_path=path, options=options)
url = 'https://banggia.vps.com.vn/chung-khoan/derivative-VN30'
driver.get(url=url)
element = driver.find_elements_by_css_selector('#root > div > div.content.undefined >
div.derivative > table.price-table > tbody')
for i in element:
print(i.get_attribute('outerHTML'))
Here is the result of running these codes
C:\Users\quank\PycharmProjects\Botthudulieu\venv\Scripts\python.exe
C:/Users/quank/PycharmProjects/pythonProject2/Botthudulieu.py
<tbody class="price-table-content"></tbody>

When you check the network activity you'll see that the data is retrieved from an api. So query the api directly rather than trying to scrape the site.
import requests
data = requests.get('https://bgapidatafeed.vps.com.vn/getpsalldatalsnapshot/VN30F2109,VN30F2110,VN30F2112,VN30F2203').json()
Or with pandas:
import pandas as pd
df = pd.read_json('https://bgapidatafeed.vps.com.vn/getpsalldatalsnapshot/VN30F2109,VN30F2110,VN30F2112,VN30F2203')
Resulting dataframe:
id
sym
mc
c
f
r
lastPrice
lastVolume
lot
avePrice
highPrice
lowPrice
fBVol
fBValue
fSVolume
fSValue
g1
g2
g3
g4
g5
g6
g7
mkStatus
listing_status
matureDate
closePrice
ptVol
oi
oichange
lv
0
2100
VN30F2109
4
1505.1
1308.3
1406.7
1420
6018
351832
1406.49
1422.9
1390
2011
0
2225.0
0
1420.00
37
i
1419.90
232
i
1419.80
3
i
1420.10
289
i
1420.20
2
i
3
356133
e
A
0
37433
3
1
2100
VN30F2110
4
1504.3
1307.5
1405.9
1418
14
462
1406.94
1422
1390
0
0
1.0
0
1418.00
1
i
1417.80
1
i
1417.50
1
i
1420.00
4
i
1421.00
1
i
1
523
e
A
0
193
2
2
2100
VN30F2112
4
1505.5
1308.7
1407.1
1420
1
54
1424.31
1420
1390.8
0
0
0
1412.30
1
i
1411.60
1
i
1411.20
1
i
1420.80
1
i
1421.20
1
i
1
88
e
A
0
596
1
3
2100
VN30F2203
4
1503.9
1307.3
1405.6
1420
1
50
1402.19
1420
1390
0
0
0
1412.10
1
i
1412.00
1
i
1410.10
1
i
1420.50
1
i
1421.00
2
i
2
85
e
A
0
138
1

How to scrape a non tabled list from wikipedia and create a datafram?

en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul
in the link above, there is an un-tabulated data for Istanbul Neighborhoods.
I want to fetch these Neighborhoods into a data frame by this code
import pandas as pd
import requests
from bs4 import BeautifulSoup
wikiurl="https://en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul"
response=requests.get(wikiurl)
soup = BeautifulSoup(response.text, 'html.parser')
tocList=soup.findAll('a',{'class':"new"})
neighborhoods=[]
for item in tocList:
text = item.get_text()
neighborhoods.append(text)
df = pd.DataFrame(neighborhoods, columns=['Neighborhood'])
print(df)
and I got this output:
Neighborhood
0 Maden
1 Nizam
2 Anadolu
3 Arnavutköy İmrahor
4 Arnavutköy İslambey
... ...
705 Seyitnizam
706 Sümer
707 Telsiz
708 Veliefendi
709 Yeşiltepe
710 rows × 1 columns
but some data are not fetched, check the data below and compare to the output:
Adalar
Burgazada
Heybeliada
Kınalıada
Maden
Nizam
findall() is not fetching the Neighborhoods which referred as links, not class, i.e.
<ol><li>Burgazada</li>
<li>Heybeliada</li>
and can I develop the code into 2 columns, each 'Neighborhood' and its 'District'

Are you trying to fetch this list from Table of Contents ?
Please check if this solves your problem:
import pandas as pd
import requests
from bs4 import BeautifulSoup
wikiurl="https://en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul"
response=requests.get(wikiurl)
soup = BeautifulSoup(response.text, 'html.parser')
tocList=soup.findAll('span',{'class':"toctext"})
districts=[]
blocked_words = ['Neighbourhoods by districts','Further reading', 'External links']
for item in tocList:
text = item.get_text()
if text not in blocked_words:
districts.append(text)
df = pd.DataFrame(districts, columns=['districts'])
print(df)
Output:
districts
0 Adalar
1 Arnavutköy
2 Ataşehir
3 Avcılar
4 Bağcılar
5 Bahçelievler
6 Bakırköy
7 Başakşehir
8 Bayrampaşa
9 Beşiktaş
10 Beykoz
11 Beylikdüzü
12 Beyoğlu
13 Büyükçekmece
14 Çatalca
15 Çekmeköy
16 Esenler
17 Esenyurt
18 Eyüp
19 Fatih
20 Gaziosmanpaşa
21 Güngören
22 Kadıköy
23 Kağıthane
24 Kartal
25 Küçükçekmece
26 Maltepe
27 Pendik
28 Sancaktepe
29 Sarıyer
30 Silivri
31 Sultanbeyli
32 Sultangazi
33 Şile
34 Şişli
35 Tuzla
36 Ümraniye
37 Üsküdar
38 Zeytinburnu

Python splitting strings and convert them to a list that notices empty fields

it took me the whole day trying to fix this problem but I didn't found a solution so I hope you can help me. I already tried to extract the data from the website. But the problem is that I don't know how to split the list so that 500g converts to 500,g. The problem is that on the website sometimes the quantity is 1 and sometimes 1 1/2 kg or sth. And now I need to convert it into a CSV file and then into a MySQL database. What I want at the end is a CSV file with the columns: ingredients ID, ingredients, quantity, and the unit of the quantity from the ingredient. So for example:
0, meat, 500, g. This is the code I already have to extract the data from this website:
import re
from bs4 import BeautifulSoup
import requests
import csv
urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
mainurl = "https://www.chefkoch.de/rs/s0e1n1z1b0i1d1,2,3/Rezepte.html"
urls_urls = []
urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
ingredients = []
menge = []
def read_recipes():
for url, id2 in zip(urls_recipes, range(len(urls_recipes))):
soup2 = BeautifulSoup(requests.get(url).content, "lxml")
for ingredient in soup2.select('.td-left'):
menge.append([*[re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True))]])
for ingredient in soup2.select('.recipe-ingredients h3, .td-right'):
if ingredient.name == 'h3':
ingredients.append([id2, *[ingredient.get_text(strip=True)]])
else:
ingredients.append([id2, *[re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True))]])
read_recipes()
I hope you can help me Thank You!

It appears that the strings containing fractions use the unicode symbols for 1/2 etc., so I think a good way of starting is replacing those by looking up the specific code and passing it to str.replace(). Splitting up the units and the amount for this example was easy, since they are separated by a space. But it might be necessary to generalize this more if you encounter other combinations.
The following code works for this specific example:
import re
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
mainurl = "https://www.chefkoch.de/rs/s0e1n1z1b0i1d1,2,3/Rezepte.html"
urls_urls = []
urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
ingredients = []
menge = []
einheit = []
for url, id2 in zip(urls_recipes, range(len(urls_recipes))):
soup2 = BeautifulSoup(requests.get(url).content)
for ingredient in soup2.select('.td-left'):
# get rid of multiple spaces and replace 1/2 unicode character
raw_string = re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True)).replace(u'\u00BD', "0.5")
# split into unit and number
splitlist = raw_string.split(" ")
menge.append(splitlist[0])
if len(splitlist) == 2:
einheit.append(splitlist[1])
else:
einheit.append('')
for ingredient in soup2.select('.recipe-ingredients h3, .td-right'):
if ingredient.name == 'h3':
continue
else:
ingredients.append([id2, re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True))])
result = pd.DataFrame(ingredients, columns=["ID", "Ingredients"])
result.loc[:, "unit"] = einheit
result.loc[:, "amount"] = menge
Output:
>>> result
ID Ingredients unit amount
0 0 Beinscheibe(n), vom Rind, ca. 4 cm dick geschn... 4
1 0 Mehl etwas
2 0 Zwiebel(n) 1
3 0 Knoblauchzehe(n) 2
4 0 Karotte(n) 1
5 0 Lauchstange(n) 1
6 0 Staudensellerie 0.5
7 0 Tomate(n), geschält Dose 1
8 0 Tomatenmark EL 1
9 0 Rotwein zum Ablöschen
10 0 Rinderfond oder Fleischbrühe Liter 0.5
11 0 Olivenöl zum Braten
12 0 Gewürznelke(n) 2
13 0 Pimentkörner 10
14 0 Wacholderbeere(n) 5
15 0 Pfefferkörner
16 0 Salz
17 0 Pfeffer, schwarz, aus der Mühle
18 0 Thymian
19 0 Rosmarin
20 0 Zitrone(n), unbehandelt 1
21 0 Knoblauchzehe(n) 2
22 0 Blattpetersilie Bund 1

Beautiful soup, how to scrape multiple urls and save them in a csv file

So I am wondering how to scrape multiple websites/urls and save them, (the data), to a csv file. I can only save the first page right now. I have tried many different ways but it doesn´t seem to work. How can I save 5 pages in a csv file and not only one?
import requests
import csv
from bs4 import BeautifulSoup
import pandas as pd
import re
from datetime import timedelta
import datetime
import time
urls = ['https://store.steampowered.com/search/?specials=1&page=1', 'https://store.steampowered.com/search/?specials=1&page=2', 'https://store.steampowered.com/search/?specials=1&page=3', 'https://store.steampowered.com/search/?specials=1&page=4','https://store.steampowered.com/search/?specials=1&page=5']
for url in urls:
my_url = requests.get(url)
html = my_url.content
soup = BeautifulSoup(html,'html.parser')
data = []
ts = time.time()
st = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
for container in soup.find_all('div', attrs={'class':'responsive_search_name_combined'}):
title = container.find('span',attrs={'class':'title'}).text
if container.find('span',attrs={'class':'win'}):
win = '1'
else:
win = '0'
if container.find('span',attrs={'class':'mac'}):
mac = '1'
else:
mac = '0'
if container.find('span',attrs={'class':'linux'}):
linux = '1'
else:
linux = '0'
data.append({
'Title':title.encode('utf-8'),
'Time':st,
'Win':win,
'Mac':mac,
'Linux':linux})
with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
fields = ['Title','Win','Mac','Linux','Time']
writer = csv.DictWriter(f, fieldnames=fields)
writer.writeheader()
writer.writerows(data)
testing = pd.read_csv('data.csv')
heading = testing.head(100)
discription = testing.describe()
print(heading)

the issue is you are re-initializing your data after each url. And then writing it after the very last iteration, meaning you'll always just have whatever the last data you got from the last url. You'll need to have that data appending and not be overwritten after each iteration:
import requests
import csv
from bs4 import BeautifulSoup
import pandas as pd
import re
from datetime import timedelta
import datetime
import time
urls = ['https://store.steampowered.com/search/?specials=1&page=1', 'https://store.steampowered.com/search/?specials=1&page=2', 'https://store.steampowered.com/search/?specials=1&page=3', 'https://store.steampowered.com/search/?specials=1&page=4','https://store.steampowered.com/search/?specials=1&page=5']
results_df = pd.DataFrame() #<-- initialize a results dataframe to dump/store the data you collect after each iteration
for url in urls:
my_url = requests.get(url)
html = my_url.content
soup = BeautifulSoup(html,'html.parser')
data = [] #<-- your data list is "reset" after each iteration of your urls
ts = time.time()
st = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
for container in soup.find_all('div', attrs={'class':'responsive_search_name_combined'}):
title = container.find('span',attrs={'class':'title'}).text
if container.find('span',attrs={'class':'win'}):
win = '1'
else:
win = '0'
if container.find('span',attrs={'class':'mac'}):
mac = '1'
else:
mac = '0'
if container.find('span',attrs={'class':'linux'}):
linux = '1'
else:
linux = '0'
data.append({
'Title':title,
'Time':st,
'Win':win,
'Mac':mac,
'Linux':linux})
temp_df = pd.DataFrame(data) #<-- temporary storing the data in a dataframe
results_df = results_df.append(temp_df).reset_index(drop=True) #<-- dumping that data into a results dataframe
results_df.to_csv('data.csv', index=False) #<-- writing the results dataframe to csv
testing = pd.read_csv('data.csv')
heading = testing.head(100)
discription = testing.describe()
print(heading)
Output:
print (results_df)
Linux Mac ... Title Win
0 0 0 ... Tom Clancy's Rainbow Six® Siege 1
1 0 0 ... Tom Clancy's Rainbow Six® Siege 1
2 1 1 ... Total War: WARHAMMER II 1
3 0 0 ... Tom Clancy's Rainbow Six® Siege 1
4 1 1 ... Total War: WARHAMMER II 1
5 0 1 ... Frostpunk 1
6 0 0 ... Tom Clancy's Rainbow Six® Siege 1
7 1 1 ... Total War: WARHAMMER II 1
8 0 1 ... Frostpunk 1
9 1 1 ... Two Point Hospital 1
10 0 0 ... Tom Clancy's Rainbow Six® Siege 1
11 1 1 ... Total War: WARHAMMER II 1
12 0 1 ... Frostpunk 1
13 1 1 ... Two Point Hospital 1
14 0 0 ... Black Desert Online 1
15 0 0 ... Tom Clancy's Rainbow Six® Siege 1
16 1 1 ... Total War: WARHAMMER II 1
17 0 1 ... Frostpunk 1
18 1 1 ... Two Point Hospital 1
19 0 0 ... Black Desert Online 1
20 1 1 ... Kerbal Space Program 1
21 0 0 ... Tom Clancy's Rainbow Six® Siege 1
22 1 1 ... Total War: WARHAMMER II 1
23 0 1 ... Frostpunk 1
24 1 1 ... Two Point Hospital 1
25 0 0 ... Black Desert Online 1
26 1 1 ... Kerbal Space Program 1
27 1 1 ... BioShock Infinite 1
28 0 0 ... Tom Clancy's Rainbow Six® Siege 1
29 1 1 ... Total War: WARHAMMER II 1
... .. ... ... ..
1595 0 0 ... VEGAS Pro 14 Edit Steam Edition 1
1596 0 0 ... ABZU 1
1597 0 0 ... Sacred 2 Gold 1
1598 0 0 ... Sakura Bundle 1
1599 1 1 ... Distance 1
1600 0 0 ... LEGO® Batman™: The Videogame 1
1601 0 0 ... Sonic Forces 1
1602 0 0 ... The Stronghold Collection 1
1603 0 0 ... Miscreated 1
1604 0 0 ... Batman™: Arkham VR 1
1605 1 1 ... Shadowrun Returns 1
1606 0 0 ... Upgrade to VEGAS Pro 16 Edit 1
1607 0 0 ... Girl Hunter VS Zombie Bundle 1
1608 0 1 ... Football Manager 2019 Touch 1
1609 0 1 ... Total War: NAPOLEON - Definitive Edition 1
1610 1 1 ... SteamWorld Dig 2 1
1611 0 0 ... Condemned: Criminal Origins 1
1612 0 0 ... Company of Heroes 1
1613 0 0 ... LEGO® Batman™ 2: DC Super Heroes 1
1614 1 1 ... Euro Truck Simulator 2 Map Booster 1
1615 0 0 ... Sonic Adventure DX 1
1616 0 0 ... Worms Armageddon 1
1617 1 1 ... Unforeseen Incidents 1
1618 0 0 ... Warhammer 40,000: Space Marine Collection 1
1619 0 0 ... VEGAS Pro 14 Edit Steam Edition 1
1620 0 0 ... ABZU 1
1621 0 0 ... Sacred 2 Gold 1
1622 0 0 ... Sakura Bundle 1
1623 1 1 ... Distance 1
1624 0 0 ... Worms Revolution 1
[1625 rows x 5 columns]

So I was apparently very blind to my code, that can happen when you stare at it all day. All I actually had to do was to move the "data = []" above the for loop so it wouldn´t reset every time.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove HTML markup (getting the desired text) - python

Related

Beautiful Soup 4 Python Webscraping price from website (not the same)

Pulling information from a Bid and Ask column in a stock exchange website

How to scrape a non tabled list from wikipedia and create a datafram?

Python splitting strings and convert them to a list that notices empty fields

Beautiful soup, how to scrape multiple urls and save them in a csv file

Categories

Resources