Scraping news title from a page with bs4 in python [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 months ago.
Improve this question
I was trying to scrape the "entry-title" of the last news on the site "https://www.abafg.it/category/avvisi/" and prints [ ] instead, what am i doing the wrong way?
The result of what the code returns instead of the "entry-title" of the page i want to scrape the info
I tried to scrape the class "entry-title" to let me save the title, the link of where that news leads and the date of publish

The entry-title class is not of the link a tag, but of the h2 wrapped around it.
You can use
names = [h.a for h in soup.find_all('h2', class_='entry-title')]
But I think using CSS selectors would be better here
names = soup.select('h2.entry-title > a[href]')
will select any a tag with a href attribute and with a h2 parent of class entry-title.
Then,
for a in names: print(a.get_text().strip(), a.get('href'))
will print
AVVISO LEZIONI DI SCULTURA : PROF.BORRELLI https://www.abafg.it/avviso-lezioni-di-scultura-prof-borrelli/
ORARIO DELLE LEZIONI A.A.2022/2023 IN VIGORE DAL 21 NOVEMBRE 2022 https://www.abafg.it/orario-delle-lezioni-a-a-2022-2023-in-vigore-dal-21-novembre-2022/
PROROGA BANDO AFFIDAMENTI INTERNI D.D. N. 3 DEL 4.11.2022 https://www.abafg.it/proroga-bando-affidamenti-interni-d-d-n-3-del-4-11-2022/
D.D. n. 7 del 15.11.2022 DECRETO GRADUATORIA PROVVISORIA ABPR19 https://www.abafg.it/d-d-n-7-del-15-11-2022-decreto-graduatoria-provvisoria-abpr19/
D.D. n. 5 DEL 10.11.2022 DECRETO DI NOMINA COMMISSIONE ABPR19 https://www.abafg.it/d-d-n-5-del-10-11-2022-decreto-di-nomina-commissione-abpr19/
RIAPERTURA BANDO AFFIDAMENTI INTERNI D.D. N. 3 DEL 4.11.2022 https://www.abafg.it/riapertura-bando-affidamenti-interni-d-d-n-4-del-4-11-2022/
D.D.81 del 26.10.2022 GRADUATORIA DEFINITIVA ABST48 STORIA DELLE ARTI APPLICATE https://www.abafg.it/d-d-81-del-26-10-2022-graduatoria-definitiva-abst48-storia-delle-arti-applicate/
AVVISO PRESENTAZIONE DOMANDE CULTORE DELLA MATERIA A.A.22.23-SCADENZA 11.11.2022 https://www.abafg.it/avviso-presentazione-domande-cultore-della-materia-a-a-22-23-scadenza-11-11-2022/
D.D. N.78 DEL 19/10/2022 BANDO GRADUATORIE D’ISTITUTO-SCADENZA 9/11/2022. https://www.abafg.it/d-d-n-78-bando-graduatorie-distituto-scadenza-9-11-2022/
ORARIO PROVVISIORIO DELLE LEZIONI A.A. 2022/2023: TRIENNIO E BIENNIO https://www.abafg.it/orario-provvisiorio-delle-lezioni-a-a-2022-2023-triennio-e-biennio/
Added EDIT: to save the printed text into a file, you could first save it as one string with .join first
asText = '\n'.join([f'{a.get_text().strip()} {a.get("href")}' for a in names])
and then you could save it with
with open('./resources/titles.txt', 'w', encoding='utf-8') as f:
f.write(asText)
If you want something more visuals-friendly, I suggest using pandas
asDF = pandas.DataFrame([{
'title': a.get_text().strip(), 'link': a.get('href')
} for a in names])
asText = asDF.to_markdown(index=False)
and now asText looks like
| title | link |
|:---------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------|
| ORARIO DELLE LEZIONI A.A.2022/2023 IN VIGORE DAL 21 NOVEMBRE 2022 | https://www.abafg.it/orario-delle-lezioni-a-a-2022-2023-in-vigore-dal-21-novembre-2022/ |
| PROROGA BANDO AFFIDAMENTI INTERNI D.D. N. 3 DEL 4.11.2022 | https://www.abafg.it/proroga-bando-affidamenti-interni-d-d-n-3-del-4-11-2022/ |
| D.D. n. 7 del 15.11.2022 DECRETO GRADUATORIA PROVVISORIA ABPR19 | https://www.abafg.it/d-d-n-7-del-15-11-2022-decreto-graduatoria-provvisoria-abpr19/ |
| D.D. n. 5 DEL 10.11.2022 DECRETO DI NOMINA COMMISSIONE ABPR19 | https://www.abafg.it/d-d-n-5-del-10-11-2022-decreto-di-nomina-commissione-abpr19/ |
| RIAPERTURA BANDO AFFIDAMENTI INTERNI D.D. N. 3 DEL 4.11.2022 | https://www.abafg.it/riapertura-bando-affidamenti-interni-d-d-n-4-del-4-11-2022/ |
| D.D.81 del 26.10.2022 GRADUATORIA DEFINITIVA ABST48 STORIA DELLE ARTI APPLICATE | https://www.abafg.it/d-d-81-del-26-10-2022-graduatoria-definitiva-abst48-storia-delle-arti-applicate/ |
| AVVISO PRESENTAZIONE DOMANDE CULTORE DELLA MATERIA A.A.22.23-SCADENZA 11.11.2022 | https://www.abafg.it/avviso-presentazione-domande-cultore-della-materia-a-a-22-23-scadenza-11-11-2022/ |
| D.D. N.78 DEL 19/10/2022 BANDO GRADUATORIE D’ISTITUTO-SCADENZA 9/11/2022. | https://www.abafg.it/d-d-n-78-bando-graduatorie-distituto-scadenza-9-11-2022/ |
| ORARIO PROVVISIORIO DELLE LEZIONI A.A. 2022/2023: TRIENNIO E BIENNIO | https://www.abafg.it/orario-provvisiorio-delle-lezioni-a-a-2022-2023-triennio-e-biennio/ |
| GRADUATORIA DEFINITIVA ABST47 STILE,STORIA DELL’ARTE E DEL COSTUME | https://www.abafg.it/graduatoria-definitiva-abst47-stilestoria-dellarte-e-del-costume/ |
And then, instead of TXT, you could also save it as CSV with
asDF.to_csv('./resources/titles.csv', index=False)
so that you can view it as a spreadsheet

Related

Pyhton pandas for manipulate text & inconsistent data

how i take specific text from one column in python pandas but inconsistent format for example like this
Area | Owners
Bali Island: 4600 | John
Java Island:7200 | Van Hour
Hallo Island : 2400| Petra
and the format would be like this
Area | Owners | Area Number
Bali Island: 4600 | John | 4600
Java Island:7200 | Van Hour | 7200
Hallo Island : 2400| Petra | 2400
You could use str.extract:
df['Area Number'] = df['Area'].str.extract('(\d+)$')
output:
Area Owners Area Number
0 Bali Island: 4600 John 4600
1 Java Island:7200 Van Hour 7200
2 Hallo Island : 2400 Petra 2400

SAS Programming: How to replace missing values in multiple columns using one column?

Background
I have a large dataset in SAS that has 17 variables of which four are numeric and 13 character/string. The original dataset that I am using can be found here: https://www.kaggle.com/austinreese/craigslist-carstrucks-data.
cylinders
condition
drive
paint_color
type
manufacturer
title_status
model
fuel
transmission
description
region
state
price (num)
posting_date (num)
odometer (num)
year (num)
After applying specific filters to the numeric columns, there are no missing values for each numeric variable. However, there are thousands to hundreds of thousands of missing variables for the remaining 14 char/string variables.
Request
Similar to the blog post towards data science as shown here (https://towardsdatascience.com/end-to-end-data-science-project-predicting-used-car-prices-using-regression-1b12386c69c8), specifically under the Feature Engineering section, how can I write the equivalent SAS code where I use regex on the description column to fill missing values of the other string/char columns with categorical values such as cylinders, condition, drive, paint_color, and so on?
Here is the Python code from the blog post.
import re
manufacturer = '(gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover | lexus \
| honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | \
mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche \
| hennessey)'
condition = '(excellent | good | fair | like new | salvage | new)'
fuel = '(gas | hybrid | diesel |electric)'
title_status = '(clean | lien | rebuilt | salvage | missing | parts only)'
transmission = '(automatic | manual)'
drive = '(4x4 | awd | fwd | rwd | 4wd)'
size = '(mid-size | full-size | compact | sub-compact)'
type_ = '(sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad)'
paint_color = '(red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow)'
cylinders = '(\s[1-9] cylinders? |\s1[0-6]? cylinders?)'
keys = ['manufacturer', 'condition', 'fuel', 'title_status', 'transmission', 'drive','size', 'type', 'paint_color' , 'cylinders']
columns = [ manufacturer, condition, fuel, title_status, transmission ,drive, size, type_, paint_color, cylinders]
for i,column in zip(keys,columns):
database[i] = database[i].fillna(
database['description'].str.extract(column, flags=re.IGNORECASE, expand=False)).str.lower()
database.drop('description', axis=1, inplace= True)
What would be the equivalent SAS code for the Python code shown above?
It's basically just doing a word search of sorts.
A simplified example in SAS:
data want;
set have;
array _fuel(*) $ _temporary_ ("gas", "hybrid", "diesel", "electric");
do i=1 to dim(_fuel);
if find(description, _fuel(i), 'it')>0 then fuel = _fuel(i);
*does not deal with multiple finds so the last one found will be kept;
end;
run;
You can expand this by creating an array for each variable and then looping through your lists. I think you can replace the loop with a REGEX command as well in SAS but regex requires too much thinking so someone else will have to provide that answer.

Scrapy handle missing path

I am building a forum-scraper for a university project. The page of the forum that I am using is the following: https://www.eurobricks.com/forum/index.php?/forums/topic/163541-lego-ninjago-2019/&tab=comments#comment-2997338.
I am able to extract all the information that I need except for the location. This information is stored inside the following path.
<li class="ipsType_light"> <\li>
<span class="fc">Country_name<\span>
The problem is that sometimes this information and path does not exist. But my actual solution can not handle it.
Here the code I wrote to get the information about the location.
location_path = "//span[#class='fc']/text()"
def parse_thread(self, response):
comments = response.xpath("//*[#class='cPost_contentWrap ipsPad']")
username = response.xpath(self.user_path).extract()
x = len(username)
if x>0:
score = response.xpath(self.score_path).extract()
content = ["".join(comment.xpath(".//*[#data-role='commentContent']/p/text()").extract()) for comment in comments]
date = response.xpath(self.date_path).extract()
location = response.xpath(self.location_path).extract()
for i in range(x):
yield{
"title": title,
"category": category,
"user": username[i],
"score": score[i],
"content": content[i],
"date": date[i],
"location": location[i]
}
One possible solution that I have tried was to check the length of the location but is not working.
Right now the code results in the following (sample data)
Title | category | test1 | 502 | 22 june 2020 | correct country
Title | category | test2 | 470 | 22 june 2020 | wrong country (it takes the next user country)
Title | category | test3 | 502 | 28 june 2020 | correct country
And what I would like to achieve is:
Title | category | test1 | 502 | 22 june 2020 | correct country
Title | category | test2 | 470 | 22 june 2020 | Not available
Title | category | test3 | 502 | 28 june 2020 | correct country
The solution to my problem is that instead of selecting the specific information one by one. First I have to select the entire block where all the pieces of information and only then pick the single information that I need.

Changing values in a column based on a match

I have a Pandas DataFrame which contains names of brazilians universities, but somethings I have these names in a short way or in a long way (for example, the Universidade Federal do Rio de Janeiro sometimes is identified as UFRJ).
The DataFrame look like this:
| college |
|----------------------------------------|
| Universidade Federal do Rio de Janeiro |
| UFRJ |
| Universidade de Sao Paulo |
| USP |
| Catholic University of Minas Gerais |
And I have another one which has in separate columns the short name and the long name of SOME (not all) of those universities. Which looks likes this:
| long_name | short_name |
|----------------------------------------|------------|
| Universidade Federal do Rio de Janeiro | UFRJ |
| Universidade de Sao Paulo | USP |
What I want is: substitute all short names by long names, so in this context, the first dataframe would have the college column changed to this:
| college |
|----------------------------------------|
| Universidade Federal do Rio de Janeiro |
| Universidade Federal do Rio de Janeiro |
| Universidade de Sao Paulo |
| Universidade de Sao Paulo |
| Catholic University of Minas Gerais | <--- note: this one does not have a match, so it stays the same
Is there a way to do that using pandas and numpy (or any other library)?
Use Series.map with replace by second DataFrame, if no match get missing values, so added Series.fillna:
df1['college'] = (df1['college'].map(df2.set_index('short_name')['long_name'])
.fillna(df1['college']))
print (df1)
college
0 Universidade Federal do Rio de Janeiro
1 Universidade Federal do Rio de Janeiro
2 Universidade de Sao Paulo
3 Universidade de Sao Paulo
4 Catholic University of Minas Gerais

Python program that reorganizes Excel formatting?

I am working on a Python program that aims to take Excel data that is vertical and make it horizontal.
For example, the data is shaped something like this:
County | State | Number | Date
Oakland | MI | 19 | 1/12/10
Oakland | MI | 32 | 1/19/10
Wayne | MI | 9 | 1/12/10
Wayne | MI | 6 | 1/19/10
But I want it like this (purposefully excluding the state):
County | 1/12/10 | 1/19/10
Oakland | 19 | 32
Wayne | 9 | 6
(And for the actual data, it’s quite long).
My logic so far:
Read in the Excel File
Loop through the counties
If county name is the same, place # in Row 1?
Make a new Excel File?
Any ideas of how to write this out? I think I am a little stuck on the syntax here.

Categories