I'm very beginner to python but I know intermediate JavaScript. I have one project to complete this is like a scraper but I want to automate some work for me.
1 ) I have a Excel with more than 1000 data and it also has URLs. I want to code that python visit every URL from that Excel sheet and search first page for Some predefine search texts (List of Texts)
2 ) If my code find any of the Text from that web page then it should return true else false
I want any idea or logic to do this kind of process. Any help will make my head pain less 😅
it is very heavy work which is not very good idea to do in JavaScript that's why I want to do it in Python
An easy way to do this would be to get the requests module. Then learn how to use the csv module which can read spreadsheets such as excel spreadsheets. Then here is what you want to do
import csv
import requests
URLS = []
def GetUrlFromCSVFile():
global URLS
#Figure out how to get link from csv file then append them to the URLS list
for url in URLS:
r = requests.get(URL, headers=#You Should Probs get some headers)
if whatever_keyword_u_looking_for in r.text:
print("Found")
else:
print("Not here")
I suggest the following:
Read about the csv library - to read the content of an excel file.
Read about the requests library - to get the page's content from its URL.
Read about regular expressions in the re library.
Since I've been trying to figure out how to make a loop and I couldn't make it from another threads, I need help. I am totally new to this so editing existing codes is hard for me.
I am trying to web scrape data from website. Here's what I've done so far, but I have to insert pages "manually". I want it to automatically scrape prices in zl/m2 from 1 to 20 pages for example:
import requests
from bs4 import BeautifulSoup
link=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=1")
page = requests.get(link).text
link1=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=2")
page1 = requests.get(link1).text
link2=("https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona=3")
page2 = requests.get(link2).text
pages=page+page1+page2+page3+page4+page5+page6
soup = BeautifulSoup(pages, 'html.parser')
price_box = soup.findAll('p', attrs={'class':'list__item__details__info details--info--price'})
prices=[]
for i in range(len(price_box)):
prices.append(price_box[i].text.strip())
prices
I've tried with this code, but got stuck. I don't know what should I add to get output from 20 pages at once and how to save it to csv file.
npages=20
baselink="https://ogloszenia.trojmiasto.pl/nieruchomosci-mam-do-wynajecia/wi,100.html?strona="
for i in range (1,npages+1):
link=baselink+str(i)
page = requests.get(link).text
Thanks in advance for any help.
Python is whitespace sensitive, so the code block of any loops needs to be indented, like so:
for i in range (1,npages+1):
link=baselink+str(i)
page = requests.get(link).text
If you want all of the pages in a single string (so you can use the same approach as with your pages variable above), you can append the strings together in your loop:
pages = ""
for i in range (1,npages+1):
link=baselink+str(i)
pages += requests.get(link).text
To create a csv file with your results, you can look into the csv.writer() method in python's built-in csv module, but I usually find it easier to write to a file using print():
with open(samplefilepath, mode="w+") as output_file:
for price in prices:
print(price, file=output_file)
w+ tells python to create the file if it doesn't exist and overwrite if it does exist. a+ would append to the existing file if it exists
I have a code that retrieves information from HPEs website regarding switches. The script works just fine and outputs the information into a CSV file. however, now I need to loop the script through 30 different switches.
I have a list of URLs that are stored in a CSV document. Here are a few examples.
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J4813A
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J4903A
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J9019B
https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?ProductNumber=J9022A
in my code, I bind 'URL' to one of these links, which pushes that through the code to retrieve the information I need.
Here is my full code:
url = "https://h10145.www1.hpe.com/downloads/SoftwareReleases.aspx?
ProductNumber=J9775A"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('table', attrs={"class": "hpui-standardHrGrid-table"})
headers = [header.text for header in table.find_all('th')]
rows = []
for row in table.find_all('tr', {'releasetype': 'Current_Releases'}):
item = []
for val in row.find_all('td'):
item.append(val.text.encode('utf8').strip())
rows.append(item)
with open('c:\source\output_file.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow({url})
writer.writerow(headers)
writer.writerows(rows)
I am trying to find the best way to automate this, as this script needs to run at least once a week. It will need to output to 1 CSV file that is overwritten every time. That CSV file then links to my excel sheet as a data source
Please find patience in dealing with my ignorance. I am a novice at Python and I haven't been able to find a solution elsewhere.
Are you on a linux system? You could set up a cron job to run your script whenever you want.
Personally, I would just make an array of each "ProductNumber" unique query parameter value and iterate through that array via a loop.
Then, by calling the rest of the code as it would be encapsulated in that loop, you should be able to accomplish this task.
when I do curl to a API call link http://example.com/passkey=wedsmdjsjmdd
curl 'http://example.com/passkey=wedsmdjsjmdd'
I get the employee output data on a csv file format, like:
"Steve","421","0","421","2","","","","","","","","","421","0","421","2"
how can parse through this using python.
I tried:
import csv
cr = csv.reader(open('http://example.com/passkey=wedsmdjsjmdd',"rb"))
for row in cr:
print row
but it didn't work and I got an error
http://example.com/passkey=wedsmdjsjmdd No such file or directory:
Thanks!
Using pandas it is very simple to read a csv file directly from a url
import pandas as pd
data = pd.read_csv('https://example.com/passkey=wedsmdjsjmdd')
This will read your data in tabular format, which will be very easy to process
You need to replace open with urllib.urlopen or urllib2.urlopen.
e.g.
import csv
import urllib2
url = 'http://winterolympicsmedals.com/medals.csv'
response = urllib2.urlopen(url)
cr = csv.reader(response)
for row in cr:
print row
This would output the following
Year,City,Sport,Discipline,NOC,Event,Event gender,Medal
1924,Chamonix,Skating,Figure skating,AUT,individual,M,Silver
1924,Chamonix,Skating,Figure skating,AUT,individual,W,Gold
...
The original question is tagged "python-2.x", but for a Python 3 implementation (which requires only minor changes) see below.
You could do it with the requests module as well:
url = 'http://winterolympicsmedals.com/medals.csv'
r = requests.get(url)
text = r.iter_lines()
reader = csv.reader(text, delimiter=',')
To increase performance when downloading a large file, the below may work a bit more efficiently:
import requests
from contextlib import closing
import csv
url = "http://download-and-process-csv-efficiently/python.csv"
with closing(requests.get(url, stream=True)) as r:
reader = csv.reader(r.iter_lines(), delimiter=',', quotechar='"')
for row in reader:
# Handle each row here...
print row
By setting stream=True in the GET request, when we pass r.iter_lines() to csv.reader(), we are passing a generator to csv.reader(). By doing so, we enable csv.reader() to lazily iterate over each line in the response with for row in reader.
This avoids loading the entire file into memory before we start processing it, drastically reducing memory overhead for large files.
This question is tagged python-2.x so it didn't seem right to tamper with the original question, or the accepted answer. However, Python 2 is now unsupported, and this question still has good google juice for "python csv urllib", so here's an updated Python 3 solution.
It's now necessary to decode urlopen's response (in bytes) into a valid local encoding, so the accepted answer has to be modified slightly:
import csv, urllib.request
url = 'http://winterolympicsmedals.com/medals.csv'
response = urllib.request.urlopen(url)
lines = [l.decode('utf-8') for l in response.readlines()]
cr = csv.reader(lines)
for row in cr:
print(row)
Note the extra line beginning with lines =, the fact that urlopen is now in the urllib.request module, and print of course requires parentheses.
It's hardly advertised, but yes, csv.reader can read from a list of strings.
And since someone else mentioned pandas, here's a pandas rendition that displays the CSV in a console-friendly output:
python3 -c 'import pandas
df = pandas.read_csv("http://winterolympicsmedals.com/medals.csv")
print(df.to_string())'
Pandas is not a lightweight library, though. If you don't need the things that pandas provides, or if startup time is important (e.g. you're writing a command line utility or any other program that needs to load quickly), I'd advise that you stick with the standard library functions.
import pandas as pd
url='https://raw.githubusercontent.com/juliencohensolal/BankMarketing/master/rawData/bank-additional-full.csv'
data = pd.read_csv(url,sep=";") # use sep="," for coma separation.
data.describe()
I am also using this approach for csv files (Python 3.6.9):
import csv
import io
import requests
r = requests.get(url)
buff = io.StringIO(r.text)
dr = csv.DictReader(buff)
for row in dr:
print(row)
what you were trying to do with the curl command was to download the file to your local hard drive(HD). You however need to specify a path on HD
curl http://example.com/passkey=wedsmdjsjmdd -o ./example.csv
cr = csv.reader(open('./example.csv',"r"))
for row in cr:
print row
HI, I've got a list of 10 websites in CSV. All of the sites have the same general format, including a large table. I only want the the data in the 7th columns. I am able to extract the html and filter the 7th column data (via RegEx) on an individual basis but I can't figure out how to loop through the CSV. I think I'm close but my script won't run. I would really appreciate it if someone could help me figure-out how to do this. Here's what i've got:
#Python v2.6.2
import csv
import urllib2
import re
urls = csv.reader(open('list.csv'))
n =0
while n <=10:
for url in urls:
response = urllib2.urlopen(url[n])
html = response.read()
print re.findall('td7.*?td',html)
n +=1
When I copied your routine, I did get a white space / tab error error. Check your tabs. You were indexing into the URL string incorrectly using your loop counter. This would have also messed you up.
Also, you don't really need to control the loop with a counter. This will loop for each line entry in your CSV file.
#Python v2.6.2
import csv
import urllib2
import re
urls = csv.reader(open('list.csv'))
for url in urls:
response = urllib2.urlopen(url[0])
html = response.read()
print re.findall('td7.*?td',html)
Lastly, be sure that your URLs are properly formed:
http://www.cnn.com
http://www.fark.com
http://www.cbc.ca