PyQuery Python not working with for loop - python

I am trying to write a program that pulls the urls from each line of a .txt file and performs a PyQuery to scrape lyrics data off of LyricsWiki, and everything seems to work fine until I actually put the PyQuery stuff in. For example, when I do:
full_lyrics = ""
#open up the input file
links = open('links.txt')
for line in links:
full_lyrics += line
print(full_lyrics)
links.close()
It prints everything out as expected, one big string with all the data in it. However, when I implement the actual html parsing, it only pulls the lyrics from the last url and skips through all the previous ones.
import requests, re, sqlite3
from pyquery import PyQuery
from collections import Counter
full_lyrics = ""
#open up the input file
links = open('links.txt')
output = open('web.txt', 'w')
output.truncate()
for line in links:
r = requests.get(line)
#create the PyQuery object and parse text
results = PyQuery(r.text)
results = results('div.lyricbox').remove('script').text()
full_lyrics += (results + " ")
output.write(full_lyrics)
links.close()
output.close()
I writing to a txt file to avoid encoding issues with Powershell. Anyway, after I run the program and open up the txt file, it only shows the lyrics of the last link on the links.txt document.
For reference, 'links.txt' should contain several links to lyricswiki song pages, like this:
http://lyrics.wikia.com/Taylor_Swift:Shake_It_Off
http://lyrics.wikia.com/Maroon_5:Animals
'web.txt' should be a blank output file.
Why is it that pyquery breaks the for loop? It clearly works when its doing something simpler, like just concatenating the individual lines of a file.

The problem is the additional newline character in every line that you read from the file (links.txt). Try open another line in your links.txt and you'll see that even the last entry will not be processed.
I recommend that you do a right strip on the line variable after the for, like this:
for line in links:
line = line.rstrip()
r = requests.get(line)
...
It should work.
I also think that you don't need requests to get the html. Try results = PyQuery(line) and see if it works.

Related

reading and printing text file from a website url line by line

I have this code here:
url = requests.get("https://raw.githubusercontent.com/deppokol/testrepo/main/testfile.txt")
File = url.text
for line in File:
print(line)
The output looks like this:
p
i
l
l
o
w
and so on...
Instead, I want it to look like this:
pillow
fire
thumb
and so on...
I know I can add end="" inside of print(line) but I want a variable to be equal to those lines. For example
Word = line
and when you print Word, it should look like this:
pillow
fire
thumb
.text of requests' response is str, you might use .splitlines for iterating over lines as follows:
import requests
url = requests.get("https://raw.githubusercontent.com/deppokol/testrepo/main/testfile.txt")
for line in url.text.splitlines():
print(line)
Note that .splitlines() deals with different newlines, so you can use it without worrying about what newlines exactly are used (using .split("\n") is fine as long are you sure you working with Linux-style newlines)
you cannot do for line in url.text because url.text is not a IO (File). Instead, you can either print it directly (since \n or the line breaks will automatically print as line breaks) or if you really need to split on new lines, then do for line in url.text.split('\n').
import requests
url = requests.get("https://raw.githubusercontent.com/deppokol/testrepo/main/testfile.txt")
for line in url.text.split('\n'):
print(line)
Edit: You might also want to do .strip() as well to remove extra line breaks.
response is a str object which you need to split() first as:
import requests
url = requests.get("https://raw.githubusercontent.com/deppokol/testrepo/main/testfile.txt")
file = url.text.split()
for line in file:
print(line)
You can also use split("\n"):
import requests
for l in requests.get("https://raw.githubusercontent.com/deppokol/testrepo/main/testfile.txt").text.split("\n"):
print(l)
Demo

Searching for word in file and taking whole line

I am running this program to basically get the page source code of a website I put in. It saves it to a file and what I want is it to look for a specific string which is basically # for the emails. However, I can't get it to work.
import requests
import re
url = 'https://www.youtube.com/watch?v=GdKEdN66jUc&app=desktop'
data = requests.get(url)
# dump resulting text to file
with open("data6.txt", "w") as out_f:
out_f.write(data.text)
with open("data6.txt", "r") as f:
searchlines = f.readlines()
for i, line in enumerate(searchlines):
if "#" in line:
for l in searchlines[i:i+3]: print((l))
You can use the regex method findall to find all email addresses in your text content, and use file.read() instead of file.readlines(). To get all content together rather than split into separate lines.
For example:
import re
with open("data6.txt", "r") as file:
content = file.read()
emails = re.findall(r"[\w\.]+#[\w\.]+", content)
Maybe cast to a set for uniqueness afterwards, and then save to a file however you like.

Scraping text from multiple web pages in Python

I've been tasked to scrape all the text off of any webpage a certain client of ours hosts. I've managed to write a script that will scrape the text off a single webpage, and you can manually replace the URL in the code each time you want to scrape a different webpage. But obviously this is very inefficient. Ideally, I could have Python connect to some list that contains all the URLs I need and it would iterate through the list and print all the scraped text into a single CSV. I've tried to write a "test" version of this code by creating a 2 URL long list and trying to get my code to scrape both URLs. However, as you can see, my code only scrapes the most recent url in the list and does not hold onto the first page it scraped. I think this is due to a deficiency in my print statement since it will always write over itself. Is there a way to have everything I scraped held somewhere until the loop goes through the entire list AND then print everything.
Feel free to totally dismantle my code. I know nothing of computer languages. I just keep getting assigned these tasks and use Google to do my best.
import urllib
import re
from bs4 import BeautifulSoup
data_file_name = 'C:\\Users\\confusedanalyst\\Desktop\\python_test.csv'
urlTable = ['url1','url2']
def extractText(string):
page = urllib.request.urlopen(string)
soup = BeautifulSoup(page, 'html.parser')
##Extracts all paragraph and header variables from URL as GroupObjects
text = soup.find_all("p")
headers1 = soup.find_all("h1")
headers2 = soup.find_all("h2")
headers3 = soup.find_all("h3")
##Forces GroupObjects into str
text = str(text)
headers1 = str(headers1)
headers2 = str(headers2)
headers3 = str(headers3)
##Strips HTML tags and brackets from extracted strings
text = text.strip('[')
text = text.strip(']')
text = re.sub('<[^<]+?>', '', text)
headers1 = headers1.strip('[')
headers1 = headers1.strip(']')
headers1 = re.sub('<[^<]+?>', '', headers1)
headers2 = headers2.strip('[')
headers2 = headers2.strip(']')
headers2 = re.sub('<[^<]+?>', '', headers2)
headers3 = headers3.strip('[')
headers3 = headers3.strip(']')
headers3 = re.sub('<[^<]+?>', '', headers3)
print_to_file = open (data_file_name, 'w' , encoding = 'utf')
print_to_file.write(text + headers1 + headers2 + headers3)
print_to_file.close()
for i in urlTable:
extractText (i)
Try this, 'w' will open the file with a pointer at the the beginning of the file. You want the pointer at the end of the file
print_to_file = open (data_file_name, 'a' , encoding = 'utf')
here is all of the different read and write modes for future reference
The argument mode points to a string beginning with one of the following
sequences (Additional characters may follow these sequences.):
``r'' Open text file for reading. The stream is positioned at the
beginning of the file.
``r+'' Open for reading and writing. The stream is positioned at the
beginning of the file.
``w'' Truncate file to zero length or create text file for writing.
The stream is positioned at the beginning of the file.
``w+'' Open for reading and writing. The file is created if it does not
exist, otherwise it is truncated. The stream is positioned at
the beginning of the file.
``a'' Open for writing. The file is created if it does not exist. The
stream is positioned at the end of the file. Subsequent writes
to the file will always end up at the then current end of file,
irrespective of any intervening fseek(3) or similar.
``a+'' Open for reading and writing. The file is created if it does not
exist. The stream is positioned at the end of the file. Subse-
quent writes to the file will always end up at the then current
end of file, irrespective of any intervening fseek(3) or similar.

Trying to loop through URLs and save contents, as a data frame, to text file

I think this block of code is pretty close to being right, but something is throwing it off. I'm trying to loop through 10 URLs and download the contents of each to a text file, and make sure everything is structured orderly, in a dataframe.
import pandas as pd
rawHtml = ''
url = r'http://www.pga.com/golf-courses/search?page=" + i + "&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0'
g = open("C:/Users/rshuell001/Desktop/MyData.txt", "w")
for i in range(0, 10):
df = pd.DataFrame.from_csv(url)
print(df)
g.write(str(df))
g.close()
The error that I get says:
CParserError: Error tokenizing data.
C error: Expected 1 fields in line 22, saw 2
I have no idea what that means. I only have 9 lines of code, so I don't know why it's mentioning a problem on line 22.
Can someone give me a push to get this working?
pandas.DataFrame.from_csv() takes a first argument which is either a path or a file-like handle, where either are supposed to be pointing at valid CSV file.
You are providing it with a URL.
It seems that you want to use a different function: the top-level pandas.read_csv. This function will actually fetch the data from you from a valid URL, then parse it.
If for any reason you insist on using pandas.DataFrame.from_csv(), you will have to:
Get the text from the page.
Persist the text, or parts thereof, as a valid CSV file, or a file-like object.
Provide the path to the file, or the handler of the file-like, as the first argument to pandas.DataFrame.from_csv().
I finally got it working. This is what I was trying to do all along.
import requests
from bs4 import BeautifulSoup
link = "http://www.pga.com/golf-courses/search?page=1&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"
html = requests.get(link).text
soup = BeautifulSoup(html, "lxml")
res = soup.findAll("div", {"class": "views-field-nothing"})
for r in res:
print("Address: " + r.find("span", {'class': 'field-content'}).text)

Search for a string in webpage and print the entire line containing it using python

I would like to search a webpage for a string and print the entire line containing that string.
I have a input file containing the links that i would like to search for that string.
String to be searched : "vcore"
My Input File:
http://abc/cluster/app/application_1447334090028_225490
http://abc/cluster/app/application_1447334090028_228858
Expected Output File:
http://abc/cluster/app/application_1447334090028_225490 12434 vcore, 123 mb
http://abc/cluster/app/application_1447334090028_228858 12132 vcore, 131 mb
Code so far:
import sys
import re
import urllib
Links = [Link.strip() for Link in open ('/home/try/Input.txt','r').readlines()]
for link in Links:
webPage = urllib.urlopen(link).read()
print webPage
Then i use grep to search for the string and store it in another file. But i want it to be done by the code itself and the line to appear next to the corresponding link. Can anyone help me on this?
lines = urllib.urlopen(link).readlines()
for line in lines:
if "vcore" in line:
print line
import re
import urllib
Links = [Link.strip() for Link in open ('/home/try/Urls.txt','r').readlines()]
for link in Links:
lines = urllib.urlopen(link).readlines()
for line in lines:
if "vcore" in line:
print link,line
Just having blank lines after every print statement

Categories