readlines() error with for-loop in python - python

This error is hard to describe because I can't figure out how the loop is even affecting the readline() and readlines() Methods. When I try using the former, I get these unexpected Traceback errors. When I try the latter, my code runs and nothing happens. I have determined that the bug is located in the first eight lines. The first few lines of the Topics.txt file is posted.
Code
import requests
from html.parser import HTMLParser
from bs4 import BeautifulSoup
Url = "https://ritetag.com/best-hashtags-for/"
Topicfilename = "Topics.txt"
Topicfile = open(Topicfilename, 'r')
Line = Topicfile.readlines()
Linenumber = 0
for Line in Topicfile:
Linenumber += 1
print("Reading line", Linenumber)
Topic = Line
Newtopic = Topic.strip("\n").replace(' ', '').replace(',', '')
print(Newtopic)
Link = Url.join(Newtopic)
print(Link)
Sourcecode = requests.get(Link)
When I run this bit here, it prints the the URL preceded by the first character of the line.For example, it prints as 2https://ritetag.com/best-hashtags-for/4https://ritetag.com/best-hashtags-for/Hhttps://ritetag.com/best-hashtags-for/ etc. for 24 Hour Fitness.
Topics.txt
21st Century Fox
24 Hour Fitness
2K Games
3M
Full Error
Reading line 1 24HourFitness
2https://ritetag.com/best-hashtags-for/4https://ritetag.com/best-hashtags-for/Hhttps://ritetag.com/best-hashtags-for/ohttps://ritetag.com/best-hashtags-for/uhttps://ritetag.com/best-hashtags-for/rhttps://ritetag.com/best-hashtags-for/Fhttps://ritetag.com/best-hashtags-for/ihttps://ritetag.com/best-hashtags-for/thttps://ritetag.com/best-hashtags-for/nhttps://ritetag.com/best-hashtags-for/ehttps://ritetag.com/best-hashtags-for/shttps://ritetag.com/best-hashtags-for/s
Traceback (most recent call last): File
"C:\Users\Caden\Desktop\Programs\LususStudios\AutoDealBot\HashtagScanner.py",
line 17, in
Sourcecode = requests.get(Link) File "C:\Python34\lib\site-packages\requests-2.10.0-py3.4.egg\requests\api.py",
line 71, in get
return request('get', url, params=params, **kwargs) File "C:\Python34\lib\site-packages\requests-2.10.0-py3.4.egg\requests\api.py",
line 57, in request
return session.request(method=method, url=url, **kwargs) File "C:\Python34\lib\site-packages\requests-2.10.0-py3.4.egg\requests\sessions.py",
line 475, in request
resp = self.send(prep, **send_kwargs) File "C:\Python34\lib\site-packages\requests-2.10.0-py3.4.egg\requests\sessions.py",
line 579, in send
adapter = self.get_adapter(url=request.url) File "C:\Python34\lib\site-packages\requests-2.10.0-py3.4.egg\requests\sessions.py",
line 653, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url) requests.exceptions.InvalidSchema: No connection adapters were
found for
'2https://ritetag.com/best-hashtags-for/4https://ritetag.com/best-hashtags-for/Hhttps://ritetag.com/best-hashtags-for/ohttps://ritetag.com/best-hashtags-for/uhttps://ritetag.com/best-hashtags-for/rhttps://ritetag.com/best-hashtags-for/Fhttps://ritetag.com/best-hashtags-for/ihttps://ritetag.com/best-hashtags-for/thttps://ritetag.com/best-hashtags-for/nhttps://ritetag.com/best-hashtags-for/ehttps://ritetag.com/best-hashtags-for/shttps://ritetag.com/best-hashtags-for/s'

I think there are two issues:
You seem to be iterating over Topicfile instead of Topicfile.readLines().
Url.join(Newtopic) isn't returning what you think it is. .join takes a list (in this case, a string is a list of characters) and will insert Url in between each one.
Here is code with these problems addressed:
import requests
Url = "https://ritetag.com/best-hashtags-for/"
Topicfilename = "topics.txt"
Topicfile = open(Topicfilename, 'r')
Lines = Topicfile.readlines()
Linenumber = 0
for Line in Lines:
Linenumber += 1
print("Reading line", Linenumber)
Topic = Line
Newtopic = Topic.strip("\n").replace(' ', '').replace(',', '')
print(Newtopic)
Link = '{}{}'.format(Url, Newtopic)
print(Link)
Sourcecode = requests.get(Link)
As an aside, I also recommend using lowercased variable names since camel case is generally reserved for class names in Python :)

Firstly, python conventions are to lowercase all variable names.
Secondly, you are exhausting the file pointer when you read all the lines at first, then continue to loop over the file.
Try to simply open the file, then loop over it
linenumber = 0
with open("Topics.txt") as topicfile:
for line in topicfile:
# do work
linenumber += 1
Then, the issue in the traceback, if you look closely, you are building up this really long url string and that's definitely not a url, so requests throws an error
InvalidSchema: No connection adapters were found for '2https://ritetag.com/best-hashtags-for/4https://ritetag.com/...
And you can debug to see that Url.join(Newtopic) is "interleaving" the Url String between each character of the Newtopic list, which is what str.join will do

Related

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) when using aiomangadexapi

What I'm working on is something to access MangaDex and check for new chapters, using an albeit old and as far as I can tell decrepit API, however it is the only API I have found for Python. In short, what I believe should be happening is the code uses json.loads() to turn a string gotten from a link, into a dict and then access the keys/values of that dict. However, something is clearly wrong with the string it should be converting.
Traceback starting from relevant code:
newLink = await getLink(mangaName, newChapterNum)
File "/home/eisbar/PycharmProjects/pythonProject1/main.py", line 45, in getLink
link = await mdapi.get_chapter(session, manga, chapter)
File "/home/eisbar/PycharmProjects/pythonProject1/venv/lib/python3.9/site-packages/aiomangadexapi/__init__.py", line 170, in get_chapter
data = await search(session,name)
File "/home/eisbar/PycharmProjects/pythonProject1/venv/lib/python3.9/site-packages/aiomangadexapi/__init__.py", line 129, in search
data = parse(response)
File "/home/eisbar/PycharmProjects/pythonProject1/venv/lib/python3.9/site-packages/aiomangadexapi/__init__.py", line 78, in parse
payload = json.loads(response)
File "/usr/lib/python3.9/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.9/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.9/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
I do understand from the traceback that s to be converted is in some way malformed in the first character. However, here is the string that should be converted, with the links removed:
{"code":200,"status":"OK","data":{"manga":{"id":48959,"title":"Ardor's Kingdom","altTitles":["\uc544\ub354\uc758 \uc655\uad6d"],"description":"An exiled prince's hectic school life in his quest to become the next King Ardor!","artist":["Yoonsoo"],"author":["Yoonsoo"],"publication":{"language":"kr","status":2,"demographic":1},"tags":[5,10,13,36,47],"lastChapter":null,"lastVolume":null,"isHentai":false,"links":{"ap":"ardors-kingdom","mu":"178415","raw":"XXXXXXXX"},"chapters":[{"id":896342,"hash":"db877e42e29ecf94c610d672c5602c19","mangaId":48959,"mangaTitle":"Ardor's Kingdom","volume":"","chapter":"1","title":"","language":"gb","groups":[657],"uploader":1166686,"timestamp":1589822996,"threadId":251978,"comments":4,"views":889}],"groups":[{"id":657,"name":"no group"}]}}
The only explanation I could think of after going through this is that the string above is in some way not passed through the json.loads() chain properly, although I have no clue as to why or how that would happen, or how to test or fix it. As a final note, in case it would help someone if they can touch the source code themselves, the module is aiomangadexapi.py.
Edit0: I'll try my best to add a reprex, but first time doing one that crosses multiple files and into source.
Code written by me that requests the link and begins the chain:
async def getLink(manga, chapter):
session = await aiomangadexapi.login('username', 'password')
link = await aiomangadexapi.get_chapter(session, mangaName, chapterNumber) #Point at which traceback errors
await session.close()
return link
Between these is a two-line function that passes mangaName from above to the search function below
async def search(session,name,link=False):
if not link:
name = name.replace(" ", "%20")
name = name.replace("[", "%5B")
name = name.replace("[", "%5D")
name = name.replace("?", "%3F")
response = await fetch(session, 'https://mangadex.org/search?tag_mode_exc=any&tag_mode_inc=all&title={}'.format(name))
link = fetch_link(response)
else:
link = 'https://mangadex.org/' + "api/v2/manga/" + name.split("title/")[1].split('/',1)[0] + "?include=chapters"
response = await fetch(session, link) #this line previously in source was not indented, however it caused a separate error that was fixed when indented.
data = parse(response) #Point at which Traceback errors
return data
The final code before it enters built-in python JSON files, going through loads - decode - raw_decode in the traceback.
def parse(response):
GENRES = {
#Removed here because not necessary and/or certain parts against SO rules
}
payload = json.loads(response) #Point at which traceback errors.
data, chapters = payload['data']['manga'], payload['data']['chapters']
#More code relating to the data variable below this, however removed as it is not executed due to code erroring on payload definition.
return [data]

Beautifulsoup: ValueError: Tag.index: element not in tag

I am converting ePub to single HTML files, so I need to concatenate the individual chapters into one HTML file. The are names "..._split_000.html" etc and I set up various structures to iterate over the ToC, generate directory names and so on.
I want to concat the HTML content of the individual parts with Beautifulsoup by appending the content of the body element of the following parts to the body of the first part. Only my code doesn't seem to work. "book" is an instance of the epub class of ebooklib. "docsfiles" is a dictionary with the names of the HTML files as a key and a list of files as one value among others:
def concat_articles(book, docsfiles, toc):
articles = {}
for doc, val in docsfiles.iteritems():
firstsoup = False
for f in val['files']:
content = book.get_item_with_href(f).content
soup = BeautifulSoup(content, "html.parser")
if not firstsoup:
firstsoup = soup
continue
body = copy.copy(soup.body)
firstsoup.body.append(body)
articles[val['id']] = firstsoup.prettify("utf-8")
return articles
When I run this on my ePub, an error occurs:
Traceback (most recent call last):
File "extract-new.py", line 170, in <module>
articles_html = concat_articles(book, docsfiles, toc)
File "extract-new.py", line 97, in concat_articles
firstsoup.body.append(body)
File "/Library/Python/2.7/site-packages/bs4/element.py", line 338, in append
self.insert(len(self.contents), tag)
File "/Library/Python/2.7/site-packages/bs4/element.py", line 291, in insert
new_child.extract()
File "/Library/Python/2.7/site-packages/bs4/element.py", line 235, in extract
del self.parent.contents[self.parent.index(self)]
File "/Library/Python/2.7/site-packages/bs4/element.py", line 888, in index
raise ValueError("Tag.index: element not in tag")
ValueError: Tag.index: element not in tag
Actually I should unwrap() the so soup.body in the above code but the leads to another error, so I thought I would solve this first.
Strange enough it works when I am using Martijn Peters' "clone()" method from this StackOverflow post:
body = clone(soup.body)
firstsoup.body.append(body)
Why this works and "copy.copy()" doesn't, I have yet to figure out.
The complete working solution without duplication of the body tags looks like this:
body = clone(soup.body)
for child in body.contents:
firstsoup.body.append(clone(child))
This also works when I am using "copy.copy()" in the first line but not when I replace "clone()" by "copy.copy()" in the last line.
It might be too late but I ran into a similar problem and found a solution that is more simple. Please turn the all the objects you extract with BeautifulSoup to a string, using the str() function.

urlib.request.urlopen not accepting query string with spaces

I am taking a udacity course on python where we are supposed to check for profane words in a document. I am using the website http://www.wdylike.appspot.com/?q= (text_to_be_checked_for_profanity). The text to be checked can be passed as a query string in the above URL and the website would return a true or false after checking for profane words. Below is my code.
import urllib.request
# Read the content from a document
def read_content():
quotes = open("movie_quotes.txt")
content = quotes.read()
quotes.close()
check_profanity(content)
def check_profanity(text_to_read):
connection = urllib.request.urlopen("http://www.wdylike.appspot.com/?q="+text_to_read)
result = connection.read()
print(result)
connection.close
read_content()
It gives me the following error
Traceback (most recent call last):
File "/Users/Vrushita/Desktop/Rishit/profanity_check.py", line 21, in <module>
read_content()
File "/Users/Vrushita/Desktop/Rishit/profanity_check.py", line 11, in read_content
check_profanity(content)
File "/Users/Vrushita/Desktop/Rishit/profanity_check.py", line 16, in check_profanity
connection = urllib.request.urlopen("http://www.wdylike.appspot.com/?q="+text_to_read)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 163, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 472, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 582, in http_response
'http', request, response, code, msg, hdrs)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 510, in error
return self._call_chain(*args)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 444, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 590, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request
The document that I am trying to read the content from contains a string "Hello world" However, if I change the string to "Hello+world", the same code works and returns the desired result. Can someone explain why this is happening and what is a workaround for this?
urllib accepts it, the server doesn't. And well it should not, because a space is not a valid URL character.
Escape your query string properly with urllib.parse.quote_plus(); it'll ensure your string is valid for use in query parameters. Or better still, use the urllib.parse.urlencode() function to encode all key-value pairs:
from urllib.parse import urlencode
params = urlencode({'q': text_to_read})
connection = urllib.request.urlopen(f"http://www.wdylike.appspot.com/?{params}")
The below response is for python 3.*
400 Bad request occurs when there is space within your input text.
To avoid this use parse.
so import it.
from urllib import request, parse
If you are sending any text along with the url then parse the text.
url = "http://www.wdylike.appspot.com/?q="
url = url + parse.quote(input_to_check)
Check the explanation here - https://discussions.udacity.com/t/problem-in-profanity-with-python-3-solved/227328
The Udacity profanity checker program -
from urllib import request, parse
def read_file():
fhand = open(r"E:\Python_Programming\Udacity\movie_quotes.txt")
file_content = fhand.read()
#print (file_content)
fhand.close()
profanity_check(file_content)
def profanity_check(input_to_check):
url = "http://www.wdylike.appspot.com/?q="
url = url + parse.quote(input_to_check)
req = request.urlopen(url)
answer = req.read()
#print(answer)
req.close()
if b"true" in answer:
print ("Profanity Alret!!!")
else:
print ("Nothing to worry")
read_file()
I think this code is closer to what the Lesson was aiming to, inferencing the difference between native functions, classes and functions inside classes:
from urllib import request, parse
def read_text():
quotes = open('C:/Users/Alejandro/Desktop/movie_quotes.txt', 'r+')
contents_of_file = quotes.read()
print(contents_of_file)
check_profanity(contents_of_file)
quotes.close()
def check_profanity(text_to_check):
connection = request.urlopen('http://www.wdylike.appspot.com/?q=' + parse.quote(text_to_check))
output = connection.read()
# print(output)
connection.close()
if b"true" in output:
print("Profanity Alert!!!")
elif b"false" in output:
print("This document has no curse words!")
else:
print("Could not scan the document properly")
read_text()
I'm working on the same project also using Python 3 like the most.
While looking for the solution in Python 3, I found this HowTo, and I decided to give it a try.
It seems that on some websites, including Google, connections through programming code (for example, via the urllib module), sometimes does not work properly. Apparently this has to do with the User Agent, which is recieved by the website when building the connection.
I did some further researches and came up with the following solution:
First I imported URLopener from urllib.request and created a class called ForceOpen as a subclass of URLopener.
Now I could create a "regular" User Agent by setting the variable version inside the ForceOpen class. Then just created an instance of it and used the open method in place of urlopen to open the URL.
(It works fine, but I'd still appreciate comments, suggestions or any feedback, also because I'm not absolute sure, if this way is a good alternative - many thanks)
from urllib.request import URLopener
class ForceOpen(URLopener): # create a subclass of URLopener
version = "Mozilla/5.0 (cmp; Konqueror ...)(Kubuntu)"
force_open = ForceOpen() # create an instance of it
def read_text():
quotes = open(
"/.../profanity_editor/data/quotes.txt"
)
contents_of_file = quotes.read()
print(contents_of_file)
quotes.close()
check_profanity(contents_of_file)
def check_profanity(text_to_check):
# now use the open method to open the URL
connection = force_open.open(
"http://www.wdylike.appspot.com/?q=" + text_to_check
)
output = connection.read()
connection.close()
if b"true" in output:
print("Attention! Curse word(s) have been detected.")
elif b"false" in output:
print("No curse word(s) found.")
else:
print("Error! Unable to scan document.")
read_text()

lxml.etree.XPathEvalError: Invalid expression

I am getting an error with Python that I am not able to understand. I have simplified my code to the very bare minimum:
response = requests.get('http://pycoders.com/archive')
tree = html.fromstring(response.text)
r = tree.xpath('//divass="campaign"]/a/#href')
print(r)
and still get the error
Traceback (most recent call last):
File "ultimate-1.py", line 17, in <module>
r = tree.xpath('//divass="campaign"]/a/#href')
File "lxml.etree.pyx", line 1509, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:50702)
File "xpath.pxi", line 318, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:145954)
File "xpath.pxi", line 238, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:144962)
File "xpath.pxi", line 224, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:144817)
lxml.etree.XPathEvalError: Invalid expression
Would anyone have an idea of where the issue is coming from? Might it be a dependencies problem? Thanks.
The expression '//divass="campaign"]/a/#href' is not syntactically correct and does not make much sense. Instead you meant to check the class attribute:
//div[#class="campaign"]/a/#href
Now, that would help to avoid the Invalid Expression error, but you would get nothing found by the expression. This is because the data is not there in the response that requests receives. You would need to mimic what the browser does to get the desired data and make an additional request to get the javascript file containing the campaigns.
Here is what works for me:
import ast
import re
import requests
from lxml import html
with requests.Session() as session:
# extract script url
response = session.get('http://pycoders.com/archive')
tree = html.fromstring(response.text)
script_url = tree.xpath("//script[contains(#src, 'generate-js')]/#src")[0]
# get the script
response = session.get(script_url)
data = ast.literal_eval(re.match(r'document.write\((.*?)\);$', response.content).group(1))
# extract the desired data
tree = html.fromstring(data)
campaigns = [item.attrib["href"].replace("\\", "") for item in tree.xpath('//div[#class="campaign"]/a')]
print(campaigns)
Prints:
['http://us4.campaign-archive2.com/?u=9735795484d2e4c204da82a29&id=3384ab2140',
...
'http://us4.campaign-archive2.com/?u=9735795484d2e4c204da82a29&id=8b91cb0481'
]
U was wrong in making xpath.
If u want to take all hrefs your xpath should be like
hrefs = tree.xpath('//div[#class="campaign"]/a')
for href in hrefs:
print(href.get('href'))
or in one line:
hrefs = [item.get('href') for item in tree.xpath('//div[#class="campaign"]/a')]

Not clear on why my function is returning none

I have very limited coding background except for some Ruby, so if there's a better way of doing this, please let me know!
Essentially I have a .txt file full of words. I want to import the .txt file and turn it into a list. Then, I want to take the first item in the list, assign it to a variable, and use that variable in an external request that sends off to get the definition of the word. The definition is returned, and tucked into a different .txt file. Once that's done, I want the code to grab the next item in the list and do it all again until the list is exhausted.
Below is my code in progress to give an idea of where I'm at. I'm still trying to figure out how to iterate through the list correctly, and I'm having a hard time interpreting the documentation.
Sorry in advance if this was already asked! I searched, but couldn't find anything that specifically answered my issue.
from __future__ import print_function
import requests
import urllib
from bs4 import BeautifulSoup
def get_definition(x):
url = 'http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query={0}'.format(x)
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
return soup.find('pre', text=True)[0]
lines = []
with open('vocab.txt') as f:
lines = f.readlines()
lines = [line.strip() for line in lines]
definitions = []
for line in lines:
definitions.append(get_definition(line))
out_str = '\n'.join(definitions)
with open('definitions.txt', 'w') as f:
f.write(out_str)
the problem I'm having is
Traceback (most recent call last):
File "WIP.py", line 20, in <module>
definitions.append(get_definition(line))
File "WIP.py", line 11, in get_definition
return soup.find('pre', text=True)[0]
File "/Library/Python/2.7/site-packages/bs4/element.py", line 958, in __getitem__
return self.attrs[key]
KeyError: 0
I understand that soup.find('pre', text=True) is returning None, but not why or how to fix it.
your problem is that find() returns a single result not a list. The result is a dict-like object so it tries to find the key 0 which it cannot.
just remove the [0] and you should be fine
Also soup.find(...) is not returning None. It is returning an answer! If it were returning None you would get the error
NoneType has no attribute __getitem__
Beautiful soup documentation for find()

Categories