Converting python webcrawler to 3.4 from 2.7 - python

For this code I am converting a working python webcrawler from 2.7 to 3.4. I've made some modifications but I still get errors when running it:
Traceback (most recent call last):
File "Z:\testCrawler.py", line 11, in <module>
for i in re.findall('''href=["'](.[^"']+)["']''', urllib.request.urlopen(myurl).read(), re.I):
File "C:\Python34\lib\re.py", line 206, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
This is the code itself, please tell me if you see what the syntax errors are.
#! C:\python34
import re
import urllib.request
textfile = open('depth_1.txt','wt')
print ("Enter the URL you wish to crawl..")
print ('Usage - "http://phocks.org/stumble/creepy/" <-- With the double quotes')
myurl = input("#> ")
for i in re.findall('''href=["'](.[^"']+)["']''', urllib.request.urlopen(myurl).read(), re.I):
print (i)
for ee in re.findall('''href=["'](.[^"']+)["']''', urllib.request.urlopen(i).read(), re.I):
print (ee)
textfile.write(ee+'\n')
textfile.close()

Change
urllib.request.urlopen(myurl).read()
to for example
urllib.request.urlopen(myurl).read().decode('utf-8')
What happens here is .read() returning bytes instead of str like it was in python 2.7, so it has to be decoded using some encoding.

Related

How to search for text string in executable output with python?

I'm trying to create a python script to auto update a program for me. When I run program.exe --help, it gives a long output and inside the output is a string with value of "Version: X.X.X" How can I make a script that runs the command and isolates the version number from the executable's output?
I should have mentioned that I tried the following:
import re
import subprocess
regex = r'Version: ([\d\.]+)'
match = re.search(regex, subprocess.run(["program.exe", "--help"]))
print((match.group(0)))
and got the error:
Traceback (most recent call last):
File "run.py", line 6, in <module>
match = re.search(regex, subprocess.run(["program.exe", "--help"]))
File "C:\Python37\lib\re.py", line 183, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object
Something like this should work:
re.search(r'Version: ([\d\.]+)', subprocess.check_output(['program.exe', '--help']).decode()).group(1)

Python Bag of Words NameError: name 'unicode' is not defined

I have been following this site, https://radimrehurek.com/data_science_python/, to apply bag of words on a list of tweets.
import csv
from textblob import TextBlob
import pandas
messages = pandas.read_csv('C:/Users/Suki/Project/Project12/newData1.csv', sep='\t', quoting=csv.QUOTE_NONE,
names=["label", "message"])
def split_into_tokens(message):
message = unicode(message, encoding="utf8") # convert bytes into proper unicode
return TextBlob(message).words
messages.message.head().apply(split_into_tokens)
print (messages)
However I keep getting this error. I've checked and I following the code on the site but the error keeps arising.
Error
Traceback (most recent call last):
File "C:/Users/Suki/Project/Project12/projectBagofWords.py", line 34, in <module>
messages.message.head().apply(split_into_tokens)
File "C:\Program Files\Python36\lib\site-packages\pandas\core\series.py", line 2510, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/src\inference.pyx", line 1521, in pandas._libs.lib.map_infer
File "C:/Users/Suki/Project/Project12/projectBagofWords.py", line 31, in split_into_tokens
message = unicode(message, encoding="utf8") # convert bytes into proper unicode
NameError: name 'unicode' is not defined
Can someone offer advice on how I could rectify this?
Thanks
unicode is a python 2 method. If you are not sure which version will run this code, you can simply add this at the beginning of your code so it will replace the old unicode with new str:
import sys
if sys.version_info[0] >= 3:
unicode = str
unicode is python 2.x method. If you are running Python 3.x, then all strings are unicode and that call is not needed.
https://docs.python.org/3/howto/unicode.html

Python re.sub specific syntax

I am writing a script in python that replaces specific lines in files Linux. Say i have a file called hi in the /home directory that contains:
hi 873840
Here is my script:
#! /usr/bin/env python
import re
fp = open("/home/hi","w")
re.sub(r"hi+", "hi 90", fp)
My desired outcome is:
hi 90
however, when i run it i get this error and the hi file ends up being balnk:
Traceback (most recent call last):
File "./script.py", line 6, in <module>
re.sub(r"hi+", "hi 90", fp)
File "/usr/lib/python2.7/re.py", line 155, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer
Is there something wrong with my syntax?
Thanks
use "r" more to read file, "w" mode will create empty file for writing. .readline() will get and pass the string to the re.sub(). r" .*" will return a string you want to replace after the 'space' character. i assume 'hi 873840' is the only text in your file and your desired output is only 'hi 90'
echo "hi 873840" > hi.txt
python3.6
import re
fp = open("hi.txt", "r")
print(re.sub(r" .*", " 90", fp.readline()))
You should open the file in read mode. re.sub expects three arguments, pattern, repl, string. The problem is the third argument you are passing is a file pointer.
Eg:
import re
with open('/home/hi', 'r', encoding='utf-8') as infile:
for line in infile:
print(re.sub(r"hi+", "hi 90", line.strip()))

Python 3 [TypeError: 'str' object cannot be interpreted as an integer] when working with sockets

I run the script with python3 in the terminal but when I reach a certain point in it, I get the following error:
Traceback (most recent call last):
File "client.py", line 50, in <module>
client()
File "client.py", line 45, in client
s.send(bytes((msg, 'utf-8')))
TypeError: 'str' object cannot be interpreted as an integer
This is the code it refers to.
else :
# user entered a message
msg = sys.stdin.readline()
s.send(bytes((msg, 'utf-8')))
sys.stdout.write(bytes('[Me] '))
sys.stdout.flush()
I read the official documentation for bytes() and another source
https://docs.python.org/3.1/library/functions.html#bytes
http://www.pythoncentral.io/encoding-and-decoding-strings-in-python-3-x/
but I am no closer to understanding how to fix this. I realise that my msg is a string and I need an integer, but I am confused about how to convert it. Can you please help me, or direct me to a source that will help me?
Edit 1: I changed the line
s.send(bytes((msg, 'utf-8')))
to
s.send(bytes(msg, 'utf-8'))
but now I get the following error:
Traceback (most recent call last):
File "client.py", line 50, in <module>
client()
File "client.py", line 46, in client
sys.stdout.write(bytes('[Me] '))
TypeError: string argument without an encoding
Edit 2: According to #falsetru updated answer.
Using bytes literal gives me
TypeError: must be str, not bytes
Change the following line:
s.send(bytes((msg, 'utf-8')))
as:
s.send(bytes(msg, 'utf-8'))
In other words, pass a string and an encoding name instead of a passing a tuple to bytes.
UPDATE accoridng to question change:
You need to pass a string to sys.stdout.write. Simply pass a string literal:
sys.stdout.write('[Me] ')

While function python

Hello is use eclipse and pydev, I was wondering why my sample code won't work trying to use the while function.
print("Welcome to the annoying program")
response = ""
while response != "Because.":
response = input("why\n")
print("Oh,ok")
the output is the following:
Welcome to the annoying program
why
Because.
Traceback (most recent call last):
File "/Users/calebmatthias/Document/workspace/de.vogella.python.first/simpprogram.py", l ine 9, in <module>
response = input("why\n")
File "/Users/calebmatthias/Desktop/eclipse 2/plugins/org.python.pydev_2.2.3.2011100616/PySrc/pydev_sitecustomize/sitecustomize.py", line 210, in input
return eval(raw_input(prompt))
File "<string>", line 1
Because.
^
SyntaxError: unexpected EOF while parsing
The input function in 2.x evaluates the input as Python code! Use the raw_input function instead.
With input, your text "Because." is being evaluated, and it's a syntax error because the dot isn't followed by anything.

Categories