Not getting all possible variables from splitting a web-scraped string

Not getting all possible variables from splitting a web-scraped string - python

I can't get my program to get every string possible from a split.
Here is one thing I tried:
var2 = "apple banana orange"
for var in var2.split():
#Here I would put what I want to do with the variable, but I put print() to show what happens
print(var)
I got:
applebananaorange
Full Code:
import requests
response = requests.get('https://raw.githubusercontent.com/Charonum/JSCode/main/Files.txt')
responsecontent = str(response.content)
for file in responsecontent.split("\n"):
file = file.replace("b'", "")
file = file.replace("'", "")
file = file.replace(r"\n", "")
if file == "":
pass
else:
print(file)
url = 'https://raw.githubusercontent.com/Charonum/JSCode/main/code/windows/' + file + ""
wget.download(url)
What should I do?

It looks like one of the files in the list is not available. It is good practice to always wrap input/output operations with a try/except to control problems like this. The code below downloads all available files and informs you which files could not be downloaded:
import requests
import wget
from urllib.error import HTTPError
response = requests.get('https://raw.githubusercontent.com/Charonum/JSCode/main/Files.txt')
responsecontent = str(response.content)
for file in responsecontent.split("\\n"):
file = file.replace("b'", "")
file = file.replace("'", "")
file = file.replace(r"\n", "")
if file == "":
pass
else:
url = 'https://raw.githubusercontent.com/Charonum/JSCode/main/code/windows/' + file + ""
print(url)
try:
wget.download(url)
except HTTPError:
print(f"Error 404: {url} not found")

It seems to work for me replacing the for statement with this one:
for file in responsecontent.split("\\n"):
...

Instead of responsecontent = str(response.content) try:
responsecontent = response.text
and then for file in responsecontent.split().

Related

How to add strings in the file at the end of all lines

I am trying to download files using python and then add lines at the end of the downloaded files, but it returns an error:
f.write(data + """<auth-user-pass>
TypeError: can't concat str to bytes
Edit: Thanks, it works now when I do this b"""< auth-user-pass >""", but I only want to add the string at the end of the file. When I run the code, it adds the string for every line.
I also tried something like this but it also did not work: f.write(str(data) + "< auth-user-pass >")
here is my full code:
import requests
from multiprocessing.pool import ThreadPool
def download_url(url):
print("downloading: ", url)
# assumes that the last segment after the / represents the file name
# if url is abc/xyz/file.txt, the file name will be file.txt
file_name_start_pos = url.rfind("/") + 1
file_name = url[file_name_start_pos:]
save_path = 'ovpns/'
complete_path = os.path.join(save_path, file_name)
print(complete_path)
r = requests.get(url, stream=True)
if r.status_code == requests.codes.ok:
with open(complete_path, 'wb') as f:
for data in r:
f.write(data + """<auth-user-pass>
username
password
</auth-user-pass>""")
return url
servers = [
"us-ca72.nordvpn.com",
"us-ca73.nordvpn.com"
]
urls = []
for server in servers:
urls.append("https://downloads.nordcdn.com/configs/files/ovpn_legacy/servers/" + server + ".udp1194.ovpn")
# Run 5 multiple threads. Each call will take the next element in urls list
results = ThreadPool(5).imap_unordered(download_url, urls)
for r in results:
print(r)

EDIT: Thanks, it works now when I do this b"""< auth-user-pass >""", but I only want to add the string at the end of the file. When I run the code, it adds the string for every line.
Try this:
import requests
from multiprocessing.pool import ThreadPool
def download_url(url):
print("downloading: ", url)
# assumes that the last segment after the / represents the file name
# if url is abc/xyz/file.txt, the file name will be file.txt
file_name_start_pos = url.rfind("/") + 1
file_name = url[file_name_start_pos:]
save_path = 'ovpns/'
complete_path = os.path.join(save_path, file_name)
print(complete_path)
r = requests.get(url, stream=True)
if r.status_code == requests.codes.ok:
with open(complete_path, 'wb') as f:
for data in r:
f.write(data)
return url
servers = [
"us-ca72.nordvpn.com",
"us-ca73.nordvpn.com"
]
urls = []
for server in servers:
urls.append("https://downloads.nordcdn.com/configs/files/ovpn_legacy/servers/" + server + ".udp1194.ovpn")
# Run 5 multiple threads. Each call will take the next element in urls list
results = ThreadPool(5).imap_unordered(download_url, urls)
with open(complete_path, 'ab') as f:
f.write(b"""<auth-user-pass>
username
password
</auth-user-pass>""")
for r in results:
print(r)

You are using binary mode, encode your string before concat, that is replace
for data in r:
f.write(data + """<auth-user-pass>
username
password
</auth-user-pass>""")
using
for data in r:
f.write(data + """<auth-user-pass>
username
password
</auth-user-pass>""".encode())

You open the file as a write in binary.
Because of that you cant use normal strings like the comment from #user56700 said.
You either need to convert the string or open it another way(ex. 'a' = appending).
Im not completly sure but it is also possible that the write binary variant of open the data of the file deletes. Normally open with write deletes existing data, so its quite possible that you need to change it to 'rwb'.

Write output data to csv

I'm writing a short piece of code in python to check the status code of a list of URLS. The steps are
1. read the URL's from a csv file.
2. Check request code
3. Write the status code request into the csv next to the checked URL
The first two steps I've managed to do but I'm stuck with writing the output of the requests into the same csv, next to the urls. Please help.
import urllib.request
import urllib.error
from multiprocessing import Pool
file = open('innovators.csv', 'r', encoding="ISO-8859-1")
urls = file.readlines()
def checkurl(url):
try:
conn = urllib.request.urlopen(url)
except urllib.error.HTTPError as e:
print('HTTPError: {}'.format(e.code) + ', ' + url)
except urllib.error.URLError as e:
print('URLError: {}'.format(e.reason) + ', ' + url)
else:
print('200' + ', ' + url)
if __name__ == "__main__":
p = Pool(processes=1)
result = p.map(checkurl, urls)
with open('innovators.csv', 'w') as f:
for line in file:
url = ''.join(line)
checkurl(urls + "," + checkurl)

The .readlines() operation leaves the file object at the end of file. When you attempt to loop through the lines of file again, without first rewinding it (file.seek(0)) or closing and opening it again (file.close() followed by opening again), there are no lines remaining. Always recommended to use with open(...) as file construct to ensure file is closed when operation is finished.
Additionally, there appears to be an error in your input to checkurl. You have added a list (urls) to a string (",") to a function (checkurl).
You probably meant for this section to read
with open('innovators.csv', 'w') as f:
for line in urls:
url = ''.join(line.replace('\n','')) # readlines leaves linefeed character at end of line
f.write(url + "," + checkurl(url))
The checkurl function should return what you are intending to place into the csv file. You are simply printing to standard output (screen). Thus, replace your checkurl with
def checkurl(url):
try:
conn = urllib.request.urlopen(url)
ret='0'
except urllib.error.HTTPError as e:
ret='HTTPError: {}'.format(e.code)
except urllib.error.URLError as e:
ret='URLError: {}'.format(e.reason)
else:
ret='200'
return ret
or something equivalent to your needs.

Save the status in a dict. and convert it to dataframe. Then simply send it to a csv file. str(code.getcode()) will return 200 if the url is connecting else it will return an exception, for which i assigned status as '000'. So your csv file will contain url,200 if URL is connecting and url,000 if URL is not connecting.
status_dict={}
for line in lines:
try:
code = urllib.request.urlopen(line)
status = str(code.getcode())
status_dict[line] = status
except:
status = "000"
status_dict[line] = status
df = pd.Dataframe(status_dict)
df.to_csv('filename.csv')

URLOpen Error while combining url with word from wordlist

Hey guys im making a Python Webcrawler at the Moment. So i have a link, which last chars are: "search?q=" and after that im using my wordlist which i have loaded before into a list. But when i try to open that with : urllib2.urlopen(url) it throws me an Error (urlopen error no host given) . But when i open that link with urllib normally (so typing the word which is normally automatic pasted in) it just works fine. So can you tell me why this is happening?
Thanks and regards
Full error:
File "C:/Users/David/PycharmProjects/GetAppResults/main.py", line 61, in <module>
getResults()
File "C:/Users/David/PycharmProjects/GetAppResults/main.py", line 40, in getResults
usock = urllib2.urlopen(url)
File "C:\Python27\lib\urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 402, in open
req = meth(req)
File "C:\Python27\lib\urllib2.py", line 1113, in do_request_
raise URLError('no host given')
urllib2.URLError: <urlopen error no host given>
Code:
with open(filePath, "r") as ins:
wordList = []
for line in ins:
wordList.append(line)
def getResults():
packageID = ""
count = 0
word = "Test"
for x in wordList:
word = x;
print word
url = 'http://www.example.com/search?q=' + word
usock = urllib2.urlopen(url)
page_source = usock.read()
usock.close()
print page_source
startSequence = "data-docid=\""
endSequence = "\""
while page_source.find(startSequence) != -1:
start = page_source.find(startSequence) + len(startSequence)
end = page_source.find(endSequence, start)
print str(start);
print str(end);
link = page_source[start:end]
print link
if link:
if not link in packageID:
packageID += link + "\r\n"
print packageID
page_source = page_source[end + len(endSequence):]
count+=1
So when i print the string word it outputs the correct word from the wordlist

I solved the Problem. I simply using now the urrlib instead of urllib2 and anything works fine thank you all :)

Note that urlopen() returns a response, not a request.
You may have a broken proxy configuration; verify that your proxies are working:
print(urllib.request.getproxies())
or bypass proxy support altogether with:
url = urllib.request.urlopen(
"http://www.example.com/search?q="+text_to_check
proxies={})
Sample way to combining URL with word from Wordlist. It combines the list words to get the images from the url and downloads it. Loop it around to access the whole list you have.
import urllib
import re
print "The URL crawler starts.."
mylist =["http://www.ebay","https://www.npmjs.org/"]
wordlist = [".com","asss"]
x = 1
urlcontent = urllib.urlopen(mylist[0]+wordlist[0]).read()
imgUrls = re.findall('img .*?src="(.*?)"',urlcontent)
for imgUrl in imgUrls:
img = imgUrl
print img
urllib.urlretrieve(img,str(x)+".jpg")
x= x + 1
Hope this helps, else post your code and error logs.

Check response using urllib2

I am trying access a page by incrementing the page counter using opencorporates api. But the problem is there are times when useless data is there. For example in the below url for jurisdiction_code = ae_az I get webpage showing just this:
{"api_version":"0.2","results":{"companies":[],"page":1,"per_page":26,"total_pages":0,"total_count":0}}
which is technically empty. How to check for such data and skip over this to move on to next jurisdiction?
This is my code
import urllib2
import json,os
f = open('codes','r')
for line in f.readlines():
id = line.strip('\n')
url = 'http://api.opencorporates.com/v0.2/companies/search?q=&jurisdiction_code={0}&per_page=26&current_status=Active&page={1}?api_token=ab123cd45'
i = 0
directory = id
os.makedirs(directory)
while True:
i += 1
req = urllib2.Request(url.format(id, i))
print url.format(id,i)
try:
response = urllib2.urlopen(url.format(id, i))
except urllib2.HTTPError, e:
break
content = response.read()
fo = str(i) + '.json'
OUTFILE = os.path.join(directory, fo)
with open(OUTFILE, 'w') as f:
f.write(content)

Interpret the response you get back (you already know it's json) and check if the data you want is there.
...
content = response.read()
data = json.loads(content)
if not data.get('results', {}).get('companies'):
break
...
Here's your code written with Requests and using the answer here. It is nowhere near as robust or clean as it should be, but demonstrates the path you might want to take. The rate limit is a guess, and doesn't seem to work. Remember to put your actual API key in.
import json
import os
from time import sleep
import requests
url = 'http://api.opencorporates.com/v0.2/companies/search'
token = 'ab123cd45'
rate = 20 # seconds to wait after rate limited
with open('codes') as f:
codes = [l.strip('\n') for l in f]
def get_page(code, page, **kwargs):
params = {
# 'api_token': token,
'jurisdiction_code': code,
'page': page,
}
params.update(kwargs)
while True:
r = requests.get(url, params=params)
try:
data = r.json()
except ValueError:
return None
if 'error' in data:
print data['error']['message']
sleep(rate)
continue
return data['results']
def dump_page(code, page, data):
with open(os.path.join(code, str(page) + '.json'), 'w') as f:
json.dump(data, f)
for code in codes:
try:
os.makedirs(code)
except os.error:
pass
data = get_page(code, 1)
if data is None:
continue
dump_page(code, 1, data['companies'])
for page in xrange(1, int(data.get('total_pages', 1))):
data = get_page(code, page)
if data is None:
break
dump_page(code, page, data['companies'])

I think that actually this example is not "technically empty." It contains data and is therefore technically not empty. The data just does not include any fields that are useful to you. :-)
If you want your code to skip over responses that have uninteresting data, then just check whether the JSON has the necessary fields before writing any data:
content = response.read()
try:
json_content = json.loads(content)
if json_content['results']['total_count'] > 0:
fo = str(i) + '.json'
OUTFILE = os.path.join(directory, fo)
with open(OUTFILE, 'w') as f:
f.write(content)
except KeyError:
break
except ValueError:
break
etc. You might want to report the ValueError or the KeyError, but that's up to you.

Python checking file content with string

Hello! I have the following script:
import os
import stat
curDir = os.getcwd()
autorun_signature = [ "[Autorun]",
"Open=regsvr.exe",
"Shellexecute=regsvr.exe",
"Shell\Open\command=regsvr.exe",
"Shell=Open" ]
content = []
def read_signature(file_path):
try:
with open(file_path) as data:
for i in range(0,5):
content.append(data.readline())
except IOError as err:
print("File Error: "+ str(err))
read_signature(os.getcwd()+'/'+'autorun.inf')
if(content==autorun_signature):
print("Equal content")
else:
print("Not equal")
It prints not equal, then I tried this method:
import os
import stat
curDir = os.getcwd()
autorun_signature = "[Autorun]\nOpen=regsvr.exe\nShellexecute=regsvr.exe\nShell\Open\command=regsvr.exe\nShell=Open"
content = ""
def read_signature(file_path):
try:
with open(file_path) as data:
content = data.read()
except IOError as err:
print("File Error: "+ str(err))
read_signature(os.getcwd()+'/'+'autorun.inf')
if(content==autorun_signature):
print("Equal content")
else:
print("Not equal")
It also print not equal!
I want to store the content of autorun.inf file in script and every time i find such file I want to check its content if it is or not, I could not do, can anyone help me?
the content of autorun.inf:
[Autorun]
Open=regsvr.exe
Shellexecute=regsvr.exe
Shell\Open\command=regsvr.exe
Shell=Open

Linebreaks under Windows \r\n are different from Linux's \n.
So try replacing \n with \r\n.

It is probably due to the fact that Windows new lines are \r\n instead of \n.
Also, you should escape the "\", so instead use"\\".

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Not getting all possible variables from splitting a web-scraped string - python

It seems to work for me replacing the for statement with this one: for file in responsecontent.split("\\n"): ...

Instead of responsecontent = str(response.content) try: responsecontent = response.text and then for file in responsecontent.split().

Related

How to add strings in the file at the end of all lines

Write output data to csv

URLOpen Error while combining url with word from wordlist

Check response using urllib2

Python checking file content with string

Categories

Resources