How do I read an HTML file in Python from multiple URLs?

How do I read an HTML file in Python from multiple URLs? - python

I'm writing a script that will pull data from a basic HTML page based on the following:
The first parameter in the URL floats between -90.0 and 90.0 (inclusive) and the second set of numbers are between -180.0 and 180.0 (inclusive). The URL will direct you to one page with a single number as the body of the page (for example, http://jawbone-virality.herokuapp.com/scanner/desert/-89.7/131.56/). I need to find the largest virality number between all of the pages attached to the URL.
So, right now I have it printing the first and second number, as well as the number in the body (we call it virality). It's only printing to the console, every time I try writing it to a file it spazzes on me and I get errors. Any hints or anything I'm missing? I'm very new to Python so I'm not sure if I'm missing something or anything.
import shutil
import os
import time
import datetime
import math
import urllib
from array import array
myFile = open('test.html','w')
m = 5
for x in range(-900,900,1):
for y in range(-1800,1800,1):
filehandle = urllib.urlopen('http://jawbone-virality.herokuapp.com/scanner/desert/'+str(x/10)+'/'+str(y/10)+'/')
print 'Planet Desert: (' + str(x/10) +','+ str(y/10) + '), Virality: ' + filehandle.readlines()[0] #lines
#myFile.write('Planet Desert: (' + str(x/10) +','+ str(y/10) + '), Virality: ' + filehandle.readlines()[0])
myFile.close()
filehandle.close()
Thank you!

When writing to the file, do you still have the print statement before? Then your problem would be that Python advances the file pointer to the end of the file when you call readlines(). The second call to readlines() will thus return an empty list and your access to the first element results in an IndexError.
See this example execution:
filehandle = urllib.urlopen('http://jawbone-virality.herokuapp.com/scanner/desert/0/0/')
print(filehandle.readlines()) # prints ['5']
print(filehandle.readlines()) # prints []
The solution is to save the result into a variable and then use it.
filehandle = urllib.urlopen('http://jawbone-virality.herokuapp.com/scanner/desert/0/0/')
res = filehandle.readlines()[0]
print(res) # prints 5
print(res) # prints 5
Yet, as already pointed out in the comments, calling readlines() here is not needed, because as it seems the format of the website is only a pure integer. So the concept of lines does not really exist there or does at least not provide any more information. So let's drop it in exchange for a easier function read() (doesn't even need readline() here).
filehandle = urllib.urlopen('http://jawbone-virality.herokuapp.com/scanner/desert/0/0/')
res = filehandle.read()
print(res) # prints 5
There's still another problem in your sourcecode. From your usage of urllib.urlopen() I can derive, you are using Python 2. However, in Python 2 divisions of integers are handled like in C or Java, they result in an integer rounded to floor. Thus, you will call http://jawbone-virality.herokuapp.com/scanner/desert/-90/-180/ ten times.
This can be fixed by either:
from __future__ import division
str(x / 10.0) and str(y / 10.0)
switching to Python 3 and using urllib2
Hopefully, I could help.

Related

String Manipulation for Json webscraping

I am trying to scrape a website and have all the data needed in very long matrices which were obtained through requests and json imports.
I am having issues getting any output.
Is it because of the merge of two strings in requests.get()?
Here is the part with the problem, all things used were declared at the start of the code.
balance=[]
for q in range(len(DepositMatrix)):
address= requests.get('https://ethplorer.io/service/service.php?data=' + str(DepositMatrix[q][0]))
data4 = address.json()
TokenBalances = data4['balances'] #returns a dictionary
balance.append(TokenBalances)
print(balance)
Example of DepositMatrix - list of lists with 4 elements, [[string , float, int, int]]
[['0x2b5634c42055806a59e9107ed44d43c426e58258', 488040277.1535826, 660, 7103],
['0x05ee546c1a62f90d7acbffd6d846c9c54c7cf94c', 376515313.83254075, 2069, 12705]]
I think the error is in this part:
requests.get('https://ethplorer.io/service/service.php?data=' + str(DepositMatrix[q][0]))
This change doesnt help either:
requests.get('https://ethplorer.io/service/service.php?data=' + DepositMatrix[q][0])

Like I said in my comment, I tried your code and it worked for me. But I wanted to highlight some things that could help your code be clearer:
import requests
import pprint
DepositMatrix = [['0x2b5634c42055806a59e9107ed44d43c426e58258', 488040277.1535826, 660, 7103],
['0x05ee546c1a62f90d7acbffd6d846c9c54c7cf94c', 376515313.83254075, 2069, 12705]]
balance=[]
for deposit in DepositMatrix:
address = requests.get('https://ethplorer.io/service/service.php?data=' + deposit[0])
data4 = address.json()
TokenBalances = data4['balances'] #returns a dictionary
balance.append(TokenBalances)
pprint.pprint(balance)
For your loop, instead of creating a range of the length of your list (q) and then using this q to get the information back from your list, it's simpler to get each element directly (for deposit in DepositMatrix:)
I've used the pprint module to ease the visualization of your data.

python variable m.start() from re.finditer does not get overwritten

I am currently writing a small python program for manipulating text files. (I am a newb programmer)
First, I am using re.finditer to find a specific string in lines1. Then I write this into a file and close it.
Next I want to grab the first line and search for this in another text file. The first time using re.finditer it was working great.
The problem is: m.start() always returns the last value of the first m.start. It does not get overwritten as it was the first time using re.finditer.
Could you help me understand why?
my code:
for m in re.finditer(finder1,lines1):
end_of_line = lines1.find('\n',m.start())
#print(m.start())
found_tag = lines1[m.start()+lenfinder1:end_of_line]
writefile.write(found_tag+'\n')
lenfinder2 = len(found_tag)
input_file3 = open ('out.txt')
writefile.close()
num_of_lines3 = file_len('out.txt')
n=1
while (n < num_of_lines3):
line = linecache.getline('out.txt', n)
n = n+1
re.finditer(line,lines2)
#print(m.start())

You've not declared\initialized line that you're using here :
re.finditer(line,lines2)
So, change :
linecache.getline('out.txt', n)
to
line = linecache.getline('out.txt', n)

Having an issue with using median function in numpy

I am having an issue with using the median function in numpy. The code used to work on a previous computer but when I tried to run it on my new machine, I got the error "cannot perform reduce with flexible type". In order to try to fix this, I attempted to use the map() function to make sure my list was a floating point and got this error message: could not convert string to float: .
Do some more attempts at debugging, it seems that my issue is with my splitting of the lines in my input file. The lines are of the form: 2456893.248202,4.490 and I want to split on the ",". However, when I print out the list for the second column of that line, I get
4
.
4
9
0
so it seems to somehow be splitting each character or something though I'm not sure how. The relevant section of code is below, I appreciate any thoughts or ideas and thanks in advance.
def curve_split(fn):
with open(fn) as f:
for line in f:
line = line.strip()
time,lc = line.split(",")
#debugging stuff
g=open('test.txt','w')
l1=map(lambda x:x+'\n',lc)
g.writelines(l1)
g.close()
#end debugging stuff
return time,lc
if __name__ == '__main__':
# place where I keep the lightcurve files from the image subtraction
dirname = '/home/kuehn/m4/kepler/subtraction/detrending'
files = glob.glob(dirname + '/*lc')
print(len(files))
# in order to create our lightcurve array, we need to know
# the length of one of our lightcurve files
lc0 = curve_split(files[0])
lcarr = np.zeros([len(files),len(lc0)])
# loop through every file
for i,fn in enumerate(files):
time,lc = curve_split(fn)
lc = map(float, lc)
# debugging
print(fn[5:58])
print(lc)
print(time)
# end debugging
lcm = lc/np.median(float(lc))
#lcm = ((lc[qual0]-np.median(lc[qual0]))/
# np.median(lc[qual0]))
lcarr[i] = lcm
print(fn,i,len(files))

handling command output in python

I want to work with the output of a wifi scan command. The output is several lines and I am interested in 2 information out of it. The goal is to have the ESSID and the address in a two dimmension array (hope thats right?) Here is what I got so far:
#!/usr/bin/python
import subprocess
import re
from time import sleep
# set wifi interface
wif = "wlan0"
So I get the command stdout and I find out that to work with this output in a loop I have to use iter
# check for WiFis nearby
wifi_out = subprocess.Popen(["iwlist", wif ,"scan"],stdout=subprocess.PIPE)
wifi_data = iter(wifi_out.stdout.readline,'')
Then I used enumerate to have the index and therefore I search for the line with the address and the next line (index + 1) would contain the ESSID
for index, line in enumerate(wifi_data):
searchObj = re.search( r'.* Cell [0-9][0-9] - Address: .*', line, re.M|re.I)
if searchObj:
print index, line
word = line.split()
wifi = [word[4],wifi_data[index + 1]]
Now I have two problems
1) wifi_data is the wrong Type
TypeError: 'callable-iterator' object has no attribute '__getitem__'
2) I guess with
wifi = [word[4],wifi_data[index + 1]]
I set the the variable every time new instead of have something that appends. But I want a variable that in the and has all ESSIDs together with all corresponding addresses.
I am new with python, so currently I imaging something like
WIFI[0][0] returns ESSID
WIFI[0][1] returns address to ESSID in WIFI[0][0]
WIFI[1][0] returns next ESSID
WIFI[1][1] returns address to ESSID in WIFI[1][0]
and so on. Or would be something else in python better to work with such kind of information?

I think you want
next(wifi_data)
since you cannot index into an iterator ... this will give you the next item ... but it may screw up your loop ...
although really you could just do
wifi_out = subprocess.Popen(["iwlist", wif ,"scan"],stdout=subprocess.PIPE)
wifi_data = wifi_out.communicate()[0].splitlines()
or even easier perhaps
wifi_data = subprocess.check_output(["iwlist",wif,"scan"]).splitlines()
and then you will have a list ... which will work more like you expect with regards to accessing the data via index (theres not really a good reason to use an iter for this that I can tell)

Writing a random amount of random numbers to a file and returning their squares

So, I'm trying to write a random amount of random whole numbers (in the range of 0 to 1000), square these numbers, and return these squares as a list. Initially, I started off writing to a specific txt file that I had already created, but it didn't work properly. I looked for some methods I could use that might make things a little easier, and I found the tempfile.NamedTemporaryFile method that I thought might be useful. Here's my current code, with comments provided:
# This program calculates the squares of numbers read from a file, using several functions
# reads file- or writes a random number of whole numbers to a file -looping through numbers
# and returns a calculation from (x * x) or (x**2);
# the results are stored in a list and returned.
# Update 1: after errors and logic problems, found Python method tempfile.NamedTemporaryFile:
# This function operates exactly as TemporaryFile() does, except that the file is guaranteed to have a visible name in the file system, and creates a temprary file that can be written on and accessed
# (say, for generating a file with a list of integers that is random every time).
import random, tempfile
# Writes to a temporary file for a length of random (file_len is >= 1 but <= 100), with random numbers in the range of 0 - 1000.
def modfile(file_len):
with tempfile.NamedTemporaryFile(delete = False) as newFile:
for x in range(file_len):
newFile.write(str(random.randint(0, 1000)))
print(newFile)
return newFile
# Squares random numbers in the file and returns them as a list.
def squared_num(newFile):
output_box = list()
for l in newFile:
exp = newFile(l) ** 2
output_box[l] = exp
print(output_box)
return output_box
print("This program reads a file with numbers in it - i.e. prints numbers into a blank file - and returns their conservative squares.")
file_len = random.randint(1, 100)
newFile = modfile(file_len)
output = squared_num(file_name)
print("The squared numbers are:")
print(output)
Unfortunately, now I'm getting this error in line 15, in my modfile function: TypeError: 'str' does not support the buffer interface. As someone who's relatively new to Python, can someone explain why I'm having this, and how I can fix it to achieve the desired result? Thanks!
EDIT: now fixed code (many thanks to unutbu and Pedro)! Now: how would I be able to print the original file numbers alongside their squares? Additionally, is there any minimal way I could remove decimals from the outputted float?

By default tempfile.NamedTemporaryFile creates a binary file (mode='w+b'). To open the file in text mode and be able to write text strings (instead of byte strings), you need to change the temporary file creation call to not use the b in the mode parameter (mode='w+'):
tempfile.NamedTemporaryFile(mode='w+', delete=False)

You need to put newlines after each int, lest they all run together creating a huge integer:
newFile.write(str(random.randint(0, 1000))+'\n')
(Also set the mode, as explained in PedroRomano's answer):
with tempfile.NamedTemporaryFile(mode = 'w+', delete = False) as newFile:
modfile returns a closed filehandle. You can still get a filename out of it, but you can't read from it. So in modfile, just return the filename:
return newFile.name
And in the main part of your program, pass the filename on to the squared_num function:
filename = modfile(file_len)
output = squared_num(filename)
Now inside squared_num you need to open the file for reading.
with open(filename, 'r') as f:
for l in f:
exp = float(l)**2 # `l` is a string. Convert to float before squaring
output_box.append(exp) # build output_box with append
Putting it all together:
import random, tempfile
def modfile(file_len):
with tempfile.NamedTemporaryFile(mode = 'w+', delete = False) as newFile:
for x in range(file_len):
newFile.write(str(random.randint(0, 1000))+'\n')
print(newFile)
return newFile.name
# Squares random numbers in the file and returns them as a list.
def squared_num(filename):
output_box = list()
with open(filename, 'r') as f:
for l in f:
exp = float(l)**2
output_box.append(exp)
print(output_box)
return output_box
print("This program reads a file with numbers in it - i.e. prints numbers into a blank file - and returns their conservative squares.")
file_len = random.randint(1, 100)
filename = modfile(file_len)
output = squared_num(filename)
print("The squared numbers are:")
print(output)
PS. Don't write lots of code without running it. Write little functions, and test that each works as expected. For example, testing modfile would have revealed that all your random numbers were being concatenated. And printing the argument sent to squared_num would have shown it was a closed filehandle.
Testing the pieces gives you firm ground to stand on and lets you develop in an organized way.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I read an HTML file in Python from multiple URLs? - python

Related

String Manipulation for Json webscraping

python variable m.start() from re.finditer does not get overwritten

Having an issue with using median function in numpy

handling command output in python

Writing a random amount of random numbers to a file and returning their squares

Categories

Resources