Efficient regex parsing of html - python

I have a piece of Python code scrapping datapoints value from what seems to be a Javascript graph on a webpage. The data looks like:
...html/javascript...
{'y':765000,...,'x':1248040800000,...},
{'y':1020000,...,'x':1279144800000,...},
{'y':1105000,...,'x':1312754400000,...}
...html/javascript...
where the dots are plotting data I skipped.
To scrap the useful information - x/y datapoints coordinates - I used regex:
#first getting the raw x data
xData = re.findall("'x':\d+", htmlContent)
#now reading each value one by one
xData = [int(re.findall("\d+",x)[0]) for x in xData]
Same for the y values. I don't know if this terribly inefficient but it does not look pretty or very smart as a have many redundant calls to re.findall. Is there a way to do it in one pass? One pass for x and one pass for y?

You can do it a little bit easier:
htmlContent = """
...html/javascript...
{'y':765000,...,'x':1248040800000,...},
{'y':1020000,...,'x':1279144800000,...},
{'y':1105000,...,'x':1312754400000,...}
...html/javascript...
"""
# Get the numbers
xData = [int(_) for _ in re.findall("'x':(\d+)", htmlContent)]
print xData

Related

extracting x and y data from a "messy" txt file

I assume the question might be quite basic, but I had no idea how I should search for this specific issue:
I have a .txt file where over several lines, several x-y data points are present per line. x and y values that belong together are seperated by a comma, while the the different couples are seperated by space.
Here in example:
2,20 12,40 13,100 14,300
15,440 16,10 24,50 25,350
26,2322 27,3323 28,9999 29,2152
30,2622 31,50
I simply want to use python to store all x and y values in individual arrays. There must be an easy solution but I just cant get my head arround it how I should read them out.
Thanks a lot for any help in advance.
I tried to read out all line by themselfe and each line then value by value, but that is not working.
fileInp = "2,20 12,40 13,100 14,300 15,440 16,10 24,50 25,350 26,2322 27,3323 28,9999 29,2152 30,2622 31,50"
x = list()
y = list()
for data in fileInp.split():
x_y_data = data.split(",")
x.append(x_y_data[0])
y.append(x_y_data[1])
print(x)
print(y)

How to to remove a double/nested for-loop? String to float transformation in python

I have the following polygon of a geographic area that I fetch via a request in CAP/XML format from an API
The raw data looks like this:
<polygon>22.3243,113.8659 22.3333,113.8691 22.4288,113.8691 22.4316,113.8742 22.4724,113.9478 22.5101,113.9951 22.5099,113.9985 22.508,114.0017 22.5046,114.0051 22.5018,114.0085 22.5007,114.0112 22.5007,114.0125 22.502,114.0166 22.5038,114.0204 22.5066,114.0245 22.5067,114.0281 22.5057,114.0371 22.5051,114.0409 22.5041,114.0453 22.5025,114.0494 22.5023,114.0511 22.5035,114.0549 22.5047,114.0564 22.5059,114.057 22.5104,114.0576 22.512,114.0584 22.5144,114.0608 22.5163,114.0637 22.517,114.0657 22.5172,114.0683 22.5181,114.0717 22.5173,114.0739</polygon>
I store the requested items in a dictionary and then work through them to transform to a GeoJSON list object that is suitable for ingestion into Elasticsearch according to the schema I'm working with. I've removed irrelevant code here for ease of reading.
# fetches and store data in a dictionary
r = requests.get("https://alerts.weather.gov/cap/ny.php?x=0")
xpars = xmltodict.parse(r.text)
json_entry = json.dumps(xpars['feed']['entry'])
dict_entry = json.loads(json_entry)
# transform items if necessary
for entry in dict_entry:
if entry['cap:polygon']:
polygon = entry['cap:polygon']
polygon = polygon.split(" ")
coordinates = []
# take the split list items swap their positions and enclose them in their own arrays
for p in polygon:
p = p.split(",")
p[0], p[1] = float(p[1]), float(p[0]) # swap lon/lat
coordinates += [p]
# more code adding fields to new dict object, not relevant to the question
The output of the p in polygon loop looks like:
[ [113.8659, 22.3243], [113.8691, 22.3333], [113.8691, 22.4288], [113.8742, 22.4316], [113.9478, 22.4724], [113.9951, 22.5101], [113.9985, 22.5099], [114.0017, 22.508], [114.0051, 22.5046], [114.0085, 22.5018], [114.0112, 22.5007], [114.0125, 22.5007], [114.0166, 22.502], [114.0204, 22.5038], [114.0245, 22.5066], [114.0281, 22.5067], [114.0371, 22.5057], [114.0409, 22.5051], [114.0453, 22.5041], [114.0494, 22.5025], [114.0511, 22.5023], [114.0549, 22.5035], [114.0564, 22.5047], [114.057, 22.5059], [114.0576, 22.5104], [114.0584, 22.512], [114.0608, 22.5144], [114.0637, 22.5163], [114.0657, 22.517], [114.0683, 22.5172], [114.0717, 22.5181], [114.0739, 22.5173] ]
Is there a way to do this that is better than O(N^2)? Thank you for taking the time to read.
O(KxNxM)
This process involves three obvious loops. These are:
Checking each entry (K)
Splitting valid entries into points (MxN) and iterating through those points (N)
Splitting those points into respective coordinates (M)
The amount of letters in a polygon string is ~MxN because there are N points each roughly M letters long, so it iterates through MxN characters.
Now that we know all of this, let's pinpoint where each occurs.
ENTRIES (K):
IF:
SPLIT (MxN)
POINTS (N):
COORDS(M)
So, we can finally conclude that this is O(K(MxN + MxN)) which is just O(KxNxM).

How to split chunks of xy data into lists between isalpha() and newline \n

So I've got a cleaned up datafile of number strings, representing coordinates for polygons. I've had experience assigning one polygon's data in a datafile into a column and plotting it in numpy/matplotlib, but for this I have to plot multiple polygons from one datafile separated by headers. The data isn't evenly sized either; every header has several lines of data in two columns, but not the same amount of lines.
i.e. I've used .readlines() to go from:
# Title of the polygons
# a comment on the datasource
# A comment on polygon projection
Poly Two/a bit
(331222.6210000003, 672917.1531000007)
(331336.0946000004, 672911.7816000003)
(331488.4949000003, 672932.4191999994)
##etc
Poly One
[(331393.15660000034, 671982.1392999999), (331477.28839999996, 671959.8816), (331602.10170000046, 671926.8432999998), (331767.28160000034, 671894.7273999993), (331767.28529999964, 671894.7267000005), (##etc)]
to:
PolyOneandTwo
319547.04899999965,673790.8118999992
319553.2614000002,673762.4122000001
319583.4143000003,673608.7760000005
319623.6182000004,673600.1608000007
319685.3598999996,673600.1608000007
##etc
PolyTwoandabit
319135.9966000002,673961.9215999991
319139.7357999999,673918.9201999996
319223.0153000001,673611.6477000006
319254.6040000003,673478.1133999992
##etc etc
PolyOneHundredFifty
##etc
My code so far involves cleaning the original dataset up to make it like you see above;
data_easting=[]
data_northing=[]
County = open('counties.dat','r')
for line in County.readlines():
if line.lstrip().startswith('#'):
print ('Comment line ignored and leading whitespace removed')
continue
line = line.replace('/','and').replace(' ','').replace('[','').replace(']','').replace('),(','\n')
line = line.strip('()\n')
print (line)
if line.isalpha():
print ('Skipped header: '+ line)
continue
I've been using isalpha(): to ignore the headers for each polygon so far, and I was planning on using if line == '\n': continue and line.split(',') to ignore the newlines between data and begin splitting the Easting and Northing lists. I've already got the numpy and matplotlib section of the code (not shown) sorted to make 1 polygon, but I don't know how to implement it to plot multiple arrays/multiple polygons.
I realised early on though that if I tried to assign all the data to the 2 x and y lists, that would just make one large unending polygon that will make a spaghetti mess of my plot as imaginary lines will be drawn to connect them up.
I want to use the isalpha() section to instead identify and make a Dictionary or List of the polygon names, and attach an array for each polygon datablock to that, but I'm not sure of how to implement it (or if you even can). I'm also not certain how to make it stop loading data into a list at the end of a polygon datablock (maybe if line == '\n': break? but how to make it start and stop again 149 more times for each other chunk?).
To make it a bit more difficult, there is 150 polygons with x and y data in this file, so making 150 x and y lists for each individual polygon and writing specific code for each wouldn't be very efficient.
So, how do I basically do:
if line.isalpha():
#(assign to a Counties dictionary or a list as PolyOne, PolyTwo,...PolyOneHundredFifty)
#(a way of getting the data between the header and newline into a separate list)
#(a way to relate that PolyOne Data list of x and y to the dictionary "PolyOne")
if line == '\n':
#(break? continue?)
#(then restart and repeat for PolyTwo,...PolyOneHundredFifty)
line.split (',')
data_easting.append(x) #x1,x2,...x150?
data_northing.append(y) #y1,y2,y150?)
Is there a way of doing what I intend? How would I go about that without pandas?
Thanks for your time.
Parsing the raw data/file:
When you encounter a line/block like the second in your example,
>>> s = '''[(331393.15660000034, 671982.1392999999), (331477.28839999996, 671959.8816), (331602.10170000046, 671926.8432999998), (331767.28160000034, 671894.7273999993), (331767.28529999964, 671894.7267000005)]'''
it can be converted directly to a 2d numpy array as follows using ast.literal_eval which is a safe way to convert text to a python object - in this case a list of tuples.
>>> import numpy as np
>>> import ast
>>> if s.startswith('['):
#print(ast.literal_eval(s))
array = np.array(ast.literal_eval(s))
>>> array
array([[331393.1566, 671982.1393],
[331477.2884, 671959.8816],
[331602.1017, 671926.8433],
[331767.2816, 671894.7274],
[331767.2853, 671894.7267]])
>>> array.shape
(5, 2)
For the blocks that resemble the first in your (raw) example accumulate each line as a tuple of floats in a list, when you reach the next block make an array of that list and reset it. I put this all in a generator function which yields blocks as 2-d arrays.
import ast
import numpy as np
def parse(lines_read):
data = []
for line in lines_read:
if line.startswith('#'):
continue
elif line.startswith('('):
n,e = line[1:-2].split(',')
data.append((float(n),float(e)))
elif line.startswith('['):
array = np.array(ast.literal_eval(line))
yield array
else:
if data:
array = np.array(data)
data = []
yield array
Used like this
>>> for block in parse(f.readlines()):
... print(block)
... print('*******************')
[[331222.621 672917.1531]
[331336.0946 672911.7816]
[331488.4949 672932.4192]]
*******************
[[331393.1566 671982.1393]
[331477.2884 671959.8816]
[331602.1017 671926.8433]
[331767.2816 671894.7274]
[331767.2853 671894.7267]]
*******************
>>>
If you need to select the northing or easting columns separately, consult the Numpy docs.
Parsing with two regular expressions. This operates on the whole file read as a string: s = fileobject.read(). It needs to go over the file twice and does not preserve the block order.
import re, ast
import numpy as np
pattern1 = re.compile(r'(\n\([^)]+\))+')
pattern2 = re.compile(r'^\[[^]]+\]',flags=re.MULTILINE)
for m in pattern1.finditer(s):
block = m.group().strip().split('\n')
data = []
for line in block:
line = line[1:-1]
n,e = map(float,line.split(','))
data.append((n,e))
print(np.array(data))
print('****')
for m in pattern2.finditer(s):
print(np.array(ast.literal_eval(m.group())))
print('****')

removing datetime.datetime from a list in python

I'm attempting to get 2 different elements from an XML file, I'm trying to print them as the x and y on a scatter plot, I can manage to get both the elements but when I plot them it only uses one of the dates to plot the other elements. I'm using the below code to get a weather HTML and save it as an XML.
url = "http://api.met.no/weatherapi/locationforecast/1.9/?lat=52.41616;lon=-4.064598"
response = requests.get(url)
xml_text=response.text
weather= bs4.BeautifulSoup(xml_text, "xml")
f = open('file.xml', "w")
f.write(weather.prettify())
f.close()
I'm then trying to get the time ('from') element and the ('windSpeed' > 'mps') element and attribute. I'm then trying to plot it as an x and y on a scatter plot.
with open ('file.xml') as file:
soup = bs4.BeautifulSoup(file, "xml")
times = soup.find_all("time")
windspeed = soup.select("windSpeed")
form = ("%Y-%m-%dT%H:%M:%SZ")
x = []
y = []
for element in times:
time = element.get("from")
t = datetime.datetime.strptime(time, form)
x.append(t)
for mps in windspeed:
speed = mps.get("mps")
y.append(speed)
plt.scatter(x, y)
plt.show()
I'm trying to make 2 lists from 2 loops, and then read them as the x and y, but when I run it it gives the error;
raise ValueError("x and y must be the same size")
ValueError: x and y must be the same size
I'm assuming it's because it prints the list as datetime.datetime(2016, 12, 22, 21, 0), how do I remove the datetime.datetime from the list.
I know there's probably a simple way of fixing it, any ideas would be great, you people here on stack are helping me a lot with learning to code. Thanks
Simply make two lists one containing x-axis values and other with y-axis values and pass to scatter function
plt.scatter(list1, list2);
I suggest that you use lxml for analysing xml because it gives you the ability to use xpath expressions which can make life much easier. In this case, not every time entry contains a windSpeed entry; therefore, it's essential to identify the windSpeed entries first then to get the associated times. This code does that. There are two little problems I usually encounter: (1) I still need to 'play' with xpath to get it right; (2) Sometimes I get a list when I expect a singleton which is why there's a '[0]' in the code. I find it's better to build the code interactively.
>>> from lxml import etree
>>> XML = open('file.xml')
>>> tree = etree.parse(XML)
>>> for count, windSpeeds in enumerate(tree.xpath('//windSpeed')):
... windSpeeds.attrib['mps'], windSpeeds.xpath('../..')[0].attrib['from']
... if count>5:
... break
...
('3.9', '2016-12-29T18:00:00Z')
('4.8', '2016-12-29T21:00:00Z')
('5.0', '2016-12-30T00:00:00Z')
('4.5', '2016-12-30T03:00:00Z')
('4.1', '2016-12-30T06:00:00Z')
('3.8', '2016-12-30T09:00:00Z')
('4.4', '2016-12-30T12:00:00Z')

Test Highcharts with selenium webdriver

I would like to test the accuracy of a Highcharts graph presenting data from a JSON file (which I already read) using Python and Selenium Webdriver.
How can I read the Highchart data from the website?
thank you,
Evgeny
The highchart data is converted to an SVG path, so you'd have to interpret the path yourself. I'm not sure why you would want to do this, actually: in general you can trust 3rd party libraries to work as advertised; the testing of that code should reside in that library.
If you still want to do it, then you'd have to dive into Javascript to retrieve the data. Taking the Highcharts Demo as an example, you can extract the data points for the first line as shown below. This will give you the SVG path definition as a string, which you can then parse to determine the origin and the data points. Comparing this to the size of the vertical axis should allow you to calculate the value implied by the graph.
# Get the origin and datapoints of the first line
s = selenium.get_eval("window.jQuery('svg g.highcharts-tracker path:eq(0)')")
splitted = re.split('\s+L\s+', s)
origin = splitted[0].split(' ')[1:]
data = [p.split(' ') for p in splitted[1:]]
# Convert to floats
origin = [float(origin[1]), float(origin[2])]
data = [[float(x), float(y)] for x, y in data]
# Get the min and max y-axis value and position
min_y_val = float(selenium.get_eval( \
"window.jQuery('svg g.highcharts-axis:eq(1) text:first').text()")
max_y_val = float(selenium.get_eval( \
"window.jQuery('svg g.highcharts-axis:eq(1) text:last').text()")
min_y_pos = float(selenium.get_eval( \
"window.jQuery('svg g.highcharts-axis:eq(1) text:first').attr('y')")
max_y_pos = float(selenium.get_eval( \
"window.jQuery('svg g.highcharts-axis:eq(1) text:last').attr('y')")
# Calculate the value based on the retrieved positions
y_scale = min_y_pos - max_y_pos
y_range = max_y_val - min_y_val
y_percentage = data[0][1] * 100.0 / y_scale
value = max_y_val - (y_range * percentage)
Disclaimer: I didn't have to time to fully verify it, but something along these lines should give you what you want.

Categories