I would like to test the accuracy of a Highcharts graph presenting data from a JSON file (which I already read) using Python and Selenium Webdriver.
How can I read the Highchart data from the website?
thank you,
Evgeny
The highchart data is converted to an SVG path, so you'd have to interpret the path yourself. I'm not sure why you would want to do this, actually: in general you can trust 3rd party libraries to work as advertised; the testing of that code should reside in that library.
If you still want to do it, then you'd have to dive into Javascript to retrieve the data. Taking the Highcharts Demo as an example, you can extract the data points for the first line as shown below. This will give you the SVG path definition as a string, which you can then parse to determine the origin and the data points. Comparing this to the size of the vertical axis should allow you to calculate the value implied by the graph.
# Get the origin and datapoints of the first line
s = selenium.get_eval("window.jQuery('svg g.highcharts-tracker path:eq(0)')")
splitted = re.split('\s+L\s+', s)
origin = splitted[0].split(' ')[1:]
data = [p.split(' ') for p in splitted[1:]]
# Convert to floats
origin = [float(origin[1]), float(origin[2])]
data = [[float(x), float(y)] for x, y in data]
# Get the min and max y-axis value and position
min_y_val = float(selenium.get_eval( \
"window.jQuery('svg g.highcharts-axis:eq(1) text:first').text()")
max_y_val = float(selenium.get_eval( \
"window.jQuery('svg g.highcharts-axis:eq(1) text:last').text()")
min_y_pos = float(selenium.get_eval( \
"window.jQuery('svg g.highcharts-axis:eq(1) text:first').attr('y')")
max_y_pos = float(selenium.get_eval( \
"window.jQuery('svg g.highcharts-axis:eq(1) text:last').attr('y')")
# Calculate the value based on the retrieved positions
y_scale = min_y_pos - max_y_pos
y_range = max_y_val - min_y_val
y_percentage = data[0][1] * 100.0 / y_scale
value = max_y_val - (y_range * percentage)
Disclaimer: I didn't have to time to fully verify it, but something along these lines should give you what you want.
Related
I am new to python. I am trying to extract mixed fractions from pdf file using Python. But I have no idea which tool I should use to extract. My sample pdf contains only one page with simple text. I would like to extract Part name and length of part using Python. Screenshot of sample pdf page is as shown in image link Page 1 of Pdf- Screenshot. Pdf file can be downloaded from the following link (Sample Pdf)
EDIT 1: - UPDATED
Thank you for suggesting Pdfplumber. It is a great tool. I could extract information with it. Though in some cases, when I extract length, I get the whole number combined with denominator. Say, if I have 36 1/2 as length (as shown in screenshot), then I get the value as 362 inches.
import pdfplumber
with pdfplumber.open("Sample.pdf") as pdf:
first_page = pdf.pages[0]
text = first_page.extract_text()
for row in text.split('\n'):
if 'inches' in row:
num = row.split()[0]
print(num)
Output: 362
This code works for me in most cases. Just in some cases, I get 362 as my output, instead of getting 36 as a separate value. How could I resolve this issue?
pdfplumber gives output like that
shape: square
part name: square
1
36 𝑖𝑛𝑐ℎ𝑒𝑠
2
I would suggest to use PDF Pluber, it's a very powerful and well documented tool for extracting text, table, images from PDFs.
Moreover, it has a very convenient function, called crop, that allows you to crop and extract just the portion of the page that you need.
Just as an example, the code would be something like this (note that this will work with any number of pages):
filename = 'path/to/your/PDF'
crop_coords = [x0, top, x1, bottom]
text = ''
pages = []
with pdfplumber.open(filename) as pdf:
for i, page in enumerate(pdf.pages):
my_width = page.width
my_height = page.height
# Crop pages
my_bbox = (crop_coords[0]*float(my_width), crop_coords[1]*float(my_height), crop_coords[2]*float(my_width), crop_coords[3]*float(my_height))
page_crop = page.crop(bbox=my_bbox)
text = text+str(page_crop.extract_text()).lower()
pages.append(page_crop)
Here is the explanation of coords:
x0 = % Distance from left vertical cut to left side of page.
top = % Distance from upper horizontal cut to upper side of page.
x1 = % Distance from right vertical cut to right side of page.
bottom = % Distance from lower horizontal cut to lower side of page.
I'm trying to resample a set of GRIB2 arrays at 0.25 degree resolution to a coarser 0.5 degree resolution using the xESMF package (xarray's coarsen method does not work here because there is an odd number of coordinates in the latitude).
I have converted the GRIB data to xarray format through the pygrib package, which then subsets out the specific grid I need:
fhr = 96
gridDefs = {
"0.25":
{'url': "https://noaa-gefs-retrospective.s3.amazonaws.com/landsfc.pgrb2.0p25"},
"0.5":
{'url': "https://noaa-gefs-retrospective.s3.amazonaws.com/landsfc.pgrb2.0p50"},
}
fileDefs = {
"0.25":
{'url': "https://noaa-gefs-retrospective.s3.amazonaws.com/GEFSv12/reforecast/2019/2019051900/c00/Days%3A1-10/tmp_pres_2019051900_c00.grib2",
'localfile': "tmp_pres.grib2"},
"0.5":
{'url': "https://noaa-gefs-retrospective.s3.amazonaws.com/GEFSv12/reforecast/2019/2019051900/c00/Days%3A1-10/tmp_pres_abv700mb_2019051900_c00.grib2",
'localfile': "tmp_pres_abv_700.grib2"},
}
def grib_to_xs(grib, vName):
arr = xr.DataArray(grib.values)
arr = arr.rename({'dim_0':'lat', 'dim_1':'lon'})
xs = arr.to_dataset(name=vName)
return xs
gribs = {}
for key, item in gridDefs.items():
if not os.path.exists(item['url'][item['url'].rfind('/')+1:]):
os.system("wget " + item['url'])
lsGrib = pygrib.open(item['url'][item['url'].rfind('/')+1:])
landsea = lsGrib[1].values
gLats = lsGrib[1]["distinctLatitudes"]
gLons = lsGrib[1]["distinctLongitudes"]
gribs["dataset" + key] = xr.Dataset({'lat': gLats, 'lon': gLons})
lsGrib.close()
for key, item in fileDefs.items():
if not os.path.exists(item['localfile']):
os.system("wget " + item['url'])
os.system("mv " + item['url'][item['url'].rfind('/')+1:] + " " + item['localfile'])
for key, item in fileDefs.items():
hold = pygrib.open(item['localfile'])
subsel = hold.select(forecastTime=fhr)
#Grab the first item
gribs[key] = grib_to_xs(subsel[1], "TT" + key)
hold.close()
The above code downloads two constant files (landsfc) at the two grid domains (0.25 and 0.5), then downloads two GRIB files at each of the resolutions as well. I'm trying to resample the 0.25 degree GRIB file (tmp_pres.grib2) to a 0.5 degree domain as such:
regridder = xe.Regridder(ds, gribs['dataset0.5'], 'bilinear')
print(regridder)
ds2 = regridder(ds)
My issue is that I generate two warning messages when trying to use the regridder:
/media/robert/HDD/Anaconda3/envs/wrf-work/lib/python3.8/site-packages/xarray/core/dataarray.py:682: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
return key in self.data
/media/robert/HDD/Anaconda3/envs/wrf-work/lib/python3.8/site-packages/xesmf/backend.py:53: UserWarning: Latitude is outside of [-90, 90]
warnings.warn('Latitude is outside of [-90, 90]')
The output xarray does have the correct coordinates, however the values inside the grid are way off (Outside the maxima/minima of the finer resolution grid), and exhibit these strange banding patterns that make no physical sense.
What I would like to know is, is this the correct process to upscale an array using xEMSF, and if not, how would I address the problem?
Any assistance would be appreciated, thanks!
I would recommend first trying conservative instead of bilinear (it's recommended on their documentation) and maybe check if you're using the parameters correctly because it seems something is wrong, my first guess would be that something you're doing moves the latitud around for some reason, I'm leaving the docs link here and hope someone knows more.
Regridder docs:
https://xesmf.readthedocs.io/en/latest/user_api.html?highlight=regridder#xesmf.frontend.Regridder.__init__
Upscaling recommendation (search for upscaling, there's also a guide for increasing resolution):
https://xesmf.readthedocs.io/en/latest/notebooks/Compare_algorithms.html?highlight=upscaling
Thanks to the documentation links and recommendations provided by MASACR 99, I was able to do some more digging into the xESMF package and to find a working example of resampling methods from the package author (https://github.com/geoschem/GEOSChem-python-tutorial/blob/main/Chapter03_regridding.ipynb), my issue was solved by two changes:
I changed the method from bilinear to conservative (This required also adding in two fields to the input array (boundaries for latitude and longitude).
Instead of directly passing the variable being resampled to the resampler, I instead had to define two fixed grids to create the resampler, then pass individual variables.
To solve the first change, I created a new function to give me the boundary variables:
def get_bounds(arr, gridSize):
lonMin = np.nanmin(arr["lon"].values)
latMin = np.nanmin(arr["lat"].values)
lonMax = np.nanmax(arr["lon"].values)
latMax = np.nanmax(arr["lat"].values)
sizeLon = len(arr["lon"])
sizeLat = len(arr["lat"])
bounds = {}
bounds["lon"] = arr["lon"].values
bounds["lat"] = arr["lat"].values
bounds["lon_b"] = np.linspace(lonMin-(gridSize/2), lonMax+(gridSize/2), sizeLon+1)
bounds["lat_b"] = np.linspace(latMin-(gridSize/2), latMax+(gridSize/2), sizeLat+1).clip(-90, 90)
return bounds
For the second change, I modified the regridder definition and application to use the statically defined grids, then passed the desired variable to resample:
regridder = xe.Regridder(get_bounds(gribs['dataset0.25'], 0.25), get_bounds(gribs['dataset0.5'], 0.5), 'conservative')
print(regridder)
ds2 = regridder(ds)
I have a piece of Python code scrapping datapoints value from what seems to be a Javascript graph on a webpage. The data looks like:
...html/javascript...
{'y':765000,...,'x':1248040800000,...},
{'y':1020000,...,'x':1279144800000,...},
{'y':1105000,...,'x':1312754400000,...}
...html/javascript...
where the dots are plotting data I skipped.
To scrap the useful information - x/y datapoints coordinates - I used regex:
#first getting the raw x data
xData = re.findall("'x':\d+", htmlContent)
#now reading each value one by one
xData = [int(re.findall("\d+",x)[0]) for x in xData]
Same for the y values. I don't know if this terribly inefficient but it does not look pretty or very smart as a have many redundant calls to re.findall. Is there a way to do it in one pass? One pass for x and one pass for y?
You can do it a little bit easier:
htmlContent = """
...html/javascript...
{'y':765000,...,'x':1248040800000,...},
{'y':1020000,...,'x':1279144800000,...},
{'y':1105000,...,'x':1312754400000,...}
...html/javascript...
"""
# Get the numbers
xData = [int(_) for _ in re.findall("'x':(\d+)", htmlContent)]
print xData
I'm currently having a little issue with a fits file. The data is in table format, a format I haven't previously used. I'm a python user, and rely heavily on astropy.fits to manipulate fits images. A quick output of the info gives:
No. Name Type Cards Dimensions Format
0 PRIMARY PrimaryHDU 60 ()
1 BinTableHDU 29 3072R x 2C [1024E, 1024E]
The header for the BinTableHDU is as follows:
XTENSION= 'BINTABLE' /Written by IDL: Mon Jun 22 23:28:21 2015
BITPIX = 8 /
NAXIS = 2 /Binary table
NAXIS1 = 8192 /Number of bytes per row
NAXIS2 = 3072 /Number of rows
PCOUNT = 0 /Random parameter count
GCOUNT = 1 /Group count
TFIELDS = 2 /Number of columns
TFORM1 = '1024E ' /Real*4 (floating point)
TFORM2 = '1024E ' /Real*4 (floating point)
TTYPE1 = 'COUNT_RATE' /
TUNIT1 = '1e-6cts/s/arcmin^2' /
TTYPE2 = 'UNCERTAINTY' /
TUNIT2 = '1e-6cts/s/arcmin^2' /
HISTORY g000m90r1b120pm.fits created on 10/08/97. PI channel range: 8: 19
PIXTYPE = 'HEALPIX ' / HEALPIX pixelisation
ORDERING= 'NESTED ' / Pixel ordering scheme, either RING or NESTED
NSIDE = 512 / Healpix resolution parameter
NPIX = 3145728 / Total number of pixels
OBJECT = 'FULLSKY ' / Sky coverage, either FULLSKY or PARTIAL
FIRSTPIX= 0 / First pixel # (0 based)
LASTPIX = 3145727 / Last pixel # (zero based)
INDXSCHM= 'IMPLICIT' / indexing : IMPLICIT or EXPLICIT
GRAIN = 0 / GRAIN = 0: No index,
COMMENT GRAIN =1: 1 pixel index for each pixel,
COMMENT GRAIN >1: 1 pixel index for Grain consecutive pixels
BAD_DATA= -1.63750E+30 / Sentinel value given to bad pixels
COORDSYS= 'G ' / Pixelization coordinate system
COMMENT G = Galactic, E = ecliptic, C = celestial = equatorial
END
I'd like to access the fits image which is stored within the TTYPE labeled 'COUNT-RATE', and then have this in a format with which I can then add to other count-rate arrays with the same dimensions.
I started with my usual prodcedure for opening a fits file:
hdulist_RASS_SXRB_R1 = fits.open('/Users/.../RASS_SXRB_R1.fits')
hdulist_RASS_SXRB_R1.info()
image_XRAY_SKYVIEW_R1 = hdulist_RASS_SXRB_R1[1].data
image_XRAY_SKYVIEW_R1 = numpy.array(image_XRAY_SKYVIEW_R1)
image_XRAY_SKYVIEW_header_R1 = hdulist_RASS_SXRB_R1[1].header
But this is coming back with IndexError: too many indices for array. I've had a look at accessing table data in the astropy documentation here (Accessing data stored as a table in a multi-extension FITS (MEF) file)
If anyone has a tried and tested method for accessing such images from a fits table I'd be very grateful! Many thanks.
I can't be sure without seeing the full traceback but I think the exception you're getting is from this:
image_XRAY_SKYVIEW_R1 = numpy.array(image_XRAY_SKYVIEW_R1)
There's no reason to manually wrap numpy.array() around the array. It's already a Numpy array. But in this case it's a structured array (see http://docs.scipy.org/doc/numpy/user/basics.rec.html).
#Andromedae93's answer is right one. But also for general documentation on this see: http://docs.astropy.org/en/stable/io/fits/index.html#working-with-table-data
However, the way you're working (which is fine for images) of manually calling fits.open, accessing the .data attribute of the HDU, etc. is fairly low level, and Numpy structured arrays are good at representing tables, but not great for manipulating them.
You're better off generally using Astropy's higher-level Table interface. A FITS table can be read directly into an Astropy Table object with Table.read(): http://docs.astropy.org/en/stable/io/unified.html#fits
The only reason the same thing doesn't exist for FITS images is there's no a generic "Image" class yet.
I used astropy.io.fits during my internship in Astrophysics and this is my process to open file .fits and make some operations :
# Opening the .fits file which is named SMASH.fits
field = fits.open(SMASH.fits)
# Data fits reading
tbdata = field[1].data
Now, with this kind of method, tbdata is a numpy.array and you can make lots of things.
For example, if you have data like :
ID, Name, Object
1, HD 1527, Star
2, HD 7836, Star
3, NGC 6739, Galaxy
If you want to print data along one condition :
Data_name = tbdata['Name']
You will get :
HD 1527
HD 7836
NGC 6739
I don't know what do you want exactly with your data, but I can help you ;)
I'm creating netCDF-files with pre-defined data types for the variables and attributes, and I'm using netCDF4 and python for this.
My minimal example looks like this:
from netCDF4 import Dataset
import numpy as np
root_grp = Dataset("test_single_band.nc" ,'w',format = 'NETCDF4')
data_grp = root_grp.createGroup("data")
data_grp.createDimension("num_pixels", 3264)
data_grp.createDimension("num_lines", None)
measurement_data_grp = data_grp.createGroup("measurement_data")
measurement_data_grp.createVariable("band", "u2", \
("num_pixels","num_lines"), fill_value = np.uint16(8191))
measurement_data_grp["band"].long_name = "radiances"
measurement_data_grp["band"].units = "W m-2 sr-1 um-1"
measurement_data_grp["band"].scale_factor = np.float(0.085006105)
measurement_data_grp["band"].add_offset = np.float(7.61)
measurement_data_grp["band"].valid_min = np.uint16(0)
measurement_data_grp["band"].valid_max = np.uint16(8190)
data_max = 4830.
data_min = 30.
data = data_max*np.random.random((3264,3800)) + data_min
target = root_grp["data/measurement_data/band"]
target[:] = data.astype(target.dtype)
root_grp.close()
And my issue is as follows: The file created by this script is displayed weirdly by Panoply:
Erroneous plot by Panoply
i.e. many values are displayed as if being NaN or larger than valid_max, which they are not by construction. It should rather look like this:
Correct plot by Panoply
Panoply displays the data correctly if I leave out the definition of valid_max, or if valid_max is set to a floating point data type. Using valid_range instead doesn't change anything.
Any pointers to what is going wrong?
Your code generates random data values between 30 and 4860, but since you have specified a scale_factor and add_offset, the values will be stored as packed data. So in this case the values written to the file will be ints between 263 [= (30 - 7.61) / 0.085006105] and 57082 [= (4860 - 7.61) / 0.085006105].
Where the problem lies is that the convention when using packed data in a netCDF file in conjunction with valid_min and valid_max specifications is that the min and max must be specified in terms of the packed values rather than the unpacked values. Since you specified a valid_max of 8190, then any value which packed as an int between 8190 and 57082 will be treated as invalid when unpacked by software following standard netCDF conventions.
See:
http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch08.html#packed-data
http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch02s05.html#missing-data