String Manipulation for Json webscraping

String Manipulation for Json webscraping - python

I am trying to scrape a website and have all the data needed in very long matrices which were obtained through requests and json imports.
I am having issues getting any output.
Is it because of the merge of two strings in requests.get()?
Here is the part with the problem, all things used were declared at the start of the code.
balance=[]
for q in range(len(DepositMatrix)):
address= requests.get('https://ethplorer.io/service/service.php?data=' + str(DepositMatrix[q][0]))
data4 = address.json()
TokenBalances = data4['balances'] #returns a dictionary
balance.append(TokenBalances)
print(balance)
Example of DepositMatrix - list of lists with 4 elements, [[string , float, int, int]]
[['0x2b5634c42055806a59e9107ed44d43c426e58258', 488040277.1535826, 660, 7103],
['0x05ee546c1a62f90d7acbffd6d846c9c54c7cf94c', 376515313.83254075, 2069, 12705]]
I think the error is in this part:
requests.get('https://ethplorer.io/service/service.php?data=' + str(DepositMatrix[q][0]))
This change doesnt help either:
requests.get('https://ethplorer.io/service/service.php?data=' + DepositMatrix[q][0])

Like I said in my comment, I tried your code and it worked for me. But I wanted to highlight some things that could help your code be clearer:
import requests
import pprint
DepositMatrix = [['0x2b5634c42055806a59e9107ed44d43c426e58258', 488040277.1535826, 660, 7103],
['0x05ee546c1a62f90d7acbffd6d846c9c54c7cf94c', 376515313.83254075, 2069, 12705]]
balance=[]
for deposit in DepositMatrix:
address = requests.get('https://ethplorer.io/service/service.php?data=' + deposit[0])
data4 = address.json()
TokenBalances = data4['balances'] #returns a dictionary
balance.append(TokenBalances)
pprint.pprint(balance)
For your loop, instead of creating a range of the length of your list (q) and then using this q to get the information back from your list, it's simpler to get each element directly (for deposit in DepositMatrix:)
I've used the pprint module to ease the visualization of your data.

Related

For Loop 60 items 10 per 10

I'm working with an api that gives me 61 items that I include in a discord embed in a for loop.
As all of this is planned to be included into a discord bot using pagination from DiscordUtils, I need to make it so it male an embed for each 10 entry to avoid a too long message / 2000 character message.
Currently what I use to do my loop is here: https://api.nepmia.fr/spc/ (I recomend the usage of a parsing extention for your browser or it will be a bit hard to read it)
But what I want to create is something that will look like that : https://api.nepmia.fr/spc/formated/
So I can iterate each range in a different embed and then use pagination.
I use TinyDB to generate the JSON files I shown before with this script:
import urllib.request, json
from shutil import copyfile
from termcolor import colored
from tinydb import TinyDB, Query
db = TinyDB("/home/nepmia/Myazu/db/db.json")
def api_get():
print(colored("[Myazu]","cyan"), colored("Fetching WynncraftAPI...", "white"))
try:
with urllib.request.urlopen("https://api.wynncraft.com/public_api.php?action=guildStats&command=Spectral%20Cabbage") as u1:
api_1 = json.loads(u1.read().decode())
count = 0
if members := api_1.get("members"):
print(colored("[Myazu]","cyan"),
colored("Got expecteded answer, starting saving process.", "white"))
for member in members:
nick = member.get("name")
ur2 = f"https://api.wynncraft.com/v2/player/{nick}/stats"
u2 = urllib.request.urlopen(ur2)
api_2 = json.loads(u2.read().decode())
data = api_2.get("data")
for item in data:
meta = item.get("meta")
playtime = meta.get("playtime")
print(colored("[Myazu]","cyan"),
colored("Saving playtime for player", "white"),
colored(f"{nick}...","green"))
db.insert({"username": nick, "playtime": playtime})
count += 1
else:
print(colored("[Myazu]","cyan"),
colored("Unexpected answer from WynncraftAPI [ERROR 1]", "white"))
except:
print(colored("[Myazu]","cyan"),
colored("Unhandled error in saving process [ERROR 2]", "white"))
finally:
print(colored("[Myazu]","cyan"),
colored(f"Finished saving data for", "white"),
colored(f"{count}", "green"),
colored("players.", "white"))
but this will only create a range like this : https://api.nepmia.fr/spc/
what I would like is something like this : https://api.nepmia.fr/spc/formated/
Thanks for your help!
PS: Sorry for your eyes I'm still new to Python so I know I don't do stuff really properly :s

To follow up from the comments, you shouldn't store items in your database in a format that is specific to how you want to return results from the database to a different API, as it will make it more difficult to query in other contexts, among other reasons.
If you want to paginate items from a database it's better to do that when you query it.
According to the docs, you can iterate over all documents in a TinyDB database just by iterating directly over the DB like:
for doc in db:
...
For any iterable you can use the enumerate function to associate an index to each item like:
for idx, doc in enumerate(db):
...
If you want the indices to start with 1 as in your examples you would just use idx + 1.
Finally, to paginate the results, you need some function that can return items from an iterable in fixed-sized batches, such as one of the many solutions on this question or elsewhere. E.g. given a function chunked(iter, size) you could do:
pages = enumerate(chunked(enumerate(db), 10))
Then list(pages) gives a list of lists of tuples like [(page_num, [(player_num, player), ...].
The only difference between a list of lists and what you want is you seem to want a dictionary structure like
{'range1': {'1': {...}, '2': {...}, ...}, 'range2': {'11': {...}, ...}}
This is no different from a list of lists; the only difference is you're using dictionary keys to give numerical indices to each item in a collection, rather than the indices being implict in the list structure. There's many ways you can go from a list of lists to this. The easiest I think is using a (nested) dict comprehension:
{f'range{page_num + 1}': {str(player_num + 1): player for player_num, player in page}
for page_num, page in pages}
This will give output in exactly the format you want.

Thanks #Iguananaut for your precious help.
In the end I made something similar from your solution using a generator.
def chunker(seq, size):
for i in range(0, len(seq), size):
yield seq[i:i+size]
def embed_creator(embeds):
pages = []
current_page = None
for i, chunk in enumerate(chunker(embeds, 10)):
current_page = discord.Embed(
title=f'**SPC** Last week online time',
color=3903947)
for elt in chunk:
current_page.add_field(
name=elt.get("username"),
value=elt.get("play_output"),
inline=False)
current_page.set_footer(
icon_url="https://cdn.discordapp.com/icons/513160124219523086/a_3dc65aae06b2cf7bddcb3c33d7a5ecef.gif?size=128",
text=f"{i + 1} / {ceil(len(embeds) / 10)}"
)
pages.append(current_page)
current_page = None
return pages
Using embed_creator I generate a list named pages that I can simply use with DiscordUtils paginator.

How to turn items from extracted data to numbers for plotting in python?

So i have a text document with a lot of values from calculations. I have extracted all the data and stored it in an array, but they are not numbers that I can use for anything. I want to use the number to plot them in a graph, but the elements in the array are text-strings, how would i turn them into numbers and remove unneccesary signs like commas and n= for instance?
Here is code, and under is my print statement.
import numpy as np
['n=1', 'n=2', 'n=3', 'n=4', 'n=5', 'n=6', 'n=7', 'n=8', 'n=9', 'n=10', 'n=11', 'n=12', 'n=13', 'n=14', 'n=15', 'n=16', 'n=17', 'n=18', 'n=19'])

I'd use the conversion method presented in this post within the extract function, so e.g.
...
delta_x.append(strtofloat(words[1]))
...
where you might as well do the conversion inline (my strtofloat is a function you'd have to write based on mentioned post) and within a try/except block, so failed conversions are just ignored from your list.
To make it more consistent, any conversion error should discard the whole line affected, so you might want to use intermediate variables and a check for each field.
Btw. I noticed the argument to the extract function, it would seem logical to make the argument a string containing the file name from which to extract the data?
EDIT: as a side note, you might want to look into pandas, which is a library specialised in numerical data handling. Depending on the format of your data file there are probably standard functions to read your whole file into a DataFrame (which is a kind of super-charged array class which can handle a lot of data processing as well) in a single command.

I would consider using regular expression:
import re
match_number = re.compile('-?[0-9]+\.?[0-9]*(?:[Ee]-?[0-9]+)?')
for line in infile:
words = line.split()
new_delta_x = float(re.search(match_number, words[1]).group())
new_abs_error = float(re.search(match_number, words[7]).group())
new_n = int(re.search(match_number, words[10]).group())
delta_x.append(new_delta_x)
abs_error.append(new_abs_error)
n.append(new_n)
But it seems like your data is already in csv format. So try using pandas.
Then read data into dataframe without header (column names will be integers).
import numpy as np
import pandas as pd
df = pd.read_csv('approx_derivative_sine.txt', header=None)
delta_x = df[1].to_numpy()
abs_error = df[7].to_numpy()
# if n is always number of the row
n = df.index.to_numpy(dtype=int)
# if n is always in the form 'n=<integer>'
n = df[10].apply(lambda x: x.strip()[2:]).to_numpy(dtype=int)
If you could post a few rows of your approx_derivative_sine.txt file, that would be useful.

From the given array in the question, If you would like to remove the 'n=' and convert each element to an integer, you may try the following.
import numpy as np
array = np.array(['n=1', 'n=2', 'n=3', 'n=4', 'n=5', 'n=6', 'n=7', 'n=8', 'n=9',
'n=10', 'n=11', 'n=12', 'n=13', 'n=14', 'n=15', 'n=16', 'n=17', 'n=18', 'n=19'])
array = [int(i.replace('n=', '')) for i in array]
print(array)

How can I increase the amount of array iterated during the run-time of script?

My script cleans arrays from the unwanted string like "##$!" and other stuff.
The script works as intended but the speed of it is extremely slow when the excel row size is big.
I tried to use numpy if it could speed it up but I'm not too familiar with is so I might be using it incorrectly.
xls = pd.ExcelFile(path)
df = xls.parse("Sheet2")
TeleNum = np.array(df['telephone'].values)
def replace(orignstr): # removes the unwanted string from numbers
for elem in badstr:
if elem in orignstr:
orignstr = orignstr.replace(elem, '')
return orignstr
for UncleanNum in tqdm(TeleNum):
newnum = replace(str(UncleanNum)) # calling replace function
df['telephone'] = df['telephone'].replace(UncleanNum, newnum) # store string back in data frame
I also tried removing the method to if that would help and just place it as one block of code but the speed remained the same.
for UncleanNum in tqdm(TeleNum):
orignstr = str(UncleanNum)
for elem in badstr:
if elem in orignstr:
orignstr = orignstr.replace(elem, '')
print(orignstr)
df['telephone'] = df['telephone'].replace(UncleanNum, orignstr)
TeleNum = np.array(df['telephone'].values)
The current speed of the script running an excel file of 200,000 is around 70it/s and take around an hour to finish. Which is not that good since this is just one function of many.
I'm not too advanced in python. I'm just learning as I script so if you have any pointer it would be appreciated.
Edit:
Most of the array elements Im dealing with are numbers but some have string in them. I trying to remove all string in the array element.
Ex.
FD3459002912
*345*9002912$

If you are trying to clear everything that isn't a digit from the strings you can directly use re.sub like this:
import re
string = "FD3459002912"
regex_result = re.sub("\D", "", string)
print(regex_result) # 3459002912

Converting molecule name to SMILES?

I was just wondering, is there any way to convert IUPAC or common molecular names to SMILES? I want to do this without having to manually convert every single one utilizing online systems. Any input would be much appreciated!
For background, I am currently working with python and RDkit, so I wasn't sure if RDkit could do this and I was just unaware. My current data is in the csv format.
Thank you!

RDKit cant convert names to SMILES.
Chemical Identifier Resolver can convert names and other identifiers (like CAS No) and has an API so you can convert with a script.
from urllib.request import urlopen
from urllib.parse import quote
def CIRconvert(ids):
try:
url = 'http://cactus.nci.nih.gov/chemical/structure/' + quote(ids) + '/smiles'
ans = urlopen(url).read().decode('utf8')
return ans
except:
return 'Did not work'
identifiers = ['3-Methylheptane', 'Aspirin', 'Diethylsulfate', 'Diethyl sulfate', '50-78-2', 'Adamant']
for ids in identifiers :
print(ids, CIRconvert(ids))
Output
3-Methylheptane CCCCC(C)CC
Aspirin CC(=O)Oc1ccccc1C(O)=O
Diethylsulfate CCO[S](=O)(=O)OCC
Diethyl sulfate CCO[S](=O)(=O)OCC
50-78-2 CC(=O)Oc1ccccc1C(O)=O
Adamant Did not work

OPSIN (https://opsin.ch.cam.ac.uk/) is another solution for name2structure conversion.
It can be used by installing the cli, or via https://github.com/gorgitko/molminer
(OPSIN is used by the RDKit KNIME nodes also)

PubChemPy has some great features that can be used for this purpose. It supports IUPAC systematic names, trade names and all known synonyms for a given Compound as documented in PubChem database:
https://pubchempy.readthedocs.io/en/latest/
>>> import pubchempy as pcp
>>> results = pcp.get_compounds('Glucose', 'name')
>>> print results
[Compound(79025), Compound(5793), Compound(64689), Compound(206)]
The first argument is the identifier, and the second argument is the identifier type, which must be one of name, smiles, sdf, inchi, inchikey or formula. It looks like there are 4 compounds in the PubChem Database that have the name Glucose associated with them. Let’s take a look at them in more detail:
>>> for compound in results:
>>> print compound.isomeric_smiles
C([C##H]1[C#H]([C##H]([C#H]([C#H](O1)O)O)O)O)O
C([C##H]1[C#H]([C##H]([C#H](C(O1)O)O)O)O)O
C([C##H]1[C#H]([C##H]([C#H]([C##H](O1)O)O)O)O)O
C(C1C(C(C(C(O1)O)O)O)O)O
It looks like they all have different stereochemistry information !

The accepted answer uses the Chemical Identifier Resolver but for some reason the website seems to be buggy for me and the API seems to be messed up.
So another way to connvert smiles to IUPAC name is with the the PubChem python API, which can work if your smiles is in their database
e.g.
#!/usr/bin/env python
import sys
import pubchempy as pcp
smiles = str(sys.argv[1])
print(smiles)
s= pcp.get_compounds(smiles,'smiles')
print(s[0].iupac_name)

You can use batch query of pubchem:
https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi
https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange-help.html

You can use the pubchem API (PUG REST) for this
(https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest-tutorial)
Basically, the url you are calling will take the compound as a "name", you then give the name, then you specify that you want the "property" of "CanonicalSMILES", as text
identifiers = ['3-Methylheptane', 'Aspirin', 'Diethylsulfate', 'Diethyl sulfate', '50-78-2', 'Adamant']
smiles_df = pd.DataFrame(columns = ['Name', 'Smiles'])
for x in identifiers :
try:
url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/' + x + '/property/CanonicalSMILES/TXT'
# remove new line character with rstrip
smiles = requests.get(url).text.rstrip()
if('NotFound' in smiles):
print(x, " not found")
else:
smiles_df = smiles_df.append({'Name' : x, 'Smiles' : smiles}, ignore_index = True)
except:
print("boo ", x)
print(smiles_df)

How do I read an HTML file in Python from multiple URLs?

I'm writing a script that will pull data from a basic HTML page based on the following:
The first parameter in the URL floats between -90.0 and 90.0 (inclusive) and the second set of numbers are between -180.0 and 180.0 (inclusive). The URL will direct you to one page with a single number as the body of the page (for example, http://jawbone-virality.herokuapp.com/scanner/desert/-89.7/131.56/). I need to find the largest virality number between all of the pages attached to the URL.
So, right now I have it printing the first and second number, as well as the number in the body (we call it virality). It's only printing to the console, every time I try writing it to a file it spazzes on me and I get errors. Any hints or anything I'm missing? I'm very new to Python so I'm not sure if I'm missing something or anything.
import shutil
import os
import time
import datetime
import math
import urllib
from array import array
myFile = open('test.html','w')
m = 5
for x in range(-900,900,1):
for y in range(-1800,1800,1):
filehandle = urllib.urlopen('http://jawbone-virality.herokuapp.com/scanner/desert/'+str(x/10)+'/'+str(y/10)+'/')
print 'Planet Desert: (' + str(x/10) +','+ str(y/10) + '), Virality: ' + filehandle.readlines()[0] #lines
#myFile.write('Planet Desert: (' + str(x/10) +','+ str(y/10) + '), Virality: ' + filehandle.readlines()[0])
myFile.close()
filehandle.close()
Thank you!

When writing to the file, do you still have the print statement before? Then your problem would be that Python advances the file pointer to the end of the file when you call readlines(). The second call to readlines() will thus return an empty list and your access to the first element results in an IndexError.
See this example execution:
filehandle = urllib.urlopen('http://jawbone-virality.herokuapp.com/scanner/desert/0/0/')
print(filehandle.readlines()) # prints ['5']
print(filehandle.readlines()) # prints []
The solution is to save the result into a variable and then use it.
filehandle = urllib.urlopen('http://jawbone-virality.herokuapp.com/scanner/desert/0/0/')
res = filehandle.readlines()[0]
print(res) # prints 5
print(res) # prints 5
Yet, as already pointed out in the comments, calling readlines() here is not needed, because as it seems the format of the website is only a pure integer. So the concept of lines does not really exist there or does at least not provide any more information. So let's drop it in exchange for a easier function read() (doesn't even need readline() here).
filehandle = urllib.urlopen('http://jawbone-virality.herokuapp.com/scanner/desert/0/0/')
res = filehandle.read()
print(res) # prints 5
There's still another problem in your sourcecode. From your usage of urllib.urlopen() I can derive, you are using Python 2. However, in Python 2 divisions of integers are handled like in C or Java, they result in an integer rounded to floor. Thus, you will call http://jawbone-virality.herokuapp.com/scanner/desert/-90/-180/ ten times.
This can be fixed by either:
from __future__ import division
str(x / 10.0) and str(y / 10.0)
switching to Python 3 and using urllib2
Hopefully, I could help.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

String Manipulation for Json webscraping - python

Related

For Loop 60 items 10 per 10

How to turn items from extracted data to numbers for plotting in python?

How can I increase the amount of array iterated during the run-time of script?

Converting molecule name to SMILES?

How do I read an HTML file in Python from multiple URLs?

Categories

Resources