I've been trying to pull a table from sales force into a jupyter notebook to make as a dataframe. When I create the query and try executing it, I get this message:
Malformed request https://lumn.my.salesforce.com/services/data/v42.0/query/?q=SELECT+WhoId%2C+ActivityDate%2C+OwnerId%2C+IsDeleted%2C+AccountId%2C+IsClosed%2C+CreatedDate%.... Response content: [{'message': "exceeded 100000 distinct who/what's", 'errorCode': 'OPERATION_TOO_LARGE'}]
I pulled in another table successfully using sf.query_all(), which had more data than 100000 rows so I'm stumped on this one and there is not a lot of documentation on what to do at this point since it doesn't give me a recordsURL to pull in increments. I also tried to see if sf.bulk.Task.query_all() would work, but it does not do it either. As well I tried to pull in this data through excel and was given the same error message. The project I'm working on does not filter this table down in any way so it consequently has to pull all from these columns I am working with. If anyone has any idea how to circumvent this please share!
Not familiar with this error but you might give "PK chunking" a go, https://developer.salesforce.com/docs/atlas.en-us.234.0.api_asynch.meta/api_asynch/async_api_headers_enable_pk_chunking.htm
And there's some good reading on work with huge tables in general, https://developer.salesforce.com/docs/atlas.en-us.salesforce_large_data_volumes_bp.meta/salesforce_large_data_volumes_bp/ldv_deployments_introduction.htm
Related
I am using Python, pandas, sqlalchemy and cx_Oracle to execute a select query in an Oracle DB.
But I am not able to retrieve data from a column that has type LONG. When I remove it from the query, it runs fine, but when it is included I get "(cx_Oracle.DatabaseError) DPI-1037: column at array position (*) fetched with error 1406"
I ran into this post How to solve error: ORA-01406 fetched column value was truncated?, and read the official documentation http://www.dba-oracle.com/t_ora_01406_fetched_column_value_was_truncated.htm
The problem seems to be related to the size of the column buffer but I don't know how I would overcome it using pandas or another python library.
Is there anyway I can deal with this error?
Thank you in advance
For anyone who stumbles upon this, I ended up executing the query via cx_Oracle and loading the result with pandas afterwards
I'm using python client library to insert data to big query table. The code is as follows.
client = bigquery.Client(project_id)
errors = client.insert_rows_json(table=tablename,json_rows=data_to_insert)
assert errors == []
There are no errors, but the data is also not getting inserted.
Sample JSON rows:
[{'a':'b','c':'d'},{'a':'f','q':'r'},.....}]
What's the problem? No exception also
client.insert_rows_json method using StreamingInsert .
Inserting data to BigQuery using StreamingInsert will be cause of latency on table preview on BigQuery console.
The data is not appeared immediately. So,
You need to query them to confirm the data inserted.
It can be 2 possible situations:
your data does not match the schema
your table is freshly created, and the update is just not yet available
References:
Related GitHub issue
Data availability
got the answer to my question. The problem was I was inserting one more column data for which data was not there. I found a hack in order to find out if the data is not inserting to bigquery table.
Change the data to newline delimited json with the keys as the column names and values as values you want for that particular column.
bq --location=US load --source_format=NEWLINE_DELIMITED_JSON dataset.tablename newline_delimited_json_file.json. Run this command in you terminal and see if throws any errors. If it throws an error it's likely that something is wrong with your data/table schema.
Change the data/table schema as per the error and retry inserting the same via python.
It's better if the python API throws an error/exception like on the terminal, it would be helpful.
I have some basics of programming, but I am completely new to RDF or Sparql, so I hope to be clear in what follows.
I am trying to download some data available at http://data.camera.it/data/en/datasets/, and all the data are organized in rdf-xml format, in an ontology.
I noticed this website has a SPARQL Query Editor online (http://dati.camera.it/sparql), and using some of their examples I was able to retrieve and convert some of the data I need using Python. I used the following code and query, using SparqlWrapper
from SPARQLWrapper import SPARQLWrapper, JSON
sparql = SPARQLWrapper("http://dati.camera.it/sparql")
sparql.setQuery(
'''
SELECT distinct ?deputatoId ?cognome ?nome ?data ?argomento titoloSeduta ?testo
WHERE {
?dibattito a ocd:dibattito; ocd:rif_leg <http://dati.camera.it/ocd/legislatura.rdf/repubblica_17>.
?dibattito ocd:rif_discussione ?discussione.
?discussione ocd:rif_seduta ?seduta.
?seduta dc:date ?data; dc:title ?titoloSeduta.
?seduta ocd:rif_assemblea ?assemblea.
?discussione rdfs:label ?argomento.
?discussione ocd:rif_intervento ?intervento.
?intervento ocd:rif_deputato ?deputatoId; dc:relation ?testo.
?deputatoId foaf:firstName ?nome; foaf:surname ?cognome .
}
ORDER BY ?data ?cognome ?nome
LIMIT 100
'''
)
sparql.setReturnFormat(JSON)
results_raw = sparql.query().convert()
However, I have a problem because the website allows only to download 10,000 values. As far as I understood, this limit cannot be modified.
Therefore I decided to download the datasets on my computer. I tried to work on all these rdf files, but I don't know how to do it, since, as far as I know, the SparqlWrapper does not work with local files.
So my questions are:
How do I create a dataset containing all the RDF files so that I can work on them as if it were a single object?
How do I query on such an object to retrieve the information I need? Is that possible?
Is this way of reasoning the right approach?
Any suggestion on how to tackle the problem is appreciated.
Thank you!
Download all the RDF/XML files from their download area, and load them into a local instance of Virtuoso (which happens to be the engine they are using for their public SPARQL endpoint). You will have the advantage of running a much more recent version (v7.2.5.1 or later), whether Open Source or Enterprise Edition than the one they've got (Open Source v7.1.0, from March, 2014!).
Use your new local SPARQL endpoint, found at http://localhost:8890/sparql by default. You can configure it to have no limits on result set sizes, or query runtimes, or otherwise.
Seems likely.
(P.S. You might encourage the folks at dati.camera.it (assistenza-dati#camera.it) to upgrade their Virtuoso instance. There are substantial performance and feature enhancements awaiting!)
So the existing setup we had use to create a new table for each day, which worked fine with "WRITE_TRUNCATE" option, however when we updated our code to use partitioned table, though our dataflow job, it wouldn`t work with write_truncate.
It works perfectly fine, with write disposition set as "WRITE_APPEND" (From what i understood, from beam, it maybe tries to delete the table, and then recreate it), since i`m supplying the table decorator it fails to create a new table.
Sample snippet using python code:
beam.io.Write('Write({})'.format(date), beam.io.BigQuerySink(output_table_name + '$' + date, create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER, write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)
which gives the error:
Table IDs must be alphanumeric
since it tried to recreate the table, and we supply the partition decorator in the argument.
Here are some of the things that i`v tried:
Updating the write_disposition as WRITE_APPEND, although it works, it fails the purpose, since running for the same date again would duplicate data.
Using
bq --apilog /tmp/log.txt load --replace --source_format=NEWLINE_DELIMITED_JSON 'table.$20160101' sample_json.json
command, to see if i can observe any logs, on how does truncate actually works, based on the link that i found.
Tried some other links, but this as well uses WRITE_APPEND.
Is there a way to write to a partitioned table, from a dataflow job using write_truncate method?
Let me know if any additional details are required.
Thanks
Seems like this is not supported at this time. Credit goes to #Pablo for finding out from the IO dev.
According to the Beam documentation on the Github page, their JIRA page would be the appropriate to request such a feature. I'd recommend filing a feature request there and posting a link in a comment here so that others in the community can follow through and show their support.
Additional questions regarding SilentGhost's initial answer to a problem I'm having parsing Twitter RSS feeds. See also partial code below.
First, could I insert tags[0], tags[1], etc., into the database, or is there a different/better way to do it?
Second, almost all of the entries have a url, but a few don't; likewise, many entries don't have the hashtags. So, would the thing to do be to create default values for url and tags? And if so, do you have any hints on how to do that? :)
Third, when you say the single-table db design is not optimal, do you mean I should create a separate table for tags? Right now, I have one table for the RSS feed urls and another table with all the rss entry data (summar.y, date, etc.).
I've pasted in a modified version of the code you posted. I had some success in getting a "tinyurl" variable to get into the sqlite database, but now it isn't working. Not sure why.
Lastly, assuming I can get the whole thing up and running (smile), is there a central site where people might appreciate seeing my solution? Or should I just post something on my own blog?
Best,
Greg
I would suggest reading up on database normalisation, especially on 1st and 2nd normal forms. Once you're done with it, I hope there won't be need for default values, and your db schema evolves into something more appropriate.
There are plenty of options for sharing your source code on the web, depending on what versioning system you're most comfortable with you might have a look at such well know sites as google code, bitbucket, github and many other.