Elastic Search retrieve all records

Elastic Search retrieve all records - python

I am using elastic search as a database which has millions of records. I am using the below code to retrieve the data but it is not giving me complete data.
response = requests.get(http://localhost:9200/cityindex/_search?q=:&size=10000)
This is giving me only 10000 records.
when I am extending the size to the size of doc count(which is 784234) it's throwing an error.
'Result window is too large, from + size must be less than or equal
to: [10000] but was [100000]. See the scroll API for a more efficient
way to request large data sets. This limit can be set by changing the
[index.max_result_window] index level setting.'}]
Context what I want to do.
I want to extract all the data of a particular index and then do the analysis on that(I am looking to get the whole data in JSON format). I am using python for my project.
Can someone please help me with this?

You need to scroll over pages ES returns to you and store them into a list/array.
You can use elastic search library for the same
example python code
from elasticsearch import Elasticsearch
es = Elasticsearch(hosts="localhost", port=9200, timeout=30)
page = es.search(
index = 'index_name',
scroll = '5m',
search_type = 'scan',
size = 5000)
sid = page['_scroll_id']
scroll_size = page['hits']['total']
print scroll_size
records = []
while (scroll_size > 0):
print "Scrolling..."
page = es.scroll(scroll_id = sid, scroll = '2m')
# Update the scroll ID
sid = page['_scroll_id']
# Get the number of results that we returned in the last scroll
scroll_size = len(page['hits']['hits'])
for rec in page['hits']['hits']:
ele = rec['_source']
records.append(ele)

Related

How to use date filter correctly on aws dynamodb boto3

I want to retrieve items in a table in dynamodb. then i will add this data to below the last data of the table in big query.
client = boto3.client('dynamodb')
table = dynamodb.Table('table')
response = table.scan(FilterExpression=Attr('created_at').gt(max_date_of_the_table_in_big_query))
#first part
data = response['Items']
#second part
while response.get('LastEvaluatedKey'):
response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
data.extend(response['Items'])
df=pd.DataFrame(data)
df=df[['query','created_at','result_count','id','isfuzy']]
# load df to big query
.....
the date filter working true but in while loop session (second part), the code retrieve all items.
after first part, i have 100 rows. but after this code
while response.get('LastEvaluatedKey'):
response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
data.extend(response['Items'])
i have 500.000 rows. i can use only first part. but i know there is a 1 mb limit, thats why i am using second part. how can i get data in given date range

Your 1st scan API call has a FilterExpression set, which applies your data filter:
response = table.scan(FilterExpression=Attr('created_at').gt(max_date_of_the_table_in_big_query))
However, the 2nd scan API call doesn't have one set and thus is not filtering your data:
response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
Apply the FilterExpression to both calls:
while response.get('LastEvaluatedKey'):
response = table.scan(
ExclusiveStartKey=response['LastEvaluatedKey'],
FilterExpression=Attr('created_at').gt(max_date_of_the_table_in_big_query)
)
data.extend(response['Items'])

Why does a Where clause cause more data to be returned?

I have a script that uses shareplum to get items from a very large and growing SharePoint (SP) list. Because of the size, I encountered the dreaded 5000 item limit set in SP. To get around that, I tried to page the data based on the 'ID' with a Where clause on the query.
# this is wrapped in a while.
# the idx is updated to the latest max if the results aren't empty.
df = pd.DataFrame(columns=cols)
idx = 0
query = {'Where': [('Gt', 'ID', str(idx))], 'OrderBy': ['ID']}
data = sp_list.GetListItems(view, query=query, row_limit=4750)
df = df.append(pd.DataFrame(data[0:]))
That seemed to work but, after I added the Where, it started returning rows not visible on the SP web list. For example, the minimum ID on the web is, say, 500 while shareplum returns rows starting at 1. It also seems to be pulling in rows that are filtered out on the web. For example, it includes column values not included on the web. If the Where is removed, it brings back the exact list viewed on the web.
What is it that I'm getting wrong here? I'm brand new to shareplum; I looked at the docs but they don't go into much detail and all the examples are rather trivial.
Why does a Where clause cause more data to be returned?

After further investigation, it seems shareplum will ignore any filters applied to the list to create the view when a query is provided to GetListItems. This is easily verified by removing the query param.
As a workaround, I'm now paging 'All Items' with a row_limit and query as below. This at least lets me get all the data and do any further filtering/grouping in python.
df = pd.DataFrame(columns=cols)
idx = 0
more = True
while more:
query = {'Where': [('Gt', 'ID', str(idx))]}
# Page 'All Items' based on 'ID' > idx
data = sp_list.GetListItems('All Items', query=query, row_limit=4500)
data_df = pd.DataFrame(data[0:])
if not data_df.empty:
df = df.append(data_df)
ids = pd.to_numeric(data_df['ID'])
idx = ids.max()
else:
more = False
As to why shareplum behaves this way is still an open question.

How to retrieve numRows, resultSize from query in BigQuery

Is it possible to retrieve the total number of rows that a query has returned without downloading all the results? For example, here is what I'm currently doing:
client = bigquery.Client()
res = client.query("SELECT funding_round_type FROM `investments`")
results = res.result()
>>> results.num_results
0
>>> records = [_ for _ in results]
>>> results.num_results
168647
In other words, without downloading the results, I cannot get the numResults. Is there another way to get the total number of results / number of MB in the resultant query set without having to download all the data?

Result of any query is stored in so called anonymous table. You can retrieve the reference to this table using jobs.get API. and then you can use tables.get API to retrieve info about that table - rows and size in particular. For example, in python:
>>> table = client.get_table(res.destination)
>>> print (table.num_rows, table.num_bytes)
168647 1451831

How to update a parameter in query (python + bigquery)

I am trying to make multiple calls to export a large data set form Bigquery into csv, via python. (e.g. 0-10000th row, 10001th-20000th row etc). But I am not sure how to set a dynamic param correctly. i.e. keep updating a and b.
The reason why I need to put the query into a loop is because the dataset is too big for a one time extraction.
a = 0
b = 10000
while a <= max(counts): #i.e. counts = 7165920
query = """
SELECT *
FROM `bigquery-public-data.ethereum_blockchain.blocks`
limit #a, #b
"""
params = [
bigquery.ScalarQueryParameter('a', 'INT', a),
bigquery.ScalarQueryParameter('b', 'INT', b) ]
query_job = client.query(query)
export_file = open("output.csv","a")
output = csv.writer(export_file, lineterminator='\n')
for rows in query_job:
output.writerow(rows)
export_file.close()
a = b +1
b = b+b
For a small data set without using a loop, I am able to get the output without any params (I just limit to 10 but that is for a single pull).
But when I tried the above method, I keep getting errors.

Suggestion of another approach
To export a table
As you want to export the whole content of the table as a CSV, I would advise you to use an ExtractJob. It is meant to send the content of a table to Google Cloud Storage, as a CSV or JSON. Here's a nice example from the docs:
destination_uri = 'gs://{}/{}'.format(bucket_name, 'shakespeare.csv')
dataset_ref = client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)
extract_job = client.extract_table(
table_ref,
destination_uri,
# Location must match that of the source table.
location='US') # API request
extract_job.result() # Waits for job to complete.
print('Exported {}:{}.{} to {}'.format(
project, dataset_id, table_id, destination_uri))
For a query
Pandas has a read_gbq function to load the result of a query in a DataFrame. If the result of the query fits in memory, you could use this then call to_csv() on the resulting DataFrame. Be sure to install the pandas-gbq package to do this.
If the query result is too big, add a destination to your QueryJobConfig, so it writes to Google Cloud Storage.
Answer to your question
You could simply use string formatting:
query = """
SELECT *
FROM `bigquery-public-data.ethereum_blockchain.blocks`
WHERE some_column = {}
LIMIT {}
"""
query_job = client.query(query.format(desired_value, number_lines))
(This places desired_value in the WHERE and number_lines in the LIMIT)
If you want to use the scalar query parameters you'll have to create a job config:
my_config = bigquery.job.QueryJobConfig()
my_config.query_parameters = params # this is the list of ScalarQueryParameter's
client.query(query, job_config=my_config)

FQL query limits the results to 25

Hi I am trying to fetch total no of movies i liked but graph API restricts the results to 25. I have tried using the Timestamp until and also LIMIT keyword but still only 25 movies are getting fetched. My code goes like this
query = "https://graph.facebook.com/USER_NAME?limit=200&access_token=%s&fields=name,movies" % TOKEN
result = requests.get(query)
data = json.loads(result.text)
fd = open('Me','a')
for key in data:
if key=='movies':
fd.write("KEY: MOVIES\n")
#print data[key]
count = len((data[unicode(key)])['data'])
fd.write("COUNT = "+str(count)+"\n")
for i in (data[unicode(key)])['data']:
fd.write((i['name']).encode('utf8'))
fd.write("\n")
Please help me in fixing it
THANKS IN ADVANCE

Since the platform update on October the 2nd, there are only 25 likes returned at once via the Graph API (see https://developers.facebook.com/roadmap/completed-changes/), and Movies are Likes. You can either implement result pagination, or use the FQL table page_fan with the following FQL:
select page_id, name from page where page_id in (select page_id from page_fan where uid=me() and profile_section = 'movies')
You have to count the entries in your application, FB has no aggregation functionality.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Elastic Search retrieve all records - python

Related

How to use date filter correctly on aws dynamodb boto3

Why does a Where clause cause more data to be returned?

How to retrieve numRows, resultSize from query in BigQuery

How to update a parameter in query (python + bigquery)

FQL query limits the results to 25

Categories

Resources