converting google datastore query result to pandas dataframe in python

converting google datastore query result to pandas dataframe in python - python

I need to convert a Google Cloud Datastore query result to a dataframe, to create a chart from the retrieved data. The query:
def fetch_times(limit):
start_date = '2019-10-08'
end_date = '2019-10-19'
query = datastore_client.query(kind='ParticleEvent')
query.add_filter(
'published_at', '>', start_date)
query.add_filter(
'published_at', '<', end_date)
query.order = ['-published_at']
times = query.fetch(limit=limit)
return times
creates a json like string of the results for each entity returned by the query:
Entity('ParticleEvent', 5942717456580608) {'gc_pub_sub_id': '438169950283983', 'data': '605', 'event': 'light intensity', 'published_at': '2019-10-11T14:37:45.407Z', 'device_id': 'e00fce6847be7713698287a1'}>
Thought I found something that would translate to json which I could convert to dataframe, but get an error that the properties attribute does not exist:
def to_json(gql_object):
result = []
for item in gql_object:
result.append(dict([(p, getattr(item, p)) for p in item.properties()]))
return json.dumps(result, cls=JSONEncoder)
Is there a way to iterate through the query results to get them into a dataframe either directly to a dataframe or by converting to json then to dataframe?

Datastore entities can be treated as Python base dictionaries! So you should be able to do something as simple as...
df = pd.DataFrame(datastore_entities)
...and pandas will do all the rest.
If you needed to convert the entity key, or any of its attributes to a column as well, you can pack them into the dictionary separately:
for e in entities:
e['entity_key'] = e.key
e['entity_key_name'] = e.key.name # for example
df = pd.DataFrame(entities)

You can use pd.read_json to read your json query output into a dataframe.
Assuming the output is the string that you have shared above, then the following approach can work.
#Extracting the beginning of the dictionary
startPos = line.find("{")
df = pd.DataFrame([eval(line[startPos:-1])])
Output looks like :
gc_pub_sub_id data event published_at \
0 438169950283983 605 light intensity 2019-10-11T14:37:45.407Z
device_id
0 e00fce6847be7713698287a1
Here, line[startPos:-1] is essentially the entire dictionary in that sthe string input. Using eval, we can convert it into an actual dictionary. Once we have that, it can be easily converted into a dataframe object

Original poster found a workaround, which is to convert each item in the query result object to string, and then manually parse the string to extract the needed data into a list.

The return value of the fetch function is google.cloud.datastore.query.Iterator which behaves like a List[dict] so the output of fetch can be passed directly into pd.DataFrame.
import pandas as pd
df = pd.DataFrame(fetch_times(10))
This is similar to #bkitej, but I added the use of the original poster's function.

Related

Index count being wrapped into the data as one object

First time on stackoverflow so bear with me. Code is below. Basically, the df_history is a dataframe with different variables. I am trying to pull the 'close' variable and sort it based on the categorical type of the currency.
When I pull data over using the .query command, it gives me 1 object with all the individual observations together separated by a space. I know how to separate that back into independent data, but issue is that it is pulling the index count with the observations. In the image you can see 179, 178, 177 etc in the BTC object. I dont want that there and didnt indend to pull it. How do I get rid of that?
additional_rows = []
for currency in selected_coins:
df_history = df_history.sort_values(['date'], ascending=True)
row_data = [currency,
df_history.query('granularity == \'daily\' and currency == #currency')['close'],
df_history.query('granularity == \'daily\' and currency == #currency').head(180)['close'].pct_change(),
df_history['date']
]
additional_rows.append(row_data)
df_additional_info = pd.DataFrame(additional_rows, columns = ['currency',
'close',
'returns',
'df_history'])
df_additional_info.set_index('currency').transpose()
import ast
list_of_lists = df_additional_info.close.to_list()
flat_list = [i for sublist in list_of_lists for i in ast.literal_eval(sublist)]
uniq_list = list(set(flat_list))
len(uniq_list),len(flat_list)
I was trying to pull data from one data frame to the next and sort it based on a categorical input from the currency variable. It is not transferring over well

How to filter a REST API row response by a value of the row with python?

I configured a REST API POST Request that connects to an Oracle Service Cloud and it receives the response in the following json format:
{"count":16,"name":"Report Name","columnNames":["Connection","Complete Name","Login ID","Login Time","Logout Time","IP Direction"],"rows":[["PGALICHI","Robert The IT Guy","3205","2018-01-25 08:52:23","2018-01-25 15:00:50","201.255.56.151"],["PGALICHI","Lucas The other IT Guy","3204","2018-01-25 08:52:21","2018-01-25 15:00:51","201.255.56.151"]],"links":[{"rel":"self","href":"https://web--tst1.custhelp.com/services/rest/connect/v1.4/analyticsReportResults"},{"rel":"canonical","href":"https://web--tst1.custhelp.com/services/rest/connect/v1.4/analyticsReportResults"},{"rel":"describedby","href":"https://web--tst1.custhelp.com/services/rest/connect/v1.4/metadata-catalog/analyticsReportResults","mediaType":"application/schema+json"}]}
This information will be the input for a script that will print just the rows, what I need now is, first, sort all the rows by "Login Time", and next, to filter all the values with a "Login Time" equal or earlier than a variable that will have the value of the last "Login Time".
This is an example of the code I use now to get only the rows:
class OracleORNHandler:
def __init__(self,**args):
pass
def __call__(self, response_object,raw_response_output,response_type,req_args,endpoint):
if response_type == "json":
output = json.loads(raw_response_output)
for row in output["rows"]:
print_xml_stream(json.dumps(row))
else:
print_xml_stream(raw_response_output)

This needs more coding. However I will explain the logic approach.
You may want to vary the coding approach a little.
Load the string into a dictionary.
Fixed string
json_out = '{"count":16,"name":"Report Name","columnNames":["Connection","Complete Name","Login ID","Login Time","Logout Time","IP Direction"],"rows":[["PGALICHI","Robert The IT Guy","3205","2018-01-25 08:52:23","2018-01-25 15:00:50","201.255.56.151"],["PGALICHI","Lucas The other IT Guy","3204","2018-01-25 08:52:21","2018-01-25 15:00:51","201.255.56.151"]],"links":[{"rel":"self","href":"https://web--tst1.custhelp.com/services/rest/connect/v1.4/analyticsReportResults"},{"rel":"canonical","href":"https://web--tst1.custhelp.com/services/rest/connect/v1.4/analyticsReportResults"},{"rel":"describedby","href":"https://web--tst1.custhelp.com/services/rest/connect/v1.4/metadata-catalog/analyticsReportResults","mediaType":"application/schema+json"}]}'
dic_json = json.loads(json_out)
Convert the rows into a dictionary
Login Time is used as value as we will need it for sorting later.
Use datetime for converting string to date
from datetime import datetime
rows_list = dic_json['rows']
d = dict()
for x in rows_list:
datetime_object = datetime.strptime(x[4], '%Y-%m-%d %H:%M:%S')
d[x.__str__()] = datetime_object
Note : Since login time may or not be unique I chose whole list to be key
Even not sure login id is unique here.
Use collections.OrderedDict for sorting on time.;
i.e
import collections
# sort by time i.e dictionary values
od2 =collections.OrderedDict(sorted(d.items(), key=lambda t: t[1]))
Filter values based on date-time condition.
Input_time
date_time_req = datetime.strptime('2018-01-25 08:52:23', '%Y-%m-%d %H:%M:%S')
for y in od2.keys():
if od2[y] <= date_time_req:
print(y)
This output is string , you can trim and convert it back to a list.

ElasticSearch query to pandas dataframe

I have a query:
s = Search(using=client, index='myindex', doc_type='mytype')
s.query = Q('bool', must=[Q('match', BusinessUnit=bunit),
Q('range', **dicdate)])
res = s.execute()
return me 627033 lines, I want to convert this dictionary in a dataframe with 627033 lines

Based on your comment I think what you're looking for is size:
es.search(index="my-index", doc_type="mydocs", body="your search", size="1000")
I'm not sure if this will work for 627,033 lines -- you might need scroll for that.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html

If your request is likely to return more than 10,000 documents from Elasticsearch, you will need to use the scrolling function of Elasticsearch. Documentation and examples for this function are rather difficult to find, so I will provide you with a full, working example:
import pandas as pd
from elasticsearch import Elasticsearch
import elasticsearch.helpers
es = Elasticsearch('127.0.0.1',
http_auth=('my_username', 'my_password'),
port=9200)
body={"query": {"match_all": {}}}
results = elasticsearch.helpers.scan(es, query=body, index="my_index")
df = pd.DataFrame.from_dict([document['_source'] for document in results])
Simply edit the fields that start with "my_" to correspond to your own values

I found the solution by Phil B a good template for my situation. However, all results are returned as lists, rather than atomic data types. To get around this, I added the following helper function and code:
def flat_data(val):
if isinstance(val):
return val[0]
else:
return val
df = pd.DataFrame.from_dict([{k:flat_data(v) for (k,v) in document(['fields'].items()}
for document in results])

Extract string from data frame in python

I am new in Python and I would like to extract a string data from my data frame. Here is my data frame:
Which state has the most counties in it?
Unfortunately I could not extract a string! Here is my code:
import pandas as pd
census_df = pd.read_csv('census.csv')
def answer_five():
return census_df[census_df['COUNTY']==census_df['COUNTY'].max()]['STATE']
answer_five()

How about this:
import pandas as pd
census_df = pd.read_csv('census.csv')
def answer_five():
"""
Returns the 'STATE' corresponding to the max 'COUNTY' value
"""
max_county = census_df['COUNTY'].max()
s = census_df.loc[census_df['COUNTY']==max_county, 'STATE']
return s
answer_five()
This should output a pd.Series object featuring the 'STATE' value(s) where 'COUNTY' is maxed. If you only want the value and not the Series (as your question stated, and since in your image there's only 1 max value for COUNTY) then return s[0] (instead of return s) should do.

def answer_five():
return census_df.groupby('STNAME')['COUNTY'].nunique().idxmax()
You can aggregate data using group by state name, then get count on unique counties and return id of max count.

I had the same issue for some reason I tried using .item() and manage to extract the exact value I needed.
In your case it would look like:
return census_df[census_df['COUNTY'] == census_df['COUNTY'].max()]['STATE'].item()

pandas read_json: "If using all scalar values, you must pass an index"

I have some difficulty in importing a JSON file with pandas.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
This is the error that I get:
ValueError: If using all scalar values, you must pass an index
The file structure is simplified like this:
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
It is from the machine learning course of University of Washington on Coursera. You can find the file here.

Try
ser = pd.read_json('people_wiki_map_index_to_word.json', typ='series')
That file only contains key value pairs where values are scalars. You can convert it to a dataframe with ser.to_frame('count').
You can also do something like this:
import json
with open('people_wiki_map_index_to_word.json', 'r') as f:
data = json.load(f)
Now data is a dictionary. You can pass it to a dataframe constructor like this:
df = pd.DataFrame({'count': data})

You can do as #ayhan mention which will give you a column base format
Or you can enclose the object in [ ] (source) as shown below to give you a row format that will be convenient if you are loading multiple values and planing on using matrix for your machine learning models.
df = pd.DataFrame([data])

I think what is happening is that the data in
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json')
is being read as a string instead of a json
{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}
is actually
'{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}'
Since a string is a scalar, it wants you to load it as a json, you have to convert it to a dict which is exactly what the other response is doing
The best way is to do a json loads on the string to convert it to a dict and load it into pandas
myfile=f.read()
jsonData=json.loads(myfile)
df=pd.DataFrame(data)

{
"biennials": 522004,
"lb915": 116290
}
df = pd.read_json('values.json')
As pd.read_json expects a list
{
"biennials": [522004],
"lb915": [116290]
}
for a particular key, it returns an error saying
If using all scalar values, you must pass an index.
So you can resolve this by specifying 'typ' arg in pd.read_json
map_index_to_word = pd.read_json('Datasets/people_wiki_map_index_to_word.json', typ='dictionary')

For newer pandas, 0.19.0 and later, use the lines parameter, set it to True.
The file is read as a json object per line.
import pandas as pd
map_index_to_word = pd.read_json('people_wiki_map_index_to_word.json', lines=True)
If fixed the following errors I encountered especially when some of the json files have only one value:
ValueError: If using all scalar values, you must pass an index
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
ValueError: Trailing data

For example
cat values.json
{
name: "Snow",
age: "31"
}
df = pd.read_json('values.json')
Chances are you might end up with this
Error: if using all scalar values, you must pass an index
Pandas looks up for a list or dictionary in the value. Something like
cat values.json
{
name: ["Snow"],
age: ["31"]
}
So try doing this. Later on to convert to html tohtml()
df = pd.DataFrame([pd.read_json(report_file, typ='series')])
result = df.to_html()

I solved this by converting it into an array like so
[{"biennials": 522004, "lb915": 116290, "shatzky": 127647, "woode": 174106, "damfunk": 133206, "nualart": 153444, "hatefillot": 164111, "missionborn": 261765, "yeardescribed": 161075, "theoryhe": 521685}]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

converting google datastore query result to pandas dataframe in python - python

Original poster found a workaround, which is to convert each item in the query result object to string, and then manually parse the string to extract the needed data into a list.

Related

Index count being wrapped into the data as one object

How to filter a REST API row response by a value of the row with python?

ElasticSearch query to pandas dataframe

Extract string from data frame in python

pandas read_json: "If using all scalar values, you must pass an index"

Categories

Resources