OverflowError: MongoDB can only handle up to 8-byte ints? - python

I have spent the last 12 hours scouring the web. I am completely lost, please help.
I am trying to pull data from an API endpoint and put it into MongoDB. The data looks like this:
{"_links": {
"self": {
"href": "https://us.api.battle.net/data/sc2/ladder/271302?namespace=prod"
}
},
"league": {
"league_key": {
"league_id": 5,
"season_id": 37,
"queue_id": 201,
"team_type": 0
},
"key": {
"href": "https://us.api.battle.net/data/sc2/league/37/201/0/5?namespace=prod"
}
},
"team": [
{
"id": 6956151645604413000,
"rating": 5321,
"wins": 131,
"losses": 64,
"ties": 0,
"points": 1601,
"longest_win_streak": 15,
"current_win_streak": 4,
"current_rank": 1,
"highest_rank": 10,
"previous_rank": 1,
"join_time_stamp": 1534903699,
"last_played_time_stamp": 1537822019,
"member": [
{
"legacy_link": {
"id": 9964871,
"realm": 1,
"name": "mTOR#378",
"path": "/profile/9964871/1/mTOR"
},
"played_race_count": [
{
"race": "Zerg",
"count": 195
}
],
"character_link": {
"id": 9964871,
"battle_tag": "Hellghost#11903",
"key": {
"href": "https://us.api.battle.net/data/sc2/character/Hellghost-11903/9964871?namespace=prod"
}
}
}
]
},
{
"id": 11611747760398664000, .....
....
Here's the code:
for ladder_number in ladder_array:
ladder_call_url = ladder_call+slash+str(ladder_number)+eng_locale+access_token
url = str(ladder_call_url)
response = requests.get(url)
print('trying ladder number '+str(ladder_number))
print('calling :'+url)
if response.status_code == 200:
print('status: '+str(response))
mmr_db.ladders.insert_one(response.json())
I get an error:
OverflowError: MongoDB can only handle up to 8-byte ints?
Is this because the data I am trying to load is too large? Are the "ID" integers too large?
Oh man, any help would be sincerely appreciated.
_______ EDIT ____________
Edited to include the Traceback:
Traceback (most recent call last):
File "C:\scripts\mmr_from_ladders.py", line 96, in <module>
mmr_db.ladders.insert_one(response.json(), bypass_document_validation=True)
File "C:\Users\me\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pymongo\collection.py", line 693, in insert_one
session=session),
File "C:\Users\me\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pymongo\collection.py", line 607, in _insert
bypass_doc_val, session)
File "C:\Users\me\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pymongo\collection.py", line 595, in _insert_one
acknowledged, _insert_command, session)
File "C:\Users\me\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pymongo\mongo_client.py", line 1243, in _retryable_write
return self._retry_with_session(retryable, func, s, None)
File "C:\Users\me\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pymongo\mongo_client.py", line 1196, in _retry_with_session
return func(session, sock_info, retryable)
File "C:\Users\me\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pymongo\collection.py", line 590, in _insert_command
retryable_write=retryable_write)
File "C:\Users\me\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pymongo\pool.py", line 584, in command
self._raise_connection_failure(error)
File "C:\Users\me\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pymongo\pool.py", line 745, in _raise_connection_failure
raise error
File "C:\Users\me\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pymongo\pool.py", line 579, in command
unacknowledged=unacknowledged)
File "C:\Users\me\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pymongo\network.py", line 114, in command
codec_options, ctx=compression_ctx)
File "C:\Users\me\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pymongo\message.py", line 679, in _op_msg
flags, command, identifier, docs, check_keys, opts)
OverflowError: MongoDB can only handle up to 8-byte ints

The BSON spec — MongoDB’s native binary extended JSON format / data type — only supports 32 bit (signed) and 64 bit (signed) integers — 8 bytes being 64 bits.
The maximum integer value that can be stored in a 64 bit int is:
9,223,372,036,854,775,807
In your example you appear to have larger ids, for example:
11,611,747,760,398,664,000
I’m guessing that the app generating this data is using uint64 types (unsigned can hold x2-1 values).
I would start by looking at either of these potential solutions, if possible:
Changing the other side to use int64 (signed) types for the IDs.
Replacing the incoming IDs using ObjectId() as you then get a 12 byte ~ GUID for your unique IDs.

Related

Iterate through nested JSON in Python

js = {
"status": "ok",
"meta": {
"count": 1
},
"data": {
"542250529": [
{
"all": {
"spotted": 438,
"battles_on_stunning_vehicles": 0,
"avg_damage_blocked": 39.4,
"capture_points": 40,
"explosion_hits": 0,
"piercings": 3519,
"xp": 376586,
"survived_battles": 136,
"dropped_capture_points": 382,
"damage_dealt": 783555,
"hits_percents": 74,
"draws": 2,
"battles": 290,
"damage_received": 330011,
"frags": 584,
"stun_number": 0,
"direct_hits_received": 1164,
"stun_assisted_damage": 0,
"hits": 4320,
"battle_avg_xp": 1299,
"wins": 202,
"losses": 86,
"piercings_received": 1004,
"no_damage_direct_hits_received": 103,
"shots": 5857,
"explosion_hits_received": 135,
"tanking_factor": 0.04
}
}
]
}
}
Let us name this json "js" as a variable, this variable will be in a for-loop.
To understand better what I'm doing here, I'm trying to collect data from a game.
This game has hundreds of different tanks, each tank has tank_id with which I can post tank_id to the game server and respond the performance data as "js".
for tank_id: json = requests.post(tank_id) etc...
and fetch all these values to my database as shown in the screenshot.
my python code for it:
def api_get():
for property in js['data']['542250529']['all']:
spotted = property['spotted']
battles_on_stunning_vehicles = property['battles_on_stunning_vehicles']
# etc
# ...
insert_to_db(spotted, battles_on_stunning_vehicles, etc....)
the exception is:
for property in js['data']['542250529']['all']:
TypeError: list indices must be integers or slices, not str
and when:
print(js['data']['542250529'])
i get the rest of the js as a string, and i can't iterate... can't be used a valid json string, also what's inside js['data']['542250529'] is a list containing only the item 'all'..., any help would be appreciated
You just missed [0] to get the first item in a list:
def api_get():
for property in js['data']['542250529'][0]['all']:
spotted = property['spotted']
# ...
Look carefully at the data structure in the source JSON.
There is a list containing the dictionary with a key of all. So you need to use js['data']['542250529'][0]['all'] not js['data']['542250529']['all']. Then you can use .items() to get the key-value pairs.
See below.
js = {
"status": "ok",
"meta": {
"count": 1
},
"data": {
"542250529": [
{
"all": {
"spotted": 438,
"battles_on_stunning_vehicles": 0,
"avg_damage_blocked": 39.4,
"capture_points": 40,
"explosion_hits": 0,
"piercings": 3519,
"xp": 376586,
"survived_battles": 136,
"dropped_capture_points": 382,
"damage_dealt": 783555,
"hits_percents": 74,
"draws": 2,
"battles": 290,
"damage_received": 330011,
"frags": 584,
"stun_number": 0,
"direct_hits_received": 1164,
"stun_assisted_damage": 0,
"hits": 4320,
"battle_avg_xp": 1299,
"wins": 202,
"losses": 86,
"piercings_received": 1004,
"no_damage_direct_hits_received": 103,
"shots": 5857,
"explosion_hits_received": 135,
"tanking_factor": 0.04
}
}
]
}
}
for key, val in js['data']['542250529'][0]['all'].items():
print("key:", key, " val:", val)
#Or this way
for key in js['data']['542250529'][0]['all']:
print("key:", key, " val:", js['data']['542250529'][0]['all'][key])

Keep jsonschema from always making requests to URI

Background
I am trying to validate a JSON file using jsonchema. However, the library is trying to make a GET request and I want to avoid that.
from jsonschema import validate
point_schema = {
"$id": "https://example.com/schemas/point",
"type": "object",
"properties": {"x": {"type": "number"}, "y": {"type": "number"}},
"required": ["x", "y"],
}
polygon_schema = {
"$id": "https://example.com/schemas/polygon",
"type": "array",
"items": {"$ref": "https://example.com/schemas/point"},
}
a_polygon = [{'x': 1, 'y': 2}, {'x': 3, 'y': 4}, {'x': 1, 'y': 2}]
validate(instance=a_polygon, schema=polygon_schema)
Error
I am trying to connect both schemas using a $ref key from the spec:
https://json-schema.org/understanding-json-schema/structuring.html?highlight=ref#ref
Unfortunately for me, this means the library will make a GET request to the URI specified and try to decode it:
Traceback (most recent call last):
File "/home/user/anaconda3/envs/myapp-py/lib/python3.7/site-packages/jsonschema/validators.py", line 777, in resolve_from_url
document = self.resolve_remote(url)
File "/home/user/anaconda3/envs/myapp-py/lib/python3.7/site-packages/jsonschema/validators.py", line 860, in resolve_remote
result = requests.get(uri).json()
File "/home/user/anaconda3/envs/myapp-py/lib/python3.7/site-packages/requests/models.py", line 910, in json
return complexjson.loads(self.text, **kwargs)
File "/home/user/anaconda3/envs/myapp-py/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/home/user/anaconda3/envs/myapp-py/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/user/anaconda3/envs/myapp-py/lib/python3.7/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/anaconda3/envs/myapp-py/lib/python3.7/site-packages/jsonschema/validators.py", line 932, in validate
error = exceptions.best_match(validator.iter_errors(instance))
File "/home/user/anaconda3/envs/myapp-py/lib/python3.7/site-packages/jsonschema/exceptions.py", line 367, in best_match
best = next(errors, None)
File "/home/user/anaconda3/envs/myapp-py/lib/python3.7/site-packages/jsonschema/validators.py", line 328, in iter_errors
for error in errors:
File "/home/user/anaconda3/envs/myapp-py/lib/python3.7/site-packages/jsonschema/_validators.py", line 81, in items
for error in validator.descend(item, items, path=index):
File "/home/user/anaconda3/envs/myapp-py/lib/python3.7/site-packages/jsonschema/validators.py", line 344, in descend
for error in self.iter_errors(instance, schema):
File "/home/user/anaconda3/envs/myapp-py/lib/python3.7/site-packages/jsonschema/validators.py", line 328, in iter_errors
for error in errors:
File "/home/user/anaconda3/envs/myapp-py/lib/python3.7/site-packages/jsonschema/_validators.py", line 259, in ref
scope, resolved = validator.resolver.resolve(ref)
File "/home/user/anaconda3/envs/myapp-py/lib/python3.7/site-packages/jsonschema/validators.py", line 766, in resolve
return url, self._remote_cache(url)
File "/home/user/anaconda3/envs/myapp-py/lib/python3.7/site-packages/jsonschema/validators.py", line 779, in resolve_from_url
raise exceptions.RefResolutionError(exc)
jsonschema.exceptions.RefResolutionError: Expecting value: line 1 column 1 (char 0)
I don't want this, I just want the polygon schema to reference the point schema that is right above (as for this purpose, a polygon is a list of points).
In fact, these schemas are in the same file.
Questions
I could always do the following:
point_schema = {
"$id": "https://example.com/schemas/point",
"type": "object",
"properties": {"x": {"type": "number"}, "y": {"type": "number"}},
"required": ["x", "y"],
}
polygon_schema = {
"$id": "https://example.com/schemas/polygon",
"type": "array",
"items": point_schema,
}
And this would technically work.
However I would simply be build a bigger dictionary and I would not be using the spec as it was designed to.
How can I use the spec to solve my problem?
You have to provide your other schemas to the implementation.
With this implementation, you must provide a RefResolver to the validate function.
You'll need to either provide a single base_uri and referrer (the schema), or a store which contains a dictionary of URI to schema.
Additionally, you may handle protocols with a function.
Your RefResolver would look like the following...
refResolver = jsonschema.RefResolver(referrer=point_schema, base_uri='https://example.com/schemas/point'

jsonb join not working properly in sqlalchemy

I have a query that joins on a jsonb type column in postgres that I want to convert to sqlalchemy in django using the aldjemy package
SELECT anon_1.key AS tag, count(anon_1.value ->> 'polarity') AS count_1, anon_1.value ->> 'polarity' AS anon_2
FROM feedback f
JOIN tagging t ON t.feedback_id = f.id
JOIN jsonb_each(t.json_content -> 'entityMap') AS anon_3 ON true
JOIN jsonb_each(((anon_3.value -> 'data') - 'selectionState') - 'segment') AS anon_1 ON true
where f.id = 2
GROUP BY anon_1.value ->> 'polarity', anon_1.key;
The json_content field stores data in the following format:
{
"entityMap":
{
"0":
{
"data":
{
"people":
{
"labelId": 5,
"polarity": "positive"
},
"segment": "a small segment",
"selectionState":
{
"focusKey": "9xrre",
"hasFocus": true,
"anchorKey": "9xrre",
"isBackward": false,
"focusOffset": 75,
"anchorOffset": 3
}
},
"type": "TAG",
"mutability": "IMMUTABLE"
},
"1":
{
"data":
{
"product":
{
"labelId": 6,
"polarity": "positive"
},
"segment": "another segment",
"selectionState":
{
"focusKey": "9xrre",
"hasFocus": true,
"anchorKey": "9xrre",
"isBackward": false,
"focusOffset": 138,
"anchorOffset": 79
}
},
"type": "TAG",
"mutability": "IMMUTABLE"
}
}
}
I wrote the following sqlalchemy code to achieve the query
first_alias = aliased(func.jsonb_each(Tagging.sa.json_content["entityMap"]))
print(first_alias)
second_alias = aliased(
func.jsonb_each(
first_alias.c.value.op("->")("data")
.op("-")("selectionState")
.op("-")("segment")
)
)
polarity = second_alias.c.value.op("->>")("polarity")
p_tag = second_alias.c.key
_count = (
Feedback.sa.query()
.join(
CampaignQuestion,
CampaignQuestion.sa.question_id == Feedback.sa.question_id,
isouter=True,
)
.join(Tagging)
.join(first_alias, true())
.join(second_alias, true())
.filter(CampaignQuestion.sa.campaign_id == campaign_id)
.with_entities(p_tag.label("p_tag"), func.count(polarity), polarity)
.group_by(polarity, p_tag)
.all()
)
print(_count)
but it is giving me a NotImplementedError: Operator 'getitem' is not supported on this expression error on accessing first_alias.c
the stack trace:
Traceback (most recent call last):
File "/home/.cache/pypoetry/virtualenvs/api-FPSaTdE5-py3.8/lib/python3.8/site-packages/rest_framework/views.py", line 506, in dispatch
response = handler(request, *args, **kwargs)
File "/home/work/api/app/campaign/views.py", line 119, in results_p_tags
d = campaign_service.get_p_tag_count_for_campaign_results(id)
File "/home/work/api/app/campaign/services/campaign.py", line 177, in get_p_tag_count_for_campaign_results
return campaign_selectors.get_p_tag_counts_for_campaign(campaign_id)
File "/home/work/api/app/campaign/selectors.py", line 196, in get_p_tag_counts_for_campaign
polarity = second_alias.c.value.op("->>")("polarity")
File "/home/.cache/pypoetry/virtualenvs/api-FPSaTdE5-py3.8/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 1093, in __get__
obj.__dict__[self.__name__] = result = self.fget(obj)
File "/home/.cache/pypoetry/virtualenvs/api-FPSaTdE5-py3.8/lib/python3.8/site-packages/sqlalchemy/sql/selectable.py", line 746, in columns
self._populate_column_collection()
File "/home/.cache/pypoetry/virtualenvs/api-FPSaTdE5-py3.8/lib/python3.8/site-packages/sqlalchemy/sql/selectable.py", line 1617, in _populate_column_collection
self.element._generate_fromclause_column_proxies(self)
File "/home/.cache/pypoetry/virtualenvs/api-FPSaTdE5-py3.8/lib/python3.8/site-packages/sqlalchemy/sql/selectable.py", line 703, in _generate_fromclause_column_proxies
fromclause._columns._populate_separate_keys(
File "/home/.cache/pypoetry/virtualenvs/api-FPSaTdE5-py3.8/lib/python3.8/site-packages/sqlalchemy/sql/base.py", line 1216, in _populate_separate_keys
self._colset.update(c for k, c in self._collection)
File "/home/.cache/pypoetry/virtualenvs/api-FPSaTdE5-py3.8/lib/python3.8/site-packages/sqlalchemy/sql/base.py", line 1216, in <genexpr>
self._colset.update(c for k, c in self._collection)
File "/home/.cache/pypoetry/virtualenvs/api-FPSaTdE5-py3.8/lib/python3.8/site-packages/sqlalchemy/sql/operators.py", line 434, in __getitem__
return self.operate(getitem, index)
File "/home/.cache/pypoetry/virtualenvs/api-FPSaTdE5-py3.8/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 831, in operate
return op(self.comparator, *other, **kwargs)
File "/home/.cache/pypoetry/virtualenvs/api-FPSaTdE5-py3.8/lib/python3.8/site-packages/sqlalchemy/sql/operators.py", line 434, in __getitem__
return self.operate(getitem, index)
File "/home/.cache/pypoetry/virtualenvs/api-FPSaTdE5-py3.8/lib/python3.8/site-packages/sqlalchemy/sql/type_api.py", line 75, in operate
return o[0](self.expr, op, *(other + o[1:]), **kwargs)
File "/home/.cache/pypoetry/virtualenvs/api-FPSaTdE5-py3.8/lib/python3.8/site-packages/sqlalchemy/sql/default_comparator.py", line 173, in _getitem_impl
_unsupported_impl(expr, op, other, **kw)
File "/home/.cache/pypoetry/virtualenvs/api-FPSaTdE5-py3.8/lib/python3.8/site-packages/sqlalchemy/sql/default_comparator.py", line 177, in _unsupported_impl
raise NotImplementedError(
NotImplementedError: Operator 'getitem' is not supported on this expression
Any help would be greatly appreciated
PS: The sqlalchemy version I'm using for this is 1.4.6
I used the same sqlalchmy query expression before in a flask project using sqlalchemy version 1.3.22 and it was working correctly
Fixed the issue by using table_valued functions as mentioned in the docs,
and accessing the ColumnCollection of the function using indices instead of keys. Code is as follows:
first_alias = func.jsonb_each(Tagging.sa.json_content["entityMap"]).table_valued(
"key", "value"
)
second_alias = func.jsonb_each(
first_alias.c[1].op("->")("data").op("-")("selectionState").op("-")("segment")
).table_valued("key", "value")
polarity = second_alias.c[1].op("->>")("polarity")
p_tag = second_alias.c[0]

Python Pandas ValueError: Expected object or value

I am trying to have Python read a JSON file and export it to a CSV. I am using Pandas for the conversion, but I am getting "ValueError: Expected object or value" when I run the code below.
import pandas as pd
df = pd.read_json ('contacts.json')
I am using Visual Studio Code for testing the script. When I run the above code, I get the message below in the Terminal window.
PS C:\Users\TaRan\tableau-wdc-tutorial-part-1> & "C:/Program
Files/Python38/python.exe" "c:/Users/TaRan/Dropbox/Team Operational
Resources/G. BI Internal/Testing/Hubspot/conversion.py"
Traceback (most recent call last):
File "c:/Users/TaRan/Dropbox/Team Operational Resources/G. BI
Internal/Testing/Hubspot/conversion.py", line 3, in
df=pd.read_json('contacts.txt') File "C:\Program
Files\Python38\lib\site-packages\pandas\util_decorators.py", line
199, in wrapper
return func(*args, **kwargs) File "C:\Program
Files\Python38\lib\site-packages\pandas\util_decorators.py", line
296, in wrapper
return func(*args, **kwargs) File "C:\Program
Files\Python38\lib\site-packages\pandas\io\json_json.py", line 618,
in read_json
result = json_reader.read() File "C:\Program
Files\Python38\lib\site-packages\pandas\io\json_json.py", line 755,
in read
obj = self._get_object_parser(self.data) File "C:\Program
Files\Python38\lib\site-packages\pandas\io\json_json.py", line 777,
in _get_object_parser
obj = FrameParser(json, **kwargs).parse() File "C:\Program
Files\Python38\lib\site-packages\pandas\io\json_json.py", line 886,
in parse
self._parse_no_numpy() File "C:\Program
Files\Python38\lib\site-packages\pandas\io\json_json.py", line 1119,
in _parse_no_numpy
loads(json, precise_float=self.precise_float), dtype=None ValueError:
Expected object or value
I thought it might be a problem with the JSON file, so I wrote a different one, but I still received the error. To me, it looks like something might be wrong with the Panadas package. I tried reinstalling it, but I still get the error.
EDIT
Here is a sample from the JSON file. I am displaying only one contact and changed the confidential information.
{"results":[{"id":"101","properties":{"createdate":"2020-06-05T15:18:37.746Z","email":"someone#aplace.com","firstname":"First","hs_object_id":"101","lastmodifieddate":"2020-08-12T15:17:35.104Z","lastname":"Last"},"createdAt":"2020-06-05T15:18:37.746Z","updatedAt":"2020-08-12T15:17:35.104Z","archived":false}],"paging":{"next":{"after":"452","link":"https://api.hubapi.com/sampleurl.com"},"prev":null}}
I am getting the JSON file from the Hubspot API. I am not doing any kind of formatting before pulling it into Python for the conversion (nor do I want to - I am trying to automate this entire process). Please note that my JSON is all on one line. I am not sure if this matters or not.
Are you sure your json file is correctly formatted? I wrote this json file and it seems to work fine for me.
{
"Name": {
"0": "John",
"1": "Nick",
"2": "Ali",
"3": "Joseph"
},
"Gender": {
"0": "Male",
"1": "Male",
"2": "Female",
"3": "Male"
},
"Nationality": {
"0": "UK",
"1": "French",
"2": "USA",
"3": "Brazil"
},
"Age": {
"0": 10,
"1": 25,
"2": 35,
"3": 29
}
}
I used the same code you wrote, but added a print statement to check the output and I was able to print out the head of the dataframe.
% python test.py
Name Gender Nationality Age
0 John Male UK 10
1 Nick Male French 25
2 Ali Female USA 35
3 Joseph Male Brazil 29
EDIT: Using the JSON you provided it looks like it is malformed. The JSON you provided is missing a closing "]" and also there were some missing brackets in the second array item.
It should look like this depending on what you're trying to do.
{
"results": [
{
"id": "101",
"properties": {
"createdate": "2020-06-05T15:18:37.746Z",
"email": "someone#aplace.com",
"firstname": "First",
"hs_object_id": "101",
"lastmodifieddate": "2020-08-12T15:17:35.104Z",
"lastname": "Last"
},
"createdAt": "2020-06-05T15:18:37.746Z",
"updatedAt": "2020-08-12T15:17:35.104Z",
"archived": false
},
{
"paging": {
"next": {
"after": "452",
"link": "https://api.hubapi.com/sampleurl.com"
},
"prev": null
}
}
]
}
You have problems in your JSON format. e.g. in the posted part you have '[' but don't have ']'

From a single JSON create and insert multiple rows to BigQuery with Pub/Sub and Dataflow

I have created a Beam Dataflow pipeline that parses a single JSON from a PubSub topic:
{
"data": "test data",
"options": {
"test options": "test",
"test_units": {
"test": {
"test1": "test1",
"test2": "test2"
},
"test2": {
"test1": "test1",
"test2": "test2"
},
"test3": {
"test1": "test1",
"test2": "test2"
}
}
}
}
My output is something like this:
{
"data": "test data",
"test_test_unit": "test1",
"test_test_unit": "test2",
"test1_test_unit": "test1",
...
},
{
"data": "test data",
"test_test_unit": "test1",
"test_test_unit": "test2",
"test1_test_unit": "test1",
...
}
Basically what I'm doing is flattening the data based on how many test_units are in the JSON from the PubSub and returning that many rows in a single dict.
I have created a Class to flatten the data which returns a dict of rows.
Here is my Beam pipeline:
lines = ( p | 'Read from PubSub' >> beam.io.ReadStringsFromPubSub(known_args.input_topic)
| 'Parse data' >> beam.DoFn(parse_pubsub())
| 'Write to BigQuery' >> beam.io.WriteToBigQuery(
known_args.output_table,
schema=table_schema,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
)
)
Here is some of the class to handle the flattening:
class parse_pubsub(beam.DoFn):
def process(self, element):
# ...
# flattens the data
# ...
return rows
Here is the error from the Stackdriver logs:
Error processing instruction -138. Original traceback is Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 151, in _execute
response = task() File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py",
line 186, in <lambda> self._execute(lambda: worker.do_instruction(work), work) File "/usr/local/lib/python2.7/
dist-packages/apache_beam/runners/worker/sdk_worker.py", line 265, in do_instruction request.instruction_id)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 281, in
process_bundle delayed_applications = bundle_processor.process_bundle(instruction_id) File "/usr/local/lib/
python2.7/dist-packages/apache_beam/runners/worker/bundle_processor.py", line 552, in process_bundle op.finish()
File "apache_beam/runners/worker/operations.py", line 549, in
apache_beam.runners.worker.operations.DoOperation.finish def finish(self): File "apache_beam/runners/worker/
operations.py", line 550, in apache_beam.runners.worker.operations.DoOperation.finish with
self.scoped_finish_state: File "apache_beam/runners/worker/operations.py", line 551, in
apache_beam.runners.worker.operations.DoOperation.finish self.dofn_runner.finish() File "apache_beam/runners/
common.py", line 758, in apache_beam.runners.common.DoFnRunner.finish self._invoke_bundle_method
(self.do_fn_invoker.invoke_finish_bundle) File "apache_beam/runners/common.py", line 752, in
apache_beam.runners.common.DoFnRunner._invoke_bundle_method self._reraise_augmented(exn) File "apache_beam/
runners/common.py", line 777, in apache_beam.runners.common.DoFnRunner._reraise_augmented raise_with_traceback
(new_exn) File "apache_beam/runners/common.py", line 750, in
apache_beam.runners.common.DoFnRunner._invoke_bundle_method bundle_method() File "apache_beam/runners/common.py",
line 361, in apache_beam.runners.common.DoFnInvoker.invoke_finish_bundle def invoke_finish_bundle(self): File
"apache_beam/runners/common.py", line 365, in apache_beam.runners.common.DoFnInvoker.invoke_finish_bundle
self.signature.finish_bundle_method.method_value()) File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/
gcp/bigquery.py", line 630, in finish_bundle self._flush_batch() File "/usr/local/lib/python2.7/dist-packages/
apache_beam/io/gcp/bigquery.py", line 637, in _flush_batch table_id=self.table_id, rows=self._rows_buffer) File
# HERE:
"/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/bigquery_tools.py",
line 611, in insert_rows for k, v in iteritems(row): File "/usr/local/lib/python2.7/dist-packages/future/utils/
__init__.py", line 308, in iteritems func = obj.items AttributeError: 'int' object has no attribute 'items'
[while running 'generatedPtransform-135']
I've also tried returning a list and had the same error that 'list' object has no 'items' therefore I'm converting the list rows to a dict like this:
0 {
"data": "test data",
"test_test_unit": "test1",
"test_test_unit": "test2",
"test1_test_unit": "test1",
...
},
1 {
"data": "test data",
"test_test_unit": "test1",
"test_test_unit": "test2",
"test1_test_unit": "test1",
...
}
I'm fairly new to this so any help will be appreciated!
You'll need to use the yield keyword to emit multiple outputs in your DoFn. For example:
class parse_pubsub(beam.DoFn):
def process(self, element):
# ...
# flattens the data
# ...
for row in rows:
yield row

Categories