How to get PyMongo's bson to decode everything - python

I'm trying to get some data stored in a .bson file into a jupyter notebook.
Per this answer and this answer, the accepted answer is basically to use the bson module from PyMongo, and then the following code
FILE = "file.bson"
with open(FILE, 'rb') as f:
data = bson.decode_all(f.read())
Now, data is a list of length 1.
data[0] is a dictionary.
The first key in this dictionary is a
data[0]["a"] is a dictionary with keys tag and data, and
data[0]["a"]["data"] is exactly what is should be, a list of integers that I can work with in python.
On the other hand, the second key in this dictionary is b
but now data[0]["b"] is a dictionary with keys tag, type, size, and data
and
data[0]["b"]["data"] is type bytes, and I'm not sure how to work with it.
I have never worked with bson before, so any input is appreciated. However, some of my questions are
Does anyone have a good ref on how to work with bson in python?
Does anyone know why a gets read in a readable way (not bytes), but b gets read in with more keys, but not readable (bytes as opposed to integers)
I was really hoping read_all would take care of everything; does anyone know why it doesn't / what I should do differently? I've tried applying read_all again to the stuff still in bytes, but I get the error message InvalidBSON: invalid message size
Does anyone have a solution for my goal, of getting the information from data[0]["b"]["data"] in a usable format (i.e. a list of integers)?

Related

Map JSON string to struct in PySpark

I have a JSON in string as follows:
'''{"col1":"value1", "col2":"[{'col3':'val3'},{'col3':'val4'}]"}'''
I want to convert it as:
{"col1":"value1",
"col2":[ {'col3':'val3'}, {'col3':'val4'}]}
And I want to read this in the PySpark dataframe. how to convert the list inside string to json struct?
The (whole) data is not a JSON-string. Namely because ' characters are not allowed in JSON structures. The best option would be to go back to wherever this is generated and correct the malformed data before going onwards.
Once you have corrected the bad data, you can do:
import json
result = json.loads('''{"col1":"value1", "col2":[{"col3":"val3"},{"col3":"val4"}]}''')
If you can't change how the data is given to you. One solution would be to string-replace the bad characters (but this might cause all sorts of trouble along the way):
import json
result = json.loads('''{"col1":"value1", "col2":"[{'col3':'val3'},{'col3':'val4'}]"}''')
result['col2'] = json.loads(result['col2'].replace("'", '"'))
Either way, I would go back and re-work the way you get the data for the most reliable results. But that is not JSON-data as it stands now. At least not in the sense you think it is.

Using nested for loops to iterate through JSON file of tweets in Python

So I am new to Python, but I know what I am trying to accomplish. Basically, I have the output of tweets from twitter in a JSON file loaded into Python. What I need to do is iterate through the tweets to access the "text" key, that has the text of each tweet, because that's what I'm going to use to do topic modeling. So, I have discovered that "text" is triple nested in this data structure, and it's been really difficult to find the correct way to write the for loop code in order to iterate through the dataset and pull the "text" from every tweet.
Here is a look at what the JSON structure is like: https://pastebin.com/fUH5MTMx
So, I have figured out that the "text" key that I want is within [hits][hits][_source]. What I can't figure out is the appropriate for loop to iterate through _source and pull those texts. Here is my code so far (again I'm very beginning sorry if try code is way off):
for hits in tweets["hits"]["hits"]:
for _source in hits:
for text in _source:
for item in text:
print(item)
also tried this:
for item in tweets['hits']["hits"]["_source"]:
print(item['text'])
But I keep getting either syntax errors for the first one then "TypeError: list indices must be integers or slices, not str" for the second one. I am understanding that I need to specify some way that I am trying to access this list, and that I'm missing something in order to show that its a list and I am not looking for integers as an output from iterations...(I am using the JSON module in Python for this, an using a Mac with Python3 in Spyder)
Any insight would be greatly appreciated! This multiple nesting is confusing me a lot.
['hits']["hits"] is not dictionary with ["_source"]
but a list with one or many items which have ["_source"]
it means
tweets['hits']["hits"][0]["_source"]
tweets['hits']["hits"][1]["_source"]
tweets['hits']["hits"][2]["_source"]
So this should work
for item in tweets['hits']["hits"]:
print(item["_source"]['text'])
Not sure if you realize it, but JSON is transformed into a Python dictionary, not a list. Anyway, let's get into this nest.
tweets['hits'] will give you another dict.
tweets['hits']['hits'] will give you a list (notice the brackets)
This apparently is a list of dictionaries, and in this case (not sure if it will always be), the dict with the "_source" key you are looking for is the first one,so:
tweets['hits']['hits'][0] will give you the dict you want. Then, finally:
tweets['hits']['hits'][0]['_source'] should give you the text.
The value of the second "hits" is a list.
Try:
for hit in tweets["hits"]["hits"]:
print(hit["_source"]["text"])

How do I write unknown keys to a CSV in large data sets?

I'm currently working on a script that will query data from a REST API, and write the resulting values to a CSV. The data set could potentially contain hundreds of thousands of records, but it returns the data in sets of 100 entries. My goal is to include every key from each entry in the CSV.
What I have so far (this is a simplified structure for the purposes of this question):
import csv
resp = client.get_list()
while resp.token:
my_data = resp.data
process_data(my_data)
resp = client.get_list(resp.token)
def process_data(my_data):
#This section should write my_data to a CSV file
#I know I can use csv.dictwriter as part of the solution
#Notice that in this example, "fieldnames" is undefined
#Defining it is part of the question
with open('output.csv', 'a') as output_file:
writer = csv.DictWriter(output_file, fieldnames = fieldnames)
for element in my_data:
writer.writerow(element)
The problem: Each entry doesn't necessarily have the same keys. A later entry missing a key isn't that big of a deal. My problem is, for example, entry 364 introducing an entirely new key.
Options that I've considered:
Whenever I encounter a new key, read in the output CSV, append the new key to the header, and append a comma to each previous line. This leads to a TON of file I/O, which I'm hoping to avoid.
Rather than writing to a CSV, write the raw JSON to a file. Meanwhile, build up a list of all known keys as I iterate over the data. Once I've finished querying the API, iterate over the JSON files that I wrote, and write the CSV using the list that I built. This leads to 2 total iterations over the data, and feels unnecessarily complex.
Hard code the list of potential keys beforehand. This approach is impossible, for a number of reasons.
None of these solutions feel particularly elegant to me, which leads me to my question. Is there a better way for me to approach this problem? Am I overlooking something obvious?
Options 1 and 2 both seem reasonable.
Does the CSV need to valid and readable while you're creating it? If not you could do the append of missing columns in one pass after you've finished reading from the API (which would be like a combination of the two approaches). If you do this you'll probably have to use the regular csv.writer in the first pass rather than csv.DictWriter, since your columns definition will grow while you're writing.
One thing to bear in mind - if the overall file is expected to be large (eg won't fit into memory), then your solution probably needs to use a streaming approach, which is easy with CSV but fiddly with JSON. You might also want to look into to alternative formats to JSON for the intermediate data (eg XML, BSON etc).

Pandas read_csv() - Can I fix missing values before they are handled by a converter in read_csv?

Inside read_csv I am using a custom converter that translates timestamps to integers. The problem is that some of the values that are sent to the converter are bad, meaning either empty strings or non numerically representable strings.
I tried to raise an exception inside the converter whenever those values are detected but that doesn't seem to be working. I am aware of methods like dropna() but that can be applied only after the data are loaded into memory. I am also aware of the na_values argument but not quite sure how to use it for this purpose.
Is there a way to filter out those bad values before they get sent into the converter?
here is a 10 row sample of the input file with a bad value in the 9th row
0:00:00.000 644.2748413
0:00:00.002 644.4001465
0:00:00.004 642.8461914
0:00:00.006 643.7497559
0:00:00.008 644.1955566
0:00:00.010 642.1920166
0:00:00.012 643.4448853
0:00:00.014 #VALUE!
0:00:00.016 644.1955566
0:00:00.018 643.2954102
I have also tried to make the converter return NaN in case the value cannot be converted but that doesn't work either. I assume the converter does not work completely sequentially on the data and that is also why it's hard to track down in which values it get's stuck by using print statements.
update2
As I found out the problem was not in the timestamp (as you can see in the sample file) column that gets processed by the converter, though the error message stated that there was a problem in the line where the converter get's called, and I got confused. So now it's clearer, although I don't know why the misleading error message. I can probably use na_values to clear things.

Reading/Writing/Appending using json module

I was recommended some code from someone here but I still need some help. Here's the code that was recommended:
import json
def write_data(data,filename):
with open(filename,'w') as outfh:
json.dump(data,outfh)
def read_data(filename):
with open(filename, 'r') as infh:
json.load(infh)
The writing code works fine, but the reading code doesn't return any strings after inputting the data:
read_data('player.txt')
Another thing that I'd like to be able to do is to be able to specify a line to be printed. Something that is also pretty important for this project / exorcise is to be able to append strings into my file. Thanks to anyone that can help me.
Edit: I need to be storing strings in the file that I can convert to values. IE;
name="Karatepig"
Is something I would store so I can recall the data if I ever need to load a previous set of data or something of that sort.
I'm very much a noob at python so I don't know what would be best, whether a list or dictionary; I haven't really used a dictionary before, and also I have no idea yet as to how I'm going to convert strings into values.
Suppose you have a dictionary like this:
data = {'name': 'Karatepig', 'score': 10}
Your write_data function is fine, but your read_data function needs to actually return the data after reading it from the file:
def read_data(filename):
with open(filename, 'r') as infh:
data = json.load(infh)
return data
data = read_data('player.json')
Now suppose you want to print the name and update the score:
print data['name']
data['score'] += 5 # add 5 points to the score
You can then write the data back to disk using the write_data function. You can not use the json module to append data to a file. You need to actually read the file, modify the data, and write the file again. This is not so bad for small amounts of data, but consider using a database for larger projects.
Whether you use a list or dictionary depends on what you're trying to do. You need to explain what you're trying to accomplish before anyone can recommend an approach. Dictionaries store values assigned to keys (see above). Lists are collections of items without keys and are accessed using an index.
It isn't clear what you mean by "convert strings into values." Notice that the above code shows the string "Karatepig" using the dictionary key "name".

Categories