store data kept in json file in a mongodb using pyMongo - python

I have a json file having 10 lines each line having one dict, containing the data, I want to store this json data in a MongoDB using pyMongo. Here is the code that I have written :
import pymongo
from pymongo import MongoClient
client = MongoClient()
db = client.twitterdata
coll = db.twitterset
f = open('twitterdata.json', 'r')
dblist = []
for line in f:
dblist.append(line)
I am trying to make a list having all the dicts as its elements and then adding it to the collection using insert_all() method, but since I am appending the line , will the elements of the list be string or a dict?

First of all, if you have one dict on each line, that's not a valid JSON file. This is not a valid JSON:
{"id": 1, "value": "abc"}
{"id": 2, "value": "xyz"}
{"id": 3, "value": "mop"}
If you have the data structured like this, I suggest you updating it to a valid JSON like:
[{"id": 1,"value": "abc"},
{"id": 2,"value": "xyz"},
{"id": 3,"value": "mop"}]
If you're forced by any reason to stay in the first case, you can make sure you're inserting what you want in the database like this:
import json
dblist = []
with open('twitterdata.json', 'r') as f:
for line in f:
dblist.append(json.loads(line))
If you chose to correctly format the file the code gets nicer:
import json
dblist = []
with open('twitterdata.json', 'r') as f:
dblist.extend(json.load(f))

Related

how to can loop through an array of array in a json object in python

https://github.com/Asabeneh/30-Days-Of-Python/blob/ff24ab221faaec455b664ad5bbdc6e0de76c3caf/data/countries_data.json
how can i loop through this countries_data.json file (see link above) to get 'languages'
i have tried:
import json
f = open("countries_data.json")
file = f.read()
# print(file)
for item in file:
print(item)
You have everything correct and set up but you didn't load the json file. Also there is a double space on "f = open". You also didn't open the file with the read parameter, not too sure if its needed though.
Correct code:
import json
f = open("countries_data.json", "r")
file = json.loads(f.read())
for item in file:
print(item)
Hope this helped, always double check your code.
You can see that you import the json module at the beginning, so you might as well use it
If you go to the documentation you will see a function allowing you to read this file directly.
In the end you end up with just a dictionary list, the code can be summarized as follows.
import json
with open("test/countries_data.json") as file:
data = json.load(file)
for item in data:
print(item["languages"])
You are missing one essential step, which is parsing the JSON data to Python datastructures.
import json
# read file
f = open("countries.json")
# parse JSON to Python datastructures
countries = json.load(f)
# now you have a list of countries
print(type(countries))
# loop through list of countries
for country in countries:
# you can access languages with country["languages"]; JSON objects are Python dictionaries now
print(type(country))
for language in country["languages"]:
print(language)
f.close()
Expected output:
<class 'list'>
<class 'dict'>
Pashto
Uzbek
Turkmen
...
You can use the json built-in package to deserialize the content of that file.
A sample of usage
data = """[
{
"name": "Afghanistan",
"capital": "Kabul",
"languages": [
"Pashto",
"Uzbek",
"Turkmen"
],
"population": 27657145,
"flag": "https://restcountries.eu/data/afg.svg",
"currency": "Afghan afghani"
},
{
"name": "Ă…land Islands",
"capital": "Mariehamn",
"languages": [
"Swedish"
],
"population": 28875,
"flag": "https://restcountries.eu/data/ala.svg",
"currency": "Euro"
}]"""
# deserializing
print(json.loads(data))
For more complex content have a look to the JSONDecoder.
doc
EDIT:
import json
path = # my file
with open(path, 'r') as fd:
# iterate over the dictionaries
for d in json.loads(fd.read()):
print(d['languages'])
EDIT: extra - top 10 languages
import json
import itertools as it
path = # path to file
with open(path, 'r') as fd:
text = fd.read()
languages_from_file = list(it.chain(*(d['languages'] for d in json.loads(text))))
# get unique "list" of languages
languages_all = set(languages_from_file)
# count the repeated languages
languages_count = {l: languages_from_file.count(l) for l in languages_all}
# order them per descending value
top_ten_languages = sorted(languages_count.items(), key=lambda k: k[1], reverse=True)[:10]
print(top_ten_languages)

Reading a json file that has multiple lines

I have a function that I apply to a json file. It works if it looks like this:
import json
def myfunction(dictionary):
#does things
return new_dictionary
data = """{
"_id": {
"$oid": "5e7511c45cb29ef48b8cfcff"
},
"description": "some text",
"startDate": {
"$date": "5e7511c45cb29ef48b8cfcff"
},
"completionDate": {
"$date": "2021-01-05T14:59:58.046Z"
},
"videos":[{"$oid":"5ecf6cc19ad2a4dfea993fed"}]
}"""
info = json.loads(data)
refined = key_replacer(info)
new_data = json.dumps(refined)
print(new_data)
However, I need to apply it to a whole while and the input looks like this (there are multiple elements and they are not separated by commas, they are one after another):
{"_id":{"$oid":"5f06cb272cfede51800b6b53"},"company":{"$oid":"5cdac819b6d0092cd6fb69d3"},"name":"SomeName","videos":[{"$oid":"5ecf6cc19ad2a4dfea993fed"}]}
{"_id":{"$oid":"5ddb781fb4a9862c5fbd298c"},"company":{"$oid":"5d22cf72262f0301ecacd706"},"name":"SomeName2","videos":[{"$oid":"5dd3f09727658a1b9b4fb5fd"},{"$oid":"5d78b5a536e59001a4357f4c"},{"$oid":"5de0b85e129ef7026f27ad47"}]}
How could I do this? I tried opening and reading the file, using load and dump instead of loads and dumps, and it still doesn't work. Do I need to read, or iterate over every line?
You are dealing with ndjson(Newline delimited JSON) data format.
You have to read the whole data string, split it by lines and parse each line as a JSON object resulting in a list of JSONs:
def parse_ndjson(data):
return [json.loads(l) for l in data.splitlines()]
with open('C:\\Users\\test.json', 'r', encoding="utf8") as handle:
data = handle.read()
dicts = parse_ndjson(data)
for d in dicts:
new_d = my_function(d)
print("New dict", new_d)

KeyError occures while opening the JSON txt file and setting it up into a DataFrame

I had a code, which gave me an empty DataFrame with no saved tweets.
I tried to debug it by putting print(line) under the for line in json file: and json_data = json.loads(line).
That resulted a KeyError.
How do I fix it?
Thank you.
list_df = list()
# read the .txt file, line by line, and append the json data in each line to the list
with open('tweet_json.txt', 'r') as json_file:
for line in json_file:
print(line)
json_data = json.loads(line)
print(line)
tweet_id = json_data['tweet_id']
fvrt_count = json_data['favorite_count']
rtwt_count = json_data['retweet_count']
list_df.append({'tweet_id': tweet_id,
'favorite_count': fvrt_count,
'retweet_count': rtwt_count})
# create a pandas DataFrame using the list
df = pd.DataFrame(list_df, columns = ['tweet_id', 'favorite_count', 'retweet_count'])
df.head()
Your comment says you're trying to save to a file, but your code kind of says that you're trying to read from a file. Here are examples of how to do both:
Writing to JSON
import json
import pandas as pd
content = { # This just dummy data, in the form of a dictionary
"tweet1": {
"id": 1,
"msg": "Yay, first!"
},
"tweet2": {
"id": 2,
"msg": "I'm always second :("
}
}
# Write it to a file called "tweet_json.txt" in JSON
with open("tweet_json.txt", "w") as json_file:
json.dump(content, json_file, indent=4) # indent=4 is optional, it makes it easier to read
Note the w (as in write) in open("tweet_json.txt", "w"). You're using r (as in read), which doesn't give you permission to write anything. Also note the use of json.dump() rather than json.load(). We then get a file that looks like this:
$ cat tweet_json.txt
{
"tweet1": {
"id": 1,
"msg": "Yay, first!"
},
"tweet2": {
"id": 2,
"msg": "I'm always second :("
}
}
Reading from JSON
Let's read the file that we just wrote, using pandas read_json():
import pandas as pd
df = pd.read_json("tweet_json.txt")
print(df)
Output looks like this:
>>> df
tweet1 tweet2
id 1 2
msg Yay, first! I'm always second :(

Can't access JSON loaded with json.dumps(json.loads(input))

Suppose I have json data like this.
{"id": {"$oid": "57dbv34346"}, "from": {"$oid": "57dbv34346sbgwe"}, "type": "int"}
{"id": {"$oid": "57dbv34345"}, "from": {"$oid": "57dbv34345sbgwe"}, "type": "int"}
I wrote a script like this in python
import json
with open('klinks_buildson.json', 'r') as f:
for line in f:
distros_dict = json.dumps(json.loads(line), sort_keys=True, indent=4)
print distros_dict['from']
print "\n"
But It is giving me an error:
print distros_dict['from']
TypeError: string indices must be integers, not str
I want data of the from in both the lines.
You don't need to load the line, you can load the file (assuming its valid json); like this:
with open('klinks_buildjson.json', 'r') as f:
data = json.load(f)
Now data is a list, where each item is an object. You can iterate through it:
for row in data:
print(row['from'])
To fix your immediate problem, remove json.dumps which is used to convert an object to a string, which is not what you want here.
distros_dict = json.loads(line)

Python CSV to JSON W/ Array Output

I'm trying to take data from a CSV and put it in a top-level array in JSON format.
Currently I am running this code:
import csv
import json
csvfile = open('music.csv', 'r')
jsonfile = open('file.json', 'w')
fieldnames = ("ID","Artist","Song", "Artist")
reader = csv.DictReader( csvfile, fieldnames)
for row in reader:
json.dump(row, jsonfile)
jsonfile.write('\n')
The CSV file is formatted as so:
| 1 | Empire of the Sun | We Are The People | Walking on a Dream |
| 2 | M83 | Steve McQueen | Hurry Up We're Dreaming |
Where = Column 1: ID | Column 2: Artist | Column 3: Song | Column 4: Album
And getting this output:
{"Song": "Empire of the Sun", "ID": "1", "Artist": "Walking on a Dream"}
{"Song": "M83", "ID": "2", "Artist": "Hurry Up We're Dreaming"}
I'm trying to get it to look like this though:
{
"Music": [
{
"id": 1,
"Artist": "Empire of the Sun",
"Name": "We are the People",
"Album": "Walking on a Dream"
},
{
"id": 2,
"Artist": "M83",
"Name": "Steve McQueen",
"Album": "Hurry Up We're Dreaming"
},
]
}
Pandas solves this really simply. First to read the file
import pandas
df = pandas.read_csv('music.csv', names=("id","Artist","Song", "Album"))
Now you have some options. The quickest way to get a proper json file out of this is simply
df.to_json('file.json', orient='records')
Output:
[{"id":1,"Artist":"Empire of the Sun","Song":"We Are The People","Album":"Walking on a Dream"},{"id":2,"Artist":"M83","Song":"Steve McQueen","Album":"Hurry Up We're Dreaming"}]
This doesn't handle the requirement that you want it all in a "Music" object or the order of the fields, but it does have the benefit of brevity.
To wrap the output in a Music object, we can use to_dict:
import json
with open('file.json', 'w') as f:
json.dump({'Music': df.to_dict(orient='records')}, f, indent=4)
Output:
{
"Music": [
{
"id": 1,
"Album": "Walking on a Dream",
"Artist": "Empire of the Sun",
"Song": "We Are The People"
},
{
"id": 2,
"Album": "Hurry Up We're Dreaming",
"Artist": "M83",
"Song": "Steve McQueen"
}
]
}
I would advise you to reconsider insisting on a particular order for the fields since the JSON specification clearly states "An object is an unordered set of name/value pairs" (emphasis mine).
Alright this is untested, but try the following:
import csv
import json
from collections import OrderedDict
fieldnames = ("ID","Artist","Song", "Artist")
entries = []
#the with statement is better since it handles closing your file properly after usage.
with open('music.csv', 'r') as csvfile:
#python's standard dict is not guaranteeing any order,
#but if you write into an OrderedDict, order of write operations will be kept in output.
reader = csv.DictReader(csvfile, fieldnames)
for row in reader:
entry = OrderedDict()
for field in fieldnames:
entry[field] = row[field]
entries.append(entry)
output = {
"Music": entries
}
with open('file.json', 'w') as jsonfile:
json.dump(output, jsonfile)
jsonfile.write('\n')
Your logic is in the wrong order. json is designed to convert a single object into JSON, recursively. So you should always be thinking in terms of building up a single object before calling dump or dumps.
First collect it into an array:
music = [r for r in reader]
Then put it in a dict:
result = {'Music': music}
Then dump to JSON:
json.dump(result, jsonfile)
Or all in one line:
json.dump({'Music': [r for r in reader]}, jsonfile)
"Ordered" JSON
If you really care about the order of object properties in the JSON (even though you shouldn't), you shouldn't use the DictReader. Instead, use the regular reader and create OrderedDicts yourself:
from collections import OrderedDict
...
reader = csv.Reader(csvfile)
music = [OrderedDict(zip(fieldnames, r)) for r in reader]
Or in a single line again:
json.dump({'Music': [OrderedDict(zip(fieldnames, r)) for r in reader]}, jsonfile)
Other
Also, use context managers for your files to ensure they're closed properly:
with open('music.csv', 'r') as csvfile, open('file.json', 'w') as jsonfile:
# Rest of your code inside this block
It didn't write to the JSON file in the order I would have liked
The csv.DictReader classes return Python dict objects. Python dictionaries are unordered collections. You have no control over their presentation order.
Python does provide an OrderedDict, which you can use if you avoid using csv.DictReader().
and it skipped the song name altogether.
This is because the file is not really a CSV file. In particular, each line begins and ends with the field separator. We can use .strip("|") to fix this.
I need all this data to be output into an array named "Music"
Then the program needs to create a dict with "Music" as a key.
I need it to have commas after each artist info. In the output I get I get
This problem is because you call json.dumps() multiple times. You should only call it once if you want a valid JSON file.
Try this:
import csv
import json
from collections import OrderedDict
def MyDictReader(fp, fieldnames):
fp = (x.strip().strip('|').strip() for x in fp)
reader = csv.reader(fp, delimiter="|")
reader = ([field.strip() for field in row] for row in reader)
dict_reader = (OrderedDict(zip(fieldnames, row)) for row in reader)
return dict_reader
csvfile = open('music.csv', 'r')
jsonfile = open('file.json', 'w')
fieldnames = ("ID","Artist","Song", "Album")
reader = MyDictReader(csvfile, fieldnames)
json.dump({"Music": list(reader)}, jsonfile, indent=2)

Categories