how to transfer .txt to MongoDB? - python

I would like to ask how to use python to convert a .txt file into MongoDB.
The .txt file is huge (ca. 800M) but has a simple data structure:
title1...TAB...text1text1text1text1text1text1\n
title2...TAB...text2text2text2text2text2text2\n
title3...TAB...text3text3text3text3text3text3\n
The ...TAB... means there is a tab key, or a big space. (Sorry I don't know exactly how to describe it.)
The desired MongoDB format should be look like:
{
“title”: title1,
“description”: text1text1text1text1text1text1\n,
“extra”: EMPTY
}
... and so on.
I tried with the code from storing full text from txt file into mongodb
from pymongo import MongoClient
client = MongoClient()
db = client.test_database # use a database called "test_database"
collection = db.files # and inside that DB, a collection called "files"
f = open('F:\\ttt.txt') # open a file
text = f.read() # read the entire contents, should be UTF-8 text
# build a document to be inserted
text_file_doc = {"file_name": "F:\\ttt.txt", "contents" : text }
# insert the contents into the "file" collection
collection.insert(text_file_doc)
To be honest, as a newbie I don't quite understand what the code means. So it is not surprise that the code above doesn't work for my purpose.
Could anybody please help me out of this problem? Any help will be highly appreciated!

It boils down to how your input file is formatted.
If it consistently follows the format you outlined, i.e. there's no tabs/whitespace characters in the title portion and the "extra" field will always be empty, you could go for sth. like this:
import json
# your mongo stuff goes here
file_content = []
with open("ttt.txt") as f:
for line in f:
# assuming tabs and not multiple space characters
title, desc = line.strip().split("\t", maxsplit=1)
file_content.append({"title": title, "description": desc, "extra": None})
collection.insert(json.dumps(file_content))

Related

Power BI(PBIX) - Parsing Layout file

I am trying to document the Reports, Visuals and measures used in a PBIX file. I have a PBIX file(containing some visuals and pointing to Tabular Model in Live Mode), I then exported it as a PBIT, renamed to zip. Now in this zip file we have a folder called Report, within that we have a file called Layout. The layout file looks like a JSON file but when I try to read it via python,
import json
# Opening JSON file
f = open("C://Layout",)
# returns JSON object as
# a dictionary
#f1 = str.replace("\'", "\"")
data = json.load(f)
I get below issue,
JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Renaming it to Layout.json doesn't help either and gives the same issue. Is there a easy way or a parser to specifically parse this Layout file and get below information out of it
Report Name | Visual Name | Column or Measure Name
Not sure if you have come across an answer for your question yet, but I have been looking into something similar.
Here is what I have had to do in order to get the file to parse correctly.
Big items here to not is the encoding and all the whitespace replacements.
data will then contain the parsed object.
with open('path/to/Layout', 'r', encoding="cp1252") as json_file:
data_str = json_file.read().replace(chr(0), "").replace(chr(28), "").replace(chr(29), "").replace(chr(25), "")
data = json.loads(data_str)
This script may help: https://github.com/grenzi/powerbi-model-utilization
a portion of the script is:
def get_layout_from_pbix(pbixpath):
"""
get_layout_from_pbix loads a pbix file, grabs the layout from it, and returns json
:parameter pbixpath: file to read
:return: json goodness
"""
archive = zipfile.ZipFile(pbixpath, 'r')
bytes_read = archive.read('Report/Layout')
s = bytes_read.decode('utf-16-le')
json_obj = json.loads(s, object_hook=parse_pbix_embedded_json)
return json_obj
had similar issue.
my work around was to save it as Layout.txt with utf-8 encoding, then continued as you have

Retrieve document '_id' of a GridFS document, by its 'filename'

I am currently working on a project in which I must retrieve a document uploaded on a MongoDB database using GridFS and store it in my local directory.
Up to now I have written these lines of code:
if not fs.exists({'filename': 'my_file.txt'}):
CRAWLED_FILE = os.path.join(SAVING_FOLDER, 'new_file.txt')
else:
file = fs.find_one({'filename': 'my_file.txt'})
CRAWLED_FILE = os.path.join(SAVING_FOLDER, 'new_file.txt')
with open(CRAWLED_FILE, 'wb') as f:
f.write(file.read())
f.close()
I believe that find_one doesn't allow me to write in a new file the content of the file previously stored in the database. f.write(file.read()) writes in the file just created (new_file.txt) the directory in which (new_file.txt) is stored! So I have a txt completely different from the one I have uploaded in the database and the only line in the txt is: E:\\my_folder\\sub_folder\\my_file.txt
It's kind of weird, I don't even know why it is happening.
I thought it could work if I use the fs.get(ObjectId(ID)) method, which, according to the official documentation of Pymongo and GridFS, it provides a file-like interface for reading. However I just know the name of the txt saved in the database, I have no clue what is the object ID, I cannot use a list or dict to store all the IDs of my documents since it wouldn't be worthy. I have checked with many posts here on StackOverflow and everyone suggests to use subscription. Basically you create a cursor using fs.find()then you can iterate over the cursor for example like this:
for x in fs.find({'filename': 'my_file.txt'}):
ID = x['_id']
see, many answers here suggest me to do the following, the only problem is that Cursor object is not subscriptable and I have no clue how I can resolve this issue.
I must find way to get the document '_id' given the filename of the document so I can later use it combined with fs.get(ObjectId(ID)).
Hope you can help me, thank you a lot!
Matteo
You can just access it like this:
ID = x._id
But "_" is a protected member in Python, so I was looking around for other solutions (could not find much). For getting just the ID, you could do:
for ID in fs.find({'filename': 'my_file.txt'}).distinct('_id'):
# do something with ID
Since that only gets the IDs, you would probably need to do:
query = fs.find({'filename': 'my_file.txt'}).limit(1) # equivalent to find_one
content = next(query, None) # Iterate GridOutCursor, should have either one element or None
if content:
ID = content._id
...

How to create collections dynamically and insert json file's data in to each of them

On ref SO suggestions , I tried inserting a bunch of Json files data to single collection as :
import json
import pymongo
client = pymongo.MongoClient('mongodb+srv://********:*******#cluster0-kb3os.mongodb.net/test?retryWrites=true&w=majority')
db = client['mydb']
test = db['test']
Then I have the json files as a.json,b.json,....,z.json , to insert all these to a single collection, I did this way as:
file_names = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']
json_file_names=[x + '.json' for x in file_names]
for file_name in json_file_names:
with open(file_name,encoding="utf8") as f:
file_data = json.load(f)
for word in file_data:
word_obj = file_data[word]
test.insert_one(word_obj)
When querying the results which is referring to particular letter record,I guess it is good to have the separate collection which might increase performance too instead of searching whole collection.
I have been looking on how to create collections dynamically such that each collection say a_col has a.json data inserted,b_col has b.json, ........
Is there a way to create as this ? , any guiding links or snippets as answer would be much helpful , TIA
inside your for loop after reading each files change collection name accoeding to your file name. like test = db[filename] it will create a new collection everytime before inserting data.

Python: Json.load large json file MemoryError

I'm trying to load a large JSON File (300MB) to use to parse to excel. I just started running into a MemoryError when I do a json.load(file). Questions similar to this have been posted but have not been able to answer my specific question. I want to be able to return all the data from the json file in one block like I did in the code. What is the best way to do that? The Code and json structure are below:
The code looks like this.
def parse_from_file(filename):
""" proceed to load the json file that given and verified,
it and returns the data that was in the json file so it can actually be read
Args:
filename (string): full branch location, used to grab the json file plus '_metrics.json'
Returns:
data: whatever data is being loaded from the json file
"""
print("STARTING PARSE FROM FILE")
with open(filename) as json_file:
d = json.load(json_file)
json_file.close()
return d
The structure looks like this.
[
{
"analysis_type": "test_one",
"date": 1505900472.25,
"_id": "my_id_1.1.1",
"content": {
.
.
.
}
},
{
"analysis_type": "test_two",
"date": 1605939478.91,
"_id": "my_id_1.1.2",
"content": {
.
.
.
}
},
.
.
.
]
Inside "content" the information is not consistent but has 3 distinct but different possible template that can be predicted based of analysis_type.
i did like this way, hope it will helps you. and maybe you need skip the 1th line "[". and remove "," at a line end if exists "},".
with open(file) as f:
for line in f:
while True:
try:
jfile = ujson.loads(line)
break
except ValueError:
# Not yet a complete JSON value
line += next(f)
# do something with jfile
If all the tested libraries are giving you memory problems my approach would be splitting the file into one per each object inside the array.
If the file has the newlines and padding as you said in the OP I owuld read by line, discarding if it is [ or ] writting the lines to new files every time you find a }, where you also need to remove the commas. Then try to load everyfile and print a message when you end reading each one to see where it fails, if it does.
If the file has no newlines or is not properly padded you would need to start reading char by char keeping too counters, increasing each of them when you find [ or { and decreasing them when you find ] or } respectively. Also take into account that you may need to discard any curly or square bracket that is inside a string, though that may not be needed.

Getting the content of .CSV cell

I’m having troubles reading a .CSV file even though i have tried to read the online python-doc.
The thing is i have been using the xlrd module on python to read through xls file and it went superbly.
Now i want to try with .CSV but i find things much more complicated.
When i wanted python to return the content of a cell(i,j) : sheet.Cell(i,j).value and it worked. End.
It's a ";" delimited csv.
Something like :
Ref;A;B;C;D;E;f
P;x1;x2;x3;x4...
L;y1;y2;y3
M;z1...
N:w1 ...
I want to display a list box containing a A,B,C,D ...
And bind this list with a Cur_Selection function that will make some calculus within x,y,z,w of a selected ref A,B,C,D ...
That was very easy in xlrd. I don't get it here.
Can someone help ?
Are you asking how to access the data in the csv? I typically parse csvs with a simple function with string manipulation methods. Works for me with rather small csv files which I generate in excel.
def parse_csv(content, delimiter = ';'):
csv_data = []
for line in content.split('\n'):
csv_data.append( [x.strip() for x in line.split( delimiter )] ) # strips spaces also
return csv_data
content = open(uri,'r').read()
list_data = parse_csv( content )
print list_data[2][1]

Categories