Retrieve document '_id' of a GridFS document, by its 'filename' - python

I am currently working on a project in which I must retrieve a document uploaded on a MongoDB database using GridFS and store it in my local directory.
Up to now I have written these lines of code:
if not fs.exists({'filename': 'my_file.txt'}):
CRAWLED_FILE = os.path.join(SAVING_FOLDER, 'new_file.txt')
else:
file = fs.find_one({'filename': 'my_file.txt'})
CRAWLED_FILE = os.path.join(SAVING_FOLDER, 'new_file.txt')
with open(CRAWLED_FILE, 'wb') as f:
f.write(file.read())
f.close()
I believe that find_one doesn't allow me to write in a new file the content of the file previously stored in the database. f.write(file.read()) writes in the file just created (new_file.txt) the directory in which (new_file.txt) is stored! So I have a txt completely different from the one I have uploaded in the database and the only line in the txt is: E:\\my_folder\\sub_folder\\my_file.txt
It's kind of weird, I don't even know why it is happening.
I thought it could work if I use the fs.get(ObjectId(ID)) method, which, according to the official documentation of Pymongo and GridFS, it provides a file-like interface for reading. However I just know the name of the txt saved in the database, I have no clue what is the object ID, I cannot use a list or dict to store all the IDs of my documents since it wouldn't be worthy. I have checked with many posts here on StackOverflow and everyone suggests to use subscription. Basically you create a cursor using fs.find()then you can iterate over the cursor for example like this:
for x in fs.find({'filename': 'my_file.txt'}):
ID = x['_id']
see, many answers here suggest me to do the following, the only problem is that Cursor object is not subscriptable and I have no clue how I can resolve this issue.
I must find way to get the document '_id' given the filename of the document so I can later use it combined with fs.get(ObjectId(ID)).
Hope you can help me, thank you a lot!
Matteo

You can just access it like this:
ID = x._id
But "_" is a protected member in Python, so I was looking around for other solutions (could not find much). For getting just the ID, you could do:
for ID in fs.find({'filename': 'my_file.txt'}).distinct('_id'):
# do something with ID
Since that only gets the IDs, you would probably need to do:
query = fs.find({'filename': 'my_file.txt'}).limit(1) # equivalent to find_one
content = next(query, None) # Iterate GridOutCursor, should have either one element or None
if content:
ID = content._id
...

Related

pymongo: insert_one() is running but isn't adding anything to mongodb database?

I'm trying to upload a .txt file to a mongodb database collection using PyCharm, but nothing is appearing inside of the collection? Here's the script I'm using at the moment:
from pymongo import MongoClient
client = MongoClient()
db = client.memorizer_data # use a database called "memorizer_data"
collection = db.english # and inside that DB, a collection called "english"
with open('7_1_1.txt', 'r') as f:
text = f.read() # read the txt file
name = '7_1_1.txt'
# build a document to be inserted
text_file_doc = {"file_name": name, "contents": text}
# insert the contents into the "file" collection
collection.insert_one(text_file_doc)
PyCharm gets through the script with no errors, I've also tried printing the acknowledged attribute just to see what comes up:
result = collection.insert_one(text_file_doc)
print(result.acknowledged)
Which is giving me True. I wasn't sure if I was actually connecting to my database, so I tried db.list_collection_names() and my collection 'english' is in the list, so as far as I can tell I am connecting with it?
I'm a newbie to MongoDB so I realize I've probably gone about things the wrong way. At the moment I'm just trying to get the script working for a single .txt file before uploading everything my project is using to the db.
What makes you think there's nothing in the collection? Two ways to check;
In your pymongo code, add a final debug line:
print(collection.find_one())
Or, in the mongodb shell:
use memorizer_data
db.english.findOne()

How do I store dictionaries in a file and read/write that file?

I am using tkinter to manage the GUI for a note retrieval program. I can pull my notes by typing a key word and hitting Enter in a text field but I would like to move my dictionary to a file so that my code space is not filled up with a massive dictionary.
I have been looking around but I am not sure how I would go about doing this.
I have the file in my directory. I know I can use open(“filename”, “mode”) to open said file for reading but how do I call each section of the notes.
For example what I do now is just call a keyword from my dictionary and have it write the definition for that keyword to a text box in my GUI. Can I do the same from the file?
How would I go about reading from the file the keyword and returning the definition to a variable or directly to the text box? For now I just need to figure out how to read the data. I think once I know that I can figure out how to write new notes or edit existing notes.
This is how I am set up now.
To call my my function
root.bind('<Return>', kw_entry)
How I return my definition to my text box
def kw_entry(event=None):
e1Current = keywordEntry.get().lower()
if e1Current in notes:
root.text.delete(1.0, END)
root.text.insert(tkinter.END, notes[e1Current])
root.text.see(tkinter.END)
else:
root.text.delete(1.0, END)
root.text.insert(tkinter.END, "Not a Keyword")
root.text.see(tkinter.END)
Sound's like you'd need to load the dictionary to memory at init time, and use it like a normal dictionary.
I am assuming your dictionary is a standard python dict of strings, so I recommend using the python json lib.
Easiest way to do this is to export the dictionary as json once to a file using something like:
with open(filename, 'w') as fp:
json.dump(dictionary, fp)
and then change your code to load the dict at init time using:
with open(filename) as fp:
dictionary = json.load(fp)
Alternatively, if your data is more complex than text, you can use python shelve which is a persistent, dictionary-like object to which you can pass any pickle-able object. Note that shelve has its drawbacks so read the attached doc.
sqlitedict is a project providing a persistent dictionary using sqlite in the background. You can use it like a normal dictionary e.g. by assigning arbitrary (picklable) objects to it.
If you access an element from the dictionary, only the value you requested is loaded from disk.

Modify a DBF file

I want to modify one column in a .dpf file using Python with this library http://pythonhosted.org/dbf/. When I want to print out some column, it works just fine. But when I am trying to modify a column, I get error
unable to modify fields individually except in with or Process()
My code:
table = dbf.Table('./path/to/file.dbf')
table.open()
for record in table:
record.izo = sys.argv[2]
table.close()
In docs, they recommend doing it like
for record in Write(table):
But I also get an error:
name 'Write' is not defined
And:
record.write_record(column=sys.argv[2])
Also gives me an error that
write_record - no such field in table
Thanks!
My apologies for the state of the docs. Here are a couple options that should work:
table = dbf.Table('./path/to/file.dbf')
# Process will open and close a table if not already opened
for record in dbf.Process(table):
record.izo = sys.argv[2]
or
with dbf.Table('./path/to/file.dbf')
# with opens a table, closes when done
for record in table:
with record:
record.izo = sys.argv[2]
I have been trying to make a change to my dbf file for several days and searched and browsed several websites, this page was the only one that gave me a solution that worked. Just to add a little more information so that whoever lands here would understand the above piece of code that Ethan Furman shared.
import dbf
table = dbf.Table('your_dbf_filename.dbf')
# Process will open and close a table if not already opened
for record in dbf.Process(table):
record.your_column_name = 'New_Value_of_that_column'
Now, because you don't have a condition mentioned here, you would end you updating all the rows of your column. Remember, this statement will immediately reflect the new value in that column. So, the advice is to save a copy of this dbf file before making any edits to it.
I also tried the 2nd solution that Ethan mentions, but it throws an error that 'table' not defined.

Getting MSI properties python

I have an MSI file that I'm trying to extract some of the parameters specified in the Details tab on the file properties.
I found the msilib where SummaryInformation.GetProperty(field) looks like the way to go, but I don't understand how to use it. how do I 'connect' it to that existing MSI file and not one that is being created?
The msi file contains both cab files and information in a database format.
See this link for more info about its structre and how to view it: MSI structure answer.
I never used the python msilib but by reading the documentation my guess is this:
to get the db object, use something like:
dbobject = msilib.OpenDatabase(path, msilib.MSIDBOPEN_READONLY)
if you want something in the summary info then you can do something like:
info = dbobject.GetSummaryInformation(1)
prop = info.GetProperty(field)
if the information you need is in one of the db tables then you should do a sql query against it:
view = dbobject.OpenView(sql)
rec = view.Execute(params)
str_val = rec.GetString(field)

Passing a directory path to create a dictionary instead of a list.

Essentially I want to sort of do a makeshift super word count, but I'm uncertain how to create a dict object from a directory path (passed in as an argument) as opposed to a list to do what I need to do.
While I want to create a dictionary object, I also want to format the ASCII values of the keys which are filenames into email or message objects using the email module. Then I want to extract the body using the payload and parse it that way. I have some example below:
mylist=os.listdir(sys.stdin)
for emails in mylist:
email_str = emails.open()
#uncertain if this will get all emails and their content or not
#all emails are supposed to have a unique identifier, they are essentially still just ascii
file_dict = {emails : email_str}
#file_dict = dict(zip(mylist, mylist))
for emails in file_dict[emails]:
msg = email.message_from_string(email_str)
body = msg.get_payload(decode=True)
#I'm not entirely sure how message objects and sub objects work, but I want the header to
#signature and I'm not sure about the type of emails as far as header style
#pretend I have a parsing method here that implements the word count and prints it as a dict:
body.parse(regex)
I don't entirely need the keys other than to parse their values so I may consider using message_from_file instead.
You can use any string as a file path, and you can even use relative file paths. If you're just trying to format data for yourself, you could do iterate through your list of email messages and store the output.
for emailpath in list_of_email_paths
emailpath = 'someemailpath'
# open path -- only works if path exists.
f = open(emailpath)
file_dict[emailpath] = f.read()
f.close()
Not a great idea to use open file objects as keys (if it's even possible, just read them and store a string as an identifier. Read the docs on os.path for more (btw - you have to import with import os, not import os.path)
Aside from that, any immutable object or reference can be a dictionary key, so there's no problem with storing paths as keys. Python doesn't care where the path came from, nor does the dict care if its keys are paths ;)
Unfortunately, because you are asking to be shown so much information at once, my answer has to be a bit more general to overview them. Even though you stated that your example is all purely pseudocode, its all so completely wrong that its hard to know what you understand and what parts you don't, so I will cover all the bases of what you said in your comments.
How to read files
You are misusing os.listdir, as it takes a string path, not a file-type object. But personally for this I like to use glob. It saves a few steps in letting you get the full path, and filtered by a pattern. Lets assume all your email files end in .mail
import sys
import glob
first_path = sys.argv[1]
pattern = "%s/*.mail" % first_path
for mail in glob.iglob(pattern):
# with context will close the file automatically
with open(main) as f:
data = f.read()
# do something with data here
Parsing email formats
The example for using the email module are extensive, so there is no point in me showing them here other than giving you a link to review: http://docs.python.org/library/email-examples.html
If the files are actually emails, then you should be able to use this module to parse them and read the message body of each one
Using a dictionary
Using a dictionary is no different in this case than in any general case of a python dict. You would start by creating an empty dict:
file_dict = {}
And on every loop of your directory listing you will always have the string path name, which you wanted to be your key. Whether you read the files raw data using the first example, or you used the email module to get the message body, either way you will end up with some chunk of text data.
for mail in glob.iglob(pattern):
...
# do stuff to get the text data from the file
data = some_approach_to_reading_file()
...
file_dict[mail] = data
Now you have a file_dict with the key being the path to the original file, and the value being the read data.
Summary
With these three sections, you should have plenty of general information to put this all together.

Categories