Representing filesystem table - python

I’m working on simple class something like “in memory linux-like filesystem” for educational purposes. Files will be as StringIO objects. I can’t make decision how to implement files-folders hierarchy type in Python. I’m thinking about using list of objects with fields: type, name, parent what else? Maybe I should look for trees and graphs.
Update:
There will be these methods:
new_dir(path),
dir_list(path),
is_file(path),
is_dir(path), remove(path),
read(file_descr),
file_descr open(file_path, mode=w|r),
close(file_descr),
write(file_descr, str)

It's perfectly possible to represent a tree as a nested set of lists. However, since entries are typically indexed by name, and a directory is generally considered to be unordered, nested dictionaries would make many operations faster and easier to write.
I wouldn't store the parent for each entry though, that's implicit from its position in the hierarchy.
Also, if you want your virtual file system to efficiently support hard links, you need to separate a file's contents from the directory hierarchy. That way, you can re-use the contents by giving each piece of content any number of names, which is what hard linking does.

May be you can try using networkx. You just have to intutive to adapt it to use with files and folder.
A simple example
import os,networkx as nx
G=nx.Graph()
for (path, dirs, files) in os.walk(os.getcwd()):
bname = os.path.split(path)
for f in files:
G.add_edge(bname,f)
# Now do what ever you want with the Graph

You should first ask the question: What operations should my "file system" support?
Based on the answer you select the data representation.
For example, if you choose to support only create and delete and the order of the files in the dictionary is not important, then select a python dictionary. A dictionary will map a file name (sub path name) to either a dictionary or the file container object.

What's the API of the filestore? Do you want to keep creation, modification and access times? Presumably the primary lookup will be by file name. Are any other retrieval operations anticipated?
If only lookup by name is required then one possible representation is to map the filestore root directory on to a Python dict. Each entry's key will be the filename, and the value will either be a StringIO object (hint: in Python 2 use cStringIO for better performance if it becomes an issue) or another dict. The StringIO objects represent your files, the dicts represent subdirectories.
So, to access any path you split it up into its constituent components (using .split("/")) and then use each to look up a successive element. Any KeyError exceptions imply "File or directory not found," as would any attempts to index a StringIO object (I'm too lazy to verify the specific exception).
If you want to implement greater detail then you would replace the StringIO objects and dicts with instances of some "filestore object" class. You could call it a "link" (since that's what it models: A Linux hard link). The various attributes of this object can easily be manipulated to keep the file attributes up to date, and the .data attribute can be either a StringIO object or a dict as before.
Overall I would prefer the second solution, since then it's easy to implement methods that do things like keep access times up to date by updating them as the operations are performed, but as I said much depends on the level of detail you want to provide.

Related

Dynamic index name in pymongo

I'm about to add some data to the MongoDB Atlas service and would like the indexes to be dynamic based on the source. I am currently ingesting logs from different sources (all ends up as files however) like so:
filenames = ["access_logs.log", "error.log", "stats.log"]
Currently I insert all of them to the same index:
db_client.logs.insert_many(data)
I however want to use the filename (this is being passed as a variable already) as the index name. Is there a dynamic way of doing this?
I can of course write case specific inserts and hardcode the name of the index based on each case but was wondering if there is another, smarter, way of doing this?
Basically i'd like to achive (please pardon the wrong syntax here, the $ is suppose to be the variable:
for filename in filenames:
db_client.$filename.insert_many(data)
I suppose this question is more of a python specific question rather than one that is only applicable for pymongo. However, I am not sure how to phrase this specific requirement.
Any help is welcome
To use a dynamic collection name, reference it in this format:
for filename in filenames:
db_client[filename].insert_many(data)

python mapping with syntax-checkable keys

I need to create a mapping of keys (strings, all suitable python identifiers) to values in python (3.9).
All keys and values are constant and known at creation time and i want to make sure, that every single key has an associated value.
1. dict
The first idea that comes to mind for this would be using a dictionary, which comes with the big problem that keys (in my case) would be strings.
That means i have to retype the key each time a value is accessed manually in a string literal, so IDEs and type checkers can't spot typos, suggest key names in autocomplete and i can't use their utility functions to rename or find usages of a key.
1.5 dict with constant variable keys
the naive solution for this would be to create a constant for each key or an enum, which i don't think is a good solution. Not only is at least one name-lookup added to each access, it also means that the key definition and the value assignment are separated, which can lead to keys that don't have a value assigned to them.
2. enum
This leads to the idea to skip the dict and use an enum to associate the keys directly with the values. Enums are conveniently supported by syntax-checkers, auto completion an the likes, as they support both attribute reference via "dot-notation" and subscriptions via "[]".
However an enum has the big disadvantage that it requires all keys/Enum-Members to have unique values and keys violating this rule will automatically be converted to aliases which makes outputs very confusing.
I already thought about copying the Enum-Code and removing the unwanted bits, but this seems to be a lot of effort for such a basic problem.
question:
So basically, what i'm looking for is a pythonic, neat and concise way to define a (potentially immutable) mapping from string keys to arbitrary values which supports the following:
iterable (over keys)
keys with identical values don't interfere with each other
keys are required to have an associated value
keys are considered by syntax-checkers, auto-completion, refactorings, etc.
The preferred way of using it would be to define it in a python source file but it would be a nice bonus, if the solution supported easy means to write the data to a text file (json format, or ini or similar) and to create a new instance from such a file.
How would you do that and why would you choose a specific solution?
For the first part, I would use aenum1, which has a noalias setting (so duplicate values can exist with distinct names):
from aenum import NoAliasEnum
class Unqiue(NoAliasEnum):
first = 1
one = 1
and in use:
>>> Unique.first
<Unique.first: 1>
>>> Unique.one
<Unique.one: 1>
>>> # name lookup still works
>>> Unique['one']
<Unique.one: 1>
>>> # but value lookups do not
>>> Unique(1)
Traceback (most recent call last):
...
TypeError: NoAlias enumerations cannot be looked up by value
For the second part, decide which you want:
read and create enums from a file
create enum in Python and write to a file
Doing both doesn't seem to make a lot of sense.
To create from a file you can use my JSONEnumMeta answer.
To write to a file you can use my share enums with arduino answer (after adapting the __init_subclass__ code).
The only thing I'm not certain of is the last point of syntax-checker and auto-completion support.
1 Disclosure: I am the author of the Python stdlib Enum, the enum34 backport, and the Advanced Enumeration (aenum) library.

Python - wrap same object to make it unique

I have a dictionary that is being built while iterating through objects. Now same object can be accessed multiple times. And I'm using object itself as a key.
So if same object is accessed more than once, then key becomes not unique and my dictionary is no longer correct.
Though I need to access it by object, because later on if someone wants access contents by it, they can request to get it by current object. And it will be correct, because it will access the last active object at that time.
So I'm wondering if it is possible to wrap object somehow, so it would keep its state and all attributes the same, but the only difference would be this new kind of object which is actually unique.
For example:
dct = {}
for obj in some_objects_lst:
# Well this kind of wraps it, but it loses state, so if I would
# instantiate I would lose all information that was in that obj.
wrapped = type('Wrapped', (type(obj),), {})
dct[wrapped] = # add some content
Now if there are some better alternatives than this, I would like to hear it too.
P.S. objects being iterated would be in different context, so even if object is the same, it would be treated differently.
Update
As requested, to give better example where the problem comes from:
I have this excel reports generator module. Using it, you can generate various excel reports. For that you need to write configuration using python dictionary.
Now before report is generated, it must do two things. Get metadata (metadata here is position of each cell that will be when report is about to be created) and second, parse configuration to fill cells with content.
One of the value types that can be used in this module, is formula (excel formulas). And the problem in my question is specifically with one of the ways formula can be computed: formula values that are retrieved for parent , that are in their childs.
For example imagine this excel file structure:
A | B | C
Total Childs Name Amount
1 sum(childs)
2 child_1 10
3 child_2 20
4 sum(childs)
...
Now in this example sum on cell 1A, would need to be 10+20=30 if sum would use expression to sum their childs column (in this case C column). And all of this is working until same object (I call it iterables) is repeated. Because when building metadata it I need to store it, to retrieve later. And key is object being iterated itself. So now when it will be iterated again when parsing values, it will not see all information, because some will overwritten by same object.
For example imagine there are invoice objects, then there are partner objects which are related with invoices and there are some other arbitrary objects that given invoice and partner produce specific amounts.
So when extracting such information in excel, it goes like this:
inoice1 -> partner1 -> amount_obj1, amount_obj2
invoice2 -> partner1 -> amount_obj3, amount_obj4.
Notice that partner in example is the same. Here is the problem, because I can't store this as key, because when parsing values, I will iterate over this object twice when metadata will actually hold values for amount_obj3 and amount_obj4
P.S Don't know if I explained it better, cause there is lots of code and I don't want to put huge walls of code here.
Update2
I'll try to explain this problem from more abstract angle, because it seems being too specific just confuses everyone even more.
So given objects list and empty dictionary, dictionary is built by iterating over objects. Objects act as a key in dictionary. It contains metadata used later on.
Now same list can be iterated again for different purpose. When its done, it needs to access that dictionary values using iterated object (same objects that are keys in that dictionary). But the problem is, if same object was used more than once, it will have only latest stored value for that key.
It means object is not unique key here. But the problem is the only thing I know is the object (when I need to retrieve the value). But because it is same iteration, specific index of iteration will be the same when accessing same object both times.
So uniqueness I guess then is (index, object).
I'm not sure if I understand your problem, so here's two options. If it's object content that matters, keep object copies as a key. Something crude like
new_obj = copy.deepcopy(obj)
dct[new_obj] = whatever_you_need_to_store(new_obj)
If the object doesn't change between the first time it's checked by your code and the next, the operation is just performed the second time with no effect. Not optimal, but probably not a big problem. If it does change, though, you get separate records for old and new ones. For memory saving you will probably want to replace copies with hashes, __str__() method that writes object data or whatever. But that depends on what your object is; maybe hashing will take too much time for miniscule savings in memory. Run some tests and see what works.
If, on the other hand, it's important to keep the same value for the same object, whether the data within it have changed or not (say, object is a user session that can change its data between login and logoff), use object ids. Not the builtin id() function, because if the object gets GCed or deleted, some other object may get its id. Define an id attribute for your objects and make sure different objects cannot possibly get the same one.

Extending pyyaml to find and replace like xml ElementTree

I'd like to extend this SO question to treat a non-trivial use-case.
Background: pyyaml is pretty sweet insofar as it eats YAML and poops Python-native data structures. But what if you want to find a specific node in the YAML? The referenced question would suggest that, hey, you just know where in the data structure the node lives and index right into it. In fact pretty much every answer to every pyyaml question on SO seems to give this same advice.
But what if you don't know where the node lives in the YAML in advance?
If I were working with XML I'd solve this problem with an xml.etree.ElementTree. These provide nice facilities for loading an XML document into memory and finding elements based on certain search criteria. See find() and findall().
Questions:
Does pyyaml provide search capabilities analogous to ElementTree? (If yes, feel free to yell at me for being bad at Google.)
If no, does anyone have nice recipe for extending pyyaml to achieve similar things? (Bonus points for not traversing the deserialized YAML all over again.)
Note that one important thing that ElementTree provides in addition to just being able to find things is the ability to modify the XML document given an element reference. I'd like to be able to do this on YAML as well.
The answer to question 1 is: no. PyYAML implements the YAML 1.1 language standard and there is nothing about finding scalars by any path in the standard nor in the library.
However if you safeload a YAML structure, everything is either a mapping, a sequence or a scalar. Even such a simplistic representation (simple, compared to full fledged object instantiation with !typemarkers), can already contain recursive self referencing structures:
&a x: *a
This is not possible in XML without external semantic interpretation. This makes making a generic tree walker much harder in YAML than in XML.
The type loading mechanism of YAML also makes it much more difficult to generic tree walker, even if you exclude the problem of self references.
If you don't know where a node lives in advance, you still need to know how to identify the node, and since you don't know how to you would walk the parent (which might be represented in multiple layers of combined mappings and sequences, it is almost almost useles to have a generic mechanism that depends on context.
Without being able to rely on context (in general) the thing that is left is a uniquely identifiable value (like the HTML id attribute). If all your objects in YAML have such a unique id, then it is possible to search the (safeloaded) tree for such an id value and extract any structure underneath it (mappings, sequences) until you hit a leaf (scalar), or some structure that has an id of its own (another object).
I have been following the YAML development for quite some time now (earliest emails from the YAML mailing list that I have in my YAML folder are from 2004) and I have not seen anything generic evolve since then. I do have some tools to walk the trees and find things that I use for extracting parts of the simplified structure for testing my raumel.yaml library, but no code that is in a releasable shape (it would have already been on PyPI if it was), and nothing near to a generic solution like you can make for XML (which is IMO, on its own, syntactically less complex than YAML).
Do you know how to search through python objects? then you know how to search through the results of a yaml.load()...
YAML is different from XML in two important ways: one is that while every element in XML has a tag and a value, in YAML, there can be some things that are only values. But secondly... again, YAML creates python objects. There is no intermediate in-memory format to use.
E.G. if you load a YAML file like this:
- First
- Second
- Third
you'll get a list like ['First', 'Second', 'Third']. Want to find 'Third' and don't know where it is? You can use [x for x in my_list if 'Third' in x] to find it. Need to lookup an item in a dictionary? Just do it.
If you want to modify an object, you don't modify the YAML, you modify the object. E.G. now I want the second entry to be in German. I just do 'my_list[1] = 'zweite', modifying it in place. Now the python list looks like ['First', 'zweite', 'Third'], and dumping it to YAML looks like
- First
- zweite
- Third
Note that PyYAML is pretty smart... you can even create objects with loops:
>>> a = [1,2,3]
>>> b = {}
>>> b[1] = a
>>> b[2] = a
>>> print yaml.dump(b)
1: &id001 [1, 2, 3]
2: *id001
>>> b[2] = [3,4,5]
>>> print yaml.dump(b)
1: [1, 2, 3]
2: [3, 4, 5]
In the first case, it even figured out that b[1] and b[2] point to the same object, so it created links and automatically put a link from one to the other... in the original object, if you did something like a.pop(), both b[1] and b[2] would show that one entry was gone. If you send that object to YAML, and then load it back in, that will still be true.
(and note in the second one, where they aren't the same, PyYAML doesn't create the extra notations, as it doesn't need to).
In short: Most likely, you're just overthinking it.

How do I get a formatted list of methods from a given object?

I'm a beginner, and the answers I've found online so far for this have been too complicated to be useful, so I'm looking for an answer in vocabulary and complexity similar to this writing.
I'm using python 2.7 in ipython notebook environment, along with related modules as distributed by anaconda, and I need to learn about the library-specific objects in the course of my daily work. The case I'm using here is a pandas dataframe object but the answer must work for any object of python or of an imported module.
I want to be able to print a list of methods for the given object. Directly from my program, in a concise and readable format. Even if it's just the method names in a list by alphabetical order, that would be great. A bit more detail would be even better, an ordering based on what it does is fine, but I'd like the output to look like a table, one row per method, and not big blocks of text. What i've tried is below, and it fails for me because it's unreadable. It puts copies of my data between each line, and it has no formatting.
(I love stackoverflow. I aspire to have enough points someday to upvote all your wonderful answers.)
import pandas
import inspect
data_json = """{"0":{"comment":"I won\'t go to school"}, "1":{"note":"Then you must stay in bed"}}"""
data_df = pandas.io.json.read_json(data_json, typ='frame',
dtype=True, convert_axes=True,
convert_dates=True, keep_default_dates=True,
numpy=False, precise_float=False,
date_unit=None)
inspect.getmembers(data_df, inspect.ismethod)
Thanks,
- Sharon
Create an object of type str:
name = "Fido"
List all its attributes (there are no “methods” in Python) in alphabetical order:
for attr in sorted(dir(name)):
print attr
Get more information about the lower (function) attribute:
print(name.lower.__doc__)
In an interactive session, you can also use the more convenient
help(name.lower)
function.

Categories