Dynamic index name in pymongo - python

I'm about to add some data to the MongoDB Atlas service and would like the indexes to be dynamic based on the source. I am currently ingesting logs from different sources (all ends up as files however) like so:
filenames = ["access_logs.log", "error.log", "stats.log"]
Currently I insert all of them to the same index:
db_client.logs.insert_many(data)
I however want to use the filename (this is being passed as a variable already) as the index name. Is there a dynamic way of doing this?
I can of course write case specific inserts and hardcode the name of the index based on each case but was wondering if there is another, smarter, way of doing this?
Basically i'd like to achive (please pardon the wrong syntax here, the $ is suppose to be the variable:
for filename in filenames:
db_client.$filename.insert_many(data)
I suppose this question is more of a python specific question rather than one that is only applicable for pymongo. However, I am not sure how to phrase this specific requirement.
Any help is welcome

To use a dynamic collection name, reference it in this format:
for filename in filenames:
db_client[filename].insert_many(data)

Related

when converting XML to SEVERAL dataframes, how to name these dfs in a dynamic way?

my code is on the bottom
"parse_xml" function can transfer a xml file to a df, for example, "df=parse_XML("example.xml", lst_level2_tags)" works
but as I want to save to several dfs so I want to have names like df_ first_level_tag, etc
when I run the bottom code, I get an error "f'df_{first_level_tag}'=parse_XML("example.xml", lst_level2_tags)
^
SyntaxError: can't assign to literal"
I also tried .format method instead of f-string but it also hasn't worked
there are at least 30 dfs to save and I don't want to do it one by one. always succeeded with f-string in Python outside pandas though
Is the problem here about f-string/format method or my code has other logic problem?
if necessary for you, the parse_xml function is directly from this link
the function definition
for first_level_tag in first_level_tags:
lst_level2_tags = []
for subchild in root[0]:
lst_level2_tags.append(subchild.tag)
f'df_{first_level_tag}'=parse_XML("example.xml", lst_level2_tags)
This seems like a situation where you'd be best served by putting them into a dictionary:
dfs = {}
for first_level_tag in first_level_tags:
lst_level2_tags = []
for subchild in root[0]:
lst_level2_tags.append(subchild.tag)
dfs[first_level_tag] = parse_XML("example.xml", lst_level2_tags)
There's nothing structurally wrong with your f-string, but you generally can't get dynamic variable names in Python without doing ugly things. In general, storing the values in a dictionary ends up being a much cleaner solution when you want something like that.
One advantage of working with them this way is that you can then just iterate over the dictionary later on if you want to do something to each of them. For example, if you wanted to write each of them to disk as a CSV with a name matching the tag, you could do something like:
for key, df in dfs.items():
df.to_csv(f'{key}.csv')
You can also just refer to them individually (so if there was a tag named a, you could refer to dfs['a'] to access it in your code later).

python parsing file into data structure

So I started looking into it, and I haven't found a good way to parse a file following the format I will show you below. I have taken a data structures course, but it doesn't really help me with what I want to do. Any help will be greatly appreciated!
Goal: Create a tool that can read, create, and manipulate a custom file type
File Format: I'm sure there is a name for this type of format, but I couldn't find it. Anyways, the format is subject to some change since the variable names can be added, removed, or changed. Also, after each variable name the data could be one of several different types. Right now the files do not use sub groups, but I want to be prepared in case they decide to change that. The only things I can think of that will remain constant are the GROUP = groupName, END_GROUP = groupName, and the varName = data.
GROUP = myGroup
name1 = String, datenum, number, list, array
name2 = String, datenum, number, list, array
// . . .
name# = String, datenum, number, list, array
GROUP = mySubGroup
name1 = String, datenum, number, list, array
END_GROUP = mySubGroup
// More names could go here
END_GROUP = myGroup
GROUP = myGroup2
// etc.
END_GROUP = myGroup2
Strings and dates are enclosed in " (ie "myString")
Numbers are written as a raw ascii encoded number. They also use the E format if they are large or small (ie 5.023E-6)
Lists are comma separated and enclosed in parentheses (ie (1,2,3,4) )
Additional Info:
I want to be able to easily read a file and manipulate it as needed. For example, if I read the file and I want to change an attribute of a specific variable within a group I should be able to do something along the lines of dataStructure.groupName.varName = newData.
It should be easy to create my own file (using a default template that I will make myself or a custom template that has been passed in).
I want it to treat numbers as numbers and not strings. I should be able to add, subtract, multiply, etc. values within the data structure that are numbers
The big kicker, I'd like to have this written in vanilla python since our systems have only the most basic modules. It is a huge pain for someone to download another module since they have to create their own virtual environment and import the module to it. This tool should be as system independent as possible
Initial Attempt: I was thinking of using a dictionary to organize the data in levels. I do, however, like the idea of using dot structures (like what one would see using MATLAB structures). I wrote a function that will read all the lines of the file and remove the newline characters from each line. From there I want to check for every GROUP = I can find. I would start adding data to that group until I hit an END_GROUP line. Using regular expressions I should be able to parse out the line to determine whether it is a date, number, string, etc.
I am asking this question because I hope to have some insight on things I may be missing. I'd like for this tool to be used long after I've left the dev team which is why I'm trying to do my best to make it as intuitive and easy to use as possible. Thank you all for your help, I really appreciate it! Let me know if you need any more information to help you help me.
EDIT: To clarify what help I need, here are my two main questions I am hoping to answer:
How should I build a data structure to hold grouped data?
Is there an accepted algorithm for parsing data like this?

Separating a string by the first occurrence of a separator

I just got into python very recently and now I'm practicing by (what I imagine to be rather simple, but challenging enough for me) creating small tools to sort files into folders.
So far it has been going pretty well, but now I've encountered a problem:
My files are in the following format:
myAsset_prefix1_prefix2_prettyName.ext ;
(i.e. Tiger_texture_spec_brightOrange.png)
myAsset always has a different length since it's dependent on name.
I want to sort every file of the same asset ( "myAsset_" tag) in a separate folder.
The copying to a separate folder etc is no challenge but..
I don't want to update an array by hand every time I create/receive a new asset.
So instead of using the startswith operation and make it run through a list, I'd like to build that array when my script runs, by making the script look at the name of the file and store everything up to and including the first "_" in a variable/array.
Is that possible?
I think you want the glob module. This allows you to list the files that match a certain format.
For example:
for filename in glob.glob(*.ext):
asset_tag = filename.split(" ")[0]

Python or bash: Merging two csv files based on several matching field values, formatting, the outputting CSV

My preference would be for this to be in Python since I am working on learning more. If you can provide help in bash that would still be helpful, though.
I've looked around Stack Overflow and found some helpful things but not enough for me to finish this.
I have two CSV files with some shared fields. The data is not INT. I would like to join based on matching 3 specific fields and write it out to a new output.csv when all the processing is done.
sourceA.csv looks like this:
fieldname_1,fieldname_2,fieldname_3,fieldname_4,fieldname_5,fieldname_6,fieldname_7,fieldname_8,fieldname_9,fieldname_10,fieldname_11,fieldname_12,fieldname_13,fieldname_14,fieldname_15,fieldname_16
sourceB.csv looks like this:
fieldname_4,fieldname_5,fieldname_OTHER,fieldname_8,fieldname_16
As you can see, sourceB.csv has 4 field names that are also in sourceA.csv and one field name that does not. The data in fieldname_OTHER will need to replace the data in sourceA[fieldname_6].
The whole process should go like this:
Replace data in sourceA[fieldname_6] with data from sourceB[fieldname_OTHER] if all of the following criteria are met:
data in sourceA[fieldname_4]=sourceB[fieldname_4]
data in sourceA[fieldname_8]=sourceB[fieldname_8]
data in sourceA[fieldname_16]=sourceB[fieldname_16]
(The data in sourceB[fieldname_5] does not need to be evaluated.)
If the above criteria aren't met, just replace sourceA[fieldname_6] with the text ANY.
Write each processed line out to output.csv.
A sample of what I would like the output to be based on the input CSVs and processing outlined above:
dataA,dataB,dataC,dataD,dataE,dataOTHER,dataG,dataH,dataI,dataJ,dataK,dataL,dataM,dataN,dataO,dataP
I hope the details I've provided haven't made it more confusing than it needs to be. Thank you for all your help!
I'm not sure I'd bother with SQL for a one-off merger like this. It's straightforward in python.
Read in both files with the csv module, to get two lists. Index sourceA into a dictionary whose key is the tuple of fields that need to be matched. You can then loop over sourceB, find the matching row instantly, and merge into it from sourceB.
When you're done, you can just output the list you read from sourceA: the dict and the list point to the same values, which you've now updated.

Representing filesystem table

I’m working on simple class something like “in memory linux-like filesystem” for educational purposes. Files will be as StringIO objects. I can’t make decision how to implement files-folders hierarchy type in Python. I’m thinking about using list of objects with fields: type, name, parent what else? Maybe I should look for trees and graphs.
Update:
There will be these methods:
new_dir(path),
dir_list(path),
is_file(path),
is_dir(path), remove(path),
read(file_descr),
file_descr open(file_path, mode=w|r),
close(file_descr),
write(file_descr, str)
It's perfectly possible to represent a tree as a nested set of lists. However, since entries are typically indexed by name, and a directory is generally considered to be unordered, nested dictionaries would make many operations faster and easier to write.
I wouldn't store the parent for each entry though, that's implicit from its position in the hierarchy.
Also, if you want your virtual file system to efficiently support hard links, you need to separate a file's contents from the directory hierarchy. That way, you can re-use the contents by giving each piece of content any number of names, which is what hard linking does.
May be you can try using networkx. You just have to intutive to adapt it to use with files and folder.
A simple example
import os,networkx as nx
G=nx.Graph()
for (path, dirs, files) in os.walk(os.getcwd()):
bname = os.path.split(path)
for f in files:
G.add_edge(bname,f)
# Now do what ever you want with the Graph
You should first ask the question: What operations should my "file system" support?
Based on the answer you select the data representation.
For example, if you choose to support only create and delete and the order of the files in the dictionary is not important, then select a python dictionary. A dictionary will map a file name (sub path name) to either a dictionary or the file container object.
What's the API of the filestore? Do you want to keep creation, modification and access times? Presumably the primary lookup will be by file name. Are any other retrieval operations anticipated?
If only lookup by name is required then one possible representation is to map the filestore root directory on to a Python dict. Each entry's key will be the filename, and the value will either be a StringIO object (hint: in Python 2 use cStringIO for better performance if it becomes an issue) or another dict. The StringIO objects represent your files, the dicts represent subdirectories.
So, to access any path you split it up into its constituent components (using .split("/")) and then use each to look up a successive element. Any KeyError exceptions imply "File or directory not found," as would any attempts to index a StringIO object (I'm too lazy to verify the specific exception).
If you want to implement greater detail then you would replace the StringIO objects and dicts with instances of some "filestore object" class. You could call it a "link" (since that's what it models: A Linux hard link). The various attributes of this object can easily be manipulated to keep the file attributes up to date, and the .data attribute can be either a StringIO object or a dict as before.
Overall I would prefer the second solution, since then it's easy to implement methods that do things like keep access times up to date by updating them as the operations are performed, but as I said much depends on the level of detail you want to provide.

Categories