I'm using minidom in Python to create an XML formatted log file for completed tasks. Part of the process is to compare the last modified time of a file to the time that that files data was recorded into the log. I plan on doing that via:
if modTime < recTime:
do_something()
For example, foo.pdf was modified at 10:40am, then at 10:46am the log recorded foo.pdf's modified time. So a portion of the log should look something like this:
<Printed Orders>
<foo.pdf>
<Date Recorded>
1352486780
</Date Recorded>
</foo.pdf>
However, when I attempt to write the times in their integer formats to the XML file I get the error:
TypeError: node contents must be a string
So, my questions are:
Is there a way to write an integer to an XML file? (Preferrably using minidom as to not clutter my script with more imports)
If there isn't, is there a better way to compare the modified time I pull from the file itself and recorded time I pull from the XML file than converting the recorded time to a string, writing to the XML file, pulling the rec time from the XML file later on, and then converting that string back to an integer?
Also, in case you're wondering, the plan is to do once-daily purges of a directory, deleting foo.pdf and other files based on the comparison of their own mod/rec times. If foo.pdf hasn't been modified since it was entered into the log, it will be deleted.
Thanks!
Just look at the output you expect. How would XML know if that is an integer or a string. With XML in general, you have to say everything with tags. Thus, everything is treated as a string.
You do not need to convert the string to a int, unless the other time is an int, because the time-string will not become any longer than it is now for a really long time (over 3,000 years). However, I am not sure why you have so much dislike for doing that conversion. If it's really a big deal, use JSON.
Related
I have a folder with multiple files (.doc and .docx). For the sake of this question I want to primarily deal with the .doc files unless for of these file types and be accounted for in the code.
I'm writing a code to read the folder and identify the .doc files. The objective is to output the paragraph 3, 4, and 7. I'm not sure why but python is reading each paragraph from a different spot in each file. I'm thinking maybe there are spacing/formatting inconsistencies that I wasn't aware of initially. To work around the formatting issue, I was thinking I could define the strings I want outputted. But I'm not sure how to do that. I tried to take add a string in the code but that didn't work.
How can I modify my code to be able to account for finding the strings that I want?
Original Code
doc = ''
for file in glob.glob(r'folderpathway*.docx'):
doc = docx.Document(file)
print (doc.paragraphs[3].text)
print (doc.paragraphs[4].text)
print (doc.paragraphs[7].text)
Code to account for the formatting issues
doc = ''
for file in glob.glob(r'folderpathway*.docx'):
doc = docx.Document(file)
print (doc.paragraphs["Substance Number"].text)
TypeError: list indices must be integers or slices, not str
I am trying to read a json which includes a number of tweets, but I get the following error.
OverflowError: int too large to convert
The script filters multiple json files to get specific tweets, and it crashes when reaching to a specific json.
The line that creates the error is this one :
df_temp = pd.read_json(path_or_buf=json_path, lines=True)
Here is the error in the cmd
Just store the user id as a String, and treat it like it is one (this is actually what you should do when dealing with this kind of ids). If you can't change the json input format, you can always parse it like a string before parsing it like a json object, and add the quotes to the id code, using for instance regexes: Regex in python.
I don't know with which library you are parsing the json, but maybe also implicit casting will work: either try the "getString" method on the number instead of the "getInt" method, or force python to treat the object like a string, with something like x = "" + json.getId()
Python is pretty loose on typing and may let you do it.
I have a folder with 300+ .txt files with total size of 15GB+. These files contain tweets. Each line is a different tweet. I have a list of keywords I'd like to search the tweets for. I have created a script that searches each line of every file for every item on my list. If the tweet contains the keyword, then it writes the line into another file. This is my code:
# Search each file for every item in keywords
print("Searching the files of " + filename + " for the appropriate keywords...")
for file in os.listdir(file_path):
f = open(file_path + file, 'r')
for line in f:
for key in keywords:
if re.search(key, line, re.IGNORECASE):
db.write(line)
This is the format each line has:
{"created_at":"Wed Feb 03 06:53:42 +0000 2016","id":694775753754316801,"id_str":"694775753754316801","text":"me with Dibyabhumi Multiple College students https:\/\/t.co\/MqmDwbCDAF","source":"\u003ca href=\"http:\/\/www.facebook.com\/twitter\" rel=\"nofollow\"\u003eFacebook\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":5981342,"id_str":"5981342","name":"Lava Kafle","screen_name":"lkafle","location":"Kathmandu, Nepal","url":"http:\/\/about.me\/lavakafle","description":"#deerwalkinc 24000+ tweeps bigdata #Team #Genomics http:\/\/deerwalk.com #Genetic #Testing #population #health #management #BigData #Analytics #java #hadoop","protected":false,"verified":false,"followers_count":24742,"friends_count":23169,"listed_count":1481,"favourites_count":147252,"statuses_count":171880,"created_at":"Sat May 12 04:49:14 +0000 2007","utc_offset":20700,"time_zone":"Kathmandu","geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"EDECE9","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_tile":false,"profile_link_color":"088253","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"E3E2DE","profile_text_color":"634047","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/677805092859420672\/kzoS-GZ__normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/677805092859420672\/kzoS-GZ__normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/5981342\/1416802075","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/MqmDwbCDAF","expanded_url":"http:\/\/fb.me\/Yj1JW9bJ","display_url":"fb.me\/Yj1JW9bJ","indices":[45,68]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1454482422661"}
The script works but it takes a lot of time. For ~40 keywords it needs more than 2 hours. Obviously my code is not optimized. What can I do to improve the speed?
p.s. I have read some relevant questions regarding searching and speed but I suspect that the problem in my script lies in the fact that I'm using a list for the keywords. I've tried some of the suggested solutions but to no avail.
1) External library
If you're willing to lean on external libraries (and time to execute is more important than the one-off time cost to install), you might be able to gain some speed by loading each file into a simple Pandas DataFrame and performing the keyword search as a vector operation. To get the matching tweets, you would do something like:
import pandas as pd
dataframe_from_text = pd.read_csv("/path/to/file.txt")
matched_tweets_index = dataframe_from_text.str.match("keyword_a|keyword_b")
dataframe_from_text[matched_tweets_index] # Uses the boolean search above to filter the full dataframe
# You'd then have a mini dataframe of matching tweets in `dataframe_from_text`.
# You could loop through these to save them out to a file using the `.to_dict(orient="records")` format.
Dataframe operations within Pandas can be really quick so might be worth investigating.
2) Group your regex
Looks like you're not logging which keyword you matched against. If this is true, you could group your keywords into a single regex query like so:
for line in f:
keywords_combined = "|".join(keywords)
if re.search(keywords_combined, line, re.IGNORECASE):
db.write(line)
I've not tested this but by reducing the number of loops per line, that could trim some time off.
Why it's slow
You are regex searching through a json dump, which is not always a good idea. For example, if you keywords include words like user, time, profile and image each line will result in a match because the json format for tweets has all these terms as dictionary keys.
Besides the raw JSON is huge, each tweet will be more than 1kb in size (this one is 2.1kb) but the only part that's relevent in your sample is:
"text":"me with Dibyabhumi Multiple College students https:\/\/t.co\/MqmDwbCDAF",
And this is less than 100 bytes, a typical tweet is still less than 140 characters despite recent changes to the API.
Things to try:
pre compile the regex as suggested by Padraic Cunningham
Option 1. Load this data into a postgresql JSONB field. JSONB fields are indexable and can be searched very quickly
Option 2. Load this into any old database, with the context of the text field having it's own column so that this column can be searched easily.
Option 3. last but not least, extract just the text field into it's own file. You can have a CSV file where the first column is the screen name and the second is the text of the tweet. Your 15GB will be shrunk to about 1GB
In short what you are doing now is searching the whole farm for the needle when you only need to search the haystack.
I've just written my first script for my first proper job (rather proud).
Basically I parse a large xml file that contains 3 types of data: estates, symbols and types.
It then creates 3 .txt files listing all items of the 3 types, one file for estates, one for symbols and one for types.
What I need to do now is format the output for use in our internal wiki.
I want to be able to filter it with a drop down menu, so that I can select an "Estate" and see what "Symbols" are in the "Estate" and then see what "Type" these symbols are.
For scope there are ~50 estates, ~26 types, and about 93000 Symbols (They all vary from day to day).
A symbol belongs to an estate and each symbol has a type.
If you want any code snippets from either the xml doc or my current script, feel free to ask, I didn't want to dump a load of code in here.
EDIT:
Here is an example of how the XML is formatted, showing the symbol name, its estate and then its type
<Symbol SymbolName="<SYMNAME>" Estate="<ESTATENAME>" TickType="<TYPE>" />
Names have been omitted for confidentiality.
EDIT2:
Had the idea of using dictionaries to better sort the parsed data.
Eg.
dictionary1 = {symbol1[estate], symbol2[estate]}
dictionary2 = {symbol1[type], symbol2[type]}
TO CLARFIY: I have a bunch of data from an xml that needs to be written to an output file in such a way that it can be filtered on a web page (drop down menus preferable)
I am working on a project that uses the Unity3D game engine. For some of the pipeline requirements, it is best to be able to update some files from external tools using Python. Unity's meta and anim files are in YAML so I thought this would be strait forward enough using PyYAML.
The problem is that Unity's format uses custom attributes and I am not sure how to work with them as all the examples show more common tags used by Python and Ruby.
Here is what the top lines of a file look like:
%YAML 1.1
%TAG !u! tag:unity3d.com,2011:
--- !u!74 &7400000
AnimationClip:
m_ObjectHideFlags: 0
m_PrefabParentObject: {fileID: 0}
...
When I try to read the file I get this error:
could not determine a constructor for the tag 'tag:unity3d.com,2011:74'
Now after looking at all the other questions asked, this tag scheme does not seem to resemble those questions and answers. For example this file uses "!u!" which I was unable to figure out what it means or how something similar would behave (my wild uneducated guess says it looks like an alias or namespace).
I can do a hack way and strip the tags out but that is not the ideal way to try to do this. I am looking for help on a solution that will properly handle the tags and allow me to parse & encode the data in a way that preserves the proper format.
Thanks,
-R
I also had this problem, and the internet was not very helpful. After bashing my head against this problem for 3 days, I was able to sort it out...or at least get a working solution. If anyone wants to add more info, please do. But here's what I got.
1) The documentation on Unity's YAML file format(they call it a "textual scene file" because it contains text that is human readable) - http://docs.unity3d.com/Manual/TextualSceneFormat.html
It is a YAML 1.1 compliant format. So you should be able to use PyYAML or any other Python YAML library to load up a YAML object.
Okay, great. But it doesn't work. Every YAML library has issues with this file.
2) The file is not correctly formed. It turns out, the Unity file has some syntactical issues that make YAML parsers error out on it. Specifically:
2a) At the top, it uses a %TAG directive to create an alias for the string "unity3d.com,2011". It looks like:
%TAG !u! tag:unity3d.com,2011:
What this means is anywhere you see "!u!", replace it with "tag:unity3d.com,2011".
2b) Then it goes on to use "!u!" all over the place before each object stream. But the problem is that - to be YAML 1.1 compliant - it should actually declare a tag alias for each stream (any time a new object starts with "--- "). Declaring it once at the top and never again is only valid for the first stream, and the next stream knows nothing about "!u!", so it errors out.
Also, this tag is useless. It basically appends "tag:unity3d.com,2011" to each entry in the stream. Which we don't care about. We already know it's a Unity YAML file. Why clutter the data?
3) The object types are given by Unity's Class ID. Here is the documentation on that:
http://docs.unity3d.com/Manual/ClassIDReference.html
Basically, each stream is defined as a new class of object...corresponding to the IDs in that link. So a "GameObject" is "1", etc. The line looks like this:
--- !u!1 &100000
So the "--- " defines a new stream. The "!u!" is an alias for "tag:unity3d.com,2011" and the "&100000" is the file ID for this object (inside this file, if something references this object, it uses this ID....remember YAML is a node-based representation, so that ID is used to denote a node connection).
The next line is the root of the YAML object, which happens to be the name of the Unity Class...example "GameObject". So it turns out we don't actually need to translate from Class ID to Human Readable node type. It's right there. If you ever need to use it, just take the root node. And if you need to construct a YAML object for Unity, just keep a dictionary around based on that documentation link to translate "GameObject" to "1", etc.
The other problem is that most YAML parsers (PyYAML is the one I tested) only support 3 types of YAML objects out of the box:
Scalar
Sequence
Mapping
You can define/extend custom nodes. But this amounts to hand writing your own YAML parser because you have to define EXPLICITLY how each YAML constructor is created, and outputs. Why would I use a Library like PyYAML, then go ahead and write my own parser to read these custom nodes? The whole point of using a library is to leverage previous work and get all that functionality from day one. I spent 2 days trying to make a new constructor for each class ID in unity. It never worked, and I got into the weeds trying to build the constructors correctly.
THE GOOD NEWS/SOLUTION:
Turns out, all the Unity nodes I've ever run into so far are basic "Mapping" nodes in YAML. So you can throw away the custom node mapping and just let PyYAML auto-detect the node type. From there, everything works great!
In PyYAML, you can pass a file object, or a string. So, my solution was to write a simple 5 line pre-parser to strip out the bits that confuse PyYAML(the bits that Unity incorrectly syntaxed) and feed this new string to PyYAML.
1) Remove line 2 entirely, or just ignore it:
%TAG !u! tag:unity3d.com,2011:
We don't care. We know it's a unity file. And the tag does nothing for us.
2) For each stream declaration, remove the tag alias ("!u!") and remove the class ID. Leave the fileID. Let PyYAML auto-detect the node as a Mapping node.
--- !u!1 &100000
becomes...
--- &100000
3) The rest, output as is.
The code for the pre-parser looks like this:
def removeUnityTagAlias(filepath):
"""
Name: removeUnityTagAlias()
Description: Loads a file object from a Unity textual scene file, which is in a pseudo YAML style, and strips the
parts that are not YAML 1.1 compliant. Then returns a string as a stream, which can be passed to PyYAML.
Essentially removes the "!u!" tag directive, class type and the "&" file ID directive. PyYAML seems to handle
rest just fine after that.
Returns: String (YAML stream as string)
"""
result = str()
sourceFile = open(filepath, 'r')
for lineNumber,line in enumerate( sourceFile.readlines() ):
if line.startswith('--- !u!'):
result += '--- ' + line.split(' ')[2] + '\n' # remove the tag, but keep file ID
else:
# Just copy the contents...
result += line
sourceFile.close()
return result
To create a PyYAML object from a Unity textual scene file, call your pre-parser function on the file:
import yaml
# This fixes Unity's YAML %TAG alias issue.
fileToLoad = '/Users/vlad.dumitrascu/<SOME_PROJECT>/Client/Assets/Gear/MeleeWeapons/SomeAsset_test.prefab'
UnityStreamNoTags = removeUnityTagAlias(fileToLoad)
ListOfNodes = list()
for data in yaml.load_all(UnityStreamNoTags):
ListOfNodes.append( data )
# Example, print each object's name and type
for node in ListOfNodes:
if 'm_Name' in node[ node.keys()[0] ]:
print( 'Name: ' + node[ node.keys()[0] ]['m_Name'] + ' NodeType: ' + node.keys()[0] )
else:
print( 'Name: ' + 'No Name Attribute' + ' NodeType: ' + node.keys()[0] )
Hope that helps!
-Vlad
PS. To Answer the next issue in making this usable:
You also need to walk the entire project directory and parse all ".meta" files for the "GUID", which is Unity's inter-file reference. So, when you see a reference in a Unity YAML file for something like:
m_Materials:
- {fileID: 2100000, guid: 4b191c3a6f88640689fc5ea3ec5bf3a3, type: 2}
That file is somewhere else. And you can re-cursively open that one to find out any dependencies.
I just ripped through the game project and saved a dictionary of GUID:Filepath Key:Value pairs which I can match against.