Using python oledump tool gives me "extraction failed" error - python

I would like some help with the oledump tool from the oletools python toolset.
I am doing a training exercise where I need to extract an attachment in a msg file and get the md5 hash for it. However I am having trouble extracting the attachment using oletools in a Linux environment. Keep in mind this is a training lab environment and I cannot get any other tools then the ones provided in the lab (oledump).
Checking the msg file, it clearly has an attachment (attach_version section) with binary data (ending in hex 0102)
Checking section 3, it is a word doc
Section 4 seems to be the start of the binary data
However when I try to extract this part, I always get an extraction failed error. I tried the "-d" flag but that only gives me the section and not the whole file
How would be the best way to go about this to extract the file with oledump without the errors?

I had the same problem. Finally worked this:
python oledump extract file .docm
python3 oledump.py Salary-Ranges.msg -s4 -d > maldoc1.docm
MD5 checksum match to to task.
The '-e' argument probably is not designed for that job or just I don't know how to use correctly - still learning this tools.

Related

Creating a kindle dictionary

I am trying to create a Kindle dictionary that can be used for offline lookup. I already have the words and their inflections, but turning this into a working dictionary is difficult.
There is some documentation about this provided by Amazon. It basically says that you should:
Create an XHTML file with their special markup specifying all inflections etc.
Turn it into an epub
Open it with Kindle Previewer
Export it with Kindle Previewer to MOBI
So I created a large XHTML file (23 MB or so) according to the Amazon specifications and opened it in Kindle Previewer, and it looked fine. However, Kindle Previewer does not let you export XHTML files to MOBI. They want you to create an intermediate epub file.
I tried using Pandoc to do the conversion, which did not work because it stripped out all the specific HTML tags and only left in paragraphs. Then I tried using calibre. The normal XHTML -> epub conversion failed because the XHTML file was too large, according to an error message. Calibre suggests to turn on the "heuristic mode" if you run into this error, which I tried, but which did not finish running after hours of runtime.
Then I attempted to create the epub file myself, using a sample file taken from this tutorial. I discovered that this is not trivial, and a check using epubcheck revealed many hard-to-understand errors in my generated file. The generation of the epub file is also a bit complicated by the fact that you probably need to split the XHTML files into many smaller files, which should maybe be 250 kb in size, because e-readers tend to struggle with parsing larger files.
So I thought there should maybe be an easier way to do this, or maybe a library that helps doing this. Maybe it would even be a good idea to output the words + inflections into some other easier dictionary format and then convert it to a MOBI using an existing library and leaving out the XHTML generation completely. Currently I am using Python, but I'd also use other languages if it is necessary. What could I try?
Edit: To add to the things I have tried: there is an apparently closed source script here that unfortunately doesn't support inflections, so does not work. And there are instructions here that advise converting the file to PRC using Mobipocket Creator and then opening it with Kindle Previewer. The problem with this approach is that Kindle Previewer throws the error:
Kindle Previewer does not support this file, which has either been created using an older version of KindleGen or a third party application. We recommend using EPUB or DOCX format directly for previewing and publishing your book on Kindle.
There are also more detailed instructions for Mobipocket Creator here, which tell you to directly move the generated .prc file onto the kindle. I tried that but it is not being recognized as a dictionary.
I figured it out by myself. First I implemented a solution myself, then I found the pyglossary library (right now the code below only works with the version from Github and not from pip) and used it like this:
from pyglossary.glossary import Glossary
Glossary.init()
glos = Glossary()
defiFormat = "h"
base_forms = get_base_forms()
for canonical_form in base_forms:
inflections = get_inflections(canonical_form)
definitions = get_definition(canonical_form)
definitionhtml = ""
for definition in definitions:
definitionhtml += "<p>" + gloss + "</p>"
all_forms = [canonical_form]
all_forms.extend(inflections)
glos.addEntryObj(glos.newEntry(all_forms, glosshtml, defiFormat))
glos.setInfo("title", "Russian-English Dictionary")
glos.setInfo("author", "Vuizur")
glos.sourceLangName = "Russian"
glos.targetLangName = "English"
glos.write("test.mobi", format="Mobi", keep=True, kindlegen_path="path/to/kindlegen.exe")

Is the Paragram_300_SL999 Word Embeddings file corrupt?

I need to use the Paragram_SL999_300 embeddings for my project that uses the open source code from a published article (https://github.com/cecilialeiqi/adversarial_text). When I try to run Step 4 (generate adversarial examples) from https://github.com/cecilialeiqi/adversarial_text, I get a ValueError saying int() expected but got ','. I know from the readme.txt for Paragram-SL999 300 that is supposed to be one token per line followed by its embeddings. Upon trying to open the Paragram_SL999_300.txt file to see if it matches this criteria, it loads about half way and then closes the TextEditor, without letting me edit it. Furthermore, it crashes LibreOffice if I try and open it in there. This was in an Ubuntu 18.04 Virtual Machine. However, I wasn't sure if this was because the author's code is wrong (in discrete_attack.py at https://github.com/cecilialeiqi/adversarial_text/blob/master/src/discrete_attack.py) or because the file is corrupt so I tried downloading and extracting the Paragram-SL999 300 archive from Wieting's website (http://www.cs.cmu.edu/~jwieting/) on my Windows computer, I get a message saying that the archive is corrupted, which prevents me from extracting the Paragram_SL999_300.txt file and also using it. On another Windows computer, I get the Error Code 0x80004005: Unspecified error when trying to extract the archive.
Is there any way to get around this issue or anyone who can provide insight on it? Would it be recommended instead to produce the embeddings from Wieting's GitHub (https://github.com/jwieting/paragram-word)? I would very much appreciate any input as these embeddings are paramount to my project.
I managed to download it from the Google drive link at https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrive.google.com%2Ffile%2Fd%2F0B9w48e1rj-MOck1fRGxaZW1LU2M%2Fview%3Fusp%3Dsharing&data=02%7C01%7C%7C36fd021bae0343bbe54408d7bdd28c81%7C1faf88fea9984c5b93c9210a11d9a5c2%7C0%7C0%7C637186584305548961&sdata=PouX2kyBlnQHpzAaDKjqe7gFC3ctti6tjBcGWt8pg1s%3D&reserved=0. In the end it worked but I'm not sure why the other times I was unable to get it to work. Also, I didn't realise that for the code I had I also I needed to add the vocabulary size and the embedding size at the first line of the file (1703756 300).

Extremely new user to Python. "No module named request" error while trying code to detect image subdomains in a website to extract them to a folder

I may sound rather uninformed writing this, and unfortunately, my current issue may require a very articulate answer to fix. Therefore, I will try to be specific as possible as to ensure that my problem can be concisely understood.
My apologizes for that- as this Python code was merely obtained from a friend of mine who wrote it for me in order to complete a certain task. I myself had had extremely minimal programming knowledge.
Essentially, I am running Python 3.6 on a Mac. I am trying to work out a code that allows Python to scan through a bulk of a particular website's potentially existent subdomains in order to find possibly-existent JPG images files contained within said subdomains, and download any and all of the resulting found files to a distinct folder on my Desktop.
The Setup-
The code itself, named "download.py" on my computer, is written as follows:
import urllib.request
start = int(input("Start range:100000"))
stop = int(input("End range:199999"))
for i in range(start, stop + 1):
filename = str(i).rjust(6, '0') + ".jpg"
url = "http://website.com/Image_" + filename
urllib.request.urlretrieve(url, filename)
print(url)
(Note that the words "website" and "Image" have been substituted for the actual text included in my code).
Before I proceed, perhaps some explanation would be necessary.
Basically, the website in question contains several subdomains that include .JPG images, however, the majority of the exact URLs that allow the user to access these sub-domains are unknown and are a hidden component of the internal website itself. The format is "website.com/Image_xxxxxx.jpg", wherein x indicates a particular digit, and there are 6 total numerical digits by which only when combined to make a valid code pertain to each of the existent images on the site.
So as you can see, I have calibrated the code so that Python will initially search through number values in the aforementioned URL format from 100000 to 199999, and upon discovering any .JPG images attributed to any of the thousands of link combinations, will directly download all existent uncovered images to a specific folder that resides within my Desktop. The aim would be to start from that specific portion of number values, and upon running the code and fetching any images (or not), continually renumbering the code to work my way through all of the possible 6-digit combos until the operation is ultimately a success.
(Possible Side-Issue- Although I am fairly confident that my friend's code is written in a manner so that Python will only download .JPG files to my computer from images that actually do exist on that particular URL, rather than swarming my folder with blank/bare files from every single one of URL attempts regardless of whether that URL happens to be successful or not, I am admittedly not completely certain. If the latter is the case, informing me of a more suitable edit to my code would be tremendously appreciated.)
The Execution-
Right off the bat, the code experienced a large error. I'll list through the series of steps that led to the creation of said error.
#1- Of course, I first copy-pasted the code into a text document, and saved it as "download.py". I saved it inside of a folder named "Images" where I sought the images to be directly downloaded to. I used BBEdit.
#2- I proceeded, in Terminal, to input the commands "cd Desktop/Images" (to account for the file being held within the "Images" folder on my Desktop), followed by the command "Python download.py" (to actually run the code).
As you can see, the error which I obtained following my attempt to run the code was the ImportError: No module named request. Despite me guessing that the answer to solving this is simple, I can legitimately say I have got such minimal knowledge regarding Python that I've absolutely no idea how to solve this.
Hint: Prior to making the download.py file, the folder, and typing the Terminal code the only interactions I made with Python were downloading the program (3.6) and placing it in my toolbar. I'm not even quite sure if I am required to create any additional scripts/text files, or make any additional downloads before a script like this would work and successfully download the resulting images into my "Images" folder as is my desired goal. If I sincerely missed something integral at any point during this long read, hopefully, someone in here can provide a thoroughly detailed explanation as to how to solve my issue.
Finishing statements for those who've managed to stick along this far:
Thank you. I know this is one hell of a read, and I'm getting more tired as I go along. What I hope to get out of this question is
1.) Obviously, what would constitute a direct solution to the "No module named request" Input Error in Terminal. In other words, what I did wrong there or am missing.
2.) Any other helpful information that you know would assist this code, for example, if there is any integral step or condition I've missed or failed to meet that would ultimately cause the entirety of my code to cease to work. If you do see a fault in this, I only ask of you to be specific, as I've not got much experience in the programming world. After all, I know there is a lot of developers out here that are far more informed and experienced than am I. Thanks.
urllib.request is in Python 3 only. When running 'python' on a Mac, you're running Python 2 by default. Try running executing with python3.
python --version
might need to
brew install python3
urllib.request is a Python 3 construct. Most systems run Python 2 as default and this is what you get when you run simply python.
To install Python 3, go to https://brew.sh/ and follow the instructions to install the Hombrew package manager. Then run
brew install python3
python3 download.py

Scraping PDF data into Excel *absolute beginner*

This is literally day 1 of python for me. I've coded in VBA, Java, and Swift in the past, but I am having a particularly hard time following guides online for coding a pdf scraper. Since I have no idea what I am doing, I keep running into a wall every time I want to test out some of the code I've found online.
Basic Info
Windows 7 64bit
python 3.6.0
Spyder3
I have many of the pdf related code packages (PyPDF2, pdfminer, pdfquery, pdfwrw, etc)
Goals
To create something in python that allows me to convert PDFs from a folder into an excel file (ideallY) OR a text file (from which I will use VBA to convert).
Issues
Every time I try some sample code from guides i've found online, I always run into syntax errors on the lines where I am calling the pdf that I want to test the code on. Some guide links and error examples below. Should I be putting my test.pdf into the same file as the .py file?
How to scrape tables in thousands of PDF files?
I got an invalid syntax error due to "for" on the last line
PDFMiner guide (Link)
runfile('C:/Users/U587208/Desktop/pdffolder/pdfminer.py', wdir='C:/Users/U587208/Desktop/pdffolder')
File "C:/Users/U587208/Desktop/pdffolder/pdfminer.py", line 79
print pdf_to_csv('test.pdf', separator, threshold)
^
SyntaxError: invalid syntax
It seems that the tutorials you are following make use of python 2. There are usually few noticable differences, the the biggest is that in python 3, print became a funtion so
print()
I would recomment either changing you version of python or finding a tutorial for python 3. Hope this helps
Here
Pdfminer python 3.5 an example, how to extract informations from a PDF.
But it does not solve the problem with tables you want to export to Excel. Commercial products are probably better in doing that...
I am trying to do this exact same thing! I have been able to convert my pdf to text however the formatting is extremely random and messy and I need the tables to stay in tact to be able to write them into excel data sheets. I am now attempting to convert to XML to see if it will be easier to extract from. If I get anywhere on this I will let you know :)
btw, use python 2 if you're going to use pdfminer. Here's some help with pdfminer https://media.readthedocs.org/pdf/pdfminer-docs/latest/pdfminer-docs.pdf

How do I implement a pre-commit hook script in SVN that calls dos2unix to validate checked-in file

I was wondering if anyone here had some experience writing this type of script and if they could give me some pointers.
I would like to modify this script to validate that the check-in file does not have a Carriage Return in the EOL formatting. The EOL format is CR LF in Windows and LF in Unix. When a User checks-in code with the Windows format. It does not compile in Unix anymore. I know this can be done on the client side but I need to have this validation done on the server side. To achieve this, I need to do the following:
1) Make sure the file I check is not a binary, I dont know how to do this with svnlook, should I check the mime:type of the file? The Red Book does not indicate this clearly or I must have not seen it.
2) I would like to run the dos2unix command to validate that the file has the correct EOL format. I would compare the output of the dos2unix command against the original file. If there is a diff between both, I give an error message to the client and cancel the check-in.
I would like your comments/feedback on this approach.
I think you can avoid a commit hook script in this case by using the svn:eol-style property as described in the SVNBook:
End-of-Line Character Sequences
Subversion Properties
This way SVN can worry about your line endings for you.
Good luck!
What exactly are you trying to do?
Of course, there are numerous places to learn about svn pre-commit hooks (e.g. here , here, and in the Red Book) but it depends what you're trying to do and what is available on your system.
Can you be more specific?

Categories