Writing a script to find certain lines/string in multiple documents - python

I have a folder with multiple files (.doc and .docx). For the sake of this question I want to primarily deal with the .doc files unless for of these file types and be accounted for in the code.
I'm writing a code to read the folder and identify the .doc files. The objective is to output the paragraph 3, 4, and 7. I'm not sure why but python is reading each paragraph from a different spot in each file. I'm thinking maybe there are spacing/formatting inconsistencies that I wasn't aware of initially. To work around the formatting issue, I was thinking I could define the strings I want outputted. But I'm not sure how to do that. I tried to take add a string in the code but that didn't work.
How can I modify my code to be able to account for finding the strings that I want?
Original Code
doc = ''
for file in glob.glob(r'folderpathway*.docx'):
doc = docx.Document(file)
print (doc.paragraphs[3].text)
print (doc.paragraphs[4].text)
print (doc.paragraphs[7].text)
Code to account for the formatting issues
doc = ''
for file in glob.glob(r'folderpathway*.docx'):
doc = docx.Document(file)
print (doc.paragraphs["Substance Number"].text)
TypeError: list indices must be integers or slices, not str

Related

Populate DB with JSON files, search values, return only matching. Or something different

First, I am relatively new at programming; Python is the only language I have any familiarity with using. Secondly, I put DB in the question because that's what seems right to me after searching around, but I am open to not using a DB at all if that's easier or more efficient.
What I Have to Work With
I have a folder with ~75,000 JSON files. They all have the same structure; here is an example of what they look like (more on that below):
{
"id": 93480
"author": "",
"joined by": [],
"date_created": "2010-04-28T16:07:21Z"
"date_modified": "2020-02-21T21:42:45.655644Z"
"type": "010combined",
"page_count": null,
"plain_text": "",
"html": "",
"extracted_by_ocr": false,
"cited": [
]
}
One way that the real files differ from the above is that either the "plain_text" or the "html" key will have an actual value, namely text (whether plaintext or HTML). The length of that text can vary from a couple of sentences to over 200 pages worth of text. Thus, the JSON files range in size from 907 bytes at the smallest to 2.1 MB.
What I'm Trying to Do
I want to be able, essentially, to search through all the files for a word or phrase contained in either the plain_text or HTML fields and, at a minimum, return a list of files containing that word or phrase. [Ideally, I'd do other things with them, as well, but I can figure that stuff out later. What I'm stumped on is where to begin.]
What I Can't Figure Out
Whether to even bother with a document-store db like MongoDB (or PostgreSQL). If that's the appropriate way to handle this, I'm open to working my way through it. But I can't even tell if that's how I should attack the problem, or if I should instead just use a Python script to iterate over the files in the folder directly. Can you use populate a DB with all the files in a folder, then search for a substring in each row? The fact that some of these files have a ton of text in one of the values makes it seem weird to me to use a DB at all, but again: I don't know what I'm doing.
I think I know to iterate over the files directly with Python. I know how to open files, and I know how to get a list of keys from JSON files. But how do you search for a matching substring in two JSON values? And then, if the substring is found in one of them, how do you return the "id" field to a list, close the file, and move to the next one? (I mean, obviously, the basic structure is a conditional. Here's the logical structure of what I'm thinking here:
Variable = "substring I want to match"
List = [] # Will hold ids of files containing variable
Open file
Read file to the end
Search file [or just the two JSON keys?] for variable
If variable found append "id" to list
Close file
Move to the next one in the directory
It's the actual code part that I'm stumbling over.
Idea using pandas since I don't know about search engines, some copied from: How to read multiple json files into pandas dataframe?
dfs = [] # an empty list to store the data frames
for file in file_list:
data = pd.read_json(file, lines=True) # read data frame from json file
dfs.append(data) # append the data frame to the list
temp = pd.concat(dfs, ignore_index=True) # concatenate all the data frames in the list.
Creating it will take forever but once that's done you can search and do operations quickly. E.g. if you want to find all id where author is not empty:
id_list = temp.loc[temp['author'] != '']['id'].tolist()
If the combined size of all your files is gigantic, you may want to consult the docs to store things more efficiently https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html or use another method.

How would I extract only specific text from this webpage?

I am looking for ways to take this line of code:
{"id":"76561198170104957","names":[{"name":"Mountain Dew"},{"name":"Sugardust"}],"kills":2394,"deaths":2617,"ff_kills":89,"ff_deaths":110,"playtime":"P5DT3H45M18S"}
and extract ONLY the "kills, deaths, ff_kills, and ff_deaths strings and their associated numbers into a list. This code varies in length depending on the user, so a static index won't really work I don't think. The code is also read from a webpage if that opens up any possibilities. Thanks.
That format is called JSON. You can easily parse it with python. Example:
import json
line = r'{"id":"76561198170104957","names":[{"name":"Mountain Dew"},{"name":"Sugardust"}],"kills":2394,"deaths":2617,"ff_kills":89,"ff_deaths":110,"playtime":"P5DT3H45M18S"}'
j = json.loads(line)
print(j['kills']);

How to read "well" from a file in python

I have to read a file that has always the same format.
As I know it has the same format I can readline() and tokenize. But I guess there is a way to read it more, how to say it, "pretty to the eyes".
The file I have to read has this format :
Nom NMS-01
MAC AAAAAAAAAAA
UDPport 2019
TCPport 9129
I just want a different way to read it without having to tokenize, if that is possbile
Your question seems to imply that "tokenizing" is some kind of mysterious and complicated process. But in fact, the thing you are trying to do is exactly tokenizing.
Here is a perfectly valid way to read the file you show, break it up into tokens, and store it in a data structure:
def read_file_data(data_file_path):
result = {}
with open(data_file_path) as data_file:
for line in data_file:
key, value = line.split(' ', maxsplit=1)
result[key] = value
return result
That wasn't complicated, it wasn't a lot of code, it doesn't need a third-party library, and it's easy to work with:
data = read_file_data('path/to/file')
print(data['Nom']) # prints "NMS-01"
Now, this implementation makes many assumptions about the structure of the file. Among other things, it assumes:
The entire file is structured as key/value pairs
Each key/value pair fits on a single line
Every line in the file is a key/value pair (no comments or blank lines)
The key cannot contain space characters
The value cannot contain newline characters
The same key does not appear multiple times in the file (or, if it does, it is acceptable for the last value given to be the only one returned)
Some of these assumptions may be false, but they are all true for the data sample you provided.
More generally: if you want to parse some kind of structured data, you need to understand the structure of the data and how values are delimited from each other. That's why common structured data formats like XML, JSON, and YAML (among many others!) were invented. Once you know the language you are parsing, tokenization is simply the code you write to match up the language with the text of your input.
Pandas does many magical things, so maybe that is prettier for you?
import pandas as pd
pd.read_csv('input.txt',sep = ' ',header=None,index_col=0)
This gives you a dataframe that you can manipulate further:
0 1
Nom NMS-01
MAC AAAAAAAAAAA
UDPport 2019
TCPport 9129

Number of lines added and deleted in files using gitpython

How to get/extract number of lines added and deleted?
(Just like we do using git diff --numstat).
repo_ = Repo('git-repo-path')
git_ = repo_.git
log_ = g.diff('--numstat','HEAD~1')
print(log_)
prints the entire output (lines added/deleted and file-names) as a single string. Can this output format be modified or changed so as to extract useful information?
Output format: num(added) num(deleted) file-name
For all files modified.
If I understand you correctly, you want to extract data from your log_ variable and then re-format it and print it? If that's the case, then I think the simplest way to fix it, is with a regular expression:
import re
for line in log_.split('\n'):
m = re.match(r"(\d+)\s+(\d+)\s+(.+)", line)
if m:
print("{}: rows added {}, rows deleted {}".format(m[3], m[1], m[2]))
The exact output, you can of course modify any way you want, once you have the data in a match m. Getting the hang of regular expressions may take a while but it can be very helpful for small scripts.
However, be adviced, reg exps tend to be write-only code and can be very hard to debug. However, for extracting small parts like this, it is very helpful.

Write Integer to XML Node in minidom

I'm using minidom in Python to create an XML formatted log file for completed tasks. Part of the process is to compare the last modified time of a file to the time that that files data was recorded into the log. I plan on doing that via:
if modTime < recTime:
do_something()
For example, foo.pdf was modified at 10:40am, then at 10:46am the log recorded foo.pdf's modified time. So a portion of the log should look something like this:
<Printed Orders>
<foo.pdf>
<Date Recorded>
1352486780
</Date Recorded>
</foo.pdf>
However, when I attempt to write the times in their integer formats to the XML file I get the error:
TypeError: node contents must be a string
So, my questions are:
Is there a way to write an integer to an XML file? (Preferrably using minidom as to not clutter my script with more imports)
If there isn't, is there a better way to compare the modified time I pull from the file itself and recorded time I pull from the XML file than converting the recorded time to a string, writing to the XML file, pulling the rec time from the XML file later on, and then converting that string back to an integer?
Also, in case you're wondering, the plan is to do once-daily purges of a directory, deleting foo.pdf and other files based on the comparison of their own mod/rec times. If foo.pdf hasn't been modified since it was entered into the log, it will be deleted.
Thanks!
Just look at the output you expect. How would XML know if that is an integer or a string. With XML in general, you have to say everything with tags. Thus, everything is treated as a string.
You do not need to convert the string to a int, unless the other time is an int, because the time-string will not become any longer than it is now for a really long time (over 3,000 years). However, I am not sure why you have so much dislike for doing that conversion. If it's really a big deal, use JSON.

Categories