How to read a specific paragraph from from multiple folders and files - python

I have a list that contains directories and filenames that I want to open, read a paragraph from and save that paragraph to a list.
The problem is that I don't know how to "filter" the paragraph out from the files and insert into my list.
My code so far.
rr = []
file_list = [f for f in iglob('**/README.md', recursive=True) if os.path.isfile(f)]
for f in file_list:
with open(f,'rt') as fl:
lines = fl.read()
rr.append(lines)
print(rr)
The format of the file I'm trying to read from. The text between the paragraph start and the new paragraph is what I'm looking for
There is text above this paragraph
## Required reading
* line
* line
* line
/n
### Supplementary reading
There is text bellow this paragraph
When I run the code I get all the lines from the files as expected.

You need to learn how your imported text is structured. How are the paragraphs segregated? does it look like '\n\n', could you split your text file on '\n\n' and return the index of the paragraph you want?
text = 'paragraph one text\n\nparagraph two text\n\nparagraph three text'.split('\n\n')[1]
print(text)
>>> 'paragraph two text'
The other option, as someone else mentioned is Regular Expression aka RegEx, which you can import using
import re
RegEx is used to find patterns in text.
Go to https://pythex.org/ and grab a sample of one of the documents and experiment findingthe pattern that will match with the paragraph you want to find.
Learn more about RegEx here
https://regexone.com/references/python

Solved my problem with string slicing.
Basically, I just scan each line for a start String and an end String and makes lines out of that. These lines then get appended to a list and written into a file.
for f in file_list:
with open(f, 'rt') as fl:
lines = fl.read()
lines = lines[lines.find('## Required reading'):lines.find('## Supplementary reading')]
lines = lines[lines.find('## Required reading'):lines.find('### Supplementary reading')]
lines = lines[lines.find('## Required reading'):lines.find('## Required reading paragraph')]
rr.append(lines)
But I still have "## Required reading" in my list and in my file so I run a second read/write method.
def removeHashTag():
global line
f = open("required_reading.md", "r")
lines = f.readlines()
f.close()
f = open("required_reading.md", "w")
for line in lines:
if line != "## Required reading" + "\n":
f.write(line)
f.close()
removeHashTag()

Related

Split text file into lines by key word python

I have a large text file I have imported in python and want to split into lines by a key word, then use those lines to take out relevent information into a dataframe.
The data follows along the same pattern for each line but wont be the exact same number of characters and some lines may have extra data
So I have a text file such as:
{data: name:Mary, friends:2, cookies:10, chairs:4},{data: name:Gerald friends:2, cookies:10, chairs:4, outside:4},{data: name:Tom, friends:2, cookies:10, chairs:4, stools:1}
There is always the key word data between lines, is there any way I can split it out by using this word as the beginning of the line (then put it into a dataframe)?
I'm not sure where to begin so any help would be amazing
When you get the content of a .txt file like this...
with open("file.txt", 'r') as file:
content = file.read()
...you have it as a string, so you can split it with the function str.split():
content = content.split(my_keyword)
You can do it with a function:
def splitter(path: str, keyword: str) -> str:
with open(path, 'r') as file:
content = file.read()
return content.split(keyword)
that you can call this way:
>>> splitter("file.txt", "data")
["I really like to write the word ", ", because I think it has a lot of meaning."]

How to extract the data from text file if the txt file do not have column?

I want to save the wimp mass only from the text file below into another txt file for plotting purposes. I have a lot of other .txt files to read WIMP_Mass from. I try to use np.loadtxt but cannot since there is string there. Can you guys suggest the code to extract Wimp_Mass, and the output value to be appended to .txt file without deleting the old values.
Omegah^2: 2.1971043635736895E-003
x_f: 25.000000000000000
Wimp_Mass: 100.87269924860568 GeV
sigmav(xf): 5.5536288606920853E-008
sigmaN_SI_p: -1.0000000000000000 GeV^-2: -389000000.00000000 pb
sigmaN_SI_n: -1.0000000000000000 GeV^-2: -389000000.00000000 pb
sigmaN_SD_p: -1.0000000000000000 GeV^-2: -389000000.00000000 pb
sigmaN_SD_n: -1.0000000000000000 GeV^-2: -389000000.00000000 pb
Nevents: 0
smearing: 0
%:a1a1_emep 24.174602466963883
%:a1a1_ddx 0.70401899013937730
%:a1a1_uux 10.607701601533348
%:a1a1_ssx 0.70401807105204617
%:a1a1_ccx 10.606374255125269
%:a1a1_bbx 0.70432127586224602
%:a1a1_ttx 0.0000000000000000
%:a1a1_mummup 24.174596692050287
%:a1a1_tamtap 24.172981870222447
%:a1a1_vevex 1.3837949256836950
%:a1a1_vmvmx 1.3837949256836950
%:a1a1_vtvtx 1.3837949256836950
You can use regex for this, please find below code:
import re
def get_wimp_mass():
# read content of file
with open("txt.file", "r") as f:
# read all lines from file into a list
data_list = f.readlines()
# convert data from list to string
data = "\n".join(data_list)
# use regex to fetch the data
wimp_search = re.search(r'Wimp_Mass:\s+([0-9\.]+)', data, re.M | re.I)
if wimp_search:
return wimp_search.group(1)
else:
return "Wimp mass not found"
if __name__ == '__main__':
wimp_mass = get_wimp_mass()
print(wimp_mass)
You can use basic regex if you want to extract value the code would go something like this:
import re
ans=re.findall("Wimp_Mass:[\ ]+([\d\.]+)",txt)
ans
>>['100.87269924860568']
If you wanted a more general code to extract everything, it could be
re.findall("([a-zA-Z0-9\^:\%]+)[\ ]+([\d\.]+)",txt)
you might have to add in some edge cases, though
Here is a simple solution:
with open('new_file.txt', 'w') as f_out: # not needed if file already exists
pass
filepaths = ['file1.txt', 'file2.txt'] # take all files
with open('new_file.txt', 'a') as f_out:
for file in filepaths:
with open(file, 'r') as f:
for line in f.readlines():
if line.startswith('Wimp_Mass'):
_, mass, _ = line.split()
f_out.write(mass) # writing current mass
f_out.write('\n') # writing newline
It first creates a new text file (remove if file exists). Then you need to enter all file paths (or just names if in same directory). The new_file.txt is opened in append mode and then for each file, the mass is found and added to the new_file.

read keyword in txt file, and print add text + keyword

I got many keywords in txt file to python using f = open().
And I want to add text before each keywords.
example,
(http://www.google.com/) + (abcdefg)
add text keywords imported
It have tried it, I can't result I want.
f = open("C:/abc/abc.txt", 'r')
data = f.read()
print("http://www.google.com/" + data)
f.close()
I tried it using "for".
But, I can't it.
Please let me know the solution.
many thanks.
Your original code has some flaws:
you only read the first line of the file, with data = f.read(). If you want to read all the lines from the file, use a for;
data is a str-type variable, which may have more than one word. Thus, you must split this line into words, using data.split()
To solve your problem, you need to read each line from the file, split the line into the words it has, then loop through the list with the words, add the desired text then the word itself.
The correct program is this:
f = open("C:/abc/abc.txt", 'r')
for data in f:
words = data.split()
for i in words:
print("http://www.google.com/" + i)
f.close()
with open('text.txt','r') as f:
for line in f:
print("http://www.google.com/" + line)

Extract chunks of text from document and write them to new text file

I have a large file text file that I want to read several lines of, and write these lines out as one line to a text file. For instance, I want to start reading in lines at a certain start word, and end on a lone parenthesis. So if my start word is 'CAR' I would want to start reading until a one parenthesis with a line break is read. The start and end words are to be kept as well.
What is the best way to achieve this? I have tried pattern matching and avoiding regex but I don't think that is possible.
Code:
array = []
f = open('text.txt','r') as infile
w = open(r'temp2.txt', 'w') as outfile
for line in f:
data = f.read()
x = re.findall(r'CAR(.*?)\)(?:\\n|$)',data,re.DOTALL)
array.append(x)
outfile.write(x)
return array
What the text may look like
( CAR: *random info*
*random info* - could be many lines of this
)
Using regular expression is totally fine for these type of problems. You cannot use them when your pattern contains recursion, like get the content from the parenthesis: ((text1)(text2)).
You can use the following regular expression: (CAR[\s\S]*?(?=\)))
See explanation...
Here you can visualize your regular expression...
We can match the text you're interested in using the regex pattern: (CAR.*)\) with flags gms.
Then we just have to remove the newline characters from the resulting matches and write them to a file.
with open("text.txt", 'r') as f:
matches = re.findall(r"(CAR.*)\)", f.read(), re.DOTALL)
with open("output.txt", 'w') as f:
for match in matches:
f.write(" ".join(match.split('\n')))
f.write('\n')
The output file looks like this:
CAR: *random info* *random info* - could be many lines of this
EDIT:
updated code to put newline between matches in output file

Find a string and insert text after it in Python

I am still learner in python. I was not able to find a specific string and insert multiple strings after that string in python. I want to search the line in the file and insert the content of write function
I have tried the following which is inserting at the end of the file.
line = '<abc hij kdkd>'
dataFile = open('C:\\Users\\Malik\\Desktop\\release_0.5\\release_0.5\\5075442.xml', 'a')
dataFile.write('<!--Delivery Date: 02/15/2013-->\n<!--XML Script: 1.0.0.1-->\n')
dataFile.close()
You can use fileinput to modify the same file inplace and re to search for particular pattern
import fileinput,re
def modify_file(file_name,pattern,value=""):
fh=fileinput.input(file_name,inplace=True)
for line in fh:
replacement=value + line
line=re.sub(pattern,replacement,line)
sys.stdout.write(line)
fh.close()
You can call this function something like this:
modify_file("C:\\Users\\Malik\\Desktop\\release_0.5\\release_0.5\\5075442.xml",
"abc..",
"!--Delivery Date:")
Python strings are immutable, which means that you wouldn't actually modify the input string -you would create a new one which has the first part of the input string, then the text you want to insert, then the rest of the input string.
You can use the find method on Python strings to locate the text you're looking for:
def insertAfter(haystack, needle, newText):
""" Inserts 'newText' into 'haystack' right after 'needle'. """
i = haystack.find(needle)
return haystack[:i + len(needle)] + newText + haystack[i + len(needle):]
You could use it like
print insertAfter("Hello World", "lo", " beautiful") # prints 'Hello beautiful world'
Here is a suggestion to deal with files, I suppose the pattern you search is a whole line (there is nothing more on the line than the pattern and the pattern fits on one line).
line = ... # What to match
input_filepath = ... # input full path
output_filepath = ... # output full path (must be different than input)
with open(input_filepath, "r", encoding=encoding) as fin \
open(output_filepath, "w", encoding=encoding) as fout:
pattern_found = False
for theline in fin:
# Write input to output unmodified
fout.write(theline)
# if you want to get rid of spaces
theline = theline.strip()
# Find the matching pattern
if pattern_found is False and theline == line:
# Insert extra data in output file
fout.write(all_data_to_insert)
pattern_found = True
# Final check
if pattern_found is False:
raise RuntimeError("No data was inserted because line was not found")
This code is for Python 3, some modifications may be needed for Python 2, especially the with statement (see contextlib.nested. If your pattern fits in one line but is not the entire line, you may use "theline in line" instead of "theline == line". If your pattern can spread on more than one line, you need a stronger algorithm. :)
To write to the same file, you can write to another file and then move the output file over the input file. I didn't plan to release this code, but I was in the same situation some days ago. So here is a class that insert content in a file between two tags and support writing on the input file: https://gist.github.com/Cilyan/8053594
Frerich Raabe...it worked perfectly for me...good one...thanks!!!
def insertAfter(haystack, needle, newText):
#""" Inserts 'newText' into 'haystack' right after 'needle'. """
i = haystack.find(needle)
return haystack[:i + len(needle)] + newText + haystack[i + len(needle):]
with open(sddraft) as f1:
tf = open("<path to your file>", 'a+')
# Read Lines in the file and replace the required content
for line in f1.readlines():
build = insertAfter(line, "<string to find in your file>", "<new value to be inserted after the string is found in your file>") # inserts value
tf.write(build)
tf.close()
f1.close()
shutil.copy("<path to the source file --> tf>", "<path to the destination where tf needs to be copied with the file name>")
Hope this helps someone:)

Categories