Reading a structured text file in Python - python

I have a text file in the following format:
1. AUTHOR1
(blank line, with a carriage return)
Citation1
2. AUTHOR2
(blank line, with a carriage return)
Citation2
(...)
That is, in this file, some lines begin with an integer number, followed by a dot, a space, and text indicating an author's name; these lines are followed by a blank line (which includes a carriage return), and then for a line of text beginning with an alphabetic character (an article or book citation).
What I want is to read this file into a Python list, joining the author's names and citation, so that each list element is of the form:
['AUTHOR1 Citation1', 'AUTHOR2 Citation2', '...']
It looks like a simple programming problem, but I could not figure out a solution to it. What I attempted was as follows:
articles = []
with open("sample.txt", "rb") as infile:
while True:
text = infile.readline()
if not text: break
authors = ""
citation = ""
if text == '\n': continue
if text[0].isdigit():
authors = text.strip('\n')
else:
citation = text.strip('\n'
articles.append(authors+' '+citation)
but the articles list gets authors and citations stored as separate elements!
Thanks in advance for any help in solving this vexing problem... :-(

Assuming your input file structure:
"""
1. AUTHOR1
Citation1
2. AUTHOR2
Citation2
"""
is not going to change I would use readlines() and slicing:
with open('sample.txt', 'r') as infile:
lines = infile.readlines()
if lines:
lines = filter( lambda x : x != '\n', lines ) # remove empty lines
auth = map( lambda x : x.strip().split('.')[-1].strip(), lines[0::2] )
cita = map( lambda x : x.strip(), lines[1::2] )
result = [ '%s %s'%(auth[i], cita[i]) for i in xrange( len( auth )) ]
print result
# ['AUTHOR1 Citation1', 'AUTHOR2 Citation2']

The problem is that, in each looping iteration you are only getting one, author or citation and not both. So, when you do the append you only have one element.
One way to fix this is to read both in each looping iteration.

This should work:
articles = []
with open("sample.txt") as infile:
for raw_line in infile:
line = raw_line.strip()
if not line:
continue
if line[0].isdigit():
author = line.split(None, 1)[-1]
else:
articles.append('{} {}'.format(author, line))

Solution processing a full entry in each loop iteration:
citations = []
with open('sample.txt') as file:
for author in file: # Reads an author line
next(file) # Reads and ignores the empty line
citation = next(file).strip() # Reads the citation line
author = author.strip().split(' ', 1)[1]
citations.append(author + ' ' + citation)
print(citations)
Solution first reading all lines and then going through them:
citations = []
with open('sample.txt') as file:
lines = list(map(str.strip, file))
for author, citation in zip(lines[::3], lines[2::3]):
author = author.split(' ', 1)[1]
citations.append(author + ' ' + citation)
print(citations)

The solutions based on slicing are pretty neat, but if there's just one blank line out of place, it throws the whole thing off. Here's a solution using regex which should work even if there's a variation in the structure:
import re
pattern = re.compile(r'(^\d\..*$)\n*(^\w.*$)', re.MULTILINE)
with open("sample.txt", "rb") as infile:
lines = infile.readlines()
matches = pattern.findall(lines)
formatted_output = [author + ' ' + citation for author, citation in matches]

You can use readline to skip empty lines.
Here's your loop body:
author = infile.readline().strip().split(' ')[1]
infile.readline()
citation = infile.readline()
articles.append("{} {}".format(author, citation))

Related

Python list iterate not working as expected

I have a file called list.txt:
['d1','d2','d3']
I want to loop through all the items in the list. Here is the code:
deviceList = open("list.txt", "r")
deviceList = deviceList.read()
for i in deviceList:
print(i)
Here the issue is that, when I run the code, it will split all the characters:
% python3 run.py
[
'
d
1
'
,
'
d
2
'
,
'
d
3
'
]
It's like all the items have been considered as 1 string? I think needs to be parsed? Please let me know what am I missing..
Simply because you do not have a list, you are reading a pure text...
I suggest writing the list without the [] so you can use the split() function.
Write the file like this: d1;d2;d3
and use this script to obtain a list
f = open("filename", 'r')
line = f.readlines()
f.close()
list = line.split(";")
if you need the [] in the file, simply add a strip() function like this
f = open("filename", 'r')
line = f.readlines()
f.close()
strip = line.strip("[]")
list = strip.split(";")
should work the same
This isn't the cleanest solution, but it will do if your .txt file is always just in the "[x,y,z]" format.
deviceList = open("list.txt", "r")
deviceList = deviceList[1:-1]
deviceList = deviceList.split(",")
for i in deviceList:
print(i)
This takes your string, strips the "[" and "]", and then separates the entire string between the commas and turns that into a list. As other users have suggested, there are probably better ways to store this list than a text file as it is, but this solution will do exactly what you are asking. Hope this helps!

How to open a file in python, read the comments ("#"), find a word after the comments and select the word after it?

I have a function that loops through a file that Looks like this:
"#" XDI/1.0 XDAC/1.4 Athena/0.9.25
"#" Column.4: pre_edge
Content
That is to say that after the "#" there is a comment. My function aims to read each line and if it starts with a specific word, select what is after the ":"
For example if I had These two lines. I would like to read through them and if the line starts with "#" and contains the word "Column.4" the word "pre_edge" should be stored.
An example of my current approach follows:
with open(file, "r") as f:
for line in f:
if line.startswith ('#'):
word = line.split(" Column.4:")[1]
else:
print("n")
I think my Trouble is specifically after finding a line that starts with "#" how can I parse/search through it? and save its Content if it contains the desidered word.
In case that # comment contain str Column.4: as stated above, you could parse it this way.
with open(filepath) as f:
for line in f:
if line.startswith('#'):
# Here you proceed comment lines
if 'Column.4' in line:
first, remainder = line.split('Column.4: ')
# Remainder contains everything after '# Column.4: '
# So if you want to get first word ->
word = remainder.split()[0]
else:
# Here you can proceed lines that are not comments
pass
Note
Also it is a good practice to use for line in f: statement instead of f.readlines() (as mentioned in other answers), because this way you don't load all lines into memory, but proceed them one by one.
You should start by reading the file into a list and then work through that instead:
file = 'test.txt' #<- call file whatever you want
with open(file, "r") as f:
txt = f.readlines()
for line in txt:
if line.startswith ('"#"'):
word = line.split(" Column.4: ")
try:
print(word[1])
except IndexError:
print(word)
else:
print("n")
Output:
>>> ['"#" XDI/1.0 XDAC/1.4 Athena/0.9.25\n']
>>> pre_edge
Used a try and except catch because the first line also starts with "#" and we can't split that with your current logic.
Also, as a side note, in the question you have the file with lines starting as "#" with the quotation marks so the startswith() function was altered as such.
with open('stuff.txt', 'r+') as f:
data = f.readlines()
for line in data:
words = line.split()
if words and ('#' in words[0]) and ("Column.4:" in words):
print(words[-1])
# pre_edge

Removing word from the beginning of my text object?

I have a function that scrapes speeches from millercenter.org and returns the processed speech. However, every one of my speeches has the word "transcript" at the beginning (that's just how it's coded into the HTML). So, all of my text files look like this:
\n <--- there's really just a new line, here, not literally '\n'
transcript
fourscore and seven years ago, blah blah blah
I have these saved in my U:/ drive - how can I iterate through these files and remove 'transcript'? The files look like this, essentially:
Edit:
speech_dict = {}
for filename in glob.glob("U:/FALL 2015/ENGL 305/NLP Project/Speeches/*.txt"):
with open(filename, 'r') as inputFile:
filecontent = inputFile.read();
filecontent.replace('transcript','',1)
speech_dict[filename] = filecontent # put the speeches into a dictionary to run through the algorithm
This is not doing anything to change my speeches. 'transcript' is still there.
I also tried putting it into my text-processing function, but that doesn't work, either:
def processURL(l):
open_url = urllib2.urlopen(l).read()
item_soup = BeautifulSoup(open_url)
item_div = item_soup.find('div',{'id':'transcript'},{'class':'displaytext'})
item_str = item_div.text.lower()
item_str_processed = punctuation.sub(' ',item_str)
item_str_processed_final = item_str_processed.replace('—',' ').replace('transcript','',1)
splitlink = l.split("/")
president = splitlink[4]
speech_num = splitlink[-1]
filename = "{0}_{1}".format(president, speech_num)
return filename, item_str_processed_final # giving back filename and the text itself
Here's an example url I run through processURL: http://millercenter.org/president/harding/speeches/speech-3805
You can use Python's excellent replace() for this:
data = data.replace('transcript', '', 1)
This line will replace 'transcript' with '' (empty string). The final parameter is the number of replacements to make. 1 for only the first instance of 'transcript', blank for all instances.
If you know that the data you want always starts on line x then do this:
with open('filename.txt', 'r') as fin:
for _ in range(x): # This loop will skip x no. of lines.
next(fin)
for line in fin:
# do something with the line.
print(line)
Or let's say you want to remove any lines before transcript:
with open('filename.txt', 'r') as fin:
while next(fin) != 'transcript': # This loop will skip lines until it reads the *transcript* lines.
break
# if you want to skip the empty line after *transcript*
next(fin) # skips the next line.
for line in fin:
# do something with the line.
print(line)

Python Insert text before a specific line

I want to insert a text specifically before a line 'Number'.
I want to insert 'Hello Everyone' befor the line starting with 'Number'
My code:
import re
result = []
with open("text2.txt", "r+") as f:
a = [x.rstrip() for x in f] # stores all lines from f into an array and removes "\n"
# Find the first occurance of "Centre" and store its index
for item in a:
if item.startswith("Number"): # same as your re check
break
ind = a.index(item) #here it produces index no./line no.
result.extend(a[:ind])
f.write('Hello Everyone')
tEXT FILE:
QWEW
RW
...
Number hey
Number ho
Expected output:
QWEW
RW
...
Hello Everyone
Number hey
Number ho
Please help me to fix my code:I dont get anything inserted with my text file!Please help!
Answers will be appreciated!
The problem
When you do open("text2.txt", "r"), you open your file for reading, not for writing. Therefore, nothing appears in your file.
The fix
Using r+ instead of r allows you to also write to the file (this was also pointed out in the comments. However, it overwrites, so be careful (this is an OS limitation, as described e.g. here). The following should do what you desire: It inserts "Hello everyone" into the list of lines and then overwrites the file with the updated lines.
with open("text2.txt", "r+") as f:
a = [x.rstrip() for x in f]
index = 0
for item in a:
if item.startswith("Number"):
a.insert(index, "Hello everyone") # Inserts "Hello everyone" into `a`
break
index += 1
# Go to start of file and clear it
f.seek(0)
f.truncate()
# Write each line back
for line in a:
f.write(line + "\n")
The correct answer to your problem is the hlt one, but consider also using the fileinput module:
import fileinput
found = False
for line in fileinput.input('DATA', inplace=True):
if not found and line.startswith('Number'):
print 'Hello everyone'
found = True
print line,
This is basically the same question as here: they propose to do it in three steps: read everything / insert / rewrite everything
with open("/tmp/text2.txt", "r") as f:
lines = f.readlines()
for index, line in enumerate(lines):
if line.startswith("Number"):
break
lines.insert(index, "Hello everyone !\n")
with open("/tmp/text2.txt", "w") as f:
contents = f.writelines(lines)

Need to remove line breaks from a text file with certain conditions

I have a text file running into 20,000 lines. A block of meaningful data for me would consist of name, address, city, state,zip, phone. My file has each of these on a new line, so a file would go like:
StoreName1
, Address
, City
,State
,Zip
, Phone
StoreName2
, Address
, City
,State
,Zip
, Phone
I need to create a CSV file and will need the above information for each store in 1 single line :
StoreName1, Address, City,State,Zip, Phone
StoreName2, Address, City,State,Zip, Phone
So essentially, I am trying to remove \r\n only at the appropriate points. How do I do this with python re. Examples would be very helpful, am a newbie at this.
Thanks.
s/[\r\n]+,/,/g
Globally substitute 'linebreak(s),' with ','
Edit:
If you want to reduce it further with a single linebreak between records:
s/[\r\n]+(,|[\r\n])/$1/g
Globally substitute 'linebreaks(s) (comma or linebreak) with capture group 1.
Edit:
And, if it really gets out of whack, this might cure it:
s/[\r\n]+\s*(,|[\r\n])\s*/$1/g
This iterator/generator version doesn't require reading the entire file into memory at once
from itertools import groupby
with open("inputfile.txt") as f:
groups = groupby(f, key=str.isspace)
for row in ("".join(map(str.strip,x[1])) for x in groups if not x[0]):
...
Assuming the data is "normal" - see my comment - I'd approach the problem this way:
with open('data.txt') as fhi, open('newdata.txt', 'w') as fho:
# Iterate over the input file.
for store in fhi:
# Read in the rest of the pertinent data
fields = [next(fhi).rstrip() for _ in range(5)]
# Generate a list of all fields for this store.
row = [store.rstrip()] + fields
# Output to the new data file.
fho.write('%s\n' % ''.join(row))
# Consume a blank line in the input file.
next(fhi)
First mind-numbigly solution
import re
ch = ('StoreName1\r\n'
', Address\r\n'
', City\r\n'
',State\r\n'
',Zip\r\n'
', Phone\r\n'
'\r\n'
'StoreName2\r\n'
', Address\r\n'
', City\r\n'
',State\r\n'
',Zip\r\n'
', Phone')
regx = re.compile('(?:(?<=\r\n\r\n)|(?<=\A)|(?<=\A\r\n))'
'(.+?)\r\n(,.+?)\r\n(,.+?)\r\n(,.+?)\r\n(,.+?)\r\n(,[^\r\n]+)')
with open('csvoutput.txt','wb') as f:
f.writelines(''.join(mat.groups())+'\r\n' for mat in regx.finditer(ch))
ch mimics the content of a file on a Windows platform (newlines == \r\n)
Second mind-numbigly solution
regx = re.compile('(?:(?<=\r\n\r\n)|(?<=\A)|(?<=\A\r\n))'
'.+?\r\n,.+?\r\n,.+?\r\n,.+?\r\n,.+?\r\n,[^\r\n]+')
with open('csvoutput.txt','wb') as f:
f.writelines(mat.group().replace('\r\n','')+'\r\n' for mat in regx.finditer(ch))
Third mind-numbigly solution, if you want to create a CSV file with other delimiters than commas:
regx = re.compile('(?:(?<=\r\n\r\n)|(?<=\A)|(?<=\A\r\n))'
'(.+?)\r\n,(.+?)\r\n,(.+?)\r\n,(.+?)\r\n,(.+?)\r\n,([^\r\n]+)')
import csv
with open('csvtry3.txt','wb') as f:
csvw = csv.writer(f,delimiter='#')
for mat in regx.finditer(ch):
csvw.writerow(mat.groups())
.
EDIT 1
You are right , tchrist, the following solution is far simpler:
regx = re.compile('(?<!\r\n)\r\n')
with open('csvtry.txt','wb') as f:
f.write(regx.sub('',ch))
.
EDIT 2
A regex isn't required:
with open('csvtry.txt','wb') as f:
f.writelines(x.replace('\r\n','')+'\r\n' for x in ch.split('\r\n\r\n'))
.
EDIT 3
Treating a file, no more ch:
'à la gnibbler" solution, in cases when the file can't be read all at once in memory because it is too big:
from itertools import groupby
with open('csvinput.txt','r') as f,open('csvoutput.txt','w') as g:
groups = groupby(f,key= lambda v: not str.isspace(v))
g.writelines(''.join(x).replace('\n','')+'\n' for k,x in groups if k)
I have another solution with regex:
import re
regx = re.compile('^((?:.+?\n)+?)(?=\n|\Z)',re.MULTILINE)
with open('input.txt','r') as f,open('csvoutput.txt','w') as g:
g.writelines(mat.group().replace('\n','')+'\n' for mat in regx.finditer(f.read()))
I find it similar to the gnibbler-like solution
f = open(infilepath, 'r')
s = ''.join([line for line in f])
s = s.replace('\n\n', '\\n')
s = s.replace('\n', '')
s = s.replace("\\n", "\n")
f.close()
f = open(infilepath, 'r')
f.write(s)
f.close()
That should do it. It will replace your input file with the new format

Categories