How do I delete the title from a text file in Python? - python

I have around 2,000 text files containing summaries of news articles and I want to remove the title from all the files that have titles (some don't have titles for some reason) using Python.
Here's an example:
Ad sales boost Time Warner profit
Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters.However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues.Time Warner's fourth quarter profits were slightly better than analysts' expectations.For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn.For 2005, TimeWarner is projecting operating earnings growth of around 5%, and also expects higher revenue and wider profit margins.
My question is how to remove the line, "Ad sales boost Time Warner profit" ?
Edit: I basically want to remove everything before a line break.
TIA.

If it's (as you say) just a simple matter of removing the first line, when followed by \n\n, you could use a simple regex like this:
import re
with open('testing.txt', 'r') as fin:
doc = fin.read()
doc = re.sub(r'^.+?\n\n', '', doc)

try this:
it will split the text into everything before the line break "\n\n" and only select the last element (the body)
line.split('\n\n', 1)[-1]
This also works when there is no line break in the text

As you may know, you can't read and write to a file. - Therefore the solution in this case would be to read the lines to a variable; modify and re-write to file.
lines = []
# open the text file in read mode and readlines (returns a list of lines)
with open('textfile.txt', 'r') as file:
lines = file.readlines()
# open the text file in write mode and write lines
with open('textfile.txt', 'w') as file:
# if the number of lines is bigger than 1 (assumption) write summary else write all lines
file.writelines(lines[2:] if len(lines) > 1 else lines)
The above is a simple example of how you can achieve what you're after. - Although keep in mind that edge cases might be present.

This will remove everything before the first line break ('\n\n').
with open('text.txt', 'r') as file:
f = file.read()
idx = f.find('\n\n') # Search for a line break
if idx > 0: # If found, return everything after it
g = f[idx+2:]
else: # Otherwise, return the original text file
g = f
print(g)
# Save the file
with open('text.txt', 'w') as file:
file.write(g)
"Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters.However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues.Time Warner's fourth quarter profits were slightly better than analysts' expectations.For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn.For 2005, TimeWarner is projecting operating earnings growth of around 5%, and also expects higher revenue and wider profit margins.\n"

Related

How to change specific lines in a text file and enforce this change when writing output file

I'm having an issue with implementing a change on specific files in a text file. I have looped over the lines and identified the ones starting with a specific character (N2).
I'm trying to wrap a paragraph so it only allows for 100 characters per line for the abstract of this output from an online source, numerous abstracts are contained in the file all starting with N2 prefixed to the string.
The information appears as separate lines in the text file, ForEoin.txt:
<!-- language: lang-none -->
TY - JOUR
ID - 31513460
T1 - Systematic Review: Clinical Metabolomics to Forecast Outcomes in Liver Transplantation Surgery.
A1 - Attard, Joseph A
A1 - Dunn, Warwick B
A1 - Mergental, Hynek
A1 - Mirza, Darius F
A1 - Afford, Simon C
A1 - Perera, M Thamara P R
Y1 - 2019//
N2 - Liver transplantation is an effective intervention for end-stage liver disease, fulminant hepatic failure, and early hepatocellular carcinoma. Yet, there is marked patient-to-patient variation in liver transplantation outcomes. This calls for novel diagnostics to enable rational deployment of donor livers. Metabolomics is a postgenomic high-throughput systems biology approach to diagnostic innovation in clinical medicine. We report here an original systematic review of the metabolomic studies that have identified putative biomarkers in the context of liver transplantation. Eighteen studies met the inclusion criteria that involved sampling of blood (n = 4), dialysate fluid (n = 4), bile (n = 5), and liver tissue (n = 5). Metabolites of amino acid and nitrogen metabolism, anaerobic glycolysis, lipid breakdown products, and bile acid metabolism were significantly different in transplanted livers with and without graft dysfunction. However, criteria for defining the graft dysfunction varied across studies. This systematic review demonstrates that metabolomics can be deployed in identification of metabolic indicators of graft dysfunction with a view to implicated molecular mechanisms. We conclude the article with a horizon scanning of metabolomics technology in liver transplantation and its future prospects and challenges in research and clinical practice.
KW - *Biomarkers
KW - Genotype
So far I have iterated over the lines of the file and called upon the textwrap module to wrap this for me but I cant get my head around writing over the existing lines with this new wrapped lines in the output file.
#!/usr/bin/env python
import textwrap
filename_org = 'ForEoin.txt'
filename_new = 'Eoin_Shortline_v2'
with open(filename_org, 'r') as rf:
with open(filename_new, 'w') as wf:
for line in rf:
if line.startswith("N2"):
wrapper = textwrap.TextWrapper(width=100)
new_line = wrapper.fill(text=line)
wf.write(new_line)
Do you just need an else statement, to write the line un-altered if it doesn't start with N2?
with open(filename_org, 'r') as rf:
with open(filename_new, 'w') as wf:
for line in rf:
if line.startswith("N2"):
wrapper = textwrap.TextWrapper(width=100)
new_line = wrapper.fill(text=line)
wf.write(new_line)
else:
wf.write(line)
import textwrap
filename_org = 'ForEoin.txt'
filename_new = 'Eoin_Shortline_v2'
with open(filename_org, 'r') as rf:
with open(filename_new, 'w') as wf:
for line in rf:
if line.startswith("N2"):
wrapper = textwrap.TextWrapper(width=100)
new_line = wrapper.fill(text=line)
wf.write(new_line)
else:
wf.write(line)
If the line starts with "N2" do the wrapper and then write it to the file in the else part also you have to right it to the file.

convert output received to dataframe in python

I have selected some fields from a json file and I saved its name along with its respective comment to do preprocessing..
Below are the codes:
import re
import json
with open('C:/Users/User/Desktop/Coding/parsehubjsonfileeg/all.json', encoding='utf8') as f:
data = json.load(f)
# dictionary for element which you want to keep
new_data = {'selection1': []}
print(new_data)
# copy item from old data to new data if it has 'reviews'
for item in data['selection1']:
if 'reviews' in item:
new_data['selection1'].append(item)
print(item['reviews'])
print('--')
# save in file
with open('output.json', 'w') as f:
json.dump(new_data, f)
selection1 = new_data['selection1']
for item in selection1:
name = item['name']
print('>>>>>>>.', name)
CommentID = item['reviews']
for com in CommentID:
comment = com['review'].lower() # converting all to lowercase
result = re.sub(r'\d+', '', comment) # remove numbers
results = (result.translate(
str.maketrans('', '', string.punctuation))).strip() # remove punctuations and white spaces
comments = (results)
print(comment)
my output is:
>>>>>>>. Heritage The Villas
we booked at villa valriche through mari deal for 2 nights and check-in was too lengthy (almost 2 hours) and we were requested to make a deposit of rs 10,000 or credit card which we were never informed about it upon booking.
lovely place to recharge.
one word: suoerb
definitely not a 5 star. extremely poor staff service.
>>>>>>>. Oasis Villas by Evaco Holidays
excellent
spent 3 days with my family and really enjoyed my stay. the advantage of oasis is its privacy - with 3 children under 6 years, going to dinner/breakfast at hotels is often a burden rather than an enjoyable experience.
staff were very friendly and welcoming. artee and menni made sure everything was fine and brought breakfast - warm croissants - every morning. atish made the check-in arrangements - and was fast and hassle free.
will definitely go again!
what should I perform to convert this output to a dataframe having column name and comment?

Replacing a word in a text file

does anyone know how to replace a word in a text file?
Here's one line from my stock file:
bread 0.99 12135479 300 200 400
I want to be able to replace my 4th word (in this instance '300') when I print 'productline' with a new number created by the nstock part of this code:
for line in details: #for every line in the file:
if digits in line: #if the barcode is in the line
productline=line #it stores the line as 'productline'
itemsplit=productline.split(' ') #seperates into different words
price=float(itemsplit[1]) #the price is the second part of the line
current=int(itemsplit[3]) #the current stock level is the third part of the line
quantity=int(input("How much of the product do you wish to purchase?\n"))
if quantity<current:
total=(price)*(quantity) #this works out the price
print("Your total spent on this product is:\n" + "£" +str(total)+"\n") #this tells the user, in total how much they have spent
with open("updatedstock.txt","w") as f:
f.writelines(productline) #writes the line with the product in
nstock=int(current-quantity) #the new stock level is the current level minus the quantity
My code does not replace the 4th word (which is the current stock level) with the new stock level (nstock)
Actually you can use regular expressions for that purpose.
import re
string1='bread 0.99 12135479 300 200 400'
pattern='300'
to_replace_with="youyou"
string2=re.sub(pattern, to_replace_with, string1)
You will have the output bellow:
'bread 0.99 12135479 youyou 200 400'
Hope this was what you were looking for ;)

Python for loop iteration to merge multiple lines in a single line

I have a CSV file that I am trying to parse but the problem is that one of the cells contains blocks of data full of nulls and line breaks. I need enclose each row inside an array and merge all the content from this particular cell within its corresponding row. I recently posted and similar question and the answer solved my problem partially, but I am having problems building a loop that iterates through every single line that does not meet a certain start condition. The code that I have merges only the first line that does not meet that condition, but it breaks after that.
I have:
file ="myfile.csv"
condition = "DAT"
data = open(file).read().split("\n")
for i, line in enumerate(data):
if not line.startswith(condition):
data[i-1] = data[i-1]+line
data.pop(i)
print data
For a CSV that looks like this:
Case | Info
-------------------
DAT1 single line
DAT2 "Berns, 17, died Friday of complications from Hutchinson-Gilford progeria syndrome, commonly known as progeria. He was diagnosed with progeria when he was 22 months old. His physician parents founded the nonprofit Progeria Research Foundation after his diagnosis.
Berns became the subject of an HBO documentary, ""Life According to Sam."" The exposure has brought greater recognition to the condition, which causes musculoskeletal degeneration, cardiovascular problems and other symptoms associated with aging.
Kraft met the young sports fan and attended the HBO premiere of the documentary in New York in October. Kraft made a $500,000 matching pledge to the foundation.
The Boston Globe reported that Berns was invited to a Patriots practice that month, and gave the players an impromptu motivational speech.
DAT3 single line
DAT4 YWYWQIDOWCOOXXOXOOOOOOOOOOO
It does join the full sentence with the previous line. But when it hits a double space or double line it fails and registers it as a new line. For example, if I print:
data[0]
The output is:
DAT1 single line
If I print:
data[1]
The output is:
DAT2 "Berns, 17, died Friday of complications from Hutchinson-Gilford progeria syndrome, commonly known as progeria. He was diagnosed with progeria when he was 22 months old. His physician parents founded the nonprofit Progeria Research Foundation after his diagnosis.
But if I print:
data[2]
The output is:
Berns became the subject of an HBO documentary, ""Life According to Sam."" The exposure has brought greater recognition to the condition, which causes musculoskeletal degeneration, cardiovascular problems and other symptoms associated with aging.
Instead of:
DAT3 single line
How do I merge that full bull of text on the column "Info" so that it always matches the corresponding DAT row instead on popping as a new row, regardless of null or new line characters?
You can split lines with regular expression directly into data:
Python
import re
f = open("myfile.csv")
text = f.read()
data = re.findall("\n(DAT\d+.*)", text)
Correct me if doesn't help.
UPDATE:
I believe, This would fix the problem with new lines:
import re
f = open("myfile.csv")
text = f.read()
lines = re.split(r"\n(?=DAT\d+)", text)
lines.pop(0)
Changing data while iterating over it is "bad"
new_data = []
for line in data:
if not new_data or line.startswith(condition):
new_data.append(line)
else:
new_data[-1] += line
print new_data

python script to split a file in two parts, name each one separately

I'm trying to code up a map visualization using d3.js and crossfilter, right now I have a big file and some pernicious row that is breaking the whole thing.
I want to create a file to split my input data in two halves so I can narrow down the source of the problem and thereby eliminate it whilst preserving my sanity.
The input data looks like this:
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJzIjoia0EtLWlpVHhUMUNtSFM0SzE4TUVzUSIsImkiOiIzMzI2ODE3ODgifQ.1u6YvzMuu_HbWqRaMwFd8zYNP43w7wYFnRbl5r2qSoY,C# Developer,Connectus,Chesterton,52.202499,0.131237,United Kingdom,statistics,1
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJzIjoia0EtLWlpVHhUMUNtSFM0SzE4TUVzUSIsImkiOiIzMzI2ODk1ODIifQ.jxcx56YcDm-4nmB8VvoIGQKew4yquszeaPon60hcDKs,Senior Java Developer,Redhill,Godstow,51.784375,-1.308003,United Kingdom,java|metadata,1
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJzIjoia0EtLWlpVHhUMUNtSFM0SzE4TUVzUSIsImkiOiIzMzI2OTEyMjIifQ.qK3xtYQDxRpKJkNargPu6Jef4njm2fSZnNIVulRHoqA,Software Development Manager,Spring Technology ,Woolstone,52.042198,-0.7047,United Kingdom,software development|sdlc|data analysis,1
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJzIjoia0EtLWlpVHhUMUNtSFM0SzE4TUVzUSIsImkiOiIzMzI4NDM1MzgifQ.pYnBX-APPdB3edTRC_M8x6usmBq_GfIxcdZOXSLJN04,Data Scientists Python R Scala Java or Matlab,Aspire Data Recruitment,East Boldon,54.94452,-1.42815,United Kingdom,data science|java|python|scala|matlab|analysis,1
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJzIjoia0EtLWlpVHhUMUNtSFM0SzE4TUVzUSIsImkiOiIzMzI4NzM4NTMifQ.mgRKEZh-0GLUXQmZ9Bp6H10haZNAieIKAH1uoWV63YU,Data Analyst - Programmatic Tech Company,Ultimate Asset Limited,London,51.50853,-0.12574,United Kingdom,data analysis|analysis|statistics,1
so then, in my idea I would evenly divide it such that subsequently I would have:
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJzIjoia0EtLWlpVHhUMUNtSFM0SzE4TUVzUSIsImkiOiIzMzI2ODE3ODgifQ.1u6YvzMuu_HbWqRaMwFd8zYNP43w7wYFnRbl5r2qSoY,C# Developer,Connectus,Chesterton,52.202499,0.131237,United Kingdom,statistics,1
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJzIjoia0EtLWlpVHhUMUNtSFM0SzE4TUVzUSIsImkiOiIzMzI2ODk1ODIifQ.jxcx56YcDm-4nmB8VvoIGQKew4yquszeaPon60hcDKs,Senior Java Developer,Redhill,Godstow,51.784375,-1.308003,United Kingdom,java|metadata,1
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJzIjoia0EtLWlpVHhUMUNtSFM0SzE4TUVzUSIsImkiOiIzMzI2OTEyMjIifQ.qK3xtYQDxRpKJkNargPu6Jef4njm2fSZnNIVulRHoqA,Software Development Manager,Spring Technology ,Woolstone,52.042198,-0.7047,United Kingdom,software development|sdlc|data analysis,1
and this one:
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJzIjoia0EtLWlpVHhUMUNtSFM0SzE4TUVzUSIsImkiOiIzMzI2OTEyMjIifQ.qK3xtYQDxRpKJkNargPu6Jef4njm2fSZnNIVulRHoqA,Software Development Manager,Spring Technology ,Woolstone,52.042198,-0.7047,United Kingdom,software development|sdlc|data analysis,1
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJzIjoia0EtLWlpVHhUMUNtSFM0SzE4TUVzUSIsImkiOiIzMzI4NDM1MzgifQ.pYnBX-APPdB3edTRC_M8x6usmBq_GfIxcdZOXSLJN04,Data Scientists Python R Scala Java or Matlab,Aspire Data Recruitment,East Boldon,54.94452,-1.42815,United Kingdom,data science|java|python|scala|matlab|analysis,1
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJzIjoia0EtLWlpVHhUMUNtSFM0SzE4TUVzUSIsImkiOiIzMzI4NzM4NTMifQ.mgRKEZh-0GLUXQmZ9Bp6H10haZNAieIKAH1uoWV63YU,Data Analyst - Programmatic Tech Company,Ultimate Asset Limited,London,51.50853,-0.12574,United Kingdom,data analysis|analysis|statistics,1
for instance.
naming them with a convention such as starting_input.csv becomes:
starting_input_a.csv
and
starting_input_b.csv
and then afterwards when I want to run it again:
starting_input_aa.csv
and
starting_input_ab.csv
and so on.
Can you follow my idea?
I tried this:
splitLen = 20 # 20 lines per file
outputBase = 'output' # output.1.txt, output.2.txt, etc.
# This is shorthand and not friendly with memory
# on very large files, but it works.
input = open('input.txt', 'r').read().split('\n')
at = 1
for lines in range(0, len(input), splitLen):
# First, get the list slice
outputData = input[lines:lines+splitLen]
# Now open the output file, join the new slice with newlines
# and write it out. Then close the file.
output = open(outputBase + str(at) + '.txt', 'w')
output.write('\n'.join(outputData))
output.close()
# Increment the counter
at += 1
but it didn't work
Here is a hint.
Just read the file twice. Once to get the line count and then again to get the top half and bottom half.
Simple example. Given your 5 line example input:
$ cat /tmp/f1.txt
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJzIjoia0EtLWlpVHhUMUNtSFM0SzE4TUVzUSIsImkiOiIzMzI2ODE3ODgifQ.1u6YvzMuu_HbWqRaMwFd8zYNP43w7wYFnRbl5r2qSoY,C# Developer,Connectus,Chesterton,52.202499,0.131237,United Kingdom,statistics,1
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJzIjoia0EtLWlpVHhUMUNtSFM0SzE4TUVzUSIsImkiOiIzMzI2ODk1ODIifQ.jxcx56YcDm-4nmB8VvoIGQKew4yquszeaPon60hcDKs,Senior Java Developer,Redhill,Godstow,51.784375,-1.308003,United Kingdom,java|metadata,1
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJzIjoia0EtLWlpVHhUMUNtSFM0SzE4TUVzUSIsImkiOiIzMzI2OTEyMjIifQ.qK3xtYQDxRpKJkNargPu6Jef4njm2fSZnNIVulRHoqA,Software Development Manager,Spring Technology ,Woolstone,52.042198,-0.7047,United Kingdom,software development|sdlc|data analysis,1
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJzIjoia0EtLWlpVHhUMUNtSFM0SzE4TUVzUSIsImkiOiIzMzI4NDM1MzgifQ.pYnBX-APPdB3edTRC_M8x6usmBq_GfIxcdZOXSLJN04,Data Scientists Python R Scala Java or Matlab,Aspire Data Recruitment,East Boldon,54.94452,-1.42815,United Kingdom,data science|java|python|scala|matlab|analysis,1
http://www.edsa-project.eu/adzuna/eyJhbGciOiJIUzI1NiJ9.eyJzIjoia0EtLWlpVHhUMUNtSFM0SzE4TUVzUSIsImkiOiIzMzI4NzM4NTMifQ.mgRKEZh-0GLUXQmZ9Bp6H10haZNAieIKAH1uoWV63YU,Data Analyst - Programmatic Tech Company,Ultimate Asset Limited,London,51.50853,-0.12574,United Kingdom,data analysis|analysis|statistics,1
You can do something like this:
def divide(fn):
# get total lines in file
with open(fn) as f:
lc=sum(1 for _ in f)
with open(fn) as fin:
# top half of file:
for i, line in enumerate(fin):
print line
if i>=lc/2:
break
# middle
print "======="
# remainder
for line in fin:
print line
That will print 3 lines from the top of the file then the '=======' divider then that last 2 lines of the example.
Instead of printing, you can write to two files with 'a' and 'b' added to the base names. Reapply to the resulting files until you are done.

Categories