I have selected some fields from a json file and I saved its name along with its respective comment to do preprocessing..
Below are the codes:
import re
import json
with open('C:/Users/User/Desktop/Coding/parsehubjsonfileeg/all.json', encoding='utf8') as f:
data = json.load(f)
# dictionary for element which you want to keep
new_data = {'selection1': []}
print(new_data)
# copy item from old data to new data if it has 'reviews'
for item in data['selection1']:
if 'reviews' in item:
new_data['selection1'].append(item)
print(item['reviews'])
print('--')
# save in file
with open('output.json', 'w') as f:
json.dump(new_data, f)
selection1 = new_data['selection1']
for item in selection1:
name = item['name']
print('>>>>>>>.', name)
CommentID = item['reviews']
for com in CommentID:
comment = com['review'].lower() # converting all to lowercase
result = re.sub(r'\d+', '', comment) # remove numbers
results = (result.translate(
str.maketrans('', '', string.punctuation))).strip() # remove punctuations and white spaces
comments = (results)
print(comment)
my output is:
>>>>>>>. Heritage The Villas
we booked at villa valriche through mari deal for 2 nights and check-in was too lengthy (almost 2 hours) and we were requested to make a deposit of rs 10,000 or credit card which we were never informed about it upon booking.
lovely place to recharge.
one word: suoerb
definitely not a 5 star. extremely poor staff service.
>>>>>>>. Oasis Villas by Evaco Holidays
excellent
spent 3 days with my family and really enjoyed my stay. the advantage of oasis is its privacy - with 3 children under 6 years, going to dinner/breakfast at hotels is often a burden rather than an enjoyable experience.
staff were very friendly and welcoming. artee and menni made sure everything was fine and brought breakfast - warm croissants - every morning. atish made the check-in arrangements - and was fast and hassle free.
will definitely go again!
what should I perform to convert this output to a dataframe having column name and comment?
Related
I have an input file such as
[headline - https://prachatai.com/journal/2020/10/89984]
'ประยุทธ์' ขอบคุณทุกฝ่าย ยืนยันเจ้าหน้าที่ปฏิบัติตามหลักสากลทุกประการ - ด้านตำรวจยืนยันไม่มีการใช้กระสุนยางและแก๊สน้ำตากระชับพื้นที่ผู้ชุมนุม ระบุสารเคมีผสมน้ำไม่มีอันตราย ใช้เพื่อระุบตัวผู้ชุมนุมดำเนินคดีในอนาคต
เมื่อคืนวันที่ 16 ต.ค. 2563 อนุชา บูรพชัยศรี โฆษกประจำสำนักนายกรัฐมนตรี เปิดเผยว่า พล.อ. ประยุทธ์ จันทร์โอชา นายกรัฐมนตรี และรัฐมนตรีว่าการกระทรวงกลาโหม ขอขอบคุณเจ้าหน้าที่ทุกฝ่าย ประชาชนทุกกลุ่ม และผู้ชุมนุมที่ให้ความร่วมมือกับทางเจ้าหน้าที่ของรัฐในการยุติการชุมนุม
[headline - https://prachatai.com/english/about/internship]
Here is some english text
[headline - https://prachatai.com/english/node/8813]
Foreigners attended the protest at Thammasat University to show their support for the people of Thailand and their fight for democracy. The use of social media has greatly contributed to the expansion of foreign participation in protests.
A protester with a Guy Fawkes mask at the 19 Sept protest.
[headline - https://prachatai.com/journal/2020/10/89903]
ต.ค.62-ก.ย.63 แรงงานไทยในต่างประเทศส่งเงินกลับบ้าน 200,254 ล้านบาท
นายสุชาติ ชมกลิ่น รัฐมนตรีว่าการกระทรวงแรงงาน เปิดเผยว่า นับจากช่วงที่ประเทศไทยเข้าสู่สถานการณ์การแพร่ระบาดของโรคโควิด-19 ส่งผลกระทบต่อการจัดส่งแรงงานไทยไปทำงานต่างประเทศในภาพรวม เนื่องจากหลายประเทศที่เป็นเป้าหมายในการเดินทางไปทำงานของแรงงานไทย ชะลอการรับคนต่างชาติเข้าประเทศ
My goal here is to remove every english articles. I have multiple large text files so I want to find an efficient way to get rid of the English articles and keep everything else.
So an example output would look like.
[headline - https://prachatai.com/journal/2020/10/89984]
'ประยุทธ์' ขอบคุณทุกฝ่าย ยืนยันเจ้าหน้าที่ปฏิบัติตามหลักสากลทุกประการ - ด้านตำรวจยืนยันไม่มีการใช้กระสุนยางและแก๊สน้ำตากระชับพื้นที่ผู้ชุมนุม ระบุสารเคมีผสมน้ำไม่มีอันตราย ใช้เพื่อระุบตัวผู้ชุมนุมดำเนินคดีในอนาคต
เมื่อคืนวันที่ 16 ต.ค. 2563 อนุชา บูรพชัยศรี โฆษกประจำสำนักนายกรัฐมนตรี เปิดเผยว่า พล.อ. ประยุทธ์ จันทร์โอชา นายกรัฐมนตรี และรัฐมนตรีว่าการกระทรวงกลาโหม ขอขอบคุณเจ้าหน้าที่ทุกฝ่าย ประชาชนทุกกลุ่ม และผู้ชุมนุมที่ให้ความร่วมมือกับทางเจ้าหน้าที่ของรัฐในการยุติการชุมนุม
[headline - https://prachatai.com/journal/2020/10/89903]
ต.ค.62-ก.ย.63 แรงงานไทยในต่างประเทศส่งเงินกลับบ้าน 200,254 ล้านบาท
นายสุชาติ ชมกลิ่น รัฐมนตรีว่าการกระทรวงแรงงาน เปิดเผยว่า นับจากช่วงที่ประเทศไทยเข้าสู่สถานการณ์การแพร่ระบาดของโรคโควิด-19 ส่งผลกระทบต่อการจัดส่งแรงงานไทยไปทำงานต่างประเทศในภาพรวม เนื่องจากหลายประเทศที่เป็นเป้าหมายในการเดินทางไปทำงานของแรงงานไทย ชะลอการรับคนต่างชาติเข้าประเทศ
If you can see, all the English articles are under
[headline - https://.../english/...
Each article begins with these [headline tags which is their URLs. And the English articles happen to have english in their URLs.
So now I want to get rid of the English artices. How do I achieve this?
current code
with open('example.txt', 'r') as inputFile:
data = inputFile.read().splitlines()
Outputtext = ""
for line in data:
if line.startswith("[headline"):
if line.contains("english"):
#somehow read until the next [headline and do check
else:
Outputtext = Outputtext + line + "\n"
else
You can possibly do this with just Regex. It may need to be tweaked to fit the specific rules for your formatting, though.
import re
all_articles = "..."
# match "[headline...english" and everything after till another "[headline"
english_article_regex = r"\[headline[^\]]*\/english[^\]]*].*?(?=(\[headline|$))"
result = re.sub(english_article_regex, "", all_articles, 0, re.DOTALL)
Here's the live example:
https://regex101.com/r/heKomA/3
I think you needed to put an extra amount of time into it and you might have solved this problem yourself. When I see your code, I see someone learning programming that is confused about what he needs to do.
You need to think step by step. Like, here, you have a text composed of articles. You want to filter out some articles depending on a condition. What's the first thing you need to do ?
You first need to know how to recognize what is an article. Is an article a pack of 3 lines in your file ? Oh, the size changes, so you need another common factor. They all begin with [headline ? Alright. Now, I need to make "groups" of articles. There are a very large number of ways you could do it. But I just wanted to give you an insight as to how you could solve your problem. One step at a time.
Here is a solution to your problem. And it is not the only one, far from it.
HELLO
IGNORE
THESE
[headline - https://prachatai.com/journal/2020/10/89984]
NOENGLISHTEXT
MULTIPLE
LINES
TEXT
[headline - https://prachatai.com/english/about/internship]
Here is some english text
[headline - https://prachatai.com/english/node/8813]
Foreigners attended the protest at Thammasat University to show their support for the people of Thailand and their fight for democracy. The use of social media has greatly contributed to the expansion of foreign participation in protests.
A protester with a Guy Fawkes mask at the 19 Sept protest.
[headline - https://prachatai.com/journal/2020/10/89903]
NOENGLISHTEXT SECOND
MULTIPLE
LINES
And my solution, in pure python.
def filter_out_english_block(lines: list) -> str:
filtered_lines = []
flag = False
for line in lines:
if line.startswith("[headline"):
if 'english' not in line:
flag = True
else:
flag = False
if flag:
filtered_lines.append(line)
return "".join(filtered_lines)
if __name__ == '__main__':
with open("hello.txt", "r") as f:
lines = f.readlines()
print(lines)
# ['HELLO\n', 'IGNORE\n', 'THESE\n', '[headline - https://prachatai.com/journal/2020/10/89984]\n', 'NOENGLISHTEXT\n', 'MULTIPLE\n', 'LINES\n', 'TEXT\n', '[headline - https://prachatai.com/english/about/internship]\n', 'Here is some english text\n', '[headline - https://prachatai.com/english/node/8813]\n', 'Foreigners attended the protest at Thammasat University to show their support for the people of Thailand and their fight for democracy. The use of social media has greatly contributed to the expansion of foreign participation in protests.\n', 'A protester with a Guy Fawkes mask at the 19 Sept protest.\n', '[headline - https://prachatai.com/journal/2020/10/89903]\n', 'NOENGLISHTEXT SECOND\n', 'MULTIPLE\n', 'LINES']
new_text = filter_out_english_block(lines)
print(new_text)
# [headline - https://prachatai.com/journal/2020/10/89984]
# NOENGLISHTEXT
# MULTIPLE
# LINES
# TEXT
# [headline - https://prachatai.com/journal/2020/10/89903]
# NOENGLISHTEXT SECOND
# MULTIPLE
# LINES
The explanation is :
I first decide to iterate through the file as a list.
I decide to store lines only If I have previously seen a condition that suits me (Here, it would be to see the [headline line, that does not contain the english string.
And my storing condition is set by default to False, so that the first lines are ignored until I see a condition that suits me for storing.
I have a text file where I need to extract first five lines ones a specified keyword occurs in the paragraph.
I am able to find keywords but not able to write next five lines from that keyword.
mylines = []
with open ('D:\\Tasks\\Task_20\\txt\\CV (4).txt', 'rt') as myfile:
for line in myfile:
mylines.append(line)
for element in mylines:
print(element, end='')
print(mylines[0].find("P"))
Please help if anybody have any idea on how to do so.
Input Text File Example:-
Philippine Partner Agency: ALL POWER STAFFING SOLUTIONS, INC.
Training Objectives: : To have international cultural exposure and hands-on experience in the field
of hospitality management as a gateway to a meaningful hospitality career. To develop my hospitality
management skills and become globally competitive.
Education
Institution Name: SOUTHVILLE FOREIGN UNIVERSITY - PHILIPPINES
Location Hom as Pinas City, Philippine Institution start date: (June 2007
Required Output:-
Training Objectives: : To have international cultural exposure and hands-on experience in the field
of hospitality management as a gateway to a meaningful hospitality career. To develop my hospitality
management skills and become globally competitive.
#
I have to search Training Objective Keyword in text file and ones it find that it should write next 5 lines only.
If you're simply trying to extract the entire "Training Objectives" block, look for the keyword and keep appending lines until you hit an empty line (or some other suitable marker, the next header for example).
(edited to handle multiple files and keywords)
def extract_block(filename, keywords):
mylines = []
with open(filename) as myfile:
save_flag = False
for line in myfile:
if any(line.startswith(kw) for kw in keywords):
save_flag = True
elif line.strip() == '':
save_flag = False
if save_flag:
mylines.append(line)
return mylines
filenames = ['file1.txt', 'file2.txt', 'file3.txt']
keywords = ['keyword1', 'keyword2', 'keyword3']
for filename in filenames:
block = extract_block(filename, keywords)
This assumes there is only 1 block that you want in each file. If you're extracting multiple blocks from each file, it would get more complicated.
If you really want 5 lines, always and every time, then you could do something similar but add a counter to count out your 5 lines.
It depends on where you're \n's are but i put a regex together that might help with a sample of how my text looks in the variable st:
In [254]: st
Out[254]: 'Philippine Partner Agency: ALL POWER STAFFING SOLUTIONS, INC.\n\nTraining Objectives::\nTo have international cultural exposure and hands-on experience \nin the field of hospitality management as a gateway to a meaningful hospitality career. \nTo develop my hospitality management skills and become globally competitive.\n\n\nEducation Institution Name: SOUTHVILLE FOREIGN UNIVERSITY - PHILIPPINES Location Hom as Pinas City, Philippine Institution start date: (June 2007\n'
impore re
re.findall('Training Objectives:.*\n((?:.*\n){1,5})', st)
Out[255]: ['To have international cultural exposure and hands-on experience \nin the field of hospitality management as a gateway to a meaningful hospitality career. \nTo develop my hospitality management skills and become globally competitive.\n\n\n']
Try this:
with open('test.txt') as f:
content = f.readlines()
index = [x for x in range(len(content)) if 'training objectives' in content[x].lower()]
for num in index:
for lines in content[num:num+5]:
print (lines)
If you have only a few words (just to get the index):
index = []
for i, line in enumerate(content):
if 'hello' in line or 'there' in line: //add your or + word here
index.append(i)
print(index)
If you have many (just to get the index):
list = ["hello","there","blink"] //insert your words here
index = []
for i, line in enumerate(content):
for items in list:
if items in line:
index.append(i)
print(index)
I have a file containing DBLP dataset which consists of bibliographic data in computer science. I want to delete some of the records with missing information. For example, I want to delete records with the missing venue. In this dataset, the venue is followed by '#c'.
In this code, I am splitting documents by the title of manuscripts ("#*"). Now, I am trying to delete records without venue name.
Input Data:
#*Toward Connectionist Parsing.
##Steven L. Small,Garrison W. Cottrell,Lokendra Shastri
#t1982
#c
#index14997
#*A Framework for Reinforcement Learning on Real Robots.
##William D. Smart,Leslie Pack Kaelbling
#t1998
#cAAAI/IAAI
#index14998
#*Efficient Goal-Directed Exploration.
##Yury V. Smirnov,Sven Koenig,Manuela M. Veloso,Reid G. Simmons
#t1996
#cAAAI/IAAI, Vol. 1
#index14999
My code:
inFile = open('lorem.txt','r')
Data = inFile.read()
data = Data.split("#*")
ouFile = open('testdata.txt','w')
for idx, word in enumerate(data):
print("i = ", idx)
if not('#!' in data[idx]):
del data[idx]
idx = idx - 1
else:
ouFile.write("#*" + data[idx])
ouFile.close()
inFile.close()
Expected Output:
#*A Framework for Reinforcement Learning on Real Robots.
##William D. Smart,Leslie Pack Kaelbling
#t1998
#cAAAI/IAAI
#index14998
#*Efficient Goal-Directed Exploration.
##Yury V. Smirnov,Sven Koenig,Manuela M. Veloso,Reid G. Simmons
#t1996
#cAAAI/IAAI, Vol. 1
#index14999
Actual Output:
An empty output file
str.find will give you an index of sub-string, or -1 if the sub-string does not exist.
DOCUMENT_SEP = '#*'
with open('lorem.txt') as in_file:
documents = in_file.read().split(DOCUMENT_SEP)
with open('testdata.txt', 'w') as out_file:
for document in documents:
i = document.find('#c')
if i < 0: # no "#c"
continue
# "#c" exists, but no trailing venue information
if not document[i+2:i+3].strip():
continue
out_file.write(DOCUMENT_SEP)
out_file.write(document)
Instead of closing manually, I used a with statement.
No need to use index; deleting an item in the middle of loop will make index calculation complex.
Using regular expressions like #c[A-Z].. will make the code simpler.
The reason your code wasn't working is because there's no #! in any of your entries.
If you want to exclude entries with empty #c fields, you can try this:
inFile = open('lorem.txt','r')
Data = inFile.read()
data = Data.split("#*")
ouFile = open('testdata.txt','w')
for idx, word in enumerate(data):
print("i = ", idx)
if not '#c\n' in data[idx] and len(word) > 0:
ouFile.write("#*" + data[idx])
ouFile.close()
inFile.close()
In general, try not to delete elements of a list you're looping through. It can cause a lot of unexpected drama.
I have around 2,000 text files containing summaries of news articles and I want to remove the title from all the files that have titles (some don't have titles for some reason) using Python.
Here's an example:
Ad sales boost Time Warner profit
Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters.However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues.Time Warner's fourth quarter profits were slightly better than analysts' expectations.For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn.For 2005, TimeWarner is projecting operating earnings growth of around 5%, and also expects higher revenue and wider profit margins.
My question is how to remove the line, "Ad sales boost Time Warner profit" ?
Edit: I basically want to remove everything before a line break.
TIA.
If it's (as you say) just a simple matter of removing the first line, when followed by \n\n, you could use a simple regex like this:
import re
with open('testing.txt', 'r') as fin:
doc = fin.read()
doc = re.sub(r'^.+?\n\n', '', doc)
try this:
it will split the text into everything before the line break "\n\n" and only select the last element (the body)
line.split('\n\n', 1)[-1]
This also works when there is no line break in the text
As you may know, you can't read and write to a file. - Therefore the solution in this case would be to read the lines to a variable; modify and re-write to file.
lines = []
# open the text file in read mode and readlines (returns a list of lines)
with open('textfile.txt', 'r') as file:
lines = file.readlines()
# open the text file in write mode and write lines
with open('textfile.txt', 'w') as file:
# if the number of lines is bigger than 1 (assumption) write summary else write all lines
file.writelines(lines[2:] if len(lines) > 1 else lines)
The above is a simple example of how you can achieve what you're after. - Although keep in mind that edge cases might be present.
This will remove everything before the first line break ('\n\n').
with open('text.txt', 'r') as file:
f = file.read()
idx = f.find('\n\n') # Search for a line break
if idx > 0: # If found, return everything after it
g = f[idx+2:]
else: # Otherwise, return the original text file
g = f
print(g)
# Save the file
with open('text.txt', 'w') as file:
file.write(g)
"Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters.However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues.Time Warner's fourth quarter profits were slightly better than analysts' expectations.For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn.For 2005, TimeWarner is projecting operating earnings growth of around 5%, and also expects higher revenue and wider profit margins.\n"
I have a CSV file that I am trying to parse but the problem is that one of the cells contains blocks of data full of nulls and line breaks. I need enclose each row inside an array and merge all the content from this particular cell within its corresponding row. I recently posted and similar question and the answer solved my problem partially, but I am having problems building a loop that iterates through every single line that does not meet a certain start condition. The code that I have merges only the first line that does not meet that condition, but it breaks after that.
I have:
file ="myfile.csv"
condition = "DAT"
data = open(file).read().split("\n")
for i, line in enumerate(data):
if not line.startswith(condition):
data[i-1] = data[i-1]+line
data.pop(i)
print data
For a CSV that looks like this:
Case | Info
-------------------
DAT1 single line
DAT2 "Berns, 17, died Friday of complications from Hutchinson-Gilford progeria syndrome, commonly known as progeria. He was diagnosed with progeria when he was 22 months old. His physician parents founded the nonprofit Progeria Research Foundation after his diagnosis.
Berns became the subject of an HBO documentary, ""Life According to Sam."" The exposure has brought greater recognition to the condition, which causes musculoskeletal degeneration, cardiovascular problems and other symptoms associated with aging.
Kraft met the young sports fan and attended the HBO premiere of the documentary in New York in October. Kraft made a $500,000 matching pledge to the foundation.
The Boston Globe reported that Berns was invited to a Patriots practice that month, and gave the players an impromptu motivational speech.
DAT3 single line
DAT4 YWYWQIDOWCOOXXOXOOOOOOOOOOO
It does join the full sentence with the previous line. But when it hits a double space or double line it fails and registers it as a new line. For example, if I print:
data[0]
The output is:
DAT1 single line
If I print:
data[1]
The output is:
DAT2 "Berns, 17, died Friday of complications from Hutchinson-Gilford progeria syndrome, commonly known as progeria. He was diagnosed with progeria when he was 22 months old. His physician parents founded the nonprofit Progeria Research Foundation after his diagnosis.
But if I print:
data[2]
The output is:
Berns became the subject of an HBO documentary, ""Life According to Sam."" The exposure has brought greater recognition to the condition, which causes musculoskeletal degeneration, cardiovascular problems and other symptoms associated with aging.
Instead of:
DAT3 single line
How do I merge that full bull of text on the column "Info" so that it always matches the corresponding DAT row instead on popping as a new row, regardless of null or new line characters?
You can split lines with regular expression directly into data:
Python
import re
f = open("myfile.csv")
text = f.read()
data = re.findall("\n(DAT\d+.*)", text)
Correct me if doesn't help.
UPDATE:
I believe, This would fix the problem with new lines:
import re
f = open("myfile.csv")
text = f.read()
lines = re.split(r"\n(?=DAT\d+)", text)
lines.pop(0)
Changing data while iterating over it is "bad"
new_data = []
for line in data:
if not new_data or line.startswith(condition):
new_data.append(line)
else:
new_data[-1] += line
print new_data