Using python to delete unwanted parts of a text file - python

I have an input file such as
[headline - https://prachatai.com/journal/2020/10/89984]
'ประยุทธ์' ขอบคุณทุกฝ่าย ยืนยันเจ้าหน้าที่ปฏิบัติตามหลักสากลทุกประการ - ด้านตำรวจยืนยันไม่มีการใช้กระสุนยางและแก๊สน้ำตากระชับพื้นที่ผู้ชุมนุม ระบุสารเคมีผสมน้ำไม่มีอันตราย ใช้เพื่อระุบตัวผู้ชุมนุมดำเนินคดีในอนาคต
เมื่อคืนวันที่ 16 ต.ค. 2563 อนุชา บูรพชัยศรี โฆษกประจำสำนักนายกรัฐมนตรี เปิดเผยว่า พล.อ. ประยุทธ์ จันทร์โอชา นายกรัฐมนตรี และรัฐมนตรีว่าการกระทรวงกลาโหม ขอขอบคุณเจ้าหน้าที่ทุกฝ่าย ประชาชนทุกกลุ่ม และผู้ชุมนุมที่ให้ความร่วมมือกับทางเจ้าหน้าที่ของรัฐในการยุติการชุมนุม
[headline - https://prachatai.com/english/about/internship]
Here is some english text
[headline - https://prachatai.com/english/node/8813]
Foreigners attended the protest at Thammasat University to show their support for the people of Thailand and their fight for democracy. The use of social media has greatly contributed to the expansion of foreign participation in protests.
A protester with a Guy Fawkes mask at the 19 Sept protest.
[headline - https://prachatai.com/journal/2020/10/89903]
ต.ค.62-ก.ย.63 แรงงานไทยในต่างประเทศส่งเงินกลับบ้าน 200,254 ล้านบาท
นายสุชาติ ชมกลิ่น รัฐมนตรีว่าการกระทรวงแรงงาน เปิดเผยว่า นับจากช่วงที่ประเทศไทยเข้าสู่สถานการณ์การแพร่ระบาดของโรคโควิด-19 ส่งผลกระทบต่อการจัดส่งแรงงานไทยไปทำงานต่างประเทศในภาพรวม เนื่องจากหลายประเทศที่เป็นเป้าหมายในการเดินทางไปทำงานของแรงงานไทย ชะลอการรับคนต่างชาติเข้าประเทศ
My goal here is to remove every english articles. I have multiple large text files so I want to find an efficient way to get rid of the English articles and keep everything else.
So an example output would look like.
[headline - https://prachatai.com/journal/2020/10/89984]
'ประยุทธ์' ขอบคุณทุกฝ่าย ยืนยันเจ้าหน้าที่ปฏิบัติตามหลักสากลทุกประการ - ด้านตำรวจยืนยันไม่มีการใช้กระสุนยางและแก๊สน้ำตากระชับพื้นที่ผู้ชุมนุม ระบุสารเคมีผสมน้ำไม่มีอันตราย ใช้เพื่อระุบตัวผู้ชุมนุมดำเนินคดีในอนาคต
เมื่อคืนวันที่ 16 ต.ค. 2563 อนุชา บูรพชัยศรี โฆษกประจำสำนักนายกรัฐมนตรี เปิดเผยว่า พล.อ. ประยุทธ์ จันทร์โอชา นายกรัฐมนตรี และรัฐมนตรีว่าการกระทรวงกลาโหม ขอขอบคุณเจ้าหน้าที่ทุกฝ่าย ประชาชนทุกกลุ่ม และผู้ชุมนุมที่ให้ความร่วมมือกับทางเจ้าหน้าที่ของรัฐในการยุติการชุมนุม
[headline - https://prachatai.com/journal/2020/10/89903]
ต.ค.62-ก.ย.63 แรงงานไทยในต่างประเทศส่งเงินกลับบ้าน 200,254 ล้านบาท
นายสุชาติ ชมกลิ่น รัฐมนตรีว่าการกระทรวงแรงงาน เปิดเผยว่า นับจากช่วงที่ประเทศไทยเข้าสู่สถานการณ์การแพร่ระบาดของโรคโควิด-19 ส่งผลกระทบต่อการจัดส่งแรงงานไทยไปทำงานต่างประเทศในภาพรวม เนื่องจากหลายประเทศที่เป็นเป้าหมายในการเดินทางไปทำงานของแรงงานไทย ชะลอการรับคนต่างชาติเข้าประเทศ
If you can see, all the English articles are under
[headline - https://.../english/...
Each article begins with these [headline tags which is their URLs. And the English articles happen to have english in their URLs.
So now I want to get rid of the English artices. How do I achieve this?
current code
with open('example.txt', 'r') as inputFile:
data = inputFile.read().splitlines()
Outputtext = ""
for line in data:
if line.startswith("[headline"):
if line.contains("english"):
#somehow read until the next [headline and do check
else:
Outputtext = Outputtext + line + "\n"
else

You can possibly do this with just Regex. It may need to be tweaked to fit the specific rules for your formatting, though.
import re
all_articles = "..."
# match "[headline...english" and everything after till another "[headline"
english_article_regex = r"\[headline[^\]]*\/english[^\]]*].*?(?=(\[headline|$))"
result = re.sub(english_article_regex, "", all_articles, 0, re.DOTALL)
Here's the live example:
https://regex101.com/r/heKomA/3

I think you needed to put an extra amount of time into it and you might have solved this problem yourself. When I see your code, I see someone learning programming that is confused about what he needs to do.
You need to think step by step. Like, here, you have a text composed of articles. You want to filter out some articles depending on a condition. What's the first thing you need to do ?
You first need to know how to recognize what is an article. Is an article a pack of 3 lines in your file ? Oh, the size changes, so you need another common factor. They all begin with [headline ? Alright. Now, I need to make "groups" of articles. There are a very large number of ways you could do it. But I just wanted to give you an insight as to how you could solve your problem. One step at a time.
Here is a solution to your problem. And it is not the only one, far from it.
HELLO
IGNORE
THESE
[headline - https://prachatai.com/journal/2020/10/89984]
NOENGLISHTEXT
MULTIPLE
LINES
TEXT
[headline - https://prachatai.com/english/about/internship]
Here is some english text
[headline - https://prachatai.com/english/node/8813]
Foreigners attended the protest at Thammasat University to show their support for the people of Thailand and their fight for democracy. The use of social media has greatly contributed to the expansion of foreign participation in protests.
A protester with a Guy Fawkes mask at the 19 Sept protest.
[headline - https://prachatai.com/journal/2020/10/89903]
NOENGLISHTEXT SECOND
MULTIPLE
LINES
And my solution, in pure python.
def filter_out_english_block(lines: list) -> str:
filtered_lines = []
flag = False
for line in lines:
if line.startswith("[headline"):
if 'english' not in line:
flag = True
else:
flag = False
if flag:
filtered_lines.append(line)
return "".join(filtered_lines)
if __name__ == '__main__':
with open("hello.txt", "r") as f:
lines = f.readlines()
print(lines)
# ['HELLO\n', 'IGNORE\n', 'THESE\n', '[headline - https://prachatai.com/journal/2020/10/89984]\n', 'NOENGLISHTEXT\n', 'MULTIPLE\n', 'LINES\n', 'TEXT\n', '[headline - https://prachatai.com/english/about/internship]\n', 'Here is some english text\n', '[headline - https://prachatai.com/english/node/8813]\n', 'Foreigners attended the protest at Thammasat University to show their support for the people of Thailand and their fight for democracy. The use of social media has greatly contributed to the expansion of foreign participation in protests.\n', 'A protester with a Guy Fawkes mask at the 19 Sept protest.\n', '[headline - https://prachatai.com/journal/2020/10/89903]\n', 'NOENGLISHTEXT SECOND\n', 'MULTIPLE\n', 'LINES']
new_text = filter_out_english_block(lines)
print(new_text)
# [headline - https://prachatai.com/journal/2020/10/89984]
# NOENGLISHTEXT
# MULTIPLE
# LINES
# TEXT
# [headline - https://prachatai.com/journal/2020/10/89903]
# NOENGLISHTEXT SECOND
# MULTIPLE
# LINES
The explanation is :
I first decide to iterate through the file as a list.
I decide to store lines only If I have previously seen a condition that suits me (Here, it would be to see the [headline line, that does not contain the english string.
And my storing condition is set by default to False, so that the first lines are ignored until I see a condition that suits me for storing.

Related

convert output received to dataframe in python

I have selected some fields from a json file and I saved its name along with its respective comment to do preprocessing..
Below are the codes:
import re
import json
with open('C:/Users/User/Desktop/Coding/parsehubjsonfileeg/all.json', encoding='utf8') as f:
data = json.load(f)
# dictionary for element which you want to keep
new_data = {'selection1': []}
print(new_data)
# copy item from old data to new data if it has 'reviews'
for item in data['selection1']:
if 'reviews' in item:
new_data['selection1'].append(item)
print(item['reviews'])
print('--')
# save in file
with open('output.json', 'w') as f:
json.dump(new_data, f)
selection1 = new_data['selection1']
for item in selection1:
name = item['name']
print('>>>>>>>.', name)
CommentID = item['reviews']
for com in CommentID:
comment = com['review'].lower() # converting all to lowercase
result = re.sub(r'\d+', '', comment) # remove numbers
results = (result.translate(
str.maketrans('', '', string.punctuation))).strip() # remove punctuations and white spaces
comments = (results)
print(comment)
my output is:
>>>>>>>. Heritage The Villas
we booked at villa valriche through mari deal for 2 nights and check-in was too lengthy (almost 2 hours) and we were requested to make a deposit of rs 10,000 or credit card which we were never informed about it upon booking.
lovely place to recharge.
one word: suoerb
definitely not a 5 star. extremely poor staff service.
>>>>>>>. Oasis Villas by Evaco Holidays
excellent
spent 3 days with my family and really enjoyed my stay. the advantage of oasis is its privacy - with 3 children under 6 years, going to dinner/breakfast at hotels is often a burden rather than an enjoyable experience.
staff were very friendly and welcoming. artee and menni made sure everything was fine and brought breakfast - warm croissants - every morning. atish made the check-in arrangements - and was fast and hassle free.
will definitely go again!
what should I perform to convert this output to a dataframe having column name and comment?

Trying to read text file and count words within defined groups

I'm a novice Python user. I'm trying to create a program that reads a text file and searches that text for certain words that are grouped (that I predefine by reading from csv). For example, if I wanted to create my own definition for "positive" containing the words "excited", "happy", and "optimistic", the csv would contain those terms. I know the below is messy - the txt file I am reading from contains 7 occurrences of the three "positive" tester words I read from the csv, yet the results print out to be 25. I think it's returning character count, not word count. Code:
import csv
import string
import re
from collections import Counter
remove = dict.fromkeys(map(ord, '\n' + string.punctuation))
# Read the .txt file to analyze.
with open("test.txt", "r") as f:
textanalysis = f.read()
textresult = textanalysis.lower().translate(remove).split()
# Read the CSV list of terms.
with open("positivetest.csv", "r") as senti_file:
reader = csv.reader(senti_file)
positivelist = list(reader)
# Convert term list into flat chain.
from itertools import chain
newposlist = list(chain.from_iterable(positivelist))
# Convert chain list into string.
posstring = ' '.join(str(e) for e in newposlist)
posstring2 = posstring.split(' ')
posstring3 = ', '.join('"{}"'.format(word) for word in posstring2)
# Count number of words as defined in list category
def positive(str):
counts = dict()
for word in posstring3:
if word in counts:
counts[word] += 1
else:
counts[word] = 1
total = sum (counts.values())
return total
# Print result; will write to CSV eventually
print ("Positive: ", positive(textresult))
I'm a beginner as well but I stumbled upon a process that might help. After you read in the file, split the text at every space, tab, and newline. In your case, I would keep all the words lowercase and include punctuation in your split call. Save this as an array and then parse it with some sort of loop to get the number of instances of each 'positive,' or other, word.
Look at this, specifically the "train" function:
https://github.com/G3Kappa/Adjustable-Markov-Chains/blob/master/markovchain.py
Also, this link, ignore the JSON stuff at the beginning, the article talks about sentiment analysis:
https://dev.to/rodolfoferro/sentiment-analysis-on-trumpss-tweets-using-python-
Same applies with this link:
http://adilmoujahid.com/posts/2014/07/twitter-analytics/
Good luck!
I looked at your code and passed through some of my own as a sample.
I have 2 idea's for you, based on what I think you may want.
First Assumption: You want a basic sentiment count?
Getting to 'textresult' is great. Then you did the same with the 'positive lexicon' - to [positivelist] which I thought would be the perfect action? Then you converted [positivelist] to essentially a big sentence.
Would you not just:
1. Pass a 'stop_words' list through [textresult]
2. merge the two dataframes [textresult (less stopwords) and positivelist] for common words - as in an 'inner join'
3. Then basically do your term frequency
4. It is much easier to aggregate the score then
Second assumption: you are focusing on "excited", "happy", and "optimistic"
and you are trying to isolate text themes into those 3 categories?
1. again stop at [textresult]
2. download the 'nrc' and/or 'syuzhet' emotional valence dictionaries
They breakdown emotive words by 8 emotional groups
So if you only want 3 of the 8 emotive groups (subset)
3. Process it like you did to get [positivelist]
4. do another join
Sorry, this is a bit hashed up, but if I was anywhere near what you were thinking let me know and we can make contact.
Second apology, Im also a novice python user, I am adapting what I use in R to python in the above (its not subtle either :) )

Split txt file into multiple new files with regex

I am calling on the collective wisdom of Stack Overflow because I am at my wits end trying to figure out how to do this and I'm a newbie self-taught coder.
I have a txt file of Letters to the Editor that I need to split into their own individual files.
The files are all formatted in relatively the same way with:
For once, before offering such generous but the unasked for advice, put yourselves in...
Who has Israel to talk to? The cowardly Jordanian monarch? Egypt, a country rocked...
Why is it that The Times does not urge totalitarian Arab slates and terrorist...
PAUL STONEHILL Los Angeles
There you go again. Your editorial again makes groundless criticisms of the Israeli...
On Dec. 7 you called proportional representation “bizarre," despite its use in the...
Proportional representation distorts Israeli politics? Huh? If Israel changes the...
MATTHEW SHUGART Laguna Beach
Was Mayor Tom Bradley’s veto of the expansion of the Westside Pavilion a political...
Although the mayor did not support Proposition U (the slow-growth initiative) his...
If West Los Angeles is any indication of the no-growth policy, where do we go from here?
MARJORIE L. SCHWARTZ Los Angeles
I thought that the best way to go about it would be to try and use regex to identify the lines that started with a name that's all in capital letters since that's the only way to really tell where one letter ends and another begins.
I have tried quite a few different approaches but nothing seems to work quite right. All the other answers I have seen are based on a repeatable line or word. (for example the answers posted here how to split single txt file into multiple txt files by Python and here Python read through file until match, read until next pattern). It all seems to not work when I have to adjust it to accept my regex of all capital words.
The closest I've managed to get is the code below. It creates the right number of files. But after the second file is created it all goes wrong. The third file is empty and in all the rest the text is all out of order and/or incomplete. Paragraphs that should be in file 4 are in file 5 or file 7 etc or missing entirely.
import re
thefile = raw_input('Filename to split: ')
name_occur = []
full_file = []
pattern = re.compile("^[A-Z]{4,}")
with open (thefile, 'rt') as in_file:
for line in in_file:
full_file.append(line)
if pattern.search(line):
name_occur.append(line)
totalFiles = len(name_occur)
letters = 1
thefile = re.sub("(.txt)","",thefile)
while letters <= totalFiles:
f1 = open(thefile + '-' + str(letters) + ".txt", "a")
doIHaveToCopyTheLine = False
ignoreLines = False
for line in full_file:
if not ignoreLines:
f1.write(line)
full_file.remove(line)
if pattern.search(line):
doIHaveToCopyTheLine = True
ignoreLines = True
letters += 1
f1.close()
I am open to completely scrapping this approach and doing it another way (but still in Python). Any help or advice would be greatly appreciated. Please assume I am the inexperienced newbie that I am if you are awesome enough to take your time to help me.
I took a simpler approach and avoided regex. The tactic here is essentially to count the uppercase letters in the first three words and make sure they pass certain logic. I went for first word is uppercase and either the second or third word is uppercase too, but you can adjust this if needed. This will then write each letter to new files with the same name as the original file (note: it assumes your file has an extension like .txt or such) but with an incremented integer appended. Try it out and see how it works for you.
import string
def split_letters(fullpath):
current_letter = []
letter_index = 1
fullpath_base, fullpath_ext = fullpath.rsplit('.', 1)
with open(fullpath, 'r') as letters_file:
letters = letters_file.readlines()
for line in letters:
words = line.split()
upper_words = []
for word in words:
upper_word = ''.join(
c for c in word if c in string.ascii_uppercase)
upper_words.append(upper_word)
len_upper_words = len(upper_words)
first_word_upper = len_upper_words and len(upper_words[0]) > 1
second_word_upper = len_upper_words > 1 and len(upper_words[1]) > 1
third_word_upper = len_upper_words > 2 and len(upper_words[2]) > 1
if first_word_upper and (second_word_upper or third_word_upper):
current_letter.append(line)
new_filename = '{0}{1}.{2}'.format(
fullpath_base, letter_index, fullpath_ext)
with open(new_filename, 'w') as new_letter:
new_letter.writelines(current_letter)
current_letter = []
letter_index += 1
else:
current_letter.append(line)
I tested it on your sample input and it worked fine.
While the other answer is suitable, you may still be curious about using a regex to split up a file.
smallfile = None
buf = ""
with open ('input_file.txt', 'rt') as f:
for line in f:
buf += str(line)
if re.search(r'^([A-Z\s\.]+\b)' , line) is not None:
if smallfile:
smallfile.close()
match = re.findall(r'^([A-Z\s\.]+\b)' , line)
smallfile_name = '{}.txt'.format(match[0])
smallfile = open(smallfile_name, 'w')
smallfile.write(buf)
buf = ""
if smallfile:
smallfile.close()
If you run on Linux, use csplit.
Otherwise, check out these two threads:
How can I split a text file into multiple text files using python?
How to match "anything up until this sequence of characters" in a regular expression?

want to merge specific code lines from 2 text files

for starters, i am actually a medical student, so i wouldn't know first thing about programming, but i found myself in desperate need for this, so pardon me for my complete ignorance about the subject.
i have 2 XML files containing text, each one contains nearly 2 million lines, the first one looks like this:
<TEXT>
<Unknown1>-65535</Unknown1>
<autoId>1</autoId>
<autoId2>0</autoId2>
<alias>Name2.Boast_Duel_Season01_sudden_death_1vs1</alias>
<original>Уникальная массовая дуэль: Битва один на один до полного уничтожения в один раунд</original>
</TEXT>
<TEXT>
<Unknown1>-65535</Unknown1>
<autoId>2</autoId>
<autoId2>0</autoId2>
<alias>Name2.Boast_Duel_Season01_sudden_death_3vs3</alias>
<original>Уникальная массовая дуэль: Битва трое на трое до полного уничтожения в один раунд</original>
and the second one looks like this:
<TEXT>
    <Unknown1>-65535</Unknown1>
    <autoId>1</autoId>
    <autoId2>0</autoId2>
    <alias>Name2.Boast_Duel_Season01_sudden_death_1vs1</alias>
<replacement>Unique mass duel one on one battle to the complete destruction of one round</replacement>
  </TEXT>
  <TEXT>
    <Unknown1>-65535</Unknown1>
    <autoId>2</autoId>
    <autoId2>0</autoId2>
    <alias>Name2.Boast_Duel_Season01_sudden_death_3vs3</alias>
    <replacement>Unique mass duel Battle three against three to the complete destruction of one round</replacement>
  </TEXT>
and those blocks of code are repeated along the files for like half a million time, netting me the 2 million liner i told you about..
now what i need to do is merge both files to make the final product look like this:
<TEXT>
    <Unknown1>-65535</Unknown1>
    <autoId>1</autoId>
    <autoId2>0</autoId2>
    <alias>Name2.Boast_Duel_Season01_sudden_death_1vs1</alias>
<original>Уникальная массовая дуэль: Битва один на один до полного уничтожения в один раунд</original>
    <replacement>Unique mass duel one on one battle to the complete destruction of one round</replacement>
  </TEXT>
  <TEXT>
    <Unknown1>-65535</Unknown1>
    <autoId>2</autoId>
    <autoId2>0</autoId2>
    <alias>Name2.Boast_Duel_Season01_sudden_death_3vs3</alias>
<original>Уникальная массовая дуэль: Битва трое на трое до полного уничтожения в один раунд</original>
    <replacement>Unique mass duel Battle three against three to the complete destruction of one round</replacement>
  </TEXT>
so, Basically i want to add the "Replacement" line under each respective "original" line while the rest of the file is kept intact (it's the same in both) , doing this manually would take me like 2 weeks..and i only have 1 day to do so!
any help is appreciated, and again..sorry if i sound like a total idiot at this, cause i kind of am!
P.S: i can't even chose a proper tag! i will totally understand if i just get lashed in the answers now..this job is way to big for me!
The truth about "where to start" is to learn basic python string manipulation. I was feeling nice and I like these sorts of problem, however, so here's a (quick and dirty) solution. The only things you'll need to change are the "original.xml" and "replacement.xml" file names. You'll also need a working python version, of course. That's up to you to figure out.
A couple caveats about my code:
Parsing XML is a solved problem. Using regular expressions to do it is frowned upon, but it works, and when you're doing something as simple and fixed as this, it really doesn't matter.
I made a few assumptions when building the outputted XML file (for example an indentation style of 4 spaces), but it outputs valid XML. The application you're using should play nice with it.
-
import re
def loadfile(filename):
'''
Returns a string containing all data from file
'''
infile = open(filename, 'r')
infile_string = infile.read()
infile.close()
return infile_string
def main():
#load the files into strings
original = loadfile("original.xml")
replacement = loadfile("replacement.xml")
#grab all of the "replacement" lines from the replacement file
replacement_regex = re.compile("(<replacement>.*?</replacement>)")
replacement_list = replacement_regex.findall(replacement)
#grab all of the "TEXT" blocks from the original file
original_regex = re.compile("(<TEXT>.*?</TEXT>)", re.DOTALL)
original_list = original_regex.findall(original)
#a string to write out to the new file
outfile_string = ""
to_find = "</original>" #this is the point where the replacement text is going to be appended after
additional_len = len(to_find)
for i in range(len(original_list)): #loop through all of the original text blocks
#build a new string with the replacement text after the original
build_string = ""
build_string += original_list[i][:original_list[i].find(to_find)+additional_len]
build_string += "\n" + " "*4
build_string += replacement_list[i]
build_string += "\n</TEXT>\n"
outfile_string+=build_string
#write the outfile string out to a file
outfile = open("outfile.txt", 'w')
outfile.write(outfile_string)
outfile.close()
if __name__ == "__main__":
main()
Edit (reply to comment): The IndexError, list out of range error means that the regex isn't properly working (it's not finding the exactly correct amount of replacement text and grabbing each item to put it into a list). I tested what I wrote on the blurbs you provided, so there's a discrepancy between the blurbs you provided and the full-blown XML files. If there aren't the same amount of original/replacement tags or anything like that, that will break the code. Impossible for me to figure out without access to the files themselves.
here I present a straightforward way to do that (without xml parseing).
def parse_org(file_handle):
for line in file_handle:
if "<TEXT>" in line:
record = line## start a new record if find tag <TEXT>
elif "</TEXT>" in line:
yield record## end a record if find tag <\TEXT>
record = None
elif record is not None:
record +=line
def parse_rep(file_handle):
for line in file_handle:
if "<TEXT>" in line:
record = None
elif "</TEXT>" in line:
yield record
record = None
elif "<replacement>" in line:
record = line
if __name__ == "__main__":
orginal_file = open("filepath/yourfile.xml")
replacement_file = ("filepath/yourfile.xml")
a_new_file = open("result_file","w")
END = "NOT"
while END =="NOT":
try:
org = parse_org(orginal_file).next()
rep = parse_rep(replacement_file).next()
new_record = org+rep+"</TEXT>\n"
a_new_file.write(new_record)
except StopIteration:
END = "YES"
a_new_file.close()
orginal_file.close()
replacement_file.close()
the code is written using python, and it uses keyword yield, use http://www.codecademy.com/ if you want to learn python, google yield python to learn how to use yield in python. if you would like to process such txt file in future, you should learn a script language, python may be the easiest one. If you encounter any questions you could post them on this website, but dont do nothing and just ask like "write this program for me".

Python for loop iteration to merge multiple lines in a single line

I have a CSV file that I am trying to parse but the problem is that one of the cells contains blocks of data full of nulls and line breaks. I need enclose each row inside an array and merge all the content from this particular cell within its corresponding row. I recently posted and similar question and the answer solved my problem partially, but I am having problems building a loop that iterates through every single line that does not meet a certain start condition. The code that I have merges only the first line that does not meet that condition, but it breaks after that.
I have:
file ="myfile.csv"
condition = "DAT"
data = open(file).read().split("\n")
for i, line in enumerate(data):
if not line.startswith(condition):
data[i-1] = data[i-1]+line
data.pop(i)
print data
For a CSV that looks like this:
Case | Info
-------------------
DAT1 single line
DAT2 "Berns, 17, died Friday of complications from Hutchinson-Gilford progeria syndrome, commonly known as progeria. He was diagnosed with progeria when he was 22 months old. His physician parents founded the nonprofit Progeria Research Foundation after his diagnosis.
Berns became the subject of an HBO documentary, ""Life According to Sam."" The exposure has brought greater recognition to the condition, which causes musculoskeletal degeneration, cardiovascular problems and other symptoms associated with aging.
Kraft met the young sports fan and attended the HBO premiere of the documentary in New York in October. Kraft made a $500,000 matching pledge to the foundation.
The Boston Globe reported that Berns was invited to a Patriots practice that month, and gave the players an impromptu motivational speech.
DAT3 single line
DAT4 YWYWQIDOWCOOXXOXOOOOOOOOOOO
It does join the full sentence with the previous line. But when it hits a double space or double line it fails and registers it as a new line. For example, if I print:
data[0]
The output is:
DAT1 single line
If I print:
data[1]
The output is:
DAT2 "Berns, 17, died Friday of complications from Hutchinson-Gilford progeria syndrome, commonly known as progeria. He was diagnosed with progeria when he was 22 months old. His physician parents founded the nonprofit Progeria Research Foundation after his diagnosis.
But if I print:
data[2]
The output is:
Berns became the subject of an HBO documentary, ""Life According to Sam."" The exposure has brought greater recognition to the condition, which causes musculoskeletal degeneration, cardiovascular problems and other symptoms associated with aging.
Instead of:
DAT3 single line
How do I merge that full bull of text on the column "Info" so that it always matches the corresponding DAT row instead on popping as a new row, regardless of null or new line characters?
You can split lines with regular expression directly into data:
Python
import re
f = open("myfile.csv")
text = f.read()
data = re.findall("\n(DAT\d+.*)", text)
Correct me if doesn't help.
UPDATE:
I believe, This would fix the problem with new lines:
import re
f = open("myfile.csv")
text = f.read()
lines = re.split(r"\n(?=DAT\d+)", text)
lines.pop(0)
Changing data while iterating over it is "bad"
new_data = []
for line in data:
if not new_data or line.startswith(condition):
new_data.append(line)
else:
new_data[-1] += line
print new_data

Categories