Its late and I have been trying to work on a simple script to rename point cloud data to a working format. I dont know what im doing wrong as the code at the bottom works fine. Why doesnt the code in the for loop work? It is adding it to the list but its just not getting formatted by the replace function. Sorry I know this isnt a debugger but I am really stuck on this and it would probably take 2 seconds for someone else to see the problem.
# Opening and Loading the text file then sticking its lines into a list []
filename = "/Users/sacredgeometry/Desktop/data.txt"
text = open(filename, 'r')
lines = text.readlines()
linesNew = []
temp = None
# This bloody for loop is the problem
for i in lines:
temp = str(i)
temp.replace(' ', ', ',2)
linesNew.append(temp)
# DEBUGGING THE CODE
print(linesNew[0])
print(linesNew[1])
# Another test to check that the replace works ... It does!
test2 = linesNew[0].replace(' ', ', ',2)
test2 = test2.replace('\t', ', ')
print('Proof of Concept: ' + '\n' + test2)
text.close()
Your not assigning the return value of replace() to anything. Also, readlines and str(i) are unnecessary.
Try this:
filename = "/Users/sacredgeometry/Desktop/data.txt"
text = open(filename, 'r')
linesNew = []
for line in text:
# i is already a string, no need to str it
# temp = str(i)
# also, just append the result of the replace to linesNew:
linesNew.append(line.replace(' ', ', ', 2))
# DEBUGGING THE CODE
print(linesNew[0])
print(linesNew[1])
# Another test to check that the replace works ... It does!
test2 = linesNew[0].replace(' ', ', ',2)
test2 = test2.replace('\t', ', ')
print('Proof of Concept: ' + '\n' + test2)
text.close()
Strings are immutable. replace returns a new string, which is what you have to insert into the linesNew list.
# This bloody for loop is the problem
for i in lines:
temp = str(i)
temp2 = temp.replace(' ', ', ',2)
linesNew.append(temp2)
I had a similar problem and came up with the code below to help solve it. My specific issue was that I need to swap out certain parts of a string with the corresponding label. I also wanted something that would be reusable in different places within my application.
With the code below, I'm able to do the following:
>>> string = "Let's take a trip to Paris next January"
>>> lod = [{'city':'Paris'}, {'month':'January'}]
>>> processed = TextLabeler(string, lod)
>>> processed.text
>>> Let's take a trip to [[ city ]] next [[ month ]]
Here is all of the code:
class TextLabeler():
def __init__(self, text, lod):
self.text = text
self.iterate(lod)
def replace_kv(self, _dict):
"""Replace any occurrence of a value with the key"""
for key, value in _dict.iteritems():
label = """[[ {0} ]]""".format(key)
self.text = self.text.replace(value, label)
return self.text
def iterate(self, lod):
"""Iterate over each dict object in a given list of dicts, `lod` """
for _dict in lod:
self.text = self.replace_kv(_dict)
return self.text
Related
I'm trying to remove all blank lines from a large .txt file but whatever method I use it always returns this traceback:
Traceback (most recent call last):
File "C:\Users\svp12\PycharmProjects\practiques\main.py", line 53, in <module>
doc = nlp(texts[line])
IndexError: list index out of range
If I don't remove these spaces then I get IndexErrors on the consequent 2 for loops (or at least I think that's the reason), that's why I'm using the the try/except like this:
try:
for word in doc.sentences[0].words:
noun.append(word.text)
lemma.append(word.lemma)
pos.append(word.pos)
xpos.append(word.xpos)
deprel.append(word.deprel)
except IndexError:
errors += 1
pass
I'd like to be able to remove all blank lines and not have to avoid IndexErrors like this, any idea on how to fix?
Here's the whole code:
import io
import stanza
import os
def linecount(filename):
ffile = open(filename, 'rb')
lines = 0
buf_size = 1024 * 1024
read_f = ffile.read
buf = read_f(buf_size)
while buf:
lines += buf.count(b'\n')
buf = read_f(buf_size)
return lines
errors = 0
with io.open('#_Calvia_2018-01-01_2022-04-01.txt', 'r+', encoding='utf-8') as f:
text = f.read()
# replacing eos with \n, numbers and symbols
texts = text.replace('eos', '.\n')
texts = texts.replace('0', ' ').replace('1', ' ').replace('2', ' ').replace('3', ' ').replace('4', ' ')\
.replace('5', ' ').replace('6', ' ').replace('7', ' ').replace('8', ' ').replace('9', ' ').replace(',', ' ')\
.replace('"', ' ').replace('·', ' ').replace('?', ' ').replace('¿', ' ').replace(':', ' ').replace(';', ' ')\
.replace('-', ' ').replace('!', ' ').replace('¡', ' ').replace('.', ' ').splitlines()
os.system("sed -i \'/^$/d\' #_Calvia_2018-01-01_2022-04-01.txt") # removing empty lines to avoid IndexError
nlp = stanza.Pipeline(lang='ca')
nouns = []
lemmas = []
poses = []
xposes = []
heads = []
deprels = []
total_lines = linecount('#_Calvia_2018-01-01_2022-04-01.txt') - 1
for line in range(50): # range should be total_lines which is 6682
noun = []
lemma = []
pos = []
xpos = []
head = []
deprel = []
# print('analyzing: '+str(line+1)+' / '+str(len(texts)), end='\r')
doc = nlp(texts[line])
try:
for word in doc.sentences[0].words:
noun.append(word.text)
lemma.append(word.lemma)
pos.append(word.pos)
xpos.append(word.xpos)
deprel.append(word.deprel)
except IndexError:
errors += 1
pass
try:
for word in doc.sentences[0].words:
head.extend([lemma[word.head-1] if word.head > 0 else "root"])
except IndexError:
errors += 1
pass
nouns.append(noun)
lemmas.append(lemma)
poses.append(pos)
xposes.append(xpos)
heads.append(head)
deprels.append(deprel)
print(nouns)
print(lemmas)
print(poses)
print(xposes)
print(heads)
print(deprels)
print("errors: " + str(errors)) # wierd, seems to be range/2-1
And as a side question, is worth to import os just for this line? (which is the one removing the blank lines
os.system("sed -i \'/^$/d\' #_Calvia_2018-01-01_2022-04-01.txt")
I can't guarantee that this works because I couldn't test it, but it should give you an idea of how you'd approach this task in Python.
I'm omitting the head processing/the second loop here, that's for you to figure out.
I'd recommend you throw some prints in there and look at the output, make sure you understand what's going on (especially with different data types) and look at examples of applications using Stanford NLP, watch some tutorials online (from start to finish, no skipping), etc.
import stanza
import re
def clean(line):
# function that does the text cleaning
line = line.replace('eos', '.\n')
line = re.sub(r'[\d,"·?¿:;!¡.-]', ' ', line)
return line.strip()
nlp = stanza.Pipeline(lang='ca')
# instead of individual variables, you could keep the values in a dictionary
# (or just leave them as they are - your call)
values_to_extract = ['text', 'lemma', 'pos', 'xpos', 'deprel']
data = {v:[] for v in values_to_extract}
with open('#_Calvia_2018-01-01_2022-04-01.txt', 'r', encoding='utf-8') as f:
for line in f:
# clean the text
line = clean(line)
# skip empty lines
if not line:
continue
doc = nlp(line)
# loop over sentences – this will work even if it's an empty list
for sentence in doc.sentences:
# append a new list to the dictionary entries
for v in values_to_extract:
data[v].append([])
for word in sentence.words:
for v in values_to_extract:
# extract the attribute (e.g.,
# a surface form, a lemma, a pos tag, etc.)
attribute = getattr(word, v)
# and add it to its slot
data[v][-1].append(attribute)
for v in values_to_extract:
print('Value:', v)
print(data[v])
print()
Because texts doesn't have 50 lines, why do you hardcode 50?
If you just need to remove blank lines you only have to do text = text.replace("\n\n","\n")
if you need to remove lines that are just whitespaces you can just do:
text = '\n'.join(line.rstrip() for line in text.split('\n') if line.strip())
Consider:
def fix_county_string(s):
""" Insert Docstring """
fp = open("michigan_COVID_08_24_21.txt", "r")
fp.readline()
for line in fp:
county = line[24:43]
x = county.split()
t = x.pop(-1)
s = x.append("County")
return s
fix_county_string(s)
The parameter is s, a string. Every county name ends with the places; if it correctly ends in places, do nothing (simply return s). Otherwise, correct the ending word to be place. Specifically, if not, fix it.
Use:
def fix_county_string():
""" Insert Docstring """
fp = open("michigan_COVID_08_24_21.txt", "r")
s = ''
for line in fp:
county = line[24:43]
x = county.split()
t = x.pop(-1)
x.append("County")
s += line + ' '.join(x)
return s
s = fix_county_string()
I think this is what you are trying to do. You can write back the output in a file.
I am trying to use the replace function to take items from a list and replace the fields below with their corresponding values, but no matter what I do, it only seems to work when it reaches the end of the range (on it's last possible value of i, it successfully replaces a substring, but before that it does not)
for i in range(len(fieldNameList)):
foo = fieldNameList[i]
bar = fieldValueList[i]
msg = msg.replace(foo, bar)
print msg
This is what I get after running that code
<<name>> <<color>> <<age>>
<<name>> <<color>> <<age>>
<<name>> <<color>> 18
I've been stuck on this for way too long. Any advice would be greatly appreciated. Thanks :)
Full code:
def writeDocument():
msgFile = raw_input("Name of file you would like to create or write to?: ")
msgFile = open(msgFile, 'w+')
msg = raw_input("\nType your message here. Indicate replaceable fields by surrounding them with \'<<>>\' Do not use spaces inside your fieldnames.\n\nYou can also create your fieldname list here. Write your first fieldname surrounded by <<>> followed by the value you'd like to assign, then repeat, separating everything by one space. Example: \"<<name>> ryan <<color>> blue\"\n\n")
msg = msg.replace(' ', '\n')
msgFile.write(msg)
msgFile.close()
print "\nDocument written successfully.\n"
def fillDocument():
msgFile = raw_input("Name of file containing the message you'd like to fill?: ")
fieldFile = raw_input("Name of file containing the fieldname list?: ")
msgFile = open(msgFile, 'r+')
fieldFile = open(fieldFile, 'r')
fieldNameList = []
fieldValueList = []
fieldLine = fieldFile.readline()
while fieldLine != '':
fieldNameList.append(fieldLine)
fieldLine = fieldFile.readline()
fieldValueList.append(fieldLine)
fieldLine = fieldFile.readline()
print fieldNameList[0]
print fieldValueList[0]
print fieldNameList[1]
print fieldValueList[1]
msg = msgFile.readline()
for i in range(len(fieldNameList)):
foo = fieldNameList[i]
bar = fieldValueList[i]
msg = msg.replace(foo, bar)
print msg
msgFile.close()
fieldFile.close()
###Program Starts#####--------------------
while True==True:
objective = input("What would you like to do?\n1. Create a new document\n2. Fill in a document with fieldnames\n")
if objective == 1:
writeDocument()
elif objective == 2:
fillDocument()
else:
print "That's not a valid choice."
Message file:
<<name>> <<color>> <<age>>
Fieldname file:
<<name>>
ryan
<<color>>
blue
<<age>>
18
Cause:
This is because all lines except the last line read from the "Fieldname" file contains "\n" characters. So when the program comes to the replacing part fieldNameList , fieldValueList and msg looks like this:
fieldNameList = ['<<name>>\n', '<<color>>\n', '<<age>>\n']
fieldValueList = ['ryan\n', 'blue\n', '18']
msg = '<<name>> <<color>> <<age>>\n'
so the replace() function actually searches for '<<name>>\n','<<color>>\n','<<age>>\n' in msg string and only <<age>> field get replaced.(You must have a "\n" at the end of msg file, otherwise it won't be replaced as well).
Solution:
use rstrip() method when reading lines to strip the newline character at the end.
fieldLine = fieldFile.readline().rstrip()
Here is my function. Trying to get this all to print to one line.
Here is the output ->
config::$var['pdf']['meta']['staff_member_name']
= ";"
The = ";" portion of the string prints to a new line in the console for some reason?
This is totally just a personal hack to help with a repetitious job requirement so i'm not looking for anything fancy.
Here is my function ->
def auto_pdf_config(file):
with open(file) as f:
content = f.readlines()
kill = " = array("
start = "config::$var['intake']"
new_line = ""
for line in content:
if kill not in line:
pass
elif start in line:
new_line = line
x = new_line.replace(kill, "")
y = x.replace(start,"")
pdf_end = ' = ";" '
z = "config::$var['pdf']['meta']{}{}".format(y,pdf_end)
print(z)
it seems you "y" variable has new line in it. you can try to strip it off.
y = x.replace(start,"").strip('\n')
Since x = new_line.replace(kill, ""), y = x.replace(start,""), and new_line is the line of content, it contains endline symbol (\n), that's why this endline symbol is appended before pdf_end. You just need to remove endline symbol from y.
You can do something like that:
y = y.strip('\n')
i hope my question makes sense. i am looking for a way to read a csv file, and map a dictionary to each cell. i can make it work without csv, but i am having a hard time making it work when reading a csv file.
note:
string0 would be cell A1 or row[0]
string1 would be cell B1 or row[1]
string2 would be cell C1 or row[2]
this is what i have so far:
dict0 = {'A':'CODE1', 'B':'CODE2'}
text0 = []
string0 = 'A'
dict1 = {'avenue':'ave', 'street':'st', 'road':'rd', 'court':'ct'}
text1 = []
string1 = '123 MAIN AVENUE'
dict2 = {'(':'', ')':'', '-':'', ' ':'', '/':'', '\\':''}
text2 = []
string2 = '(123) 456/7890'
for i in string0:
newcode = dict0.get(i,i)
text0.append(newcode)
print ' '.join(text0)
for i in string1.lower().split(' '):
newaddress = dict1.get(i.lower(),i)
text1.append(newaddress)
print ' '.join(text1)
for i in string2:
newphone = dict2.get(i,i)
text2.append(newphone)
print ''.join(text2)
the code above works exactly as i intend it to work, but im having a hard time trying to make it work when reading a csv file.
thank you very much
edit #1:***********************************************
here is an excerpt of sample1.csv:
A,123 MAIN STREET,(123) 456-7890
B,888 TEST ROAD,(222) 555-5555
sorry if the code isnt much cleaner/clearer, but that is why i am asking for guidance.
in essence, every column will have a dictionary associated with it, so that the "code" column will write "CODE1 or CODE2" depending on the value of cell A1 ("A" or "B").
column 2 will have dict1{} associated with it, and will clean up the address column.
column 3 will have dict2{} associated with it, and will remove (,),/,\ from the phone number column.
my issue is i do not know how to start the code. i can write the code if i set the cell information as variables (see me code above, variables: string0, string1, string2), but i do not know how i would start to iterate the csv file.
thank you
edit #2:***********************************************
here is my code when i try using import csv
dict0 = {'A':'CODE1', 'B':'CODE2'}
text0 = []
dict1 = {'avenue':'ave', 'street':'st', 'road':'rd', 'court':'ct'}
text1 = []
dict2 = {'(':'', ')':'', '-':'', ' ':'', '/':'', '\\':''}
text2 = []
import csv
with open('O:/sample1.csv', 'rb') as c:
reader = csv.reader(c)
for row in reader:
for i in row[0]:
newcode = dict0.get(i,i)
text0.append(newcode)
for i in row[1].lower().split(' '):
newaddress = dict1.get(i.lower(),i)
text1.append(newaddress)
for i in row[2]:
newphone = dict2.get(i,i)
text2.append(newphone)
print str(' '.join(text0)) + ',' + str(' '.join(text1)) + ',' + str(''.join(text2))
prints:
CODE1,123 main st,1234567890
CODE1 CODE2,123 main st 888 test rd,12345678902225555555
i would like to print:
CODE1,123 main st,1234567890
CODE2,888 test rd,2225555555
hopefully someone can help
thank you
edit #3 *********************************************************************************************************************
can the following be improved (syntax, indentation etc..):
sample1.csv:
A,123 MAIN STREET,(123) 456-7890
B,888 TEST ROAD,(222) 555-5555
here is the code:
import csv
newcsv = csv.writer(open('O:/csvfile1.csv', 'ab'))
with open('O:/sample1.csv', 'rb') as c:
reader = csv.reader(c)
dict0 = {'A':'CODE1', 'B':'CODE2'}
dict1 = {'avenue':'ave', 'street':'st', 'road':'rd', 'court':'ct'}
dict2 = {'(':'', ')':'', '-':'', ' ':'', '/':'', '\\':''}
# read element in *reader*
for row in reader:
text0 = []
text1 = []
text2 = []
newline = []
# read element in *row*
for i in row[0]:
newcode = dict0.get(i,i)
text0.append(newcode)
newline.append(' '.join(text0))
for i in row[1].lower().split(' '):
newaddress = dict1.get(i.lower(),i)
text1.append(newaddress)
newline.append(' '.join(text1))
for i in row[2]:
newphone = dict2.get(i,i)
text2.append(newphone)
newline.append(''.join(text2))
newcsv.writerow(newline)
print newline
prints the following:
['CODE1', '123 main st', '1234567890']
['CODE2', '888 test rd', '2225555555']
creates csvfile1.csv (using '|' as a 'cell delimiter') and its exactly what i want:
CODE1|123 main st|1234567890
CODE2|888 test rd|2225555555
just wondering if the above code can be improved/written in an more effective way.
thank you
The reason for the garbled output is that you are not clearing the text<n> variables on each cycle of the loop. While there is the fix below, I recommend reading at least how to define functions, rewriting the code without so many global variables so that you don't run into the same problems like now.
with open('O:/sample1.csv', 'rb') as c:
reader = csv.reader(c)
for row in reader:
text0 = []
text1 = []
text2 = []
for i in row[0]:
newcode = dict0.get(i,i)
text0.append(newcode)
for i in row[1].lower().split(' '):
newaddress = dict1.get(i.lower(),i)
text1.append(newaddress)
for i in row[2]:
newphone = dict2.get(i,i)
text2.append(newphone)
print str(' '.join(text0)) + ',' + str(' '.join(text1)) + ',' + str(''.join(text2))