Line split is not functioning as intended - python

I am trying to get this code to split one at a time, but it is not functioning as expected:
for line in text_line:
one_line = line.split(' ',1)
if len(one_line) > 1:
acro = one_line[0].strip()
meaning = one_line[1].strip()
if acro in acronyms_dict:
acronyms_dict[acro] = acronyms_dict[acro] + ', ' + meaning
else:
acronyms_dict[acro] = meaning

Remove the ' ' from the str.split. The file is using tabs to delimit the acronyms:
import requests
data_site = requests.get(
"https://raw.githubusercontent.com/priscian/nlp/master/OpenNLP/models/coref/acronyms.txt"
)
text_line = data_site.text.split("\n")
acronyms_dict = {}
for line in text_line:
one_line = line.split(maxsplit=1) # <-- remove the ' '
if len(one_line) > 1:
acro = one_line[0].strip()
meaning = one_line[1].strip()
if acro in acronyms_dict:
acronyms_dict[acro] = acronyms_dict[acro] + ", " + meaning
else:
acronyms_dict[acro] = meaning
print(acronyms_dict)
Prints:
{
'24KHGE': '24 Karat Heavy Gold Electroplate',
'2B1Q': '2 Binary 1 Quaternary',
'2D': '2-Dimensional',
...

Related

Managing various printing conditions cleanly?

I have the following code which prints the data I have available but currently looks bad and cluttered:
has_printed = 0
output = ''
is_banner_printed = 0
if time.time() > next_print_time:
output = output + (' ' + datetime.now().strftime("%I:%M:%S %p") + ' ---------------------------------------------------------------------------------c')
is_banner_printed = 1
if is_banner_printed or not(len(self.post_list)):
print(output)
output = ''
for i, post in enumerate(self.post_list):
post.update()
output = output + (' Age: ' + datetime.fromtimestamp(self.utils.get_submission_age(post.created_utc)).strftime("%M:%S") +
' | URL: http://url.com/' + post.id +
' | Score: ' + str(post.score).ljust(6, ' ') + ... )
if i == 0:
output = '\n' + output
output = output + '\n'
if output is not '' and time.time() > next_print_time and is_banner_printed:
print(output)
has_printed = 1
next_print_time = time.time() + self.PRINT_RATE
if has_printed:
print(' ---------------------------------------------------------------------------------------------c\n\n\n')
elif is_banner_printed and output is '':
print('')
next_print_time = time.time() + self.PRINT_RATE
I wonder if it's possible to clean this mess to make the actual code readable?

list index out of range when extending list

This function takes email body as input and returns values after Application name, source and message respectively and it works fine
def parse_subject(line):
info = {}
segments = line.split(' ')
info['time'] = segments[0]+' '+segments[1]
for i in range(2, len(segments)):
key = ''
if segments[i] == 'Application name:':
key = 'appname'
elif segments[i] == 'Source:':
key = 'source'
elif segments[i] == 'Message:':
key = 'message'
if key != '':
i += 1
info[key] = segments[i]
return info
For another email body format i need to extend segments format because i need to search more lines in message body so i changed info['time'] and as soon i extend segments for more than 2 i'm getting out of range errors
info['time'] = segments[0]+' '+segments[1]+' '+segments[2]+' '+segments[3]+' '+segments[4]+' '+segments[5]......up to segment[17]
maybe i'll need to extend more
and above function fails with list index out of range
i changed code but same error:
also tried changing number to match number of segments but same:
for i in range(<number of segments>, len(segments)):
example of segments: lenght will vary because string after Message has different value, sometime it's URL string
Question
when i define lenght of the segment, let's say up to segments[17],
what i need to change in function not to throw out of index error
def parse_subject(line):
info = {}
segments = line.split(' ')
info['time'] = segments[0]+' '+segments[1] + ' ' + segments[2] + ' ' + segments[3] + ' ' + segments[4] + ' ' + segments[5] + ' ' + segments[6] + ' ' + segments[7] + ' ' + segments[8] +' ' + segments[9] + ' ' + segments[10] + ' ' + segments[11] + ' ' + segments[12] +' ' + segments[13] + ' ' + segments[14] + ' '
+ segments[15] +' ' + segments[16] + ' ' + segments[17]
for i in range(16, len(segments)):
key = ''
if segments[i] == 'name:':
key = 'appname'
elif segments[i] == 'Source:':
key = 'source'
elif segments[i] == 'Message:':
key = 'message'
if key != '':
i += 1
info[key] = segments[i]
return info
if mail["Subject"].find("PA1") > 0 or mail["Subject"].find("PA2") > 0:
body = get_autosys_body(mail)
# print(body)
for line in body.splitlines():
if 'Application Name' in line:
job_info = parse_subject(line)
break
print(job_info)
I need to pass line variable (content below)
name:Contoso.Service
Source: host15
Timestamp: 2019-01-22T00:00:43.901Z
Message:null
to parse_subject(line) function and from above output to get:
Contoso.Service as value of job_info['appname']
host15 as value of jobinfo['source']
null as value of jobinfo['message']
In your code, you need to debug it. The error is telling you exactly what is wrong.
def old_parse_subject(line):
info = {}
segments = line.split(' ')
if len(segments < 18):
raise ValueError("segments[17] won't work if segments is not that long")
You could have done a print(len(segments)) or just print (segments) right before where you know the error is.
For reading an email header, if you know it has multiple lines, you get those with split('\n') and then for each line if you know it is "name: value" you get that with split(':', 1).
The second argument to split says only split on 1 colon, because any additional colons are allowed to be part of the data. For example, timestamps have colons.
def parse_subject(headers):
info = {}
# split the full header into separate lines
for line in headers.split('\n'):
# split on colon, but only once
key, value = line.split(':', 1)
# store info
info[key] = value
return info
data = """name:Contoso.Service
Source: host15
Timestamp: 2019-01-22T00:00:43.901Z
Message:null"""
print (parse_subject(data))
{'name': 'Contoso.Service', 'Source': ' host15', 'Timestamp': ' 2019-01-22T00:00:43.901Z', 'Message': 'null'}

Selenium Python web scraping UTF-8

Maybe this question was asked before but since I could not find a proper answer, I dare to ask a similar one. My problem is, I have been trying to scrape a Turkish car sale web site which is named 'Sahibinden'. I use jupyter notebook and sublime editors.Once I try to get the data written in a csv file, the Turkish letter changes to different characters. I tried. 'UTF-8 Encoding', '# -- coding: utf-8 --', ISO 8859-9, etc. but I could not solve the problem. The other issue is that Sublime editor does not create the csv file despite I did not have any problem on the jupyter notebook. You will find the csv file output in the image link. If someone can reply me I would appreciate it.
Note: the program works and no problem once I run print command on the editors.
Thanks a lot.
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException
import unicodedata
with open ('result1.csv','w') as f:
f.write('brand, model, year, oil_type, gear, odometer, body, hp,
eng_dim, color, warranty, condition, price, safe,
in_fea, outs_fea, mul_fea,pai_fea, rep_fea, acklm \n')
chrome_path = r"C:\Users\Mike\Desktop\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
def final_page(fn_20):
for lur in fn_20:
driver.get(lur)
brand = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/ul/li[3]/span''')
brand = brand.text
brand = brand.encode("utf-8")
print (brand)
model = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/ul/li[5]/span''')
model = model.text
model = model.encode("utf-8")
print (model)
year = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/ul/li[6]/span''')
year = year.text
year = year.encode("utf-8")
print (year)
oil_type = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/ul/li[7]/span''')
oil_type = oil_type.text
oil_type = oil_type.encode("utf-8")
print (oil_type)
gear = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/ul/li[8]/span''')
gear = gear.text
gear = gear.encode("utf-8")
print (gear)
odometer = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/ul/li[9]/span''')
odometer = odometer.text
odometer = odometer.encode("utf-8")
print (odometer)
body = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/ul/li[10]/span''')
body = body.text
body = body.encode("utf-8")
print (body)
hp = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/ul/li[11]/span''')
hp = hp.text
hp = hp.encode("utf-8")
print (hp)
eng_dim = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/ul/li[12]/span''')
eng_dim = eng_dim.text
eng_dim = eng_dim.encode("utf-8")
print (eng_dim)
color = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/ul/li[14]/span''')
color = color.text
color = color.encode("utf-8")
print (color)
warranty = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/ul/li[15]/span''')
warranty = warranty.text
warranty = warranty.encode("utf-8")
print (warranty)
condition = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/ul/li[19]/span''')
condition = condition.text
condition = condition.encode("utf-8")
print (condition)
price = driver.find_element_by_xpath('''//*[#id="classifiedDetail"]/div[1]/div[2]/div[2]/h3''')
price = price.text
price = price.encode("utf-8")
print (price)
safe = ''
safety1 = driver.find_elements_by_xpath('''//div[#id='classifiedProperties']/ul[1]/li[#class='selected']''')
for ur in safety1:
ur1 = ur.text
ur1 = ur1.encode("utf-8")
safe +=ur1 + ', '
print (safe)
in_fea = ''
in_features = driver.find_elements_by_xpath('''//div[#id='classifiedProperties']/ul[2]/li[#class='selected']''')
for ins in in_features:
ins1 = ins.text
ins1 = ins1.encode("utf-8")
in_fea += ins1 + ', '
print (in_fea)
outs_fea = ''
out_features = driver.find_elements_by_xpath('''//div[#id='classifiedProperties']/ul[3]/li[#class='selected']''')
for outs in out_features:
out1 = outs.text
out1 = out1.encode("utf-8")
outs_fea += out1 + ', '
print (outs_fea)
mul_fea = ''
mult_features = driver.find_elements_by_xpath('''//div[#id='classifiedProperties']/ul[4]/li[#class='selected']''')
for mults in mult_features:
mul = mults.text
mul = mul.encode("utf-8")
mul_fea += mul + ', '
print (mul_fea)
pai_fea = ''
paint = driver.find_elements_by_xpath('''//div[#class='classified-pair custom-area ']/ul[1]/li[#class='selected']''')
for pai in paint:
pain = pai.text
pain = pain.encode("utf-8")
pai_fea += pain + ', '
print (pai_fea)
rep_fea = ''
replcd = driver.find_elements_by_xpath('''//div[#class='classified-pair custom-area']/ul[2]/li[#class='selected']''')
for rep in replcd:
repa = rep.text
repa = repa.encode("utf-8")
rep_fea += rep + ', '
print (rep_fea)
acklm = driver.find_element_by_xpath('''//div[#id='classified-detail']/div[#class='uiBox'][1]/div[#id='classifiedDescription']''')
acklm = acklm.text
acklm = acklm.encode("utf-8")
print (acklm)
try:
with open ('result1.csv', 'a') as f:
f.write (brand + ',' [enter image description here][1]+ model + ',' + year + ',' + oil_type + ',' + gear + ',' + odometer + ',' + body + ',' + hp + ',' + eng_dim + ',' + color + ',' + warranty + ',' + condition + ',' + price + ',' + safe + ',' + in_fea + ',' + outs_fea + ',' + mul_fea + ',' + pai_fea + ',' + rep_fea + ',' + acklm + '\n')
except Exception as e:
print (e)
driver.close
import codecs
file = codecs.open("utf_test", "w", "utf-8")
file.write(u'\ufeff')
file.write("test with utf-8")
file.write("字符")
file.close()
or this also works for me
with codecs.open("utf_test", "w", "utf-8-sig") as temp:
temp.write("this is a utf-test\n")
temp.write(u"test")

Data sorting, combining 2 lines

I have a file looking this way:
;1;108/1;4, 109
;1;51;4, 5
;2;109/2;4, 5
;2;108/2;4, 109
;3;108/2;4, 109
;3;51;4, 5
;4;109/2;4, 5
;4;51;4, 5
;5;109/2;4, 5
;5;40/6;5, 6, 7
where
;id1;id2;position_on_shelf_id2
;id1;id3;position_on_shelf_id3
as a result, i want to get:
id1;id2-id3;x
where x are common shelf positions for both id2 and id3, it should look like this
1;108/1-51;4
2;109/2-108/2;4
3;108/2-51;4
4;109/2-51;4, 5
5;109/2-40/6;5
my script works fine up to the moment where I need to type common shelf positions. I tried using .intersection, but it is not working properly, when I have positions consisting of double characters (pos:144-result: 14; pos:551, result: 51; pos:2222-result: 2 i.e)
result = id2_chars.intersection(id3_chars)
any fix for intersection? or maybe some better method on your mind?
code so far:
part1 - merge every 2nd line together
exp = open('output.txt', 'w')
with open("dane.txt") as f:
content = f.readlines()
strng = ""
for i in range(1,len(content)+1):
strng += content[i-1].strip()
if i % 2 == 0:
exp.writelines(strng + '\n')
strng = ""
exp.close()
part2 - intersection:
exp = open('output2.txt', 'w')
imp = open('output.txt')
for line in imp:
none, lp1, dz1, poz1, lp2, dz2, poz2 = line.split(';')
s1 = poz1.lower()
s2 = poz2.lower()
s1_chars = set(s1)
s2_chars = set(s2)
result = s1_chars.intersection(s2_chars)
result = str(result)
exp.writelines(lp1 + ';' + dz1 + '-' + dz2 + ';' + result + '\n')
exp.close()
** i did not filtered the result for my needs yet (it is in "list" form), but it won't be a problem once I get the right intersection result
Your main problem is that you try to intersect 2 sets of characters while you should intersect positions. So you should at least use:
...
s1 = poz1.lower()
s2 = poz2.lower()
s1_poz= set(x.strip() for x in s1.split(','))
s2_poz = set(x.strip() for x in s1.split(','))
result = s1_poz.intersection(s2_poz)
result = ', '.join(result)
...
But in fact, you could easily do the whole processing in one single pass:
exp = open('output.txt', 'w')
with open("dane.txt") as f:
old = None
for line in f: # one line at a time is enough
line = line.strip()
if old is None: # first line of a block, just store it
old = line
else: # second line of a bock, process both
none, lp1, dz1, poz1 = old.split(';')
none, lp2, dz2, poz2 = line.split(';')
poz1x = set(x.strip() for x in poz1.tolower().split(','))
poz2x = set(x.strip() for x in poz2.tolower().split(','))
result = ', '.join(poz1x.intersection(poz2x))
exp.write(lp1 + ';' + dz1 + '-' + dz2 + ';' + result + '\n')
old = None

remove similar lines in text file

I am not using Python but I have script in python:
part of script
elif line.find("CONECT") > -1:
con = line.split()
line_value = line_value + 1
#print line_value
#print con[2]
try:
line_j = "e" + ', ' + str(line_value) + ', ' + con[2] + "\n"
output_file.write(line_j)
print(line_j)
line_i = "e" + ', ' + str(line_value) + ', ' + con[3] + "\n"
output_file.write(line_i)
print(line_i)
line_k = "e"+ ', ' + str(line_value) + ', ' + con[4] + "\n"
print(line_k)
output_file.write(line_k)
except IndexError:
continue
which give .txt output in format
e, 1, 2
e, 1, 3
e, 1, 4
e, 2, 1
e, 2, 3
etc.
I need remove similar lines with the same numbers, but no matter on order this numbers
i.e. line e, 2, 1..
Is it possible?
Of course, it is better to modify your code to remove that lines BEFORE you're writing them to file. You can use a list to store already saved values, and on each itereation, perfom a search if the values you're want to add is already exists in that list. The code below isn't tested and optimized, but it explains an idea:
# 'added = []' should be placed somewhere before 'if'
added = []
# you part of code
elif line.find("CONECT") > -1:
con = line.split()
line_value = line_value + 1
try:
line_j = "e, %s, %s\n" % (str(line_value),con[2])
tmp = sorted((str(line_value),con[2]))
if tmp not in added:
added.append(tmp)
output_file.write(line_j)
print(line_j)
line_i = "e, %s, %s\n" % (str(line_value),con[3])
tmp = sorted((str(line_value),con[3]))
if tmp not in added:
added.append(tmp)
output_file.write(line_i)
print(line_i)
line_k = "e, %s, %s\n" % (str(line_value),con[4])
tmp = sorted((str(line_value),con[4]))
if tmp not in added:
added.append(tmp)
print(line_k)
output_file.write(line_k)
except IndexError:
continue
Here is a comparison method for two lines of your file:
def compare(line1, line2):
els1 = line1.strip().split(', ')
els2 = line2.strip().split(', ')
return Counter(els1) == Counter(els2)
See the documentation for the Counter class.
If the count of elements doesn't matter you can replace the Counter class with set instead
The following approach should work. First add the following line further up in your code:
seen = set()
Then replace everything inside the try with the following code:
for con_value in con[2:5]:
entry = frozenset((line_value, con_value))
if entry not in seen:
seen.append(entry)
line_j = "e" + ', ' + str(line_value) + ', ' + con_value + "\n"
output_file.write(line_j)
print(line_j)
Make sure this code is indented to the same level as the code it replaces.

Categories