extracting hyperlinks from rtf with python

extracting hyperlinks from rtf with python - python

I'm trying to extract hyperlinks from rtfs, with python. I have like a 1000 rtfs to go through so figured if this could ease my task. But my code doesn't extract links to the articles, just the front page of that database. Here's what I wrote:
import csv
import re
with open('text.rtf', 'r') as file:
for line in file:
urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line)
print(urls)
with open ('some.csv','w') as fw:
writer = csv.writer(fw)
writer.writerows(urls)
And this is what's printed out:
[]
[]
[]
['https://database.com']
[]
[]
csv file is empty ...(And I want to write those urls into a csv file... Is it even possible?)
I guess this needs to be modified: 'https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line'
I do not know how.

Python's built-in function open cannot alone decode RTF files. You'll need to install another package to handle that. There's another that can help with extracting urls from text as well. Not sure it's the most accurate url extractor solution, though.
pip install striprtf urlextract
Back in your main file you can try something like:
import csv
from striprtf import striprtf
from urlextract import URLExtract
with open( 'text.rtf', 'r') as rtf_file:
file_text = striprtf.rtf_to_text( rtf_file.read() )
extractor = URLExtract()
urls = extractor.find_urls(file_text)
with open('some.csv', 'w', newline='') as fw:
fieldnames = ['urls']
writer = csv.DictWriter(fw, fieldnames = fieldnames)
writer.writeheader()
for link in urls:
writer.writerow( {'urls': link} )
Hopefully this will get you what you need.

Related

Iterate through a html file and extract data to CSV file

I have searched high and low for a solution, but non have quite fit what I need to do.
I have an html page that is saved, as a file, lets call it sample.html and I need to extract recurring json data from it. An example file is as follows:
I need to get the info from these files regulary, so the amount of objects change every time, an object would be considered as "{"SpecificIdent":2588,"SpecificNum":29,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false}"
I need to get each of the values to a CSV file, with column headings being SpecificIdent, SpecificNum, Meter, Power, WPower, Snumber, isI. The associated data would be the rows from each.
I apologize if this is a basic question in Python, but I am pretty new to it and cannot fathom the best way to do this. Any assistance would be greatly appreciated.
Kind regards
A
<html><head><meta name="color-scheme" content="light dark"></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">[{"SpecificIdent":2588,"SpecificNum":29,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":3716,"SpecificNum":39,"Meter":1835,"Power":11240.0,"WPower":null,"SNumber":"0703-403548","isI":false},{"SpecificIdent":6364,"SpecificNum":27,"Meter":7768,"Power":29969.0,"WPower":null,"SNumber":"467419","isI":false},{"SpecificIdent":6583,"SpecificNum":51,"Meter":7027,"Power":36968.0,"WPower":null,"SNumber":"JE1449-521248","isI":false},{"SpecificIdent":6612,"SpecificNum":57,"Meter":12828,"Power":53918.0,"WPower":null,"SNumber":"JE1509-534327","isI":false},{"SpecificIdent":7139,"SpecificNum":305,"Meter":6264,"Power":33101.0,"WPower":null,"SNumber":"JE1449-521204","isI":false},{"SpecificIdent":7551,"SpecificNum":116,"Meter":0,"Power":21569.0,"WPower":null,"SNumber":"JE1449-521252","isI":false},{"SpecificIdent":7643,"SpecificNum":56,"Meter":7752,"Power":40501.0,"WPower":null,"SNumber":"JE1449-521200","isI":false},{"SpecificIdent":8653,"SpecificNum":49,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":9733,"SpecificNum":142,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":10999,"SpecificNum":20,"Meter":7723,"Power":6987.0,"WPower":null,"SNumber":"JE1608-625534","isI":false},{"SpecificIdent":12086,"SpecificNum":24,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":14590,"SpecificNum":35,"Meter":394,"Power":10941.0,"WPower":null,"SNumber":"BN1905-944799","isI":false},{"SpecificIdent":14954,"SpecificNum":100,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"517163","isI":false},{"SpecificIdent":14995,"SpecificNum":58,"Meter":0,"Power":38789.0,"WPower":null,"SNumber":"JE1444-511511","isI":false},{"SpecificIdent":15245,"SpecificNum":26,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"430149","isI":false},{"SpecificIdent":18824,"SpecificNum":55,"Meter":8236,"Power":31358.0,"WPower":null,"SNumber":"0703-310839","isI":false},{"SpecificIdent":20745,"SpecificNum":41,"Meter":0,"Power":60963.0,"WPower":null,"SNumber":"JE1447-517260","isI":false},{"SpecificIdent":31584,"SpecificNum":11,"Meter":0,"Power":3696.0,"WPower":null,"SNumber":"467154","isI":false},{"SpecificIdent":32051,"SpecificNum":40,"Meter":7870,"Power":13057.0,"WPower":null,"SNumber":"JE1608-625593","isI":false},{"SpecificIdent":32263,"SpecificNum":4,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":33137,"SpecificNum":132,"Meter":5996,"Power":26650.0,"WPower":null,"SNumber":"459051","isI":false},{"SpecificIdent":33481,"SpecificNum":144,"Meter":4228,"Power":16136.0,"WPower":null,"SNumber":"JE1603-617807","isI":false},{"SpecificIdent":33915,"SpecificNum":145,"Meter":5647,"Power":3157.0,"WPower":null,"SNumber":"JE1518-549610","isI":false},{"SpecificIdent":36051,"SpecificNum":119,"Meter":2923,"Power":12249.0,"WPower":null,"SNumber":"135493","isI":false},{"SpecificIdent":37398,"SpecificNum":21,"Meter":58,"Power":5540.0,"WPower":null,"SNumber":"BN1925-982761","isI":false},{"SpecificIdent":39024,"SpecificNum":50,"Meter":7217,"Power":38987.0,"WPower":null,"SNumber":"JE1445-511599","isI":false},{"SpecificIdent":39072,"SpecificNum":59,"Meter":5965,"Power":32942.0,"WPower":null,"SNumber":"JE1449-521199","isI":false},{"SpecificIdent":40601,"SpecificNum":9,"Meter":0,"Power":59655.0,"WPower":null,"SNumber":"JE1447-517150","isI":false},{"SpecificIdent":40712,"SpecificNum":37,"Meter":0,"Power":5715.0,"WPower":null,"SNumber":"JE1502-525840","isI":false},{"SpecificIdent":41596,"SpecificNum":53,"Meter":8803,"Power":60669.0,"WPower":null,"SNumber":"JE1503-527155","isI":false},{"SpecificIdent":50276,"SpecificNum":30,"Meter":2573,"Power":4625.0,"WPower":null,"SNumber":"JE1545-606334","isI":false},{"SpecificIdent":51712,"SpecificNum":69,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":56140,"SpecificNum":10,"Meter":5169,"Power":26659.0,"WPower":null,"SNumber":"JE1547-609024","isI":false},{"SpecificIdent":56362,"SpecificNum":6,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":58892,"SpecificNum":113,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":65168,"SpecificNum":5,"Meter":12739,"Power":55833.0,"WPower":null,"SNumber":"JE1449-521284","isI":false},{"SpecificIdent":65255,"SpecificNum":60,"Meter":5121,"Power":27784.0,"WPower":null,"SNumber":"JE1449-521196","isI":false},{"SpecificIdent":65665,"SpecificNum":47,"Meter":11793,"Power":47576.0,"WPower":null,"SNumber":"JE1509-534315","isI":false},{"SpecificIdent":65842,"SpecificNum":8,"Meter":10783,"Power":46428.0,"WPower":null,"SNumber":"JE1509-534401","isI":false},{"SpecificIdent":65901,"SpecificNum":22,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":65920,"SpecificNum":17,"Meter":9316,"Power":38242.0,"WPower":null,"SNumber":"JE1509-534360","isI":false},{"SpecificIdent":66119,"SpecificNum":43,"Meter":12072,"Power":52157.0,"WPower":null,"SNumber":"JE1449-521259","isI":false},{"SpecificIdent":70018,"SpecificNum":34,"Meter":11172,"Power":49706.0,"WPower":null,"SNumber":"JE1449-521285","isI":false},{"SpecificIdent":71388,"SpecificNum":54,"Meter":6947,"Power":36000.0,"WPower":null,"SNumber":"JE1445-512406","isI":false},{"SpecificIdent":71892,"SpecificNum":36,"Meter":15398,"Power":63691.0,"WPower":null,"SNumber":"JE1447-517256","isI":false},{"SpecificIdent":72600,"SpecificNum":38,"Meter":14813,"Power":62641.0,"WPower":null,"SNumber":"JE1447-517189","isI":false},{"SpecificIdent":73645,"SpecificNum":2,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":77208,"SpecificNum":28,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":77892,"SpecificNum":15,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false},{"SpecificIdent":78513,"SpecificNum":31,"Meter":6711,"Power":36461.0,"WPower":null,"SNumber":"JE1445-511601","isI":false},{"SpecificIdent":79531,"SpecificNum":18,"Meter":0,"Power":0.0,"WPower":null,"SNumber":"","isI":false}]</pre></body></html>
I have tried examples from bs4, jsontoxml, and others, but I am sure there is a simple way to iterate and extract this?

I would harness python's standard library following way
import csv
import json
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_data(self, data):
if data.strip():
self.data = data
parser = MyHTMLParser()
with open("sample.html","r") as f:
parser.feed(f.read())
with open('sample.csv', 'w', newline='') as csvfile:
fieldnames = ['SpecificIdent', 'SpecificNum', 'Meter', 'Power', 'WPower', 'SNumber', 'isI']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(json.loads(parser.data))
which creates file starting with following lines
SpecificIdent,SpecificNum,Meter,Power,WPower,SNumber,isI
2588,29,0,0.0,,,False
3716,39,1835,11240.0,,0703-403548,False
6364,27,7768,29969.0,,467419,False
6583,51,7027,36968.0,,JE1449-521248,False
6612,57,12828,53918.0,,JE1509-534327,False
7139,305,6264,33101.0,,JE1449-521204,False
7551,116,0,21569.0,,JE1449-521252,False
7643,56,7752,40501.0,,JE1449-521200,False
8653,49,0,0.0,,,False
Disclaimer: this assumes JSON array you want is last text element which is not empty (i.e. contain at least 1 non-whitespace character).

There is a python library, called BeautifulSoup, that you could utilize to parse the whole HTML file:
# pip install bs4
from bs4 import BeautifulSoup
html = BeautifulSoup(your-html)
From here on, you can perform any actions upon the html. In your case, you just need to find the <pre> element, and get its contents. This can be achieved easily:
pre = html.body.find('pre')
text = pre.text
Finally, you need to parse the text, which it seems is JSON. You can do with Python's internal json library:
import json
result = json.loads(text)
Now, we need to convert this to a CSV file. This could be done, using the csv library:
import csv
with open('GFG', 'w') as f:
writer = csv.DictWriter(f, fieldnames=[
"SpecificIdent",
"SpecificNum",
"Meter",
"Power",
"WPower",
"SNumber",
"isI"
])
writer.writeheader()
writer.writerows(result)
Finally, your code should look something like this:
from bs4 import BeautifulSoup
import json
import csv
with open('raw.html', 'r') as f:
raw = f.read()
html = BeautifulSoup(raw)
pre = html.body.find('pre')
text = pre.text
result = json.loads(text)
with open('result.csv', 'w') as f:
writer = csv.DictWriter(f, fieldnames=[
"SpecificIdent",
"SpecificNum",
"Meter",
"Power",
"WPower",
"SNumber",
"isI"
])
writer.writeheader()
writer.writerows(result)

how to replace HTML codes in HTML file using python?

I'm trying to replace all HTML codes in my HTML file in a for Loop (not sure if this is the easiest approach) without changing the formatting of the original file. When I run the code below I don't get the codes replaced. Does anyone know what could be wrong?
import re
tex=open('ALICE.per-txt.txt', 'r')
tex=tex.read()
for i in tex:
if i =='õ':
i=='õ'
elif i == 'ç':
i=='ç'
with open('Alice1.replaced.txt', "w") as f:
f.write(tex)
f.close()

You can use html.unescape.
>>> import html
>>> html.unescape('õ')
'õ'
With your code:
import html
with open('ALICE.per-txt.txt', 'r') as f:
html_text = f.read()
html_text = html.unescape(html_text)
with open('ALICE.per-txt.txt', 'w') as f:
f.write(html_text)
Please note that I opened the files with a with statement. This takes care of closing the file after the with block - something you forgot to do when reading the file.

csv.Error: iterable expected trying to save a CSV file from a For

everything good?
I need some help to save this script in CSV that reads a CSV and transforms the data through a lib. I've been racking my brain for hours and I can't figure out why I can't save the CSV file.
Can anybody help me? I am a beginner in python and I am learning the tool to use in ETL processes.
import csv
from user_agents import parse
with open('UserAgent.csv', 'r') as csv_file:
csv_reader = csv.reader(csv_file)
idUser = 0
space = ' / '
for line in csv_reader:
user_agent = parse(line[0])
idUser = idUser + 1
with open('data.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(user_agent)

writer.writerow expects an iterable. Your user_agent must not be an iterable.
Try
writer.writerow( [user_agent] )
instead of
writer.writerow(user_agent)
Check if that's what you want.

Python Web Api to CSV

I am looking for some assistance with writing API results to a .CSV file using Python.
I have my source as CSV file. It contains the below urls in a column as separate rows.
https://webapi.nhtsa.gov/api/SafetyRatings/modelyear/2013/make/Acura/model/rdx?format=csv
https://webapi.nhtsa.gov/api/SafetyRatings/modelyear/2017/make/Chevrolet/model/Corvette?format=csv
I can call the Web API and get the printed results. Please find attached 'Web API results' snapshot.
When I try to export these results into a csv, I am getting them as per the attached 'API results csv'. It is not transferring all the records. Right now, It is only sending the last record to csv.
My final output should be as per the attached 'My final output should be' for all the given inputs.
Please find the below python code that I have used. I appreciate your help on this. Please find attached image for my code.My Code
import csv, requests
with open('C:/Desktop/iva.csv',newline ='') as f:
reader = csv.reader(f)
for row in reader:
urls = row[0]
print(urls)
r = requests.get(urls)
print (r.text)
with open('C:/Desktop/ivan.csv', 'w') as csvfile:
csvfile.write(r.text)

You'll have to create a writer object of the csvfile(to be created). and use the writerow() method you could write to the csvfile.
import csv,requests
with open('C:/Desktop/iva.csv',newline ='') as f:
reader = csv.reader(f)
for row in reader:
urls = row[0]
print(urls)
r = requests.get(urls)
print (r.text)
with open('C:/Desktop/ivan.csv', 'w') as csvfile:
writerobj=csv.writer(r.text)
for line in reader:
writerobj.writerow(line)

One problem in your code is that every time you open a file using open and mode w, any existing content in that file will be lost. You could prevent that by using append mode open(filename, 'a') instead.
But even better. Just open the output file once, outside the for loop.
import csv, requests
with open('iva.csv') as infile, open('ivan.csv', 'w') as outfile:
reader = csv.reader(infile)
for row in reader:
r = requests.get(urls[0])
outfile.write(r.text)

Merge txt files in a folder and replacing characters in python

I have a doubt about how to do to continue the code, I need to take all files from a folder and merge them in 1 file with another text format.
Example:
The Input files are of text format like this:
"{'nr': '3173391045', 'data': '27/12/2017'}"
"{'nr': '2173391295', 'data': '05/01/2017'}"
"{'nr': '5173351035', 'data': '07/03/2017'}"
The Output files must be lines like this:
"3173391045","27/09/2017"
"2173391295","05/01/2017"
"5173351035","07/03/2017"
This is my working code, it's working for merge and taking out the blank lines
import glob2
import datetime
filenames=glob2.glob("*.txt")
with open(datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S-%f")+".SAI", 'w') as file:
for filename in filenames:
with open(filename,"r") as f:
file.write(f.read())
I'm trying something with .replace but is not working, I get syntax errors or blank files
filedata = filedata.replace("{", "") for line in filedata

If your input files had contained valid JSON strings, the correct way would have been to parse the lines as JSON and write them back in csv. As strings are enclosed in single quotes (') they are rejected by the json module of the Python library, and my advice is to use a regex to parse them. Code could become:
import glob2
import datetime
import csv
import re
# the regex to parse the line
rx = re.compile(r".*'nr'\s*:\s*'(\d+)'.*'data'\s*:\s*'([/\d]+)'")
filenames=glob2.glob("*.txt")
with open(datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S-%f")+".SAI", 'w') as file:
wr = csv.writer(file, quoting = csv.QUOTE_ALL)
for filename in filenames:
with open(filename,"r") as f:
for line in f: # process line by line
m = rx.match(line)
wr.writerow(m.groups())

With a few tweaks, the input data can be coerced into a form suitable for JSON parsing:
from datetime import datetime
import json
import glob2
import csv
with open(datetime.now().strftime("%Y-%m-%d-%H-%M-%S-%f")+".SAI", 'w', newline='') as f_output:
csv_output = csv.writer(f_output, quoting=csv.QUOTE_ALL)
for filename in glob2.glob('*.txt'):
with open(filename) as f_input:
for row in f_input:
row_dict = json.loads(row.strip('"\n').replace("'", '"'))
csv_output.writerow([row_dict['nr'], row_dict['data']])
Giving you:
"3173391045","27/12/2017"
"2173391295","05/01/2017"
"5173351035","07/03/2017"
Note, in Python 3.x the output file should be opened with newline=''. Without this, extra blank lines can appear in the output file.

using regex/replaces to parse those strings is dangerous. You could always stumble on a data containing the delimiter, the comma, etc..
And in this case, even if json cannot read those lines,ast.literal_eval can without any modification whatsoever:
import ast
with open("output.csv",newline="") as fw:
cw = csv.writer(fw)
for filename in filenames:
with open(filename) as f:
for line in f:
d = ast.literal_eval(line)
cw.writerow([d['nr'],d['data'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

extracting hyperlinks from rtf with python - python

Related

Iterate through a html file and extract data to CSV file

how to replace HTML codes in HTML file using python?

csv.Error: iterable expected trying to save a CSV file from a For

Python Web Api to CSV

Merge txt files in a folder and replacing characters in python

Categories

Resources