I am running this program to basically get the page source code of a website I put in. It saves it to a file and what I want is it to look for a specific string which is basically # for the emails. However, I can't get it to work.
import requests
import re
url = 'https://www.youtube.com/watch?v=GdKEdN66jUc&app=desktop'
data = requests.get(url)
# dump resulting text to file
with open("data6.txt", "w") as out_f:
out_f.write(data.text)
with open("data6.txt", "r") as f:
searchlines = f.readlines()
for i, line in enumerate(searchlines):
if "#" in line:
for l in searchlines[i:i+3]: print((l))
You can use the regex method findall to find all email addresses in your text content, and use file.read() instead of file.readlines(). To get all content together rather than split into separate lines.
For example:
import re
with open("data6.txt", "r") as file:
content = file.read()
emails = re.findall(r"[\w\.]+#[\w\.]+", content)
Maybe cast to a set for uniqueness afterwards, and then save to a file however you like.
Related
Whats the way to extract only lines with specific word only from requests (online text file) and write to a new text file? I am stuck here...
This is my code:
r = requests.get('http://website.com/file.txt'.format(x))
with open('data.txt', 'a') as f:
if 'word' in line:
f.write('\n')
f.writelines(str(r.text))
f.write('\n')
If I remove: if 'word' in line:, it works, but for all lines. So it's only copying all lines from one file to another.
Any idea how to give the correct command to extract (filter) only lines with specific word?
Update: This is working but If that word exist in the requests file, it start copying ALL lines, i need to copy only the line with 'SOME WORD'.
I have added this code:
for line in r.text.split('\n'):
if 'SOME WORD' in line:
*Thank you guys for all the answers and sorry If i didn't made myself clear.
Perhaps this will help.
Whenever you invoke POST/GET or whatever, always check the HTTP response code.
Now let's assume that the lines within the response text are delimited with newline ('\n') and that you want to write a new file (change the mode to 'a' if you want to append). Then:
import requests
(r := requests.get('SOME URL')).raise_for_status()
with open('SOME FILENAME', 'w') as outfile:
for line in r.text.split('\n'):
if 'SOME WORD' in line:
print(line, file=outfile)
break
Note:
You will need Python 3.8+ in order to take advantage of the walrus operator in this code
I would suggest you these steps for properly handling the file:
Step1:Streamline the download file to a temporary file
Step2:Read lines from the temporary file
Step3:Generate main file based on your filter
Step4:Delete the temporary file
Below is the code that does the following steps:
import requests
import os
def read_lines(file_name):
with open(file_name,'r') as fp:
for line in fp:
yield line
if __name__=="__main__":
word='ipsum'
temp_file='temp_file.txt'
main_file='main_file.txt'
url = 'https://filesamples.com/samples/document/txt/sample3.txt'
with open (temp_file,'wb') as out_file:
content = requests.get(url, stream=True).content
out_file.write(content)
with open(main_file,'w') as mf:
out=filter(lambda x: word in x,read_lines(temp_file))
for i in out:
mf.write(i)
os.remove(temp_file)
Well , there is missing line you have to put in order to check with if statement.
import requests
r = requests.get('http://website.com/file.txt').text
with open('data.txt', 'a') as f:
for line in r.splitlines(): #this is your loop where you get a hold of line.
if 'word' in line: #so that you can check your 'word'
f.write(line) # write your line contains your word
I'm trying to replace all HTML codes in my HTML file in a for Loop (not sure if this is the easiest approach) without changing the formatting of the original file. When I run the code below I don't get the codes replaced. Does anyone know what could be wrong?
import re
tex=open('ALICE.per-txt.txt', 'r')
tex=tex.read()
for i in tex:
if i =='õ':
i=='õ'
elif i == 'ç':
i=='ç'
with open('Alice1.replaced.txt', "w") as f:
f.write(tex)
f.close()
You can use html.unescape.
>>> import html
>>> html.unescape('õ')
'õ'
With your code:
import html
with open('ALICE.per-txt.txt', 'r') as f:
html_text = f.read()
html_text = html.unescape(html_text)
with open('ALICE.per-txt.txt', 'w') as f:
f.write(html_text)
Please note that I opened the files with a with statement. This takes care of closing the file after the with block - something you forgot to do when reading the file.
Question
I have a text file that records metadata of research papers requested with SemanticScholar API. However, when I wrote requested data, I forgot to add "\n" for each individual record. This results in something looks like
{<metadata1>}{<metadata2>}{<metadata3>}...
and this should be if I did add "\n".
{<metadata1>}
{<metadata2>}
{<metadata3>}
...
Now, I would like to read the data. As all the metadata is now stored in one line, I need to do some hacks
First I split the cluttered dicts using "{".
Then I tried to convert the string line back to dict. Note that I do consider line might not be in a proper JSON format.
import json
with open("metadata.json", "r") as f:
for line in f.readline().split("{"):
print(json.loads("{" + line.replace("\'", "\"")))
However, there is still an error message
JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
I am wondering what should I do to recover all the metadata I collected?
MWE
Note, in order to get metadata.json file I use, use the following code, it should work out of the box.
import json
import urllib
import requests
baseURL = "https://api.semanticscholar.org/v1/paper/"
paperIDList = ["200794f9b353c1fe3b45c6b57e8ad954944b1e69",
"b407a81019650fe8b0acf7e4f8f18451f9c803d5",
"ff118a6a74d1e522f147a9aaf0df5877fd66e377"]
for paperID in paperIDList:
response = requests.get(urllib.parse.urljoin(baseURL, paperID))
metadata = response.json()
record = dict()
record["title"] = metadata["title"]
record["abstract"] = metadata["abstract"]
record["paperId"] = metadata["paperId"]
record["year"] = metadata["year"]
record["citations"] = [item["paperId"] for item in metadata["citations"] if item["paperId"]]
record["references"] = [item["paperId"] for item in metadata["references"] if item["paperId"]]
with open("metadata.json", "a") as fileObject:
fileObject.write(json.dumps(record))
The problem is that when you do the split("{") you get a first item that is empty, corresponding to the opening {. Just ignore the first element and everything works fine (I added an r in your quote replacements so python considers then as strings literals and replace them properly):
with open("metadata.json", "r") as f:
for line in f.readline().split("{")[1:]:
print(json.loads("{" + line).replace(r"\'", r"\""))
As suggested in the comments, I would actually recommend recreating the file or saving a new version where you replace }{ by }\n{:
with open("metadata.json", "r") as f:
data = f.read()
data_lines = data.replace("}{","}\n{")
with open("metadata_mod.json", "w") as f:
f.write(data_lines)
That way you will have the metadata of a paper per line as you want.
I have two files: "invoiceencoded.txt"(base64 code) and "invoice.txt". I want to replace the word 'INPUT' in the second text file with the base64 code of the first text file. The purpose is to loop over the specific path for multiple examples of those, but that doesn't matter. I have the following code:
import re
import os
for f_name in os.listdir('C:/..'):
if f_name.endswith('encoded.txt'):
fin = open(f_name, "rt")
filedata = fin.read()
with open(f_name[:-11]+".txt", 'r+') as f:
text = f.read()
text = re.sub('INPUT', filedata, text)
f.seek(0)
f.write(text)
f.truncate()
The 'INPUT' string is concatenated as 'abcINPUTdef'. However, instead of giving me
"abcbase64codedef", I get:
"abcbase64code
def"
Does anyone know how to remove this line break?
Thanks in advance
Probably the line break is at the end of your base64 string in invoiceencoded.txt.
I'd suggest that you remove those line breaks and rerun your script.
I'm trying to write a web service in Python (fairly new to it). I have acces to an API that wants an url in a specific format:
http://api.company-x.com/api/publickey/string/0/json
It is not a problem to perform a GET-request one by one but I would like to do it in a batch. So I have a text-file with strings in it. For example:
string1,
string2,
string3,
I would like to write a Python-script that iterates through that file, makes it in the specific format, performs the requests and writes the responses of the batch to a new text-file. I've read the docs of requests and it mentioned adding parameters to your url but it doesn't do it in the specific format I need for this API.
My basic code so far without the loop looks like this:
import requests
r = requests.get('http://api.company-x.com/api/publickey/string/0/json')
print(r.url)
data = r.text
text_file = open("file.txt", "w")
text_file.write(data)
text_file.close()
First open the file that has the strings,
import requests
with open(filename) as file:
data = file.read()
split_data = data.split(',')
Then iterate through the list,
for string in split_data:
r = requests.get(string)
(...your code...)
Is this what you wanted?
I've played around some more and this is what I wanted:
#requests to talk easily with API's
import requests
#to use strip to remove spaces in textfiles.
import sys
#two variables to squeeze a string between these two so it will become a full uri
part1 = 'http://api.companyx.com/api/productkey/'
part2 = '/precision/format'
#open the outputfile before the for loop
text_file = open("uri.txt", "w")
#open the file which contains the strings
with open('strings.txt', 'r') as f:
for i in f:
uri = part1 + i.strip(' \n\t') + part2
print uri
text_file.write(uri)
text_file.write("\n")
text_file.close()
#open a new file textfile for saving the responses from the api
text_file = open("responses.txt", "w")
#send every uri to the api and write the respsones to a textfile
with open('uri.txt', 'r') as f2:
for i in f2:
uri = i.strip(' \n\t')
batch = requests.get(i)
data = batch.text
print data
text_file.write(data)
text_file.write('\n')
text_file.close()