How to put all this links in one text document - python

So, here is the deal: I have this code bellow and it produces multiples results, how do i put all this results in a single document? I was wondering if it was possible to make all of this a list of links. It's comming this way
['http://acervo.estadao.com.br/pagina/#!/20171101-45305-nac-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20171004-45277-spo-1-pri-a1-not/busca/Minist%C3%A9rio', 'http://acervo.estadao.com.br/pagina/#!/20171004-45277-nac-1-pri-a1-not/busca/Minist%C3%A9rio', 'http://acervo.estadao.com.br/pagina/#!/20171109-45313-nac-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20171219-45353-nac-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20171122-45326-spo-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20171122-45326-nac-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20171229-45363-spo-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20171229-45363-nac-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20180105-45370-nac-1-pri-a1-not/busca/minist%C3%A9rio']
['http://acervo.estadao.com.br/pagina/#!/20180202-45398-spo-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20180202-45398-nac-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20180131-45396-spo-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20100702-42626-spo-1-pri-a1-not/busca/Ministro', 'http://acervo.estadao.com.br/pagina/#!/20101202-42779-spo-1-pri-a1-not/busca/Minist%C3%A9rio', 'http://acervo.estadao.com.br/pagina/#!/20101220-42797-spo-1-pri-a1-not/busca/Minist%C3%A9rio', 'http://acervo.estadao.com.br/pagina/#!/20100904-42690-spo-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20101102-42749-spo-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20100514-42577-nac-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20100915-42701-spo-1-pri-a1-not/busca/Minist%C3%A9rio']
But i wanted something like a list, like this:
http://acervo.estadao.com.br/pagina/#!/20171101-45305-nac-1-pri-a1-not/busca/ministro
http://acervo.estadao.com.br/pagina/#!/20180202-45398-spo-1-pri-a1-not/busca/ministro
http://acervo.estadao.com.br/pagina/#!/20180131-45396-spo-1-pri-a1-not/busca/ministro
http://acervo.estadao.com.br/pagina/#!/20171101-45305-nac-1-pri-a1-not/busca/ministro
A bunch of links in the order they were get in a .txt document. I have no idea how to start (i'm a newbie in programming).
opts = Options()
opts.add_argument("user-agent=Mozilla/5.0")
driver = webdriver.Chrome(chrome_options=opts)
x = 1
driver.get("http://acervo.estadao.com.br/procura/#!/ministro%3B minist%C3%A9rio|||/Acervo/capa//1/2000|2010|2010///Primeira")
time.sleep(5)
page_number = driver.find_element_by_class_name("page-ultima-qtd").text
for i in range(int(page_number)):
link = ("http://acervo.estadao.com.br/procura/#!/ministro%3B minist%C3%A9rio|||/Acervo/capa//{}/2000|2010|2010///Primeira").format(x)
#driver.get(link)
links = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.LINK_TEXT, "LEIA ESTA EDIÇÃO")))
references = [link.get_attribute("href") for link in links]
driver.find_element_by_class_name("seta-right").click()
time.sleep(1)
print(references)
x = x + 1
#print(x)
print(i)

import csv
list1 = ['a','b','c']
list2 = ['a','b','c']
#if your output your getting is lists you could put them all into one list first
master = list1 + list2
#concatenated lists
print(master)
#then simply send to file
with open("filenames.csv", 'w') as f:
wr = csv.writer(f, lineterminator='\n')
for row in master:
wr.writerow([row])

Simplest solution: format your references list before printing, ie
# print(references)
print("\n".join(references))
or print them one by one (might be a bit longer but well):
# print(references)
for ref in references:
print(ref)
and then use your OS redirections to redirect the output to a file (linux example):
$ python yourscript.py > myurls.txt

Related

Extracting information from multiple resumes all in PDF format

I have a data set with a column which has google drive link for resumes, I have 5000 rows so there are 5000 links , I am trying to extract information like years of experience and salary from these resumes in 2 separate columns. so far I've seen so many examples mentioned here on SO.
For example: the code mentioned below can only read the data from one file , how do I replicate this to multiple rows ?
Please help me with this , else I will have to manually go through 500 resumes and fill in the data
Hoping that I'll get a solution for this painful problem that I have.
pdf_file = open('sample.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content.encode('utf-8')
#to extract salary , experience using regular expressions
import re
prog = re.compile("\s*(Name|name|nick).*")
result = prog.match("Name: Bob Exampleson")
if result:
print result.group(0)
result = prog.match("University: MIT")
if result:
print result.group(0)
Use a loop. Basically you put your main code into a function (easier to read) and create a list of filenames. Then you iterate over this list, using the values from the list as argument for your function:
Note: I didn't check your scraping code, just showing how to loop. There are also way more efficient ways to do this, but I'm assuming you're somewhat of a Python beginner so lets keep it simple to start with.
# add your imports to the top
import re
# create a list of your filenames
files_list = ['a.pdf', 'b.pdf', 'c.pdf']
for filename in files_list: # iterate over the list
get_data(filename)
# put the rest in a function for readability
def get_data(filename):
pdf_file = open(filename, 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content.encode('utf-8')
prog = re.compile("\s*(Name|name|nick).*")
result = prog.match("Name: Bob Exampleson")
if result:
print result.group(0)
result = prog.match("University: MIT")
if result:
print result.group(0)
So now your next question might be, how do I create this list with 5000 filenames? This depends on what the files are called and where they are stored. If they are sequential, you could to something like:
files_list = [] # empty list
num_files = 5000 # total number of files
for i in range(1, num_files+1):
files_list.append(f'myfile-{i}.pdf')
This will create a list with 'myfile-1.pdf', 'myfile-2.pdf', etc.
Hopefully this is enough to get you started.
You can also use return in your function to create a new list with all of the output which you can use later on, instead of printing the output as you go:
output = []
def doSomething(i):
return i * 2
for i in range(1, 100):
output.append(doSomething(i))
# output is now a list with values like:
# [2, 4, 6, 8, 10, 12, ...]

python splitting array into 2 arrays

I'm trying to separate the information within this array. It has the title, and the link
the first section is the title, and the last is the url.
u'6 Essential Tips on How to Become a Full Stack Developer',
u'https://hackernoon.com/6-essential-tips-on-how-to-become-a-full-
stack-developer-1d10965aaead'
u'6 Essential Tips on How to Become a Full Stack Developer',
u'https://hackernoon.com/6-essential-tips-on-how-to-become-a-full-
stack-developer-1d10965aaead'
u'What is a Full-Stack Developer? - Codeup',
u'https://codeup.com/what-is-a-full-stack-developer/'
u'A Guide to Becoming a Full-Stack Developer in 2017 \u2013
Coderbyte ...',
u'https://medium.com/coderbyte/a-guide-to-
becoming-a-full-stack-developer-in-2017-5c3c08a1600c'
I want to be able to push the titles in a list, as well as the links in a list. Instead of having everything in one list
here is my current code
main.py
from gsearch.googlesearch import search
from orderedset import OrderedSet
import re
import csv
results = search('Full Stack Developer') # returns 10 or less
results
myAraay = list()
for x in results:
owl = re.sub('[\(\)\{\}<>]', '', str(x))
myAraay.append(owl)
newArray = "\n".join(map(str, myAraay))
print(newArray)
Updated Main.py (now getting cannot concatenate 'str' and 'int' objects)
from gsearch.googlesearch import search
from orderedset import OrderedSet
import re
import csv
results = search('Full Stack Developer') # returns 10 or less results
myAraay = list()
for x in results:
owl = re.sub('[\(\)\{\}<>]', '', str(x))
myAraay.append(owl)
newArray = "\n".join(map(str, myAraay))
theDict = {} #keep a dictionary for title:link
for idx, i in enumerate(newArray): #loop through list
if idx % 2 == 0: #the title goes first...
dict[i] = dict[i+1] #add title and link to dictionary
else: #link comes second
continue #skip it, go to next title
print(theDict)
Assuming that the list is formatted such that the link always succeeds the title...
theDict = {} #keep a dictionary for title:link
for idx, i in enumerate(OriginalList): #loop through list
if idx % 2 == 0: #the title goes first...
dict[i] = dict[i+1] #add title and link to dictionary
else: #link comes second
continue #skip it, go to next title
here it looks like
{"How to be Cool" : "http://www.cool.org/cool"}
If you want lists instead of dictionary its kind of the same...
articles = [] #keep a lists
for idx, i in enumerate(OriginalList): #loop through list
if idx % 2 == 0: #the title goes first...
articles.append([i,i+1]) #add title and link to list
else: #link comes second
continue #skip it, go to next title
so here it looks like:
[["How to be Cool", "http://www.cool.org/cool"]]

Unable to write extracted text as individual rows in csv

This may be considered as second part of question Finding an element within an element using Selenium Webdriver.
What Im doing here is, after extracting each text from the table, writing it into csv file
Here is the code:
from selenium import webdriver
import os
import csv
chromeDriver = "/home/manoj/workspace2/RedTools/test/chromedriver"
os.environ["webdriver.chrome.driver"] = chromeDriver
driver = webdriver.Chrome(chromeDriver)
driver.get("https://www.betfair.com/exchange/football/coupon?id=2")
list2 = driver.find_elements_by_xpath('//*[#data-sportid="1"]')
couponlist = []
finallist = []
for game in list2[1:]:
coup = game.find_element_by_css_selector('span.home-team').text
print(coup)
couponlist.append(coup)
print(couponlist)
print('its done')
outfile = open("./footballcoupons.csv", "wb")
writer = csv.writer(outfile)
writer.writerow(["Games"])
writer.writerows(couponlist)
Results of 3 print statements:
Santos Laguna
CSMS Iasi
AGF
Besiktas
Malmo FF
Sirius
FCSB
Eibar
Newcastle
Pescara
[u'Santos Laguna', u'CSMS Iasi', u'AGF', u'Besiktas', u'Malmo FF', u'Sirius', u'FCSB', u'Eibar', u'Newcastle', u'Pescara']
its done
Now, You can notice the code where i write these values into csv. But I end up writing it weirdly into csv. please see the snapshot. Can someone help me to fix this please?
According to the documentation, writerows takes as parameter a list of rows, and
A row must be an iterable of strings or numbers for Writer objects
You are passing a list of strings, so writerows iterates over your strings, making a row out of each character.
You could use a loop:
for team in couponlist:
writer.writerow([team])
or turn your list into a list of lists, then use writerows :
couponlist = [[team] for team in couponlist]
writer.writerows(couponlist)
But anyway, there's no need to use csv if you only have one column...

how to speed up the pattern search btw two lists : python

I have two fastq files like the one given below. Each record in the file starts with '#'. For two such files, my aim is to extract records that are common btw two files.
#IRIS:7:1:17:394#0/1
GTCAGGACAAGAAAGACAANTCCAATTNACATTATG
+IRIS:7:1:17:394#0/1
aaabaa`]baaaaa_aab]D^^`b`aYDW]abaa`^
#IRIS:7:1:17:800#0/1
GGAAACACTACTTAGGCTTATAAGATCNGGTTGCGG
+IRIS:7:1:17:800#0/1
ababbaaabaaaaa`]`ba`]`aaaaYD\\_a``XT
I have tried this:
first I get a list of read IDs that are common in file1 and 2.
import sys
#('reading files and storing all lines in a list')
data1 = open(sys.argv[1]).read().splitlines()
data2 = open(sys.argv[2]).read().splitlines()
#('listing all read IDs from file1')
list1 = []
for item in data1:
if '#' in item:
list1.append(item)
#('listing all read IDs from file2')
list2 = []
for item in data2:
if '#' in item:
list2.append(item)
#('finding common reads in file1 and file2')
def intersect(a, b):
return list(set(a) & set(b))
common = intersect(list1, list2)
Here, I search for commom IDs in the main file and export the data in a new file. following code works fine for small files but freezes my computer if I try it with larger files. I believe that the 'for' is taking too much memory:
#('filtering read data from file1')
mod_data1 = open(sys.argv[1]).read().rstrip('\n').replace('#', ',#')
tab1 = open(sys.argv[1] + '_final', 'wt')
records1 = mod_data1.split(',')
for item in records1[1:]:
if item.replace('\n', '\t').split('\t')[0] in common:
tab1.write(item)
Please suggest what should I do with the code above, such that it works on larger files(40-100 million records/file, and each record is 4 line).
Using list comprehension, you could write :
list1 = [i for item in data1 if '#' in item]
list2 = [i for item in data2 if '#' in item]
You could also define them as sets directly using set comprehension (depending on the version of python you are using).
set1 = {i for item in data1 if '#' in item}
set2 = {i for item in data2 if '#' in item}
I'd expect creating the set from the beginning to be faster than creating a list and then make a set out of it.
As for the second part of the code, I am not quite sure yet what you are trying to achieve.

What does the following Python code do? It's like a list comprehension with parentheses.

I'm researching web crawlers made in Python, and I've stumbled across a pretty simple one. But, I don't understand the last few lines, highlighted in the following code:
import sys
import re
import urllib2
import urlparse
tocrawl = [sys.argv[1]]
crawled = []
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s(?:.*?\s)*?href=[\'"](.*?)[\'"].*?>')
while 1:
crawling = tocrawl.pop(0)
response = urllib2.urlopen(crawling)
msg = response.read()
keywordlist = keywordregex.findall(msg)
crawled.append(crawling)
links = linkregex.findall(msg)
url = urlparse.urlparse(crawling)
a = (links.pop(0) for _ in range(len(links))) //What does this do?
for link in a:
if link.startswith('/'):
link = 'http://' + url[1] + link
elif link.startswith('#'):
link = 'http://' + url[1] + url[2] + link
elif not link.startswith('http'):
link = 'http://' + url[1] + '/' + link
if link not in crawled:
tocrawl.append(link)
That line looks like some kind of a list comprehension to me, but I'm not sure and I need an explanation.
It's a generator expression and it simply empties the list links as you iterate over it.
They could have replaced this part
a = (links.pop(0) for _ in range(len(links))) //What does this do?
for link in a:
With this:
while links:
link = links.pop(0)
And it would work the same. But since popping from the end of a list is more efficient, this would be better than either:
links.reverse()
while links:
link = links.pop()
Of course, if you're fine with following the links in reverse order (I don't see why they need to be processed in order), it would be even more efficient to not reverse the links list and just pop off the end.
It creates a generator which will take objects off the links list.
To explain:
range(len(links)) returns a list of numbers from 0 up to, but not including, the length of the links list. So if links contains [ "www.yahoo.com", "www.google.com", "www.python.org" ], then it will generate a list [ 0, 1, 2 ].
for _ in blah, just loops over the list, throwing away the result.
links.pop(0) removes the first item from links.
The entire expression returns a generator which pops items from the head of the links list.
And lastly, a demonstration in a python console:
>>> links = [ "www.yahoo.com", "www.google.com", "www.python.org "]
>>> a = (links.pop(0) for _ in range(len(links)))
>>> a.next()
'www.yahoo.com'
>>> links
['www.google.com', 'www.python.org ']
>>> a.next()
'www.google.com'
>>> links
['www.python.org ']
>>> a.next()
'www.python.org '
>>> links
[]
a = (links.pop(0) for _ in range(len(links)))
can also be written as:
a = []
for _ in range(len(links)):
a.append(links.pop(0))
EDIT:
only the only difference is when using the generator it is done lazily, so items are only popped from links as they are requested through a. and not all at once, when dealing with lots of data then it is much more efficient, and there is no way to do this without using advanced pythonic functions.

Categories