Extracting information from multiple resumes all in PDF format - python

I have a data set with a column which has google drive link for resumes, I have 5000 rows so there are 5000 links , I am trying to extract information like years of experience and salary from these resumes in 2 separate columns. so far I've seen so many examples mentioned here on SO.
For example: the code mentioned below can only read the data from one file , how do I replicate this to multiple rows ?
Please help me with this , else I will have to manually go through 500 resumes and fill in the data
Hoping that I'll get a solution for this painful problem that I have.
pdf_file = open('sample.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content.encode('utf-8')
#to extract salary , experience using regular expressions
import re
prog = re.compile("\s*(Name|name|nick).*")
result = prog.match("Name: Bob Exampleson")
if result:
print result.group(0)
result = prog.match("University: MIT")
if result:
print result.group(0)

Use a loop. Basically you put your main code into a function (easier to read) and create a list of filenames. Then you iterate over this list, using the values from the list as argument for your function:
Note: I didn't check your scraping code, just showing how to loop. There are also way more efficient ways to do this, but I'm assuming you're somewhat of a Python beginner so lets keep it simple to start with.
# add your imports to the top
import re
# create a list of your filenames
files_list = ['a.pdf', 'b.pdf', 'c.pdf']
for filename in files_list: # iterate over the list
get_data(filename)
# put the rest in a function for readability
def get_data(filename):
pdf_file = open(filename, 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content.encode('utf-8')
prog = re.compile("\s*(Name|name|nick).*")
result = prog.match("Name: Bob Exampleson")
if result:
print result.group(0)
result = prog.match("University: MIT")
if result:
print result.group(0)
So now your next question might be, how do I create this list with 5000 filenames? This depends on what the files are called and where they are stored. If they are sequential, you could to something like:
files_list = [] # empty list
num_files = 5000 # total number of files
for i in range(1, num_files+1):
files_list.append(f'myfile-{i}.pdf')
This will create a list with 'myfile-1.pdf', 'myfile-2.pdf', etc.
Hopefully this is enough to get you started.
You can also use return in your function to create a new list with all of the output which you can use later on, instead of printing the output as you go:
output = []
def doSomething(i):
return i * 2
for i in range(1, 100):
output.append(doSomething(i))
# output is now a list with values like:
# [2, 4, 6, 8, 10, 12, ...]

Related

split big pdf into multiple smaller pdfs of different page length based on spectific string appearance in the big pdf using python

Problem
I have a long PDF file with many pages. I want that this pdf is splitted in many smaller files, which lenght is derived from the text content of the long pdf. You can imagine the string as something that activate a scissors that cut the long pdf and give even the filename to the smaller pdf.
The "scissors" strings are generated by the following iterator and are represented from "text":
for municipality in array_merged_zone_send:
text = f'PANORAMICA DI {municipality.upper()}'
If I print ('text') in the iterator the result is that:
PANORAMICA DI BELLINZONA
PANORAMICA DI RIVIERA
PANORAMICA DI BLENIO
PANORAMICA DI ACQUAROSSA
The strings above are unique values, they appear only once.
Above I have shown only the first four, there are more and EVERY item is written in the original pdf that I want to split. Every item appears only one time in the original pdf, no more than one, no less than one (match always one to one) and in the pdf there is never additional "PANORAMICA DI........" that is not already an item obtained by the iteration. PANORAMICA means OVERVIEW in English.
Here an example of the pages inside the original pdf where there is the string that come from item "PANORAMICA DI BLENIO"
What I want to do: I want to split the original pdf every time that appears the string item.
In the image above the original pdf have to be split in two: first pdf end in the page before "PANORAMICA DI BLENIO", second begins in page "PANORAMICA DI BLENIO" and will end in the page before the next "PANORAMICA DI {municipality.upper()}". The resulting pdf name is "zp_Blenio.pdf" for the second, for the first "zp_Acquarossa". For this it should be no problem because "municipality" when it is no upper() is already OK (in other words is "Acquarossa" and "Blenio")
Other example to understand with a simplified simulation (my file has more page):
original pdf 12 pages long, pay attention that is not a code, but I put as a code to write well:
page 1: "PANORAMICA DI RIVIERA"
page 2: no match with "text" item
page 3: no match with "text" item
page 4: "PANORAMICA DI ACQUAROSSA"
page 5: no match with "text" item
page 6: "PANORAMICA DI BLENIO"
page 7: no match with "text" item
page 8: no match with "text" item
page 9: no match with "text" item
page 10: no match with "text" item
page 11: "PANORAMICA DI BELLINZONA"
page 12: no match with "text" item
results will be (again pay attention that is not a code, but I put as a code to show you well):
first created pdf is from page 1 to page 3
second created pdf is from page 4 to page 5
third pdf is from page 6 to 10
forth pdf is from page 11 to 12
the rule is like: split at the page when a text appears until the page before that the text appears again, split at the page when a text appears until the page before that the text appears again, and so on.
Take care: my original pdf is part of a long py code and the pdf changed every time, but the rule of "PANORAMICA DI ....." does not change. In other words, maybe the interval lenght of pages between "PANORAMICA DI ACQUAROSSA" and "PANORAMICA DI BLENIO" changes. This prevents to use a workaroung and set manually the interval of page to split ignoring the rules established above.
Attempt to solve the problem
The only one solution to this issue that I have found is a code that was obsolete and not checked by the author that can be found in this page: https://stackoverflow.com/a/62344714/13769033
I've taken the code and changed depending on the new functions and classes and integrating the iteration to obtain "text".
The result of the old code after my updating is the following:
from PyPDF2 import PdfWriter, PdfReader
import re
def getPagebreakList(file_name: str)->list:
pdf_file = PyPDF2.PdfReader(file_name)
num_pages = len(pdf_file.pages)
page_breaks = list()
for i in range(0, num_pages):
Page = pdf_file.pages[i]
Text = PageObject.extract_text()
for municipality in array_merged_zone_send:
text = f'PANORAMICA DI {municipality.upper()}'
if re.search(text, Text):
page_breaks.append(i)
return page_breaks
inputpdf = PdfReader(open("./report1.pdf", "rb"))
num_pages = len(inputpdf.pages)
page_breaks = getPagebreakList("./report1.pdf")
i = 0
while (i < num_pages):
if page_breaks:
page_break = page_breaks.pop(0)
else:
page_break = num_pages
output = PdfWriter()
while (i != page_break + 1):
output.add_page(inputpdf.pages[i])
i = i + 1
with open(Path('.')/f'zp_{municipality}.pdf',"wb") as outputStream:
output.write(outputStream)
Unfortunately, I don't understand large part of the code.
From the part that I don't understand at all and I don't know if the author made an error:
the indentation of "output = PdfWriter()"
the "getPagebreakList('./report1.pdf')" where I put the same pdf that I want to split but where tha author put "getPagebreakList('yourPDF.pdf')" that was nevertheless different of PdfFileReader(open("80....pdf", "rb")). I assume that it should have written yourPDF.pdf for both.
To be noted: "./report1.pdf" is the path where there is the pdf to split and I am sure that is right.
The code is wrong, when I execute I obtain "TypeError: 'list' object is not callable".
I want that someone help me to find the solution. You can modified my updated code or suggest another way to solve. Thank you.
Suggestion to simulate
To simplify, at the beginning I suggest to consider a static string of your pdf (strings that is repeating every x pages) instead of part of an array.
In my case, I had considered:
Text = PageObject.extract_text()
text = 'PANORAMICA'
if re.search(text, Text):
page_breaks.append(i)
....and changed even the path for the output.
You can simply use a long pdf with repeating fixed text that appears periodically but in an irregular way (once after 3 pages, once every 5 pages and so on).
Only when you find the solution you can integrate the iteration for municipality. The integration of "municipality" on the text is only used to integrate the "municipality" in the name of the new pdf files. Using only "PANORAMICA" does not impact on the lenght of the page interval of the new pdf.
My suggestion is to divide the problem into smaller ones, essentially using a divide and conquer approach. By making single task functions debugging in case of mistakes should be easier. Notice that getPagebreakList is slightly different.
from PyPDF2 import PdfWriter, PdfReader
def page_breaks(pdf_r:PdfReader) -> dict:
p_breaks = {}
for i in range(len(pdf_r.pages)):
page_text = pdf_r.pages[i].extract_text()
for municipality in array_merged_zone_send:
pattern = f'PANORAMICA DI {municipality.upper()}'
if re.search(pattern, page_text):
p_breaks[municipality] = i
return p_breaks
def filenames_range_mapper(pdf_r:PdfReader, page_indices:dict) -> dict:
num_pages = list(page_indices.values()) + [len(pdf_r.pages)+1] # add last page as well
# slice the pages from the reader object
return {name: pdf_r.pages[start:end] for name, start, end in zip(page_indices, num_pages, num_pages[1:])}
def save(file_name:str, pdf_pages:list[PdfReader]) -> None:
# pass the pages to the writer
pdf_w = PdfWriter()
for p in pdf_pages:
pdf_w.add_page(p)
# write to file
with open(file_name, "wb") as outputStream:
pdf_w.write(outputStream)
# message
print(f'Pdf "{file_name}" created.')
# main
# ####
# initial location of the file
file_name = "./report1.pdf"
# create reader object
pdf_r = PdfReader(open(file_name, "rb"))
# get index locations of matches
p_breaks = page_breaks(pdf_r)
# dictionary of name-pages slice objects
mapper = filenames_range_mapper(pdf_r, p_breaks)
# template file name
template_output = './zp_{}.pdf'
# iterate over the location-pages mapper
for municipality, pages in mapper.items():
# set file name
new_file_name = template_output.format(municipality.title()) # eventually municipality.upper()
# save the pages into a new file
save(new_file_name, pages)
Test the code with auxiliary function to avoid unwanted output.
In this case it would be enough to consider a slightly different implementation of filenames_range_mapper in which the values will be just a list of integers (and not PdfReader objects).
def filenames_range_mapper_tester(pdf_r:PdfReader, page_indices:dict) -> dict:
num_pages = list(page_indices.values()) + [len(pdf_r.pages)+1] # add last page as well
# slice the pages from the reader object
return {name: list(range(len(pdf_r.pages)+1))[start:end] for name, start, end in zip(page_indices, num_pages, num_pages[1:])}
# auxiliary test
file_name = "./report1.pdf"
pdf_r = PdfReader(open(file_name, "rb"))
p_breaks = page_breaks(pdf_r)
mapper = filenames_range_mapper_tester(pdf_r, p_breaks)
template_output = './zp_{}.pdf'
for name, pages in mapper.items():
print(template_output.format(name.title()), pages)
If the output make sense then you can proceed with the non-testing code.
An abstraction on how to get the right pages:
# mimic return of "page_breaks"
page_breaks = {
"RIVIERA": 1,
"ACQUAROSSA": 4,
"BLENIO": 6,
"BELLINZONA": 11
}
# mimic "filenames_range_mapper"
last_page_of_pdf = 12 + 1 # increment by 1 the number of pages of the pdf!
num_pages = list(page_breaks.values()) + [last_page_of_pdf]
#[1, 4, 6, 11, 12]
mapper = {name: list(range(start, end)) for name, start, end in zip(page_breaks, num_pages, num_pages[1:])}
#{'RIVIERA': [1, 2, 3],
# 'ACQUAROSSA': [4, 5],
# 'BLENIO': [6, 7, 8, 9, 10],
# 'BELLINZONA': [11, 12]}

How to get the same name with multiple value get unique results in Python

I have a large csv file that compares the URLs of my txt files
How to get the same name with multiple value get unique results in Python and Is there a way to better compare the speed of two files? because it has a minimum large csv file of 1 gb
file1.csv
[01/Nov/2019:09:54:26 +0900] ","","102.12.14.22","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","164.16.37.75","52.222.194.116","200","CONNECT","http://www.google.com:443","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","167.27.14.62","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","192.10.77.95","21.323.12.96","200","CONNECT","http://www.wakers.com/sg/wew/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","167.27.14.62","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","197.99.94.32","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","157.87.34.72","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
file2.txt
1 www.amazon.com shop
1 wakers.com shop
script:
import csv
with open("file1.csv", 'r') as f:
reader = csv.reader(f)
for k in reader:
ko = set()
srcip = k[2]
url = k[6]
lines = url.replace(":443", "").replace(":8080", "")
war = lines.split("//")[-1].split("/")[0].split('?')[0]
ko.add((war,srcip))
for to in ko:
with open("file2.txt", "r") as f:
all_val = set()
for i in f:
val = i.strip().split(" ")[1]
if val in to[0]:
all_val.add(to)
for ki in all_val:
print(ki)
my output:
('www.amazon.com', '102.12.14.22')
('www.amazon.com', '167.27.14.62')
('www.wakers.com', '192.10.77.95')
('www.amazon.com', '167.27.14.62')
('www.amazon.com', '197.99.94.32')
('www.amazon.com', '157.87.34.72')
how to get if the url is the same, get the total value with a unique value
how to get results like this?
amazon.com 102.12.14.22
167.27.14.62
197.99.94.32
157.87.34.72
wakers.com 192.10.77.95
Short answer: you can't directly do so. Well you can but with low performances.
CSV is a good storing format but if you want to do something like that you might want to store everything in another custom data file. you could first parse your file to have only Unique IDs instead of long strings (like amazon = 0, wakers = 1 and so on) to perform better and reduce compare cost.
The thing is, those thing are pretty bad for variable csv, memory mapping or building a database from your csv might also be great though (and making the changes on the database, only dumping the csv when you need to)
look at: How do quickly search through a .csv file in Python for a more complete answer.
Problem solution
import csv
import re
def possible_urls(filename, category, category_position, url_position):
# Here we will read a txt file to create a list of domains, that could correspond to shops
domains = []
with open(filename, "r") as file:
file_content = file.read().splitlines()
for line in file_content:
info_in_line = line.split(" ")
# Here i use a regular expression, to prase domain from url.
domain = re.sub('www.', '', info_in_line[url_position])
if info_in_line[category_position] == category:
domains.append(domain)
return domains
def read_from_csv(filename, ip_position, url_position, possible_domains):
# Here we will create a dictionary, where will
# all ips that this domain can have.
# Dictionary will look like this:
# {domain_name: [list of possible ips]}
domain_ip = {domain: [] for domain in possible_domains}
with open(filename, 'r') as f:
reader = csv.reader(f)
for line in reader:
if len(line) < max(ip_position, url_position):
print(f'Not enough items in line {line}, to obtain url or ip')
continue
ip = line[ip_position]
url = line[url_position]
# Using python regular expression to get a domain name
# from url.
domain = re.search('//[w]?[w]?[w]?\.?(.[^/]*)[:|/]', url).group(1)
if domain in domain_ip.keys():
domain_ip[domain].append(ip)
return domain_ip
def print_fomatted_result(result):
# Prints formatted result
for shop_domain in result.keys():
print(f'{shop_domain}: ')
for shop_ip in result[shop_domain]:
print(f' {shop_ip}')
def create_list_of_shops():
# Function that first creates a list of possible domains, and
# then read ip for that domains from csv
possible_domains = possible_urls('file2.txt', 'shop', 2, 1)
shop_domains_with_ip = read_from_csv('file1.csv', 2, 6, possible_domains)
# Display result, we get in previous operations
print(shop_domains_with_ip)
print_fomatted_result(shop_domains_with_ip)
create_list_of_shops()
Output
Dictionary of ip's where domains are keys, so you can get all possible ip's for domain by giving a name of that domain:
{'amazon.com': ['102.12.14.22', '167.27.14.62', '167.27.14.62', '197.99.94.32', '157.87.34.72'], 'wakers.com': ['192.10.77.95']}
amazon.com:
102.12.14.22
167.27.14.62
167.27.14.62
197.99.94.32
157.87.34.72
wakers.com:
192.10.77.95
Regular expressions
A very useful thing you can learn from the solution is regular expressions. Regular expressions are tools that allow you to filter or retrieve information from lines in a very convenient way. It also greatly reduces the amount of code, which makes the code more readable and safe.
Let's consider your code of removing ports from strings and think how we can replace it with regex.
lines = url.replace(":443", "").replace(":8080", "")
Replacing of ports in such way is vulnerable, because you never can be sure, what port numbers can actually be in url. What if there will appear port number 5460, or port number 1022, etc. For each of such ports you will add new replaces and soon your code will look something like this
lines = url.replace(":443", "").replace(":8080", "").replace(":5460","").replace(":1022","")...
Not very readable. But with regular experssion you can describe a pattern. And the great news is that we actually know pattern for url with port numbers. They all looking like this:
:some_digits. So if we know pattern we can describe it with regular expression, and tell python to find everything, that match it and replace with empty string '':
re.sub(':\d+', '', url)
It tells to python regular expression engine:
Look for all digits in string url, that goes after : and replace them with empty string. This solution is shorter, safer and a way more readable then solution with replace chain, so I suggest you to read about them a little. Great resource to learn about regular expressions is
this site. Here you can test your regex.
Explanation of Regular expressions in code
re.sub('www.', '', info_in_line[url_position])
Look for all www. in string info_in_line[url_position] and replace it with empty string.
re.search('www.(.[^/]*)[:|/]', url).group(1)
Let's split it on parts:
[^/] - here could be everything except /
(.[^/]*) - Here i used match group. It tells to engine where solution we intersted in will be.
[:|/] - it means characters that could stay on that place. Long story short: after capturing group could be : or(|) /.
So summarizing. Regex can be expressed in words as follows:
Find all substrings, that starts with www., and ends with : or \ and return me everything that stadns between them.
group(1) - means get the first match.
Hope answer will be helpful!
If you used the URL as the key in a dictionary, and had your IP address sets as the elements of the dictionary, would that achieve what you intended?
my_dict = {
'www.amazon.com' = {
'102.12.14.22',
'167.27.14.62',
'197.99.94.32',
'157.87.34.72',
},
'www.wakers.com' = {'192.10.77.95'},
}
## I have used your code & Pandas to get your desired output
## Copy paste the code & execute to get the result
import csv
url_dict = {}
## STEP 1: Open file2.txt to get url names
with open("file2.txt", "r") as f:
for i in f:
val = i.strip().split(" ")[1]
url_dict[val] = []
## STEP 2: 2.1 Open csv file 'file1.csv' to extract url name & ip address
## 2.2 Check if url from file2.txt is available from the extracted url from 'file1.csv'
## 2.3 Create a dictionary with the matched url & its ip address
## 2.4 Remove duplicates in ip addresses from same url
with open("file1.csv", 'r') as f: ## 2.1
reader = csv.reader(f)
for k in reader:
#ko = set()
srcip = k[2]
#print(srcip)
url = k[6]
lines = url.replace(":443", "").replace(":8080", "")
war = lines.split("//")[-1].split("/")[0].split('?')[0]
for key, value in url_dict.items():
if key in war: ## 2.2
url_dict[key].append(srcip) ## 2.3
## 2.4
for key, value in url_dict.items():
url_dict[key] = list(set(value))
## STEP 3: Print dictionary output to .TXT file
file3 = open('output_text.txt', 'w')
for key, value in url_dict.items():
file3.write('\n' + key + '\n')
for item in value:
file3.write(' '*15 + item + '\n')
file3.close()

How to put all this links in one text document

So, here is the deal: I have this code bellow and it produces multiples results, how do i put all this results in a single document? I was wondering if it was possible to make all of this a list of links. It's comming this way
['http://acervo.estadao.com.br/pagina/#!/20171101-45305-nac-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20171004-45277-spo-1-pri-a1-not/busca/Minist%C3%A9rio', 'http://acervo.estadao.com.br/pagina/#!/20171004-45277-nac-1-pri-a1-not/busca/Minist%C3%A9rio', 'http://acervo.estadao.com.br/pagina/#!/20171109-45313-nac-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20171219-45353-nac-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20171122-45326-spo-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20171122-45326-nac-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20171229-45363-spo-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20171229-45363-nac-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20180105-45370-nac-1-pri-a1-not/busca/minist%C3%A9rio']
['http://acervo.estadao.com.br/pagina/#!/20180202-45398-spo-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20180202-45398-nac-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20180131-45396-spo-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20100702-42626-spo-1-pri-a1-not/busca/Ministro', 'http://acervo.estadao.com.br/pagina/#!/20101202-42779-spo-1-pri-a1-not/busca/Minist%C3%A9rio', 'http://acervo.estadao.com.br/pagina/#!/20101220-42797-spo-1-pri-a1-not/busca/Minist%C3%A9rio', 'http://acervo.estadao.com.br/pagina/#!/20100904-42690-spo-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20101102-42749-spo-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20100514-42577-nac-1-pri-a1-not/busca/ministro', 'http://acervo.estadao.com.br/pagina/#!/20100915-42701-spo-1-pri-a1-not/busca/Minist%C3%A9rio']
But i wanted something like a list, like this:
http://acervo.estadao.com.br/pagina/#!/20171101-45305-nac-1-pri-a1-not/busca/ministro
http://acervo.estadao.com.br/pagina/#!/20180202-45398-spo-1-pri-a1-not/busca/ministro
http://acervo.estadao.com.br/pagina/#!/20180131-45396-spo-1-pri-a1-not/busca/ministro
http://acervo.estadao.com.br/pagina/#!/20171101-45305-nac-1-pri-a1-not/busca/ministro
A bunch of links in the order they were get in a .txt document. I have no idea how to start (i'm a newbie in programming).
opts = Options()
opts.add_argument("user-agent=Mozilla/5.0")
driver = webdriver.Chrome(chrome_options=opts)
x = 1
driver.get("http://acervo.estadao.com.br/procura/#!/ministro%3B minist%C3%A9rio|||/Acervo/capa//1/2000|2010|2010///Primeira")
time.sleep(5)
page_number = driver.find_element_by_class_name("page-ultima-qtd").text
for i in range(int(page_number)):
link = ("http://acervo.estadao.com.br/procura/#!/ministro%3B minist%C3%A9rio|||/Acervo/capa//{}/2000|2010|2010///Primeira").format(x)
#driver.get(link)
links = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.LINK_TEXT, "LEIA ESTA EDIÇÃO")))
references = [link.get_attribute("href") for link in links]
driver.find_element_by_class_name("seta-right").click()
time.sleep(1)
print(references)
x = x + 1
#print(x)
print(i)
import csv
list1 = ['a','b','c']
list2 = ['a','b','c']
#if your output your getting is lists you could put them all into one list first
master = list1 + list2
#concatenated lists
print(master)
#then simply send to file
with open("filenames.csv", 'w') as f:
wr = csv.writer(f, lineterminator='\n')
for row in master:
wr.writerow([row])
Simplest solution: format your references list before printing, ie
# print(references)
print("\n".join(references))
or print them one by one (might be a bit longer but well):
# print(references)
for ref in references:
print(ref)
and then use your OS redirections to redirect the output to a file (linux example):
$ python yourscript.py > myurls.txt

Storing different lists in csv file- not overwriting last addition

I am new to Python and somehow can't quite get this simple task to run.
I have generated a randomization of some images I would like to later present to each participant in my experiment. The randomization assigns each participant a particular order for a list of images to be presented. The randomization looks as follows:
all_images_face = ["01_1.jpg","02_1.jpg", "03_1.jpg", "04_1.jpg", "05_1.jpg",
"06_1.jpg", "07_1.jpg", "08_1.jpg", "09_1.jpg","10_1.jpg"]
all_images_scene = ["01_2.jpg", "02_2.jpg", "03_2.jpg", "04_2.jpg", "05_2.jpg",
"06_2.jpg", "07_2.jpg", "08_2.jpg", "09_1.jpg", "10_2.jpg]
ind_scene = range(len(all_images_scene))
len_scene = len(all_images_scene)
for p in range (0,participants): #for each participant
rand_ind_face=random.sample(range(len(all_images_face)),len(all_images_face)/2)
TLS = []
list_scene = []
while True:
for i in range(0,len_scene):
print "loop round: ", i
if ind_scene[i] not in rand_ind_face[:]:
el = ind_scene[i]
TLS.append(el)
list_scene.append(el)
print "list_scene: ", list_scene
print "TLS: ", TLS
if len(TLS) == 0:
EF = "Error"
list_all = TLS + rand_ind_face # scenes + faces
final_scene = [] # list to retrieve elements from index
final_face = []
for i in list_all[:len(all_images_face)/2]: # first half scenes
all_images_scene[i]
final_scene.append(all_images_scene[i])
for i in list_all[len(all_images_face)/2:]: # second half faces
all_images_face[i]
final_face.append(all_images_face[i])
str_all = final_scene + final_face
print str_all
# needed data
random.shuffle(str_all) #shuffle order of scene/face of stimuli
print str_all
# write the str_all into csv
fp = open('list_stim.csv','w')
wr = csv.writer(fp,delimiter = ',')
wr.writerow(str_all)
if p == participants:
fp.close()
I end up with, for example, a list that looks like this for p == 1:
str_all = ['01_2.jpg', '06_2.jpg', '08_1.jpg', '04_2.jpg', '10_1.jpg',
'07_2.jpg', '02_1.jpg', '05_2.jpg', '09_1.jpg', '03_1.jpg']
For each participant this random list of string names is different. I would like to store each new str_all list into a new row of the same cvs file, where each element corresponds to a column (meaning that each added row is for each new participant). I created manually with Excel a csv file called list_stim.csv.
This last code allows me to add my newly created str_all list, but when I run the loop again (for p == 2) it does not add the new list but overwrites the old one.
# write the str_all into csv
fp = open('list_stim.csv','w')
wr = csv.writer(fp,delimiter = ',')
wr.writerow(str_all)
In the code pasted above, I cannot see any loop which iterates through the length str_all, as it can be seen you are calling str_all[i].
The solution which I think should work for you is:
fp = open('list_stim.csv','w')
wr = csv.writer(fp,delimiter = ',')
wr.writerow(str_all)
fp.close()
This will write the str_all into the CSV. Each item of the list str_all will be a column in the CSV file.
From the question, it seems you want to write multiple such lists.
So you will have to define a list which contains all this list. I am showing an example here:
str_all1=['01_2.jpg', '06_2.jpg', '08_1.jpg', '04_2.jpg','10_1.jpg']
str_all2=[ '10_1.jpg','01_2.jpg', '06_2.jpg', '08_1.jpg','03_1.jpg']
str_all3=['06_2.jpg', '08_1.jpg','01_2.jpg', '04_2.jpg', '10_1.jpg']
big_str = [str_all1,str_all2,str_all3]
fp = open('list_stim.csv','w')
wr = csv.writer(fp,delimiter = ',')
for i in range(len(big_str)):
wr.writerow(big_str[i])
fp.close()
This is easily stored in a json-string-file. Here is an example of how:
import json
import os
import random
def func_returns_imagelist():
# Do some magic here
l = ['01_2.jpg', '06_2.jpg', '08_1.jpg', '04_2.jpg', '10_1.jpg',
'07_2.jpg', '02_1.jpg', '05_2.jpg', '09_1.jpg', '03_1.jpg']
random.shuffle(l)
return l
def open_database(s):
# Load database if exist, else set d to empty dict
if os.path.isfile(s):
with open(s) as f:
return json.load(f)
else:
return {}
def save_database(s,d):
# Save to file
with open(s, 'w') as f:
json.dump(d,f,indent=2)
def main():
pathtofile = 'mydb.json'
d = open_database(pathtofile)
d['participant{}'.format(len(d)+1)] = func_returns_imagelist()
save_database(pathtofile,d)
main()
This function will try to open a file called 'mydb.json' and return the data as a dictionary, if it doesn't exist it creates it.
It will add a participant with the function (func_returns_imagelist)
It will save back the file to json.
Try running this multiple times and you can see that the file ('mydb.json') is growing each time.

Create Class Instance from String, and put into List

So I am reading lines from a CSV, and trying to create class instances from those items. I am having problems getting the correct amount of parameters/formatting my str from the CSV so that it recognizes them as separate objects. It says that I have only given 2 parameters (self, and one of the rows)
I tried using the split() and strip(), but can't get it to work.
Any help would be appreciated. This has taken me way too long to figure out.
Here is an example of what the rows look like.
Current input:
whiskers,pugsley,Raja,Panders,Smokey
kitty,woofs,Tigger,Yin,Baluga
Current code:
import sys
class Animals:
def__init__(self,cats,dogs,tiger,panda,bear)
self.cats=cats
self.dogs = dogs
self.tiger = tiger
self.panda = panda
self.bear = bear
csv = open(file, 'r')
rowList = csv.readlines()
for row in rowList:
animalList = Animals(row.split(',')) # Fails here...
animals = []
animals = animals.append(animalsList) # Want to add to list
print animals
You seem to have a couple syntax errors, I went ahead and corrected those. Next, I decided to use csv to split up the rows properly. And then, when your inputting a list to a function (and you want them to be parsed as arguments) you should use the * operator.
import sys, csv
file = "file.txt" #added for testing purposes.
class Animals:
def __init__(self,cats,dogs,tiger,panda,bear):
self.cats = cats
self.dogs = dogs
self.tiger = tiger
self.panda = panda
self.bear = bear
csv = csv.reader(open(file, 'r'))
animalsList = []
for row in csv:
animalClass = Animals(*row) # Fails here...
animalsList.append(animalClass) # Want to add to list

Categories