I am using the following code to search and extract research documents on chemical compounds from pubmed. I am interested in the author, name of document, abstract, etc..When I run the code I am only getting results for the last item on my list (see example data) in code below. Yet when I do a manual search I.e. one at a time), I get results from all of them..
#example data list
data={'IUPACName':['ethenyl(trimethoxy)silane','sodium;prop-2-enoate','2-methyloxirane;oxirane','2-methylprop-1-ene;styrene','terephthalic acid', 'styrene' ]}
df=pd.DataFrame(data)
df_list = []
import time
from pymed import PubMed
pubmed = PubMed(tool="PubMedSearcher", email="thomas.heiman#fda.hhs.gov")
data = []
for index, row in df.iterrows():
## PUT YOUR SEARCH TERM HERE ##
search_term =row['IUPACName']
time.sleep(3) #because I dont want to slam them with requests
#search_term = '3-hydroxy-2-(hydroxymethyl)-2-methylpropanoic '
results = pubmed.query(search_term, max_results=500)
articleList = []
articleInfo = []
for article in results:
# Print the type of object we've found (can be either PubMedBookArticle or PubMedArticle).
# We need to convert it to dictionary with available function
articleDict = article.toDict()
articleList.append(articleDict)
# Generate list of dict records which will hold all article details that could be fetch from
#PUBMED API
for article in articleList:
#Sometimes article['pubmed_id'] contains list separated with comma - take first pubmedId in that list - thats article pubmedId
pubmedId = article['pubmed_id'].partition('\n')[0]
# Append article info to dictionary
try:
articleInfo.append({u'pubmed_id':pubmedId,
u'title':article['title'],
u'keywords':article['keywords'],
u'journal':article['journal'],
u'abstract':article['abstract'],
u'conclusions':article['conclusions'],
u'methods':article['methods'],
u'results': article['results'],
u'copyrights':article['copyrights'],
u'doi':article['doi'],
u'publication_date':article['publication_date'],
u'authors':article['authors']})
except KeyError as e:
continue
# Generate Pandas DataFrame from list of dictionaries
articlesPD = pd.DataFrame.from_dict(articleInfo)
#Add the query to the first column
articlesPD.insert(loc=0, column='Query', value=search_term)
df_list.append(articlesPD)
data = pd.concat(df_list, axis=1)
all_export_csv = data.to_csv (r'C:\Users\Thomas.Heiman\Documents\pubmed_output\all_export_dataframe.csv', index = None, header=True)
#Print first 10 rows of dataframe
#print(all_export_csv.head(10))
Any ideas on what I am doing wrong? Thank you!
Related
I am scraping data with python. I get a csv file and can split it into columns in excel later. But I am encountering an issue I have not been able to solve. Sometimes the scraped items have two statuses and sometimes just one. The second status is thus moving the other values in the columns to the right and as a result the dates are not all in the same column which would be useful to sort the rows.
Do you have any idea how to make the columns merge if there are two statuses for example or other solutions?
Maybe is is also an issue that I still need to separate the values into columns manually with excel.
Here is my code
#call packages
import random
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import pandas as pd
# define driver etc.
service_obj = Service("C:\\Users\\joerg\\PycharmProjects\\dynamic2\\chromedriver.exe")
browser = webdriver.Chrome(service=service_obj)
# create loop
initiative_list = []
for i in range(0, 2):
url = 'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page='+str(i)
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
initiative_list.extend(initiatives)
df = pd.DataFrame(initiative_list)
#create csv
print(df)
df.to_csv('Initiativen.csv')
df.columns = ['tosplit']
new_df = df['tosplit'].str.split('\n', expand=True)
print(new_df)
new_df.to_csv('Initiativennew.csv')
I tried to merge the columns if there are two statuses.
make the columns merge if there are two statuses for example or other solutions
[If by "statuses" you mean the yellow labels ending in OPEN/UPCOMING/etc, then] it should be taken care of by the following parts of the getDetails_iiaRow (below the dividing line):
labels = cssSelect(iiaEl, 'div.field span.label')
and then
'labels': ', '.join([l.text.strip() for l in labels])
So, multiple labels will be separated by commas (or any other separator you apply .join to).
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
Instead of doing it like this and then having to split and clean things, you should consider extracting each item in a more specific manner and have each "row" be represented as a dictionary (with the column-names as the keys, so nothing gets mis-aligned later). If you wrap it as a function:
def cssSelect(el, sel): return el.find_elements(By.CSS_SELECTOR, sel)
def getDetails_iiaRow(iiaEl):
title = cssSelect(iiaEl, 'div.search-result-title')
labels = cssSelect(iiaEl, 'div.field span.label')
iiarDets = {
'title': title[0].text.strip() if title else None,
'labels': ', '.join([l.text.strip() for l in labels])
}
cvSel = 'div[translate]+div:last-child'
for c in cssSelect(iiaEl, f'div:has(>{cvSel})'):
colName = cssSelect(c, 'div[translate]')[0].text.strip()
iiarDets[colName] = cssSelect(c, cvSel)[0].text.strip()
link = iiaEl.get_attribute('href')
if link[:1] == '/':
link = f'https://ec.europa.eu/{link}'
iiarDets['link'] = iiaEl.get_attribute('href')
return iiarDets
then you can simply loop through the pages like:
initiative_list = []
for i in range(0, 2):
url = f'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page={i}'
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_list += [
getDetails_iiaRow(iia) for iia in
cssSelect(browser, 'initivative-item>article>a ')
]
and the since it's all cleaned already, you can directly save the data with
pd.DataFrame(initiative_list).to_csv('Initiativen.csv', index=False)
The output I got for the first 3 pages looks like:
I think it is worth working a little bit harder to get your data rationalised before putting it in the csv rather than trying to unpick the damage once ragged data has been exported.
A quick look at each record in the page suggests that there are five main items that you want to export and these correspond to the five top-level divs in the a element.
The complexity (as you note) comes because there are sometimes two statuses specified, and in that case there is sometimes a separate date range for each and sometimes a single date range.
I have therefore chosen to put the three ever present fields as the first three columns, followed next by the status + date range columns as pairs. Finally I have removed the field names (these should effectively become the column headings) to leave only the variable data in the rows.
initiatives = [processDiv(item) for item in initiative_item]
def processDiv(item):
divs = item.find_elements(By.XPATH, "./article/a/div")
if "\n" in divs[0].text:
statuses = divs[0].text.split("\n")
if len(divs) > 5:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[5].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[4].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], divs[0].text, divs[4].text.split("\n")[1]]
The above approach sticks as close to yours as I can. You will clearly need to rework the pandas code to reflect the slightly altered data structure.
Personally, I would invest even more time in clearly identifying the best definitions for the fields that represent each piece of data that you wish to retrieve (rather than as simply divs 0-5), and extract the text directly from them (rather than messing around with split). In this way you are far more likely to create robust code that can be maintained over time (perhaps not your goal).
grateful for your help.
I'm trying to return multiple search results from Google based on two or more search terms. Example inputs:
digital economy gov.uk
digital economy gouv.fr
For about 50% of the search results I input, the script below works fine. However, for the remaining search terms, I receive:
ValueError: arrays must all be same length
Any ideas on how I can address this?
output_df1=pd.DataFrame()
for input in inputs:
query = input
#query = urllib.parse.quote_plus(query)
number_result = 20
ua = UserAgent()
google_url = "https://www.google.com/search?q=" + query + "&num=" + str(number_result)
response = requests.get(google_url, {"User-Agent": ua.random})
soup = BeautifulSoup(response.text, "html.parser")
result_div = soup.find_all('div', attrs = {'class': 'ZINbbc'})
links = []
titles = []
descriptions = []
for r in result_div:
# Checks if each element is present, else, raise exception
try:
link = r.find('a', href = True)
title = r.find('div', attrs={'class':'vvjwJb'}).get_text()
description = r.find('div', attrs={'class':'s3v9rd'}).get_text()
# Check to make sure everything is present before appending
if link != '' and title != '' and description != '':
links.append(link['href'])
titles.append(title)
descriptions.append(description)
# Next loop if one element is not present
except:
continue
to_remove = []
clean_links = []
for i, l in enumerate(links):
clean = re.search('\/url\?q\=(.*)\&sa',l)
# Anything that doesn't fit the above pattern will be removed
if clean is None:
to_remove.append(i)
continue
clean_links.append(clean.group(1))
output_dict = {
'Search_Term': input,
'Title': titles,
'Description': descriptions,
'URL': clean_links,
}
search_df = pd.DataFrame(output_dict, columns = output_dict.keys())
#merging the data frames
output_df1=pd.concat([output_df1,search_df])
Based on this answer: Python Pandas ValueError Arrays Must be All Same Length I have also tried to use orient=index. While this does not give me the array error, it only returns one response for each search result:
a = {
'Search_Term': input,
'Title': titles,
'Description': descriptions,
'URL': clean_links,
}
search_df = pd.DataFrame.from_dict(a, orient='index')
search_df = search_df.transpose()
#merging the data frames
output_df1=pd.concat([output_df1,search_df])
Edit: based on #Hammurabi's answer, I was able to at least pull 20 returns per input, but these appear to be duplicates. Any idea how I iterate the unique returns to each row?
df = pd.DataFrame()
cols = ['Search_Term', 'Title', 'Description', 'URL']
for i in range(20):
df_this_row = pd.DataFrame([[input, titles, descriptions, clean_links]], columns=cols)
df = df.append(df_this_row)
df = df.reset_index(drop=True)
##merging the data frames
output_df1=pd.concat([output_df1,df])
Any thoughts on either how I can address the array error so it works for all search terms? Or how I make the orient='index' method work for multiple search results - in my script I am trying to pull 20 results per search term.
Thanks for your help!
You are having trouble with columns of different lengths, maybe because sometimes you get more or fewer than 20 results per term. You can put dataframes together even if they have different lengths. I think you want to append the dataframes, because you have different search terms so there is probably no merging to do to consolidate matching search terms. I don't think you want orient='index' because in the example you post, that puts lists into the df, rather than separating out the list items into different columns. Also, I don't think you want the built-in input as part of the df, looks like you want to repeat the query for each relevant row. Maybe something is going wrong in the dictionary creation.
You could consider appending 1 row at a time to your main dataframe, and skip the list and dictionary creation, after your line
if link != '' and title != '' and description != '':
Maybe simplifying the df creation will avoid the error. See this toy example:
df = pd.DataFrame()
cols = ['Search_Term', 'Title', 'Description', 'URL']
query = 'search_term1'
for i in range(2):
link = 'url' + str(i)
title = 'title' + str(i)
description = 'des' + str(i)
df_this_row = pd.DataFrame([[query, title, description, link]], columns=cols)
df = df.append(df_this_row)
df = df.reset_index(drop=True) # originally, every row has index 0
print(df)
# Search_Term Title Description URL
# 0 search_term1 title0 des0 url0
# 1 search_term1 title1 des1 url1
Update: you mentioned that you are getting the same result 20 times. I suspect that is because you are only getting number_result = 20, and you probably want to iterate instead.
Your code fixes number_result at 20, then uses it in the url:
number_result = 20
# ...
google_url = "https://www.google.com/search?q=" + query + "&num=" + str(number_result)
Try iterating instead:
for number_result in range(1, 21): # if results start at 1 rather than 0
# ...
google_url = "https://www.google.com/search?q=" + query + "&num=" + str(number_result)
I am trying to extract tables and the table names from a pdf file using camelot in python. Although I know how to extract tables (which is pretty straightforward) using camelot, I am struggling to find any help on how to extract the table name. The intention is to extract this information and show a visual of the tables and their names for a user to select relevant tables from the list.
I have tried extracting tables and then extracting text as well from pdfs. I am successful at both but not at connecting the table name to the table.
def tables_from_pdfs(filespath):
pdffiles = glob.glob(os.path.join(filespath, "*.pdf"))
print(pdffiles)
dictionary = {}
keys = []
for file in pdffiles:
print(file)
n = PyPDF2.PdfFileReader(open(file, 'rb')).getNumPages()
print(n)
tables_dict = {}
for i in range(n):
tables = camelot.read_pdf(file, pages = str(i))
tables_dict[i] = tables
head, tail = os.path.split(file)
tail = tail.replace(".pdf", "")
keys.append(tail)
dictionary[tail] = tables_dict
return dictionary, keys
The expected result is a table and the name of the table as stated in the pdf file. For instance:
Table on page x of pdf name: Table 1. Blah Blah blah
'''Table'''
I was able to find a relative solution. Works for me at least.
import os, PyPDF2, time, re, shutil
import pytesseract
from pdf2image import convert_from_path
import camelot
import datefinder
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
similarityAmt = 0.6 # find with 60% similarity
def find_table_name(dataframe, documentString):
# Assuming that you extracted the text from a PDF, it should be multi-lined. We split by line
stringsSeparated = text.split("\n")
for i, string in enumerate(stringsSeparated):
# Split by word
words = string.split()
for k, word in enumerate(words):
# Get the keys from the dataframe as a list (it is initially extracted as a generator type)
dfList = list(dataframe.keys())
keys = str(dfList)
# If the first key is a digit, we assume that the keys are from the row below the keys instead
if keys[0].isdigit():
keys = dataframe[dfList[0]]
# Put all of the keys in a single string
keysAll = ""
for key in keys:
keysAll += key
# Since a row should be horizontal, we check the similarity between that of the text by line.
similarRating = similar(words, keysAll)
if similarRating > similarityAmt: # If similarity rating (which is a ratio from 0 to 1) is above the similarity amount, we approve of it
for j in range(10): # Iterate upwards 10 lines above until we are capable of finding a line that is longer than 4 characters (this is an arbitrary number just to ignore blank lines).
try:
separatedString = stringsSeparated[i-j-1]
if len(separatedString) > 4:
return stringsSeparated[i-j-2]+separatedString # Return the top two lines to hopefully have an accurate name
else:
continue
except:
continue
return "Unnamed"
# Retreive the text from the pdf
pages = convert_from_path(pdf_path, 500) # pdf_path would be the path of the PDF which you extracted the table from
pdf_text = ""
# Add all page strings into a single string, so the entire PDF is one single string
for pageNum, imgBlob in enumerate(pages):
extractedText = pytesseract.image_to_string(imgBlob, lang='eng')
pdf_text += extractedText + "\n"
# Get the name of the table using the table itself and pdf text
tableName = find_table_name(table.df, pdf_text) # A table you extracted with your code, which you want to find the name of
Tables are listed with the TableList and Table functions in the camelot API found here:
https://camelot-py.readthedocs.io/en/master/api.html#camelot.core.TableList
start in the web page where it says:
Lower-Lower-Level Classes
Camelot does not have a reference to the table name just the cell data descriptions.
It does use python's panda database API though which may have the table name in it.
Combine usage of Camelot and Pandas to get the table name.
Get the name of a pandas DataFrame
appended update to answer
from
https://camelot-py.readthedocs.io/en/master/
import camelot
tables = camelot.read_pdf('foo.pdf')
tables
<TableList n=1>
tables.export('foo.csv', f='csv', compress=True) # json, excel, html
tables[0]
<Table shape=(7, 7)>
tables[0].parsing_report
{
'accuracy': 99.02,
'whitespace': 12.24,
'order': 1,
'page': 1
}
tables[0].to_csv('foo.csv') # to_json, to_excel, to_html
df_table = tables[0].df # get a pandas DataFrame!
#add
df_table.name = 'name here'
#from https://stackoverflow.com/questions/31727333/get-the-name-of-a-pandas-dataframe
import pandas as pd
df = pd.DataFrame( data=np.ones([4,4]) )
df.name = 'Ones'
print df.name
note: the added 'name' attribute is not part of df. While serializing the df, the added name attribute is lost.
More appended answer, the 'name' attribute is actually called 'index'.
Getting values
>>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
... index=['cobra', 'viper', 'sidewinder'],
... columns=['max_speed', 'shield'])
>>> df
max_speed shield
cobra 1 2
viper 4 5
sidewinder 7 8
Single label. Note this returns the row as a Series.
>>> df.loc['viper']
max_speed 4
shield 5
Name: viper, dtype: int64
I'm attempting to get the last 5 orders from currency exchanges through their respective JSON API. Everything is working except for the fact there are some coins that have less than 5 orders (ask/bid) which causes some errors in the table write to Excel.
Here is what I have now:
import grequests
import json
import itertools
active_sheet("Livecoin Queries")
urls3 = [
'https://api.livecoin.net/exchange/order_book?
currencyPair=RBIES/BTC&depth=5',
'https://api.livecoin.net/exchange/order_book?
currencyPair=REE/BTC&depth=5',
]
requests = (grequests.get(u) for u in urls3)
responses = grequests.map(requests)
CellRange("B28:DJ48").clear()
def make_column(catalog_response, name):
column = []
catalog1 = catalog_response.json()[name]
quantities1, rates1 = zip(*catalog1)
for quantity, rate in zip(quantities1, rates1):
column.append(quantity)
column.append(rate)
return column
bid_table = []
ask_table = []
for response in responses:
try:
bid_table.append(make_column(response,'bids'))
ask_table.append(make_column(response,'asks'))
except (KeyError,ValueError,AttributeError):
continue
Cell(28, 2).table = zip(*ask_table)
Cell(39, 2).table = zip(*bid_table)
I've isolated the list of links down to just two with "REE" coin being the issue here.
I've tried:
for i in itertools.izip_longest(*bid_table):
#Cell(28, 2).table = zip(*ask_table)
#Cell(39, 2).table = zip(*i)
print(i)
Which prints out nicely in the terminal:
itertools terminal output
NOTE: As of right now "REE" has zero bid orders so it ends up creating an empty list:
empty list terminal output
When printing to excel I get a lot of strange outputs. None of which resemble what it looks like in the terminal. The way the information is set up in Excel requires it to be Cell(X,X).table
My question is, how do I make zipping with uneven lists play nice with tables in DataNitro?
EDIT1:
The problem is arising at catalog_response.json()[name]
def make_column(catalog_response, name):
column = []
catalog1 = catalog_response.json()[name]
#quantities1, rates1 = list(itertools.izip_longest(*catalog1[0:5]))
print(catalog1)
#for quantity, rate in zip(quantities1, rates1):
# column.append(quantity)
# column.append(rate)
#return column
Since there are zero bids there is not even an empty list created which is why I'm unable to zip them together.
ValueError: need more than 0 values to unpack
I suggest that you build the structure myTable that you intend to write back to excel.
It should be a list of lists
myTable = []
myRow = []
…build each myRow from your code…
if the length of the list for myRow is too short, pad with proper number of [None] elements
in your case if len(myRow) is 0 you need to append two “None” items
myRow.append(None)
myRow.append(None)
add the row to the output table
myTable.append(myRow)
so when ready you have a well formed nn x n table to output via:
Cell(nn, n).table = myTable
I read from a file and stored into artists_tag with column names .
Now this file has multiple columns and I need to generate a new data structure which has 2 columns from the artists_tag as it is and the most frequent value from the 'Tag' column as the 3rd column value.
Here is what I have written as of now:
import pandas as pd
from collections import Counter
def parse_artists_tags(filename):
df = pd.read_csv(filename, sep="|", names=["ArtistID", "ArtistName", "Tag", "Count"])
return df
def parse_user_artists_matrix(filename):
df = pd.read_csv(filename)
return df
# artists_tags = parse_artists_tags(DATA_PATH + "\\artists-tags.txt")
artists_tags = parse_artists_tags("C:\\Users\\15-J001TX\\Documents\\ml_task\\artists-tags.txt")
#print(artists_tags)
user_art_mat = parse_user_artists_matrix("C:\\Users\\15-J001TX\\Documents\\ml_task\\userart-mat-training.csv")
#print ("Number of tags {0}".format(len(artists_tags))) # Change this line. Should be 952803
#print ("Number of artists {0}".format(len(user_art_mat))) # Change this line. Should be 17119
# TODO Implement this. You can change the function arguments if necessary
# Return a data structure that contains (artist id, artist name, top tag) for every artist
def calculate_top_tag(all_tags):
temp = all_tags.Tag
a = Counter(temp)
a = a.most_common()
print (a)
top_tags = all_tags.ArtistID,all_tags.ArtistName,a;
return top_tags
top_tags = calculate_top_tag(artists_tags)
# Print the top tag for Nirvana
# Artist ID for Nirvana is 5b11f4ce-a62d-471e-81fc-a69a8278c7da
# Should be 'Grunge'
print ("Top tag for Nirvana is {0}".format(top_tags)) # Complete this line
In the last method calculate_top_tag I don't understand how to choose the most frequent value from the 'Tag' column and put it as the third column for top_tags before returning it.
I am new to python and my knowledge of syntax and data structures is limited. I did try the various solutions mentioned for finding the most frequent value from the list but they seem to display the entire column and not one particular value. I know this is some trivial syntax issue but after having searched for long I still cannot figure out how to get this one.
edit 1 :
I need to find the most common tag for a particular artist and not the most common overall.
But again, I don't know how to.
edit 2 :
here is the link to the data files:
https://github.com/amplab/datascience-sp14/raw/master/hw2/hw2data.tar.gz
I'm sure there is a more succint way of doing it, but this should get you started:
# returns a df grouped by ArtistID and Tag
tag_counts = artists_tags.groupby(['ArtistID', 'Tag'])
# sum up tag counts and sort in descending order
tag_counts = tag_counts.sum().sort('Count', ascending=False).reset_index()
# keep only the top ranking tag per artist
top_tags = tag_counts.groupby('ArtistID').first()
# top_tags is now a dataframe which contains the top tag for every artist
# We can simply lookup the top tag for Nirvana via it's index:
top_tags.ix['5b11f4ce-a62d-471e-81fc-a69a8278c7da'][0]
# 'Grunge'