pandas/python: splitting url and adding columns to a database - python

I have a one column database with several url of the form
'w.lejournal.fr/actualite/politique/sarkozy-terminator_1557749.html',
'w.lejournal.fr/palmares/palmares-immobilier/',
'w.lejournal.fr/actualite/societe/adeline-hazan-devient-la-nouvelle-controleuse-des-lieux-de-privation-de-liberte_1558176.html'
I want to create a 3 column database which first column contains these exact url, and the second column the principal category of the page (actualite, or palmares), and the third column containing the second category of the page (politique, or palmares-immobilier, or societe).
I can't give my code since I am not allowed to post urls.
I want to use python pandas.
Firstly: is this the good way to do it?
Secondly: how can I finish the concatenation?
Thank you very much.

With pure Python:
data= [
'w.lejournal.fr/actualite/politique/sarkozy-terminator_1557749.html',
'w.lejournal.fr/palmares/palmares-immobilier/',
'w.lejournal.fr/actualite/societe/adeline-hazan-devient-la-nouvelle-controleuse-des-lieux-de-privation-de-liberte_1558176.html'
]
result = []
for x in data:
cols = x.split('/')
result.append( [x, cols[1], cols[2]] )
print result
.
[
['w.lejournal.fr/actualite/politique/sarkozy-terminator_1557749.html', 'actualite', 'politique'],
['w.lejournal.fr/palmares/palmares-immobilier/', 'palmares', 'palmares-immobilier'],
['w.lejournal.fr/actualite/societe/adeline-hazan-devient-la-nouvelle-controleuse-des-lieux-de-privation-de-liberte_1558176.html', 'actualite', 'societe']
]
You have to only read and write to database.
If you have all urls started with http:// than you will need to get cols[3], cols[4]
data= [
'http://w.lejournal.fr/actualite/politique/sarkozy-terminator_1557749.html',
'http://w.lejournal.fr/palmares/palmares-immobilier/',
'http://w.lejournal.fr/actualite/societe/adeline-hazan-devient-la-nouvelle-controleuse-des-lieux-de-privation-de-liberte_1558176.html'
]
result = []
for x in data:
cols = x.split('/')
result.append( [x, cols[3], cols[4]] )
print result

No need for pandas, regex can do this quite efficiently:
import re
ts = ['w.lejournal.fr/actualite/politique/sarkozy-terminator_1557749.html',
'w.lejournal.fr/palmares/palmares-immobilier/',
'w.lejournal.fr/actualite/societe/adeline-hazan-devient-la-nouvelle-controleuse-des-lieux-de-privation-de-liberte_1558176.html']
rgx = r'(?<=w.lejournal.fr/)([aA-zZ]*)/([aA-zZ_-]*)(?=/)'
for url_address in ts:
found_group = re.findall(rgx, url_address)
for item in found_group:
print item
this is what it returns:
('actualite', 'politique')
('palmares', 'palmares-immobilier')
('actualite', 'societe')
of course you wouldn't need to do this on a list of urls

Related

no returned results from pubmed query

I am using the following code to search and extract research documents on chemical compounds from pubmed. I am interested in the author, name of document, abstract, etc..When I run the code I am only getting results for the last item on my list (see example data) in code below. Yet when I do a manual search I.e. one at a time), I get results from all of them..
#example data list
data={'IUPACName':['ethenyl(trimethoxy)silane','sodium;prop-2-enoate','2-methyloxirane;oxirane','2-methylprop-1-ene;styrene','terephthalic acid', 'styrene' ]}
df=pd.DataFrame(data)
df_list = []
import time
from pymed import PubMed
pubmed = PubMed(tool="PubMedSearcher", email="thomas.heiman#fda.hhs.gov")
data = []
for index, row in df.iterrows():
## PUT YOUR SEARCH TERM HERE ##
search_term =row['IUPACName']
time.sleep(3) #because I dont want to slam them with requests
#search_term = '3-hydroxy-2-(hydroxymethyl)-2-methylpropanoic '
results = pubmed.query(search_term, max_results=500)
articleList = []
articleInfo = []
for article in results:
# Print the type of object we've found (can be either PubMedBookArticle or PubMedArticle).
# We need to convert it to dictionary with available function
articleDict = article.toDict()
articleList.append(articleDict)
# Generate list of dict records which will hold all article details that could be fetch from
#PUBMED API
for article in articleList:
#Sometimes article['pubmed_id'] contains list separated with comma - take first pubmedId in that list - thats article pubmedId
pubmedId = article['pubmed_id'].partition('\n')[0]
# Append article info to dictionary
try:
articleInfo.append({u'pubmed_id':pubmedId,
u'title':article['title'],
u'keywords':article['keywords'],
u'journal':article['journal'],
u'abstract':article['abstract'],
u'conclusions':article['conclusions'],
u'methods':article['methods'],
u'results': article['results'],
u'copyrights':article['copyrights'],
u'doi':article['doi'],
u'publication_date':article['publication_date'],
u'authors':article['authors']})
except KeyError as e:
continue
# Generate Pandas DataFrame from list of dictionaries
articlesPD = pd.DataFrame.from_dict(articleInfo)
#Add the query to the first column
articlesPD.insert(loc=0, column='Query', value=search_term)
df_list.append(articlesPD)
data = pd.concat(df_list, axis=1)
all_export_csv = data.to_csv (r'C:\Users\Thomas.Heiman\Documents\pubmed_output\all_export_dataframe.csv', index = None, header=True)
#Print first 10 rows of dataframe
#print(all_export_csv.head(10))
Any ideas on what I am doing wrong? Thank you!

Filter list with regex multiple values from a JSON file

Trying to get a list with filtered items using regex. I am trying to get out a specific location codes from the results. I am able to get the results from a JSON file, but I am stuck at figuring out how I can use multiple regex values to filter out the results from the JSON file.
This is how far I am:
import json
import re
file_path = './response.json'
result = []
with open(file_path) as f:
data = json.loads(f.read())
for d in data:
result.append(d['location_code'])
result = list(dict.fromkeys(result))
re_list = ['.*dk*', '.*se*', '.*fi*', '.*no*']
matches = []
for r in re_list:
matches += re.findall( r, result)
# r = re.compile('.*denmark*', '', '', '')
# filtered_list = list(filter(r.match, result))
print(matches)
Output from the first JSON sort. I need to filter out country initials like dk, no, lv, fi, ee etc. and leave only the data that include the specific country codes.
[
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.northern-europe.dk.na.copenhagen|chromium|74',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.northern-europe.dk.na.copenhagen|chromium|87',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.western-europe.nl.na.amsterdam|firefox|28',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.eastern-europe.bg.na.sofia|chromium|74',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.eastern-europe.bg.na.sofia|chromium|87',
...
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.western-europe.de.na.frankfurt.amazon|chromium|87'
]
Would appreciate any help. Thanks!
In that case, I know this could work if you try. here is a way that could be used:
Set up multiple fields.
for the first pattern you could:
"2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|([^"]+)"
or
"2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|*"
or
for text:
.*?text"\s?:\s?"([\w\s]+)
for names:
.*?name"\s?:\s?"([\w\s]+)
let me know it, if you are able to do
This looks like regex won't be the best tool; for example, .*fi.* will match sofia, which is probably not wanted; even if we insist on periods before and after, all of the example rows have .na., but probably shouldn't match a search for Namibia.
Probably a better way would be to parse the string more carefully, using one or more of (a) the csv module (if it can contain quoting and escaping in the fields), (b) the split method, and/or (c) regular expressions, to retrieve the country code from each row. Once we have the country code, we can then compare it explicitly
For example, using the split method:
DATA = [
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.northern-europe.dk.na.copenhagen|chromium|74',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.northern-europe.dk.na.copenhagen|chromium|87',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.western-europe.nl.na.amsterdam|firefox|28',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.eastern-europe.bg.na.sofia|chromium|74',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.eastern-europe.bg.na.sofia|chromium|87',
'2e4efc13-6a6a-45ba-a6aa-ec4eb1f4fb2b|europe.western-europe.de.na.frankfurt.amazon|chromium|87'
]
COUNTRIES = ['dk', 'se', 'fi', 'no']
def extract_country(row):
geo = row.split('|')[1]
country = geo.split('.')[2]
return country
filtered = [
row for row in DATA
if extract_country(row) in COUNTRIES
]
print(filtered)
or, if you prefer one-liners, you can skip the extract_country function:
filtered = [
row for row in DATA
if row.split('|')[1].split('.')[2] in COUNTRIES
]
Both of these split the row on | and take the second column to get the geographical area, then split the geo area on . and take the third item, which seems to be the country code. If you have documentation for your data source, you will be able to check whether this is true.
One additional check might be to verify that the extracted country code has exactly two letters, as a partial check for irregularities in the data:
def extract_country(row):
geo = row.split('|')[1]
country = geo.split('.')[2]
if not re.match('^[a-z]{2}$', country):
raise ValueError(
'Expected a two-letter country code, got "%s" in row %s'
% (country, row)
)
return country

Pandas: ValueError: arrays must all be same length - when orient=index doesn't work

grateful for your help.
I'm trying to return multiple search results from Google based on two or more search terms. Example inputs:
digital economy gov.uk
digital economy gouv.fr
For about 50% of the search results I input, the script below works fine. However, for the remaining search terms, I receive:
ValueError: arrays must all be same length
Any ideas on how I can address this?
output_df1=pd.DataFrame()
for input in inputs:
query = input
#query = urllib.parse.quote_plus(query)
number_result = 20
ua = UserAgent()
google_url = "https://www.google.com/search?q=" + query + "&num=" + str(number_result)
response = requests.get(google_url, {"User-Agent": ua.random})
soup = BeautifulSoup(response.text, "html.parser")
result_div = soup.find_all('div', attrs = {'class': 'ZINbbc'})
links = []
titles = []
descriptions = []
for r in result_div:
# Checks if each element is present, else, raise exception
try:
link = r.find('a', href = True)
title = r.find('div', attrs={'class':'vvjwJb'}).get_text()
description = r.find('div', attrs={'class':'s3v9rd'}).get_text()
# Check to make sure everything is present before appending
if link != '' and title != '' and description != '':
links.append(link['href'])
titles.append(title)
descriptions.append(description)
# Next loop if one element is not present
except:
continue
to_remove = []
clean_links = []
for i, l in enumerate(links):
clean = re.search('\/url\?q\=(.*)\&sa',l)
# Anything that doesn't fit the above pattern will be removed
if clean is None:
to_remove.append(i)
continue
clean_links.append(clean.group(1))
output_dict = {
'Search_Term': input,
'Title': titles,
'Description': descriptions,
'URL': clean_links,
}
search_df = pd.DataFrame(output_dict, columns = output_dict.keys())
#merging the data frames
output_df1=pd.concat([output_df1,search_df])
Based on this answer: Python Pandas ValueError Arrays Must be All Same Length I have also tried to use orient=index. While this does not give me the array error, it only returns one response for each search result:
a = {
'Search_Term': input,
'Title': titles,
'Description': descriptions,
'URL': clean_links,
}
search_df = pd.DataFrame.from_dict(a, orient='index')
search_df = search_df.transpose()
#merging the data frames
output_df1=pd.concat([output_df1,search_df])
Edit: based on #Hammurabi's answer, I was able to at least pull 20 returns per input, but these appear to be duplicates. Any idea how I iterate the unique returns to each row?
df = pd.DataFrame()
cols = ['Search_Term', 'Title', 'Description', 'URL']
for i in range(20):
df_this_row = pd.DataFrame([[input, titles, descriptions, clean_links]], columns=cols)
df = df.append(df_this_row)
df = df.reset_index(drop=True)
##merging the data frames
output_df1=pd.concat([output_df1,df])
Any thoughts on either how I can address the array error so it works for all search terms? Or how I make the orient='index' method work for multiple search results - in my script I am trying to pull 20 results per search term.
Thanks for your help!
You are having trouble with columns of different lengths, maybe because sometimes you get more or fewer than 20 results per term. You can put dataframes together even if they have different lengths. I think you want to append the dataframes, because you have different search terms so there is probably no merging to do to consolidate matching search terms. I don't think you want orient='index' because in the example you post, that puts lists into the df, rather than separating out the list items into different columns. Also, I don't think you want the built-in input as part of the df, looks like you want to repeat the query for each relevant row. Maybe something is going wrong in the dictionary creation.
You could consider appending 1 row at a time to your main dataframe, and skip the list and dictionary creation, after your line
if link != '' and title != '' and description != '':
Maybe simplifying the df creation will avoid the error. See this toy example:
df = pd.DataFrame()
cols = ['Search_Term', 'Title', 'Description', 'URL']
query = 'search_term1'
for i in range(2):
link = 'url' + str(i)
title = 'title' + str(i)
description = 'des' + str(i)
df_this_row = pd.DataFrame([[query, title, description, link]], columns=cols)
df = df.append(df_this_row)
df = df.reset_index(drop=True) # originally, every row has index 0
print(df)
# Search_Term Title Description URL
# 0 search_term1 title0 des0 url0
# 1 search_term1 title1 des1 url1
Update: you mentioned that you are getting the same result 20 times. I suspect that is because you are only getting number_result = 20, and you probably want to iterate instead.
Your code fixes number_result at 20, then uses it in the url:
number_result = 20
# ...
google_url = "https://www.google.com/search?q=" + query + "&num=" + str(number_result)
Try iterating instead:
for number_result in range(1, 21): # if results start at 1 rather than 0
# ...
google_url = "https://www.google.com/search?q=" + query + "&num=" + str(number_result)

Python loop fails to iterate over all the values of a html table

I'm trying to fetch all the data from all the rows for one specific column of a table. The problem is that the loop only fetches the first-row multiple times but is not able to continue to the next row. Here is the relevant code.
numRows = len(driver.find_elements_by_xpath('/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr'))
numColumns = len(driver.find_elements_by_xpath('/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/thead/tr[2]/th'))
print(numRows)
# Prints 139
print(numColumns)
# prints 21
for i in range(numRows + 1):
df = []
value = driver.find_element_by_xpath("/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr['{}']/td[16]".format(i))
df.append(value.text)
print(df)
As is evident from the print methods is that I have all the rows and columns of my table. So that part works. But when I try to iterate over all the rows for one specific column, I only get the first value. I have tried to solve this problem by using a format() method but that doesn't seem to solve the problem. Any idea how I can solve this problem?
please try iterate through all results found instead of finding individual elements. I cannot test the code since I do not have access to HTML file.
found_elements = driver.find_elements_by_xpath("/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr/td[16]")
for i in range(numRows + 1):
df = []
df.append(found_elements[i].text)
print(df)
I found the following code to work:
rader = len(driver.find_elements_by_xpath('/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr'))
#kolonner = len(driver.find_elements_by_xpath('/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/thead/tr[2]/th'))
kolonneFinish = []
kolonneBib = []
for i in range(1, rader + 1):
valueFinish = driver.find_element_by_xpath("/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr["+str(i)+"]/td[16]")
kolonneFinish.append(valueFinish.text)
I have no idea what the + + does. So if someone knows, please comment.

Is it possible to scrape data from this page?

I'm having issues with extracting table from this page, and I really need this data for my paper. I came up with this code, but it got stuck on second row.
browser.get('https://www.eex.com/en/market-data/power/futures/french-futures#!/2018/02/01')
table = browser.find_element_by_xpath('//*[#id="content"]/div/div/div/div[1]/div/div/div')
html_table = html.fromstring(table.get_attribute('innerHTML'))
html_code = etree.tostring(html_table)
df = pd.read_html(html_code)[0]
df.drop(['Unnamed: 12', 'Unnamed: 13'], axis=1, inplace=True)
Any advice?
You can always parse the table manually.
I prefer to use BeautifulSoup since I find it much easier to work with.
from bs4 import BeautifulSoup
soup = BeautifulSoup(browser.page_source, "html.parser")
Let's parse the first table, and get the column names:
table = soup.select("table.table-horizontal")[0]
columns = [i.get_text() for i in table.find_all("th")][:-2] ## We don't want the last 2 columns
Now, let's go through the table row by row:
rs = []
for r in table.find_all("tr"):
ds = []
for d in r.find_all("td"):
ds.append(d.get_text().strip())
rs.append(ds[:-2])
You can write the same code more concisely using list comprehensions:
rs = [[d.get_text().strip() for d in r.find_all("td")][:-2] for r in table.find_all("tr")]
Next, we filter rs to remove lists with length != 12 (since we have 12 columns):
rs = [i for i in rs if len(i)==12]
Finally, we can put this into a DataFrame:
df = pd.DataFrame({k:v for k, v in zip(columns, zip(*rs))})
You can follow a similar procedure for the second table. Hope this helps!

Categories