Create column in Pandas based on dataframe query condition - python

I've got a pandas dataframe like this:
Title Pgid Views30 Title_to
===============================================================================
30 Хо_Ен_Чон 9048639 284950 Чон_Хо_Ён
98 Mail.ru_Group 9018641 153082 VK_(компания)
105 Паша_Техник 9070663 143053 Техник,_Паша
303 Freeware 6166716 79998 Бесплатное_программное_обеспечение
399 СССР 1007 69349 Союз_Советских_Социалистических_Республик
The data contains over 1.2 million entries from Wikipedia page data:
Title = page title
Pgid = page ID
Views30 = monthly page views
Title_to = title of the page that this page redirects to (or NaN if no redirect)
Now I want to make a new column Pgid_to with the page IDs of the redirect target pages for all pages with Title_to != NaN. That is, collect Pgid from Title = Title_to for all entries.
My current solution is straightforward:
def cond(title_to):
try:
# get Pgid of page whose title == title_to
return df.loc[df['Title'] == title_to, 'Pgid']
except:
# return NaN on failure to locate
return np.NaN
# make new column by applying search element-wise
df['Pgid_to'] = df['Title_to'].apply(cond)
However, this algorithm is likely to take polynomial time (N^2), which for 1.2 MM entries means 1.4 trillion operations! Is is possible to optimize? Possibly, is there a vectorized solution?

np.where() is vectorized and hopefully will save the day. Kindly try:
df['Pgid_to'] = np.where(df[Title'] == df['Title_co'], 'Pgid',np.nan)
If you want to compare it against nan:
df['Pgid_to'] = np.where(df[Title'].isna(),np.nan, 'Pgid')

OK, found a solution by simply merging the dataframe with itself!
df_merged = df.merge(df, 'left', left_on='Title',
right_on='Title_to', suffixes=(None, '_from'))
This produces a new dataframe with the 'Pgid_from' column (among others), which can then be used to group the data.

Related

How can I update my Dataframe with new columns and rows while webscraping?

I'm trying to create a webscraping tool that will update a Dataframe with data from multiple tables.
The page I'm working on has a base table in which every row has a link that directs you to a new URL that has a secondary table with the data I'm looking for.
My objective is to create a unique Dataframe comprehensive of all the data present on all secondary tables of the site.
Problem is, every secondary table can have different sets of columns from the previous one, depending on whether that secondary table has a value for that specific column or not, and I cannot know all the possibile column types
I tried multiple solutions. What I'm working on at the moment is to create a for loop that constantly create a new Dataframe out of the new tables and merge them to the previous one.
But I'm stuck on trying to merge the two Dataframes on all the columns they have in common.
Please forgive me if I made amateur mistakes, I've been using python only for a week.
#create the main DataFrame
link1= links[0]
url_linked = url_l + link1
page_linked = requests.get (url_linked)
soup_linked = BeautifulSoup(page_linked.text,'lxml')
table_linked= soup_linked.find('table', class_="XXXXX")
headers_link=[]
headers_unique=[]
for i in table_linked.find_all('th'):
title_link=i.text
title_link=map(str,title_link)
headers_link.append(title_link)
headers_unique=headers_link
mydata_link = pd.DataFrame(columns=headers_link)
count = 1
for link in links:
url_linked = url_l + link
page_linked = requests.get (url_linked)
soup_linked = BeautifulSoup(page_linked.text,'lxml')
table_linked= soup_linked.find('table', class_="table table-directory-responsive")
row2=[]
n_columns =len(table_linked.find_all('th'))
#populating the main dataframe
if count == 1:
for j in table_linked.find_all('tr'):
row_data=j.find_all('td')
row=[i.text for i in row_data]
row2.append(row)
lenght_link= len(mydata_link)
row2.remove(['']) #To get rid of empty rows that have no th
mydata_link.loc[lenght_link]=row2
print(mydata_link)
print('Completato link '+ str(count))
count= count+1
#creating the secondary Dataframe
else:
headers_test=[]
for i in table_linked.find_all('th'):
title_test=i.text
title_test=map(str,title_test)
headers_test.append(title_test)
mydata_temp=pd.DataFrame(columns=headers_test)
for j in table_linked.find_all('tr'):
row_data=j.find_all('td')
row=[i.text for i in row_data]
row2.append(row)
lenght_link= len(mydata_link)
row2.remove(['']) #To get rid of empty rows that have no th
mydata_temp.loc[lenght_link]=row2
print(mydata_temp)
#merge the two DataFrames based on the unique set of columns they both have
headers_unique= set(headers_unique).intersection(headers_test)
mydata_link=mydata_link.merge(mydata_temp, on=[headers_unique], how='outer')
print(mydata_link)
print('Completed link '+ str(count))
count= count+1
What I need is basically a function that, given these sample dataFrames:
A
B
C
1
2
3
C
A
D
E
4
5
6
7
Will return the following dataframe:
A
B
C
D
E
1
2
3
Nan
Nan
5
Nan
4
6
7
Just concatenating all the secondary tables should do - build a list of all the secondary DataFrames, and then pd.concat(dfList).
Btw, have you considered just using .read_html instead of looping through the cells?
#create the main DataFrame
link1 = links[0]
url_linked = url_l + link1
page_linked = requests.get (url_linked)
soup_linked = BeautifulSoup(page_linked.text, 'lxml')
table_linked = soup_linked.find('table', class_="XXXXX")
if table_linked:
primaryDf = pd.read_html(table_linked.prettify())[0]
headers_link = [h.get_text(' ').strip() for h in table_linked.find_all('th')]
dfList = [pd.DataFrame(columns=headers_link if headers_link else primaryDf.columns)]
else: primaryDf, dfList = None, []
count = 0
for link in links:
count += 1
url_linked = url_l + link
page_linked = requests.get (url_linked)
soup_linked = BeautifulSoup(page_linked.text, 'lxml')
table_linked = soup_linked.find('table', class_="table table-directory-responsive")
if not table_linked:
## to see if any response errors or redirects
print(f'[{page_linked.status_code} {page_linked.reason} from {page_linked.url}]')
## print error message and move to next link
print(f'Found no tables with required class at link#{count}', url_linked)
continue
tempDf = pd.read_html(table_linked.prettify())[0] ## read table as df [if found]
## get rid of empty rows and empty columns
tempDf = tempDf.dropna(axis='rows', 'how'='all').dropna(axis='columns', 'how'='all')
dfList.append(tempDf.loc[:]) ## .loc[:] to append a copy, not original (just in case)
print(f'Completed link#{count} with {len(tempDf)} rows from {url_linked}')
combinedDF = pd.concat(dfList)

Scraping OSHA website using BeautifulSoup

I'm looking for help with two main things: (1) scraping a web page and (2) turning the scraped data into a pandas dataframe (mostly so I can output as .csv, but just creating a pandas df is enough for now). Here is what I have done so far for both:
(1) Scraping the web site:
I am trying to scrape this page: https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015&id=1284178.015&id=1283809.015&id=1283549.015&id=1282631.015. My end goal is to create a dataframe that would ideally contain only the information I am looking for (i.e. I'd be able to select only the parts of the site that I am interested in for my df); it's OK if I have to pull in all the data for now.
As you can see from the URL as well as the ID hyperlinks underneath "Quick Link Reference" at the top of the page, there are five distinct records on this page. I would like each of these IDs/records to be treated as an individual row in my pandas df.
EDIT: Thanks to a helpful comment, I'm including an example of what I would ultimately want in the table below. The first row represents column headers/names and the second row represents the first inspection.
inspection_id open_date inspection_type close_conference close_case violations_serious_initial
1285328.015 12/28/2017 referral 12/28/2017 06/21/2018 2
Mostly relying on BeautifulSoup4, I've tried a few different options to get at the page elements I'm interested in:
# This is meant to give you the first instance of Case Status, which in the case of this page is "CLOSED".
case_status_template = html_soup.head.find('div', {"id" : "maincontain"},
class_ = "container").div.find('table', class_ = "table-bordered").find('strong').text
# I wasn't able to get the remaining Case Statuses with find_next_sibling or find_all, so I used a different method:
for table in html_soup.find_all('table', class_= "table-bordered"):
print(table.text)
# This gave me the output I needed (i.e. the Case Status for all five records on the page),
# but didn't give me the structure I wanted and didn't really allow me to connect to the other data on the page.
# I was also able to get to the same place with another page element, Inspection Details.
# This is the information reflected on the page after "Inspection: ", directly below Case Status.
insp_details_template = html_soup.head.find('div', {"id" : "maincontain"},
class_ = "container").div.find('table', class_ = "table-unbordered")
for div in html_soup.find_all('table', class_ = "table-unbordered"):
print(div.text)
# Unfortunately, although I could get these two pieces of information to print,
# I realized I would have a hard time getting the rest of the information for each record.
# I also knew that it would be hard to connect/roll all of these up at the record level.
So, I tried a slightly different approach. By focusing instead on a version of that page with a single inspection record, I thought maybe I could just hack it by using this bit of code:
url = 'https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
first_table = html_soup.find('table', class_ = "table-borderedu")
first_table_rows = first_table.find_all('tr')
for tr in first_table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)
# Then, actually using pandas to get the data into a df and out as a .csv.
dfs_osha = pd.read_html('https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015',header=1)
for df in dfs_osha:
print(df)
path = r'~\foo'
dfs_osha = pd.read_html('https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015',header=1)
for df[1,3] in dfs_osha:
df.to_csv(os.path.join(path,r'osha_output_table1_012320.csv'))
# This worked better, but didn't actually give me all of the data on the page,
# and wouldn't be replicable for the other four inspection records I'm interested in.
So, finally, I found a pretty handy example here: https://levelup.gitconnected.com/quick-web-scraping-with-python-beautiful-soup-4dde18468f1f. I was trying to work through it, and had gotten as far as coming up with this code:
for elem in all_content_raw_lxml:
wrappers = elem.find_all('div', class_ = "row-fluid")
for x in wrappers:
case_status = x.find('div', class_ = "text-center")
print(case_status)
insp_details = x.find('div', class_ = "table-responsive")
for tr in insp_details:
td = tr.find_all('td')
td_row = [i.text for i in td]
print(td_row)
violation_items = insp_details.find_next_sibling('div', class_ = "table-responsive")
for tr in violation_items:
tr = tr.find_all('tr')
tr_row = [i.text for i in tr]
print(tr_row)
print('---------------')
Unfortunately, I ran into too many bugs with this to be able to use it so I was forced to abandon the project until I got some further guidance. Hopefully the code I've shared so far at least shows the effort I've put in, even if it doesn't do much to get to the final output! Thanks.
For this type of page you don't really need beautifulsoup; pandas is enough.
url = 'your url above'
import pandas as pd
#use pandas to read the tables on the page; there are lots of them...
tables = pd.read_html(url)
#Select from this list of tables only those tables you need:
incident = [] #initialize a list of inspections
for i, table in enumerate(tables): #we need to find the index position of this table in the list; more below
if table.shape[1]==5: #all relevant tables have this shape
case = [] #initialize a list of inspection items you are interested in
case.append(table.iat[1,0]) #this is the location in the table of this particular item
case.append(table.iat[1,2].split(' ')[2]) #the string in the cell needs to be cleaned up a bit...
case.append(table.iat[9,1])
case.append(table.iat[12,3])
case.append(table.iat[13,3])
case.append(tables[i+2].iat[0,1]) #this particular item is in a table which 2 positions down from the current one; this is where the index position of the current table comes handy
incident.append(case)
columns = ["inspection_id", "open_date", "inspection_type", "close_conference", "close_case", "violations_serious_initial"]
df2 = pd.DataFrame(incident,columns=columns)
df2
Output (pardon the formatting):
inspection_id open_date inspection_type close_conference close_case violations_serious_initial
0 Nr: 1285328.015 12/28/2017 Referral 12/28/2017 06/21/2018 2
1 Nr: 1283809.015 12/18/2017 Complaint 12/18/2017 05/24/2018 5
2 Nr: 1284178.015 12/18/2017 Accident 05/17/2018 09/17/2018 1
3 Nr: 1283549.015 12/13/2017 Referral 12/13/2017 05/22/2018 3
4 Nr: 1282631.015 12/12/2017 Fat/Cat 12/12/2017 11/16/2018 1

Iterate through DIV Class using selenium on page

I'm looking to iterate through a set of rows on a page using selenium to scrape live results from the page in a quick manner. I have a code which appears to return the first row and print it, but doesn't look to be iterating through the set.
content = [browser.find_element_by_class_name('event')]
rows = [browser.find_elements_by_class_name('event__match')]
for rows in content:
goals = {}
goals['Home'] = rows.find_element_by_class_name("event__participant--home").text.strip()
goals['Away'] = rows.find_element_by_class_name("event__participant--away").text.strip()
goals['hScore'] = rows.find_element_by_class_name("event__scores").text.split("-")[1]
goals['aScore'] = rows.find_element_by_class_name("event__scores").text.split("-")[-1]
print(goals['Home'],goals['aScore'],goals['aScore'],goals['Away'])
gets me the result;
Team 1
0
0 Team 2
Which would be the expected result when it's only one match on the page - but there's 50 at the moment.
I feel like I'm missing something in my method here, it could be pretty simple and staring me in the face so apologies if that's the case!
You mistake is in for rows in content:, where content is parent div and you need rows. To iterate all through rows use code below:
rows = browser.find_elements_by_class_name('event__match')
for row in rows:
goals = {}
goals['Home'] = row.find_element_by_class_name("event__participant--home").text.strip()
goals['Away'] = row.find_element_by_class_name("event__participant--away").text.strip()
goals['hScore'] = row.find_element_by_class_name("event__scores").text.split("-")[1]
goals['aScore'] = row.find_element_by_class_name("event__scores").text.split("-")[-1]
print(goals['Home'],goals['aScore'],goals['aScore'],goals['Away'])

Match one to many columns in Pandas dataframe

I have 2 datasets in CSV file, using pandas each file is converted into 2 different dataframes.
I want to find similar companies based on their url. I'm able to find similar companies based on 1 field (Rule1), but I want to compare more efficiently as following:
Dataset 1
uuid, company_name, website
YAHOO,Yahoo,yahoo.com
CSCO,Cisco,cisco.com
APPL,Apple,
Dataset 2
company_name, company_website, support_website, privacy_website
Yahoo,,yahoo.com,yahoo.com
Google,google.com,,
Cisco,,,cisco.com
Result Dataset
company_name, company_website, support_website, privacy_website, uuid
Yahoo,,yahoo.com,yahoo.com,YAHOO
Google,google.com,,
Cisco,,,cisco.com,CSCO
Dataset1 contains ~50K records.
Dataset2 contains ~4M records.
Rules
If field website in dataset 1 is the same as field company_website in dataset 2, extract identifier.
If not match, check if field website in dataset 1 is the same as field support_website in dataset 2, extract identifier.
If not match, check if field website in dataset 1 is the same as field privacy_website in dataset 2, extract identifier.
If not match, check if field company_name in dataset 1 is the same as field company_name in dataset 2, extract identifier.
If not matches return record and identifier field (UUID) will be empty.
Here is my current function:
def MatchCompanies(
companies: pandas.Dataframe,
competitor_companies: pandas.Dataframe) -> Optional[Sequence[str]]:
"""Find Competitor companies in companies dataframe and generate a new list.
Args:
companies: A dataframe with company information from CSV file.
competitor_companies: A dataframe with Competitor information from CSV file.
Returns:
A sequence of matched companies and their UUID.
Raises:
ValueError: No companies found.
"""
if _IsEmpty(companies):
raise ValueError('No companies found')
# Clean up empty fields. Use extra space to avoid matching on empty TLD.
companies.fillna({'website': ' '}, inplace=True)
competitor_companies = competitor_companies.fillna('')
logging.info('Found: %d records.', len(competitor_companies))
# Rename column to TLD to compare matching companies.
companies.rename(columns={'website': 'tld'}, inplace=True)
logging.info('Cleaning up company name.')
companies.company_name = companies.company_name.apply(_NormalizeText)
competitor_companies.company_name = competitor_companies.company_name.apply(
_NormalizeText)
# Rename column to TLD since Competitor already contains TLD in company_website.
competitor_companies.rename(columns={'company_website': 'tld'}, inplace=True)
logging.info('Extracting UUID')
merge_tld = competitor_companies.merge(
companies[['tld', 'uuid']], on='tld', how='left')
# Extracts UUID for company name matches.
competitor_companies = competitor_companies.merge(
companies[['company_name', 'uuid']], on='company_name', how='left')
# Combines dataframes.
competitor_companies['uuid'] = competitor_companies['uuid'].combine_first(
merge_tld['uuid'])
match_companies = len(
competitor_companies[competitor_companies['uuid'].notnull()])
total_companies = len(competitor_companies)
logging.info('Results found: %d out of %d', match_companies, total_companies)
competitor_companies.rename(columns={'tld': 'company_website'}, inplace=True)
return competitor_companies
Looking for advise in which function to use?
Use map by Series with combine_first, but one requrement is necessary - always unique values in df1['website'] and df1['company_name']:
df1 = df1.dropna()
s1 = df1.set_index('website')['uuid']
s2 = df1.set_index('company_name')['uuid']
w1 = df2['company_website'].map(s1)
w2 = df2['support_website'].map(s1)
w3 = df2['privacy_website'].map(s1)
c = df2['company_name'].map(s2)
df2['uuid'] = w1.combine_first(w2).combine_first(w3).combine_first(c)
print (df2)
company_name company_website support_website privacy_website uuid
0 Yahoo NaN yahoo.com yahoo.com YAHOO
1 Google google.com NaN NaN NaN
2 Cisco NaN NaN cisco.com CSCO
Take a look at dataframe.merge. Rename third column in A to company_website and do something like
A.merge(B, on='company_website', indicator=True)
should at least take care of the first rule.

How to reindex malformed columns retrived from pandas read_html?

I am retrieving some content from a website which has several tables with the same number of columns, with pandas read_html. When I read a single link that actually has several tables with the same number of columns, pandas effectively read all the tables as one (something like a flat/normalized table). However, I am interested in do the same for a list of links from a website (i.e. a single flat table for several links), so I tried the following:
In:
import multiprocessing
def process(url):
df_url = pd.read_html(url)
df = pd.concat(df_url, ignore_index=False)
return df_url
links = ['link1.com','link2.com','link3.com',...,'linkN.com']
pool = multiprocessing.Pool(processes=6)
df = pool.map(process, links)
df
Nevertheless, I guess I am not specifiying corecctly to read_html() which are the columns, so I am getting this malformed list of lists:
Out:
[[ Form Disponibility \
0 290090 01780-500-01) Unavailable - no product available for release.
Relation \
Relation drawbacks
0 NaN Removed
1 NaN Removed ],
[ Form \
Relation \
0 American Regent is currently releasing the 0.4...
1 American Regent is currently releasing the 1mg...
drawbacks
0 Demand increase for the drug
1 Removed ,
Form \
0 0.1 mg/mL; 10 mL Luer-Jet Prefilled Syringe (N...
Disponibility Relation \
0 Product available NaN
2 Removed
3 Removed ]]
So my question which parameter should I move in order to get a flat pandas dataframe from the above nested list?. I tried to header=0, index_col=0, match='"columns"', none of them worked or do I need to do the flatting when I create the pandas dataframe with pd.Dataframe()?. My main objective is to have a pandas dataframe like with this columns:
form, Disponibility, Relation, drawbacks
1
2
...
n
IIUC you can do it this way:
first you want to return concatenated DF, instead of list of DFs (as read_html returns a list of DFs):
def process(url):
return pd.concat(pd.read_html(url), ignore_index=False)
and then concatenate them for all URLs:
df = pd.concat(pool.map(process, links), ignore_index=True)

Categories