Web Scrape Attempt With Selenium Yields Duplicate Entries - python

I am attempting to scrape some data off of this website.
I would greatly appreciate any assistance with this.
There are 30 entries per page and I'm currently trying to scrape information from within each of the links on each page. Here is my code:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Chrome()
print(driver.title)
driver.get("https://www.businesslist.com.ng/category/farming")
time.sleep(1)
select = driver.find_element_by_id("listings")
page_entries = [i.find_element_by_tag_name("a").get_attribute("href")
for i in select.find_elements_by_tag_name("h4")]
columns = {"ESTABLISHMENT YEAR":[], "EMPLOYEES":[], "COMPANY MANAGER":[],
"VAT REGISTRATION":[], "REGISTRATION CODE":[]}
for i in page_entries:
print(i)
driver.get(i)
listify_subentries = [i.text.strip().replace("\n","") for i in
driver.find_elements_by_class_name("info")][:11]
Everything runs fine up to here.The problem is likely in the section below.
for i in listify_subentries:
for q in columns.keys():
if q in i:
item = i.replace(q,"")
print(item)
columns[q].append(item)
else:
columns[q].append("None given")
print("None given")
Here's a picture of the layout for one entry. Sorry I can't yet embed images.
I'm trying to scrape some of the information under the "Working Hours" box (i.e. establishment year, company manager etc) from every business's page. You can find the exact information under the columns variable.
Because the not all pages have the same amount of information under the "Working Hours" box (here is one with more details underneath it), I tried using dictionaries + text manipulation to look up the available sub-entries and obtain the relevant information to their right. That is to say, obtain the name of the company manager, the year of establishment and so on; and if a page did not have this, then it would simply be tagged as "None given" under the relevant sub-entry.
The idea is to collate all this information and export it to a dataframe later on. Inputting "None given" when a page is lacking a particular sub-entry allows me to preserve the integrity of the data structure so that the entries are sure to align.
However, when I run the code the output I receive is completely off.
Here is the outer view of the columns dictionary once the code has run.
And if I click on the 'COMPANY MANAGER' section, you can see that there are multiple instances of it saying "None given" before it gives the name of company manager on the page. This is repeated for every other sub-entry as you'll see if you run the code and scroll down. I'm not sure what went wrong, but it seems that the size of the list has been inflated by a factor of 10, with extra "None given"s littered here and there. The size of each list should be 30, but now its 330.
I would greatly appreciate any assistance with this. Thank you.

You can use next example how to iterate all enterprises on that page and save the various info into a dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.businesslist.com.ng/category/farming"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for a in soup.select("h4 > a"):
u = "https://www.businesslist.com.ng" + a["href"]
print(u)
data = {"URL": u}
s = BeautifulSoup(requests.get(u).content, "html.parser")
for info in s.select("div.info:has(.label)"):
label = info.select_one(".label")
label.extract()
value = info.get_text(strip=True, separator=" ")
data[label.get_text(strip=True)] = value
all_data.append(data)
df = pd.DataFrame(all_data)
print(df)
df.to_csv("data.csv", index=None)
Prints:
URL Company name Address Phone Number Mobile phone Working hours Establishment year Employees Company manager Share this listing Location map Description Products & Services Listed in categories Keywords Website Registration code VAT registration E-mail Fax
0 https://www.businesslist.com.ng/company/198846/macmed-integrated-farms Macmed Integrated Farms 1, Gani Street, Ijegun Imore. Satellite Town. Lagos , Lagos , Lagos , Nigeria View Map 08033316905 09092245349 Monday: 8am-5pm Tuesday: 8am-5pm Wednesday: 8am-5pm Thursday: 8am-5pm Friday: 8am-5pm Saturday: 8am-5pm Sunday: 10am-4pm 2013 1-5 Engr. Marcus Awoh Show Map Expand Map 1, Gani Street, Ijegun Imore. Satellite Town. Lagos , Lagos , Lagos , Nigeria Get Directions Macmed Integrated Farms is into Poultry, Fish Farming (eggs, meat,day old chicks,fingerlings and grow-out) and animal Husbandry and sales of Farmlands land and facilities We also provide feasibility studies and business planning for all kind of businesses. Day old chicks WE are receiving large quantity of Day old Pullets, Broilers and cockerel in December 2016.\nInterested buyers are invited. PRICE: N 100 - N 350 Investors/ Partners We Macmed Integrated Farms a subsidiary of Macmed Cafe Limited RC (621444) are into poultry farming situated at Iponsinyi, behind (Nigerian National Petroleum Marketing Company)NNPMC at Mosimi, along... Commercial Hatchery Macmed Integrated Farms is setting up a Hatchery for chicken and other birds. We have 2 nos of fully automatic incubator imported from China with combined capacity of 1,500 eggs per setting.\nPlease book in advance.\nMarcus Awoh.\nfarm Operations Manager. PRICE: N100 - N250 Business Services Business Services / Consultants Business Services / Small Business Business Services / Small Business / Business Plans Business Services / Animal Shelters Manufacturing & Industry Manufacturing & Industry / Farming Manufacturing & Industry / Farming / Poultry Housing Suppliers Catfish Day old chicks Farming FINGERLINGS Fishery grow out and aquaculture Meat Poultry eggs spent Pol MORE +4 NaN NaN NaN NaN NaN
...
And saves data.csv (screenshot from LibreOffice):

Related

How to extract values or data from a list of stored links using selenium python?

I am trying to scrape price of a real estate website namely this one, so I made a list of scraped links and wrote scripts to get prices from all those links. I tried googling and asking around but could not find a decent answer, I just want to get price values from list of links and store it in a way so that it can be converted into a csv file later on with house name, location,price as headers along with respective datas . The output I am getting is: .The last list with a lot of prices is what I want. My code is as follows
from selenium import webdriver
PATH = "C:/ProgramData/Anaconda3/scripts/chromedriver.exe" #always keeps chromedriver.exe inside scripts to save hours of debugging
driver =webdriver.Chrome(PATH) #preety important part
driver.get("https://www.nepalhomes.com/list/&sort=1&find_property_purpose=5db2bdb42485621618ecdae6&find_property_category=5d660cb27682d03f547a6c4a")
driver.implicitly_wait(10)
data_extract= pd.read_csv(r'F:\github projects\homie.csv') #reading csv file which contains 8 links
de = data_extract['Links'].tolist() #converting the csv file to list so that it can be iterated
data=[] # created an empty list to store extracted prices after the scraping is done from homie.csv
for url in de[0:]: #de has all the links which i want to iterate and scrape prices
driver.get(url)
prices = driver.find_elements_by_xpath("//div[#id='app']/div[1]/div[2]/div[1]/div[2]/div/p[1]")
for price in prices: #after finding xapth get prices
data.append(price.text)
print(data) # printing in console just to check what kind of data i obtained
any help will be appreciated. The output I am expecting is something like this [[price of house inside link 0], [price of house inside link 1], similarly]..the links in homie.csv are as follows
Links
https://www.nepalhomes.com/detail/bungalow-house-for-sale-at-mandikhatar
https://www.nepalhomes.com/detail/brand-new-house-for-sale-in-baluwakhani
https://www.nepalhomes.com/detail/bungalow-house-for-sale-in-bhangal-budhanilkantha
https://www.nepalhomes.com/detail/commercial-house-for-sale-in-mandikhatar
https://www.nepalhomes.com/detail/attractive-house-on-sale-in-budhanilkantha
https://www.nepalhomes.com/detail/house-on-sale-at-bafal
https://www.nepalhomes.com/detail/house-on-sale-in-madigaun-sunakothi
https://www.nepalhomes.com/detail/house-on-sale-in-chhaling-bhaktapur
There is no need to use Selenium to get the data you need. That page loads it's data from an API endpoint.
The API endpoint:
https://www.nepalhomes.com/api/property/public/data?&sort=1&find_property_purpose=5db2bdb42485621618ecdae6&find_property_category=5d660cb27682d03f547a6c4a
You can directly make a request to that API endpoint using requests module and get your data.
This code will print all the prices.
import requests
url = 'https://www.nepalhomes.com/api/property/public/data?&sort=1&find_property_purpose=5db2bdb42485621618ecdae6&find_property_category=5d660cb27682d03f547a6c4a'
r = requests.get(url)
info = r.json()
for i in info['data']:
print([i['basic']['title'],i['price']['value']])
['House on sale at Kapan near Karuna Hospital ', 15500000]
['House on sale at Banasthali', 70000000]
['Bungalow house for sale at Mandikhatar', 38000000]
['Brand new house for sale in Baluwakhani', 38000000]
['Bungalow house for sale in Bhangal, Budhanilkantha', 29000000]
['Commercial house for sale in Mandikhatar', 27500000]
['Attractive house on sale in Budhanilkantha', 55000000]
['House on sale at Bafal', 45000000]
I see several problems here:
I couldn't see no elements matching text-3xl font-bold leading-none text-black class names on the https://www.nepalhomes.com/list/&sort=1&find_property_purpose=5db2bdb42485621618ecdae6&find_property_category=5d660cb27682d03f547a6c4a web page
Even if there were such elements - for multiple class names you should use CSS selector or XPath so instead of
find_elements_by_class_name('text-3xl font-bold leading-none text-black')
it should be
find_elements_by_css_selector('.text-3xl.font-bold.leading-none.text-black')
find_elements method returns a list of web elements, so to get texts from these elements you have to iterate over the list and get text from each element, like following:
prices = driver.find_elements_by_css_selector('.text-3xl.font-bold.leading-none.text-black')
for price in prices:
data.append(price.text)
UPD
With this locator it works correct for me:
prices = driver.find_elements_by_xpath("//p[#class='text-xl leading-none text-black']/p[1]")
for price in prices:
data.append(price.text)
Tried with below xpath. And it retrieved the prize.
price_list,nameprice_list = [],[]
houses = driver.find_elements_by_xpath("//div[contains(#class,'table-list')]/a")
for house in houses:
name = house.find_element_by_tag_name("h2").text
address = house.find_element_by_xpath(".//p[contains(#class,'opacity-75')]").text
price = (house.find_element_by_xpath(".//p[contains(#class,'text-xl')]/p").text).replace('Rs. ','')
price_list.append(price)
nameprice_list.append((name,price))
print("{}: {}".format(name,price))
And output:
House on sale at Kapan near Karuna Hospital: Kapan, Budhanilkantha Municipality,1,55,00,000
House on sale at Banasthali: Banasthali, Kathmandu Metropolitan City,7,00,00,000
...
[('House on sale at Kapan near Karuna Hospital', '1,55,00,000'), ('House on sale at Banasthali', '7,00,00,000'), ('Bungalow house for sale at Mandikhatar', '3,80,00,000'), ('Brand new house for sale in Baluwakhani', '3,80,00,000'), ('Bungalow house for sale in Bhangal, Budhanilkantha', '2,90,00,000'), ('Commercial house for sale in Mandikhatar', '2,75,00,000'), ('Attractive house on sale in Budhanilkantha', '5,50,00,000'), ('House on sale at Bafal', '4,50,00,000')]
['1,55,00,000', '7,00,00,000', '3,80,00,000', '3,80,00,000', '2,90,00,000', '2,75,00,000', '5,50,00,000', '4,50,00,000']
by first look, only 8 prices are visible, and if you just want to scrape them using selenium
driver.maximize_window()
driver.implicitly_wait(30)
driver.get("https://www.nepalhomes.com/list/&sort=1&find_property_purpose=5db2bdb42485621618ecdae6&find_property_category=5d660cb27682d03f547a6c4a")
wait = WebDriverWait(driver, 20)
for price in driver.find_elements(By.XPATH, "//p[contains(#class,'leading')]/p[1]"):
print(price.text.split('.')[1])
this will print all the price, without RS.
This print statement should be outside the for loops to avoid staircase printing of output.
from selenium import webdriver
PATH = "C:/ProgramData/Anaconda3/scripts/chromedriver.exe" #always keeps chromedriver.exe inside scripts to save hours of debugging
driver =webdriver.Chrome(PATH) #preety important part
driver.get("https://www.nepalhomes.com/list/&sort=1&find_property_purpose=5db2bdb42485621618ecdae6&find_property_category=5d660cb27682d03f547a6c4a")
driver.implicitly_wait(10)
data_extract= pd.read_csv(r'F:\github projects\homie.csv')
de = data_extract['Links'].tolist()
data=[]
for url in de[0:]:
driver.get(url)
prices = driver.find_elements_by_xpath("//div[#id='app']/div[1]/div[2]/div[1]/div[2]/div/p[1]")
for price in prices: #after finding xapth get prices
data.append(price.text)
print(data)

Index Error while webscraping news websites

I've been trying to web-scrape the titles of news articles, but I encounter an "Index Error" when the following code. I'm facing a problem only in the last line of code.
import requests
from bs4 import BeautifulSoup
URL= 'https://www.ndtv.com/coronavirus?pfrom=home-mainnavgation'
r1 = requests.get(URL)
coverpage = r1.content
soup1 = BeautifulSoup(coverpage, 'html5lib')
coverpage_news = soup1.find_all('h3', class_='item-title')
coverpage_news[4].get_text()
This is the error:
IndexError Traceback (most recent call last)
<ipython-input-10-f7f1f6fab81c> in <module>
6 soup1 = BeautifulSoup(coverpage, 'html5lib')
7 coverpage_news = soup1.find_all('h3', class_='item-title')
----> 8 coverpage_news[4].get_text()
IndexError: list index out of range
Use soup1.select() to search for nested elements matching a CSS selector:
coverpage_news = soup1.select("h3 a.item-title")
This will find an a element with class="item-title" that's a descendant of an h3 element.
Try to change:
coverpage_news = soup1.find_all('h3', class_='item-title')
to
coverpage_news = soup1.find_all('h3', class_='list-txt')
Changing #Barmar's helpful answer a little bit to show all titles:
coverpage_news = soup1.select("h3 a.item-title")
for link in coverpage_news:
print(link.text)
Output:
US Covid Infections Cross 9 Million, Record 1-Day Spike Of 94,000 Cases
Johnson & Johnson Plans To Test COVID-19 Vaccine On Youngsters Soon
Global Stock Markets Decline As Coronavirus Infection Rate Weighs
Cristiano Ronaldo Recovers From Coronavirus
Reliance's July-September Profit Falls 15% As Covid Slams Oil Business
"Likely To Know By December If We'll Have Covid Vaccine": Top US Expert
With No Local Case In A Record 200 Days, This Country Is World's Envy
Delhi Blames Pollution For Covid Spike At High-Level Health Ministry Meet
Delhi Covid Cases Above 5,000 For 3rd Straight Day, Spike In ICU Patients
2 Million Indians Returned From Abroad Under Vande Bharat Mission: Centre
Existing Lockdown Restrictions Extended In Maharashtra Till November 30
Can TB Vaccine Protect Elderly From Covid?
Is The Covid-19 Situation Worsening In Delhi?
What's The Truth Behind India's Falling Covid Numbers?
"Slight Laxity Can Lead To Spike": AIIMS Director As India Sees Drop In Covid Cases

Pandas read_html producing empty df with tuple column names

I want to retrieve the tables on the following website and store them in a pandas dataframe: https://www.acf.hhs.gov/orr/resource/ffy-2012-13-state-of-colorado-orr-funded-programs
However, the third table on the page returns an empty dataframe with all the table's data stored in tuples as the column headers:
Empty DataFrame
Columns: [(Service Providers, State of Colorado), (Cuban - Haitian Program, $0), (Refugee Preventive Health Program, $150,000.00), (Refugee School Impact, $450,000), (Services to Older Refugees Program, $0), (Targeted Assistance - Discretionary, $0), (Total FY, $600,000)]
Index: []
Is there a way to "flatten" the tuple headers into header + values, then append this to a dataframe made up of all four tables? My code is below -- it has worked on other similar pages but keeps breaking because of this table's formatting. Thanks!
funds_df = pd.DataFrame()
url = 'https://www.acf.hhs.gov/programs/orr/resource/ffy-2011-12-state-of-colorado-orr-funded-programs'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
year = url.split('ffy-')[1].split('-orr')[0]
tables = page.content
df_list = pd.read_html(tables)
for df in df_list:
df['URL'] = url
df['YEAR'] = year
funds_df = funds_df.append(df)
For this site, there's no need for beautifulsoup or requests
pandas.read_html creates a list of DataFrames for each <table> at the URL.
import pandas as pd
url = 'https://www.acf.hhs.gov/orr/resource/ffy-2012-13-state-of-colorado-orr-funded-programs'
# read the url
dfl = pd.read_html(url)
# see each dataframe in the list; there are 4 in this case
for i, d in enumerate(dfl):
print(i)
display(d) # display worker in Jupyter, otherwise use print
print('\n')
dfl[0]
Service Providers Cash and Medical Assistance* Refugee Social Services Program Targeted Assistance Program TOTAL
0 State of Colorado $7,140,000 $1,896,854 $503,424 $9,540,278
dfl[1]
WF-CMA 2 RSS TAG-F CMA Mandatory 3 TOTAL
0 $3,309,953 $1,896,854 $503,424 $7,140,000 $9,540,278
dfl[2]
Service Providers Refugee School Impact Targeted Assistance - Discretionary Services to Older Refugees Program Refugee Preventive Health Program Cuban - Haitian Program Total
0 State of Colorado $430,000 $0 $100,000 $150,000 $0 $680,000
dfl[3]
Volag Affiliate Name Projected ORR MG Funding Director
0 CWS Ecumenical Refugee & Immigration Services $127,600 Ferdi Mevlani 1600 Downing St., Suite 400 Denver, CO 80218 303-860-0128
1 ECDC ECDC African Community Center $308,000 Jennifer Guddiche 5250 Leetsdale Drive Denver, CO 80246 303-399-4500
2 EMM Ecumenical Refugee Services $191,400 Ferdi Mevlani 1600 Downing St., Suite 400 Denver, CO 80218 303-860-0128
3 LIRS Lutheran Family Services Rocky Mountains $121,000 Floyd Preston 132 E Las Animas Colorado Springs, CO 80903 719-314-0223
4 LIRS Lutheran Family Services Rocky Mountains $365,200 James Horan 1600 Downing Street, Suite 600 Denver, CO 80218 303-980-5400

How can I have python export my scraped variables as a .csv?

import requests
from bs4 import BeautifulSoup
import pandas as pd
URL = 'https://mychesterfieldschools.com/mams/news-and-announcements/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find_all('div', class_='col-sm-12 col-md-12')
for results in results:
body = results.find('p')
link = results.find('a')
a = body.text
b = link
print(a)
print(b)
The strings are weird, and I'm very new to Python. Please help! I've tried using Pandas but it didn't work for me.
Here is the desired output:
Chromebook for support for students is available as follows through July 30: Tuesdays at Thomas Dale HS from 8 – 10 am Thursdays at CTC#Hull from 2-4,
...Read Full Article
Here are some resources to keep your child engaged in Mathematics over the summer and prepare for the course they will enter in the fall. ALEKS – Your child has been using ALEKS during the school year in math class. ALEKS is an adaptive math program that provides each student with a personalized learning path […],
...Read Full Article
“Full STEAM Ahead is a conference by CodeVA dedicated to empowering young women through Science, Technology, Engineering, the Arts, and Mathematics. We inspire our students by connecting them with female role models in engaging hands-on workshops. Our speakers will share their experience as leaders in their respective industries, highlighting the importance of STEAM and the […],
...Read Full Article
As a result of the statewide closure of schools, Chesterfield County Public Schools is rescheduling opportunities for pre-kindergarten and kindergarten registration. Both in-person opportunities will be rescheduled when restrictions related to large gatherings are lifted or eased. In the meantime, there are opportunities for online registration for prospective prekindergarten and kindergarten students. The attached news release […],
...Read Full Article
I have created 2 arrays to store 2 different types the scraped data.
pandas.DataFrame() will create a data frame object and pandas.to_csv() sends the dataframe object to a .csv file.
This may not be the most efficient code but it works
import requests
from bs4 import BeautifulSoup
import pandas as pd
URL = 'https://mychesterfieldschools.com/mams/news-and-announcements/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find_all('div', class_='col-sm-12 col-md-12')
// declaring the 2 arrays for storing your scraped data
text = []
a_tags = []
for results in results:
body = results.find('p')
link = results.find('a')
a = body.text
b = link
print(a) // prints the text (data type string)
print(b) // prints the tag (data type bs4.element.Tag object)
// store the text in text array
text.append(a)
// convert the tags to string and store in a_tags array
a_tags.append(str(b))
// prints the saved arrays
print("text : ", text)
print("tags : ", a_tags)
// creates a pandas dataframe object of the above 2 arrays
df = pd.DataFrame(
{
"Text": text,
"A_tags": a_tags
}
)
// converts to csv
df.to_csv("data.csv", index=False, encoding="utf-8")
Links to documentation :
pandas.DataFrame()
pandas.to_csv()
Output data.csv file appears in the same directory as your python script.
This is how the csv looks on Microsoft Office Excel :

Natural language processing - extracting data

I need help with processing unstructured data of day-trading/swing-trading/investment recommendations. I've the unstructured data in the form of CSV.
Following are 3 sample paragraphs from which data needs to be extracted:
Chandan Taparia of Anand Rathi has a buy call on Coal India Ltd. with
an intra-day target price of Rs 338 . The current market
price of Coal India Ltd. is 325.15 . Chandan Taparia recommended to
keep stop loss at Rs 318 .
Kotak Securities Limited has a buy call on Engineers India Ltd. with a
target price of Rs 335 .The current market price of Engineers India Ltd. is Rs 266.05 The analyst gave a year for Engineers
India Ltd. price to reach the defined target. Engineers India enjoys a
healthy market share in the Hydrocarbon consultancy segment. It enjoys
a prolific relationship with few of the major oil & gas companies like
HPCL, BPCL, ONGC and IOC. The company is well poised to benefit from a
recovery in the infrastructure spending in the hydrocarbon sector.
Independent analyst Kunal Bothra has a sell call on Ceat Ltd. with a
target price of Rs 1150 .The current market price of Ceat Ltd. is Rs 1199.6 The time period given by the analyst is 1-3 days
when Ceat Ltd. price can reach the defined target. Kunal Bothra
maintained stop loss at Rs 1240.
Its been a challenge extracting 4 information out of the paragraphs:
each recommendation is differently framed but essentially has
Target Price
Stop Loss Price
Current Price.
Duration
and not necessarily all the information will be available in all the recommendations - every recommendation will atleast have Target Price.
I was trying to use regular expressions, but not very successful, can anyone guide me how to extract this information may be using nltk?
Code I've so far in cleaning the data:
import pandas as pd
import re
#etanalysis_final.csv has 4 columns with
#0th Column having data time
#1st Column having a simple hint like 'Sell Ceat Ltd. target Rs 1150 : Kunal Bothra,Sell Ceat Ltd. at a price target of Rs 1150 and a stoploss at Rs 1240 from entry point', not all the hints are same, I can rely on it for recommender, Buy or Sell, which stock.
#4th column has the detailed recommendation given.
df = pd.read_csv('etanalysis_final.csv',encoding='ISO-8859-1')
df.DATE = pd.to_datetime(df.DATE)
df.dropna(inplace=True)
df['RECBY'] = df['C1'].apply(lambda x: re.split(':|\x96',x)[-1].strip())
df['ACT'] = df['C1'].apply(lambda x: x.split()[0].strip())
df['STK'] = df['C1'].apply(lambda x: re.split('\.|\,|:| target| has| and|Buy|Sell| with',x)[1])
#Getting the target price - not always correct
df['TGT'] = df['C4'].apply(lambda x: re.findall('\d+.', x)[0])
#Getting the stop loss price - not always correct
df['STL'] = df['C4'].apply(lambda x: re.findall('\d+.\d+', x)[-1])
This is a hard question in that there are different possibilities in which each of the 4 pieces of information might be written. Here is a naive approach that might work, albeit would require verification. I'll do the example for the target but you can extend this to any:
CONTEXT = 6
def is_float(x):
try:
float(x)
return True
except ValueError:
return False
def get_target_price(s):
words = s.split()
n = words.index('target')
words_in_range = words[n-CONTEXT:n+CONTEXT]
return float(list(filter(is_float, words_in_range))[0]) # returns any instance of a float
This is a simple approach to get you started but you can put extra checks to make this safer. Things to potentially improve:
Make sure that the the index before the one where the proposed float is found is Rs.
If no float is found in the context range, expand the context
Add user verification if there are ambiguities i.e. more than one instance of target or more than one float in the context range etc.
I got the solution :
Code here contains only solution part of the question asked. It shall be possible to greatly improve this solution using fuzzywuzzy library.
from nltk import word_tokenize
periods = ['year',"year's", 'day','days',"day's", 'month', "month's", 'week',"week's", 'intra-day', 'intraday']
stop = ['target', 'current', 'stop', 'period', 'stoploss']
def extractinfo(row):
if 'intra day' in row.lower():
row = row.lower().replace('intra day', 'intra-day')
tks = [ w for w in word_tokenize(row) if any([w.lower() in stop, isfloat(w)])]
tgt = ''
crt = ''
stp = ''
prd = ''
if 'target' in tks:
if len(tks[tks.index('target'):tks.index('target')+2]) == 2:
tgt = tks[tks.index('target'):tks.index('target')+2][-1]
if 'current' in tks:
if len(tks[tks.index('current'):tks.index('current')+2]) == 2:
crt = tks[tks.index('current'):tks.index('current')+2][-1]
if 'stop' in tks:
if len(tks[tks.index('stop'):tks.index('stop')+2]) == 2:
stp = tks[tks.index('stop'):tks.index('stop')+2][-1]
prdd = set(periods).intersection(tks)
if 'period' in tks:
pdd = tks[tks.index('period'):tks.index('period')+3]
prr = set(periods).intersection(pdd)
if len(prr) > 0:
if len(pdd) > 2:
prd = ' '.join(pdd[-2::1])
elif len(pdd) == 2:
prd = pdd[-1]
elif len(prdd) > 0:
prd = list(prdd)[0]
return (crt, tgt, stp, prd)
Solution is relatively self explanatory - otheriwse please let me know.

Categories