Natural language processing - extracting data - python

I need help with processing unstructured data of day-trading/swing-trading/investment recommendations. I've the unstructured data in the form of CSV.
Following are 3 sample paragraphs from which data needs to be extracted:
Chandan Taparia of Anand Rathi has a buy call on Coal India Ltd. with
an intra-day target price of Rs 338 . The current market
price of Coal India Ltd. is 325.15 . Chandan Taparia recommended to
keep stop loss at Rs 318 .
Kotak Securities Limited has a buy call on Engineers India Ltd. with a
target price of Rs 335 .The current market price of Engineers India Ltd. is Rs 266.05 The analyst gave a year for Engineers
India Ltd. price to reach the defined target. Engineers India enjoys a
healthy market share in the Hydrocarbon consultancy segment. It enjoys
a prolific relationship with few of the major oil & gas companies like
HPCL, BPCL, ONGC and IOC. The company is well poised to benefit from a
recovery in the infrastructure spending in the hydrocarbon sector.
Independent analyst Kunal Bothra has a sell call on Ceat Ltd. with a
target price of Rs 1150 .The current market price of Ceat Ltd. is Rs 1199.6 The time period given by the analyst is 1-3 days
when Ceat Ltd. price can reach the defined target. Kunal Bothra
maintained stop loss at Rs 1240.
Its been a challenge extracting 4 information out of the paragraphs:
each recommendation is differently framed but essentially has
Target Price
Stop Loss Price
Current Price.
Duration
and not necessarily all the information will be available in all the recommendations - every recommendation will atleast have Target Price.
I was trying to use regular expressions, but not very successful, can anyone guide me how to extract this information may be using nltk?
Code I've so far in cleaning the data:
import pandas as pd
import re
#etanalysis_final.csv has 4 columns with
#0th Column having data time
#1st Column having a simple hint like 'Sell Ceat Ltd. target Rs 1150 : Kunal Bothra,Sell Ceat Ltd. at a price target of Rs 1150 and a stoploss at Rs 1240 from entry point', not all the hints are same, I can rely on it for recommender, Buy or Sell, which stock.
#4th column has the detailed recommendation given.
df = pd.read_csv('etanalysis_final.csv',encoding='ISO-8859-1')
df.DATE = pd.to_datetime(df.DATE)
df.dropna(inplace=True)
df['RECBY'] = df['C1'].apply(lambda x: re.split(':|\x96',x)[-1].strip())
df['ACT'] = df['C1'].apply(lambda x: x.split()[0].strip())
df['STK'] = df['C1'].apply(lambda x: re.split('\.|\,|:| target| has| and|Buy|Sell| with',x)[1])
#Getting the target price - not always correct
df['TGT'] = df['C4'].apply(lambda x: re.findall('\d+.', x)[0])
#Getting the stop loss price - not always correct
df['STL'] = df['C4'].apply(lambda x: re.findall('\d+.\d+', x)[-1])

This is a hard question in that there are different possibilities in which each of the 4 pieces of information might be written. Here is a naive approach that might work, albeit would require verification. I'll do the example for the target but you can extend this to any:
CONTEXT = 6
def is_float(x):
try:
float(x)
return True
except ValueError:
return False
def get_target_price(s):
words = s.split()
n = words.index('target')
words_in_range = words[n-CONTEXT:n+CONTEXT]
return float(list(filter(is_float, words_in_range))[0]) # returns any instance of a float
This is a simple approach to get you started but you can put extra checks to make this safer. Things to potentially improve:
Make sure that the the index before the one where the proposed float is found is Rs.
If no float is found in the context range, expand the context
Add user verification if there are ambiguities i.e. more than one instance of target or more than one float in the context range etc.

I got the solution :
Code here contains only solution part of the question asked. It shall be possible to greatly improve this solution using fuzzywuzzy library.
from nltk import word_tokenize
periods = ['year',"year's", 'day','days',"day's", 'month', "month's", 'week',"week's", 'intra-day', 'intraday']
stop = ['target', 'current', 'stop', 'period', 'stoploss']
def extractinfo(row):
if 'intra day' in row.lower():
row = row.lower().replace('intra day', 'intra-day')
tks = [ w for w in word_tokenize(row) if any([w.lower() in stop, isfloat(w)])]
tgt = ''
crt = ''
stp = ''
prd = ''
if 'target' in tks:
if len(tks[tks.index('target'):tks.index('target')+2]) == 2:
tgt = tks[tks.index('target'):tks.index('target')+2][-1]
if 'current' in tks:
if len(tks[tks.index('current'):tks.index('current')+2]) == 2:
crt = tks[tks.index('current'):tks.index('current')+2][-1]
if 'stop' in tks:
if len(tks[tks.index('stop'):tks.index('stop')+2]) == 2:
stp = tks[tks.index('stop'):tks.index('stop')+2][-1]
prdd = set(periods).intersection(tks)
if 'period' in tks:
pdd = tks[tks.index('period'):tks.index('period')+3]
prr = set(periods).intersection(pdd)
if len(prr) > 0:
if len(pdd) > 2:
prd = ' '.join(pdd[-2::1])
elif len(pdd) == 2:
prd = pdd[-1]
elif len(prdd) > 0:
prd = list(prdd)[0]
return (crt, tgt, stp, prd)
Solution is relatively self explanatory - otheriwse please let me know.

Related

Regex question - works in one row but does not work in another row [duplicate]

This question already has an answer here:
re.sub replaces only first two instances
(1 answer)
Closed 4 days ago.
I (a regex novice) am trying to replace certain key words in a dataframe. The dataframe looks like this:
df = pd.DataFrame({'A': ['awesome news this is tax and taxation. but vat rate conservative.',
'great news for taxidrivers. no taxation for people.',
'This is taxonomy country. That is fine.',
'terrible Taxation rates in the country. but vat rate is fine.',
]})
I am trying to replace tax, taxation, vat etc with 'MNOPQ' but words like taxonomy, taxi, conservative etc should be left alone. I have copy+pasted from different sources and used the following function and vectorized it using apply, as shown below:
def replace_specific_text(text):
pattern = re.compile(r'( tax(?!i)(?!o)\w*|vat(?!\w))')
text = re.sub(pattern, ' MNOPQ ', text, re.IGNORECASE)
return text
df['A_1'] = df['A'].apply(replace_specific_text)
There are at least two issues that I am struggling with:
a) The 'vat' in the 4th row is replaced by MNOPQ as intended; but 'vat' in the first row is not. Why is this?
b) 'Taxation' in the 4th row with capital T is not replaced even though I have tried to use re.IGNORECASE
What am I getting wrong? Any suggestions will be appreciated. Thanks in advance.
Try:
import re
df = pd.DataFrame(
{
"A": [
"awesome news this is tax and taxation. but vat rate conservative.",
"great news for taxidrivers. no taxation for people.",
"This is taxonomy country. That is fine.",
"terrible Taxation rates in the country. but vat rate is fine.",
]
}
)
words = ['tax', 'taxation', 'vat']
pat = r'(?i)\b(?:' + '|'.join(map(re.escape, words)) + r')\b'
df['A_1'] = df['A'].str.replace(pat, 'MNOPQ', regex=True)
print(df)
Prints:
A A_1
0 awesome news this is tax and taxation. but vat rate conservative. awesome news this is MNOPQ and MNOPQ. but MNOPQ rate conservative.
1 great news for taxidrivers. no taxation for people. great news for taxidrivers. no MNOPQ for people.
2 This is taxonomy country. That is fine. This is taxonomy country. That is fine.
3 terrible Taxation rates in the country. but vat rate is fine. terrible MNOPQ rates in the country. but MNOPQ rate is fine.

Python Pandas problem while assigning a value to a column cell using either "loc" or "at method inside loop

I have a data-frame (df) with the following columns:
source company category header content published_date sentiment
0 Forbes General Electric None Is New England Baking The Books On Oil-Fired C... The rise of natural gas as the primary fuel fo... 2014-01-01 0
1 Forbes General Electric None DARPA Is Building A Vanishing Battery: This Po... Considering that batteries are typically desig... 2014-01-02 0
2 Forbes General Electric None Four High-Yielding ETFs For Growth And Income Growth & income exchange-traded funds typicall... 2014-01-02 0
3 Forbes Citigroup None Analyst Moves: BAC, DUK, PZZA This morning, Citigroup upgraded shares of Ban... 2014-01-02 0
4 WSJ JPMorgan MARKETS JPMorgan Broker Barred for Role in Insider Tra... Finra says information about merger, acquisiti... 2014-01-02 0
The expected result that I should get, after assigning the new valus to the column sentiment, is the following (over 23,000 rows in total):
source company category header content published_date sentiment
0 Forbes General Electric None Is New England Baking The Books On Oil-Fired C... The rise of natural gas as the primary fuel fo... 2014-01-01 -1
1 Forbes General Electric None DARPA Is Building A Vanishing Battery: This Po... Considering that batteries are typically desig... 2014-01-02 1
2 Forbes General Electric None Four High-Yielding ETFs For Growth And Income Growth & income exchange-traded funds typicall... 2014-01-02 0
3 Forbes Citigroup None Analyst Moves: BAC, DUK, PZZA This morning, Citigroup upgraded shares of Ban... 2014-01-02 -1
4 WSJ JPMorgan MARKETS JPMorgan Broker Barred for Role in Insider Tra... Finra says information about merger, acquisiti... 2014-01-02 1
The algorithm that I'm using in order to update the sentiment column cell values is the shown below.
Note: I verified the updated values before and after using 'at' o 'loc' inside the loop ' for e in ch_ix:'. The cell values do change, but only inside that loop.
If I try to verify by printing 'dfd['sentiment]' the resulting values are still the same 0s:
dfd = db_data.copy()
for index in range(len(stock_list_company_name)):
cont = dfd.loc[dfd["company"] == stock_list_company_name[index]]
#stock_data is another df which contains the columns 'ticker, date, closed, volume, sentiment' and has more rows than dfd.
cont2 = stock_data.loc[stock_data["ticker"] == ticker_list[index]]
dates = cont2["date"].values
for ix in range(len(dates)):
if(not cont.loc[cont["published_date"] == dates[ix]].empty):
ch_ix = cont.loc[cont["published_date"] == dates[ix], "sentiment"].index
for e in ch_ix:
cont.at[e,"sentiment"] = cont2["sentiment"].values[ix]
print(dfd['sentiment'] #values are still 0s
Can someone please help me if this is a loop lack of memory or chained indexing problem? I can't still figure out why the values are not being updated.
Testing and running the code in this Google Colab => url
Inside the loop, you have this assignment:
cont = dfd.loc[dfd["company"] == stock_list_company_name[index]]
Note that this makes cont a new dataframe, not just a reference to a slice of the original dataframe. That's why when you change cont, it has no effect on dfd.
The easiest way to fix this is to assign the cont value back to the dfd slice, after you have made the changes to cont, still inside the loop:
dfd.loc[dfd["company"] == stock_list_company_name[index]] = cont

Web Scrape Attempt With Selenium Yields Duplicate Entries

I am attempting to scrape some data off of this website.
I would greatly appreciate any assistance with this.
There are 30 entries per page and I'm currently trying to scrape information from within each of the links on each page. Here is my code:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Chrome()
print(driver.title)
driver.get("https://www.businesslist.com.ng/category/farming")
time.sleep(1)
select = driver.find_element_by_id("listings")
page_entries = [i.find_element_by_tag_name("a").get_attribute("href")
for i in select.find_elements_by_tag_name("h4")]
columns = {"ESTABLISHMENT YEAR":[], "EMPLOYEES":[], "COMPANY MANAGER":[],
"VAT REGISTRATION":[], "REGISTRATION CODE":[]}
for i in page_entries:
print(i)
driver.get(i)
listify_subentries = [i.text.strip().replace("\n","") for i in
driver.find_elements_by_class_name("info")][:11]
Everything runs fine up to here.The problem is likely in the section below.
for i in listify_subentries:
for q in columns.keys():
if q in i:
item = i.replace(q,"")
print(item)
columns[q].append(item)
else:
columns[q].append("None given")
print("None given")
Here's a picture of the layout for one entry. Sorry I can't yet embed images.
I'm trying to scrape some of the information under the "Working Hours" box (i.e. establishment year, company manager etc) from every business's page. You can find the exact information under the columns variable.
Because the not all pages have the same amount of information under the "Working Hours" box (here is one with more details underneath it), I tried using dictionaries + text manipulation to look up the available sub-entries and obtain the relevant information to their right. That is to say, obtain the name of the company manager, the year of establishment and so on; and if a page did not have this, then it would simply be tagged as "None given" under the relevant sub-entry.
The idea is to collate all this information and export it to a dataframe later on. Inputting "None given" when a page is lacking a particular sub-entry allows me to preserve the integrity of the data structure so that the entries are sure to align.
However, when I run the code the output I receive is completely off.
Here is the outer view of the columns dictionary once the code has run.
And if I click on the 'COMPANY MANAGER' section, you can see that there are multiple instances of it saying "None given" before it gives the name of company manager on the page. This is repeated for every other sub-entry as you'll see if you run the code and scroll down. I'm not sure what went wrong, but it seems that the size of the list has been inflated by a factor of 10, with extra "None given"s littered here and there. The size of each list should be 30, but now its 330.
I would greatly appreciate any assistance with this. Thank you.
You can use next example how to iterate all enterprises on that page and save the various info into a dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.businesslist.com.ng/category/farming"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
all_data = []
for a in soup.select("h4 > a"):
u = "https://www.businesslist.com.ng" + a["href"]
print(u)
data = {"URL": u}
s = BeautifulSoup(requests.get(u).content, "html.parser")
for info in s.select("div.info:has(.label)"):
label = info.select_one(".label")
label.extract()
value = info.get_text(strip=True, separator=" ")
data[label.get_text(strip=True)] = value
all_data.append(data)
df = pd.DataFrame(all_data)
print(df)
df.to_csv("data.csv", index=None)
Prints:
URL Company name Address Phone Number Mobile phone Working hours Establishment year Employees Company manager Share this listing Location map Description Products & Services Listed in categories Keywords Website Registration code VAT registration E-mail Fax
0 https://www.businesslist.com.ng/company/198846/macmed-integrated-farms Macmed Integrated Farms 1, Gani Street, Ijegun Imore. Satellite Town. Lagos , Lagos , Lagos , Nigeria View Map 08033316905 09092245349 Monday: 8am-5pm Tuesday: 8am-5pm Wednesday: 8am-5pm Thursday: 8am-5pm Friday: 8am-5pm Saturday: 8am-5pm Sunday: 10am-4pm 2013 1-5 Engr. Marcus Awoh Show Map Expand Map 1, Gani Street, Ijegun Imore. Satellite Town. Lagos , Lagos , Lagos , Nigeria Get Directions Macmed Integrated Farms is into Poultry, Fish Farming (eggs, meat,day old chicks,fingerlings and grow-out) and animal Husbandry and sales of Farmlands land and facilities We also provide feasibility studies and business planning for all kind of businesses. Day old chicks WE are receiving large quantity of Day old Pullets, Broilers and cockerel in December 2016.\nInterested buyers are invited. PRICE: N 100 - N 350 Investors/ Partners We Macmed Integrated Farms a subsidiary of Macmed Cafe Limited RC (621444) are into poultry farming situated at Iponsinyi, behind (Nigerian National Petroleum Marketing Company)NNPMC at Mosimi, along... Commercial Hatchery Macmed Integrated Farms is setting up a Hatchery for chicken and other birds. We have 2 nos of fully automatic incubator imported from China with combined capacity of 1,500 eggs per setting.\nPlease book in advance.\nMarcus Awoh.\nfarm Operations Manager. PRICE: N100 - N250 Business Services Business Services / Consultants Business Services / Small Business Business Services / Small Business / Business Plans Business Services / Animal Shelters Manufacturing & Industry Manufacturing & Industry / Farming Manufacturing & Industry / Farming / Poultry Housing Suppliers Catfish Day old chicks Farming FINGERLINGS Fishery grow out and aquaculture Meat Poultry eggs spent Pol MORE +4 NaN NaN NaN NaN NaN
...
And saves data.csv (screenshot from LibreOffice):

Downloading key ratios for various Tickers with python library FundamenalAnalysis

I try to download key financial ratios from yahoo finance via the FundamentalAnalysis library. It's pretty easy for single I have a df with tickers and names:
Ticker Company
0 A Agilent Technologies Inc.
1 AA ALCOA CORPORATION
2 AAC AAC Holdings Inc
3 AAL AMERICAN AIRLINES GROUP INC
4 AAME Atlantic American Corp.
I then tried to use a for-loop to download the ratios for every ticker with fa.ratios().
for i in range (3):
i = 0
i = i + 1
Ratios = fa.ratios(tickers["Ticker"][i])
So basically it shall download all ratios for one ticker and the second and so on. I also tried to change the df into a list, but it didn't work as well. If I put them in a list manually like:
Symbol = ["TSLA" , "AAPL" , "MSFT"]
it works somehow. But as I want to work with Data from 1000+ Tickers I don't want to type all of them manually into a list.
Maybe this question has already been answered elsewhere, in that case sorry, but I've not been able to find a thread that helps me. Any ideas?
You can get symbols using
symbols = df['Ticker'].to_list()
and then you could use for-loop without range()
ratios = dict()
for s in symbols:
ratios[s] = fa.ratios(s)
print(ratios)
Because some symbols may not give ratios so you should use try/except
Minimal working example. I use io.StringIO only to simulate file.
import FundamentalAnalysis as fa
import pandas as pd
import io
text='''Ticker Company
A Agilent Technologies Inc.
AA ALCOA CORPORATION
AAC AAC Holdings Inc
AAL AMERICAN AIRLINES GROUP INC
AAME Atlantic American Corp.'''
df = pd.read_csv(io.StringIO(text), sep='\s{2,}')
symbols = df['Ticker'].to_list()
#symbols = ["TSLA" , "AAPL" , "MSFT"]
print(symbols)
ratios = dict()
for s in symbols:
try:
ratios[s] = fa.ratios(s)
except Exception as ex:
print(s, ex)
for s, ratio in ratios.items():
print(s, ratio)
EDIT: it seems fa.ratios() returns DataFrames and if you will keep them on list then you can concatenate all DataFrames to one DataFrame
ratios = list() # list instead of dictionary
for s in symbols:
try:
ratios.append(fa.ratios(s)) # append to list
except Exception as ex:
print(s, ex)
df = pd.concat(ratios, axis=1) # convert list of DataFrames to one DataFrame
print(df.columns)
print(df)
Doc: pandas.concat()

django countries currency code

I am using django_countries to show the countries list. Now, I have a requirement where I need to show currency according to country.
Norway - NOK, Europe & Afrika (besides UK) - EUR, UK - GBP, AMERICAS & ASIA - USDs.
Could this be achieved through django_countries project? or are there any other packages in python or django which I could use for this?
Any other solution is welcomed as well.
--------------------------- UPDATE -------------
The main emphasis is on this after getting lot of solutions:
Norway - NOK, Europe & Afrika (besides UK) - EUR, UK - GBP, AMERICAS & ASIA - USDs.
---------------------------- SOLUTION --------------------------------
My solution was quite simple, when I realized that I couldnt get any ISO format or a package to get what I want, I thought to write my own script. It is just a conditional based logic:
from incf.countryutils import transformations
def getCurrencyCode(self, countryCode):
continent = transformations.cca_to_ctn(countryCode)
# print continent
if str(countryCode) == 'NO':
return 'NOK'
if str(countryCode) == 'GB':
return 'GBP'
if (continent == 'Europe') or (continent == 'Africa'):
return 'EUR'
return 'USD'
Dont know whether this is efficient way or not, would like to hear some suggestions.
Thanks everyone!
There are several modules out there:
pycountry:
import pycountry
country = pycountry.countries.get(name='Norway')
currency = pycountry.currencies.get(numeric=country.numeric)
print currency.alpha_3
print currency.name
prints:
NOK
Norwegian Krone
py-moneyed
import moneyed
country_name = 'France'
for currency, data in moneyed.CURRENCIES.iteritems():
if country_name.upper() in data.countries:
print currency
break
prints EUR
python-money
import money
country_name = 'France'
for currency, data in money.CURRENCY.iteritems():
if country_name.upper() in data.countries:
print currency
break
prints EUR
pycountry is regularly updated, py-moneyed looks great and has more features than python-money, plus python-money is not maintained now.
Hope that helps.
django-countries just hands you a field to couple to your model (and a static bundle with flag icons). The field can hold a 2 character ISO from the list in countries.py which is convenient if this list is up-to-date (haven't checked) because it saves a lot of typing.
If you wish to create a model with verbose data that's easily achieved, e.g.
class Country(models.Model):
iso = CountryField()
currency = # m2m, fk, char or int field with pre-defined
# choices or whatever suits you
>> obj = Country.objects.create(iso='NZ', currency='NZD')
>> obj.iso.code
u'NZ'
>> obj.get_iso_display()
u'New Zealand'
>> obj.currency
u'NZD'
An example script of preloading data, which could later be exported to create a fixture which is a nicer way of managing sample data.
from django_countries.countries import COUNTRIES
for key in dict(COUNTRIES).keys():
Country.objects.create(iso=key)
I have just released country-currencies, a module that gives you a mapping of country codes to currencies.
>>> from country_currencies import get_by_country
>>> get_by_country('US')
('USD',)
>>> get_by_country('ZW')
('USD', 'ZAR', 'BWP', 'GBP', 'EUR')

Categories