I am using django_countries to show the countries list. Now, I have a requirement where I need to show currency according to country.
Norway - NOK, Europe & Afrika (besides UK) - EUR, UK - GBP, AMERICAS & ASIA - USDs.
Could this be achieved through django_countries project? or are there any other packages in python or django which I could use for this?
Any other solution is welcomed as well.
--------------------------- UPDATE -------------
The main emphasis is on this after getting lot of solutions:
Norway - NOK, Europe & Afrika (besides UK) - EUR, UK - GBP, AMERICAS & ASIA - USDs.
---------------------------- SOLUTION --------------------------------
My solution was quite simple, when I realized that I couldnt get any ISO format or a package to get what I want, I thought to write my own script. It is just a conditional based logic:
from incf.countryutils import transformations
def getCurrencyCode(self, countryCode):
continent = transformations.cca_to_ctn(countryCode)
# print continent
if str(countryCode) == 'NO':
return 'NOK'
if str(countryCode) == 'GB':
return 'GBP'
if (continent == 'Europe') or (continent == 'Africa'):
return 'EUR'
return 'USD'
Dont know whether this is efficient way or not, would like to hear some suggestions.
Thanks everyone!
There are several modules out there:
pycountry:
import pycountry
country = pycountry.countries.get(name='Norway')
currency = pycountry.currencies.get(numeric=country.numeric)
print currency.alpha_3
print currency.name
prints:
NOK
Norwegian Krone
py-moneyed
import moneyed
country_name = 'France'
for currency, data in moneyed.CURRENCIES.iteritems():
if country_name.upper() in data.countries:
print currency
break
prints EUR
python-money
import money
country_name = 'France'
for currency, data in money.CURRENCY.iteritems():
if country_name.upper() in data.countries:
print currency
break
prints EUR
pycountry is regularly updated, py-moneyed looks great and has more features than python-money, plus python-money is not maintained now.
Hope that helps.
django-countries just hands you a field to couple to your model (and a static bundle with flag icons). The field can hold a 2 character ISO from the list in countries.py which is convenient if this list is up-to-date (haven't checked) because it saves a lot of typing.
If you wish to create a model with verbose data that's easily achieved, e.g.
class Country(models.Model):
iso = CountryField()
currency = # m2m, fk, char or int field with pre-defined
# choices or whatever suits you
>> obj = Country.objects.create(iso='NZ', currency='NZD')
>> obj.iso.code
u'NZ'
>> obj.get_iso_display()
u'New Zealand'
>> obj.currency
u'NZD'
An example script of preloading data, which could later be exported to create a fixture which is a nicer way of managing sample data.
from django_countries.countries import COUNTRIES
for key in dict(COUNTRIES).keys():
Country.objects.create(iso=key)
I have just released country-currencies, a module that gives you a mapping of country codes to currencies.
>>> from country_currencies import get_by_country
>>> get_by_country('US')
('USD',)
>>> get_by_country('ZW')
('USD', 'ZAR', 'BWP', 'GBP', 'EUR')
Related
I'm trying to filter out bogus locations from a column in a data frame. The column is filled with locations taken from tweets. Some of the locations aren't real. I am trying to separate them from the valid locations. Below is the code I have. However, the output is not producing the right thing, it instead will only return France. I'm hoping someone can identify what I'm doing wrong here or another way to try. Let me know if I didn't explain it well enough. Also, I assign variables both outside and inside the function for testing purposes.
import pandas as pd
cn_csv = pd.read_csv("~/Downloads/cntry_list.csv") #this is just a list of every country along with respective alpha 2 and alpha 3 codes, see the link below to download csv
country_names = cn_csv['country']
results = pd.read_csv("~/Downloads/results.csv") #this is a dataframe with multiple columns, one being "source location" See edit below that displays data in "Source Location" column
src_locs = results["Source Location"]
locs_to_list = list(src_locs)
new_list = [entry.split(', ') for entry in locs_to_list]
def country_name_check(input_country_list):
cn_csv = pd.read_csv("~/Downloads/cntrylst.csv")
country_names = cn_csv['country']
results = pd.read_csv("~/Downloads/results.csv")
src_locs = results["Source Location"]
locs_to_list = list(src_locs)
new_list = [entry.split(', ') for entry in locs_to_list]
valid_names = []
tobe_checked = []
for i in new_list:
if i in country_names.values:
valid_names.append(i)
else:
tobe_checked.append(i)
return valid_names, tobe_checked
print(country_name_check(src_locs))
EDIT 1: Adding the link for the cntry_list.csv file. I downloaded the csv of the table data. https://worldpopulationreview.com/country-rankings/country-codes
Since I am unable to share a file on here, here is the "Source Location" column data:
Source Location
She/her
South Carolina, USA
Torino
England, UK
trying to get by
Bemidiji, MN
St. Paul, MN
Stockport, England
Liverpool, England
EH7
DLR - LAX - PDX - SEA - GEG
Barcelona
Curitiba
kent
Paris, France
Moon
Denver, CO
France
If your goal is to find and list country names, both valid and not, you may filter the initial results DataFrame:
# make list from unique values of Source Location that match values from country_names
valid_names = list(results[results['Source Location']
.isin(country_names)]['Source Location']
.unique())
# with ~ select unique values that don't match country_names values
tobe_checked = list(results[~results['Source Location']
.isin(country_names)]['Source Location']
.unique())
Your unwanted result with only France being returned could be solved by trying that simpler approach. However, the problem in your code may be there when reading cntrylst outside of the function, as indicated by ScottC
I am trying to extract data from Bloomberg.
I need data for multiple fields with diferent currencies.
I can't get what I need from this answer https://github.com/alpha-xone/xbbg
Can anyone help, please?. for a "specifc time" and on a "period of time"?
I tried the following code but it didn't work.
blp.bdh(
tickers='TPXDDVD Index,SCTOGAA LN Equity,VAPEJSI ID Equity', flds=['PX_LAST', 'FUND_NET_ASSET_VAL', 'FUND_TOTAL_ASSETS'],
start_date='2018-10-01', end_date='2019-11-01', FX=['JPY','GBp','USD']
)
It is best to group similar securities together when pulling in historic data. In the OP's question, 'TPXDDVD Index' is a calculated total return index. Hence it will not have the same fields available as the other two, which as their tickers suggest are quoted funds.
Taking just the two quoted funds, SCTOGAA LN Equity and VAPEJSI ID Equity, we can determine the default currency in which each field is denominated. This being Bloomberg, naming conventions are organic and may not be obvious! The two value fields are FUND_NET_ASSET_VAL and FUND_TOTAL_ASSETS, and the default currency might be different for each.
We can use the bdp() function to pull back these currencies as follows (NB tickers are in a python list:
from xbbg import blp
tickers = ['SCTOGAA LN Equity','VAPEJSI ID Equity']
df = blp.bdp(tickers,['NAV_CRNCY','FUND_TOTAL_ASSETS_CRNCY'])
print(df)
With the result:
nav_crncy fund_total_assets_crncy
SCTOGAA LN Equity GBp GBP
VAPEJSI ID Equity USD EUR
So for VAPEJSI the NAV and Total Assets are denominated in different currencies. And NB. GBp is not a typo by the data entry clerk, it means GBP pence.
You can override the currency with a single currency value applied to all fields in the function call.
from xbbg import blp
fields = ['FUND_NET_ASSET_VAL','FUND_TOTAL_ASSETS']
df = blp.bdh('SCTOGAA LN Equity',fields,
start_date='2018-10-01', end_date='2019-11-01')
print(df.tail(2))
df = blp.bdh('SCTOGAA LN Equity',fields,
start_date='2018-10-01', end_date='2019-11-01', Currency='USD')
print(df.tail(2))
which returns:
SCTOGAA LN Equity
FUND_NET_ASSET_VAL FUND_TOTAL_ASSETS
2019-10-31 70.81 1755.65
2019-11-01 70.99 1756.51
SCTOGAA LN Equity
FUND_NET_ASSET_VAL FUND_TOTAL_ASSETS
2019-10-31 0.91607 2271.28527
2019-11-01 0.91840 2272.40325
The asset value in GBP pence, has been converted to USD dollars. As an aside if you had put Currency='USd' instead you would have the price in US cents. You have to love case-sensitivity ...
I intend to merge two data frames, Chicago crime and Redfin real estate data, but Redfin data was collected by neighborhood in Chicago, while crime data were collected by community area. To do so, I found neighborhood map in Chicago and I am kinda figured out how to assign neighborhood to the community area. the structure of two dataframe is a bit different, so I did a few step manipulation on that. here are the details about my attempt:
example data snippet
here is the public gist where I can view the example data snippet.
here is the neighborhood mapping that I collected from the online source.
my solution
here is my first mapping solution:
code_pairs_neighborhoods = [[p[0], p[1]] for p in [pair.strip().split('\t') for pair in neighborhood_Map.strip().split('\n')]]
neighborhood_name_dic = {k[0]:k[1] for k in code_pairs_neighborhoods} #neighborhood -> community area
chicago_crime['neighborhood'] = chicago_crime['community_name'].map(neighborhood_name_dic)
Redfin['neighborhood'] = Redfin['Region'].map(neighborhood_name_dic)
final_df= pd.merge(chicago_crime, chicago_crime, on='neighborhood')
but this solution didn't find correct mapping and neighborhood becomes NAN, which is wrong.
second mapping attempt:
without using neighborhood mapping, I intuitively come up this solution for mapping:
chicago_crime['community_name']=[[y.split() for y in x] for x in chicago_crime['community_name']]
Redfin['Region']= [[j.split() for j in i] for i in Redfin['Region']]
idx, datavalue = [], []
for i,dv in enumerate(chicago_crime['community_name']):
for d in dv:
if d in Redfin['Region'][i]:
if i not in idx:
idx.append(i)
datavalue.append(d)
chicago_crime['merge_ref'] = datavalue
Redfin['merge_ref'] = datavalue
final_df= pd.merge(chicago_crime[['community_area','community_name','merge_ref']], Redfin, on='merge_ref')
but this solution gave me error: ValueError: Length of values does not match length of index, AttributeError: 'list' object has no attribute 'split'.
how can I make this work? based on neighborhood mapping, how can I get correct mapping both for Redfin data and chicago crime data? Any idea to make this mapping correct and get right merged dataframe? any thought? thanks in advance.
update:
I put all of my solution including dataset on this github repository all solution and data on github
Ok, here's what I found:
there is a unicode character in the first line of neighborhood_Map that you probably want to remove: Cabrini\xe2\x80\x93Green'-> Cabrini Green
switch the key and value in neighborhood_name_dic around, since you want to map the existing 'Rogers Park' to the neighborhood 'East Rogers Park', like so:
neighborhood_name_dic = {k[1]:k[0] for k in code_pairs_neighborhoods}
We still don't know from your code how your reading in your Redfin data, but I presume you'll have to remove the Chicago, IL - part in the Region column somewhere, before you can merge?
Update: So I think I managed to understand your code (again, please try to clean up these things a bit before posting), and I think that Redfin is equal to house_df there. So instead of the line that says:
house_df=house_df.set_index('Region',drop=False)
I would suggest to create a neighbourhood column:
house_df['neighborhood'] = house_df['Region'].map(lambda x: x.lstrip('Chicago, IL - '))
and then you can merge on:
crime_finalDF = pd.merge(chicago_crime, house_df, left_on='neighborhood', right_on='neighborhood')
To test it, try:
mask=(crime_finalDF['neighborhood']==u'Sheridan Park')
print(crime_finalDF[['robbery','neighborhood', u'2018-06-01 00:00:00']][mask])
which yields:
robbery neighborhood 2018-06-01 00:00:00
0 140.0 Sheridan Park 239.0
1 122.0 Sheridan Park 239.0
2 102.0 Sheridan Park 239.0
3 113.0 Sheridan Park 239.0
4 139.0 Sheridan Park 239.0
so a successful join of both datasets (I think).
Update 2, regarding the success of the merge().
This is how I read in and cleaned up your xlsx file:
house_df = pd.read_excel("./real_eastate_data_main.xlsx",)
house_df.replace({'-': None})
house_df.columns=house_df.columns.astype(str)
house_df = house_df[house_df['Region'] != 'Chicago, IL']
house_df = house_df[house_df['Region'] != 'Chicago, IL metro area']
house_df['neighborhood'] = house_df['Region'].str.split(' - ')## note the surrounding spaces
house_df['neighborhood'] = house_df['neighborhood'].map(lambda x: list(x)[-1])
chicago_crime['neighborhood'] = chicago_crime['community_name'].map(neighborhood_name_dic)
## Lakeview and Humboldt park not defined in neighborhood_name_dic
# print( chicago_crime[['community_name','neighborhood']][pd.isnull(chicago_crime['neighborhood'])] )
chicago_crime = chicago_crime[~pd.isnull(chicago_crime['neighborhood'])] ## remove them
Now we turn to finding all unique neighborhoods in both df's
cc=sorted(chicago_crime['neighborhood'].unique())
ho=sorted(house_df['neighborhood'].unique())
print(30*u"-"+u" chicago_crime: "+30*u"-")
print(len(cc),cc)
print(30*u"-"+u" house_df: "+30*u"-")
print(len(ho),ho)
print(60*"-")
# print('\n'.join(cc))
set1 = set(cc)
set2 = set(ho)
missing = list(sorted(set1 - set2))
added = list(sorted(set2 - set1))
print('These {0} are missing in house_df: {1}'.format(len(missing),missing))
print(60*"-")
print('These {0} are only in house_df: {1}'.format(len(added),added))
Which reveals that 29 are missing in house_df (e.g. 'East Pilsen') and 132 are found only within house_df (e.g. 'Albany Park'), i.e. we can "inner join" only 46 entries.
Now you have to decide how you want to continue, best if you first read this about the way merging works (e.g. understand the venn diagrams posted there) and then you can improve your code yourself accordingly! Or: clean up your data manually before, sometimes there isnĀ“t a fully automatic solution!
Background
I have 2 data frames which has no common key to which I can merge them. Both df have a column that contains "entity name". One df contains 8000+ entities and the other close to 2000 entities.
Sample Data:
vendor_df=
Name of Vendor City State ZIP
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102
LACKEY SHEET METAL St. Louis MO 63102
regulator_df =
Name of Entity Committies
LACKEY SHEET METAL Private
PRIMUS STERILIZER COMPANY LLC Private
HELGET GAS PRODUCTS INC Autonomous
ORTHOQUEST LLC Governmant
Problem Stmt:
I have to fuzzy match the entities of these two(Name of vendor & Name of Entity) columns and get a score. So, need to know if 1st value of dataframe 1(vendor_df) is matching with any of the 2000 entities of dataframe2(regulator_df).
StackOverflow Links I checked:
fuzzy match between 2 columns (Python)
create new column in dataframe using fuzzywuzzy
Apply fuzzy matching across a dataframe column and save results in a new column
Code
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
vendor_df = pd.read_excel('C:\\Users\\40101584\\Desktop\\AUS CUB AML\\Vendors_Sheet.xlsx', sheet_name=0)
regulator_df = pd.read_excel('C:\\Users\\40101584\\Desktop\\AUS CUB AML\\Regulated_Vendors_Sheet.xlsx', sheet_name=0)
compare = pd.MultiIndex.from_product([vendor_df['Name of vendor'],
regulator_df['Name of Entity']]).to_series()
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
#compare.apply(metrics) -- Either this works or the below line
result = compare.apply(metrics).unstack().idxmax().unstack(0)
Problems with Above Code:
The code works if the two dataframes are small but it is taking forever when I give the complete dataset. Above code is taken from 3rd link.
Any solution if the same thing can work fast or can work with large dataset?
UPDATE 1
Can the above code be made faster if we pass or hard-code a score say 80 which will filter series/dataframe only with fuzzyscore > 80 ?
Below solution is faster than what I posted but if someone has a more faster approach please tell:
matched_vendors = []
for row in vendor_df.index:
vendor_name = vendor_df.get_value(row,"Name of vendor")
for columns in regulator_df.index:
regulated_vendor_name=regulator_df.get_value(columns,"Name of Entity")
matched_token=fuzz.partial_ratio(vendor_name,regulated_vendor_name)
if matched_token> 80:
matched_vendors.append([vendor_name,regulated_vendor_name,matched_token])
I've implemented the code in Python with parallel processing, which will be much faster than serial computation. Furthermore, where a fuzzy metric score exceeds a threshold, only those computations are performed in parallel. Please see the link below for the code:
https://github.com/ankitcoder123/Important-Python-Codes/blob/main/Faster%20Fuzzy%20Match%20between%20two%20columns/Fuzzy_match.py
Vesrion Compatibility:
pandas version :: 1.1.5 ,
numpy vesrion:: 1.19.5,
fuzzywuzzy version :: 1.1.0 ,
joblib version :: 0.18.0
Fuzzywuzzy metric explanation:
link text
Output from code:
in my case also i need to look for only above 80. i modified your code as per my use case.hope it helps.
compare = compare.apply(metrics)
compare_80=compare[(compare['ratio'] >80) & (compare['token'] >80)]
I need help with processing unstructured data of day-trading/swing-trading/investment recommendations. I've the unstructured data in the form of CSV.
Following are 3 sample paragraphs from which data needs to be extracted:
Chandan Taparia of Anand Rathi has a buy call on Coal India Ltd. with
an intra-day target price of Rs 338 . The current market
price of Coal India Ltd. is 325.15 . Chandan Taparia recommended to
keep stop loss at Rs 318 .
Kotak Securities Limited has a buy call on Engineers India Ltd. with a
target price of Rs 335 .The current market price of Engineers India Ltd. is Rs 266.05 The analyst gave a year for Engineers
India Ltd. price to reach the defined target. Engineers India enjoys a
healthy market share in the Hydrocarbon consultancy segment. It enjoys
a prolific relationship with few of the major oil & gas companies like
HPCL, BPCL, ONGC and IOC. The company is well poised to benefit from a
recovery in the infrastructure spending in the hydrocarbon sector.
Independent analyst Kunal Bothra has a sell call on Ceat Ltd. with a
target price of Rs 1150 .The current market price of Ceat Ltd. is Rs 1199.6 The time period given by the analyst is 1-3 days
when Ceat Ltd. price can reach the defined target. Kunal Bothra
maintained stop loss at Rs 1240.
Its been a challenge extracting 4 information out of the paragraphs:
each recommendation is differently framed but essentially has
Target Price
Stop Loss Price
Current Price.
Duration
and not necessarily all the information will be available in all the recommendations - every recommendation will atleast have Target Price.
I was trying to use regular expressions, but not very successful, can anyone guide me how to extract this information may be using nltk?
Code I've so far in cleaning the data:
import pandas as pd
import re
#etanalysis_final.csv has 4 columns with
#0th Column having data time
#1st Column having a simple hint like 'Sell Ceat Ltd. target Rs 1150 : Kunal Bothra,Sell Ceat Ltd. at a price target of Rs 1150 and a stoploss at Rs 1240 from entry point', not all the hints are same, I can rely on it for recommender, Buy or Sell, which stock.
#4th column has the detailed recommendation given.
df = pd.read_csv('etanalysis_final.csv',encoding='ISO-8859-1')
df.DATE = pd.to_datetime(df.DATE)
df.dropna(inplace=True)
df['RECBY'] = df['C1'].apply(lambda x: re.split(':|\x96',x)[-1].strip())
df['ACT'] = df['C1'].apply(lambda x: x.split()[0].strip())
df['STK'] = df['C1'].apply(lambda x: re.split('\.|\,|:| target| has| and|Buy|Sell| with',x)[1])
#Getting the target price - not always correct
df['TGT'] = df['C4'].apply(lambda x: re.findall('\d+.', x)[0])
#Getting the stop loss price - not always correct
df['STL'] = df['C4'].apply(lambda x: re.findall('\d+.\d+', x)[-1])
This is a hard question in that there are different possibilities in which each of the 4 pieces of information might be written. Here is a naive approach that might work, albeit would require verification. I'll do the example for the target but you can extend this to any:
CONTEXT = 6
def is_float(x):
try:
float(x)
return True
except ValueError:
return False
def get_target_price(s):
words = s.split()
n = words.index('target')
words_in_range = words[n-CONTEXT:n+CONTEXT]
return float(list(filter(is_float, words_in_range))[0]) # returns any instance of a float
This is a simple approach to get you started but you can put extra checks to make this safer. Things to potentially improve:
Make sure that the the index before the one where the proposed float is found is Rs.
If no float is found in the context range, expand the context
Add user verification if there are ambiguities i.e. more than one instance of target or more than one float in the context range etc.
I got the solution :
Code here contains only solution part of the question asked. It shall be possible to greatly improve this solution using fuzzywuzzy library.
from nltk import word_tokenize
periods = ['year',"year's", 'day','days',"day's", 'month', "month's", 'week',"week's", 'intra-day', 'intraday']
stop = ['target', 'current', 'stop', 'period', 'stoploss']
def extractinfo(row):
if 'intra day' in row.lower():
row = row.lower().replace('intra day', 'intra-day')
tks = [ w for w in word_tokenize(row) if any([w.lower() in stop, isfloat(w)])]
tgt = ''
crt = ''
stp = ''
prd = ''
if 'target' in tks:
if len(tks[tks.index('target'):tks.index('target')+2]) == 2:
tgt = tks[tks.index('target'):tks.index('target')+2][-1]
if 'current' in tks:
if len(tks[tks.index('current'):tks.index('current')+2]) == 2:
crt = tks[tks.index('current'):tks.index('current')+2][-1]
if 'stop' in tks:
if len(tks[tks.index('stop'):tks.index('stop')+2]) == 2:
stp = tks[tks.index('stop'):tks.index('stop')+2][-1]
prdd = set(periods).intersection(tks)
if 'period' in tks:
pdd = tks[tks.index('period'):tks.index('period')+3]
prr = set(periods).intersection(pdd)
if len(prr) > 0:
if len(pdd) > 2:
prd = ' '.join(pdd[-2::1])
elif len(pdd) == 2:
prd = pdd[-1]
elif len(prdd) > 0:
prd = list(prdd)[0]
return (crt, tgt, stp, prd)
Solution is relatively self explanatory - otheriwse please let me know.