I want to use the census API to pull employment data that is identical to the CB1100A11 table (Screenshot attached). Each row of this table represents a different 2-digit NAICS sector. Although structuring this table is another task entirely, it appears that I am unable to get API data when I include additional variables.
I have had success with each of the example urls the Census Bureau provides, but I have not had any success with my own. I have included a code snippet below, minus my key, to show what this looks like. I am using Python 3 in Jupyter Notebooks and BS4 from BeautifulSoup.
I have already consulted the API users documentation and variable list without success.
example_vars = 'NAICS2007_TTL,GEO_TTL,EMP,LFO_TTL,ESTAB,PAYANN'
my_vars = 'NAICS2007,NAICS2007_TTL,GEO_TTL,EMP,LFO_TTL,ESTAB,PAYANN'
county_fips = '027'
state_fips = '42'
key ='str'
url= 'https://api.census.gov/data/2011/cbp?get='+my_vars+'&for=county:'+county_fips+'&in=state:'+state_fips+'&key='+key
res = requests.get(url)
res.status_code
When I add additional variables like NAICS2007 I receive a status code 400, but when I use the example variables I get a 200. The common denominator seems to be my code. Can anyone help?
image of the CB1100A11 table
This should be moved to comments (I can't comment bc of rep) but as someone who's worked closely with the US Census API I highly recommend using the Census library:
https://github.com/datamade/census
One of my queries looks like this (where acs1dp is the database I am querying):
from census import Census
conn = Census("MY API KEY")
name = 'NAME'
agriculture = 'DP03_0033PE'
laborForce = 'DP03_0003PE'
travelTime = 'DP03_0025E'
highSchool = 'DP02_0066PE'
unemployed = 'DP03_0009PE'
poverty = 'DP03_0128PE'
payload = conn.acs1dp.get((name, travelTime, agriculture, poverty,
unemployed, laborForce, highSchool), {'for': 'state:*'})
which returns each of those column values for all of the states.
I'm starting a work to analyse data from Stats Institutions like Eurostat using python, and so pandas. I found out there are two methods to get data from Eurostat.
pandas_datareader: it seems very easy to use but I found some problems to get some specific data
pandasdmx: I've found it a bit complicated but it seems a promising solution, but documentation is poor
I use a free Azure notebook, online service, but I don't think it will complicate more my situation.
Let me explain the problems for pandas_datareader. According to the pandas documentation, in the section API, there is this short documented package and it works. Apart from the shown example, that nicely works, a problem arises about other tables. For example, I can get data about European house price, which ID table is prc_hpi_a with this simple code:
import pandas_datareader.data as web
import datetime
df = web.DataReader('prc_hpi_a', 'eurostat')
But the table has three types of data about dwellings: TOTAL, EXISTING and NEW. I got only Existing dwellings and I don't know how to get the other ones. Do you have a solution for these types of filtering.
Secondly there is the path using pandasdmx. Here it is more complicated. My idea is to upload all data to a pandas DataFrame, and then I can analyse as I want. Easy to say, but I've not find many tutorials that explain this passage: upload data to pandas structures. For example, I found this tutorial, but I'm stuck to the first step, that is instantiate a client:
import pandasdmx
from pandasdmx import client
#estat=client('Eurostat', 'milk.db')
and it returns:
--------------------------------------------------------------------------- ImportError Traceback (most recent call
last) in ()
1 import pandasdmx
----> 2 from pandasdmx import client
3 estat=client('Eurostat', 'milk.db')
ImportError: cannot import name 'client'
What's the problem here? I've looked around but no answer to this problem
I also followed this tutorial:
from pandasdmx import Request
estat = Request('ESTAT')
metadata = estat.datastructure('DSD_une_rt_a').write()
metadata.codelist.iloc[8:18]
resp = estat.data('une_rt_a', key={'GEO': 'EL+ES+IE'}, params={'startPeriod': '2007'})
data = resp.write(s for s in resp.data.series if s.key.AGE == 'TOTAL')
data.columns.names
data.columns.levels
data.loc[:, ('PC_ACT', 'TOTAL', 'T')]
I got the data, but my purpose is to upload them to a pandas structure (Series, DataFrame, etc..), so I can handle easily according to my work. How to do that?
Actually I did with this working line (below the previous ones):
s=pd.DataFrame(data)
But it doesn't work if I try to get other data tables. Let me explain with another example about the Harmonized Index Current Price table:
estat = Request('ESTAT')
metadata = estat.datastructure('DSD_prc_hicp_midx').write()
resp = estat.data('prc_hicp_midx')
data = resp.write(s for s in resp.data.series if s.key.COICOP == 'CP00')
It returns an error here, that is:
--------------------------------------------------------------------------- AttributeError Traceback (most recent call
last) in ()
2 metadata = estat.datastructure('DSD_prc_hicp_midx').write()
3 resp = estat.data('prc_hicp_midx')
----> 4 data = resp.write(s for s in resp.data.series if s.key.COICOP == 'CP00')
5 #metadata.codelist
6 #data.loc[:, ('TOTAL', 'INX_Q','EA', 'Q')]
~/anaconda3_501/lib/python3.6/site-packages/pandasdmx/api.py in
getattr(self, name)
622 Make Message attributes directly readable from Response instance
623 '''
--> 624 return getattr(self.msg, name)
625
626 def _init_writer(self, writer):
AttributeError: 'DataMessage' object has no attribute 'data'
Why does it do not get data now? What's wrong now?
I lost almost a day looking around for some clear examples and explanations. Do you have some to propose? Is there a full and clear documentation? I found also this page with other examples, explaining the use of categorical schemes, but it is not for Eurostat (as explained at some point)
Both methods could work, apart from some explained issues, but I need also a suggestion to have a definitely method to use, to query Eurostat but also many other institutions like OECD, World Bank, etc...
Could you guide me to a definitive and working solution, even if it is different for each institution?
That's my definitive answer to my question that works for each type of data collected from Eurostat. I post here because it can be useful for many.
Let me propose some examples. They produce three pandas series (EU_unempl,EU_GDP,EU_intRates) with data and correct time indexes
#----Unemployment Rate---------
dataEU_unempl=pd.read_json('http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/ei_lmhr_m?geo=EA&indic=LM-UN-T-TOT&s_adj=NSA&unit=PC_ACT',typ='series',orient='table',numpy=True) #,typ='DataFrame',orient='table'
x=[]
for i in range(int(sorted(dataEU_unempl['value'].keys())[0]),1+int(sorted(dataEU_unempl['value'].keys(),reverse=True)[0])):
x=numpy.append(x,dataEU_unempl['value'][str(i)])
EU_unempl=pd.Series(x,index=pd.date_range((pd.to_datetime((sorted(dataEU_unempl['dimension']['time']['category']['index'].keys())[(sorted(int(v) for v in dataEU_unempl['value'].keys())[0])]),format='%YM%M')), periods=len(x), freq='M')) #'1/1993'
#----GDP---------
dataEU_GDP=pd.read_json('http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/namq_10_gdp?geo=EA&na_item=B1GQ&s_adj=NSA&unit=CP_MEUR',typ='series',orient='table',numpy=True) #,typ='DataFrame',orient='table'
x=[]
for i in range((sorted(int(v) for v in dataEU_GDP['value'].keys())[0]),1+(sorted((int(v) for v in dataEU_GDP['value'].keys()),reverse=True))[0]):
x=numpy.append(x,dataEU_GDP['value'][str(i)])
EU_GDP=pd.Series(x,index=pd.date_range((pd.Timestamp(sorted(dataEU_GDP['dimension']['time']['category']['index'].keys())[(sorted(int(v) for v in dataEU_GDP['value'].keys())[0])])), periods=len(x), freq='Q'))
#----Money market interest rates---------
dataEU_intRates=pd.read_json('http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/irt_st_m?geo=EA&intrt=MAT_ON',typ='series',orient='table',numpy=True) #,typ='DataFrame',orient='table'
x=[]
for i in range((sorted(int(v) for v in dataEU_intRates['value'].keys())[0]),1+(sorted((int(v) for v in dataEU_intRates['value'].keys()),reverse=True))[0]):
x=numpy.append(x,dataEU_intRates['value'][str(i)])
EU_intRates=pd.Series(x,index=pd.date_range((pd.to_datetime((sorted(dataEU_intRates['dimension']['time']['category']['index'].keys())[(sorted(int(v) for v in dataEU_intRates['value'].keys())[0])]),format='%YM%M')), periods=len(x), freq='M'))
The general solution is to not rely on overly-specific APIs like datareader and instead go to the source. You can use datareader's source code as inspiration and as a guide for how to do it. But ultimately when you need to get data from a source, you may want to directly access that source and load the data.
One very popular tool for HTTP APIs is requests. You can easily use it to load JSON data from any website or HTTP(S) service. Once you have the JSON, you can load it into Pandas. Because this solution is based on general-purpose building blocks, it is applicable to virtually any data source on the Web (as opposed to e.g. pandaSDMX, which is only applicable to SDMX data sources).
Load with read_csv and multiple separators
The problem with eurostat data from the bulk download repository is that they are tab separated files where the first 3 columns are separated by commas. Pandas read_csv() can deal with mulitple separators as a regex if you specify engine="python". This works for some data sets, but the OP's dataset also contains flags, which cannot be ignored in the last column.
# Load the house price index from the Eurostat bulk download facility
import pandas
code = "prc_hpi_a"
url = f"https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2F{code}.tsv.gz" # Pandas.read_csv could almost read it directly with a multiple separator
df = pandas.read_csv(url, sep=",|\t| [^ ]?\t", na_values=":", engine="python")
# But the last column is a character column instead of a numeric because of the
# presence of a flag ": c" illustrated in the last line of the table extract
# below
# purchase,unit,geo\time\t 2006\t 2005
# DW_EXST,I10_A_AVG,AT\t :\t :
# DW_EXST,I10_A_AVG,BE\t 83.86\t 75.16
# DW_EXST,I10_A_AVG,BG\t 87.81\t 76.56
# DW_EXST,I10_A_AVG,CY\t :\t :
# DW_EXST,I10_A_AVG,CZ\t :\t :
# DW_EXST,I10_A_AVG,DE\t100.80\t101.10
# DW_EXST,I10_A_AVG,DK\t113.85\t 91.79
# DW_EXST,I10_A_AVG,EE\t156.23\t 98.69
# DW_EXST,I10_A_AVG,ES\t109.68\t :
# DW_EXST,I10_A_AVG,FI\t : c\t : c
Load with the eurostat package
There is also a python package called eurostat which makes it possible to search and load data set from the bulk facility into pandas data frames.
Load 2 different monthly exchange rate data sets:
import eurostat
df1 = eurostat.get_data_df(code)
The table of content of the bulk download facility can be read with
toc_url = "https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=table_of_contents_en.txt"
toc2 = pandas.read_csv(toc_url, sep="\t")
# Remove white spaces at the beginning and end of strings
toc2 = toc2.applymap(lambda x: x.strip() if isinstance(x, str) else x)
or with
toc = eurostat.get_toc_df()
toc0 = (eurostat.subset_toc_df(toc, "exchange"))
The last line searches for the datasets that have "exchange" in their title
Reshape to long format
It might be useful to reshape the eurostat data to long format
with
if any(df.columns.str.contains("time")):
time_column = df.columns[df.columns.str.contains("time")][-1]
# Id columns are before the time columns
id_columns = df.loc[:, :time_column].columns
df = df.melt(id_vars=id_columns, var_name="period", value_name="value")
# Remove "\time" from the rightmost column of the index
df = df.rename(columns=lambda x: re.sub(r"\\time", "", x))
I'm new to using an API, and I'm currently trying to use Elsevier API. My goal is to extract the author (university) affiliations for each submission in a given journal. I've set up the API Key and looked at the exampleProg.py found here.
The How-To guides also aren't very helpful with my specific task. Could someone point me in the right direction?
Using the pybliometrics package that we design (we're Scopus users w/o Elsevier affiliation) it's very easy:
from pybliometrics.scopus import ScopusSearch
q = "ISSN(0036-8075)" # Query of the journal SoftwareX
s = ScopusSearch(q) # Handles access, retrieval and parsing
pubs = s.results # This is a list of namedtuples, one for each publication
data = []
for pub in pubs:
if not pub.author_ids:
continue
authors = pub.author_ids.split(";")
affs = pub.author_afids.split(";") # Multiple affiliations joined on hyphen!
data.extend(list(zip(authors, affs)))
We designed the information such that missing affiliations are simply stored as empty string.
The following code is returning None keywords:-
from rake_nltk import Rake
r=Rake()
testscenario='''This document is very important as it has a lot of business objectives mentioned in it.'''
defect='''Current day per security file is going to Bloomberg and we are getting data back from Bloomberg but it is not loading into the MarkIt tables. Last date on MarkIt tables for data loaded was June 29, 2016.BBG Run date for what is going into per security matcher is June 29th.See attached for screen shots.'''
print(r.extract_keywords_from_text(testscenario))
The output that I am getting is None.
The following code can be used. It worked for me.
from rake_nltk import Rake
r=Rake()
testscenario='This document is very important as it has a lot of business objectives mentioned in it.'
r.extract_keywords_from_text(testscenario)
print(r.get_ranked_phrases())
Reference: https://pypi.org/project/rake-nltk/
Refer to README of the package. It clearly describes what you need to get ranked phrases
r.extract_keywords_from_text(testscenario)
extracts the keywords from the given text. Use
r.get_ranked_phrases()
r.get_ranked_phrases_with_scores()
to get the ranked scores and their weights.
Readme link : https://github.com/csurfer/rake-nltk/blob/master/README.md
How to generate summary like IBM from json using discovery news services with python
qopts = {'nested':'(enriched_text.entities)','filter':'(enriched_text.entities.type::Person)','term':'(enriched_text.entities.text,count:10)','filter':'(enriched_text.concepts.text:infosys)','filter':'(enriched_text.concepts.text:ceo)'}
my_query = discovery.query('system', 'news', qopts)
print(json.dumps(my_query, indent=2))
This query is proper or not for find ceo of Infosys ?
Output came in large json format the how I identify answer or create summary like top ten ceo or people.
How to generate summary from json using discovery news services with python. I fire query then output became large json format ..how to find proper summary from that json file my query is correct or not
I believe there are two questions here.
In order to answer a question like "Who is the CEO of Infosys?" I would instead make use of the natural_language_query parameter as follows:
qopts = {'natural_language_query':'Who is the CEO of Infosys?','count':'5'}
response = discovery.query(environment_id='system',collection_id='news',query_options=qopts)
print(json.dumps(response,indent=2))
In order to make use of aggregations, they must be specified in a single aggregation parameter combined with filter aggregations in the query options as follows:
qopts = {'aggregation': 'nested(enriched_text.entities).filter(enriched_text.entities.type::Person).term(enriched_text.entities.text,count:10)', 'filter':'enriched_text.entities:(text:Infosys,type:Company)','count':'0'}
response = discovery.query(environment_id='system',collection_id='news',query_options=qopts}
print(json.dumps(response,indent=2))
Notice that aggregations are chained/combined with the . symbol.