r reticulate: rename duplicates from converted Python pandas dataframe - python

I am using the great new r package "reticulate" to merge Python and R to be able to use an API from a data provider (Thomson Reuters Eikon) in R which is only available for Python. I wish to do that as my R abilities are better than my (almost non-existent) Python abilities.
I use a function "get_news_headlines" from the Python module "eikon" which serves as the API to download data from Thomson Reuters Eikon. I automatically convert the resulting pandas dataframe to an r dataframe by setting the argument "convert" of the reticulate function "import" to TRUE.
The API sets the first column of the downloaded data containing the news publication dates as the index. When the dataframe is converted to an r object automatically there are duplicates in the dates and I receive the following error message:
Error in `row.names<-.data.frame`(`*tmp*`, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique value when setting 'row.names': ‘2018-05-31 08:21:56’
Here is my code:
library(reticulate) #load reticulate package to combine Python with R
PYTHON_pandas <- import("pandas", convert = TRUE)
#import Python pandas via reticulate's function "import"
PYTHON_eikon <- import("eikon", convert = TRUE)
#import the Thomson Reuters API Python module for use of the API in R
#(I set convert to true to convert Python objects into their R equivalents)
#do not bother with the following line:
PYTHON_eikon$set_app_id('ADD EIKON APP ID HERE')
#set a Thomson Reuters application ID (step is necessary to download data from TR, any string works)
DF <- PYTHON_eikon$get_news_headlines(query = 'Topic:REAM AND Topic:US', count = 10L)
#save news data from the API in an R dataframe
#query is the Thomson Reuters code from their Eikon database
#count is the number of news articles to be downloaded, I arbitrarily set it to 10 articles here
So my problem is that I have to tell R to replace the duplicates from the pandas index before the conversion into an r dataframe happens to avoid the stated error message. When I set the argument count to a small number and coincidentally do not have any duplicates, the code works perfectly fine as it is now.
This is probably easy for people with some knowledge both in R and Python (so not for me as my Python knowledge is very limited). Unfortunately the code is not replicable as I want to use the Thomson Reuters data access.
Any help is highly appreciated!
EDIT:
Would it possibly be an option to set the argument convert = FALSE in the import function to receive a pandas dataframe in R first? Than I would require a possibility to manipulate the Python pandas dataframe within R so that the duplicates are removed or alternatively the pandas dataframe index is removed before I manually convert the pandas dataframe to an R dataframe. Is that possible with reticulate?
The documentation for the eikon Python package is not really good yet as it is a pretty new Python module.
#Moody_Mudskipper:
str(PYTHON_eikon) only returns Module(eikon) as I am only fetching the respective Python module with the import function.
names(PYTHON_eikon) returns:
"data_grid" "eikonError" "EikonError" "get_app_id" "get_data" "get_news_headlines" "get_news_story" "get_port_number" "get_symbology" "get_timeout" "get_timeseries" "json_requests" "news_request" "Profile" "send_json_request" "set_app_id" "set_port_number" "set_timeout" "symbology" "time_series" "tools" "TR_Field"
None of the available eikon functions seems to help me with my issue.

In case this rather special problem is ever interesting for someone else, I briefly want to share the solution I found in the meantime (not perfect, but working):
library(reticulate) #load reticulate package to combine Python with R
PYTHON_pandas <- import("pandas", convert = TRUE)
#import Python pandas via reticulate's function "import"
PYTHON_eikon <- import("eikon", convert = TRUE)
#import the Thomson Reuters API Python module for use of the API in R
#(I set convert to true to convert Python objects into their R equivalents)
#do not bother with the following line:
PYTHON_eikon$set_app_id('ADD EIKON APP ID HERE')
#set a Thomson Reuters application ID (step is necessary to download data from TR, any string works)
#**Solution starts HERE:**
DF <- PYTHON_eikon$get_news_headlines(query = 'Topic:REAM AND Topic:US', count = 10L, raw_output = TRUE)
#use argument "raw_output" to receive a list instead of a dataframe
DF[c(2, 3)] <- NULL
#delete unrequired list-elements
DF <- list.cbind(DF)
#use "rlist" function "list.cbind" to column-bind list object "DF"
DF <- rbindlist(DF, fill = FALSE)
#use "data.table" function "rbindlist" to row-bind list object "DF"

Do you need to use the R package "reticulate" or could you look at other packages as well?
There is an open source wrapper for R available at GitHub: eikonapir. Although it is not officially supported, you might find it useful because it can execute your command without any problems:
get_news_headlines(query = 'Topic:REAM AND Topic:US', count = 10L)
**Disclaimer: I am currently employed by Thomson Reuters

Related

piplyr package python

Has anyone heard about "piplyr" package in python? It has several functions similar to "dplyr" and "tidyr".
I am trying to follow its instruction here https://pypi.org/project/piplyr. However, I get an error saying the 'DataFrameGroupBy' object has no attribute 'sort_values'. Below I have provided the example:
import pandas as pd
from piplyr.main import piplyr
df = pd.DataFrame({"A":[2,1,20,10],"C":['a','a','b','b']})
pi = piplyr(df)
(
pi.
group_by('C').
sort_by("A")
).to_df
I can get to the results I am looking for using the following code:
df = pd.DataFrame({"A":[2,1,20,10],"C":['a','a','b','b']})
pi = piplyr(df)
(
pi.
sql_plyr('''
SELECT A, C
FROM df
GROUP BY C,A
ORDER BY A
''')
).to_df
Thanks for using piplyr. This is related to the fact that after applying the group by clause, Pandas does not render a data frame, but rather a 'groupby' object. So, I plan to remove the group_by function from the package in the next version.To solve this problem, I added four functions to the package: mutate_eval, mutate_func, sql_plyr, and summarize. These functions enable users to perform group by operations.

Is there a way to search for a string and copy text in front until it reaches a comma?

I am new to python and wanted to store the recentAveragePrice inside a variable (from a string like this one)
{"assetStock":null,"sales":250694,"numberRemaining":null,"recentAveragePrice":731,"originalPrice":null,"priceDataPoints":[{"value":661,"date":"2022-08-11T05:00:00Z"},{"value":592,"date":"2022-08-10T05:00:00Z"},{"value":443,"date":"2022-08-09T05:00:00Z"}],"volumeDataPoints":[{"value":155,"date":"2022-08-11T05:00:00Z"},{"value":4595,"date":"2022-08-10T05:00:00Z"},{"value":12675,"date":"2022-08-09T05:00:00Z"},{"value":22179,"date":"2022-08-08T05:00:00Z"},{"value":15181,"date":"2022-08-07T05:00:00Z"},{"value":14541,"date":"2022-08-06T05:00:00Z"},{"value":15310,"date":"2022-08-05T05:00:00Z"},{"value":14146,"date":"2022-08-04T05:00:00Z"},{"value":13083,"date":"2022-08-03T05:00:00Z"},{"value":14460,"date":"2022-08-02T05:00:00Z"},{"value":16809,"date":"2022-08-01T05:00:00Z"},{"value":17571,"date":"2022-07-31T05:00:00Z"},{"value":23907,"date":"2022-07-30T05:00:00Z"},{"value":39007,"date":"2022-07-29T05:00:00Z"},{"value":38823,"date":"2022-07-28T05:00:00Z"}]}
My current solution is this:
var = sampleStr[78] + sampleStr[79] + sampleStr[80]
It works for the current string but if the recentAveragePrice was above 999 it would stop working and i was wondering if instead of getting a fixed number i could search for it inside the string.
Your replit code shows that you're acquiring JSON data from some website. Here's an example based on the URL that you're using. It shows how you check the response status, acquire the JSON data as a Python dictionary then print a value associated with a particular key. If the key is missing, it will print None:
import requests
(r := requests.get('https://economy.roblox.com/v1/assets/10159617728/resale-data')).raise_for_status()
jdata = r.json()
print(jdata.get('recentAveragePrice'))
Output:
640
Since this is json you should just be able to parse it and access recentAveragePrice:
import json
sample_string = '''{"assetStock":null,"sales":250694,"numberRemaining":null,"recentAveragePrice":731,"originalPrice":null,"priceDataPoints":[{"value":661,"date":"2022-08-11T05:00:00Z"},{"value":592,"date":"2022-08-10T05:00:00Z"},{"value":443,"date":"2022-08-09T05:00:00Z"}],"volumeDataPoints":[{"value":155,"date":"2022-08-11T05:00:00Z"},{"value":4595,"date":"2022-08-10T05:00:00Z"},{"value":12675,"date":"2022-08-09T05:00:00Z"},{"value":22179,"date":"2022-08-08T05:00:00Z"},{"value":15181,"date":"2022-08-07T05:00:00Z"},{"value":14541,"date":"2022-08-06T05:00:00Z"},{"value":15310,"date":"2022-08-05T05:00:00Z"},{"value":14146,"date":"2022-08-04T05:00:00Z"},{"value":13083,"date":"2022-08-03T05:00:00Z"},{"value":14460,"date":"2022-08-02T05:00:00Z"},{"value":16809,"date":"2022-08-01T05:00:00Z"},{"value":17571,"date":"2022-07-31T05:00:00Z"},{"value":23907,"date":"2022-07-30T05:00:00Z"},{"value":39007,"date":"2022-07-29T05:00:00Z"},{"value":38823,"date":"2022-07-28T05:00:00Z"}]}'''
data = json.loads(sample_string)
recent_price = data['recentAveragePrice']
print(recent_price)
outputs:
731
Your data is in a popular format called JSON (JavaScript Object Notation). It's commonly used to exchange data between different systems like a server and a client, or a Python program and JavaScript program.
Now Python doesn't use JSON per-se, but it has a data type called a dictionary that behaves very similarly to JSON. You can access elements of a dictionary as simply as:
print(my_dictionary["recentAveragePrice"])
Python has a built-in library meant specifically to handle JSON data, and it includes a function called loads() that can convert a string into a Python dictionary. We'll use that.
Finally, putting all that together, here is a more robust program to help parse your string and pick out the data you need. Dictionaries can do a lot more cool stuff, so make sure you take a look at the links above.
# import the JSON library
# specifically, we import the `loads()` function, which will convert a JSON string into a Python object
from json import loads
# let's store your string in a variable
original_string = """
{"assetStock":null,"sales":250694,"numberRemaining":null,"recentAveragePrice":731,"originalPrice":null,"priceDataPoints":[{"value":661,"date":"2022-08-11T05:00:00Z"},{"value":592,"date":"2022-08-10T05:00:00Z"},{"value":443,"date":"2022-08-09T05:00:00Z"}],"volumeDataPoints":[{"value":155,"date":"2022-08-11T05:00:00Z"},{"value":4595,"date":"2022-08-10T05:00:00Z"},{"value":12675,"date":"2022-08-09T05:00:00Z"},{"value":22179,"date":"2022-08-08T05:00:00Z"},{"value":15181,"date":"2022-08-07T05:00:00Z"},{"value":14541,"date":"2022-08-06T05:00:00Z"},{"value":15310,"date":"2022-08-05T05:00:00Z"},{"value":14146,"date":"2022-08-04T05:00:00Z"},{"value":13083,"date":"2022-08-03T05:00:00Z"},{"value":14460,"date":"2022-08-02T05:00:00Z"},{"value":16809,"date":"2022-08-01T05:00:00Z"},{"value":17571,"date":"2022-07-31T05:00:00Z"},{"value":23907,"date":"2022-07-30T05:00:00Z"},{"value":39007,"date":"2022-07-29T05:00:00Z"},{"value":38823,"date":"2022-07-28T05:00:00Z"}]}
"""
# convert the string into a dictionary object
dictionary_object = loads(original_string)
# access the element you need
print(dictionary_object["recentAveragePrice"])
Output upon running this program:
$ python exp.py
731

How to limit open orders request to one pair with kraken api (python)?

In order to make my code more efficient, I'm trying to limit my api request for open orders to one single pair. I can't figure out how to correctly use the input parameters.
I'm using python3 and the krakenex package (which I could replace if there is one which works better)
client = krakenex.API(<<key>>, <<secret>>)
data = {'pair': 'ADAEUR'}
open_ord = client.query_private(method='OpenOrders',data = data) ['result']
open_ord_ = list(open_ord.values())[0]
---> this unfortunately returns the open orders of all my pairs and not only the "ADAEUR".
I guess one needs to adapt the data parameters which I was not able to figure out...
Would be awesome if someone could help me.
Many Thanks in advance
According to the Kraken API docs, there is no data argument for the getOpenOrders endpoint, so this explains why your results are not filtered.
Two methods:
Using the pykrakenapi package that neatly wraps all output in a Pandas DataFrame:
import krakenex
from pykrakenapi import KrakenAPI
api = krakenex.API(<<key>>, <<secret>>)
connection = KrakenAPI(api)
pairs = ['ADAEUR', 'XTZEUR']
open_orders = connection.get_open_orders()
open_orders = open_orders[open_orders['descr_pair'].isin(pairs)]
print(open_orders)
Using only krakenex and filtering from the JSON output:
import krakenex
api = krakenex.API(<<key>>, <<secret>>)
pairs = ['ADAEUR', 'XTZEUR']
open_orders = api.query_private(method='OpenOrders')['result']['open']
open_orders = [(o, open_orders[o]) for o in open_orders if open_orders[o]['descr']['pair'] in pairs]
print(open_orders)
Both methods are written so they can filter one or multiple pairs.
Method 1 returns a Pandas DataFrame, the second method returns a list with for each open order a tuple of (order ID (str), order info (dict)).

Looking for samples to multiple API calls in python

I am new and learning python. As part of my learning, i am trying to do Api integration. I am getting the result but it's limited to 100. But the totalresults is around 7000 records. Is there a way I can call multiple times to bring the entire result in CSV format. I am adding my code below and not sure how to proceed further.
import requests
import pandas as pd
resp = requests.get ('apipath' & '?company=XXXX', auth=(XXXXX', 'XXXXXX'))
dataframe = resp.json()
dataset = pd.DataFrame(dataframe["items"]).to_csv('dict_file.csv', header=True)
Please help.
You'll need to check the API Documentation but generally there will be a parameter "maxResults (or similar) that you can add to the url to retrieve more than the default number of results.
Your request (by modify the query string in the url) would look something like this:
resp = requests.get ('apipath' & '?company=XXXX&maxResults=1000', auth=(XXXXX', 'XXXXXX'))

pandas data mining from Eurostat

I'm starting a work to analyse data from Stats Institutions like Eurostat using python, and so pandas. I found out there are two methods to get data from Eurostat.
pandas_datareader: it seems very easy to use but I found some problems to get some specific data
pandasdmx: I've found it a bit complicated but it seems a promising solution, but documentation is poor
I use a free Azure notebook, online service, but I don't think it will complicate more my situation.
Let me explain the problems for pandas_datareader. According to the pandas documentation, in the section API, there is this short documented package and it works. Apart from the shown example, that nicely works, a problem arises about other tables. For example, I can get data about European house price, which ID table is prc_hpi_a with this simple code:
import pandas_datareader.data as web
import datetime
df = web.DataReader('prc_hpi_a', 'eurostat')
But the table has three types of data about dwellings: TOTAL, EXISTING and NEW. I got only Existing dwellings and I don't know how to get the other ones. Do you have a solution for these types of filtering.
Secondly there is the path using pandasdmx. Here it is more complicated. My idea is to upload all data to a pandas DataFrame, and then I can analyse as I want. Easy to say, but I've not find many tutorials that explain this passage: upload data to pandas structures. For example, I found this tutorial, but I'm stuck to the first step, that is instantiate a client:
import pandasdmx
from pandasdmx import client
#estat=client('Eurostat', 'milk.db')
and it returns:
--------------------------------------------------------------------------- ImportError Traceback (most recent call
last) in ()
1 import pandasdmx
----> 2 from pandasdmx import client
3 estat=client('Eurostat', 'milk.db')
ImportError: cannot import name 'client'
What's the problem here? I've looked around but no answer to this problem
I also followed this tutorial:
from pandasdmx import Request
estat = Request('ESTAT')
metadata = estat.datastructure('DSD_une_rt_a').write()
metadata.codelist.iloc[8:18]
resp = estat.data('une_rt_a', key={'GEO': 'EL+ES+IE'}, params={'startPeriod': '2007'})
data = resp.write(s for s in resp.data.series if s.key.AGE == 'TOTAL')
data.columns.names
data.columns.levels
data.loc[:, ('PC_ACT', 'TOTAL', 'T')]
I got the data, but my purpose is to upload them to a pandas structure (Series, DataFrame, etc..), so I can handle easily according to my work. How to do that?
Actually I did with this working line (below the previous ones):
s=pd.DataFrame(data)
But it doesn't work if I try to get other data tables. Let me explain with another example about the Harmonized Index Current Price table:
estat = Request('ESTAT')
metadata = estat.datastructure('DSD_prc_hicp_midx').write()
resp = estat.data('prc_hicp_midx')
data = resp.write(s for s in resp.data.series if s.key.COICOP == 'CP00')
It returns an error here, that is:
--------------------------------------------------------------------------- AttributeError Traceback (most recent call
last) in ()
2 metadata = estat.datastructure('DSD_prc_hicp_midx').write()
3 resp = estat.data('prc_hicp_midx')
----> 4 data = resp.write(s for s in resp.data.series if s.key.COICOP == 'CP00')
5 #metadata.codelist
6 #data.loc[:, ('TOTAL', 'INX_Q','EA', 'Q')]
~/anaconda3_501/lib/python3.6/site-packages/pandasdmx/api.py in
getattr(self, name)
622 Make Message attributes directly readable from Response instance
623 '''
--> 624 return getattr(self.msg, name)
625
626 def _init_writer(self, writer):
AttributeError: 'DataMessage' object has no attribute 'data'
Why does it do not get data now? What's wrong now?
I lost almost a day looking around for some clear examples and explanations. Do you have some to propose? Is there a full and clear documentation? I found also this page with other examples, explaining the use of categorical schemes, but it is not for Eurostat (as explained at some point)
Both methods could work, apart from some explained issues, but I need also a suggestion to have a definitely method to use, to query Eurostat but also many other institutions like OECD, World Bank, etc...
Could you guide me to a definitive and working solution, even if it is different for each institution?
That's my definitive answer to my question that works for each type of data collected from Eurostat. I post here because it can be useful for many.
Let me propose some examples. They produce three pandas series (EU_unempl,EU_GDP,EU_intRates) with data and correct time indexes
#----Unemployment Rate---------
dataEU_unempl=pd.read_json('http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/ei_lmhr_m?geo=EA&indic=LM-UN-T-TOT&s_adj=NSA&unit=PC_ACT',typ='series',orient='table',numpy=True) #,typ='DataFrame',orient='table'
x=[]
for i in range(int(sorted(dataEU_unempl['value'].keys())[0]),1+int(sorted(dataEU_unempl['value'].keys(),reverse=True)[0])):
x=numpy.append(x,dataEU_unempl['value'][str(i)])
EU_unempl=pd.Series(x,index=pd.date_range((pd.to_datetime((sorted(dataEU_unempl['dimension']['time']['category']['index'].keys())[(sorted(int(v) for v in dataEU_unempl['value'].keys())[0])]),format='%YM%M')), periods=len(x), freq='M')) #'1/1993'
#----GDP---------
dataEU_GDP=pd.read_json('http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/namq_10_gdp?geo=EA&na_item=B1GQ&s_adj=NSA&unit=CP_MEUR',typ='series',orient='table',numpy=True) #,typ='DataFrame',orient='table'
x=[]
for i in range((sorted(int(v) for v in dataEU_GDP['value'].keys())[0]),1+(sorted((int(v) for v in dataEU_GDP['value'].keys()),reverse=True))[0]):
x=numpy.append(x,dataEU_GDP['value'][str(i)])
EU_GDP=pd.Series(x,index=pd.date_range((pd.Timestamp(sorted(dataEU_GDP['dimension']['time']['category']['index'].keys())[(sorted(int(v) for v in dataEU_GDP['value'].keys())[0])])), periods=len(x), freq='Q'))
#----Money market interest rates---------
dataEU_intRates=pd.read_json('http://ec.europa.eu/eurostat/wdds/rest/data/v2.1/json/en/irt_st_m?geo=EA&intrt=MAT_ON',typ='series',orient='table',numpy=True) #,typ='DataFrame',orient='table'
x=[]
for i in range((sorted(int(v) for v in dataEU_intRates['value'].keys())[0]),1+(sorted((int(v) for v in dataEU_intRates['value'].keys()),reverse=True))[0]):
x=numpy.append(x,dataEU_intRates['value'][str(i)])
EU_intRates=pd.Series(x,index=pd.date_range((pd.to_datetime((sorted(dataEU_intRates['dimension']['time']['category']['index'].keys())[(sorted(int(v) for v in dataEU_intRates['value'].keys())[0])]),format='%YM%M')), periods=len(x), freq='M'))
The general solution is to not rely on overly-specific APIs like datareader and instead go to the source. You can use datareader's source code as inspiration and as a guide for how to do it. But ultimately when you need to get data from a source, you may want to directly access that source and load the data.
One very popular tool for HTTP APIs is requests. You can easily use it to load JSON data from any website or HTTP(S) service. Once you have the JSON, you can load it into Pandas. Because this solution is based on general-purpose building blocks, it is applicable to virtually any data source on the Web (as opposed to e.g. pandaSDMX, which is only applicable to SDMX data sources).
Load with read_csv and multiple separators
The problem with eurostat data from the bulk download repository is that they are tab separated files where the first 3 columns are separated by commas. Pandas read_csv() can deal with mulitple separators as a regex if you specify engine="python". This works for some data sets, but the OP's dataset also contains flags, which cannot be ignored in the last column.
# Load the house price index from the Eurostat bulk download facility
import pandas
code = "prc_hpi_a"
url = f"https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2F{code}.tsv.gz" # Pandas.read_csv could almost read it directly with a multiple separator
df = pandas.read_csv(url, sep=",|\t| [^ ]?\t", na_values=":", engine="python")
# But the last column is a character column instead of a numeric because of the
# presence of a flag ": c" illustrated in the last line of the table extract
# below
# purchase,unit,geo\time\t 2006\t 2005
# DW_EXST,I10_A_AVG,AT\t :\t :
# DW_EXST,I10_A_AVG,BE\t 83.86\t 75.16
# DW_EXST,I10_A_AVG,BG\t 87.81\t 76.56
# DW_EXST,I10_A_AVG,CY\t :\t :
# DW_EXST,I10_A_AVG,CZ\t :\t :
# DW_EXST,I10_A_AVG,DE\t100.80\t101.10
# DW_EXST,I10_A_AVG,DK\t113.85\t 91.79
# DW_EXST,I10_A_AVG,EE\t156.23\t 98.69
# DW_EXST,I10_A_AVG,ES\t109.68\t :
# DW_EXST,I10_A_AVG,FI\t : c\t : c
Load with the eurostat package
There is also a python package called eurostat which makes it possible to search and load data set from the bulk facility into pandas data frames.
Load 2 different monthly exchange rate data sets:
import eurostat
df1 = eurostat.get_data_df(code)
The table of content of the bulk download facility can be read with
toc_url = "https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=table_of_contents_en.txt"
toc2 = pandas.read_csv(toc_url, sep="\t")
# Remove white spaces at the beginning and end of strings
toc2 = toc2.applymap(lambda x: x.strip() if isinstance(x, str) else x)
or with
toc = eurostat.get_toc_df()
toc0 = (eurostat.subset_toc_df(toc, "exchange"))
The last line searches for the datasets that have "exchange" in their title
Reshape to long format
It might be useful to reshape the eurostat data to long format
with
if any(df.columns.str.contains("time")):
time_column = df.columns[df.columns.str.contains("time")][-1]
# Id columns are before the time columns
id_columns = df.loc[:, :time_column].columns
df = df.melt(id_vars=id_columns, var_name="period", value_name="value")
# Remove "\time" from the rightmost column of the index
df = df.rename(columns=lambda x: re.sub(r"\\time", "", x))

Categories