Reading R dataframes in python in Jupyter

Reading R dataframes in python in Jupyter - python

I want to read R objects back to python in Jupyter. For example, in Jupyter this example reads a dataframe generated in python and processed in R. Now I process this dataframe and create a new one that I want to be able to read to python.
Python cell:
# enables the %%R magic, not necessary if you've already done this
%load_ext rpy2.ipython
import pandas as pd
df = pd.DataFrame({
'cups_of_coffee': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
'productivity': [2, 5, 6, 8, 9, 8, 0, 1, 0, -1]
})
R cell:
%%R -i df
# import df from global environment
df$time = 1
df_new = df
df_new
If I move to a new cell the new dataframe df_new cannot read it as is not recognized.
I tried this:
%Rget df_new
But don't know how to assign it to a pandas dataframe or pass it to a python function.
How can switch back to a python cell and be able to read this new dataframe created in the R cell?

So, I randomly tried something myself and it worked. I couldn't find some good documentation.
So, one can just simply do:
df_python = %Rget df_new
This worked for me.

Related

Appending data to a Google Sheet using Python

I have 3 different tables I'm looking to directly push to 3 separate tabs in a Google Sheet. I set up the GSpread connection and that's working well. I started to adjust my first print statement into what I thought would append the information to Tab A (waveData), but no luck.
I'm looking to append the information to the FIRST blank row in a tab. Basically, so that the data will be ADDED to what is already in there.
I'm trying to use append_rows to do this, but am hitting a "gspread.exceptions.APIError: {'code': 400, 'message': 'Invalid value at 'data.values' (type.googleapis.com/google.protobuf.ListValue).
I'm really new to this, just thought it would be a fun project to evaluate wave sizes in NJ across all major surf spots, but really in over my head (no pun intended).
Any thoughts?
import requests
import pandas as pd
import gspread
gc = gspread.service_account(filename='creds.json')
sh = gc.open_by_key('152qSpr-4nK9V5uHOiYOWTWUx4ojjVNZMdSmFYov-n50')
waveData = sh.get_worksheet(0)
tideData = sh.get_worksheet(1)
lightData = sh.get_worksheet(2)
# AddValue = ["Test", 25, "Test2"]
# lightData.insert_row(AddValue, 3)
id_list = [
'/Belmar-Surf-Report/3683/',
'/Manasquan-Surf-Report/386/',
'/Ocean-Grove-Surf-Report/7945/',
'/Asbury-Park-Surf-Report/857/',
'/Avon-Surf-Report/4050/',
'/Bay-Head-Surf-Report/4951/',
'/Belmar-Surf-Report/3683/',
'/Boardwalk-Surf-Report/9183/',
]
for x in id_list:
waveData.append_rows(pd.read_html(requests.get('http://magicseaweed.com' + x).text)
[2].iloc[:9, [0, 1, 2, 3, 4, 6, 7, 12, 15]].to_json(), value_input_option="USER_ENTERED")
# print(pd.read_html(requests.get('http://magicseaweed.com' + x).text)[0])
# print(pd.read_html(requests.get('http://magicseaweed.com' + x).text)[1])

From your following reply,
there really is no relationship between the 3. When I scrape with IMPORTHTML into Google sheets, those are just Tables at the locations 0,1, and 2. I'm basically just trying to have an output of each table on a separate tab
I understood that you wanted to retrieve the values with pd.read_html(requests.get('http://magicseaweed.com' + x).text)[2].iloc[:9, [0, 1, 2, 3, 4, 6, 7, 12, 15]] from id_list, and wanted to put the values to a sheet in Google Spreadsheet.
In this case, how about the following modification?
At append_rows, it seems that JSON data cannot be directly used. In this case, it is required to use a 2-dimensional array. And, I'm worried about the value of NaN in the datafarame. When these points are reflected in your script, how about the following modification?
Modified script 1:
In this sample, all values are put into a sheet.
gc = gspread.service_account(filename='creds.json')
sh = gc.open_by_key('152qSpr-4nK9V5uHOiYOWTWUx4ojjVNZMdSmFYov-n50')
waveData = sh.get_worksheet(0)
id_list = [
"/Belmar-Surf-Report/3683/",
"/Manasquan-Surf-Report/386/",
"/Ocean-Grove-Surf-Report/7945/",
"/Asbury-Park-Surf-Report/857/",
"/Avon-Surf-Report/4050/",
"/Bay-Head-Surf-Report/4951/",
"/Belmar-Surf-Report/3683/",
"/Boardwalk-Surf-Report/9183/",
]
# I modified the below script.
res = []
for x in id_list:
df = pd.read_html(requests.get("http://magicseaweed.com" + x).text)[2].iloc[:9, [0, 1, 2, 3, 4, 6, 7, 12, 15]].fillna("")
values = [[x], df.columns.values.tolist(), *df.values.tolist()]
res.extend(values)
res.append([])
waveData.append_rows(res, value_input_option="USER_ENTERED")
When this script is run, the retrieved values are put into the 1st sheet as follows. In this sample modification, the path and a blank row are inserted between each data. Please modify this for your actual situation.
Modified script 2:
In this sample, each value is put into each sheet.
gc = gspread.service_account(filename='creds.json')
sh = gc.open_by_key('152qSpr-4nK9V5uHOiYOWTWUx4ojjVNZMdSmFYov-n50')
id_list = [
"/Belmar-Surf-Report/3683/",
"/Manasquan-Surf-Report/386/",
"/Ocean-Grove-Surf-Report/7945/",
"/Asbury-Park-Surf-Report/857/",
"/Avon-Surf-Report/4050/",
"/Bay-Head-Surf-Report/4951/",
"/Belmar-Surf-Report/3683/",
"/Boardwalk-Surf-Report/9183/",
]
obj = {e.title: e for e in sh.worksheets()}
for e in id_list:
if e not in obj:
obj[e] = sh.add_worksheet(title=e, rows="1000", cols="26")
for x in id_list:
df = pd.read_html(requests.get("http://magicseaweed.com" + x).text)[2].iloc[:9, [0, 1, 2, 3, 4, 6, 7, 12, 15]].fillna("")
values = [df.columns.values.tolist(), *df.values.tolist()]
obj[x].append_rows(values, value_input_option="USER_ENTERED")
When this script is run, the sheets are checked and created with the sheet names of the values in id_list, and each value is put to each sheet.
Reference:
append_rows

Fast way to create a variable in a dataframe as a function of other variables and values in another dataframe?

I have a first dataframe of individuals (df_id) that enter the data at start_time and exit it at end_time.
I have another dataframe (df_time) that gives me the value of a variable x at every point in time.
I want to create a new variable in df_id that will give me, for each individual, the average of x from the individual's start time to end time.
I was only able to do this by looping over each individual one by one, which takes a very long time. Is there a faster way to do this?
Here is what I tried:
import pandas as pd
data_id = {'id':[1, 2, 3], 'start_time':[1, 2, 4], 'end_time':[2, 4, 5]}
df_id = pd.DataFrame(data_id)
data_time = {'time': list(range(1,6)), 'x': [2,2,4,5,3] }
df_time = pd.DataFrame(data_time)
# This works, but is way too slow
for i, row in df_id.iterrows():
start = row['start_time']-1
end = row['end_time']
df_id.at[i,'mean_x'] = ((df_time['x'][start:end])).mean()
Many thanks!

Use apply() instead of iterrows. This will cut the runtime in half
import pandas as pd
df_id = pd.DataFrame({'id':[1, 2, 3], 'start_time':[1, 2, 4], 'end_time':[2, 4, 5]})
df_time = pd.DataFrame({'time': list(range(1,6)), 'x': [2,2,4,5,3]})
df_id['mean_x'] = df_id.apply(lambda row: df_time['x'][row['start_time']-1:row['end_time']].mean(), axis=1)

Cannot use ggplot2 in python jupyter notebook

I am trying to run this code in a Jupyter notebook
# Hide warnings if there are any
import warnings
warnings.filterwarnings('ignore')
# Load in the r magic
%load_ext rpy2.ipython
# We need ggplot2
%R require(ggplot2)
# Load in the pandas library
import pandas as pd
# Make a pandas DataFrame
df = pd.DataFrame({'Alphabet': ['a', 'b', 'c', 'd','e', 'f', 'g', 'h','i'],
'A': [4, 3, 5, 2, 1, 7, 7, 5, 9],
'B': [0, 4, 3, 6, 7, 10, 11, 9, 13],
'C': [1, 2, 3, 1, 2, 3, 1, 2, 3]})
# Take the name of input variable df and assign it to an R variable of the same name
%R -i df
# Plot the DataFrame df
ggplot(data=df) + geom_point(aes(x=A, y=B, color=C))
After running this code I get the error NameError: name 'ggplot' is not defined
In fact, when I run %R require(ggplot2) I get an empty array: array([0], dtype=int32)
UPDATE
After #James comment, if I use %R ggplot(data=df) + geom_point(aes(x=A, y=B, color=C)) instead of ggplot(data=df) + geom_point(aes(x=A, y=B, color=C)) then I get Error in ggplot(data = df) : could not find function "ggplot"
This and this posts did not work for me

I had the same problem.
The solution for me was to split the notebook up into different cells.
I assume you're running through https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook and the author doesn't explain where to split cells.
So you need to create a new cell for
%%R -i df
# Plot the DataFrame df
ggplot(data=df) + geom_point(aes(x=A, y=B, color=C))
(Note the '%%' in front of R. I think it means it's cell magic and not line magic and cell magic has to happen at the very start of a cell.)
Here's what my notebook looks like:
Notebook

Applying different functions and their arguments using a general functions (Ta-Lib in particular)

would appreciate if you guys can help with a function that takes in a pandas df, a function name, input columns needed and argument/kwargs
import talib
The df is of the form:
Open High Low Close Volume
Date
1993-01-29 43.970001 43.970001 43.750000 43.939999 1003200
1993-02-01 43.970001 44.250000 43.970001 44.250000 480500
1993-02-02 44.220001 44.380001 44.130001 44.340000 201300
This following code is ok:
def ApplyIndicator(df, data_col, indicator_func,period):
df_copy = df.copy()
col_name = indicator_func.__name__
df_copy[col_name]=df_copy[data_col].apply(lambda x:indicator_func(x,period))
return df_copy
Sample:
new_df = ApplyIndicator(df,['Close'],talib.SMA,10)
However, if I want a general ApplyIndicator which could take different columns, for example, talib.STOCH, it takes more than 1 arguments and need different columns:
slowk, slowd = STOCH(input_arrays, 5, 3, 0, 3, 0, prices=['high', 'low', 'open'])
For this case, how can I do a general ApplyIndicator function that do it on general talib function assuming all required columns are in df already.
Thank you.
More details on the two functions:
SMA(real[, timeperiod=?])
and
STOCH(high, low, close[, fastk_period=?, slowk_period=?, slowk_matype=?, slowd_period=?, slowd_matype=?])

With the original ApplyIndicator, it can be be done like this:
def slowk(arr, per):
return STOCH(arr, 5, 3, 0, 3, 0, prices=['high', 'low', 'open'])[0]
new_df = ApplyIndicator(df,['Close'], slowk, None)
Lambda won't work here because it its name is always "", but with some smarter column naming they should be fine too.
To make it slightly more elegant, we can let arbitrary number of attributes:
def ApplyIndicator(df, indicator_func, *args):
col_name = indicator_func.__name__
df[col_name] = df.apply(lambda x:indicator_func(x, *args))
return df
new_df = ApplyIndicator(df[['Close']], talib.SMA, 10)
new_df = ApplyIndicator(df[...], STOCH, 5, 3, 0, 3, 0, ['high', 'low', 'open'])
But in fact, the whole function is so trivial it might be easier to replace it with a single call like this:
df[['slowk', 'slowd']] = df.apply(
lambda idx, row: STOCH(row, 5, 3, 0, 3, 0, ['high', 'low', 'open']))

Python - excel : writing to multiple cells takes time

I'm using win32com.client to write data to an excel file.
This takes too much time (the code below simulates the amount of data I want to update excel with, and it takes ~2 seconds).
Is there a way to update multiple cells (with different values) in one call rather than filling them one by one? or maybe using a different method which is more efficient?
I'm using python 2.7 and office 2010.
Here is the code:
from win32com.client import Dispatch
xlsApp = Dispatch('Excel.Application')
xlsApp.Workbooks.Add()
xlsApp.Visible = True
workSheet = xlsApp.Worksheets(1)
for i in range(300):
for j in range(20):
workSheet.Cells(i+1,j+1).Value = (i+10000)*j

A few suggestions:
ScreenUpdating off, manual calculation
Try the following:
xlsApp.ScreenUpdating = False
xlsApp.Calculation = -4135 # manual
try:
#
worksheet = ...
for i in range(...):
#
finally:
xlsApp.ScreenUpdating = True
xlsApp.Calculation = -4105 # automatic
Assign several cells at once
Using VBA, you can set a range's value to an array. Setting several values at once might be faster:
' VBA code
ActiveSheet.Range("A1:D1").Value = Array(1, 2, 3, 4)
I have never tried this using Python, I suggest you try something like:
worksheet.Range("A1:D1").Value = [1, 2, 3, 4]
A different approach
Consider using openpyxl or xlwt. Openpyxls lets you create .xlsx files without having Excel installed. Xlwt does the same thing for .xls files.

used the range suggestion of the other answer, I wrote this:
def writeLineToExcel(wsh,line):
wsh.Range( "A1:"+chr(len(line)+96).upper()+"1").Value=line
xlApp = Dispatch("Excel.Application")
xlApp.Visible = 1
xlDoc = xlApp.Workbooks.Open("test.xlsx")
wsh = xlDoc.Sheets("Sheet1")
writeLineToExcel(wsh,[1, 2, 3, 4])
you may also write multiple lines at once:
def writeLinesToExcel(wsh,lines): # assume that all lines have the same length
wsh.Range( "A1:"+chr(len(lines)+96).upper()+str(len(lines[0]))).Value=lines
writeLinesToExcel(wsh,[ [1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10,11,12],
[13,14,15,16],
])

Note that you can set ranges via numeric adresses easily by using the following code:
cl1 = Sheet1.Cells(X1,Y1)
cl2 = Sheet1.Cells(X2,Y2)
Range = Sheet1.Range(cl1,cl2)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading R dataframes in python in Jupyter - python

So, I randomly tried something myself and it worked. I couldn't find some good documentation. So, one can just simply do: df_python = %Rget df_new This worked for me.

Related

Appending data to a Google Sheet using Python

Fast way to create a variable in a dataframe as a function of other variables and values in another dataframe?

Cannot use ggplot2 in python jupyter notebook

Applying different functions and their arguments using a general functions (Ta-Lib in particular)

Python - excel : writing to multiple cells takes time

Categories

Resources