key Error 'data' when loading rdata into python using pyreadr - python

So until now I was able to use the pyreadr package to load rdata files to python. Somehow this time I keep getting a key error 'data'
# load the data set:
rdata_read = pyreadr.read_r("/content/GrowthData.rda")
data = rdata_read[ 'data' ]
n = data.shape[0]
Where does the error come from ?
Furthermore, I found out that the type of this is a "collections.OrderedDict" which is new to me and never happened before. Consequently, I tried to convert it to a pandas data frame. Unfortunately, I could not convert this type to a pandas data frame either as I receive the error "must pass a 2-D array". Hence, I am very confused right now and don't know how I can access this data via python and work with it. Appreciate any help!!

Pyreadr read_r function always gives back a OrderedDict (think of it just as a regular python dictionary, the distinction was important in older versions of python, not anymore), where the keys of the dictionary are the name of the object (dataframe) as it was set in R, and the value is the dataframe. You can read about this in the README
The reason why it returns a dictionary is because in an RData file you can save multiple objects (dataframes), therefore pyreadr has to give a way to return multiple dataframes you can recognize by their name.
In R you would do:
save(dataframe1, dataframe2, file="GrowthData.rda")
What I would suggest you to do in python, is after you have read the data, explore what keys you have in there:
# load the data set:
rdata_read = pyreadr.read_r("/content/GrowthData.rda")
print(rdata_read.keys())
# would print dataframe1, dataframe2 in the above example
this will tell you what objects have been saved in the Rdata file and you can retrieve as you were doing before
data = rdata_read['dataframe1']

Related

How to write/read a (pickled) array of strings in sql using df.to_sql?

I'm trying to append scrape data while the scrape is running.
Some of the columns contain an array of multiple strings which are to be saved and postprocessed into dummies after the scrape.
e.g tags= array(['tag1','tag2'])
However writing to and reading from the database doesn't work for the arrays.
Ive tried different storage methods, csv, pickling, HDF all of these dont work for different reasons.
(mainly problems appending to a central database and storing lists like strings).
I also tried different database formats (mysql and postgres) , i tried using dtype ARRAY, however that requires a fixed length (known beforehand) array..
From what i gather, i can go the JSON route or the pickle route.
I chose the pickle route since i dont need the db to do anything with the contents of the array.
from sqlalchemy.types import PickleType
df=pd.DataFrame([],columns=['Name','Tags'])
df['Price'] = array(['tag1','tag2'], dtype='<U8')
type_dict = {'Name': String ,'Tags': PickleType}
engine = create_engine('sqlite://', echo=False)
df.to_sql('test', con=engine, if_exists='append', index=False, dtype=type_dict)
df2=pd.read_sql_table('test' ,con =engine)
expected output:
df2['Tags'].values
array(['tag1','tag2'], dtype='<U8')
actual output:
df2['Tags'].iloc[0]
b'\x80\x04\x95\xa4\x00\x00\x00\x00\x00\x00\x00\x8c\x15numpy.core.multiarray\x94\x8c\x0c_reconstruct\x94\x93\x94\x8c\x05numpy\x94\x8c\x07ndarray\x94\x93\x94K\x00\x85\x94C\x01b\x94\x87\x94R\x94(K\x01K\x01\x85\x94h\x03\x8c\x05dtype\x94\x93\x94\x8c\x02U8\x94K\x00K\x01\x87\x94R\x94(K\x03\x8c\x01<\x94NNNK K\x04K\x08t\x94b\x89C \xac \x00\x00\xac \x00\x00 \x00\x00\x00-\x00\x00\x00 \x00\x00\x00\xac \x00\x00\xac \x00\x00\xac \x00\x00\x94t\x94b.'
So something has gone wrong during pickling, and I cant figure out what.
edit:
Okay, so np.loads(df2['Tags'].iloc[0]) gives the original array back. Is there a way to pass this to read_sql_table? such that i immediately get the "original" dataframe back?
So the problems occurs during the reading, the arrays are pickled, but they are not automatically read back as pickled data. There is no way to pass dtype to read_sql_table right?
def unpickle_the_pickled(df):
df2=df
for col in df.columns:
if type(df[col].iloc[0])==bytes:
df2[col]=df[col].apply(np.loads)
return df2
finally solved it, so happy!

How to pass a Python DataFrame to UiPath?

I'm calling a python function using the UiPath Python Activities Pack (Get Python Object) and it returns a DataFrame in order to use it within UiPath. Unfortunately, UiPath is not able to convert the DataFrame to a .Net DataType like a DataTable.
Even when I try to convert the DataFrame to any other format (String, numpy array, html etc.) it is not working although the documentation mentions explicitly that all DataTypes are supported. The Python script does its work an stores the Content of the DataFrame in an Excel file and I could, of course, just read the Excel file. I was just wondering whether there is a way to directly pass the data to UiPath instead of saving it first and reading it again.
Actually, I spent quite some time on this but finally figured out how to pass the pandas DataFrame to UiPath and make it available there in a dataTable. I explain how I did it in the following:
Python Script:
I let the python function that I call in the UiPath 'Invoke Paython Method' activity return the pandas dataframe as a JSON string, i.e.
return df.to_json(orient='records')
Get Python Object:
Save the JSON string in a variable of type string
Deserialize JSON:
Choose 'System.Data.DataTable' as TypeArgument and store the result in a variable of type dataTable
Now the Data from the pandas dataFrame is available in a .Net dataTable in Uipath.

unable to parse a column of json strings in modin dataframe (works in pandas)

I have a dataframe of json strings I want to convert to json objects.
df.col.apply(json.loads) works fine for pandas, but fails when using modin dataframes.
example:
import pandas
import modin.pandas
import json
pandas.DataFrame.from_dict({'a': ['{}']}).a.apply(json.loads)
0 {}
Name: a, dtype: object
modin.pandas.DataFrame.from_dict({'a': ['{}']}).a.apply(json.loads)
TypeError: the JSON object must be str, bytes or bytearray, not float
This issue was also raised on GitHub, and was answered here: https://github.com/modin-project/modin/issues/616
The error is coming from the error checking component of the run, where we call the apply (or agg) on an empty DataFrame to determine the return type and let pandas handle the error checking (Link).
Locally, I can reproduce this issue and have fixed it by changing the line to perform the operation on one line of the Series. This may affect the performance, so I need to do some more tuning to see if there is a way to speed it up and still be robust. After the fix the overhead of that check is ~10ms for 256 columns and I don't think we want error checking to take that long.
untill the fix is released, it's possible to workaround this issue by using code that work also for empty data - for example:
def safe_loads(x)
try:
return json.loads(x)
except:
return None
modin.pandas.DataFrame.from_dict({'a': ['{}']}).a.apply(safe_loads)

storing python list into mysql and accessing it

How can I store python 'list' values into MySQL and access it later from the same database like a normal list?
I tried storing the list as a varchar type and it did store it. However, while accessing the data from MySQL I couldn't access the same stored value as a list, but it instead it acts as a string. So, accessing the list with index was no longer possible. Is it perhaps easier to store some data in the form of sets datatype? I see the MySQL datatype 'set' but i'm unable to use it from python. When I try to store set from python into MySQL, it throws the following error: 'MySQLConverter' object has no attribute '_set_to_mysql'. Any help is appreciated
P.S. I have to store co-ordinate of an image within the list along with the image number. So, it is going to be in the form [1,157,421]
Use a serialization library like json:
import json
l1 = [1,157,421]
s = json.dumps(l1)
l2 = json.loads(s)
Are you using an ORM like SQLAlchemy?
Anyway, to answer your question directly, you can use json or pickle to convert your list to a string and store that. Then to get it back, you can parse it (as JSON or a pickle) and get the list back.
However, if your list is always a 3 point coordinate, I'd recommend making separate x, y, and z columns in your table. You could easily write functions to store a list in the correct columns and convert the columns to a list, if you need that.

Store repetitive data in python?

I'm working on a small task with excel sheet and python and the problem that I'm facing is i have few lines of code to perform string manipulation on the data which i fetch from the sheet. Since i got plenty of sheets,sometimes only limited number of sheets are required and couple of time whole excel sheet to perform string manipulation i can't write the same code everywhere so i thought of performing the operation once and storing it like oldvalue : newvalue so that whenever i read oldvalue i don't have to do manipulation again just fetch the newvalue from there. Now i tried using dictionary which is the best way to do it but the problem with using it is my key and value can both be repetitive and i don't want to update my previous entry with it. As per my knowledge we can't achieve it using dictionary. So what I'm asking is whether we have some kind of different data type to store it? Or do we actually need one? Can you help me figure out a way to solve it without using any data type?
EDIT :
The point is I'm getting the data from excel sheet and performing string manipulation on it and sometimes the key and the value are getting repetitive and since i'm using dictionary, it's updating previous value which i don't want to.
This will check if your dictionary contains a value for a specified key. If not, you can manipulate your string and save it for that key. If it does, it will grab that value and use it as your manipulated string.
""" Stuff is done. New string to manipulated is found """
if key not in dict:
value = ... #manipulated string
dict[key] = value
else:
manipulated_string = dict[key] #did this before, have the value already

Categories