rpy2 and pandas: PandasError: DataFrame constructor not properly called - python

I am trying to create a pandas DataFrame from an R Dataframe. I am encountering the following error, which I cannot figure out.
Traceback (most recent call last):
File "", line 1, in
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 291, in init
raise PandasError('DataFrame constructor not properly called!')
PandasError: DataFrame constructor not properly called!
The code I am using is:
import rpy2.robjects as robjects
from rpy2.robjects import r
robjects.r['load']("file.RData")
my_data = pd.DataFrame(r['ops.data'])
and the error comes after the last line.

You need to read in data sequentially uses a for loop. DataFrames don't easily read in data in the way you are representing it. They are much more suited to dictionaries. Write some headers and then write the data underneath the headers.
Furthermore by saying ['ops.data'] means you are specifying "ops.data" as a data header. Obviously you can't read in an entire file as a column header

Related

Pandas version 0.22.0 - drop_duplicates() got an unexpected keyword argument 'keep'

I am trying to drop duplicates in my dataframe using drop_duplicates(subset=[''], keep=False). Apparently it is working just okay in my Jupyter Notebook but when I am trying to executing through terminal as .py file I am getting the below error:
Traceback (most recent call last):
File "/home/source/fork/PySpark_Analytics/Notebooks/Krish/beryllium_pandas.py", line 54, in <module>
dffsamelname = dffsameflname.drop_duplicates(subset=['INDIVIDUAL_LASTNAME'], keep=False)
File "/var/webeng/opensource/aetna-anaconda/lib/python2.7/site-packages/pandas/util/decorators.py", line 88, in wrapper
return func(*args, **kwargs)
TypeError: drop_duplicates() got an unexpected keyword argument 'keep'
Checked that the pandas version is >0.18 as keep = false was introduced then.
# Trying to drop both the records with same last name
dffsamelname = dffsameflname.drop_duplicates(subset=['INDIVIDUAL_LASTNAME'], keep=False)
I want to drop both the records being dropped. Hence keep=false is necessary.
It just works fine if I remove the keep=false.
It may be that your object is not a native pandas dataframe but instead a pyspark dataframe.
From this http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.drop_duplicates it seems that subset is only argument accepted. Can you add your imports and the lines where you create the dataframe.

Facing Wierd Issue using pandas-0.24.2

I am using python/pandas on Windows 10 from last month & did not face the below issue that suddenly came into being. I have a csv file that is read with pandas. However, the dataframe is arbitrarily joining the comma separated heading into one & while doing this abruptly leaving off last few characters, as a result of this, the code though very simple, is failing. Has anyone seen this kind of problem? Suggestions to overcome this would be of great help
Was trying to check the date format to be in 'yyyy-mm-dd'. Since I got the error, put a print statement to check column names,
Reinstalled python 3.6.8, pandas etc, but that did not help.
import pandas as pd
df = pd.read_csv('Data.csv','r')
print(df.columns)
for pdt in df.PublicDate:
try:
dat = pdt[0:10]
if dat[4] != '-' or dat[7] != '-':
print('\nPub Date Format Error',dat)
except TypeError as e:
print(e)
Test Data csv file has:
PIC,PublicDate,Version,OriginalDate,BPD
ABCD,2019-06-15T19:25:22.000000000Z,1,2019-06-1519.25.22.0000000000,15-06-2019
EFGH,06/15/2019T19:26:22.000000000Z,,2019-06-1519.26.22.0000000000,15-06-2019
IJKL,2019-06-15T20:26:22.000000000Z,1,2019-06-1520.26.22.0000000000,6/25/2019
MNOP,,,2019-06-1520.26.22.0000000000,6/25/2019
QRST,2019-06-15T22:26:22.000000000Z,1,,6/25/2019
Expected:
dates of the format 6/25/2019 should be pointed out for not being in the format 2019-06-25
Actual Result: Below Error
=============== RESTART: H:\Python\DateFormat.py ===============
Index(['PIC,PublicDate,Ve', 'sion,O', 'iginalDate,BPD'], dtype='object')
Traceback (most recent call last):
File "H:\Program Files\Python\DateFormat.py", line 8, in <module>
for pdt in df.PublicDate:
File "G:\Program Files\lib\site-packages\pandas\core\generic.py", line 5067, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'PublicDate'
The problem in the second parameter:
df = pd.read_csv('Data.csv','r')
Without it the example works fine:
df = pd.read_csv('Data.csv')
It happens because the second parameter is separator, not access modifier. With this configuration pandas still available to read the file but cannot create an index or work properly.

Open and edit excel via python

I want to import an existing excel file and edit it. But when i copy the excel file and try to edit on it i get some errors. I did not get errors while trying to execute "write" command. But when i am trying to read some values in the cell, i am having problem.
import xlsxwriter
from xlrd import open_workbook
from xlwt import Workbook, easyxf
import xlwt
from xlutils.copy import copy
workbook=open_workbook("month.xlsx")
sheet=workbook.sheet_by_index(0)
print sheet.nrows
book = copy(workbook)
w_sheet=book.get_sheet(0)
print w_sheet.cell(0,0).value
Error: Traceback (most recent call last):
File "excel.py", line 18, in <module>
print w_sheet.cell(0,0).value
AttributeError: 'Worksheet' object has no attribute 'cell'
I haven't used this library, but looking at the documentation I think you are trying to do something it doesn't support. The worksheet documentation lists it's functionality and cell() is not there.
I think this library is for writing excel only, not reading.
Perhaps try pandas read_excel() to read the excel documents you create?
You can the use pandas iloc on the resulting dataframe to get the value you want:
value=pd.read_excel("file.xlsx", sheet_name="sheet").iloc[0,0]
I think that's correct, although I can't run the code to check just now...

genfromtxt names changes the shape of the array

I am importing a decent amount of data from a CSV file. The original CSV file has around 40 columns, I am trimming it down and want to use my own names (as shown below) for column names.
import numpy as np
import math
import sys
import matplotlib.pyplot as plt
print(sys.argv)
data1 = np.genfromtxt(
sys.argv[1],
skip_header=10,
skip_footer=1,
delimiter=',',
usecols=(0,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18,20,21,22,23,24,25,26,27),
names=['line','TPS','mph','rpm','load','engtemp','trq','month','day','T1','T2','T3','T4','T5','T6','T7','T8','T9','T10','T11','T12','T13','T14','T15','T16'])
Unfortunately this changes the shape of the array and I don't know how to compensate for it, neither do I know how to call out the names. Originally I was hoping that it would save each column as it's own list with the name specified. If you know of a function that does that, that would be neat!
So, the question.
Why does the above not work when I try to reference a column for plotting
plt.plot(data1[:,9])
but when I remove the names argument from genfromtxt it works fine?
and
is there a way to do something like?
plt.plot(data1[T1])
Could it be because the 'names' argument is trying to name the rows, not the columns?
When names is used I get the error
Traceback (most recent call last):
File "HeatReport.py", line 48, in <module>
print(data1[3,9])
IndexError: too many indices for array

How to read HDF table from pandas?

I have an my_file.h5 file that, presumably, contains data in HDF5 format (PyTables). I try to read this file using pandas:
import pandas as pd
store = pd.HDFStore('my_file.h5')
Then I try to use the store object:
print store
As a result I get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/pymodules/python2.7/pandas/io/pytables.py", line 133, in __repr__
kind = v._v_attrs.pandas_type
File "/usr/lib/python2.7/dist-packages/tables/attributeset.py", line 302, in __getattr__
(name, self._v__nodePath)
AttributeError: Attribute 'pandas_type' does not exist in node: '/data'
Does anybody know what am I doing wrong? Can the problem be caused by the fact that my *.h5 is not really what I think it is (not data in hdf5 format)?
In your /usr/lib/pymodules/python2.7/pandas/io/pytables.py, line 133
kind = v._v_attrs.pandas_type
In my pytables.py I see
kind = getattr(n._v_attrs,'pandas_type',None)
By using getattr, if there is no pandas_type attribute, then kind is set to None. I'm guessing my version of Pandas
In [7]: import pandas as pd
In [8]: pd.__version__
Out[8]: '0.10.0'
is newer than yours. If so, the fix is to upgrade your pandas.
I had a h5 table. Made with pytables independent of pandas and needed to turn it into a list of tuples then import it to a df. This woked nice because it allows me to make use of my pytables index to run a "where" on the input. This saves me reading all the rows.

Categories