Im new to python so apologies for any dum qa in advanced,
Im tring to import a csv as data frame to pandas, do some 'df.groupby' (base on 'mean'), and to merge with other data frames,
the problem is that the values for the 'mean' are taken as an object :
Plant object
Component int64
PerUnitPrice object >> that's what I'm talking about
dtype: object
Traceback (most recent call last):
I did try to convert using '.astype(float)' - got an error
and with :
price_['PerUnitPrice'] = pd.to_numeric(price_['PerUnitPrice'],errors='coerce')
that worked partially ->> it set all the values bigger than 999 as Nan, at least that what I think it did
here are some lines from the csv that I'm importing:
csv lines
It looks like a problem with the reading in of the data from .csv. Looking at the image of the .csv you shared, there are thousand separators (commas) in the PerUnitPrice column (which are causing the column to be parsed incorrectly).
In your pd.read_csv call, set thousands=True, and specify the dtype with a dictionary, as in the documentation for read_csv. In your case, I think you'll want to use {'Plant' : str, 'Component' : str, 'PerUnitPrice' : float}.
A tip in debugging this sort of issue is to use the info() method of a pd.DataFrame to check the dtypes, size, and non-null values of a DataFrame.
Related
While reading Dataframe in Atoti using the following code error is occured which is shown below.
#Code
global_data=session.read_pandas(df,keys=["Row ID"],table_name="Global_Superstore")
#error
ArrowInvalid: Could not convert '2531' with type str: tried to convert to int64
How to solve this? Please help guys..
Was trying to read a Dataframe using atoti functions.
There are values with different types in that particular column. If you aren't going to preprocess the data and you're fine with that column being read as a string, then you should specify the exact datatypes of each of your columns (or that particular column), either when you load the dataframe with pandas, or when you read the data into a table with the function you're currently using:
import atoti as tt
global_superstore = session.read_pandas(
df,
keys=["Row ID"],
table_name="Global_Superstore",
types={
"<invalid_column>": tt.type.STRING
}
)
I import an Excel file with pandas and when I try to convert all columns to float64 for further manipulation. I have several columns that have a type like:
0
column_name_1 float64
column_name_1 float64
dtype: object
and it is unable to do any calculations. May I ask how I could change this column type to float64?
If You want to change the datatype of columns in pandas,
Have you tried using the astype() function, Visit the official documentation below for more information and usage examples,
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html
I just solved it yesterday and it is because I have two same columns in the Data frame and it causes that when I try to access pd['something'] it automatically combine two columns together and then it becomes an object instead of float64
Python script does not run while executing in a bat file, but runs seamlessly on the editor.
The error is related to datatype difference in pd.merge script. Although the datatype given to both the columns is same in both the dataframes.
df2a["supply"] = df2a["supply"].astype(str)
df2["supply_typ"] = df2["supply_typ"].astype(str)
df2a["supply_typ"] = df2a["supply_typ"].astype(str)
df = (pd.merge(df2,df2a, how=join,on=
['entity_id','pare','grome','buame','tame','prd','gsn',
'supply','supply_typ'],suffixes=['gs2','gs2x']))
While running the bat file i am getting following error in pd.merge:
You are trying to merge on float64 and object columns. If you wish to proceed you should use pd.concat
Not a direct answer, but contains code that cannot be formatted in a comment, and should be enough to solve the problem.
When pandas says that you are trying to merge on float64 and object columns, it is certainly right. It may not be evident because pandas relies on numpy, and that a numpy object column can store any data.
I ended with a simple function to diagnose all those data type problem:
def show_types(df):
for i,c in enumerate(df.columns):
print(df[c].dtype, type(df.iat[0, i]))
It shows both the pandas datatype of the columns of a dataframe, and the actual type of the first element of the column. It can help do see the difference between columns containing str elements and other containing datatime.datatime ones, while the datatype is just objects.
Use that on both of your dataframes, and the problem should become evident...
Update: I am using some example code from "Socrata Open Source API." I note the following comment in the code:
# First 2000 results, returned as JSON from API / converted to Python
# list of dictionaries by sodapy.
I am not v. familiar with JSON.
I have downloaded a dataset, creating a DataFrame 'df' with a large number of columns.
df = pd.DataFrame.from_records(results)
When I attempt to use the describe() method, I get "TypeError: unhashable type: 'dict'":
df.describe()
...
TypeError: unhashable type: 'dict'
How can I identify the columns which are generating this error?
UPDATE 2:
Per Yuca's request, I include an extract from the df:
I came across the same Problem today and did a bit research about different versions of pyarrow. here I found that in the past (<0.13), pyarrow would write real columns of data for the index, with names.In the most recent version of pyarrow, there would be no column data, but a range index metadata marker instead. It means parquet files produced with newer version of pyarrow cant be read by older versions.
pandas0.25.3 is ok with reading json containing dicts, apparently pandas1.0.1 not so much
df = pd.read_json(path,lines=True)
TypeError: unhashable type: ‘dict’
Above is thrown by pandas1.0.1 for the same file for which it works in pandas0.25.3.
The issue is tracked and apparently fixed in master which I suppose will make it into the next version.
Thanks to the user community (h/t G Anderson), I pieced together a solution:
for i in df.columns:
if df[i].transform(type).any() == dict:
df = df.drop(i, axis= 1)
transform(type).any() checks all elements in column i, and drops the column if the element is type dict.
Thanks to all!
I'm using a csv file from Excel to create a pandas data frame. Recently, I've encountered several ValueError messages regarding the dtypes of each column in the dataframe.
This is the most recent exception raised:
ValueError: could not convert string to float: 'OH'
After running pandas' dtypes method on my data frame, it shows that this particular column addr_state is an object, not a float.
I've pasted all my code below for clarification:
work_path = 'C:\\Users\\Projects\\loans.csv'
unfiltered_y_df = pd.read_csv(work_path, low_memory=False, encoding='latin-1')
print(unfiltered_y_df.dtypes)
filtered_y_df = unfiltered_y_df.loc[unfiltered_y_df['loan_status'].isin(['Fully Paid', 'Charged Off', 'Default'])]
X = StandardScaler().fit_transform(filtered_y_df[[column for column in filtered_y_df]])
Y = filtered_y_df['loan_status']
Also, is it possible to explicitly write out the dtypes for each column? Right now I feel like that's the only way to solve this. Thanks in advance!
So two issues here I think:
To print out the types for each column just use the ftypes or dtypes method:
i.e.
unfiltered_y_df.ftypes
You say 'addr_state' is an object not a float. Well that is the problem, StandardScaler() will only work on floats so it is trying to coerce your state 'OH' to a float and can't, hence the error