Reading Data Frame in Atoti? - python

While reading Dataframe in Atoti using the following code error is occured which is shown below.
#Code
global_data=session.read_pandas(df,keys=["Row ID"],table_name="Global_Superstore")
#error
ArrowInvalid: Could not convert '2531' with type str: tried to convert to int64
How to solve this? Please help guys..
Was trying to read a Dataframe using atoti functions.

There are values with different types in that particular column. If you aren't going to preprocess the data and you're fine with that column being read as a string, then you should specify the exact datatypes of each of your columns (or that particular column), either when you load the dataframe with pandas, or when you read the data into a table with the function you're currently using:
import atoti as tt
global_superstore = session.read_pandas(
df,
keys=["Row ID"],
table_name="Global_Superstore",
types={
"<invalid_column>": tt.type.STRING
}
)

Related

Why Doesn't Python Recognize the Column Name (KeyError)

I imported stock/options data into a data frame and want to use pandas to manually filter for specific criteria. I renamed a few columns and then later on I tried to do a bit of cleaning so I can work with the data.
I tried to replace percentage signs then convert the data type to a float by doing this:
df = df['IV'].str.rstrip("%").astype(float)
df = df['IV_Rank'].str.rstrip("%").astype(float)/100
df = df['IV PCT'].str.rstrip("%").astype(float)/100
When I run that code I get the error message: KeyError: 'IV'. I got this error for the other columns as well when I tried to run them each independently but I tried copy then pasting the column name as well as trying the old names. I am not too sure what to do but some help would be appreciated
That's because you are overwriting the entire dataframe. This is what I think you are trying to do
df['IV'] = df['IV'].str.rstrip("%").astype(float)
df['IV_Rank'] = df['IV_Rank'].str.rstrip("%").astype(float)/100
df['IV PCT'] = df['IV PCT'].str.rstrip("%").astype(float)/100

Values of the columns are null and swapped in pyspark dataframe

I am using pyspark==2.3.1. I have performed data preprocessing on the data using pandas now I want to convert my preprocessing function into pyspark from pandas. But while reading the data CSV file using pyspark lot of values become null of the column that has actually some values. If I try to perform any operation on this dataframe then it is swapping the values of the columns with other columns. I also tried different versions of pyspark. Please let me know what I am doing wrong. Thanks
Result from pyspark:
The values of the column "property_type" have null but the actual dataframe has some value instead of null.
CSV File:
But pyspark is working fine with small datasets. i.e.
In our we faced the similar issue. Things you need to check
Check wether your data as " [double quotes] pypark would consider this as delimiter and data looks like malformed
Check wether your csv data as multiline
We handled this situation by mentioning the following configuration
spark.read.options(header=True, inferSchema=True, escape='"').option("multiline",'true').csv(schema_file_location)
Are you limited to use CSV fileformat?
Try parquet. Just save your DataFrame in pandas with .to_parquet() instead of .to_csv(). Spark works with this format really well.

Trying to convert a column with strings to float via Pandas

Hi I have looked but on stackoverflow and not found a solution for my problem. Any help highly appeciated.
After importing a csv I noticed that all the types of the columns are object and not float.
My goal is to convert all the columns but the YEAR column to float. I have read that you first have to strip the columns for taking blanks out and then also convert NaNs to 0 and then try to convert strings to floats. But in the code below I'm getting an error.
My code in Jupyter notes is:
And I get the following error.
How do I have to change the code.
All the columns but the YEAR column have to be set to float.
If you can help me set the column Year to datetime that would be also very nice. But my main problem is getting the data right so I can start making calculations.
Thanks
Runy
Easiest would be
df = df.astype(float)
df['YEAR'] = df['YEAR'].astype(int)
Also, your code fails because you have two columns with the same name BBPWN, so when you do df['BBPWN'], you will get a dataframe with those two columns. Then, df['BBPWN'].str will fail.

How to write dataframe to csv with a single row header(5k columns)?

I am trying to export a pandas dataframe with to_csv so it can be processed by another tool before using it again with python. It is a token dataset with 5k columns. When exported the header is split in two rows. This might not be an issue for pandas but in this case I need to export it on a single row csv. Is this a pandas limitation or a csv format one?
Currently, searching returned no compatible results. The only solution I came up is writing the column names and the values separately, eg. writing an str column list first and then a numpy array to the csv. Can this be implemented, and if so how?
For me this problem was caused by having multiple indexes. The easiest way to resolve this issue is to specify your own headers. I found reference to an option called tupleize_cols but it doesn't exist in current (1.2.2) pandas.
I was using the following aggregation:
df.groupby(["device"]).agg({
"outage_length":["count","sum"],
}).to_csv("example.csv")
This resulted in the following csv output:
,outage_length,outage_length
,count,sum
device,,
device0001,3,679.0
device0002,1,113.0
device0003,2,400.0
device0004,1,112.0
I specified my own headers in the call to to_csv; excluding my group_by, as follows:
}).to_csv("example.csv",header=("flaps","downtime"))
And got the following csv output, which was much more pleasing to spreadsheet software:
device,flaps,downtime
device0001,3,679.0
device0002,1,113.0
device0003,2,400.0
device0004,1,112.0

error inserting values to db with psycopg2 module

I am attempting to insert a dataframe into my postgres database using the pscycopg2 module used with sqlalchemy. The process is loading an excel file into a pandas dataframe and then inserting the dataframe into the database via the predefined table schema.
I believe these are the relevant lines of code:
post_meta.reflect(schema="users")
df = pd.read_excel(path)
table = sql.Table(table_name, post_meta, schema="users")
dict_items = df.to_dict(orient='records')
connection.execute(table.insert().values(dict_items))
I'm getting the following error:
<class 'sqlalchemy.exc.ProgrammingError'>, ProgrammingError("(psycopg2.ProgrammingError) can't adapt type 'numpy.int64'",)
All data field types in the dataframe are int64.
I can't seem to find a similar question or information regarding why this error is and what it means.
Any direction would be great.
Thanks
Looks like you're trying to insert numpy integers, and psycopg2 doesn't know how to handle those objects. You need to convert them to normal python integers first. Maybe try calling the int() function on each value... Please provide more context with code if that fails.
I also ran into this error, and then realized that I was trying to insert integer data into a SqlAlchemy Numeric column, which maps to float, not int. Changing the offending DataFrame column to float did the trick for me:
df[col] = df[col].astype(float)
Perhaps you are also trying to insert integer data into a non-integer column?

Categories