PySpark caused mismatch column when reading from csv - python

Edit: The previous problem was solved by specifying the argument multiLine by Truein the spark.read.csv function. However, I discovered another problem when using the spark.read.csv function.
Another problem I encountered was with another csv file in the same dataset as described in the question. It is a review dataset from insideairbnb.com.
The csv file is like this:
:
But the output of the read.csv function concatenated several lines together and generated a weird format:
Any thoughts? Thank you for your time.
The following problem was solved by specifying the argument multiLine in spark.read.csv function. The root cause was there were \r\n\n\r strings in one of the columns, which the function treated as a line separator instead of a string
I attempted to load a large csv file to a spark dataframe using PySpark.
listings = spark.read.csv("listings.csv")
# Loading to SparkSession
listings.createOrReplaceTempView("listings")
When I tried to get a glance at the result using Spark SQL with the following code:
listing_query = "SELECT * FROM listings LIMIT 20"
spark.sql(listing_query).show()
I got the following result:
Which is very weird consider reading the csv with pandas outputs the correct format of the table without the mismatched column.
Any idea about what caused this issue and how to fix it?

Related

Reading Data Frame in Atoti?

While reading Dataframe in Atoti using the following code error is occured which is shown below.
#Code
global_data=session.read_pandas(df,keys=["Row ID"],table_name="Global_Superstore")
#error
ArrowInvalid: Could not convert '2531' with type str: tried to convert to int64
How to solve this? Please help guys..
Was trying to read a Dataframe using atoti functions.
There are values with different types in that particular column. If you aren't going to preprocess the data and you're fine with that column being read as a string, then you should specify the exact datatypes of each of your columns (or that particular column), either when you load the dataframe with pandas, or when you read the data into a table with the function you're currently using:
import atoti as tt
global_superstore = session.read_pandas(
df,
keys=["Row ID"],
table_name="Global_Superstore",
types={
"<invalid_column>": tt.type.STRING
}
)

Values of the columns are null and swapped in pyspark dataframe

I am using pyspark==2.3.1. I have performed data preprocessing on the data using pandas now I want to convert my preprocessing function into pyspark from pandas. But while reading the data CSV file using pyspark lot of values become null of the column that has actually some values. If I try to perform any operation on this dataframe then it is swapping the values of the columns with other columns. I also tried different versions of pyspark. Please let me know what I am doing wrong. Thanks
Result from pyspark:
The values of the column "property_type" have null but the actual dataframe has some value instead of null.
CSV File:
But pyspark is working fine with small datasets. i.e.
In our we faced the similar issue. Things you need to check
Check wether your data as " [double quotes] pypark would consider this as delimiter and data looks like malformed
Check wether your csv data as multiline
We handled this situation by mentioning the following configuration
spark.read.options(header=True, inferSchema=True, escape='"').option("multiline",'true').csv(schema_file_location)
Are you limited to use CSV fileformat?
Try parquet. Just save your DataFrame in pandas with .to_parquet() instead of .to_csv(). Spark works with this format really well.

Invalid Character Error When Generating Tableau Hypers using API

I have a process that loops through several parquet files, converts them to CSVs, and then generates tableau hypers following the guidance provided here:
https://github.com/tableau/hyper-api-samples/blob/main/Tableau-Supported/Python/create_hyper_file_from_csv.py
However, for one of the files, I keep getting an error as the CSV is copied into the .hyper:
HyperException: invalid character in input string file:'/filepath/adds_time_of_day_error_drop_col.csv' line:2 column:12 Context: 0x5fdfad59
This is the schema of the problem parquet (before being converted to CSV):
parquet schema
The characters it calls "invalid" are decimals (e.g., 0.011 or 0.012) that worked fine in the other files I converted, so I'm not sure why it's not working here. Even when I drop the column, it just shifts the error over one column (so it would throw the error at line:2 column:11 instead of 12).
This seems similar to this issue on the tableau help forum, but I couldn't apply the solution there to this case because, as far as I can tell, there is no invalid character at line 2, column 12. I can't tell why only this CSV would have this issue when the others were created the same way with no problem.
https://community.tableau.com/s/question/0D54T00000F33g5SAB/hyper-api-copy-from-csv-to-hyper-failed-with-context-0x5fdfad59
I don't think it has to do with the Table Definition because I've tried different SqlTypes for the column, and it always fails.

Pandas, remove outer quote marks from specific columns on export?

I have a specific problem, we are moving our from old to a new system. Old databse was adjusted to a new one with Pandas. However, I am facing a problem.
If file opened with SQL or Csv, it has outer quotes,
"UUID_TO_BIN('5e6f7922-8ae9-11ea-a3bd-888888888788', true)"
I need to make sure it has no upper quotes like this:
UUID_TO_BIN('5e6f7922-8ae9-11ea-a3bd-888888888788', true)
What could be pandas solution to do this for specific columns, on exporting saving to SQL or csv? Because now it's as a string and returns like this?
If your problem is that the old system produces files like .csv with the quotes, you might just want to edit the .csv file itself as described here
If your problem is that pandas saves it as a string with double quotes you can either run the same thing on the csv output of pandas, or you could pass the .to_csv() function the argument
quotechar=''
for which you can find more info on this page
Try to read your data in Pandas using:
df = pd.read_csv(filename, sep=',').replace('"','', regex=True)
This reading line should remove the " " from your data.

How to write dataframe to csv with a single row header(5k columns)?

I am trying to export a pandas dataframe with to_csv so it can be processed by another tool before using it again with python. It is a token dataset with 5k columns. When exported the header is split in two rows. This might not be an issue for pandas but in this case I need to export it on a single row csv. Is this a pandas limitation or a csv format one?
Currently, searching returned no compatible results. The only solution I came up is writing the column names and the values separately, eg. writing an str column list first and then a numpy array to the csv. Can this be implemented, and if so how?
For me this problem was caused by having multiple indexes. The easiest way to resolve this issue is to specify your own headers. I found reference to an option called tupleize_cols but it doesn't exist in current (1.2.2) pandas.
I was using the following aggregation:
df.groupby(["device"]).agg({
"outage_length":["count","sum"],
}).to_csv("example.csv")
This resulted in the following csv output:
,outage_length,outage_length
,count,sum
device,,
device0001,3,679.0
device0002,1,113.0
device0003,2,400.0
device0004,1,112.0
I specified my own headers in the call to to_csv; excluding my group_by, as follows:
}).to_csv("example.csv",header=("flaps","downtime"))
And got the following csv output, which was much more pleasing to spreadsheet software:
device,flaps,downtime
device0001,3,679.0
device0002,1,113.0
device0003,2,400.0
device0004,1,112.0

Categories