I have a process that loops through several parquet files, converts them to CSVs, and then generates tableau hypers following the guidance provided here:
https://github.com/tableau/hyper-api-samples/blob/main/Tableau-Supported/Python/create_hyper_file_from_csv.py
However, for one of the files, I keep getting an error as the CSV is copied into the .hyper:
HyperException: invalid character in input string file:'/filepath/adds_time_of_day_error_drop_col.csv' line:2 column:12 Context: 0x5fdfad59
This is the schema of the problem parquet (before being converted to CSV):
parquet schema
The characters it calls "invalid" are decimals (e.g., 0.011 or 0.012) that worked fine in the other files I converted, so I'm not sure why it's not working here. Even when I drop the column, it just shifts the error over one column (so it would throw the error at line:2 column:11 instead of 12).
This seems similar to this issue on the tableau help forum, but I couldn't apply the solution there to this case because, as far as I can tell, there is no invalid character at line 2, column 12. I can't tell why only this CSV would have this issue when the others were created the same way with no problem.
https://community.tableau.com/s/question/0D54T00000F33g5SAB/hyper-api-copy-from-csv-to-hyper-failed-with-context-0x5fdfad59
I don't think it has to do with the Table Definition because I've tried different SqlTypes for the column, and it always fails.
Related
Hey everyone my question is kinda silly but i am new to python)
I am writing a python script for c# aplication and i got kinda strange issue when i work with csv document.
When i open it it and work with Date column it works fine
df=pd.read_csv("../Debug/StockHistoryData.csv")
df = df[['Date']]
But when i try to work with another columns it throws error
df = df[['Close/Last']]
KeyError: "None of [Index(['Close/Last'], dtype='object')] are in the [columns]"
It says there are no such Index but but when i print the whole table it works fine and shows all columns
Table Image
Error image
Take a look at the first row of your CSV file.
It contains colum names (comma-separated).
From time to time it happens that this initial line contains a space after
each comma.
For a human being it is quite readable and even intuitive.
But read_csv adds these spaces to column names, what is sometimes difficult
to discover.
Another option is to run print(df.columns) after you read your file.
Then look for any extra spaces in column names.
Edit: The previous problem was solved by specifying the argument multiLine by Truein the spark.read.csv function. However, I discovered another problem when using the spark.read.csv function.
Another problem I encountered was with another csv file in the same dataset as described in the question. It is a review dataset from insideairbnb.com.
The csv file is like this:
:
But the output of the read.csv function concatenated several lines together and generated a weird format:
Any thoughts? Thank you for your time.
The following problem was solved by specifying the argument multiLine in spark.read.csv function. The root cause was there were \r\n\n\r strings in one of the columns, which the function treated as a line separator instead of a string
I attempted to load a large csv file to a spark dataframe using PySpark.
listings = spark.read.csv("listings.csv")
# Loading to SparkSession
listings.createOrReplaceTempView("listings")
When I tried to get a glance at the result using Spark SQL with the following code:
listing_query = "SELECT * FROM listings LIMIT 20"
spark.sql(listing_query).show()
I got the following result:
Which is very weird consider reading the csv with pandas outputs the correct format of the table without the mismatched column.
Any idea about what caused this issue and how to fix it?
I have some dataframe df in pySpark, which results from calling:
df = spark.sql("select A, B from org_table")
df = df.stuffIdo
I want to overwrite org_table at the end of my script.
Since overwriting input-tabels is forbidden, I checkpointed my data:
sparkContext.setCheckpointDir("hdfs:/directoryXYZ/PrePro_temp")
checkpointed = df.checkpoint(eager=True)
The lineage should be broken now and I can also see my checkpointed data with checkpointed.show() (works). What does not work is writing the table:
checkpointed.write.format('parquet')\
.option("checkpointLocation", "hdfs:/directoryXYZ/PrePro_temp")\
.mode('overwrite').saveAsTable('org_table')
This results in an error:
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://org_table_path/org_table/part-00081-4e9d12ea-be6a-4a01-8bcf-1e73658a54dd-c000.snappy.parquet
I have tried several things like refreshing the org_table before doing the writing etc., but I'm puzzled here. How can I solve this error?
I would be careful with such operations where transformed input is new output. The reason for that is that you can lost your data in case of any error. Let's imagine that your transformation logic was buggy and you generated invalid data. But you saw that only one day later. Moreover, to fix the bug, you cannot use the data you've just transformed. You needed the data before the transformation. What do you do to bring data consistent again ?
An alternative approach would be:
exposing a view
at each batch you're writing a new table and at the end you only replace the view with this new table
after some days you can also plan a cleaning job that will delete the tables from last X days
If you want to stay with your solution, why not simply to do that instead of dealing with checkpointing ?
df.write.parquet("hdfs:/directoryXYZ/PrePro_temp")\
.mode('overwrite')
df.load("hdfs:/directoryXYZ/PrePro_temp").write.format('parquet').mode('overwrite').saveAsTable('org_table')
Of course, you will read the data twice but it looks less hacky than the one with checkpoint. Moreover, you could store your "intermediate" data in different dirs every time and thanks to that you can address the issue I exposed at the beginning. Even though you had a bug, you can still bring valid version of data by simply choosing a good directory and doing .write.format(...) to org_table.
I'm trying to set up a Python script that will be able to read in many fixed width data files and then convert them to csv. To do this I'm using pandas like this:
pandas.read_fwf('source.txt', colspecs=column_position_length).\
to_csv('output.csv', header=column_name, index=False, encoding='utf-8')
Where column_position_length and column_name are lists containing the information needed to read and write the data.
Within these files I have long strings of numbers representing test answers. For instance: 333133322122222223133313222222221222111133313333 represents the correct answers on a multiple choice test. So this is more of a code than a numeric value. The problem that I am having is pandas interpreting these values as floats and then writing these values in scientific notation into the csv (3.331333221222221e+47).
I found a lot of questions regarding this issue, but they didn't quite resolve my issue.
Solution 1 - I believe at this point the values have already been converted to floats so this wouldn't help.
Solution 2 - according to the pandas documentation, dtype is not supported as an argument for read_fwf in Python.
Solution 3 Use converters - the issue with using converters is that you need to specify the column name or index to convert to a data type, but I would like to read all of the columns as strings.
The second option seemes to be the go to answer for reading every column in as a string, but unfortunately it just isn't supported for read_fwf. Any suggestions?
So I think I figured out a solution, but I don't know why it works. Pandas was interpreting these values as floats because there were NaN values (blank lines) in the columns. By adding keep_default_na=False to the read_fwf() parameters, it resolved this issue. According to the documentation:
keep_default_na : bool, default True If na_values are specified and
keep_default_na is False the default NaN values are overridden,
otherwise they’re appended to.
I guess I'm not quite understanding how this is fixing my issue. Could anyone add any clarity on this?
I am attempting to use python to pull a JSON array from a file and input it into ElasticSearch. The array looks as follows:
{"name": [["string1", 1, "string2"],["string3", 2, "string4"], ... (variable length) ... ["string n-1", 3, "string n"]]}
ElasticSearch throws a TransportError(400, mapper_parsing_exception, failed to parse) when attempting to index the array. I discovered that ElasticSearch sometimes throws the same error whenever I try to feed it a string with both strings and integers. So, for example, the following will sometimes crash and sometimes succeed:
import json
from elasticsearch import Elasticsearch
es = Elasticsearch()
test = json.loads('{"test": ["a", 1, "b"]}')
print test
es.index(index, body=test)
This code is everything I could safely comment out without breaking the program. I put the JSON in the program instead of having it read from a file. The actual strings I'm inputting are quite long (or else I would just post them) and will always crash the program. Changing the JSON to "test": ["a"] will cause it to work. The current setup crashes if it last crashed, or works if it last worked. What is going on? Will some sort of mapping setup fix this? I haven't figured out how to set a map with variable array length. I'd prefer to take advantage of the schema-less input but I'll take whatever works.
It is possible you are running into type conflicts with your mapping. Since you have expressed a desire to stay "schema-less", I am assuming you have not explicitly provided a mapping for your index. That works fine, just recognize that the first document you index will determine the schema for your index. Each document you index afterwards that has the same fields (by name), those fields must conform to the same type as the first document.
Elasticsearch has no issues with arrays of values. In fact, under the hood it treats all values as arrays (with one or more entries). What is slightly concerning is the example array you chose, which mixes string and numeric types. Since each value in your array gets mapped to the field named "test", and that field may only have one type, if the first value of the first document ES processes is numeric, it will likely assign that field as a long type. Then, future documents that contain a string that does not parse nicely into a number, will cause an exception in Elasticsearch.
Have a look at the documentation on Dynamic Mapping.
It can be nice to go schema-less, but in your scenario you may have more success by explicitly declaring a mapping on your index for at least some of the fields in your documents. If you plan to index arrays full of mixed datatypes, you are better off declaring that field as string type.