I would like to preface by saying I am very new to Spark. I have a working program on Pandas that I need to run on Spark. I am using Databricks to do this. After initializing 'sqlContext' and 'sc', I load in a CSV file and create a Spark dataframe. After doing this, I then convert this dataframe into a Pandas dataframe, where I have already wrote code to do what I need to do.
Objective: I need to load in a CSV file and identify the data types and return the data types of each and every column. The tricky part is that dates come in a variety of formats, for which I have written (with help from this community) regular expressions to match. I do this for every data type. At the end, I convert the columns to the correct type and print each column type.
After successfully loading my Pandas dataframe in, I am getting this error: "TypeError: to_numeric() got an unexpected keyword argument 'downcast' "
The code that I am running that triggered this:
# Changing the column data types
if len(int_count) == len(str_count):
df[lst[col]] = pd.to_numeric(df[lst[col]], errors='coerce', downcast='integer')
if len(float_count) == len(str_count):
df[lst[col]] = pd.to_numeric(df[lst[col]], errors='coerce', downcast='float')
if len(boolean_count) == len(str_count):
df[lst[col]] = df[lst[col]].astype('bool')
if len(date_count) == len(str_count):
df[lst[col]] = pd.to_datetime(df[lst[col]], errors='coerce')
'lst' is the column header and 'col' is a variable I used to iterate through the column headers. This code worked perfectly when running on PyCharm. Not sure why I am getting this error on Spark.
Any help would be great!
From your comments:
I have tried to load the initial data directly into pandas df but it has consistently thrown me an error, saying the file doesn't exist, which is why I have had to convert it after loading it into Spark.
So, my answer has nothing to do with Spark, only with uploading data to Databricks Cloud (Community Edition), which seems to be your real issue here.
After initializing a cluster and uploading a file user_info.csv, we get this screenshot:
including the actual path for our uploaded file.
Now, in a Databricks notebook, if you try to use this exact path with pandas, you'll get a File does not exist error:
import pandas as pd
pandas_df = pd.read_csv("/FileStore/tables/1zpotrjo1499779563504/user_info.csv")
[...]
IOError: File /FileStore/tables/1zpotrjo1499779563504/user_info.csv does not exist
because, as the instructions clearly mention, in that case (i.e. files you want loaded directly in pandas or R instead of Spark) you need to prepend the file path with /dbfs:
pandas_df = pd.read_csv("/dbfs/FileStore/tables/1zpotrjo1499779563504/user_info.csv") # works OK
pandas_df.head() # works OK
Related
I am new to data, so after a few lessons on importing data in python, I tried the following codes in my jupter notebook but keep getting an error saying df not defined. I need help.
The code I wrote is as follows;
import pandas as pd
url = "https://api.worldbank.org/v2/en/indicator/SH.TBS.INCD?downloadformat=csv"
df = pd.read_csv(https://api.worldbank.org/v2/en/indicator/SH.TBS.INCD?downloadformat=csv)
After running the third code, I got a series of reports on jupter notebook but one that stood out was "df not defined"
The problem here is that your data is a ZIP file containing multiple CSV files. You need to download the data, unpack the ZIP file, and then read one CSV file at a time.
If you can give more details on the problem(etc: screenshots), debugging will become more easier
One possibility for the error is that the response content accessed by the url(https://api.worldbank.org/v2/en/indicator/SH.TBS.INCD?downloadformat=csv) is a zip file, which may prevent pandas from processing it further.
I'm trying to import a parquet file in Databricks (pyspark) and keep getting the error
df = spark.read.parquet(inputFilePath)
AnalysisException: Column name "('my data (beta)', "Meas'd Qty")" contains invalid character(s). Please use alias to rename it.
I tried the suggestions in this post, using .withColumnRenamed like in this post, and also using alias like
(spark.read.parquet(inputFilePath)).select(col("('my data (beta)', "Meas'd Qty")").alias("col")).show()
but always get the same error. How do I go through each column to replace any invalid characters with underscore _ or even just delete all invalid characters?
How is the old file generated? The file was saved with column names that are not allowed by the spark.
Better to fix this issue at the source when this file is generated.
Few approaches you can try in spark to resolve are
In the select statement put column name in `` like
(spark.read.parquet(inputFilePath)).select(col(`('my data (beta)', "Meas'd Qty")`).alias("col")).show()
Try to rename using toDF. You need to pass all the column names in the output df.
(spark.read.parquet(inputFilePath)).toDF(["col_a", "col_b", ...]).show()
Try reading the file using pyarrow and refactor the columns and save the result. After that read using pysaprk and continue with your tasks.
I tried to read a csv file of 4GB initially with pandas pd.read_csv but my system is running out of memory (I guess) and the kernel is restarting or the system hangs.
So, I tried using vaex library to convert csv to HDF5 and do operations(aggregations,group by)on that. For that I've used:
df = vaex.from_csv('Wager-Win_April-Jul.csv',column_names = None, convert=True, chunk_size=5000000)
and
df = vaex.from_csv('Wager-Win_April-Jul.csv',header = None, convert=True, chunk_size=5000000)
But still I'm getting my first record in csv file as the header(column names to be precise)and I'm unable to change the column names. I tried finding function to change the names but didn't come across any. Pls help me on that. Thanks :)
The column names 1559104, 10289, 991... is actually the first record in the csv and somehow vaex is taking the first row as my column names which I want to avoid
vaex.from_csv is a wrapper around pandas.read_csv with few extra options for the conversion.
So reading the pandas documentation, header='infer' (which is the default) if you want the csv reader to automatically infer the column names. Otherwise the 1st row of the file is used as the header. Alternatively you can pass the column names manually via the names kwarg. Same holds true for both vaex and pandas.
I would read the pandas.read_csv documentation to better understand all the options. Then you can use those options with vaex and the convert and chunk_size arguments.
I use:
Python 3.7
SAS v7.1 Eterprise
I want to export some data (from library) from SAS to CSV. After that I want to import this CSV to Pandas Dataframe and use it.
I have problem, because when I export data from SAS with this code:
proc export data=LIB.NAME
outfile='path\to\export\file.csv'
dbms=csv
replace;
run;
Every column were exported correctly instead of Column with Date. In SAS I see something like:
06NOV2018
16APR2018
and so on... In CSV it looks the same. But if i import this CSV to DataFrame, unfortunatelly, Python see the column with date as Object/string instead of date type.
So here is my question. How Can I export whole library to CSV from SAS with correct type of column (ecpessially column with Date). Maybe I should convert something before Export? Plz help me with this, In SAS I'm new, i want to just import Data from it and use it in Python.
Before you write something, keep in mind, that I had tried with pandas read_sas function, but during this command I've got such Exception with error:
df1 = pd.read_sas(path)
ValueError: Unexpected non-zero end_of_first_byte Exception ignored
in: 'pandas.io.sas._sas.Parser.process_byte_array_with_data' Traceback
(most recent call last): File "pandas\io\sas\sas.pyx", line 31, in
pandas.io.sas._sas.rle_decompress
I put fillna function and show the same error :/
df = pd.DataFrame.fillna((pd.read_sas(path)), value="")
I tried with sas7bdat module in Python, but I've got the same error.
Then I tried with sas7bdat_converter module. But CSV has the same values in Date column, so problem with dtype will arrive after convert csv to DataFrame.
Have you got any sugestions? I've spent 2 days tried to figure it out, but without any positive results :/
Regarding the read_sas error, a Git issue has been reported but closed for lack of reproducible example. However, I can easily import SAS data files with Pandas using .sas7bdat files generated from SAS 9.4 base (possibly the v7.1 Enterprise is the issue).
However, consider using parse_dates argument of read_csv as it can convert your date DDMMMYY format to datetime during import. No change needed with your SAS exported dataset.
sas_df = pd.read_csv(r"path\to\export\file.csv", parse_dates = ['DATE_COLUMN'])
I have a parquet file which has a simple file schema with a few columns. I read it into python using the code below
from fastparquet import ParquetFile
pf = ParquetFile('inout_files.parquet')
This runs fine, but when I convert it into pandas using the code below I get the following error:
df = pf.to_pandas()
The error is:
NotImplementedError: Encoding 4
To find the source of the error I ran df=pf.to_pandas(columns=col_to_retrieve) adding the columns separately and notice the error raises from one of the columns which has list of strings (e.g. ("a","b","c")) as value for each cell of the column.
Do you know how to convert it to pandas knowing that there is column with type set(string)?
After re-reading the question I'm concerned my answer may be a non-sequitor...
I am having a related problem with a very large dataframe/parquet and getting the Error:
"BinaryArray cannot contain more than 2147483646 bytes".
It appears that fastparquet can read my large table without errors and pyarrow can write them without issues, as long as I don't have category types. So this my current workaround until this issue is solved:
0) Take dataframe without category columns and make a table:
import pyarrow as pa
table = pa.Table.from_pandas(df)
1) write my tables using pyarrow.parquet:
import pyarrow.parquet as pq
pq.write_table(table, 'example.parquet')
2) read my tables using fastparquet:
from fastparquet import ParquetFile
pf = ParquetFile('example.parquet')
3) convert to pandas using fastparquet:
df = pf.to_pandas()