Using alias to rename pyspark columns - python

I'm trying to import a parquet file in Databricks (pyspark) and keep getting the error
df = spark.read.parquet(inputFilePath)
AnalysisException: Column name "('my data (beta)', "Meas'd Qty")" contains invalid character(s). Please use alias to rename it.
I tried the suggestions in this post, using .withColumnRenamed like in this post, and also using alias like
(spark.read.parquet(inputFilePath)).select(col("('my data (beta)', "Meas'd Qty")").alias("col")).show()
but always get the same error. How do I go through each column to replace any invalid characters with underscore _ or even just delete all invalid characters?

How is the old file generated? The file was saved with column names that are not allowed by the spark.
Better to fix this issue at the source when this file is generated.
Few approaches you can try in spark to resolve are
In the select statement put column name in `` like
(spark.read.parquet(inputFilePath)).select(col(`('my data (beta)', "Meas'd Qty")`).alias("col")).show()
Try to rename using toDF. You need to pass all the column names in the output df.
(spark.read.parquet(inputFilePath)).toDF(["col_a", "col_b", ...]).show()
Try reading the file using pyarrow and refactor the columns and save the result. After that read using pysaprk and continue with your tasks.

Related

Picking out a specific column in a table

My goal is to import a table of astrophysical data that I have saved to my computer (obtained from matching 2 other tables in TOPCAT, if you know it), and extract certain relevant columns. I hope to then do further manipulations on these columns. I am a complete beginner in python, so I apologise for basic errors. I've done my best to try and solve my problem on my own but I'm a bit lost.
This script I have written so far:
import pandas as pd
input_file = "location\\filename"
dataset = pd.read_csv(input_file,skiprows=12,usecols=[1])
The file that I'm trying to import is listed as having file type "File", in my drive. I've looked at this file in Notepad and it has a lot of descriptive bumf in the first few rows, so to try and get rid of this I've used "skiprows" as you can see. The data in the file is separated column-wise by lines--at least that's how it appears in Notepad.
The problem is when I try to extract the first column using "usecol" it instead returns what appears to be the first row in the command window, as well as a load of vertical bars between each value. I assume it is somehow not interpreting the table correctly? Not understanding what's a column and what's a row.
What I've tried: Modifying the file and saving it in a different filetype. This gives the following error:
FileNotFoundError: \[Errno 2\] No such file or directory: 'location\\filename'
Despite the fact that the new file is saved in exactly the same location.
I've tried using "pd.read_table" instead of csv, but this doesn't seem to change anything (nor does it give me an error).
When I've tried to extract multiple columns (ie "usecol=[1,2]") I get the following error:
ValueError: Usecols do not match columns, columns expected but not found: \[1, 2\]
My hope is that someone with experience can give some insight into what's likely going on to cause these problems.
Maybie you can try dataset.iloc[:,0] . With iloc you can extract the column or line you want by index(not only). [:,0] for all the lines of 1st column.
The file is incorrectly named.
I expect that you are reading a csv file or an xlsx or txt file. So the (windows) path would look similar to this:
import pandas as pd
input_file = "C:\\python\\tests\\test_csv.csv"
dataset = pd.read_csv(input_file,skiprows=12,usecols=[1])
The error message tell you this:
No such file or directory: 'location\\filename'

Renaming the columns in Vaex

I tried to read a csv file of 4GB initially with pandas pd.read_csv but my system is running out of memory (I guess) and the kernel is restarting or the system hangs.
So, I tried using vaex library to convert csv to HDF5 and do operations(aggregations,group by)on that. For that I've used:
df = vaex.from_csv('Wager-Win_April-Jul.csv',column_names = None, convert=True, chunk_size=5000000)
and
df = vaex.from_csv('Wager-Win_April-Jul.csv',header = None, convert=True, chunk_size=5000000)
But still I'm getting my first record in csv file as the header(column names to be precise)and I'm unable to change the column names. I tried finding function to change the names but didn't come across any. Pls help me on that. Thanks :)
The column names 1559104, 10289, 991... is actually the first record in the csv and somehow vaex is taking the first row as my column names which I want to avoid
vaex.from_csv is a wrapper around pandas.read_csv with few extra options for the conversion.
So reading the pandas documentation, header='infer' (which is the default) if you want the csv reader to automatically infer the column names. Otherwise the 1st row of the file is used as the header. Alternatively you can pass the column names manually via the names kwarg. Same holds true for both vaex and pandas.
I would read the pandas.read_csv documentation to better understand all the options. Then you can use those options with vaex and the convert and chunk_size arguments.

Deleting Columns from a CSV in Python

I know similar questions to this have been asked, but I couldn't find any that were dealing with the error I'm getting (though I apologize if I'm missing something!). I am trying to remove a few columns from a CSV that wouldn't load in Excel so I couldn't just delete them within the file. I have the following code:
import os
import pandas as pd
os.chdir(r"C:\Users\maria\Desktop\Project\North American Breeding Bird Survey")
data = pd.read_csv("NABBSStateData.csv")
data.drop(["CountryNum", "Route", "RPID"], axis = 1, inplace = True)
but when I run it I get this error message:
c:\program files (x86)\microsoft visual studio\2019\professional\common7\ide\extensions\microsoft\python\core\Packages\ptvsd\_vendored\pydevd\pydevd.py:1664: DtypeWarning: Columns (0,1,2,3,4,5,6,7,8,9,10,11,12,13) have mixed types. Specify dtype option on import or set low_memory=False.
return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
I am relatively new to python/visual studio, and I am having a hard time figuring out what this error message is saying and how to fix it. Thank you!!
Edit: The CSV in question is the state files from this site concatenated together, so you can open one of the state files to see the columns/data types.
Looks like you have mixed data types in some of your columns (e.g. columns 0,1,2,3,4,5,6,7,8,9,10,11,12,13).
Mixed data type means in one column, say column 'a', most rows are numbers, but there might be strings in some rows as well.
Try use dtype option from pd.read_csv to specify the column types. If you are not sure about the type, use object or str.
This is an example:
df = pd.read_csv('D:\\foo.csv', header=0, dtype={'currency':str, 'v1':object, 'v2':object})
A link to use read_csv
Here's list of all the types you can specify.

Read CSV file with specific columns as String in PySpark

I have a large file which is dynamically generated, a small sample of which is given below:
ID,FEES,I_CLSS
11,5555,00000110
12,5555,654321
13,5555,000030
14,5555,07640
15,5555,14550
17,5555,99070
19,5555,090090
My issue is that I will always have a column like I_CLSS that is starting with 0s in this file. I'd like to read the file to a spark dataframe with I_CLSS column as StringType.
For this, in python I can do something like,
df = pandas.read_csv('INPUT2.csv',dtype={'I_CLSS': str})
But is there an alternative to this command in pyspark?
I understand that I can manually specify the schema of a file in Pyspark. But it would be extremely diffficult to do for a file whose columns are dynamically generated.
So I'd appreciate it if somebody could help me with this.

Error with Pandas command on Spark?

I would like to preface by saying I am very new to Spark. I have a working program on Pandas that I need to run on Spark. I am using Databricks to do this. After initializing 'sqlContext' and 'sc', I load in a CSV file and create a Spark dataframe. After doing this, I then convert this dataframe into a Pandas dataframe, where I have already wrote code to do what I need to do.
Objective: I need to load in a CSV file and identify the data types and return the data types of each and every column. The tricky part is that dates come in a variety of formats, for which I have written (with help from this community) regular expressions to match. I do this for every data type. At the end, I convert the columns to the correct type and print each column type.
After successfully loading my Pandas dataframe in, I am getting this error: "TypeError: to_numeric() got an unexpected keyword argument 'downcast' "
The code that I am running that triggered this:
# Changing the column data types
if len(int_count) == len(str_count):
df[lst[col]] = pd.to_numeric(df[lst[col]], errors='coerce', downcast='integer')
if len(float_count) == len(str_count):
df[lst[col]] = pd.to_numeric(df[lst[col]], errors='coerce', downcast='float')
if len(boolean_count) == len(str_count):
df[lst[col]] = df[lst[col]].astype('bool')
if len(date_count) == len(str_count):
df[lst[col]] = pd.to_datetime(df[lst[col]], errors='coerce')
'lst' is the column header and 'col' is a variable I used to iterate through the column headers. This code worked perfectly when running on PyCharm. Not sure why I am getting this error on Spark.
Any help would be great!
From your comments:
I have tried to load the initial data directly into pandas df but it has consistently thrown me an error, saying the file doesn't exist, which is why I have had to convert it after loading it into Spark.
So, my answer has nothing to do with Spark, only with uploading data to Databricks Cloud (Community Edition), which seems to be your real issue here.
After initializing a cluster and uploading a file user_info.csv, we get this screenshot:
including the actual path for our uploaded file.
Now, in a Databricks notebook, if you try to use this exact path with pandas, you'll get a File does not exist error:
import pandas as pd
pandas_df = pd.read_csv("/FileStore/tables/1zpotrjo1499779563504/user_info.csv")
[...]
IOError: File /FileStore/tables/1zpotrjo1499779563504/user_info.csv does not exist
because, as the instructions clearly mention, in that case (i.e. files you want loaded directly in pandas or R instead of Spark) you need to prepend the file path with /dbfs:
pandas_df = pd.read_csv("/dbfs/FileStore/tables/1zpotrjo1499779563504/user_info.csv") # works OK
pandas_df.head() # works OK

Categories