PySpark fails to select column after pivot - python

I have a dataframe with a Timestamp column, a Tag column and a Value column.
I did a pivot like this:
df = df.groupBy("Timestamp").pivot("Tag").mean()
Which works well, gives me what I want. When I show columns, I get
df.columns
----------------------------------------
['Timestamp', 'TAG:Tag1.val', 'TAG:Tag2.val', 'TAG:Tag3.val']
But then when I try to select a column, I have this error:
df.select('TAG:Tag1.val')
----------------------------------------
org.apache.spark.sql.AnalysisException: cannot resolve '`TAG:Tag1.val`' given input columns: [Timestamp, TAG:Tag1.val, TAG:Tag2.val, TAG:Tag3.val];;
I tried by giving the name directly, by using df.columns[0], df.schema.fieldNames(), by doing df=df.toDF(*df.schema.fieldNames()) before select.
Always the same error message. Do you know why is it doing so?
I also tried to hardcode the column's list in .pivot("Tag", list_tags), got the same result.
I also need to tell you that selecting Timestamp works perfectly well.

Here's a way, you need to wrap the column names with backticks:
df.select('`TAG:Tag1.val`').show()
To check all columns, you can do:
df.select([f'`{x}`' for x in df.columns]).show()

I found the solution by replacing all ':' in the tags by '_'.
For a reason I don't know, pyspark did not read correctly ':'

Related

SnowparkFetchDataException: (1406): Failed to fetch a Pandas Dataframe. The error is: Found non-unique column index

While running some code like this:
session = ...
return session.table([DB,SCHEMA, MANUAL_METRICS_BY_SIZE]).select("TECHNOLOGY","OBJECTTYPE","OBJECTTYPE","SIZE","EFFORT").to_pandas()
I got this error.
Any idea of what might be causing this?
Well it was easier that what I thought.
I had a duplicated column name and pandas doesn't like that.
Just check your columns. For example with df.columns and remove the duplicated column

Why Doesn't Python Recognize the Column Name (KeyError)

I imported stock/options data into a data frame and want to use pandas to manually filter for specific criteria. I renamed a few columns and then later on I tried to do a bit of cleaning so I can work with the data.
I tried to replace percentage signs then convert the data type to a float by doing this:
df = df['IV'].str.rstrip("%").astype(float)
df = df['IV_Rank'].str.rstrip("%").astype(float)/100
df = df['IV PCT'].str.rstrip("%").astype(float)/100
When I run that code I get the error message: KeyError: 'IV'. I got this error for the other columns as well when I tried to run them each independently but I tried copy then pasting the column name as well as trying the old names. I am not too sure what to do but some help would be appreciated
That's because you are overwriting the entire dataframe. This is what I think you are trying to do
df['IV'] = df['IV'].str.rstrip("%").astype(float)
df['IV_Rank'] = df['IV_Rank'].str.rstrip("%").astype(float)/100
df['IV PCT'] = df['IV PCT'].str.rstrip("%").astype(float)/100

Dropping rows in a Data Frame

I am trying to drop some specific rows in a DataFrame df where, the column Time is anything except 06:00:00. I tried the following code but it dosen't seem to work. I even tried adding another column Index to my file to aid the process but still it is not working. Can you please help me. I am attaching the screenshots.
The val just contains the specific time 06:00:00. Also, please ignore the variable req. Thanks a lot.
In pandas, by default drop isn't inplace operation. Try specifying df.drop(j, inplace=True).
Have you tried?
df = df.drop(df[//expresion here//].index)
Or even better:
df = df[~df.a.str.contains("06:00:00")]
Where a is the name of the column you want to search the time in

Unable to rename and remove pandas Index - Python

I have a dataframe which is like as shown below
Though I know the column names are 'FR', 'ig' and 'te' with the help of below command.
dataFramesDict['Tri'].columns
What does name = 'level_1' mean here? Moreover, I also don't see subject_ID in the columns or index list. What is subject_ID here?
How do I get the output to be like as shown below
I tried the below code to rename 'level_1' to 'subject_ID' but it doesn't work
dataFramesDict['Tri'].index = dataFramesDict['Tri'].index.rename('subject_ID')
Please note that the data is just a sample data. I am only interested in changing the first column name and dropping that 'level_1'. Nothing to do with data
I am unable to create dataframe of this type through sample code. The above shown dataframe is a result of another complex code. So, I have provided a screenshot of dataframe
Try this
df.columns.name= ''
df.reset_index(inplace=True)

How to search in a pandas dataframe column with the space in the column name

If I need to search if a value exists in a pandas data frame column , which has got a name without any spaces, then I simply do something like this
if value in df.Timestamp.values
This will work if the column name is Timestamp. However, I have got plenty of data with column names as 'Date Time'. How do I use the if in statement in that case?
If there is no easy way to check for this using the if in statement, can I search for the existence of the value in some other way? Note that I just need to search for the existence of the value. Also, this is not an index column.
Thank you for any inputs
It's better practice to use the square bracket notation:
df["Date Time"].values
Which does exactly the same thing
There are 2 ways of indexing columns in pandas. One is using the dot notation which you are using and the other is using square brackets. Both work the same way.
if value in df["Date Time"].values
in the case where you want to work with a column that has a header name with spaces
but you don't want it changed because you may have to forward the file
...one way is to just rename it, do whatever you want with the new no-spaced-name, them rename it back...# e.g. to drop the rows with the value "DUMMY" in the column 'Recipient Fullname'
df.rename(columns={'Recipient Fullname':'Recipient_Fullname'}, inplace=True)
df = df[(df.Recipient_Fullname != "DUMMY")]
df.rename(columns={'Recipient_Fullname':'Recipient Fullname'}, inplace=True)

Categories