Spark Dataframe column name change does not reflect - python

I am trying to rename some special characters from my spark dataframe. For some weird reason, it shows the updated column name when I print the schema, but any attempt to access the data results in an error complaining about the old column name. Here is what I am trying:
# Original Schema
upsertDf.columns
# Output: ['col 0', 'col (0)', 'col {0}', 'col =0', 'col, 0', 'col; 0']
for c in upsertDf.columns:
upsertDf = upsertDf.withColumnRenamed(c, c.replace(" ", "_").replace("(","__").replace(")","__").replace("{","___").replace("}","___").replace(",","____").replace(";","_____").replace("=","_"))
upsertDf.columns
# Works and returns expected result
# Output: ['col_0', 'col___0__', 'col____0___', 'col__0', 'col_____0', 'col______0']
# Print contents of dataframe
# Throws error for original attribute name "
upsertDf.show()
AnalysisException: 'Attribute name "col 0" contains invalid character(s) among " ,;{}()\\n\\t=". Please use alias to rename it.;'
I have tried other options to rename the column (using alias etc...) and they all return the same error. Its almost as if the show operation is using a cached version of the schema but I can't figure out how to force it to use the new names.
Has anyone run into this issue before?

Have a look at this minimal example (using your renaming code, ran in a pyspark shell version 3.3.1):
df = spark.createDataFrame(
[("test", "test", "test", "test", "test", "test")],
['col 0', 'col (0)', 'col {0}', 'col =0', 'col, 0', 'col; 0']
)
df.columns
['col 0', 'col (0)', 'col {0}', 'col =0', 'col, 0', 'col; 0']
for c in df.columns:
df = df.withColumnRenamed(c, c.replace(" ", "_").replace("(","__").replace(")","__").replace("{","___").replace("}","___").replace(",","____").replace(";","_____").replace("=","_"))
df.columns
['col_0', 'col___0__', 'col____0___', 'col__0', 'col_____0', 'col______0']
df.show()
+-----+---------+-----------+------+---------+----------+
|col_0|col___0__|col____0___|col__0|col_____0|col______0|
+-----+---------+-----------+------+---------+----------+
| test| test| test| test| test| test|
+-----+---------+-----------+------+---------+----------+
As you see, this executes successfully. So your renaming functionality is OK.
Since you haven't shared all your code (how upsertDf is defined), we can't really know exactly what's going on. But looking at your error message, this comes from ParquetSchemaConverter.scala in a Spark version earlier than 3.2.0 (this error message changed in 3.2.0, see SPARK-34402).
Make sure that you read in your data and then immediately rename the columns, without doing any other operation.

Related

TypeError: list indices must be integers or slices, not str - Dash app

Within the Dash code I have
dcc.Store(id='store-data', data=[], storage_type='memory')
Where I created and stored some variables.
Later, I used a callback to calculate some results based on the stored data.
The stored data returns a dictionary and looks something like this:
dict_values = {
'Value A':[1,2,3],
'Value B': [4,5,6],
'Value C' :[7,8,9],
}
Now, within the callback that has as input the stored data I tried to create a DataFrame like this:
df = pd.DataFrame(dict_values['Value A'])
That's when I get the error TypeError: list indices must be integers or slices, not str
Does anyone know why and how can I fix it?
I have to mentioned that even thou I've got this error, Dash app still runs and works without any problem.
rename your dictionary
import pandas as pd
d = {
'Value A':[1,2,3],
'Value B': [4,5,6],
'Value C' :[7,8,9],
}
pd.DataFrame(d['Value A'])
0
0 1
1 2
2 3

Pandas when renaming a column, key error is encountered

I have the below code to rename a column
df.rename(columns = {'Long Name 1':'Court'}, inplace = True)
But encounter the below error
KeyError: "['Long Name 1'] not in index"
Not sure why there is an error. When I see the columns in the df, it exists
print(df.columns)
Result:
Index(['Activity', 'Date', 'Hirer Category', 'No of Slots', 'Slot Status', 'Start Time', 'Court', 'Long Name 1'], dtype='object')
Why am I not able to rename column 'Long Name 1'?
Your Problem can't reproduce. I checked with dummy values but not found any error. You can see the screenshot and your code is working fine.
Link to the Screenshot as I not have enough reputation to embed it
Hope this helps. Thank you

How to filter a pandas column by list of strings?

The standard code for filtering through pandas would be something like:
output = df['Column'].str.contains('string')
strings = ['string 1', 'string 2', 'string 3']
Instead of 'string' though, I want to filter such that it goes through a collection of strings in list, "strings". So I tried something such as
output = df['Column'].str.contains('*strings')
This is the closest solution I could find, but did not work
How to filter pandas DataFrame with a list of strings
Edit: I should note that I'm aware of the | or operator. However, I'm wondering how to tackle all cases in the instance list strings is changing and I'm looping through varying lists of changing lengths as the end goal.
You can create a regex string and search using this string.
Like this:
df['Column'].str.contains('|'.join(strings),regex=True)
you probably should look into using isin() function (pandas.Series.isin) .
check the code below:
df = pd.DataFrame({'Column':['string 1', 'string 1', 'string 2', 'string 2', 'string 3', 'string 4', 'string 5']})
strings = ['string 1', 'string 2', 'string 3']
output = df.Column.isin(strings)
df[output]
output:
Column
0 string 1
1 string 1
2 string 2
3 string 2
4 string 3

Access dynamically created data frames

Hello Python community,
I have a problem with my code creation.
I wrote a code that creates dynamically dataframes in a for loop. The problem is that I don't know how to access to them.
Here is a part of code
list = ['Group 1', 'Group 2', 'Group 3']
for i in list:
exec('df{} = pd.DataFrame()'.format(i))
for i in list:
print(df+i)
The dataframes are created but i can not access them.
Could someone help me please?
Thank you in advance
I'm not sure exactly how your data is stored/accessed but you could create a dictionary to pair your list items with each dataframe as follows:
list_ = ['Group 1', 'Group 2', 'Group 3']
dataframe_dict = {}
for i in list_:
data = np.random.rand(3,3) #create data for dataframe here
dataframe_dict[i] = pd.DataFrame(data, columns=["your_column_one", "two","etc"])
Can then retrieve each dataframe by calling its associated group name as the key of the dictionary as follows:
for key in dataframe_dict.keys():
print(key)
print(dataframe_dict[key])

Web2Py - using starred expression for rendering HTML table

This question is an extension of: Web2Py - rendering AJAX response as HTML table
Basically, I come up with a dynamic list of response rows that I need to display on the UI as an HTML table.
Essentially the code looks like this,
response_results = []
row_one = ['1', 'Col 11', 'Col 12', 'Col 13']
response_results.append(row_one)
row_two = ['2', 'Col 21', 'Col 22', 'Col 23']
response_results.append(row_two)
html = DIV(TABLE(THEAD(TR(TH('Row #'), TH('Col 1'), TH('Col 2'), TH('Col 3')),
_id=0), TR([*response for response in response_results]),
_id='records_table', _class='table table-bordered'),
_class='table-responsive')
return html
When I use this kind of code: TR([request.vars[input] for input in inputs]) or TR(*the_list), it works fine.
However, I have come up with a need to use a hybrid of these two i.e. TR([*response for response in response_results]). But, it fails giving an error message:
"Python version 2.7 does not support this syntax. Starred expressions are not allowed as assignment targets in Python 2."
When I run this code instead i.e. without a '*': TR([response for response in response_results]) it runs fine but puts all the columns of my row together in the first column of the generated HTML table, leaving all other columns blank.
Can someone kindly help me resolve this issue and guide on how can I achieve the required result of displaying each column of the rows at their proper spots in the generated HTML table?
You need to generate a TR for each item in response_results, which means you need a list of TR elements with which you can then use Python argument expansion (i.e., the * syntax) to treat each TR as a positional argument to TABLE.
html = DIV(TABLE(THEAD(TR(TH('Row #'), TH('Col 1'), TH('Col 2'), TH('Col 3')), _id=0),
*[TR(response) for response in response_results],
_id='records_table', _class='table table-bordered'),
_class='table-responsive')
Note, because each response is itself a list, you could also use argument expansion within the TR:
*[TR(*response) for response in response_results]
But that is not necessary, as TR optionally takes a list, converting each item in the list into a table cell.
Another option is to make response_results a list of TR elements, starting with the THEAD element, and then just pass that list to TABLE:
response_results = [THEAD(TR(TH('Row #'), TH('Col 1'), TH('Col 2'), TH('Col 3')), _id=0)]
row_one = ['1', 'Col 11', 'Col 12', 'Col 13']
response_results.append(TR(row_one))
row_two = ['2', 'Col 21', 'Col 22', 'Col 23']
response_results.append(TR(row_two))
html = DIV(TABLE(response_results, _id='records_table', _class='table table-bordered'),
_class='table-responsive')
Again, you could do TABLE(*response_results, ...), but the * is not necessary, as TABLE can take a list of row elements.

Categories