I am trying to strip all the values after 'H' and store it to a column.
df['col1'] = df['col1'].str.split('H').str[0]
But pyspark gives me error : Column' object is not callable
One possible solution is add expand=True for DataFrame and then select second column:
df['col1'] = df['col1'].str.split('H', expand=True).iloc[:, 1]
Or:
df['col1'] = df['col1'].str.split('H', expand=True)[1]
Related
I have a column in my df which ends with ['-A','-B','-T','-Z','-EQ','-BE','-BL','-BT','-GC','-IL','-IQ'], and I need to remove the values.
I tried the below and got an error
df['name'] = df['name'].str.replace(['-A','-B','-T','-Z','-EQ','-BE','-BL','-BT','-GC','-IL','-IQ'],'', regex=True)
TypeError: unhashable type: 'list'
Use Series.replace instead Series.str.replace:
df['name'] = df['name'].replace(['-A','-B','-T','-Z','-EQ','-BE','-BL','-BT','-GC','-IL','-IQ'],'', regex=True)
I am trying to add empty columns to a dataframe df1 that are not already in a second dataframe df2. So, given
df2.columns = ['a', 'b', 'c', 'd']
df1.columns = ['a', 'b']
I would like to add columns with names 'c' and 'd' to dataframe df1.
For performance reasons, I would like to avoid a loop with multiple withColumn() statements:
for col in df1.columns:
if col not in df2.columns:
df1= df1.withColumn(col, lit(None).cast(StringType()))
My first attemt
df1 = df1.select(col('*'),
lit(None).alias(col_name) for col_name in df1.columns if col_name not in df2.columns)
is throwing an error
TypeError: Invalid argument, not a string or column: <generator object
myfunction.. at 0x7f60e2bcc8e0> of type <class
'generator'>. For column literals, use 'lit', 'array', 'struct' or
'create_map' function.
You need first to convert generator to list using list() function. After converting pass the list to select().
df1.select(col('*'), *list(lit(None).alias(col_name) for col_name in df2.columns if col_name not in df1.columns))
i have a dataframe, 22 columns and 65 rows. The data comes in from csv file.
Each of the values with dataframe has an extra unwanted whitespace. So if i do a loop on 'Year' column with a Len() i get
2019 5
2019 5
2018 5
...
this 1 extra whitespace appears throughout DF in every value. I tried running a .strip() on DF but no attribute exists
i tried a 'for each df[column].str.strip() but there are various data types in each column... dtypes: float64(6), int64(4), object(14) , so this errors.
any ideas on how to apply a function for entire dataframe, and if so, what function/method? if not what is best way to handle?
Handle the error:
for col in df.columns:
try:
df[col] = df[col].str.strip()
except AttributeError:
pass
Normally, I'd say select the object dtypes, but that can still be problematic if the data are messy enough to store numeric data in an object container.
import pandas as pd
df = pd.DataFrame({'foo': [1, 2, 3], 'bar': ['seven ']*3})
df['foo2'] = df.foo.astype(object)
for col in df.select_dtypes('object'):
df[col] = df[col].str.strip()
#AttributeError: Can only use .str accessor with string values!
you should use apply() function in order to do this :
df['Year'] = df['Year'].apply(lambda x:x.strip() )
you can apply this function on each column separately :
for column in df.columns:
df[column] = df[column].apply(lambda x:x.strip() )
Try this:
for column in df.columns:
df[column] = df[column].apply(lambda x: str(x).replace(' ', ' '))
Why not try this?
for column in df.columns:
df[column] = df[column].apply(lambda x: str(x).strip())
I want to following thing to happen:
for every column in df check if its type is numeric, if not - use label encoder to map str/obj to numeric classes (e.g 0,1,2,3...).
I am trying to do it in the following way:
for col in df:
if not np.issubdtype(df[col].dtype, np.number):
df[col] = LabelEncoder().fit_transform(df[col])
I see few problems here.
First - column names can repeat and thus df[col] returns more than one column, which is not what I want.
Second - df[col].dtype throws error:
AttributeError: 'DataFrame' object has no attribute 'dtype'
which I assume might arise due to the issue #1 , e.g we get multiple columns returned. But I am not confident.
Third - would assigning df[col] = LabelEncoder().fit_transform(df[col]) lead to a column substitution in df or should I do some esoteric df partitioning and concatenation?
Thank you
Since LabelEncoder supports only one column at a time, iteration over columns is your only option. You can make this a little more concise using select_dtypes to select the columns, and then df.apply to apply the LabelEncoder to each column.
cols = df.select_dtypes(exclude=[np.number]).columns
df[cols] = df[cols].apply(lambda x: LabelEncoder().fit_transform(x))
Alternatively, you could build a mask by selecting object dtypes only (a little more flaky but easily extensible):
m = df.dtypes == object
# m = [not np.issubdtype(d, np.number) for d in df.dtypes]
df.loc[:, m] = df.loc[:, m].apply(lambda x: LabelEncoder().fit_transform(x))
I have a dataframe with the following shape:
Index([u'PRODUCT',u'RANK', u'PRICE', u'STARS', u'SNAPDATE', u'CAT_NAME'], dtype='object')
For each product of that dataframe I can have NaN values for a specific date.
The goal is to replace for each product the NaN values by the mean of the existing values.
Here is what I tried without success:
for product in df['PRODUCT'].unique():
df = df[df['PRODUCT'] == product]['RANK'].fillna((df[df['PRODUCT'] == product]['RANK'].mean()), inplace=True)
print df
gives me:
TypeError: 'NoneType' object has no attribute '__getitem__'
What am I doing wrong?
You can use groupby to create a mean series:
s = df.groupby('PRODUCT')['RANK'].mean()
Then use this series to fillna values:
df['RANK'] = df['RANK'].fillna(df['PRODUCT'].map(s))
The reason you're getting this error is because of your use of inplace in fillna. Unfortunately, the documentation there is wrong:
Returns: filled : Series
This shows otherwise, though:
df = pd.DataFrame({'a': [3]})
>>> type(df.a.fillna(6, inplace=True))
NoneType
>>> type(df.a.fillna(6))
pandas.core.series.Series
So when you assign
df = df[df['PRODUCT'] == product]['RANK'].fillna((df[df['PRODUCT'] == product]['RANK'].mean()), inplace=True)
you're assigning df = None, and the next iteration fails with the error you get.
You can omit the assignment df =, or, better yet, use the other answer.