Changing string values - python

I have the following list:
weather = ['sunny', 'foggy', 'cloudy', 22] that I would like to use it in a 'sche' of spark DataFrame in the follwoing way:
sche = "f'status_{weather[0]}_today' string, f'temprature_{weather[3]_today}' int"
So that at the end I get 2 columns in my new dataframe as following:
First column: status_sunny_today
Second column: temprature_22_today
but when I run the code it returns error and does not recognize the format in sche above. If I print just sche, it returns: f'status_{weather[0]}_today' string, f'temprature_{weather[3]_today}' int

This is the correct way to use format strings in Python:
sche = f"'status_{weather[0]}_today' string, 'temprature_{weather[3]_today}' int"
Put f before the whole string, not inside the string.

Related

How to convert a string type column to list type in pandas dataframe?

I have a dataframe as below,
df = pd.DataFrame({'URL_domains':[['wa.me','t.co','goo.gl','fb.com'],['tinyurl.com','bit.ly'],['test.in']]})
Here the column URL_Domains has got 2 observations with a list of Domains.
I would like to know the length of each observations URL domain list as:
df['len_of_url_list'] = df['URL_domains'].map(len)
and output as:
That is fine and no issues with above case and
In my case these list observations are treated string type as below:
When i execute the below code with string type URL domain it has shown the below output:
How to make a datatype conversion from string to list here in pandas ?
Use ast.literal_eval, because eval is bad practice:
import ast
df['len_of_url_list'] = df['URL_domains'].map(ast.literal_eval)
df["URL_domains"] = df["URL_domains"].apply(eval)

How to split a column twice using Pandas?

I have a column {'duration': 0, 'is_incoming': False}
I want to fetch 0 and Falseout of this. How do I split it using Python (Pandas)?
I tried - data["additional_data"] = data["additional_data"].apply(lambda x :",".join(x.split(":")[:-1]))
I want two columns Duration and Incoming_Time
How do I do this?
You can try converting those string to actual dict:
from ast import literal_eval
Finally:
out=pd.DataFrame(df['additional_data'].astype(str).map(literal_eval).tolist())
Now if you print out you will get your expected output
If needed use join() method:
df=df.join(out)
Now if you print df you will get your expected result
If your column additional_data contains real dict / json, you can directly use the string accessor .str[] to get the dict values by keys, as follows:
data['Duration'] = data['additional_data].str['duration']
data['Incoming_Time'] = = data['additional_data].str['is_incoming']
If your column additional_data contains strings of dict (enclosing dict with a pair of single quotes or double quotes), you need to convert the string to dict first, by:
from ast import literal_eval
data['Duration'] = data['additional_data].map(literal_eval).str['duration']
data['Incoming_Time'] = data['additional_data].map(literal_eval).str['is_incoming']

ValueError while trying to check for a "W" in dataset

Datset
I'm trying to check for a win from the WINorLOSS column, but I'm getting the following error:
Code and Error Message
The variable combined.WINorLOSS seems to be a Series type object and you can't compare an iterable (like list, dict, Series,etc) with a string type value. I think you meant to do:
for i in combined.WINorLOSS:
if i=='W':
hteamw+=1
else:
ateamw+=1
You can't compare a Series of values (like your WINorLOSS dataframe column) to a single string value. However you can use the following to counts the 'L' and 'W' in your columns:
hteamw = combined['WINorLOSS'].value_counts()['W']
hteaml = combined['WINorLOSS'].value_counts()['L']

Convert dataframe with whitespaces to numeric, obstacle - whitespaces (e.g. 3 014.0 i.e. '3\xa0014.0')

I have a dataframe where instead of expected numerical values were stored
data of the type "Object" which looks like 3 014.0 i.e. '3\xa0014.0', instead of 3014.0 - whitespaces (i.e. '\xa0') - create a problem for conversion
Question: Is there some way to convert it to numeric ?
Strange thing: It appears that I can do conversion of the single element:
float( df.iloc[0,0].replace('\xa0', '') ) # - works
but the same does NOT work for the whole series
df['p1'].astype('str').replace('\xa0','') # does nothing
-- does nothing
I tried:
pd.to_numeric - gives: Unable to parse string
trying to covert to string and then use replace:
df['p1'].astype('str').replace('\xa0','')
do nothing
Data example:
df.iloc[0:3,0]
2017-10-10 11:32:49.895023 3 014.0
2017-10-10 11:33:11.612169 3 013.5
2017-10-10 11:33:22.488124 3 013.0
Name: p1, dtype: object
df.iloc[0:3,0]:
'3\xa0014.0'
Use this instead: df['p1'] = df['p1'].apply(lambda x: float(x.replace('\xa0','')))
df.iloc[0,0] is a string while df['p1'] is a pandas series. The replace method associated with a string and with a series is different. When you call replace on a series, pandas will attempt to replace elements.
For example,
df = pd.DataFrame({'name': 'alexander'})`
df['name'].replace('a', 'x') #does nothing`
df['name'].replace('alexander', 'x') #replaces the name alexander with x
df['p1'].apply(lambda x: float(x.replace('\xa0',''))) applies the replace method to each element (which happens to be a string) in the column p1. You can read more about the method here.
Hope this makes things clearer :)

spark cannot create LabeledPoint

I always get this error:
AnalysisException: u"cannot resolve 'substring(l,1,-1)' due to data type mismatch: argument 1 requires (string or binary) type, however, 'l' is of array type.;"
Quite confused because l[0] is a string, and matches arg 1.
dataframe has only one column named 'value', which is a comma separated string.
And I want to convert this original dataframe to another dataframe of object LabeledPoint, with the first element to be 'label' and the others to be 'features'.
from pyspark.mllib.regression import LabeledPoint
def parse_points(dataframe):
df1=df.select(split(dataframe.value,',').alias('l'))
u_label_point=udf(LabeledPoint)
df2=df1.select(u_label_point(col('l')[0],col('l')[1:-1]))
return df2
parsed_points_df = parse_points(raw_data_df)
I think you what to create LabeledPoint in dataframe. So you can:
def parse_points(df):
df1=df.select(split(df.value,',').alias('l'))
df2=df1.map(lambda seq: LabeledPoint(float(seq[0][0]),seq[0][1:])) # since map applies lambda in each tuple
return df2.toDF() #this will convert pipelinedRDD to dataframe
parsed_points_df = parse_points(raw_data_df)

Categories