How to substitute NaN for a text in a DataFrame? - python

I have a DataFrame and I need to change the content of the cells of a specific column to a text content (for example "not registered").
I am trying different options, these are some of them:
dftotal.fillna({"Computer_OS":"not registered", "Computer_OS_version":"not registered"}, inplace=True)
dftotal.loc[(dftotal["Computer_OS"]=="NaN"),"Computer_OS"] = "not registered"

Assumed that all values in Computer_OS column are string datatype else you would need to change datatype first.
import numpy as np
import pandas as pd
import re
def txt2nan(x):
"""
if given string x contains alphabet
return NaN else original x.
Parameters
----------
x : str
"""
if re.match('[a-zA-Z]', x):
return np.nan
else:
return x
df = pd.DataFrame({"os":["tsd", "ssad d", "sd", "1","2","3"]})
df["os"] = df["os"].apply(txt2nan)
Better sol'tn is to vectorize above operation:
df["os"] = np.where(df["os"].str.match('[a-zA-Z]'), np.nan, df["os"])

Related

How to remove 'nan' values in a list of string values?

I have a list, made of values from various rows, those values have to be converted into strings. The problem is when a row is empty there is a value called 'nan' displayed, which I would like to remove.
My code :
import pandas as pd
test = []
for index, row in df.iterrows():
x = str(row['Date']) + ' | ' + str(row['Time'])
test.append(x)
print(test)
I tried multiple things :
import numpy as np
list_clean = test[np.logical_not(np.isnan(test))]
print(res_backend_nan)
Which says
TypeError: only integer scalar arrays can be converted to a scalar
index
I tried :
import math
from numpy import nan
list_clean = [item for item in test if not(math.isnan(item) == False)]
And it says :
TypeError: must be real number, not P
I tried :
list_clean = [item for item in list if not(pd.isnull(item) == True)]
The result is the list is still displayed with nan values :
06/07/35 | nan
Some help would be very welcome please, thank you.
Try something like the following:
import pandas as pd
def remove_nan(row):
return [str(x) for x in row if not pd.isnull(x)]

I want to add an additional column to my df based on multiple condition on another column, all these condition are w.r.t string matching

import pandas as pd
name =
r"C:\Users\saarif2201\Desktop\Classification\Vidio_Indonesia\VidioDataset-Apr'22.xlsx"
df = pd.read_excel(name)
consist = ["Episode","Ep"]
def cat_marking(x):
if consist in (x):
return "Series"
else:
return ""
df['Content_Category'] = df['vod_episode_name'].apply(cat_marking)
This I have written to add a column name content_category and mark some of its rows as series based on the condition(that it contains Episode or Ep as a string) on vod_episode_name column.
I can't access the .xlsx file you posted, but I think that the following code should work for you if you want to apply your function to the vod_episode_name column.
import pandas as pd
name = r"C:\Users\saarif2201\Desktop\Classification\Vidio_Indonesia\VidioDataset-Apr'22.xlsx"
df = pd.read_excel(name)
consist = ["Episode","Ep"]
def cat_marking(x):
if any(ext in x for ext in consist):
return "Series"
else:
return ""
df['Content_Category'] = df['vod_episode_name'].apply(lambda x: cat_marking(x))

Dataframe with arrays and key-pairs

I have a JSON structure which I need to convert it into data-frame. I have converted through pandas library but I am having issues in two columns where one is an array and the other one is key-pair value.
Pito Value
{"pito-key": "Number"} [{"WRITESTAMP": "2018-06-28T16:30:36Z", "S":"41bbc22","VALUE":"2"}]
How to break columns into the data-frames.
As far as I understood your question, you can apply regular expressions to do that.
import pandas as pd
import re
data = {'pito':['{"pito-key": "Number"}'], 'value':['[{"WRITESTAMP": "2018-06-28T16:30:36Z", "S":"41bbc22","VALUE":"2"}]']}
df = pd.DataFrame(data)
def get_value(s):
s = s[1]
v = re.findall(r'VALUE\":\".*\"', s)
return int(v[0][8:-1])
def get_pito(s):
s = s[0]
v = re.findall(r'key\": \".*\"', s)
return v[0][7:-1]
df['value'] = df.apply(get_value, axis=1)
df['pito'] = df.apply(get_pito, axis=1)
df.head()
Here I create 2 functions that transform your scary strings to values you want them to have
Let me know if that's not what you meant

in python pandas how to get value of dynamic series/ column data transformation

I am trying to merge two csv and transforming the values in one csv by looking up constant values in another csv.i am able to get series but but not able to get the correct cell value. Can you please suggest?
I am calling the below function in reading the main csv and transforming the language column
dataDF['language'] =
dataDF['language'].apply(translateLanguagetest)
def translateLanguagetest( keystring):
print("keystring" + keystring)
ref_Data_File = Path('C:\sampletest')/ "constant.csv"
refDataDF = pd.read_csv(ref_Data_File)
refDataDF['refKey']=refDataDF['sourcedomain']+"#"+refDataDF['value']
+"#"+refDataDF['targetdomain']
refDataDF['refValue']=refDataDF['target']
modRef= refDataDF['refValue'].where(refDataDF['refKey']==
'languageSRC#'+keystring+'#languagetarget')
print("modRef: "+modRef )
cleanedRef = modRef.dropna()
f(cleanedRef)
print(cleanedRef)
value = cleanedRef.loc[('refValue')]
return value
The contents of constant.csv is
value,sourcedomain,targetdomain,target
ita,languageSRC,languagetarget,it
eng,languageSRC,languagetarget,en
Got the solution and it was a simple one. Being new to python, took some time to find the answer. I am reading the constants csv before and passing the constants dataframe as parameter to the method for transformation of column value.
import unittest
from pathlib import Path
import pandas as pd
class AdvancedTestSuite(unittest.TestCase):
"""Advanced test cases."""
def test_transformation(self):
data_File = Path('C:\Test_python\stackflow')/ "data.csv"
data_mod_File = Path('C:\Test_python\stackflow')/ "data_mod.csv"
dataDF = pd.read_csv(data_File)
ref_Data_File = Path('C:\Test_python\stackflow')/ "constant.csv"
refDataDF = pd.read_csv(ref_Data_File)
refDataDF['refKey']=refDataDF['sourcedomain'] \
+"#"+refDataDF['value']+"#"+refDataDF['targetdomain']
refDataDF['refValue']=refDataDF['target']
dataDF['language'] = dataDF['language'].apply(
lambda x: translateLanguagetest(x, refDataDF))
dataDF['gender'] = dataDF['gender'].apply(
lambda x: translateGendertest(x, refDataDF))
dataDF.to_csv(data_mod_File,index=False)
def translateLanguagetest( keystring, refDataDF):
print("keystring" + keystring)
modRef= refDataDF['refValue'].where(refDataDF['refKey']==
'languageSRC#'+keystring+'#languagetarget')
#removes the NaN number. modRef is an numpy.ndarray.
cleanedRef = modRef.dropna()
#after ckeab up,since only one row is remaining, item to select the value
#with one element
value = cleanedRef.item()
return value
def translateGendertest( keystring, refDataDF):
print("keystring" + keystring)
modRef= refDataDF['refValue'].where(refDataDF['refKey']==
'genderSRC#'+keystring+'#gendertarget')
#removes the NaN number modRef is an numpy.ndarray.
cleanedRef = modRef.dropna()
#after ckeab up,since only one row is remaining, item to select the value
value = cleanedRef.item()
return value
if __name__ == '__main__':
unittest.main()
The data.csv before transformation
Id,language,gender
1,ita,male
2,eng,female
The constant.csv
value,sourcedomain,targetdomain,target
ita,languageSRC,languagetarget,it
eng,languageSRC,languagetarget,en
male,genderSRC,gendertarget,Male
female,genderSRC,gendertarget,Female
The csv after transformation:
Id,language,gender
1,it,Male
2,en,Female

Drop 0 values, NaN values, and empty strings

import pandas as pd
import csv
import numpy as np
readfile = pd.read_csv('50.csv')
filevalues= readfile.loc[readfile['Customer'].str.contains('Lam Dep', na=False), 'Jul-18\nQty']
filevalues = filevalues.replace(r'^\s*$', np.nan, regex=True)
filevalues = filevalues.fillna(int(0))
int_series = filevalues.astype(int)
calculated_series = int_series.apply(lambda x: x*(1/1.2))
print(calculated_series)
So I have hundreds of csv files with many empty spots for values. Some of the blanks spaces are detected as NaNs and others are empty strings.This has Forced me to create my code in the way it is right now, and the reason so is that I need to conduct a formula on each value so I changed all such NaNs and empty strings to 0 so that I am able to conduct any formula ( in this example 1/1.2.) The problem is that I do not want to see values that are 0, NaN or empty strings when printing my dataframe.
I have tried to use the following:
filevalues = filevalues.dropna()
But because certain csv files have empty strings, this method does not fully work and get the error:
ValueError: invalid literal for int() with base 10: ' '
I have also tried the following after converting all values to 0:
filevalues = filevalues.loc[:, (filevalues != 0).all(axis=0)]
and
mask = np.any(np.isnan(filevalues) | np.equal(a, 0), axis=1)
Every method seems to be giving different errors. Is there a clean way to not count these types of values when I am printing my pandas dataframe? Please let me know if an example csv file is needed.
Got it to work! Here is the answer if it is of use to anyone.
import pandas as pd
import csv
import numpy as np
readfile = pd.read_csv('50.csv')
filevalues= readfile.loc[readfile['Customer'].str.contains('Lam Dep', na=False), 'Jul-18\nQty']
filevalues = filevalues.replace(" ", "", regex=True)
filevalues.replace("", np.nan, inplace=True) # replace empty string with np.nan
filevalues.dropna(inplace=True) # drop nan values
int_series = filevalues.astype(int) # change type
calculated_series = int_series.apply(lambda x: x*(1/1.2))
print(calculated_series)

Categories