apply lstrip on every element in dataframe of lists - python

In my dataframe each cell is a list with strings. The problem is that each string contains a whitespace infront of it
a={'names':[[' Peter',' Alex'],[' Josh',' Hans']]}
df=pd.DataFrame(a)
I want to remove the whitespaces.
For a single list i would use
y=[]
x = [' ab',' de',' cd']
for i in x:
d=i.strip()
y.append(d)
print (y)
['ab', 'de', 'cd']
so i tried to construct smth similar for a dataframe
stripped=[]
df=pd.DataFrame(a)
for index,row in df.iterrows():
d=df.names.apply(lambda x: x.lstrip())
stripped.append(d)
print(stripped)
which returns
'list' object has no attribute 'lstrip'
and if i call
for index,row in df.iterrows():
d=df.names.str.lstrip()
stripped.append(d)
print(stripped)
it returns Nan lists

this should work
df['names'] = df['names'].apply(lambda x: [i.strip() for i in x])
Output
names
0 [Peter, Alex]
1 [Josh, Hans]

Related

How to delete specific values from a list-column in pandas

I've used POS-tagging (in german language, thus nouns have "NN" and "NE" as abbreviations) and now I am having trouble to extract the nouns into a new column of the pandas dataframe.
Example:
data = {"tagged": [[("waffe", "Waffe", "NN"), ("haus", "Haus", "NN")], [("groß", "groß", "ADJD"), ("bereich", "Bereich", "NN")]]}
df = pd.DataFrame(data=data)
df
df["nouns"] = df["tagged"].apply(lambda x: [word for word, tag in x if tag in ["NN", "NE"]])
Results in the following error message: "ValueError: too many values to unpack (expected 2)"
I think the code would work if I was able to delete the first value of each tagged word but I cannot figure out how to do that.
Because there are tuples with 3 values unpack values to variables word1 and word2:
df["nouns"] = df["tagged"].apply(lambda x: [word2 for word1, word2, tag
in x if tag in ["NN", "NE"]])
Or use same solution in list comprehension:
df["nouns"] = [[word2 for word1,word2, tag in x if tag in ["NN", "NE"]]
for x in df["tagged"]]
print (df)
tagged nouns
0 [(waffe, Waffe, NN), (haus, Haus, NN)] [Waffe, Haus]
1 [(groß, groß, ADJD), (bereich, Bereich, NN)] [Bereich]
I think it would be easier with function call. This creates list of NN or NE tags from each row. If you would like to deduplicate, you need to update the function.
data = {"tagged": [[("waffe", "Waffe", "NN"), ("haus", "Haus", "NN")], [("groß", "groß", "ADJD"), ("bereich", "Bereich", "NN")]]}
df = pd.DataFrame(data=data)
#function
def getNoun(obj):
ret=[] #declare empty list as default value
for l in obj: #iterate list of word groups
for tag in l: #iterate list of words/tags
if tag in ['NN','NE']:
ret.append(tag) #add to return list
return ret
#call new column creation
df['noun']=df['tagged'].apply(getNoun)
#result
print(df['noun'])
#output:
#0 [NN, NN]
#1 [NN]
#Name: noun, dtype: object

Filtering a column on pandas based on a string

I'm trying to filter a column on pandas based on a string, but the issue that I'm facing is that the rows are lists and not only strings.
A small example of the column
tags
['get_mail_mail']
['app', 'oflline_hub', 'smart_home']
['get_mail_mail', 'smart_home']
['web']
[]
[]
['get_mail_mail']
and I'm using this
df[df["tags"].str.contains("smart_home", case=False, na=False)]
but it's returning an empty dataframe.
You can explode, then compare and aggregate with groupby.any:
m = (df['tags'].explode()
.str.contains('smart_home', case=False, na=False)
.groupby(level=0).any()
)
out = df[m]
Or concatenate the string with a delimiter and use str.contains:
out = df[df['tags'].agg('|'.join).str.contains('smart_home')]
Or use a list comprehension:
out = df[[any(s=='smart_home' for s in l) for l in df['tags']]]
output:
tags
1 [app, oflline_hub, smart_home]
2 [get_mail_mail, smart_home]
You could try:
# define list of searching patterns
pattern = ["smart_home"]
df.loc[(df.apply(lambda x: any(m in str(v)
for v in x.values
for m in pattern),
axis=1))]
Output
tags
-- ------------------------------------
1 ['app', 'oflline_hub', 'smart_home']
2 ['get_mail_mail', 'smart_home']

Pandas: replace string with special characters

I have a dataframe (see after) in which I have two columns containing either a list of patients or an empty list (like that [''] ). I want to remove the empty list
What i have:
Homozygous_list
heterozygous_list
[Patient1,Patient2]
['']
['']
[Patient1]
What i want:
Homozygous_list
heterozygous_list
[Patient1,Patient2]
[Patient1]
I try several thing like :
variants["Homozygous_list"].replace("['']","", regex=True, inplace=True)
or
variants["Homozygous_list"].replace("\[''\]","", regex=True, inplace=True)
or
variants["Homozygous_list"] = variants["Homozygous_list"].replace("['']","", regex=True)
etc but nothing seems to work.
If you really have lists of strings, use applymap:
df = df.applymap(lambda x: '' if x==[''] else x) # or pd.NA in place of ''
output:
Homozygous_list heterozygous_list
0 [Patient1, Patient2]
1 [Patient1]
used input:
df = pd.DataFrame({'Homozygous_list ': [['Patient1','Patient2'], ['']],
'heterozygous_list': [[''], ['Patient1']]})

Use list comprehension to create a list of tuples for two different conditionals

Is there a way to use list comprehension to create a list of tuples with two different conditions.
I am interacting through a Pandas DF and I want to return an entire row in tuple if it matches either condition. The first is if the DF has nan values in any column.
The other is if a column in the DF called ODFS_FILE_CREATE_DATETIME doesn't match the regex pattern for the date column. The date column is supposed to have an output that looks like this: 2005242132. 10 number digits. So if the df returns something like 2004dg, it should be picked up as an error and the row should be added to my list of tuples
My sad pathetic attempt:
[tuple(x) for x in odfscsv_df[odfscsv_df.isna().any(1)].values or x in odfscdate_re.search(str(odfscsv_df['ODFS_FILE_CREATE_DATETIME'])) ]
Full Function that contains the two seperate list of tuples:
def process_csv_formatting(csv):
odfscsv_df = pd.read_csv(csv, header=None,names=['ODFS_LOG_FILENAME', 'ODFS_FILE_CREATE_DATETIME', 'LOT', 'TESTER', 'WAFER_SCRIBE'])
odfscsv_df['CSV_FILENAME'] = csv.name
odfscdate_re = re.compile(r"\d{10}")
#print(odfscsv_df)
#odfscsv_df = odfscsv_df.replace('', np.nan)
errortup = [(odfsname, "Bad_ODFS_FILE_CREATE_DATETIME= " + str(cdatetime), csv.name) for odfsname,cdatetime in zip(odfscsv_df['ODFS_LOG_FILENAME'], odfscsv_df['ODFS_FILE_CREATE_DATETIME']) if not odfscdate_re.search(str(cdatetime))]
emptypdf = pd.DataFrame(columns=['ODFS_LOG_FILENAME', 'ODFS_FILE_CREATE_DATETIME', 'LOT', 'TESTER', 'WAFER_SCRIBE'])
print([tuple(x) for x in odfscsv_df[odfscsv_df.isna().any(1)].values])
[tuple(x) for x in odfscsv_df[odfscsv_df.isna().any(1)].values or x in odfscdate_re.search(str(odfscsv_df['ODFS_FILE_CREATE_DATETIME'])) ]
#print(odfscsv_df[(odfscsv_df[column_name].notnull()) & (odfscsv_df[column_name] != u'')].index)
for index, row in odfscsv_df.iterrows():
#print((row['WAFER_SCRIBE']))
print((row['ODFS_FILE_CREATE_DATETIME']))
#errortup = [x for x in odfscsv_df['ODFS_FILE_CREATE_DATETIME']]
if len(errortup) != 0:
#print(errortup) #put this in log file statement somehow
#print(errortup[0][2])
return emptypdf
else:
return odfscsv_df
Sample CSV Data. The commas delienate the cells:
2005091432_943SK1J.00J.SK1J-23.FPD.FMGN520.Jx6D36ny5EO53qAtX4.log,,W943SK10,MGN520,0Z0RK072TCD2
2005230137_014SF1J.00J.SF1J-23.WCPC.FMGN520.XlwHcgyP5eFCpZm5cf.log,,W014SF10,MGN520,DM4MU129SEC1
2005240909_001914J.E0J.914J-15.WRO3PC.FMGN520.nZKn7OvjGKw1i4pxiu.log,,K001914E,MGN520,DM3FZ226SEE3
2005242132_001914J.E0J.914J-15.WRO4PC.FMGN520.V8dcLhEgygRj2rP2Df.log,2005242132,K001914E,MGN520,DM3FZ226SEE3
2005251037_001914J.E0J.914J-15.WRO4PC.FMGN520.dyixmQ5r4SvbDFkivY.log,2005251037,K001914E,MGN520,DM3FZ226SEE3
2005251215_949949J.E0J.949J-21.WRO2PP.FMGN520.yp1i4e7a7D1ighkdB7.log,2005251215,K949949E,MGN520,DG2KV122SEF6
2005251231_949949J.E0J.949J-25.WRO2PP.FMGN520.oLQGhc2whAlhC3dSuR.log,2005251231,K949949E,MGN520,DG2KV333SEF3
2005260105_001914J.E0J.914J-15.WRO4PC.FMGN520.wOQMUOfZgkQK9iHJS5.log,2005260105,K001914E,MGN520,DM3FZ226SEE3
2006111130_950909J.00J.909J-22.FPC.FMGN520.UuqeGtw9xP6lLDUW9N.log,2006111130,K9509090,MGN520,DG7LW031SEE7
2006111612_950909J.00J.909J-22.FPC.FMGN520.hoDl3QSNPKhcs4oA2N.log,2006111612,K9509090,MGN520,DG7LW031SEE7
2006120638_006914J.E0J.914J-15.CZPC.FMGN520.qCgFUH2H21ieT641i9.log,2006120638,K006914E,MGN520,DM8KJ568SEC3
2006122226_006914J.E0J.914J-15.CZPC.FMGN520.nSHSp7klxjrQlVTcCu.log,2006122226,K006914E,MGN520,DM8KJ568SEC3
2006130919_006914J.E0J.914J-15.CZPC.FMGN520.Zd6DrMUsCjuEVBFwvn.log,2006130919,K006914E,MGN520,DM8KJ568SEC3
2006140457_007911J.E0J.911J-25.RDR2PC.FMGN520.QPX9r59TnXObXyfibv.log,2006140457,K007911E,MGN520,DN4AU351SED1
2006141722_007911J.E0J.911J-25.WCPC.FMGN520.dNQLkvQlPTplEjJspB.log,2006141722,K007911E,MGN520,DN4AU351SED1
2006160332_007911J.E0J.911J-25.WCPC.FMGN520.DQiH82Ze9fCoaLVbDE.log,2006160332,K007911E,MGN520,DN4AU351SED1
2006170539_007911J.E0J.911J-25.WCPC.FMGN520.TjakhXkmhmlGhfLheo.log,2006170539,K007911E,MGN520,DN4AU351SED1
Add dtype parameter to import 'ODFS_FILE_CREATE_DATETIME' as dtype string when you call read_csv
odfscsv_df = pd.read_csv(csv, header=None,
names=['ODFS_LOG_FILENAME', 'ODFS_FILE_CREATE_DATETIME', 'LOT', 'TESTER', 'WAFER_SCRIBE'],
dtype={'ODFS_FILE_CREATE_DATETIME': str})
m1 = odfscsv_df.isna().any(1)
s = odfscsv_df['ODFS_FILE_CREATE_DATETIME']
m2 = ~s.astype(str).str.isnumeric()
m3 = s.astype(str).str.len().ne(10)
[tuple(x) for x in odfscsv_df[m1 | m2 | m3].values]

Python remove uppercase and empty elements from dataframe

I am new to dealing with list in the data frame. I have a data frame with 1 column that contains list like values. I am trying to remove 'empty list' and 'upper case' elements from this column. Here is what I tried what am I missing in this code?
Data csv:
id,list_col
1,"['',' books','PARAGRAPH','ISBN number','Harry Potter']"
2,"['',' books','TESTS','events 1234','Harry Potter',' 1 ']"
3,
4,"['',' books','PARAGRAPH','','PUBLISHES number','Garden Guide', '']"
5,"['',' books','PARAGRAPH','PUBLISHES number','The Lord of the Rings']"
Code:
df = pd.read_csv('sample.csv')
# (1) # trying to remove empty list but not working
df['list_col'] = list(filter(None, [w[2:] for w in df['list_col'].astype(str)]))
df['list_col']
# (2) remove upper case elements in the dataframe
#AttributeError: 'map' object has no attribute 'upper'
df['list_col'] = [t for t in (w for w in df['list_col'].astype(str)) != t.upper()]
Output Looking for:
id list_col
1 [' books','ISBN number','Harry Potter']
2 [' books','events 1234','Harry Potter',' 1 ']
3
4 [' books','PUBLISHES number','Garden Guide']
5 [' books','PUBLISHES number','The Lord of the Rings']
When pandas loads your CSV it loads the list as a quoted string which can be converted into a python list by eval and then you can use re.match to remove uppercase elements.
Code:
import pandas as pd
from re import compile
regex = compile('^[A-Z]{1,}$')
df = pd.read_csv(r'./input.csv')
not_null_indices = df.loc[:, 'list_col'].isna()
df.loc[~not_null_indices, 'list_col'] = df.loc[~not_null_indices, 'list_col']\
.apply(lambda x: eval(x))\
.apply(lambda y: list(filter(lambda z: regex.match(z) is None, y)) \
if isinstance(y, list) else list())

Categories