Iterating over elements in column add new value - Pandas - python

When iterating through column elements (Y,Y,nan,Y in my case) for some reason I can't add a new element when a condition is met (if twice Y,Y is encountered) I want to replace the last Y with: "encountered" or simply just add it or rewrite it since I have track of the index number.
I have a dataframe
col0 col1
1 A Y
2 B Y
3 B nan
4 C Y
code:
count = 0
for i,e in enumerate(df[col1]):
if 'Y' in e:
count += 1
else:
count = 0
if count == 2:
df['col1'][i] = 'encountered' #Index errror: list index out of range
error message:
IndexError: list index out of range
Even if I try to specify the index in which column-cell I would like to 'add the msg to' gives me the same error:
code;
df['col1'][1] = 'or this'
main idea direct example:
df['col1'][2] = 'under index 2 in column1 add this msg'
is it because of the pyPDF2/utils is interfering?
warning:
File "C:\Users\path\lib\site-packages\PyPDF2\utils.py", line 69, in formatWarning
file = filename.replace("/", "\\").rsplit("\\", 1)[1] # find the file name
error:
IndexError: list index out of range

last_index=df[df['col1']=='Y'].index[-1]
df.loc[last_index,'col1']='encountered'

Here's how I would go about solving this:
prev_val = None
# Iterate through rows to utilize the index
for idx, row in df[['col1']].iterrows():
# unpack your row, a bit more overhead but highly readable
val = row['col1']
# Use previous value instead of counter – is easier to read and is more accurate
if val == 'Y' and val == prev_val:
df.loc[idx, 'col1'] = 'encountered'
# now set the prev value to current:
prev_val = val

What possibly could be the issue with your code is the way you are iterating over your dataframe and also the indexing. Another issue is that you are trying to set the value you are iterating over with a new value. That bound give you issues later.
Does this works for you?:
>>> count = 0
>>> df['encounter'] = np.nan
>>> for i in df.itertuples():
>>> if getattr(i, 'col1')=='Y':
>>> count+=1
>>> else:
>>> count = 0
>>> if count==2:
>>> df.loc[i[0], 'encounter']= 'encountered'
>>> print(df)
col0 col1 encounter
0 A Y NaN
1 B Y encountered
2 B NaN NaN
3 C Y NaN

Related

Data Cleaning with Pandas

I have a dataframe column consisting of text data and I need to filter it according to the following conditions:
The character "M", if it's present in the string, it can only be at the n-2 position
The n-1 position of the string always has to be a "D".
ex:
KFLL
KSDS
KMDK
MDDL
In this case, for example, I would have to remove the first string, since the character at the n-1 position is not a "D", and the last one, since the character "M" appears out of the n-2 position.
How can I apply this to a whole dataframe column?
Here's with a list comprehension:
l = ['KFLL', 'KSDS', 'KMDK', 'MDDL']
[x for x in l if ((('M' not in x) or (x[-3] == 'M')) and (x[-2] == 'D'))]
Output:
['KSDS', 'KMDK']
This does what you want. Could probably be written down shorter with list comprehensions, but at least this is readable. It assumes that the strings are all longer than 3 characters, otherwise you get an IndexError. In that case you need to add a try/except
from collections import Counter
import pandas as pd
df = pd.DataFrame(data=list(["KFLL", "KSDS", "KMDK", "MDDL"]), columns=["code"])
print("original")
print(df)
mask = list()
for code in df["code"]:
flag = False
if code[-2] == "D":
counter = Counter(list(code))
if counter["M"] == 0 or (counter["M"] == 1 and code[-3] == "M"):
flag = True
mask.append(flag)
df["mask"] = mask
df2 = df[df["mask"]].copy()
df2.drop("mask", axis=1, inplace=True)
print("new")
print(df2)
Output looks like this
original
code
0 KFLL
1 KSDS
2 KMDK
3 MDDL
new
code
1 KSDS
2 KMDK
Thank you all for your help.
I ended up implementing it like this:
l = {"Sequence": [ 'KFLL', 'KSDS', 'KMDK', 'MDDL', "MMMD"]}
df = pd.DataFrame(data= l)
print(df)
df = df[df.Sequence.str[-2] == 'D']
df = df[~df.Sequence.apply(lambda x: ("M" in x and x[-3]!='M') or x.count("M") >1 )]
print(df)
Output:
Sequence
0 KFLL
1 KSDS
2 KMDK
3 MDDL
4 MMMD
Sequence
1 KSDS
2 KMDK

Concatenate columns of dataframe in array

I'm trying to make a data visualization app, which is introduced a file type CSV and then select the columns to represent (not all columns are represented), I already got the function to select only a few variables, but now I need to join those columns in a single data frame to work with, I tried to do this:
for i in range(0, len(data1.columns)):
i = 0
df = np.array(data1[data1.columns[i]])
i +=1
print(df)
But I've only got the same column repeated numb_selection = numb_columns_dataframe (i.e. if I select 5 columns, the same column returns 5 times)
How do I ensure that for each iteration I insert a different column and not always the same one?
The problem of repeating one column is in i rewriting.
# For example `data1.columns` is ["a", "b", "c", "d", "e"]
# Your code:
for i in range(0, len(data1.columns)):
i = 0 # Here, in every interaction, set into 0
print(i, data1.columns[i], sep=": ")
i += 1
# Output:
# 0: a
# 0: a
# 0: a
# 0: a
# 0: a
i = 0 & i += 1 are useless because you already get i fromrange, ranging from 0 to len (data1.columns).
Fixed version
for i in range(0, len(data1.columns)):
print(i, data1.columns[i], sep=": ")
# Output:
# 0: a
# 1: b
# 2: c
# 3: d
# 5: e
Versions using manual increment i plus iteration through elements:
# First step, iter over columns
for col in data1.columns:
print(col)
# Output:
# a
# b
# c
# d
# e
# Step two, manual increment to obtain the list (array) index
i = 0
for col in data1.columns:
print(i, col, sep=": ")
i += 1
# Output:
# 0: a
# 1: b
# 2: c
# 3: d
# 5: e
Helpful to know, enumerate:
Function enumerate(iterable) is nice for obtain key of index and value itself.
print(list(enumerate(["Hello", "world"])))
# Output:
[
(0, "Hello"),
(1, "world")
]
Usage:
for i, col in enumerate(data1.columns):
print(i, col, sep=": ")
# Output:
# 0: a
# 1: b
# 2: c
# 3: d
# 5: e
At the end I solved it, declaring an empty list before the loop, iterating on the selected variables and saving the indexes in this list. So I get a list with the indexes that I should use for my visualization.
def get_index(name):
'''
return the index of a column name
'''
for column in df.columns:
if column == name:
index = df.columns.get_loc(column)
return index
result=[]
for i in range(len(selected)):
X = get_index(selected[i])
result.append(X)
df = df[df.columns[result]]
x = df.values
Where 'selected' is the list of selected variables (filter first by column name, then get its index number), I don't know if it's the most elegant way to do this, but it works well.

Search for a value anywhere in a pandas DataFrame

This seems like a simple question, but I couldn't find it asked before (this and this are close but the answers aren't great).
The question is: if I want to search for a value somewhere in my df (I don't know which column it's in) and return all rows with a match.
What's the most Pandaic way to do it? Is there anything better than:
for col in list(df):
try:
df[col] == var
return df[df[col] == var]
except TypeError:
continue
?
You can perform equality comparison on the entire DataFrame:
df[df.eq(var1).any(1)]
You should using isin , this is return the column , is want row check cold' answer :-)
df.isin(['bal1']).any()
A False
B True
C False
CLASS False
dtype: bool
Or
df[df.isin(['bal1'])].stack() # level 0 index is row index , level 1 index is columns which contain that value
0 B bal1
1 B bal1
dtype: object
You can try the code below:
import pandas as pd
x = pd.read_csv(r"filePath")
x.columns = x.columns.str.lower().str.replace(' ', '_')
y = x.columns.values
z = y.tolist()
print("Note: It take Case Sensitive Values.")
keyWord = input("Type a Keyword to Search: ")
try:
for k in range(len(z)-1):
l = x[x[z[k]].str.match(keyWord)]
print(l.head(10))
k = k+1
except:
print("")
This is a solution which will return the actual column you need.
df.columns[df.isin(['Yes']).any()]
Minimal solution:
import pandas as pd
import numpy as np
def locate_in_df(df, value):
a = df.to_numpy()
row = np.where(a == value)[0][0]
col = np.where(a == value)[1][0]
return row, col

Extracting the n-th elements from a list of named tuples in pandas Python?

I am trying to extract the n-th element from a list of named tuples stored in a df looking like this:
df['text'] = [Tag(word='Come', pos='adj', lemma='Come'), Tag(word='on', pos='nounpl', lemma='on'), Tag(word='Feyenoord', pos='adj', lemma='Feyenoord')]
I am trying to extract only elements that contain the pos information from each tuple. This is the outcome that I would like to achieve:
df['text'] = ['adj', 'nounpl', 'adj']
This is what I have tried this far:
d =[]
count = 0
while count < df['text'].size:
d.append([item[1] for item in df['text'][count]])
count += 1
dfpos = pd.DataFrame({'text':d})
df['text']= pd.DataFrame({'text':d})
df['text']=df['text'].apply(lambda x: ', '.join(x))
And this is the error: IndexError: tuple index out of range
What am I missing?
Solution: It seems that the easiest solution is to turn the tuples into a list. I am not sure if this is the best solution, but it works.
d =[]
count = 0
while count < df['text'].size:
temp=([list(item[1:-1]) for item in df['text'][count]])
d.append(sum(temp, []))
count += 1
df['text']= pd.DataFrame({'text':d})
df['text2']=df['text'].apply(lambda x: ', '.join(x))
Try indexing using apply if Tag is your named tuple i.e
Data Preparation :
from collections import namedtuple
Tag = namedtuple('Tag', 'word pos lemma')
li = [Tag(word='Come', pos='adj', lemma='Come'), Tag(word='on', pos='nounpl', lemma='on'), Tag(word='Feyenoord', pos='adj', lemma='Feyenoord')]
df = pd.DataFrame({'text':li})
For attribute based selection use . in apply since its a named tuple i.e
df['new'] = df['text'].apply(lambda x : x.pos)
If you need an index based selection then use
df['new'] = df['text'].apply(lambda x : x[1] if len(x)>1 else np.nan)
Output df['new']
0 adj
1 nounpl
2 adj
Name: text, dtype: object
Another solution is use str[1] for select value in namedtuple:
df['text1'] = df['text'].str[1]
print (df)
text text1
0 (Come, adj, Come) adj
1 (on, nounpl, on) nounpl
2 (Feyenoord, adj, Feyenoord) adj

How to find the most probable pair of a string from a two-column dataset?

Given columns A and B, how can I find the most probable item at column B each of the items at column A? What about something based on nested hash-maps? I want to do that in Python.
INPUT:
a,abd37534c7d9a2efb9465de931cd7055ffdb8879563ae98078d6d6d5
a,abd37534c7d9a2efb9465de931cd7055ffdb8879563ae98078d6d6d5
a,abd37534c7d9a2efb9465fghfghfghfghfghrewresdasdzfdghhgfhg
a,abd3753dfrtdgfdg563ae98078d6dfgfdgdfghdgasdaSADFBVFDGFD5
b,c681e18b81edaf2b66dd22376734dba5992e362bc3f91ab225854c17
OUTPUT:
a,abd37534c7d9a2efb9465de931cd7055ffdb8879563ae98078d6d6d5
b,c681e18b81edaf2b66dd22376734dba5992e362bc3f91ab225854c17
I will assume "most probable" means the one with the highest occurrence for each {a,b}.
The following will likely work, though may have some syntax issues. In any case, it will give you an idea of how approach the problem (if not solve it for you).
tupleList = [('a','abd37534c7d9a2efb9465de931cd7055ffdb8879563ae98078d6d6d5'),
('a','abd37534c7d9a2efb9465de931cd7055ffdb8879563ae98078d6d6d5'),
('a','abd37534c7d9a2efb9465fghfghfghfghfghrewresdasdzfdghhgfhg'),
('a','abd3753dfrtdgfdg563ae98078d6dfgfdgdfghdgasdaSADFBVFDGFD5'),
('b','c681e18b81edaf2b66dd22376734dba5992e362bc3f91ab225854c17')]
# Load your list of a,blah into tupleList
myHashMap = {}
for col1, col2 in tupleList:
if col1 not in myHashMap:
myHashMap[col1] = {}
if col2 not in myHashMap[col1]:
myHashMap[col1][col2] = 0
myHashMap[col1][col2] += 1
# Now iterate over to find the one with highest occurrence.
for col in myHashMap:
maxKey = ''
maxVal = 0
for col2 in myHashMap[col1]:
if myHashMap[col1][col2] > maxVal:
maxVal = myHashMap[col1][col2]
maxKey = col2
print 'Most probable for %s is %s'%(col, maxKey)

Categories