How do I know is there a value in another column? - python

I have a df something like this:
lst = [[30029509,37337567,41511334,41511334,41511334]]
lst2 = [35619048]
lst3 = [[41511334,37337567,41511334]]
lst4 = [[37337567,41511334]]
df = pd.DataFrame()
df['0'] = lst, lst2, lst3, lst4
I need to count how many times there is a '41511334' in every column
I do this code:
df['new'] = '41511334' in str(df['0'])
And I got True in every column's row, but it's a mistake for second line.
What's wrong?
Thanks

str(df['0']) gives a string representation of column 0 and so includes all the data. You will then see that
'41511334' in str(df['0'])
gives True, and you assign this to every row of the 'new' column. You are looking for something like
df['new'] = df['0'].apply(lambda x: '41511334' in str(x))
or
df['new'] = df['0'].astype(str).str.contains('41511334')

Related

Python - Pandas - drop specific columns (axis)?

So I got a numeric list [0-12] that matches the length of my columns in my spreadsheet and also replaced the column headers with that list df.columns = list.
Now i want to drop specific columns out of that spreadsheet like this.
To create the list of numbers to match the length of columns I got this:
listOfNumbers = []
column_name = []
for i in range(0, len(df.columns)):
listOfNumbers.append(i)
df.columns = listOfNumbers
for i in range(1, len(df.columns)):
for j in range(1, len(df.columns)):
if i != colList[j]:
df.drop(i, inplace=True)
And I got the list [1,2,3] as seen in the picture.
But i always get this Error:
KeyError: '[1] not found in axis
I tried to replace df.drop(i, inplace=True) with df.drop(i, axis=1, inplace=True) but that didn't work either.
Any suggestions? Thanks.
the proper way will be:
columns_to_remove = [1, 2, 3] # columns to delete
df = df.drop(columns=df.columns[columns_to_remove])
So for your use case:
for i in range(1, len(df.columns)):
for j in range(1, len(df.columns)):
if i != colList[j]:
df.drop(columns=df.columns[i], inplace=True)
If you want to drop every column that does not appear in colList, this code does it, using set difference:
setOfNumbers = set(range(df.shape[1]))
setRemainColumns = set(colList)
for dropColumn in setOfNumbers.difference(setRemainColumns):
df.drop(dropColumn, axis=1, inplace=True)

Generating concatenated dataframe output from for loop over a function

I have a function:
def func(x):
y = pd.read_csv(x)
return y
In case i have to loop this function for severeal inputs i am expected to have combined dataframe from all those inputs as output:
list = ["a.csv", "b.csv", "d.csv", "e.csv"]
for i in list:
m = func(x)
How can i get value of m as combined dataframes from all the input files?
For combine DataFrames use concat in list comprehension:
Notice: Dont use list like variable, because python code word.
L = ["a.csv", "b.csv", "d.csv", "e.csv"]
df1 = pd.concat([func(x) for i in L])
Or in loop:
out = []
for i in L:
m = func(x)
out.append(m)
df1 = pd.concat(out)

Search values from a list in dataframe cell list and add another column with results

I am trying to create a column with the result of a comparison between a Dataframe cell list and a list
I have this dataframe with list values:
df = pd.DataFrame({'A': [['KB4525236', 'KB4485447', 'KB4520724', 'KB3192137', 'KB4509091']], 'B': [['a', 'b']]})
and a list with this value:
findKBs = ['KB4525236','KB4525202']
The expected result :
A B C
0 [KB4525236, KB4485447, KB4520724, KB3192137, K... [a, b] [KB4525202]
I donĀ“t know how to iterate my list with the cell list and find the non matches, can you help me?
You should simply compare the 2 lists like this: Loop through the values of findKBs and assign them to new list if they are not in df['A'][0]
df['C'] = [[x for x in findKBs if x not in df['A'][0]]]
Result:
A B C
0 [KB4525236, KB4485447, KB4520724, KB3192137, K... [a, b] [KB4525202]
There's probably a pandas-centric way you could do it,but this appears to work:
df['C'] = [list(filter(lambda el: True if el not in df['A'][0] else False, findKBs))]

ValueError: Columns must be same length as key

I have a problem running the code below.
data is my dataframe. X is the list of columns for train data. And L is a list of categorical features with numeric values.
I want to one hot encode my categorical features. So I do as follows. But a "ValueError: Columns must be same length as key" (for the last line) is thrown. And I still don't understand why after long research.
def turn_dummy(df, prop):
dummies = pd.get_dummies(df[prop], prefix=prop, sparse=True)
df.drop(prop, axis=1, inplace=True)
return pd.concat([df, dummies], axis=1)
L = ['A', 'B', 'C']
for col in L:
data_final[X] = turn_dummy(data_final[X], col)
It appears that this is a problem of dimensionality. It would be like the following:
Say I have a list like so:
mylist = [0, 0, 0, 0]
It is of length 4. If I wanted to do 1:1 mapping of elements of a new list into that one:
otherlist = ['a', 'b']
for i in range(len(mylist)):
mylist[i] = otherlist[i]
Obviously this will throw an IndexError, because it's trying to get elements that otherlist just doesn't have
Much the same is occurring here. You are trying to insert a string (len=1) to a column of length n>1. Try:
data_final[X] = turn_dummy(data_final[X], L)
Assuming len(L) = number_of_rows

How can iterate rows of dataframe using index?

I am looking to apply a loop over indices of dataframe in python.
My loop is like:
for index in DataFrame:
if index <= 10
index= index+1
return rows(index)
Use DataFrame.iterrows():
for row, srs in pd.DataFrame({'a': [1,2], 'b': [3,4]}).iterrows():
...do something...
Try this:
for index, row in df.iterrows():
if index <=10:
print row
This is going to print the first 10 rows
We have to take list of index if any condition is required
we can take the rows in list of Series
for i in index:
l1 = list(range(i-10,i+2))
all_index.extend(l1)
all_index = list(set(all_index))
all_series = []
take list of series
for i in all_index:
a = df.iloc[i, :]
all_series = all_series.extend(a)

Categories