So I got a numeric list [0-12] that matches the length of my columns in my spreadsheet and also replaced the column headers with that list df.columns = list.
Now i want to drop specific columns out of that spreadsheet like this.
To create the list of numbers to match the length of columns I got this:
listOfNumbers = []
column_name = []
for i in range(0, len(df.columns)):
listOfNumbers.append(i)
df.columns = listOfNumbers
for i in range(1, len(df.columns)):
for j in range(1, len(df.columns)):
if i != colList[j]:
df.drop(i, inplace=True)
And I got the list [1,2,3] as seen in the picture.
But i always get this Error:
KeyError: '[1] not found in axis
I tried to replace df.drop(i, inplace=True) with df.drop(i, axis=1, inplace=True) but that didn't work either.
Any suggestions? Thanks.
the proper way will be:
columns_to_remove = [1, 2, 3] # columns to delete
df = df.drop(columns=df.columns[columns_to_remove])
So for your use case:
for i in range(1, len(df.columns)):
for j in range(1, len(df.columns)):
if i != colList[j]:
df.drop(columns=df.columns[i], inplace=True)
If you want to drop every column that does not appear in colList, this code does it, using set difference:
setOfNumbers = set(range(df.shape[1]))
setRemainColumns = set(colList)
for dropColumn in setOfNumbers.difference(setRemainColumns):
df.drop(dropColumn, axis=1, inplace=True)
Related
cols = [2,4,6,8,10,12,14,16,18] # selected the columns i want to work with
df = pd.read_csv('mywork.csv')
df1 = df.iloc[:, cols]
b= np.array(df1)
b
outcome
b = [['WV5 6NY' 'RE4 9VU' 'BU4 N90' 'TU3 5RE' 'NE5 4F']
['SA8 7TA' 'BA31 0PO' 'DE3 2FP' 'LR98 4TS' 0]
['MN0 4NU' 'RF5 5FG' 'WA3 0MN' 'EA15 8RE' 'BE1 4RE']
['SB7 0ET' 'SA7 0SB' 'BT7 6NS' 'TA9 0LP' 'BA3 1OE']]
a = np.concatenate(b) #concatenated to get a single array, this worked well
a = np.array([x for x in a if x != 'nan'])
a = a[np.where(a != '0')] #removed the nan
print(np.sort(a)) # to sort alphabetically
#Sorted array
['BA3 1OE' 'BA31 0PO' 'BE1 4RE' 'BT7 6NS' 'BU4 N90'
'DE3 2FP' 'EA15 8RE' 'LR98 4TS' 'MN0 4NU', 'NE5 4F' 'RE4 9VU'
'RF5 5FG' 'SA7 0SB' 'SA8 7TA' 'SB7 0ET' 'TA9 0LP' 'TU3 5RE'
'WA3 0MN' 'WV5 6NY']
#Find the index position of all elements of b in a(sorted array)
def findall_index(b, a )
result = []
for i in range(len(a)):
for j in range(len(a[i])):
if b[i][j] == a:
result.append((i, j))
return result
print(findall_index(0,result))
I am still very new with python, I tried finding the index positions of all element of b in a above. The underneath codes blocks doesn't seem to be giving me any result. Please can some one help me.
Thank you in advance.
One way you could approach this is by zipping (creating pairs) the index of elements in b with the actual elements and then sorting this new array based on the elements only. Now you have a mapping from indices of the original array to the new sorted array. You can then just loop over the sorted pairs to map the current index to the original index.
I would highly suggest you to code this yourself, since it will help you learn!
How to detect columns and rows that might have one of the characters in a string of a dataframe element other than the desired characters.
desired characters are A, B, C, a, b, c, 1, 2, 3, &, %, =, /
dataframe -
Col1
Col2
Col3
Abc
Øa
12
bbb
+
}
output will be elements Øa, +, } and their location in dataframe.
I find it really difficult to locate an element for a condition directly in pandas, so I converted the dataframe to a nested list first, then proceeded to work with the list. Try this:
import pandas as pd
import numpy as np
#creating your sample dataframe
array = np.array([['Abc','Øa','12'],['bbb','+','}']])
columns = ['Col1','Col2','Col3']
df = pd.DataFrame(data=array, columns=columns)
#convert dataframe to nested list
pd_list = df.values.tolist()
#return any characters other than the ones in 'var'
all_chars = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;=>?#[\\]^_`{|}~Ø'
var = 'ABCabc123&%=//'
for a in var:
all_chars = all_chars.replace(a, "")
#stores previously detected elements to prevent duplicate
temp_storage = []
#loops through the nested list to get the elements' indexes
for x in all_chars:
for i in pd_list:
for n in i:
if x in n:
#check if element is duplicate
if not n in temp_storage:
temp_storage.append(n)
print(f'found {n}: row={pd_list.index(i)}; col={i.index(n)}')
Output:
> found +: row=1; col=1
> found }: row=1; col=2
> found Øa: row=0; col=1
I have a df something like this:
lst = [[30029509,37337567,41511334,41511334,41511334]]
lst2 = [35619048]
lst3 = [[41511334,37337567,41511334]]
lst4 = [[37337567,41511334]]
df = pd.DataFrame()
df['0'] = lst, lst2, lst3, lst4
I need to count how many times there is a '41511334' in every column
I do this code:
df['new'] = '41511334' in str(df['0'])
And I got True in every column's row, but it's a mistake for second line.
What's wrong?
Thanks
str(df['0']) gives a string representation of column 0 and so includes all the data. You will then see that
'41511334' in str(df['0'])
gives True, and you assign this to every row of the 'new' column. You are looking for something like
df['new'] = df['0'].apply(lambda x: '41511334' in str(x))
or
df['new'] = df['0'].astype(str).str.contains('41511334')
I have a problem running the code below.
data is my dataframe. X is the list of columns for train data. And L is a list of categorical features with numeric values.
I want to one hot encode my categorical features. So I do as follows. But a "ValueError: Columns must be same length as key" (for the last line) is thrown. And I still don't understand why after long research.
def turn_dummy(df, prop):
dummies = pd.get_dummies(df[prop], prefix=prop, sparse=True)
df.drop(prop, axis=1, inplace=True)
return pd.concat([df, dummies], axis=1)
L = ['A', 'B', 'C']
for col in L:
data_final[X] = turn_dummy(data_final[X], col)
It appears that this is a problem of dimensionality. It would be like the following:
Say I have a list like so:
mylist = [0, 0, 0, 0]
It is of length 4. If I wanted to do 1:1 mapping of elements of a new list into that one:
otherlist = ['a', 'b']
for i in range(len(mylist)):
mylist[i] = otherlist[i]
Obviously this will throw an IndexError, because it's trying to get elements that otherlist just doesn't have
Much the same is occurring here. You are trying to insert a string (len=1) to a column of length n>1. Try:
data_final[X] = turn_dummy(data_final[X], L)
Assuming len(L) = number_of_rows
I am looking to apply a loop over indices of dataframe in python.
My loop is like:
for index in DataFrame:
if index <= 10
index= index+1
return rows(index)
Use DataFrame.iterrows():
for row, srs in pd.DataFrame({'a': [1,2], 'b': [3,4]}).iterrows():
...do something...
Try this:
for index, row in df.iterrows():
if index <=10:
print row
This is going to print the first 10 rows
We have to take list of index if any condition is required
we can take the rows in list of Series
for i in index:
l1 = list(range(i-10,i+2))
all_index.extend(l1)
all_index = list(set(all_index))
all_series = []
take list of series
for i in all_index:
a = df.iloc[i, :]
all_series = all_series.extend(a)