Groupby with row index operation? - python

How can I select the rows with given row index operation (say, only even rows or only if row# % 5 == 0) in pandas?
let's say I have a dataframe with df [120 rows x 10 column], and I want to create two out of it, one from even rows df1 [60 rows x 10 column], and one from odd rows [60 rows x 10 column]?

You can slice the dfs using normal list style slicing semantics:
first = df.iloc[::2]
second = df.iloc[1::2]
So the first steps every 2 rows starting from first to last row
the second does the same but starts from row 1, the second row, and steps every 2 rows

As stated already, you may use iloc
df0 = df.iloc[::2]
df1 = df.iloc[1::2]
If you have a more complex selection schema you may pass a boolean vector to iloc, e.g.,
def filter_by( idx ):
# param idx: an index
# returns True if idx%4==0 or idx%4==1
if idx%4==0 or idx%4==1:
return True
else:
return False
# a boolean vector is created by means of filter_by
df_new = df.iloc[ [ filter_by(i) for i in range(df.shape[0]) ] ]
the filtering above is then:
df0 = df.iloc[ [ idx%2==0 for idx in range(df.shape[0]) ] ]
df1 = df.iloc[ [ idx%2==1 for idx in range(df.shape[0]) ] ]

Related

Dataframe group by only groups with 2 or more rows

Is there a way to groupby only groups with 2 or more rows?
Or can I delete groups from a grouped dataframe that contains only 1 row?
Thank you very much for your help!
Yes there is a way. Here is an example below
df = pd.DataFrame(
np.array([['A','A','B','C','C'],[1,2,1,1,2]]).T
, columns= ['type','value']
)
groups = df.groupby('type')
groups_without_single_row_df = [g for g in groups if len(g[1]) > 1]
groupby return a list of tuples.
Here, 'type' (A, b or C) is the first element of the tuple and the subdataframe the second element.
You can check length of each subdataframe with len() as in [g for g in groups if len(g[1]) > 1] where we check the lengh of the second element of the tuple.
If the the len() is greater than 1, it is include in the ouput list.
Hope it helps

Python - Pandas - drop specific columns (axis)?

So I got a numeric list [0-12] that matches the length of my columns in my spreadsheet and also replaced the column headers with that list df.columns = list.
Now i want to drop specific columns out of that spreadsheet like this.
To create the list of numbers to match the length of columns I got this:
listOfNumbers = []
column_name = []
for i in range(0, len(df.columns)):
listOfNumbers.append(i)
df.columns = listOfNumbers
for i in range(1, len(df.columns)):
for j in range(1, len(df.columns)):
if i != colList[j]:
df.drop(i, inplace=True)
And I got the list [1,2,3] as seen in the picture.
But i always get this Error:
KeyError: '[1] not found in axis
I tried to replace df.drop(i, inplace=True) with df.drop(i, axis=1, inplace=True) but that didn't work either.
Any suggestions? Thanks.
the proper way will be:
columns_to_remove = [1, 2, 3] # columns to delete
df = df.drop(columns=df.columns[columns_to_remove])
So for your use case:
for i in range(1, len(df.columns)):
for j in range(1, len(df.columns)):
if i != colList[j]:
df.drop(columns=df.columns[i], inplace=True)
If you want to drop every column that does not appear in colList, this code does it, using set difference:
setOfNumbers = set(range(df.shape[1]))
setRemainColumns = set(colList)
for dropColumn in setOfNumbers.difference(setRemainColumns):
df.drop(dropColumn, axis=1, inplace=True)

Searching index position in python

cols = [2,4,6,8,10,12,14,16,18] # selected the columns i want to work with
df = pd.read_csv('mywork.csv')
df1 = df.iloc[:, cols]
b= np.array(df1)
b
outcome
b = [['WV5 6NY' 'RE4 9VU' 'BU4 N90' 'TU3 5RE' 'NE5 4F']
['SA8 7TA' 'BA31 0PO' 'DE3 2FP' 'LR98 4TS' 0]
['MN0 4NU' 'RF5 5FG' 'WA3 0MN' 'EA15 8RE' 'BE1 4RE']
['SB7 0ET' 'SA7 0SB' 'BT7 6NS' 'TA9 0LP' 'BA3 1OE']]
a = np.concatenate(b) #concatenated to get a single array, this worked well
a = np.array([x for x in a if x != 'nan'])
a = a[np.where(a != '0')] #removed the nan
print(np.sort(a)) # to sort alphabetically
#Sorted array
['BA3 1OE' 'BA31 0PO' 'BE1 4RE' 'BT7 6NS' 'BU4 N90'
'DE3 2FP' 'EA15 8RE' 'LR98 4TS' 'MN0 4NU', 'NE5 4F' 'RE4 9VU'
'RF5 5FG' 'SA7 0SB' 'SA8 7TA' 'SB7 0ET' 'TA9 0LP' 'TU3 5RE'
'WA3 0MN' 'WV5 6NY']
#Find the index position of all elements of b in a(sorted array)
def findall_index(b, a )
result = []
for i in range(len(a)):
for j in range(len(a[i])):
if b[i][j] == a:
result.append((i, j))
return result
print(findall_index(0,result))
I am still very new with python, I tried finding the index positions of all element of b in a above. The underneath codes blocks doesn't seem to be giving me any result. Please can some one help me.
Thank you in advance.
One way you could approach this is by zipping (creating pairs) the index of elements in b with the actual elements and then sorting this new array based on the elements only. Now you have a mapping from indices of the original array to the new sorted array. You can then just loop over the sorted pairs to map the current index to the original index.
I would highly suggest you to code this yourself, since it will help you learn!

Detect specific characters in pandas dataframe

How to detect columns and rows that might have one of the characters in a string of a dataframe element other than the desired characters.
desired characters are A, B, C, a, b, c, 1, 2, 3, &, %, =, /
dataframe -
Col1
Col2
Col3
Abc
Øa
12
bbb
+
}
output will be elements Øa, +, } and their location in dataframe.
I find it really difficult to locate an element for a condition directly in pandas, so I converted the dataframe to a nested list first, then proceeded to work with the list. Try this:
import pandas as pd
import numpy as np
#creating your sample dataframe
array = np.array([['Abc','Øa','12'],['bbb','+','}']])
columns = ['Col1','Col2','Col3']
df = pd.DataFrame(data=array, columns=columns)
#convert dataframe to nested list
pd_list = df.values.tolist()
#return any characters other than the ones in 'var'
all_chars = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;=>?#[\\]^_`{|}~Ø'
var = 'ABCabc123&%=//'
for a in var:
all_chars = all_chars.replace(a, "")
#stores previously detected elements to prevent duplicate
temp_storage = []
#loops through the nested list to get the elements' indexes
for x in all_chars:
for i in pd_list:
for n in i:
if x in n:
#check if element is duplicate
if not n in temp_storage:
temp_storage.append(n)
print(f'found {n}: row={pd_list.index(i)}; col={i.index(n)}')
Output:
> found +: row=1; col=1
> found }: row=1; col=2
> found Øa: row=0; col=1

How can iterate rows of dataframe using index?

I am looking to apply a loop over indices of dataframe in python.
My loop is like:
for index in DataFrame:
if index <= 10
index= index+1
return rows(index)
Use DataFrame.iterrows():
for row, srs in pd.DataFrame({'a': [1,2], 'b': [3,4]}).iterrows():
...do something...
Try this:
for index, row in df.iterrows():
if index <=10:
print row
This is going to print the first 10 rows
We have to take list of index if any condition is required
we can take the rows in list of Series
for i in index:
l1 = list(range(i-10,i+2))
all_index.extend(l1)
all_index = list(set(all_index))
all_series = []
take list of series
for i in all_index:
a = df.iloc[i, :]
all_series = all_series.extend(a)

Categories