Looping over dataframe with extracting different columns at each iteration - python

I am trying to loop over a dataframe df and I would like to extract different columns at each iteration.
say I have Columns: ['A', 'B', 'C', 'D', 'E', 'F'] in my df
column_names=[['A','B'],['A','C','D']]
for index,row in df: //lets assume index starts with 0
row[column_names[index]] // However you can not apply this syntax for rows like you could for a df to get a sub dataframe.
What are my options? I have tried itertuples and iterrows but you can not select different columns by passing a list of column names
Thanks

The easiest way to loop over columns and retrieving a dataframe would be to invert your loops :
for col in column_names:
for ix in df.index:
print(df.loc[ix, col])

With iterrows() you get a tuple with index at 0th position and row at the 1st. You might want to use iterrows() as:
column_names=[['A',"B"],['A','C','D']]
for row in df.iterrows():
print(row[1][column_names[row[0]]].to_frame())
For a df of ones i.e.:
A B C D E F
0 1.0 1.0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0 1.0
You get:
A 1.0
B 1.0
Name: 0, dtype: float64
A 1.0
C 1.0
D 1.0
Name: 1, dtype: float64

Related

convert a dict with key and list of values into pandas Dataframe where values are column names

Given a dict like this:
d={'paris':['a','b'],
'brussels':['b','c'],
'mallorca':['a','d']}
#when doing:
df = pd.DataFrame(d)
df.T
I dont get the expected result.
What I would like to get is a one_hot_encoding DF, in which the columns are the capitals and the value 1 or 0 corresponds to every of the letters that every city includes being paris, mallorca ect
The desired result is:
df = pd.DataFrame([[1,1,0,0],[0,1,1,0],[1,0,0,1]], index=['paris','brussels','mallorca'], columns=list('abcd'))
df.T
Any clever way to do this without having to multiloop over the first dict to transform it into another one?
Solution 1:
Combine df.apply with series.value_counts and append df.fillna to fill NaN values with zeros.
out = df.apply(pd.Series.value_counts).fillna(0)
print(out)
paris brussels mallorca
a 1.0 0.0 1.0
b 1.0 1.0 0.0
c 0.0 1.0 0.0
d 0.0 0.0 1.0
Solution 1:
Transform your df using df.melt and then use the result inside pd.crosstab.
Again use df.fillna to change NaN values to zeros. Finally, reorder the columns based on the order in the original df.
out = df.melt(value_name='index')
out = pd.crosstab(index=out['index'], columns=out['variable'])\
.fillna(0).loc[:, df.columns]
print(out)
paris brussels mallorca
index
a 1 0 1
b 1 1 0
c 0 1 0
d 0 0 1
I don't know how 'clever' my solution is but it works and it is pretty consise and readable.
import pandas as pd
d = {'paris': ['a', 'b'],
'brussels': ['b', 'c'],
'mallorca': ['a', 'd']}
df = pd.DataFrame(d).T
df.columns = ['0', '1']
df = pd.concat([df['0'], df['1']])
df = pd.crosstab(df, columns=df.index)
print(df)
Yields:
brussels mallorca paris
a 0 1 1
b 1 0 1
c 1 0 0
d 0 1 0

How to drop duplicate data with different column names in pandas?

I have a DataFrame with columns with duplicate data with different names:
In[1]: df
Out[1]:
X1 X2 Y1 Y2
0.0 0.0 6.0 6.0
3.0 3.0 7.1 7.1
7.6 7.6 1.2 1.2
I know .drop(columns = ) exists but is there a way more efficient way to drop these without having to list down the column names? or not.. please let me know as i can just use .drop()
We can use np.unique over axis 1. Unfortunately, there's no pandas built-in function to drop duplicate columns.
df.drop_duplicates only removes duplicate rows.
Return DataFrame with duplicate rows removed.
We can create a function around np.unique to drop duplicate columns.
def drop_duplicate_cols(df):
uniq, idxs = np.unique(df, return_index=True, axis=1)
return pd.DataFrame(uniq, index=df.index, columns=df.columns[idxs])
drop_duplicate_cols(X)
X1 Y1
0 0.0 6.0
1 3.0 7.1
2 7.6 1.2
Online Demo
NB: np.unique docs:
Returns the sorted unique elements of an array.
Workaround: To retain the original order, sort the idxs.
Using .T on dataframe having multiple dtypes is going to mess with your actual dtypes.
df = pd.DataFrame({'A': [0, 1], 'B': ['a', 'b'], 'C': [0, 1], 'D':[2.1, 3.1]})
df.dtypes
A int64
B object
C int64
D float64
dtype: object
df.T.T.dtypes
A object
B object
C object
D object
dtype: object
# To get back original `dtypes` we can use `.astype`
df.T.T.astype(df.dtypes).dtypes
A int64
B object
C int64
D float64
dtype: object
You could transpose with T and drop_duplicates then transpose back:
>>> df.T.drop_duplicates().T
X1 Y1
0 0.0 6.0
1 3.0 7.1
2 7.6 1.2
>>>
Or with loc and duplicated:
>>> df.loc[:, df.T.duplicated(keep='last')]
X1 Y1
0 0.0 6.0
1 3.0 7.1
2 7.6 1.2
>>>

Loop through dataframe (cols and rows) and replace data

I have:
df = pd.DataFrame([[1, 2,3], [2, 4,6],[3, 6,9]], columns=['A', 'B','C'])
and I need to calculate de difference between the i+1 and i value of each row and column, and store it again in the same column. The output needed would be:
Out[2]:
A B C
0 1 2 3
1 1 2 3
2 1 2 3
I have tried to do this, but I finally get a list with all values appended, and I need to have them stored separately (in lists, or in the same dataframe).
Is there a way to do it?
difs=[]
for column in df:
for i in range(len(df)-1):
a = df[column]
b = a[i+1]-a[i]
difs.append(b)
for x in difs:
for column in df:
df[column]=x
You can use pandas function shift to achieve your intended goal. This is what it does (more on it on the docs):
Shift index by desired number of periods with an optional time freq.
for col in df:
df[col] = df[col] - df[col].shift(1).fillna(0)
df
Out[1]:
A B C
0 1.0 2.0 3.0
1 1.0 2.0 3.0
2 1.0 2.0 3.0
Added
In case you want to use the loop, probably a good approach is to use iterrows (more on it here) as it provides (index, Series) pairs.
difs = []
for i, row in df.iterrows():
if i == 0:
x = row.values.tolist() ## so we preserve the first row
else:
x = (row.values - df.loc[i-1, df.columns]).values.tolist()
difs.append(x)
difs
Out[1]:
[[1, 2, 3], [1, 2, 3], [1, 2, 3]]
## Create new / replace old dataframe
cols = [col for col in df.columns]
new_df = pd.DataFrame(difs, columns=cols)
new_df
Out[2]:
A B C
0 1.0 2.0 3.0
1 1.0 2.0 3.0
2 1.0 2.0 3.0

Selecting a subset using dropna() to select multiple columns

I have the following DataFrame:
df = pd.DataFrame([[1,2,3,3],[10,20,2,],[10,2,5,],[1,3],[2]],columns = ['a','b','c','d'])
From this DataFrame, I want to drop the rows where all values in the subset ['b', 'c', 'd'] are NA, which means the last row should be dropped.
The following code works:
df.dropna(subset=['b', 'c', 'd'], how = 'all')
However, considering that I will be working with larger data frames, I would like to select the same subset using the range ['b':'d']. How do I select this subset?
IIUC, use loc, retrieve those columns, and pass that to dropna.
c = df.loc[0, 'b':'d'].columns # retrieve only the 0th row for efficiency
df = df.dropna(subset=c, how='all')
print(df)
a b c d
0 1 2.0 3.0 3.0
1 10 20.0 2.0 NaN
2 10 2.0 5.0 NaN
3 1 3.0 NaN NaN
Similar to #ayhan's idea - using df.columns.slice_indexer:
In [25]: cols = df.columns[df.columns.slice_indexer('b','d')]
In [26]: cols
Out[26]: Index(['b', 'c', 'd'], dtype='object')
In [27]: df.dropna(subset=cols, how='all')
Out[27]:
a b c d
0 1 2.0 3.0 3.0
1 10 20.0 2.0 NaN
2 10 2.0 5.0 NaN
3 1 3.0 NaN NaN
You could also slice the column list numerically:
c = df.columns[1:4]
df = df.dropna(subset=c, how='all')
If using numbers is impractical (i.e. too many to count), there is a somewhat cumbersome work-around:
start, stop = df.columns.get_loc('b'), df.columns.get_loc('d')
c = df.columns[start:stop+1]
df = df.dropna(subset=c, how='all')

Adding items to empty pandas DataFrame

I want to dynamically extend an empty pandas DataFrame in the following way:
df=pd.DataFrame()
indices=['A','B','C']
colums=['C1','C2','C3']
for colum in colums:
for index in indices:
#df[index,column] = anyValue
Where both indices and colums can have arbitrary sizes which are not known in advance, i.e. I cannot create a DataFrame with the correct size in advance.
Which pandas function can I use for
#df[index,column] = anyValue
?
I think you can use loc:
df = pd.DataFrame()
df.loc[0,1] = 10
df.loc[2,8] = 100
print(df)
1 8
0 10.0 NaN
2 NaN 100.0
Faster solution with DataFrame.set_value:
df = pd.DataFrame()
indices = ['A', 'B', 'C']
columns = ['C1', 'C2', 'C3']
for column in columns:
for index in indices:
df.set_value(index, column, 1)
print(df)
C1 C2 C3
A 1.0 1.0 1.0
B 1.0 1.0 1.0
C 1.0 1.0 1.0
loc works very well, but...
For single assignments use at
df = pd.DataFrame()
indices = ['A', 'B', 'C']
columns = ['C1', 'C2', 'C3']
for column in columns:
for index in indices:
df.at[index, column] = 1
df
.at vs .loc vs .set_value timing

Categories