Selecting a subset using dropna() to select multiple columns - python

I have the following DataFrame:
df = pd.DataFrame([[1,2,3,3],[10,20,2,],[10,2,5,],[1,3],[2]],columns = ['a','b','c','d'])
From this DataFrame, I want to drop the rows where all values in the subset ['b', 'c', 'd'] are NA, which means the last row should be dropped.
The following code works:
df.dropna(subset=['b', 'c', 'd'], how = 'all')
However, considering that I will be working with larger data frames, I would like to select the same subset using the range ['b':'d']. How do I select this subset?

IIUC, use loc, retrieve those columns, and pass that to dropna.
c = df.loc[0, 'b':'d'].columns # retrieve only the 0th row for efficiency
df = df.dropna(subset=c, how='all')
print(df)
a b c d
0 1 2.0 3.0 3.0
1 10 20.0 2.0 NaN
2 10 2.0 5.0 NaN
3 1 3.0 NaN NaN

Similar to #ayhan's idea - using df.columns.slice_indexer:
In [25]: cols = df.columns[df.columns.slice_indexer('b','d')]
In [26]: cols
Out[26]: Index(['b', 'c', 'd'], dtype='object')
In [27]: df.dropna(subset=cols, how='all')
Out[27]:
a b c d
0 1 2.0 3.0 3.0
1 10 20.0 2.0 NaN
2 10 2.0 5.0 NaN
3 1 3.0 NaN NaN

You could also slice the column list numerically:
c = df.columns[1:4]
df = df.dropna(subset=c, how='all')
If using numbers is impractical (i.e. too many to count), there is a somewhat cumbersome work-around:
start, stop = df.columns.get_loc('b'), df.columns.get_loc('d')
c = df.columns[start:stop+1]
df = df.dropna(subset=c, how='all')

Related

Why isn't transpose working to allow me to concat dataframe with df.loc["Y"] row

I have 2 dataframes:
df = pd.DataFrame({'A': [1, 2, 3],
'B': [4, 5, 6]},
index=['X', 'Y', 'Z'])
and
df1 = pd.DataFrame({'M': [10, 20, 30],
'N': [40, 50, 60]},
index=['S', 'T', 'U'])
i want to append the df1 with a row of from the df dataframe.
i use the following code to extract the row:
row = df.loc['Y']
when i print this i get:
A 2
B 5
Name: Y, dtype: int64
A and B are key values or column heading names. so i transpose this with
row_25 = row.transpose()
i print row_25 and get:
A 2
B 5
Name: Y, dtype: int64
this is the same as row, so it seems the transpose didn't happen
i then add this code to add the row to df1:
result = pd.concat([df1, row_25], axis=0, ignore_index=False)
print(result)
when i print df1 i get:
M N 0
S 10.0 40.0 NaN
T 20.0 50.0 NaN
U 30.0 60.0 NaN
A NaN NaN 2.0
B NaN NaN 5.0
i want A and B to be column headings (key values) and the name of row (Y) to be the row index.
what am i doing wrong?
Try
pd.concat([df1, df.loc[['Y']])
It generates:
M N A B
S 10.0 40.0 NaN NaN
T 20.0 50.0 NaN NaN
U 30.0 60.0 NaN NaN
Y NaN NaN 2.0 5.0
Not sure if this is what you want.
To exclude column names 'M' and 'N' from the result you can rename the columns beforehand:
>>> df1.columns = ['A', 'B']
>>> pd.concat([df1, df.loc[['Y']])
A B
S 10 40
T 20 50
U 30 60
Y 2 5
The reason why you need double square brackets is that single square brackets return a 1D Series, that cannot be transposed. And double brackets return a 2D DataFrame (in general double brackets are used to reference several columns, like df1.loc[['X', 'Y']]; it is called 'fancy indexing' in NumPy).
If you are allergic to double brackets, use
pd.concat([df1.rename(columns={'M': 'A', 'N': 'B'}),
df.filter('Y', axis=0)])
Finally, if you really want to transpose something, you can convert the series to a frame and transpose it:
>>> df.loc['Y'].to_frame().T
A B
Y 2 5

convert a dict with key and list of values into pandas Dataframe where values are column names

Given a dict like this:
d={'paris':['a','b'],
'brussels':['b','c'],
'mallorca':['a','d']}
#when doing:
df = pd.DataFrame(d)
df.T
I dont get the expected result.
What I would like to get is a one_hot_encoding DF, in which the columns are the capitals and the value 1 or 0 corresponds to every of the letters that every city includes being paris, mallorca ect
The desired result is:
df = pd.DataFrame([[1,1,0,0],[0,1,1,0],[1,0,0,1]], index=['paris','brussels','mallorca'], columns=list('abcd'))
df.T
Any clever way to do this without having to multiloop over the first dict to transform it into another one?
Solution 1:
Combine df.apply with series.value_counts and append df.fillna to fill NaN values with zeros.
out = df.apply(pd.Series.value_counts).fillna(0)
print(out)
paris brussels mallorca
a 1.0 0.0 1.0
b 1.0 1.0 0.0
c 0.0 1.0 0.0
d 0.0 0.0 1.0
Solution 1:
Transform your df using df.melt and then use the result inside pd.crosstab.
Again use df.fillna to change NaN values to zeros. Finally, reorder the columns based on the order in the original df.
out = df.melt(value_name='index')
out = pd.crosstab(index=out['index'], columns=out['variable'])\
.fillna(0).loc[:, df.columns]
print(out)
paris brussels mallorca
index
a 1 0 1
b 1 1 0
c 0 1 0
d 0 0 1
I don't know how 'clever' my solution is but it works and it is pretty consise and readable.
import pandas as pd
d = {'paris': ['a', 'b'],
'brussels': ['b', 'c'],
'mallorca': ['a', 'd']}
df = pd.DataFrame(d).T
df.columns = ['0', '1']
df = pd.concat([df['0'], df['1']])
df = pd.crosstab(df, columns=df.index)
print(df)
Yields:
brussels mallorca paris
a 0 1 1
b 1 0 1
c 1 0 0
d 0 1 0

Summing multiple columns using a regular expression to select which columns to sum

I would like to perform the following:"
test = pd.DataFrame({'A1':[1,1,1,1],
'A2':[1,2,2,1],
'A3':[1,1,1,1],
'B1':[1,1,1,1],
'B2':[pd.NA, 1,1,1]})
result = pd.DataFrame({'A': test.filter(regex='A').sum(axis=1),
'B': test.filter(regex='B').sum(axis=1)})
I was wondering whether there is a better method to do this, when we have more columns and more "regex"-matches.
Use dict comprehension instead multiple repeat code like:
L = ['A','B']
df = pd.DataFrame({x: test.filter(regex=x).sum(axis=1) for x in L})
Or if possible simplify solution by select only first letters use:
df = test.groupby(lambda x: x[0], axis=1).sum()
print (df)
A B
0 3 1.0
1 4 2.0
2 4 2.0
3 3 2.0
If regexes should ne joined by | and gt all columns substrings use:
vals = test.columns.str.extract('(A|B)', expand=False)
print (vals)
Index(['A', 'A', 'A', 'B', 'B'], dtype='object')
df = test.groupby(vals, axis=1).sum()
print (df)
A B
0 3 1.0
1 4 2.0
2 4 2.0
3 3 2.0

merge new data into old data by replacing old data while appending new rows

I have 2 data frames with the same column names. Old data frame old_df and the new data frame is new_df with 1 column as a key.
I am trying to merge the 2 data frames into a single data frame which following conditions.
If the key is missing in the new table, then data from old_df should be taken
if the key is missing in old table, then data from new_df should be added.
If the key is present in both the tables then the data from new_df should overwrite the data from old_df.
Below is my code snippet that I am trying to play with.
new_data = pd.read_csv(filepath)
new_data.set_index(['Name'])
old_data = pd.read_sql_query("select * from dbo.Details", con=engine)
old_data.set_index(['Name'])
merged_result = pd.merge(new_data[['Name','RIC','Volatility','Sector']],
old_data,
on='Name',
how='outer')
I am thinking of using np.where from this point onwards but not sure how to proceed. please advice.
I believe you need DataFrame.combine_first with DataFrame.set_index for match by Name columns:
merged_result = (new_data.set_index('Name')[['RIC','Volatility','Sector']]
.combine_first(old_data.set_index('Name'))
.reset_index())
Sample data:
old_data = pd.DataFrame({'RIC':range(6),
'Volatility':[5,3,6,9,2,4],
'Name':list('abcdef')})
print (old_data)
RIC Volatility Name
0 0 5 a
1 1 3 b
2 2 6 c
3 3 9 d
4 4 2 e
5 5 4 f
new_data = pd.DataFrame({'RIC':range(4),
'Volatility':[10,20,30, 40],
'Name': list('abhi')})
print (new_data)
RIC Volatility Name
0 0 10 a
1 1 20 b
2 2 30 h
3 3 40 i
merged_result = (new_data.set_index('Name')
.combine_first(old_data.set_index('Name'))
.reset_index())
print (merged_result)
Name RIC Volatility
0 a 0.0 10.0
1 b 1.0 20.0
2 c 2.0 6.0
3 d 3.0 9.0
4 e 4.0 2.0
5 f 5.0 4.0
6 h 2.0 30.0
7 i 3.0 40.0
#jezrael's answer looks good. You may also try splitting dataset upon conditions and concatenating the old and new dataframes.
In the following example, I'm taking col1 as index and producing results that comply with your question's rules for combination.
import pandas as pd
old_data = {'col1': ['a', 'b', 'c', 'd', 'e'], 'col2': ['A', 'B', 'C', 'D', 'E']}
new_data = {'col1': ['a', 'b', 'e', 'f', 'g'], 'col2': ['V', 'W', 'X', 'Y', 'Z']}
old_df = pd.DataFrame(old_data)
new_df = pd.DataFrame(new_data)
old_df:
new_df:
Now,
df = pd.concat([new_df, old_df[~old_df['col1'].isin(new_df['col1'])]], axis=0).reset_index(drop=True)
Which gives us
df:
Hope this helps.

Calculate difference between first valid value and last valid value in DataFrame per row?

I'm trying to find the difference between the first valid value and the last valid value in a DataFrame per row.
I have a working code with a for loop and looking for something faster.
Here's an example of what I'm doing currently:
import pandas as pd
import numpy as np
df = pd.DataFrame(
np.arange(16).astype(np.float).reshape(4, 4),
columns=['a', 'b', 'c', 'd'])
# Fill some NaN
df.loc[0, ['a', 'd']] = np.nan
df.loc[1, ['c', 'd']] = np.nan
df.loc[2, 'b'] = np.nan
df.loc[3, :] = np.nan
print(df)
# a b c d
# 0 NaN 1.0 2.0 NaN
# 1 4.0 5.0 NaN NaN
# 2 8.0 NaN 10.0 11.0
# 3 NaN NaN NaN NaN
diffs = pd.Series(index=df.index)
for i in df.index:
row = df.loc[i]
min_i = row.first_valid_index()
max_i = row.last_valid_index()
if min_i is None or min_i == max_i: # 0 or 1 valid values
continue
diffs[i] = df.loc[i, max_i] - df.loc[i, min_i]
df['diff'] = diffs
print(df)
# a b c d diff
# 0 NaN 1.0 2.0 NaN 1.0
# 1 4.0 5.0 NaN NaN 1.0
# 2 8.0 NaN 10.0 11.0 3.0
# 3 NaN NaN NaN NaN NaN
One way would be to back and forward fill the missing values, and then just compare the first and last rows.
df2 = df.fillna(method='ffill', axis=1).fillna(method='bfill', axis=1)
df['diff'] = df2.ix[:, -1] - df2.ix[:, 0]
If you want to do it in one line, without creating a new dataframe:
df['diff'] = df.fillna(method='ffill', axis=1).fillna(method='bfill', axis=1).apply(lambda r: r.d - r.a, axis=1)
Pandas making your life easy, one method (first_valid_values()) at a time. Note that you'll have to delete any rows that have all NaN values (no point in having these anyways):
For first valid values:
a= [df.ix[x,i] for x,i in enumerate(df.apply(lambda row: row.first_valid_index(), axis=1))]
For last valid values:
b = [df.ix[x,i] for x,i in enumerate(df.apply(lambda row: row[::-1].first_valid_index(), axis=1))]
Subtract to get final result:
a-b

Categories