How do I drop row number i of a DF ?
I did the thing below but it is not working.
DF = DF.drop(i)
So I wonder what I miss there.
You must pass a label to drop. Here drop tries to use i as a label and fails (ith KeyError) as your index probably has other values. Worse, if the index was composed of integers in random order you might drop an incorrect row without noticing it.
Use:
df.drop(df.index[i])
Example:
df = pd.DataFrame({'col': range(4)}, index=list('ABCD'))
out = df.drop(df.index[2])
output:
col
A 0
B 1
D 3
pitfall
In case of duplicated indices, you might remove unwanted rows!
df = pd.DataFrame({'col': range(4)}, index=list('ABAD'))
out = df.drop(df.index[2])
output (A is incorrectly dropped!):
col
B 1
D 3
workaround:
import numpy as np
out = df[np.arange(len(df)) != i]
drop several indices by position:
import numpy as np
out = df[~np.isin(np.arange(len(df)), [i, j])]
You need to add square brackets:
df = df.drop([i])
Try This:
df.drop(df.index[i])
Related
I've been trying to print out a Pandas dataframe to html and have specific entire rows highlighted if the value of one specific column's value for that row is over a threshold. I've looked through the Pandas Styler Slicing and tried to vary the highlight_max function for such a use, but seem to be failing miserably; if I try, say, to replace the is_max with a check for whether a given row's value is above said threshold (e.g., something like
is_x = df['column_name'] >= threshold
), it isn't apparent how to properly pass such a thing or what to return.
I've also tried to simply define it elsewhere using df.loc, but that hasn't worked too well either.
Another concern also came up: If I drop that column (currently the criterion) afterwards, will the styling still hold? I am wondering if a df.loc would prevent such a thing from being a problem.
This solution allows for you to pass a column label or a list of column labels to highlight the entire row if that value in the column(s) exceeds the threshold.
import pandas as pd
import numpy as np
np.random.seed(24)
df = pd.DataFrame({'A': np.linspace(1, 10, 10)})
df = pd.concat([df, pd.DataFrame(np.random.randn(10, 4), columns=list('BCDE'))],
axis=1)
df.iloc[0, 2] = np.nan
def highlight_greaterthan(s, threshold, column):
is_max = pd.Series(data=False, index=s.index)
is_max[column] = s.loc[column] >= threshold
return ['background-color: yellow' if is_max.any() else '' for v in is_max]
df.style.apply(highlight_greaterthan, threshold=1.0, column=['C', 'B'], axis=1)
Output:
Or for one column
df.style.apply(highlight_greaterthan, threshold=1.0, column='E', axis=1)
Here is a simpler approach:
Assume you have a 100 x 10 dataframe, df. Also assume you want to highlight all the rows corresponding to a column, say "duration", greater than 5.
You first need to define a function that highlights the cells. The real trick is that you need to return a row, not a single cell. For example:
def highlight(s):
if s.duration > 5:
return ['background-color: yellow'] * len(s)
else:
return ['background-color: white'] * len(s)
**Note that the return part should be a list of 10 (corresponding to the number of columns). This is the key part.
Now you can apply this to the dataframe style as:
df.style.apply(highlight, axis=1)
Assume you have the following dataframe and you want to highlight the rows where id is greater than 3 to red
id char date
0 0 s 2022-01-01
1 1 t 2022-02-01
2 2 y 2022-03-01
3 3 l 2022-04-01
4 4 e 2022-05-01
5 5 r 2022-06-01
You can try Styler.set_properties with pandas.IndexSlice
# Subset your original dataframe with condition
df_ = df[df['id'].gt(3)]
# Pass the subset dataframe index and column to pd.IndexSlice
slice_ = pd.IndexSlice[df_.index, df_.columns]
s = df.style.set_properties(**{'background-color': 'red'}, subset=slice_)
s.to_html('test.html')
You can also try Styler.apply with axis=None which passes the whole dataframe.
def styler(df):
color = 'background-color: {}'.format
mask = pd.concat([df['id'].gt(3)] * df.shape[1], axis=1)
style = np.where(mask, color('red'), color('green'))
return style
s = df.style.apply(styler, axis=None)
I have a big dataframe of items which is simplified as below. I am looking for good way to find the the item(A, B, C) in each row which is repeated more than or equal to 2 times.
for example in row1 it is A and in row2 result is B.
simplified df:
df = pd.DataFrame({'C1':['A','B','A','A','C'],
'C2':['B','A','A','C','B'],
'C3':['A','B','A','C','C']},
index =['ro1','ro2','ro3','ro4','ro5']
)
Like mozway suggested, we don't know what will be your output. I will assume you need a list.
You can try something like this.
import pandas as pd
from collections import Counter
holder = []
for index in range(len(df)):
temp = Counter(df.iloc[index,:].values)
holder.append(','.join([key for key,value in temp.items() if value >= 2]))
As you have three columns and always a non unique, you can conveniently use mode.
df.mode(1)[0]
Output:
ro1 A
ro2 B
ro3 A
ro4 C
ro5 C
Name: 0, dtype: object
If you might have all unique values (e.g. A/B/C), you need to check that the mode is not unique:
m = df.mode(1)[0]
m2 = df.eq(m, axis=0).sum(1).le(1)
m.mask(m2)
I have a dataframe of unique strings and I want to find the row and column for a given string. I want these values because I'll be eventually exporting this dataframe to an excel spreadsheet. The easiest way I've found so far to get these values is the follwing:
jnames = list(df.iloc[0].to_frame().index)
for i in jnames:
for k in df[i]:
if 'searchstring' in str(k):
print('Column: {}'.format( (jnames.index(i) + 1 ) ) )
print('Row: {}'.format( list( df[i] ).index('searchstring') ) )
break
Can anyone advise a solution that takes better advantage of the inherent capabilities of pandas?
Without reproducible code / data, I'm going to make up a dataframe and show one simple way:
Setup
import pandas as pd, numpy as np
df = pd.DataFrame([['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'b']])
The dataframe looks like this:
0 1 2
0 a b c
1 d e f
2 g h b
Solution
result = list(zip(*np.where(df.values == 'b')))
Result
[(0, 1), (2, 2)]
Explanation
df.values accesses the numpy array underlying the dataframe.
np.where creates an array of coordinates satisfying the provided condition.
zip(*...) transforms [x-coords-array, y-coords-array] into (x, y) coordinate pairs.
Try using contains. This will return you a dataframe of rows that contain the slice you are looking for.
df[df['<my_col>'].str.contains('<my_string_slice>')]
Similarly, you can use match for a direct match.
This is my approach not writing double for loops:
value_to_search = "c"
print(df[[x for x in df.columns if value_to_search in df[x].unique()]].index[0])
print(df[[x for x in df.columns if value_to_search in df[x].unique()]].columns[0])
The first will return the column name and the second will return the index. Combined together, you will get the index-column combination. Since you mentioned that all values in the df is unique, both lines will return exactly one value.
You might need a try-except if value_to_search might not in the data frame.
By using stack , data from jpp
df[df=='b'].stack()
Out[211]:
0 1 b
2 2 b
dtype: object
I have a list of columns in a dataframe that I want to run through and perform an operation on them. the columns hold datetimes or nothing.
For each column in the list, I would like to trim every value in the column that contains "20" in it to the first 10 characters, otherwise leave it as is.
I've tried this a few ways, but get a variety of errors or imperfect results.
The following version throws an error of " 'str' object has no attribute 'apply'", but if I don't use ".astype(str)", then I get an error of " argument of type 'datetime.datetime' is not iterable".
df_combined[dateColumns] = df_combined[dateColumns].fillna(notFoundText).astype(str)
print (dateColumns)
for column in dateColumns:
for row in range(len(column)):
print(df_combined[column][row])
if "20" in (df_combined[column][row]):
df_combined[column][row].apply(lambda x: x[:10], axis=1)
print(df_combined[column][row])
Halp. Thanks in advance.
Loops are considered an abomination in pandas. I'd recommend just doing something like this, with str.contains + np.where.
for c in df.columns:
# df[c] = df[c].astype(str) # uncomment this if your columns aren't dtype=str
df[c] = np.where(df[c].str.contains("20"), df[c].str[:10], df[c])
IIUC:
You want to do this over the entire dataframe.
If so, here is a vectorized way using numpy over the entire dataframe at once.
Setup
df = pd.DataFrame([
['xxxxxxxx20yyyy', 'z' * 14, 'wwwwwwww20vvvv'],
['k' * 14, 'dddddddd20ffff', 'a' * 14]
], columns=list('ABC'))
df
A B C
0 xxxxxxxx20yyyy zzzzzzzzzzzzzz wwwwwwww20vvvv
1 kkkkkkkkkkkkkk dddddddd20ffff aaaaaaaaaaaaaa
Solution
Using numpy.core.defchararray.find and np.where
from numpy.core.defchararray import find
v = df.values.astype(str)
i, j = np.where(find(v, '20') > -1)
v[i, j] = v[i, j].astype('<U10')
df.loc[:] = v
df
A B C
0 xxxxxxxx20 zzzzzzzzzzzzzz wwwwwwww20
1 kkkkkkkkkkkkkk dddddddd20 aaaaaaaaaaaaaa
If you don't want to overwrite the old dataframe, you can create a new one:
pd.DataFrame(v, df.index, df.columns)
A B C
0 xxxxxxxx20 zzzzzzzzzzzzzz wwwwwwww20
1 kkkkkkkkkkkkkk dddddddd20 aaaaaaaaaaaaaa
I have two large dataframes I want to compare. I want a comparison result capable of a column and / or row wise comparison of similarities by percent. This part is simple. However, I want to be able to make the comparison ignore differences based upon value criteria. A small example is below.
d1 = {'Sample':pd.Series([101,102,103]),
'Col1':pd.Series(['AA','--','BB']),
'Col2':pd.Series(['AB','AA','BB'])}
d2 = {'Sample':pd.Series([101,102,103]),
'Col1':pd.Series(['BB','AB','--']),
'Col2':pd.Series(['AB','AA','AB'])}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
df1 = df1.set_index('Sample')
df2 = df2.set_index('Sample')
comparison = df1.eq(df2)
# for column stats
comparison.sum(axis=0) / float(len(df1.index))
# for row stats
comparison.sum(axis=1) / float(len(df1.columns))
My problem is that for when value1='AA' and value2 = '--' I want them to be viewed as equal (so when one is '--' basically always be true) but, otherwise perform a normal Boolean comparison. I need an efficient way to do this that doesn't include excessive looping as the datasets are quite large.
Below, I'm interpreting "when one is '--' basically always be true" to mean that any comparison against '--' (no matter what the other value is) should return True. In that case, you could use
mask = (df1=='--') | (df2=='--')
to find every location where either df1 or df2 is equal to '--' and then use
comparison |= mask
to update comparison. For example,
import itertools as IT
import numpy as np
import pandas as pd
np.random.seed(2015)
N = 10000
df1, df2 = [pd.DataFrame(
np.random.choice(map(''.join, IT.product(list('ABC'), repeat=2))+['--'],
size=(N, 2)),
columns=['Col1', 'Col2']) for i in range(2)]
comparison = df1.eq(df2)
mask = (df1=='--') | (df2=='--')
comparison |= mask
# for column stats
column_stats = comparison.sum(axis=0) / float(len(df1.index))
# for row stats
row_stats = comparison.sum(axis=1) / float(len(df1.columns))
I think loop comprehension should be quite fast:
new_columns = []
for col in df1.columns:
new_columns.append([True if (x==y or x=='--' or y=='--') else False for x,y in zip(df1[col],df2[col])])
results = pd.DataFrame(new_columns).T
results.index = df1.index
This outputs the full true/false df.