delete pandas dataframe row if every value is equal - python

If I have a pandas dataframe which has a row containing float values and all the values are equal in the row, how do I delete that row from the dataframe?

Use DataFrame.nunique for test number of unique values per rows with Series.ne for filter out unique rows by boolean indexing:
df1 = df[df.nunique(axis=1).ne(1)]
Or test if not equal first column and test if at least one True per rows by DataFrame.any:
df1 = df[df.ne(df.iloc[:, 0], axis=0).any(axis=1)]
EDIT: If want remove all rows and all columns with same values solution should be changed for test columns with loc and axis=0:
df = pd.DataFrame({
'B':[4,4,4,4,4,4],
'C':[4,4,9,4,2,3],
'D':[4,4,5,7,1,0],
})
print (df)
B C D
0 4 4 4
1 4 4 4
2 4 9 5
3 4 4 7
4 4 2 1
5 4 3 0
df2 = df.loc[df.nunique(axis=1).ne(1), df.nunique(axis=0).ne(1)]
And for second solution:
df2 = df.loc[df.ne(df.iloc[:, 0], axis=0).any(axis=1), df.ne(df.iloc[0], axis=1).any(axis=0)]
print (df2)
C D
2 9 5
3 4 7
4 2 1
5 3 0

You can use DataFrame.diff over axis=1 (per row):
# Example dataframe:
df = pd.DataFrame({'Col1':[1,2,3],
'Col2':[2,2,5],
'Col3':[4,2,9]})
Col1 Col2 Col3
0 1 2 4
1 2 2 2 # <-- row with all same values
2 3 5 9
df[df.diff(axis=1).fillna(0).ne(0).any(axis=1)]
Col1 Col2 Col3
0 1 2 4
2 3 5 9

Related

how to insert a list value into a dataframe by row and column number?

How do I insert a list value to a dataframe on a specific row and column?
For example say I have the dataframe
source col 1 col 2
0 a xxx xxx
1 b xxx xxx
2 c xxx xxx
3 a xxx xxx
My list is
list_value = [5,"text"]
How do I insert this list to the dataframe at row 1 and column 1 (col 1)
source col 1 col 2
0 a xxx xxx
1 b 5 xxx
2 c text xxx
3 a xxx xxx
EDIT
#Dev Arora
When I run your code I get this error.
d = {'col1': [1, 2,3,5,6,7], 'col2': [3, 4,5,"",5,6]}
df = pd.DataFrame(data=d)
df
col1 col2
0 1 3
1 2 4
2 3 5
3 5
4 6 5
5 7 6
list_value = [5,"text"]
df.at[1, 'col2'] = list_value
df
col1 col2
0 1 3
1 2 [5, 'text']
2 3 5
3 5
4 6 5
5 7 6
Instead I want it to be
col1 col2
0 1 3
1 2 5
2 3 'text'
3 5
4 6 5
5 7 6
Assuming we're looking at pandas dataframes:
I think the df.at operator is what you're looking for:
df = pd.read_csv("./test.csv")
list_value = [5,"text"]
string_to_input = ""
for val in list_value:
string_to_input += str(val) + " "
df.at[<row_num>, "<col_name>"] = string_to_input
EDIT: If you're looking to add the values in just as a list you can also do
df = pd.read_csv("./test.csv")
list_value = [5,"text"]
df.at[<row_num>, "<col_name>"] = list_value
EDIT: Hmm okay lets take this from the top. As per the desired information in the post i.e. how to insert a value into a dataframe by row and column number, we're looking at df.at. What df.at does is insert a value in a dataframe based on the specific row number and column number given. Insofar in the example:
d = {'col1': [1, 2,3,5,6,7], 'col2': [3, 4,5,"",5,6]}
df = pd.DataFrame(data=d)
df
col1 col2
0 1 3
1 2 4
2 3 5
3 5
4 6 5
5 7 6
list_value = [5,"text"]
df.at[1, 'col2'] = list_value
df
col1 col2
0 1 3
1 2 [5, 'text']
2 3 5
3 5
4 6 5
5 7 6
That is exactly what has happened. This is not an error.
The command df.at[1, 'col2'] = list_value specifies that at row 1 and col2 insert the list_value which is [5, 'text'].
If you want a dataframe that looks like this by specifically indicating the desired row and column for each insertion:
col1 col2
0 1 3
1 2 5
2 3 'text'
3 5
4 6 5
5 7 6
Something like this is required:
df.at[1, "col2"] = 5
df.at[2, "col2"] = 'text'
The above code specifies that at row 1, col2 insert 5, and at row 2 col2 insert 'text'. Hope this helps!

pandas: groupby two columns and get random selection of groups such that each value in the first column will be represented by a single group

It's similar to this question, but with an additional level of complexity.
In my case, I have a the following dataframe:
import pandas as pd
df = pd.DataFrame({'col1': list('aaabbbabababbaaa'), 'col2': list('cdddccdsssssddcd'), 'val': range(0, 16)})
output:
col1 col2 val
0 a c 0
1 a d 1
2 a d 2
3 b d 3
4 b c 4
5 b c 5
6 a d 6
7 b s 7
8 a s 8
9 b s 9
10 a s 10
11 b s 11
12 b d 12
13 a d 13
14 a c 14
15 a d 15
My goal is to select random groups of groupby(['col1', 'col2']) such that each value of col1 will be selected only once.
This can be executed by the following code:
g = df.groupby('col1')
indexes = []
for _, group in g:
g_ = group.groupby('col2')
a = np.arange(g_.ngroups)
np.random.shuffle(a)
indexes.extend(group[g_.ngroup().isin(a[:1])].index.tolist())
output:
print(df[df.index.isin(indexes)])
col1 col2 val
4 b c 4
5 b c 5
8 a s 8
10 a s 10
However, I'm looking for a more concise and pythonic way to solve this.
Another option is to sufffle your two columns with sample and drop_duplicates by col1, so that you keep only one couple per col1 value. then merge the result to df to select all the rows with these couples.
print(df.merge(df[['col1','col2']].sample(frac=1).drop_duplicates('col1')))
col1 col2 val
0 b s 7
1 b s 9
2 b s 11
3 a s 8
4 a s 10​
or with groupby and sample a bit the same idea but to select only one row per col1 value with merge after
df.merge(df[['col1','col2']].groupby('col1').sample(n=1))
EDIT: to get both the selected rows and the others rows, then you can use the parameter indicator in the merge and do a left merge. then query each separately:
m = df.merge(df[['col1','col2']].groupby('col1').sample(1), how='left', indicator=True)
print(m)
select_ = m.query('_merge=="both"')[df.columns]
print(select_)
comp_ = m.query('_merge=="left_only"')[df.columns]
print(comp_)

How to slice a pandas dataframe based on the position after a specified row

The following code:
import pandas as pd
df = pd.DataFrame({'col':['A', '1', '2', '3', 'B', '4', '5', 'C', '7', '8', '10']})
Produces the following dataframe:
col
0 A
1 1
2 2
3 3
4 B
5 4
6 5
7 C
8 7
9 8
10 10
I would like to come up with a good, pandas-friendly way of slicing the dataframe based on the occurence of the letters 'A', 'B' or 'C'. The expected result is as follows:
col col2
1 1 A
2 2 A
3 3 A
5 4 B
6 5 B
8 7 C
9 8 C
10 10 C
How can I achieve this?
Create a mask which finds the rows to split. To create col2, boolean index with the mask, reindex with the full original index, forward-fill the missing values. For col1, copy the original col. Then create the final df and index with the negation of the mask.
mask = df['col'].isin(['A', 'B', 'C']) # could use df['col'].str.isalpha() also
col2 = df['col'][mask].reindex(df.index).ffill()
col1 = df['col']
df = pd.DataFrame({'col1':col1, 'col2':col2})[~mask]
Result (df):
col1 col2
1 1 A
2 2 A
3 3 A
5 4 B
6 5 B
8 7 C
9 8 C
10 10 C
Just find the string/substring you are looking for in the column, then explode and ffill it. You can then just filter out the dataframe where col and col2 have different values.
df['col2'] = df['col'].str.findall('A|B|C').explode().ffill()
df[df['col']!=df['col2']]
col col2
1 1 A
2 2 A
3 3 A
5 4 B
6 5 B
8 7 C
9 8 C
10 10 C
One way is to form a mask with isalpha or not to mark letters, and group by the letter and "its digits" via cumsum. Then transforming with "first" gives almost col2 except for repetition for letters, which are dropped with the firstly formed mask:
mask = df.col.str.isalpha()
grouper = mask.cumsum()
new_df = df.assign(col2=df.groupby(grouper).transform("first"))[~mask]
to get
>>> new_df
col col2
1 1 A
2 2 A
3 3 A
5 4 B
6 5 B
8 7 C
9 8 C
10 10 C
Just needs a clever way of slicing:
Find all of the letters
Mask anything that is not a letter as NaN
Forward fill the column
Remove the original rows that were letters
is_letters = df["col"].str.isalpha()
new_df = df.where(is_letters).ffill().loc[~is_letters]
print(new_df)
col
1 A
2 A
3 A
5 B
6 B
8 C
9 C
10 C

keep the same factorizing between two data

We have two data sets with one varialbe col1.
some levels are missing in the second data. For example let
import pandas as pd
df1 = pd.DataFrame({'col1':["A","A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
When we factorize df1
df1["f_col1"]= pd.factorize(df1.col1)[0]
df1
we got
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
But when we do it for df2
df2["f_col1"]= pd.factorize(df2.col1)[0]
df2
we got
col1 f_col1
0 A 0
1 B 1
2 D 2
3 E 3
this is not what I want. I want to keep the same factorizing between data, i.e. in df2 we should have something like
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Thanks.
PS: The two data sets not always available in the same time, so I cannot concat them. The values should be stored form df1 and used in df2 when it is available.
You could concatenate the two DataFrames, then apply pd.factorize once to the entire column:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df = pd.concat({'df1':df1, 'df2':df2})
df['f_col1'], uniques = pd.factorize(df['col1'])
print(df)
yields
col1 f_col1
df1 0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
df2 0 A 0
1 B 1
2 D 3
3 E 4
To extract df1 and df2 from df you could use df.loc:
In [116]: df.loc['df1']
Out[116]:
col1 f_col1
0 A 0
1 B 1
2 C 2
3 D 3
4 E 4
In [117]: df.loc['df2']
Out[117]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
(But note that since performance of vectorized operations improve if you can apply them once to large DataFrames instead of multiple times to smaller DataFrames, you might be better off keeping df and ditching df1 and df2...)
Alternatively, if you must generate df1['f_col1'] first, and then compute
df2['f_col1'] later, you could use merge to join df1 and df2 on col1:
import pandas as pd
df1 = pd.DataFrame({'col1':["A","B","C","D","E"]})
df2 = pd.DataFrame({'col1':["A","B","D","E"]})
df1['f_col1'], uniques = pd.factorize(df1['col1'])
df2 = pd.merge(df2, df1, how='left')
print(df2)
yields
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
You could reuse f_col1 column of df1 and map values of df2.col1 by setting index on df.col1
In [265]: df2.col1.map(df1.set_index('col1').f_col1)
Out[265]:
0 0
1 1
2 3
3 4
Details
In [266]: df2['f_col1'] = df2.col1.map(df1.set_index('col1').f_col1)
In [267]: df2
Out[267]:
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4
Incase, df1 has multiple records, drop the records using drop_duplicates
In [290]: df1
Out[290]:
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
In [291]: df2.col1.map(df1.drop_duplicates().set_index('col1').f_col1)
Out[291]:
0 0
1 1
2 3
3 4
Name: col1, dtype: int32
You want to get unique values across both sets of data. Then create a series or a dictionary. This is your factorization that can be used across both data sets. Use map to get the output you are looking for.
u = np.unique(np.append(df1.col1.values, df2.col1.values))
f = pd.Series(range(len(u)), u) # this is factorization
Assign with map
df1['f_col1'] = df1.col1.map(f)
df2['f_col1'] = df2.col1.map(f)
print(df1)
col1 f_col1
0 A 0
1 A 0
2 B 1
3 C 2
4 D 3
5 E 4
print(df2)
col1 f_col1
0 A 0
1 B 1
2 D 3
3 E 4

pandas - number of unique rows occurrences in dataframe

How can I count number of occurrences of each unique row in a DataFrame?
data = {'x1': ['A','B','A','A','B','A','A','A'], 'x2': [1,3,2,2,3,1,2,3]}
df = pd.DataFrame(data)
df
x1 x2
0 A 1
1 B 3
2 A 2
3 A 2
4 B 3
5 A 1
6 A 2
7 A 3
And I would like to obtain
x1 x2 count
0 A 1 2
1 A 2 3
2 A 3 1
3 B 3 2
IIUC you can pass param as_index=False as an arg to groupby:
In [100]:
df.groupby(['x1','x2'], as_index=False).count()
Out[100]:
x1 x2 count
0 A 1 2
1 A 2 3
2 A 3 1
3 B 3 2
You could also drop duplicated rows:
In [4]: df.shape[0]
Out[4]: 8
In [5]: df.drop_duplicates().shape[0]
Out[5]: 4
There are two ways you can find unique occurence in your dataframe.
1st: Using drop_duplicates
df.drop_duplicates().sort_values('x1',ignore_index=True)
2nd: Using groupby.nunique
df.groupby(['x1','x2'], as_index=False).nunique()
For finding the number of occurrences, the answer from #EdChum will work precisely.

Categories