I want to filter my df down to only those rows who have a value in column A which appears less frequently than some threshold. I currently am using a trick with two value_counts(). To explain what I mean:
df = pd.DataFrame([[1, 2, 3], [1, 4, 5], [6, 7, 8]], columns=['A', 'B', 'C'])
'''
A B C
0 1 2 3
1 1 4 5
2 6 7 8
'''
I want to remove any row whose value in the A column appears < 2 times in the column A. I currently do this:
df = df[df['A'].isin(df.A.value_counts()[df.A.value_counts() >= 2].index)]
Does Pandas have a method to do this which is cleaner than having to call value_counts() twice?
It's probably easiest to filter by group size, where the groups are done on column A.
df.groupby('A').filter(lambda x: len(x) >=2)
Related
I am doing an experiment and want to observe the impact of missing values on the query results. I am doing it using Python Pandas. Consider that I have dataframe df. This dataframe is the complete data. My real data consists of many columns and thousands of rows.
I made a copy of df to df_copy. Then I do an experiment using df_copy and df is the ground truth. I put some NaN values on df_copy randomly.
I have some ideas to fix the missing values on df_copy using a heuristic ways. Currently, I can do easily using row operation in pandas. For instance, if I want to fix any rows on df_copy, I just can get the row by the id from df_copy then drop the row and replace from the df.
My question is, how can I do an operation on a cell-based in pandas? For instance, How can I get the index (x,y) from all missing values and when I want to fix a missing cell, I can just replace the value on that cell from the ground truth by calling the index (x,y)
Example:
df
df = pd.DataFrame(np.array([["x", 2, 3], ["y", 5, 6], ["z", 8, 9]]),
columns=['a', 'b', 'c'])
a b c
0 x 2 3
1 y 5 6
2 z 8 9
df_copy
df_copy = pd.DataFrame(np.array([["x", np.nan, 3], ["y", 5, np.nan], [np.nan, 8, 9]]),
columns=['a', 'b', 'c'])
a b c
0 x nan 3
1 y 5 nan
2 nan 8 9
I want to fill the column of the df2 (~100.000 rows) with the values from the same column of df (~1.000.000 rows). Df often has several times the same row but with wrong data, so I always want to take the first value of my column 'C'.
df = pd.DataFrame([[100, 1, 2], [100, 3, 4], [100, 5, 6], [101, 7, 8], [101, 9, 10]],
columns=['A', 'B', 'C'])
df2=pd.DataFrame([[100,0],[101,0]], columns=['A', 'C'])
for i in range(0,len(df2.index)):
#My Question:
df2[i,'C']=first value of 'C' column of df where the 'A' column is the same of both dataframes. E.g. the first value for 100 would be 2 and then the first value for 101 would be 8
In the end, my output should be a table like this:
df2=pd.DataFrame([[100,2],[101,8]], columns=['A', 'C'])
You can try this:
df2['C'] = df.groupby('A')['C'].first().values
Which will give you:
A C
0 100 2
1 101 8
first() returns the first value of every group.
Then you want to assign the values to df2 column, unfortunately, you cannot assign the result directly like this:
df2['C'] = df.groupby('A')['C'].first() .
Because the above line will result in :
A C
0 100 NaN
1 101 NaN
(You can read about the cause here: Adding new column to pandas DataFrame results in NaN)
I want to add a row to a pandas dataframe with using df.loc[rowname] = s (where s is a series).
However, I constantly get the Cannot reindex from a duplicate axis ValueError.
I presume that this is due to having duplicate column names in df as well as duplicate index names in s (the index of s is identical to df.columns.
However, when I try to reproduce this error on a small example, I don't get this error. What could the reason for this behavior be?
a = pd.DataFrame(columns=['a', 'b', 'a'], data=[[1, 2, 7], [5, 4, 5], ['', '', '']])
b=pd.DataFrame(columns=a.columns)
b.loc['mean'] = a.replace('',np.nan).mean(skipna=True)
print(b)
a b a
mean 3.0 3.0 6.0
I think duplicated columns names should be avoid, because then should be weird errors.
It seems there are non matched values between index of Series and columns of DataFrame:
a = pd.DataFrame(columns=['a', 'b', 'a'], data=[[1, 2, 7], [5, 4, 5], ['', '', '']])
a.loc['mean'] = pd.Series([2,5,4], index=list('abb'))
print(a)
ValueError: cannot reindex from a duplicate axis
One possible solution for deduplicated columns names with rename columns:
s = a.columns.to_series()
a.columns = s.add(s.groupby(s).cumcount().astype(str).replace('0',''))
print(a)
a b a1
0 1 2 7
1 5 4 5
2
Or drop duplicated columns:
a = a.loc[:, ~a.columns.duplicated()]
print(a)
a b
0 1 2
1 5 4
2
This question already has an answer here:
Replace values in a pandas series via dictionary efficiently
(1 answer)
Closed 4 years ago.
Suppose you have a data frame
df = pd.DataFrame({'a':[1,2,3,4],'b':[2,4,6,8],'c':[2,4,5,6]})
and you want to replace specific values in columns 'a' and 'c' (but not 'b'). For example, replacing 2 with 20, and 4 with 40.
The following will not work since it is setting values on a copy of a slice of the DataFrame:
df[['a','c']].replace({2:20, 4:40}, inplace=True)
A loop will work:
for col in ['a','c']:
df[col].replace({2:20, 4:40},inplace=True)
But a loop seems inefficient. Is there a better way to do this?
According to the documentation on replace, you can specify a dictionary for each column:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [2, 4, 6, 8], 'c': [2, 4, 5, 6]})
lookup = {col : {2: 20, 4: 40} for col in ['a', 'c']}
df.replace(lookup, inplace=True)
print(df)
Output
a b c
0 1 2 20
1 20 4 40
2 3 6 5
3 40 8 6
I have a pandas dataframe following the form in the example below:
data = {'id': [1,1,1,1,2,2,2,2,3,3,3], 'a': [-1,1,1,0,0,0,-1,1,-1,0,0], 'b': [1,0,0,-1,0,1,1,-1,-1,1,0]}
df = pd.DataFrame(data)
Now, what I want to do is create a pivot table such that for each of the columns except the id, I will have 3 new columns corresponding to the values. That is, for column a, I will create a_neg, a_zero and a_pos. Similarly, for b, I will create b_neg, b_zero and b_pos. The values for these new columns would correspond to the number of times those values appear in the original a and b column. The final dataframe should look like this:
result = {'id': [1,2,3], 'a_neg': [1, 1, 1],
'a_zero': [1, 2, 2], 'a_pos': [2, 1, 0],
'b_neg': [1, 1, 1], 'b_zero': [2,1,1], 'b_pos': [1,2,1]}
df_result = pd.DataFrame(result)
Now, to do this, I can do the following steps and arrive at my final answer:
by_a = df.groupby(['id', 'a']).count().reset_index().pivot('id', 'a', 'b').fillna(0).astype(int)
by_a.columns = ['a_neg', 'a_zero', 'a_pos']
by_b = df.groupby(['id', 'b']).count().reset_index().pivot('id', 'b', 'a').fillna(0).astype(int)
by_b.columns = ['b_neg', 'b_zero', 'b_pos']
df_result = by_a.join(by_b).reset_index()
However, I believe that that method is not optimal especially if I have a lot of original columns aside from a and b. Is there a shorter and/or more efficient solution for getting what I want to achieve here? Thanks.
A shorter solution, though still quite in-efficient:
In [11]: df1 = df.set_index("id")
In [12]: g = df1.groupby(level=0)
In [13]: g.apply(lambda x: x.apply(lambda x: x.value_counts())).fillna(0).astype(int).unstack(1)
Out[13]:
a b
-1 0 1 -1 0 1
id
1 1 1 2 1 2 1
2 1 2 1 1 1 2
3 1 2 0 1 1 1
Note: I think you should be aiming for the multi-index columns.
I'm reasonably sure I've seen a trick to remove the apply/value_count/fillna with something cleaner and more efficient, but at the moment it eludes me...