I'd like to change zero values in a dataframe for the value found in the last column for each row. I can solve this using a for in the columns or the rows, but it didnt seem too pythonic to me.
In short, I have a dataframe like this:
col1 col2 col3 nonzero
1 2 0 10
1 0 3 20
and I'd like to do an operation like
df[df==0] = df.nonzero
so I'd get
col1 col2 col3 nonzero
1 2 10 10
1 20 3 20
This however does not work, as [df==0] is a DataFrame itself with True/False values. How can this be done?
One option is to use apply method, loop through rows of the data frame and replace zeros with the last element of the row:
df.apply(lambda row: row.where(row != 0, row.iat[-1]), axis=1)
You can also modify the data frame in place:
df[df == 0] = (df == 0).mul(df.nonzero, axis=0)
Which yields df as the same result above. In this method, (df == 0).mul(df.nonzero, axis=0) creates a data frame with zeros entries replaced by the values in the nonzero column and other entries zero; Combined with boolean indexing and assignment, you can conditionally modify the zero entries in the original data frame:
(df == 0).mul(df.nonzero, axis=0)
Related
I have the following dataframe:
ID col1 col2 col3
0 ['a','b'] ['d','c'] ['e','d']
1 ['s','f'] ['f','a'] ['d','aaa']
Give an input string = 'a'
I want to receive a dataframe like this:
ID col1 col2 col3
0 1 0 0
1 0 1 0
I see how to do it with a for loop but that takes forever, and there must be a method I miss
Processing lists in pandas is not vectorized supported, so performance is worse like scalars.
First idea is reshape lists columns to Series by DataFrame.stack, create scalars by Series.explode, so possible compare by a, test if match per first levels by Series.any, and last reshape back with convert boolean mask to integers:
df1 = df.set_index('ID').stack().explode().eq('a').any(level=[0,1]).unstack().astype(int)
print (df1)
col1 col2 col3
ID
0 1 0 0
1 0 1 0
Or is possible use DataFrame.applymap for elementwise testing by lambda function with in:
df1 = df.set_index('ID').applymap(lambda x: 'a' in x).astype(int)
Or create for each lists column DataFrame, so possible test by a with DataFrame.any:
f = lambda x: pd.DataFrame(x.tolist(), index=x.index).eq('a').any(axis=1)
df1 = df.set_index('ID').apply(f).astype(int)
I have any output that counts the number of na values in my dataframe using this logic
df.isna().sum()
col1 8
col2 0
And would like the same thing, but with duplicates although I don't see a full df approach to this - only column by column
How can I leverage something like
df.duplicated().any().sum()
Without specifying column by column like df['col1'].duplicated().any().sum()
Suppose I have the following data frame.
X = pd.DataFrame([["A","Z"],["A","Z"],["A","Z"],["B","Y"],["B","Y"]],columns=["COL1","COL2"])
Suppose I have the above dataframe. COL1 contains 3 A's and 2 B's. COL2 contains 3 Z's and 2 Y's.
What I'm trying to do is search each column and find the rows where there is less than i of a variable (E.g. in this case I search each column and find what rows have fewer than 3 entries).
In this case I have a bunch of duplicate entries but it's just presented like that for simplicity.
Link to my previous question:
Pandas: How do I loop through and remove rows where a column has a single entry
Please let me know if clarification is needed.
You can use subset and keep False params
X = X[X.duplicated(subset=list(X.columns), keep=False)]
output:
COL1 COL2
0 A Z
1 A Z
You can do
i=3
X[X.groupby(X.columns.tolist()).COL1.transform('count')>=i]
COL1 COL2
0 A Z
1 A Z
2 A Z
I have the following input:
col1 col2 col3
1 4 0
0 12 2
2 12 4
3 2 1
I want to sort the DataFrame according to the values in the columns, e.g. sorting it primarily for df[df==0].count() and secondarily for df.sum() would produce the output:
col2 col3 col1
4 0 1
12 2 0
12 4 2
2 1 3
pd.DataFrame.sort() takes a colum object as argument, which does not apply here, so how can I achieve this?
Firstly, I think your zero count is increasing from right to left whereas your sum is decreasing, so I think you need to clarify that. You can get the number of zero rows simply by (df == 0).sum().
To sort by a single aggregate, you can do something like:
col_order = (df == 0).sum().sort(inplace=False).index
df[col_order]
This sorts the series of aggregates by its values and the resulting index is the columns of df in the order you want.
To sort on two sets of values would be more awkward/tricky but you could do something like
aggs = pd.DataFrame({'zero_count': (df == 0).sum(), 'sum': df.sum()})
col_order = aggs.sort(['zero_count', 'sum'], inplace=False).index
df[col_order]
Note that the sort method takes an ascending parameter which takes either a Boolean or a list of Booleans of equal length to the number of columns you are sorting on, e.g.
df.sort(['a', 'b', ascending=[True, False])
I'm creating a Pandas DataFrame to store data. Unfortunately, I can't know the number of rows of data that I'll have ahead of time. So my approach has been the following.
First, I declare an empty DataFrame.
df = DataFrame(columns=['col1', 'col2'])
Then, I append a row of missing values.
df = df.append([None] * 2, ignore_index=True)
Finally, I can insert values into this DataFrame one cell at a time. (Why I have to do this one cell at a time is a long story.)
df['col1'][0] = 3.28
This approach works perfectly fine, with the exception that the append statement inserts an additional column to my DataFrame. At the end of the process the output I see when I type df looks like this (with 100 rows of data).
<class 'pandas.core.frame.DataFrame'>
Data columns (total 2 columns):
0 0 non-null values
col1 100 non-null values
col2 100 non-null values
df.head() looks like this.
0 col1 col2
0 None 3.28 1
1 None 1 0
2 None 1 0
3 None 1 0
4 None 1 1
Any thoughts on what is causing this 0 column to appear in my DataFrame?
The append is trying to append a column to your dataframe. The column it is trying to append is not named and has two None/Nan elements in it which pandas will name (by default) as column named 0.
In order to do this successfully, the column names coming into the append for the data frame must be consistent with the current data frame column names or else new columns will be created (by default)
#you need to explicitly name the columns of the incoming parameter in the append statement
df = DataFrame(columns=['col1', 'col2'])
print df.append(Series([None]*2, index=['col1','col2']), ignore_index=True)
#as an aside
df = DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
dfRowImproper = [1,2,3,4]
#dfRowProper = DataFrame(arange(4)+1,columns=['A','B','C','D']) #will not work!!! because arange returns a vector, whereas DataFrame expect a matrix/array#
dfRowProper = DataFrame([arange(4)+1],columns=['A','B','C','D']) #will work
print df.append(dfRowImproper) #will make the 0 named column with 4 additional rows defined on this column
print df.append(dfRowProper) #will work as you would like as the column names are consistent
print df.append(DataFrame(np.random.randn(1,4))) #will define four additional columns to the df with 4 additional rows
print df.append(Series(dfRow,index=['A','B','C','D']), ignore_index=True) #works as you want
You could use a Series for row insertion:
df = pd.DataFrame(columns=['col1', 'col2'])
df = df.append(pd.Series([None]*2), ignore_index=True)
df["col1"][0] = 3.28
df looks like:
col1 col2
0 3.28 NaN