Cells operation in Python Pandas - python

I am doing an experiment and want to observe the impact of missing values on the query results. I am doing it using Python Pandas. Consider that I have dataframe df. This dataframe is the complete data. My real data consists of many columns and thousands of rows.
I made a copy of df to df_copy. Then I do an experiment using df_copy and df is the ground truth. I put some NaN values on df_copy randomly.
I have some ideas to fix the missing values on df_copy using a heuristic ways. Currently, I can do easily using row operation in pandas. For instance, if I want to fix any rows on df_copy, I just can get the row by the id from df_copy then drop the row and replace from the df.
My question is, how can I do an operation on a cell-based in pandas? For instance, How can I get the index (x,y) from all missing values and when I want to fix a missing cell, I can just replace the value on that cell from the ground truth by calling the index (x,y)
Example:
df
df = pd.DataFrame(np.array([["x", 2, 3], ["y", 5, 6], ["z", 8, 9]]),
columns=['a', 'b', 'c'])
a b c
0 x 2 3
1 y 5 6
2 z 8 9
df_copy
df_copy = pd.DataFrame(np.array([["x", np.nan, 3], ["y", 5, np.nan], [np.nan, 8, 9]]),
columns=['a', 'b', 'c'])
a b c
0 x nan 3
1 y 5 nan
2 nan 8 9

Related

Does Pandas Have an Alternative to This Syntax I'm Currently Using?

I want to filter my df down to only those rows who have a value in column A which appears less frequently than some threshold. I currently am using a trick with two value_counts(). To explain what I mean:
df = pd.DataFrame([[1, 2, 3], [1, 4, 5], [6, 7, 8]], columns=['A', 'B', 'C'])
'''
A B C
0 1 2 3
1 1 4 5
2 6 7 8
'''
I want to remove any row whose value in the A column appears < 2 times in the column A. I currently do this:
df = df[df['A'].isin(df.A.value_counts()[df.A.value_counts() >= 2].index)]
Does Pandas have a method to do this which is cleaner than having to call value_counts() twice?
It's probably easiest to filter by group size, where the groups are done on column A.
df.groupby('A').filter(lambda x: len(x) >=2)

Filter for rows in pandas dataframe where values in a column are greater than x or NaN

I'm trying to figure out how to filter a pandas dataframe so that that the values in a certain column are either greater than a certain value, or are NaN. Lets say my dataframe looks like this:
df = pd.DataFrame({"col1":[1, 2, 3, 4], "col2": [4, 5, np.nan, 7]})
I've tried:
df = df[df["col2"] >= 5 | df["col2"] == np.nan]
and:
df = df[df["col2"] >= 5 | np.isnan(df["col2"])]
But the first causes an error, and the second excludes rows where the value is NaN. How can I get the result to be this:
pd.DataFrame({"col1":[2, 3, 4], "col2":[5, np.nan, 7]})
Please Try
df[df.col2.isna()|df.col2.gt(4)]
col1 col2
1 2 5.0
2 3 NaN
3 4 7.0
Also, you can fill nan with the threshold:
df[df.fillna(5)>=5]

Filling a column with values from another dataframe

I want to fill the column of the df2 (~100.000 rows) with the values from the same column of df (~1.000.000 rows). Df often has several times the same row but with wrong data, so I always want to take the first value of my column 'C'.
df = pd.DataFrame([[100, 1, 2], [100, 3, 4], [100, 5, 6], [101, 7, 8], [101, 9, 10]],
columns=['A', 'B', 'C'])
df2=pd.DataFrame([[100,0],[101,0]], columns=['A', 'C'])
for i in range(0,len(df2.index)):
#My Question:
df2[i,'C']=first value of 'C' column of df where the 'A' column is the same of both dataframes. E.g. the first value for 100 would be 2 and then the first value for 101 would be 8
In the end, my output should be a table like this:
df2=pd.DataFrame([[100,2],[101,8]], columns=['A', 'C'])
You can try this:
df2['C'] = df.groupby('A')['C'].first().values
Which will give you:
A C
0 100 2
1 101 8
first() returns the first value of every group.
Then you want to assign the values to df2 column, unfortunately, you cannot assign the result directly like this:
df2['C'] = df.groupby('A')['C'].first() .
Because the above line will result in :
A C
0 100 NaN
1 101 NaN
(You can read about the cause here: Adding new column to pandas DataFrame results in NaN)

Indexing with pandas df..loc with duplicated columns - sometimes it works, sometimes it doesn't

I want to add a row to a pandas dataframe with using df.loc[rowname] = s (where s is a series).
However, I constantly get the Cannot reindex from a duplicate axis ValueError.
I presume that this is due to having duplicate column names in df as well as duplicate index names in s (the index of s is identical to df.columns.
However, when I try to reproduce this error on a small example, I don't get this error. What could the reason for this behavior be?
a = pd.DataFrame(columns=['a', 'b', 'a'], data=[[1, 2, 7], [5, 4, 5], ['', '', '']])
b=pd.DataFrame(columns=a.columns)
b.loc['mean'] = a.replace('',np.nan).mean(skipna=True)
print(b)
a b a
mean 3.0 3.0 6.0
I think duplicated columns names should be avoid, because then should be weird errors.
It seems there are non matched values between index of Series and columns of DataFrame:
a = pd.DataFrame(columns=['a', 'b', 'a'], data=[[1, 2, 7], [5, 4, 5], ['', '', '']])
a.loc['mean'] = pd.Series([2,5,4], index=list('abb'))
print(a)
ValueError: cannot reindex from a duplicate axis
One possible solution for deduplicated columns names with rename columns:
s = a.columns.to_series()
a.columns = s.add(s.groupby(s).cumcount().astype(str).replace('0',''))
print(a)
a b a1
0 1 2 7
1 5 4 5
2
Or drop duplicated columns:
a = a.loc[:, ~a.columns.duplicated()]
print(a)
a b
0 1 2
1 5 4
2

Why get different results when comparing two dataframes?

I am comparing two df, it gives me False when using .equals(), but if I append two df together and use drop_duplicate() it gives me nothing. Can someone explain this?
TL;DR
These are completely different operations and I'd have never expected them to produce the same results.
pandas.DataFrame.equals
Will return a boolean value depending on whether Pandas determines that the dataframes being compared are the "same". That means that the index of one is the "same" as the index of the other, the columns of one is the "same" as the columns of the the other, and the data of one is the "same" as the data of the other.
See docs
It is NOT the same as pandas.DataFrame.eq which will return a dataframe of boolean values.
Setup
Consider these three dataframes
df0 = pd.DataFrame([[0, 1], [2, 3]], [0, 1], ['A', 'B'])
df1 = pd.DataFrame([[1, 0], [3, 2]], [0, 1], ['B', 'A'])
df2 = pd.DataFrame([[0, 1], [2, 3]], ['foo', 'bar'], ['A', 'B'])
df0 df1 df2
A B B A A B
0 0 1 0 1 0 foo 0 1
1 2 3 1 3 2 bar 2 3
If we checked if df1 was equals to df0, we get
df0.equals(df1)
False
Even though all elements are the same
df0.eq(df1).all().all()
True
And that is because the columns are not aligned. If I sort the columns then ...
df0.equals(df1.sort_index(axis=1))
True
pandas.DataFrame.drop_duplicates
Compares the values in rows and doesn't care about the index.
So, both of these produce the same looking results
df0.append(df2).drop_duplicates()
and
df0.append(df1, sort=True).drop_duplicates()
A B
0 0 1
1 2 3
When I append (or pandas.concat), Pandas will align the columns and add the appended dataframe as new rows. Then drop_duplicates does it's thing. But it was the inherent aligning of the columns that does the what I did above with sort_index and axis=1.
maybe the lines in both dataframes are not ordered the same way? dataframes will be equal when the lines corresponding to the same index are the same

Categories