Pandas duplicate equivalent to isna sum - python

I have any output that counts the number of na values in my dataframe using this logic
df.isna().sum()
col1 8
col2 0
And would like the same thing, but with duplicates although I don't see a full df approach to this - only column by column
How can I leverage something like
df.duplicated().any().sum()
Without specifying column by column like df['col1'].duplicated().any().sum()

Related

How to subtract all column values of two PySpark dataframe?

Hi I faced this case that I need to subtract all column values between two PySpark dataframe like this:
df1:
col1 col2 ... col100
1 2 ... 100
df2:
col1 col2 ... col100
5 4 ... 20
And I want to get the final dataframe with df1 - df2 :
new df:
col1 col2 ... col100
-4 -2 ... 80
I checked the possible solution is subtract two column like:
new_df = df1.withColumn('col1', df1['col1'] - df2['col1'])
But I have 101 columns, how can I simply traverse the whole thing and avoid writing 101 similar logics?
Any answers are super appriciate!
for 101 columns how to simply traverse all column and subtract its values?
You can create a for loop to iterate over the columns and create new columns in the dataframe with the subtracted values. Here's one way to do it in PySpark:
columns = df1.columns
for col in columns:
df1 = df1.withColumn(col, df1[col] - df2[col])
This will create a new dataframe with the subtracted values for each column.
Edit: (to address #Kay's comments)
The error you're encountering is due to a duplicate column name in the output dataframe. You can resolve this by using a different name for the new columns in the output dataframe. Try it by using alias method in the withColumn function:
columns = df1.columns
for col in columns:
df1 = df1.withColumn(col + "_diff", df1[col] - df2[col]).alias(col)
That way you will add a suffix "_diff" to the new columns in the output dataframe to avoid the duplicate column name issue.
Within a single select with a python list comprehension :
columns = df1.columns
df1 = df1.select(*(df1[col] - df2[col]).alias(col) for col in columns))

How to merge rows by same value in different columns using Python (Pandas)

I have a data frame, something like this:
Id Col1 Col2 Paired_Id
1 a A
2 c B
A b 1
B d 2
I would like to merge the rows to get the output something like this. Delete the paired row after merging.
Id Col1 Col2 Paired_Id
1 a b A
2 c d B
Any hint?
So:
Merging rows (ID) with its Paired_ID entries.
Is this possible with Pandas?
Assuming NaNs in the empty cells, I would use a groupby.first with a frozenset of the two IDs as grouper:
group = df[['Id', 'Paired_Id']].apply(frozenset, axis=1)
out = df.groupby(group, as_index=False).first()
Output:
Id Col1 Col2 Paired_Id
0 1 a b A
1 2 c d B
Don't have a lot of information about the structure of your dataframe, so I will just assume a few things - please correct me if I'm wrong:
A line with an entry in Col1 will never have an entry in Col2.
Corresponding lines appear in the same sequence (lines 1,2,3... then
corresponding lines 1,2,3...)
Every line has a corresponding second line later on in the dataframe
If all those assumptions are correct, you could split your data into two dataframes, df_upperhalf containing the Col1, df_lowerhalf the Col2.
df_upperhalf = df.iloc[:len(df.index),]
df_lowerhalf = df.iloc[(len(df.index)*(-1):,]
Then you can easily combine those values:
df_combined = df_upperhalf
df_combined['Col2'] = df_lowerhalf['Col2']
If some of my assumptions are incorrect, this will of course not produce the results you want.
There are also quite a few ways to do it in fewer lines of code, but I think this way you end up with nicer dataframes and the code should be easily readable.
Edit:
I think this would be quite a bit faster:
df_upperhalf = df.head(len(df.index))
df_lowerhalf = df.tail(len(df.index))

Appending only rows that are not yet in a pandas dataframe

I have the same dataset but over different weeks (so later weeks contain new rows). I want to append the new rows to the original dataframe to create one big dataframe with all unique rows and no duplicates. I can't just take the last week because some get deleted over the weeks.
I tried to use the following code but somehow my final_info dataframe still contains some non-unique values
final_info = data[list(data.keys())[-1]]['all_info']
for week in reversed(data.keys()):
df_diff = pd.concat([data[week]['all_info'],final_info]).drop_duplicates(subset='project_slug',
keep=False)
final_info = final_info.append(df_diff).reset_index(drop=True)
Does somebody see where it goes wrong?
if I understand your question, you are just trying to add the unique rows from one dataframe to another dataframe. I don't think there is any need to iterate through the keys like you are doing. There is an example on this question that I think can help you and i think it is conceptually easier to follow 1. I'll try to walk through an example to make it more clear.
So if you have a dataframe A:
col1 col2
1 2
2 3
3 4
and a dataframe B:
col1 col2
1 2
2 3
6 4
These two dataframes have the same first two rows but have different last rows. If you wanted to get all the unique rows into one dataframe you could first get all the unique rows from just one of the dataframes. So for this example you could get the unique row in dataframe B, lets call it df_diff in this example. The code to do this would be
df_diff = B[~B.col1.isin(A.col1)]
output: col1 col2
6 4
This above line of code makes whats called a boolean mask and then negates using ~ so that you get all rows in dataframe B where the col1 value is not in dataframe A.
You could then merge this dataframe, df_diff, with the first dataframe A. We can call this df_full. This step is done with:
df_full = pd.concat([A, df_diff], ignore_index=True)
The ignore_index=True just resets the index of the resulting dataframe. This will give you:
col1 col2
1 2
2 3
3 4
6 4
Now the above dataframe has the new row in dataframe B plus the original rows from dataframe A.
I think this would work for your situation and may be less lines of code.

How to add a pandas Series to a DataFrame ignoring indices?

I have a DataFrame with random, unsorted row indices, which is a result of removing some 'noise' from the original DataFrame.
row_index col1 col2
2 1 2
19 3 4
432 4 1
I would like to add some pd.Series to this Dataframe. The Series has its indices sorted from 0 to n=number of rows. The number of rows equals the number of rows in the DataFrame
Having tried multiple ways of adding the Series to my DataFrame I realized that the data from the Series gets mixed up, because (I believe) Python is matching records by their indices.
Is there a way I can add the Series to the Dataframe, ignoring the indices, so that my data doesn't get mixed up?
convert the series into a data frame.
code
df=pd.DataFrame(df)
result=pd.concat([df1,df],axis=1,ignore_index=True)
df1 is the data frame you want to add .
df is the data frame i.e series you converted to data frame
df['new_col'] = other_df['column'].values

Conditionally assign values from another column in a DataFrame

I'd like to change zero values in a dataframe for the value found in the last column for each row. I can solve this using a for in the columns or the rows, but it didnt seem too pythonic to me.
In short, I have a dataframe like this:
col1 col2 col3 nonzero
1 2 0 10
1 0 3 20
and I'd like to do an operation like
df[df==0] = df.nonzero
so I'd get
col1 col2 col3 nonzero
1 2 10 10
1 20 3 20
This however does not work, as [df==0] is a DataFrame itself with True/False values. How can this be done?
One option is to use apply method, loop through rows of the data frame and replace zeros with the last element of the row:
df.apply(lambda row: row.where(row != 0, row.iat[-1]), axis=1)
You can also modify the data frame in place:
df[df == 0] = (df == 0).mul(df.nonzero, axis=0)
Which yields df as the same result above. In this method, (df == 0).mul(df.nonzero, axis=0) creates a data frame with zeros entries replaced by the values in the nonzero column and other entries zero; Combined with boolean indexing and assignment, you can conditionally modify the zero entries in the original data frame:
(df == 0).mul(df.nonzero, axis=0)

Categories