Pandas - df.compare() how to change self/other labels? - python

Using df.compare in Pandas, is it possible to change the labels of self/other from the output?
I need to send this output directly to less technically savvy users and would like to change them to more descriptive labels.
My code:
if df_1.equals(df_2):
return None
else:
return df_1.compare(df_2, align_axis=0)

You can rename the index level to something more obvious:
df1 = pd.DataFrame([[1,2,3,4], [1,2,3,4]])
df2 = pd.DataFrame([[1,2,5,4], [5,2,3,1]])
df1.compare(df2, align_axis=0).rename(index={'self': 'left', 'other': 'right'}, level=-1)
0 2 3
0 left NaN 3.0 NaN
right NaN 5.0 NaN
1 left 1.0 NaN 4.0
right 5.0 NaN 1.0

Related

How to filter multiple columns/rows in a python dataframe?

Okay so I have a dataframe called ski_data that has two columns named AdultWeekend and AdultWeekday which shows the price for the weekend and weekdays of each ski resort.
However, for some of the resorts, both price columns are NaN and I need to filter that out.
My approach was this:
Step 1)
create a new dataframe that takes each ski resort and gives it a value of 0, 1, or 2 based on how many of the price tags are NaN so the code was
missing_data = ski_data[['AdultWeekend', 'AdultWeekday']].isnull().sum(axis=1)
Step 2)
Create a counter variable and an empty list. Then iterate over the missing_data and if the value == 2. append the counter to the list which keeps track of the index that is equivalent in the ski_data dataframe.
counter = 0
missingList = []
for x in missing_price:
if x==2:
missingList.append(counter)
counter += 1
Step 3)
Iterate over ski_data dataframe and drop location of each index that was appended to the list.
for i in missingList:
ski_data.drop(labels=[i], axis=0, inplace=True)
However I get multiple errors one of which involves the first index label that is appended to the missing_list is 98 but ski_data.loc[98] raises a keyError. Can anyone explain or help me?
https://github.com/seungsooim32/stackoverflow/blob/main/02_data_wrangling.ipynb
code task #28
You can try:
ski_data.dropna(how="all", subset=['AdultWeekend', 'AdultWeekday'])
This will create a new dataframe with NaN rows dropped. If you want to modify the existing dataframe, add the argument inplace=True.
Don't drop nan, keep the right rows:
>>> ski_data
AdultWeekend AdultWeekday
0 10.0 15.0
1 NaN 12.0
2 13.0 NaN
3 20.0 19.0
4 NaN NaN
>>> ski_data.loc[ski_data[['AdultWeekend', 'AdultWeekday']].notna().all(axis=1)]
AdultWeekend AdultWeekday
0 10.0 15.0
1 NaN 12.0
2 13.0 NaN
3 20.0 19.0
>>> ski_data.loc[~ski_data[['AdultWeekend', 'AdultWeekday']].isna().all(axis=1)]
AdultWeekend AdultWeekday
0 10.0 15.0
1 NaN 12.0
2 13.0 NaN
3 20.0 19.0

Replace all NaN values with value from other column

I have the following dataframe:
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, 5, np.nan],
[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
I want to do a ffill() on column B with df["B"].ffill(inplace=True) which results in the following df:
A B C D
0 NaN 2.0 NaN 0.0
1 3.0 4.0 NaN 1.0
2 NaN 4.0 5.0 NaN
3 NaN 3.0 NaN 4.0
Now I want to replace all NaN values with their corresponding value from column B. The documentation states that you can give fillna() a Series, so I tried df.fillna(df["B"], inplace=True). This results in the exact same dataframe as above.
However, if I put in a simple value (e.g. df.fillna(0, inplace=True), then it does work:
A B C D
0 0.0 2.0 0.0 0.0
1 3.0 4.0 0.0 1.0
2 0.0 4.0 5.0 0.0
3 0.0 3.0 0.0 4.0
The funny thing is that the fillna() does seem to work with a Series as value parameter when operated on another Series object. For example, df["A"].fillna(df["B"], inplace=True) results in:
A B C D
0 2.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 4.0 4.0 NaN 5
3 3.0 3.0 NaN 4
My real dataframe has a lot of columns and I would hate to manually fillna() all of them. Am I overlooking something here? Didn't I understand the docs correctly perhaps?
EDIT I have clarified my example in such a way that 'ffill' with axis=1 does not work for me. In reality, my dataframe has many, many columns (hundreds) and I am looking for a way to not have to explicitly mention all the columns.
Try changing the axis to 1 (columns):
df = df.ffill(1).bfill(1)
If you need to specify the columns, you can do something like this:
df[["B","C"]] = df[["B","C"]].ffill(1)
EDIT:
Since you need something more general and df.fillna(df.B, axis = 1) is not implemented yet, you can try with:
df = df.T.fillna(df.B).T
Or, equivalently:
df.T.fillna(df.B, inplace=True)
This works because the indices of df.B coincides with the columns of df.T so pandas will know how to replace it. From the docs:
value: scalar, dict, Series, or DataFrame.
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.
So, for example, the NaN in column 0 at row A (in df.T) will be replaced for the value with index 0 in df.B.

Pandas update doesn't do anything

I have a dataframe with some information on it. I created another dataframe that is larger and has default values in it. I want to update the default dataframe with the values from the first dataframe. I'm using df.update but nothing is happening. Here is the code:
new_df = pd.DataFrame(index=range(25))
new_df['Column1'] = 1
new_df['Column2'] = 2
new_df.update(old_df)
Here, old_df has 2 rows, indexed 5,6 with some random values in Column1 and Column2 and nothing else. I'm expecting these rows to overwrite the default values in new_df, what am I doing wrong?
This works for me, so I assume the problem is in the part of the code you haven't shown us.
import pandas as pd
import numpy as np
new_df = pd.DataFrame(index=range(25))
old_df = pd.DataFrame(index=[5,6])
new_df['Column1'] = 1
new_df['Column2'] = 2
old_df['Column1'] = np.nan
old_df['Column2'] = np.nan
old_df.loc[5,'Column1'] = 9
old_df.loc[6,'Column2'] = 7
new_df.update(old_df)
print(new_df.head(10))
Output:
Column1 Column2
0 1.0 2.0
1 1.0 2.0
2 1.0 2.0
3 1.0 2.0
4 1.0 2.0
5 9.0 2.0
6 1.0 7.0
7 1.0 2.0
8 1.0 2.0
9 1.0 2.0
As you don't provide us how you construct/get old_df, before do the update, make sure that the type of both indexes is the same.
new_df.index = new_df.index.astype('int64')
old_df.index = old_df.index.astype('int64')
One int is not equal to one string 1 != '1'. So update() doesn't found common rows in yours dataframes and as nothing to do.

python- flagging a second set of items in a series

I have a dataframe column which contains a list of numbers from a .csv. These numbers range from 1-1400 and may or may not be repeated and the a NaN value can appear pretty much anywhere at random.
Two examples would be
a=[1,4,NaN,5,6,7,...1398,1400,1,2,3,NaN,8,9,...,1398,NaN]
b=[1,NaN,2,3,4,NaN,7,10,...,1398,1399,1400]
I would like to create another column that finds the first 1-1400 and records a '1' in the same index and if the second set of 1-1400 exists, then mark that down as a '2' in the new column
I can think of some roundabout ways using temporary placeholders and some other kind of checks, but I was wondering if there was a 1-3 liner to do this operation
Edit1: I would prefer there to be a single column returned
a1=[1,1,NaN,1,1,1,...1,1,2,2,2,NaN,2,2,...,2,NaN]
b1=[1,NaN,1,1,1,NaN,1,1,...,1,1,1]
You can use groupby() and cumcount() to count numbers in each column:
# create new columns for counting
df['a1'] = np.nan
df['b1'] = np.nan
# take groupby for each value in column `a` and `b` and count each value
df.a1 = df.groupby('a').cumcount() + 1
df.b1 = df.groupby('b').cumcount() + 1
# set np.nan as it is
df.loc[df.a.isnull(), 'a1'] = np.nan
df.loc[df.b.isnull(), 'b1'] = np.nan
EDIT (after receiving a comment of 'does not work'):
df['a2'] = df.ffill().a.diff()
df['a1'] = df.loc[df.a2 < 0].groupby('a').cumcount() + 1
df['a1'] = df['a1'].bfill().shift(-1)
df.loc[df.a1.isnull(), 'a1'] = df.a1.max() + 1
df.drop('a2', axis=1, inplace=True)
df.loc[df.a.isnull(), 'a1'] = np.nan
you can use diff to check when the difference between two following values is negative, meaning of the start of a new range. Let's create a dataframe:
import pandas as pd
import numpy as np
# to create a dataframe with two columns my range go up to 12 but 1400 is the same
df = pd.DataFrame({'a':[1,4,np.nan,5,10,12,2,3,4,np.nan,8,12],'b':range(1,13)})
df.loc[[4,8],'b'] = np.nan
Because you have 'NaN', you need to use ffill to fill NaN with previous value and you want the opposite of the row (using ~) where the diff is greater or equal than 0 (I know it sound like less than 0, but not exactely here as it miss the first row of the dataframe). For column 'a' for example
print (df.loc[~(df.a.ffill().diff()>=0),'a'])
0 1.0
6 2.0
Name: a, dtype: float64
you get the two rows where a "new" range start. To use this property to create 'a1', you can do:
# put 1 in the rows with a new range start
df.loc[~(df.a.ffill().diff()>=0),'a1'] = 1
# create a mask to select notnull row in a:
mask_a = df.a.notnull()
# use cumsum and ffill on column a1 with the mask_a
df.loc[mask_a,'a1'] = df.loc[mask_a,'a1'].cumsum().ffill()
Finally, for several column, you can do:
list_col = ['a','b']
for col in list_col:
df.loc[~(df[col].ffill().diff()>=0),col+'1'] = 1
mask = df[col].notnull()
df.loc[mask,col+'1'] = df.loc[mask,col+'1'].cumsum().ffill()
and with my input, you get:
a b a1 b1
0 1.0 1.0 1.0 1.0
1 4.0 2.0 1.0 1.0
2 NaN 3.0 NaN 1.0
3 5.0 4.0 1.0 1.0
4 10.0 NaN 1.0 NaN
5 12.0 6.0 1.0 1.0
6 1.0 7.0 2.0 1.0
7 3.0 8.0 2.0 1.0
8 4.0 NaN 2.0 NaN
9 NaN 10.0 NaN 1.0
10 8.0 11.0 2.0 1.0
11 12.0 12.0 2.0 1.0
EDIT: you can even do it in one line for each column, same result:
df['a1'] = df[df.a.notnull()].a.diff().fillna(-1).lt(0).cumsum()
df['b1'] = df[df.b.notnull()].b.diff().fillna(-1).lt(0).cumsum()

Identify and count unique patterns in a pandas dataframe

You'll find snippets with reproducible input and an example of desired output at the end of the question.
The challenge:
I have a dataframe like this:
The dataframe has two columns with patterns of 1 and 0 like this:
Or this:
The number of columns will vary, and so will the length of the patterns.
However, the only numbers in the dataframe will be 0 or 1.
I would like to identify these patterns, count each occurence of them, and build a dataframe containing the results. To simplify the whole thing, I'd like to focus on the ones, and ignore the zeros. The desired output in this particular case would be:
I'd like the procedure to identify that, as an example, the pattern [1,1,1] occurs two times in column_A, and not at all in column_B. Notice that I've used the sums of the patterns as indexes in the dataframe.
Reproducible input:
import pandas as pd
df = pd.DataFrame({'column_A':[1,1,1,0,0,0,1,0,0,1,1,1],
'column_B':[1,1,1,1,1,0,0,0,1,1,0,0]})
colnames = list(df)
df[colnames] = df[colnames].apply(pd.to_numeric)
datelist = pd.date_range(pd.datetime.today().strftime('%Y-%m-%d'), periods=len(df)).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
print(df)
Desired output:
df2 = pd.DataFrame({'pattern':[5,3,2,1],
'column_A':[0,2,0,1],
'column_B':[1,0,1,0]})
df2 = df2.set_index(['pattern'])
print(df2)
My attempts so far:
I've been working on a solution that includes nested for loops where I calculate running sums that are reset each time an observation equals zero. It also includes functions such as df.apply(lambda x: x.value_counts()). But it's messy to say the least, and so far not 100% correct.
Thank you for any other suggestions!
Here's my attempt:
def fun(ser):
ser = ser.dropna()
ser = ser.diff().fillna(ser)
return ser.value_counts()
df.cumsum().where((df == 1) & (df != df.shift(-1))).apply(fun)
Out:
column_A column_B
1.0 1.0 NaN
2.0 NaN 1.0
3.0 2.0 NaN
5.0 NaN 1.0
The first part (df.cumsum().where((df == 1) & (df != df.shift(-1)))) produces the cumulative sums:
column_A column_B
dates
2017-08-04 NaN NaN
2017-08-05 NaN NaN
2017-08-06 3.0 NaN
2017-08-07 NaN NaN
2017-08-08 NaN 5.0
2017-08-09 NaN NaN
2017-08-10 4.0 NaN
2017-08-11 NaN NaN
2017-08-12 NaN NaN
2017-08-13 NaN 7.0
2017-08-14 NaN NaN
2017-08-15 7.0 NaN
So if we ignore the NaNs and take the diffs, we can have the values. That's what the function does: it drops the NaNs and then take the differences so it's not cumulative sum anymore. It finally returns the value counts.

Categories