I have a dataframe with some information on it. I created another dataframe that is larger and has default values in it. I want to update the default dataframe with the values from the first dataframe. I'm using df.update but nothing is happening. Here is the code:
new_df = pd.DataFrame(index=range(25))
new_df['Column1'] = 1
new_df['Column2'] = 2
new_df.update(old_df)
Here, old_df has 2 rows, indexed 5,6 with some random values in Column1 and Column2 and nothing else. I'm expecting these rows to overwrite the default values in new_df, what am I doing wrong?
This works for me, so I assume the problem is in the part of the code you haven't shown us.
import pandas as pd
import numpy as np
new_df = pd.DataFrame(index=range(25))
old_df = pd.DataFrame(index=[5,6])
new_df['Column1'] = 1
new_df['Column2'] = 2
old_df['Column1'] = np.nan
old_df['Column2'] = np.nan
old_df.loc[5,'Column1'] = 9
old_df.loc[6,'Column2'] = 7
new_df.update(old_df)
print(new_df.head(10))
Output:
Column1 Column2
0 1.0 2.0
1 1.0 2.0
2 1.0 2.0
3 1.0 2.0
4 1.0 2.0
5 9.0 2.0
6 1.0 7.0
7 1.0 2.0
8 1.0 2.0
9 1.0 2.0
As you don't provide us how you construct/get old_df, before do the update, make sure that the type of both indexes is the same.
new_df.index = new_df.index.astype('int64')
old_df.index = old_df.index.astype('int64')
One int is not equal to one string 1 != '1'. So update() doesn't found common rows in yours dataframes and as nothing to do.
Related
I have a table named df with two columns - Name and Data. The table is something as follows
I am trying to create all possible combinations of values from the Data column and concat the results as separate columns to the existing table. Basically, in every subsequent column, two of the names will take values as 2 and 1.5 and the rest will take the value as 1. I am looking for output as similar to the following table:
Though I have been able to figure out the combination of names that will take the values as 2 and 1.5 in the next column using the following code
for index in list(combinations(df[['Name']].index,2)):
print(df[['Name']].loc[index,:])
print('\n')
However, I am stuck on how to create the fresh columns as mentioned above. Any help on the same is highly appreciated.
I think you are looking for permutations, not combinations. In this case we can generate those and transpose the data. After the transpose we can rename the columns.
import pandas as pd
from itertools import permutations
df = pd.DataFrame({'Name':['A','B','C','D'],
'Data':[1,2,1,1.5]})
df = pd.DataFrame(list(permutations(df.Data.values,4)), columns=df.Name.values).T
df.columns = [f'Data{x+1}' for x in df.columns]
df.reset_index(inplace=True)
df.rename(columns={'index':'Name'}, inplace=True)
Or:
pd.DataFrame(list(permutations(df.Data.values,4)), columns=df.Name.values).T.add_prefix('Data').rename_axis('Name').reset_index()
Output
Name Data1 Data2 Data3 Data4 Data5 Data6 Data7 Data8 Data9 ... \
0 A 1.0 1.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 ...
1 B 2.0 2.0 1.0 1.0 1.5 1.5 1.0 1.0 1.0 ...
2 C 1.0 1.5 2.0 1.5 2.0 1.0 1.0 1.5 1.0 ...
3 D 1.5 1.0 1.5 2.0 1.0 2.0 1.5 1.0 1.5 ...
Using df.compare in Pandas, is it possible to change the labels of self/other from the output?
I need to send this output directly to less technically savvy users and would like to change them to more descriptive labels.
My code:
if df_1.equals(df_2):
return None
else:
return df_1.compare(df_2, align_axis=0)
You can rename the index level to something more obvious:
df1 = pd.DataFrame([[1,2,3,4], [1,2,3,4]])
df2 = pd.DataFrame([[1,2,5,4], [5,2,3,1]])
df1.compare(df2, align_axis=0).rename(index={'self': 'left', 'other': 'right'}, level=-1)
0 2 3
0 left NaN 3.0 NaN
right NaN 5.0 NaN
1 left 1.0 NaN 4.0
right 5.0 NaN 1.0
I want to calculate a pandas dataframe, but some rows contain missing values. For those missing values, i want to use a diffent algorithm. Lets say:
If column B contains a value, then substract A from B
If column B does not contain a value, then subtract A from C
import pandas as pd
df = pd.DataFrame({'a':[1,2,3,4], 'b':[1,1,None,1],'c':[2,2,2,2]})
df['calc'] = df['b']-df['a']
results in:
print(df)
a b c calc
0 1 1.0 2 0.0
1 2 1.0 2 -1.0
2 3 NaN 2 NaN
3 4 1.0 2 -3.0
Approach 1: fill the NaN rows using .where:
df['calc'].where(df['b'].isnull()) = df['c']-df['a']
which results in SyntaxError: cannot assign to function call.
Approach 2: fill the NaN rows using .iterrows():
for index, row in df.iterrows():
i = df['calc'].iloc[index]
if pd.isnull(row['b']):
i = row['c']-row['a']
print(i)
else:
i = row['b']-row['a']
print(i)
is executed without errors and calculation is correct, these i values are printed to the console:
0.0
-1.0
-1.0
-3.0
but the values are not written into df['calc'], the datafram remains as is:
print(df['calc'])
0 0.0
1 -1.0
2 NaN
3 -3.0
What is the correct way of overwriting the NaN values?
Finally, I stumbled over .fillna:
df['calc'] = df['calc'].fillna( df['c']-df['a'] )
gets the job done! Can anyone explain what is wrong with above two approaches...?
Approach 2:
you are assigning it to i value. but this won't modify your original dataframe.
for index, row in df.iterrows():
i = df['calc'].iloc[index]
if pd.isnull(row['b']):
i = row['c']-row['a']
print(i)
else:
i = row['b']-row['a']
print(i)
df.loc[index,'calc'] = i #<------------- here
also don't use iterrows() it is too slow.
Approach 1:
Pandas where() method is used to check a data frame for one or more condition and return the result accordingly. By default, The rows not satisfying the condition are filled with NaN value.
it should be:
df['calc'] = df['calc'].where(df['b'].isnull(), df['c']-df['a'])
but this will only find those row value where you have non zero value and fill that with the given value.
Use:
df['calc'] = df['calc'].where(~df['b'].isnull(), df['c']-df['a'])
OR
df['calc'] = np.where(df['b'].isnull(), df['c']-df['a'], df['calc'])
Instead of subtracting b from a then c from a what you can do is first fill the nan values in column b with the values from column c, then subtract column a:
df['calc'] = df['b'].fillna(df['c']) - df['a']
a b c calc
0 1 1.0 2 0.0
1 2 1.0 2 -1.0
2 3 NaN 2 -1.0
3 4 1.0 2 -3.0
How do I combine values from two rows with identical index and has no intersection in values?
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,None,None],[None,5,6]],index=['a','b','b'])
df
#input
0 1 2
a 1.0 2.0 3.0
b 4.0 NaN NaN
b NaN 5.0 6.0
Desired output
0 1 2
a 1.0 2.0 3.0
b 4.0 5.0 6.0
Please stack(), drops all nans and unstack()
df.stack().unstack()
If possible simplify solution for first non missing values per index labels use GroupBy.first:
df1 = df.groupby(level=0).first()
If possible same output from sample data is use sum per labels use sum:
df1 = df.sum(level=0)
If there is multiple non missing values per groups is necessary specify expected output, obviously is is more complicated.
I have a dataframe column which contains a list of numbers from a .csv. These numbers range from 1-1400 and may or may not be repeated and the a NaN value can appear pretty much anywhere at random.
Two examples would be
a=[1,4,NaN,5,6,7,...1398,1400,1,2,3,NaN,8,9,...,1398,NaN]
b=[1,NaN,2,3,4,NaN,7,10,...,1398,1399,1400]
I would like to create another column that finds the first 1-1400 and records a '1' in the same index and if the second set of 1-1400 exists, then mark that down as a '2' in the new column
I can think of some roundabout ways using temporary placeholders and some other kind of checks, but I was wondering if there was a 1-3 liner to do this operation
Edit1: I would prefer there to be a single column returned
a1=[1,1,NaN,1,1,1,...1,1,2,2,2,NaN,2,2,...,2,NaN]
b1=[1,NaN,1,1,1,NaN,1,1,...,1,1,1]
You can use groupby() and cumcount() to count numbers in each column:
# create new columns for counting
df['a1'] = np.nan
df['b1'] = np.nan
# take groupby for each value in column `a` and `b` and count each value
df.a1 = df.groupby('a').cumcount() + 1
df.b1 = df.groupby('b').cumcount() + 1
# set np.nan as it is
df.loc[df.a.isnull(), 'a1'] = np.nan
df.loc[df.b.isnull(), 'b1'] = np.nan
EDIT (after receiving a comment of 'does not work'):
df['a2'] = df.ffill().a.diff()
df['a1'] = df.loc[df.a2 < 0].groupby('a').cumcount() + 1
df['a1'] = df['a1'].bfill().shift(-1)
df.loc[df.a1.isnull(), 'a1'] = df.a1.max() + 1
df.drop('a2', axis=1, inplace=True)
df.loc[df.a.isnull(), 'a1'] = np.nan
you can use diff to check when the difference between two following values is negative, meaning of the start of a new range. Let's create a dataframe:
import pandas as pd
import numpy as np
# to create a dataframe with two columns my range go up to 12 but 1400 is the same
df = pd.DataFrame({'a':[1,4,np.nan,5,10,12,2,3,4,np.nan,8,12],'b':range(1,13)})
df.loc[[4,8],'b'] = np.nan
Because you have 'NaN', you need to use ffill to fill NaN with previous value and you want the opposite of the row (using ~) where the diff is greater or equal than 0 (I know it sound like less than 0, but not exactely here as it miss the first row of the dataframe). For column 'a' for example
print (df.loc[~(df.a.ffill().diff()>=0),'a'])
0 1.0
6 2.0
Name: a, dtype: float64
you get the two rows where a "new" range start. To use this property to create 'a1', you can do:
# put 1 in the rows with a new range start
df.loc[~(df.a.ffill().diff()>=0),'a1'] = 1
# create a mask to select notnull row in a:
mask_a = df.a.notnull()
# use cumsum and ffill on column a1 with the mask_a
df.loc[mask_a,'a1'] = df.loc[mask_a,'a1'].cumsum().ffill()
Finally, for several column, you can do:
list_col = ['a','b']
for col in list_col:
df.loc[~(df[col].ffill().diff()>=0),col+'1'] = 1
mask = df[col].notnull()
df.loc[mask,col+'1'] = df.loc[mask,col+'1'].cumsum().ffill()
and with my input, you get:
a b a1 b1
0 1.0 1.0 1.0 1.0
1 4.0 2.0 1.0 1.0
2 NaN 3.0 NaN 1.0
3 5.0 4.0 1.0 1.0
4 10.0 NaN 1.0 NaN
5 12.0 6.0 1.0 1.0
6 1.0 7.0 2.0 1.0
7 3.0 8.0 2.0 1.0
8 4.0 NaN 2.0 NaN
9 NaN 10.0 NaN 1.0
10 8.0 11.0 2.0 1.0
11 12.0 12.0 2.0 1.0
EDIT: you can even do it in one line for each column, same result:
df['a1'] = df[df.a.notnull()].a.diff().fillna(-1).lt(0).cumsum()
df['b1'] = df[df.b.notnull()].b.diff().fillna(-1).lt(0).cumsum()