Create log with change in values in pandas dataframe - python

I have been working with a dataframe having the following format (the actual table has much more rows(id's) and columns(value_3, value_4 etc..)):
where for each id, the status column has the value 'new' if this is the first entry for that id, and the value 'modified' if any of the value_1, value_2 columns have changed compared to their previous value. I would like to create a log of any changes made in the table, in particular I would like the resulted format for the given data above to be something like this:
Ideally, I would like to avoid using loops, so could you please suggest any more efficient pythonic way to achieve the format above?
I have seen the answers posted for the question here: Determining when a column value changes in pandas dataframe
which partly do the job I want (using shift or diff) for identifying cases where there was a change, and I was wondering if this is the best way to build on for my case, or if there is a more efficient way to do that and speed up the process. Ideally, I would like something that can work for both numeric and non-numeric values in value_1, value_2, etc columns..
Code for creating the sample data of the first pic:
import pandas as pd
data = [[1,2,5,'new'], [1,1,5,'modified'], [1,0,5,'modified'],
[2,5,2,'new'], [2,5,3,'modified'], [2,5,4,'modified'] ]
df = pd.DataFrame(data, columns = ['id', 'value_1', 'value_2',
'status'])
df
Many thanks in advance for any suggestion/help!

We do need melt first then groupby after drop_duplicates
s = df.melt(['id','status']).drop_duplicates(['id','variable','value'])
s['new'] = s.groupby(['id','variable'])['value'].shift()
s #s.sort_values('id')
id status variable value new
0 1 new value_1 2 NaN
1 1 modified value_1 1 2.0
2 1 modified value_1 0 1.0
3 2 new value_1 5 NaN
6 1 new value_2 5 NaN
9 2 new value_2 2 NaN
10 2 modified value_2 3 2.0
11 2 modified value_2 4 3.0

Related

Columns getting appended to wrong row in pandas

So I have a dataframe like this:-
0 1 2 ...
0 Index Something Something2 ...
1 1 5 8 ...
2 2 6 9 ...
3 3 7 10 ...
Now, I want to append some columns in between those "Something" column names, for which I have used this code:-
j = 1
for i in range(2, 51):
if i % 2 != 0 and i != 4:
df.insert(i, f"% Difference {j}", " ")
j += 1
where df is the dataframe. Now what happens is that the columns do get inserted but like this:-
0 1 Difference 1 2 ...
0 Index Something NaN Something2 ...
1 1 5 NaN 8 ...
2 2 6 NaN 9 ...
3 3 7 NaN 10 ...
whereas what I wanted was this:-
0 1 2 3 ...
0 Index Something Difference 1 Something2 ...
1 1 5 NaN 8 ...
2 2 6 NaN 9 ...
3 3 7 NaN 10 ...
Edit 1 Using jezrael's logic:-
df.columns = df.iloc[0].tolist()
df = df.iloc[1:].reset_index(drop = True)
print(df)
The output of that is still this:-
0 1 2 ...
0 Index Something Something2 ...
1 1 5 8 ...
2 2 6 9 ...
3 3 7 10 ...
Any ideas or suggestions as to where or how I am going wrong?
If your dataframe looks like what you've shown in your first code block, your column names aren't Index, Something, etc. - they're actually 0, 1, etc.
Pandas is seeing Index, Something, etc. as data in row 0, NOT as column names (which exist above row 0). So when you add a column with the name Difference 1, you're adding a column above row 0, which is where the range of integers is located.
A couple potential solutions to this:
If you'd like the actual column names to be Index, Something, etc. then the best solution is to import the data with that row as the headers. What is the source of your data? If it's a csv, make sure to NOT use the header = None option. If it's from somewhere else, there is likely an option to pass in a list of the column names to use. I can't think of any reason why you'd want to have a range of integer values as your column names rather than the more descriptive names that you have listed.
Alternatively, you can do what #jezrael suggested and convert your first row of data to column names then delete that data row. I'm not sure why their solution isn't working for you, since the code seems to work fine in my testing. Here's what it's doing:
df.columns = df.iloc[0].tolist()
df.columns tells pandas what to (re)name the columns of the dataframe. df.iloc[0].tolist() creates a list out of the first row of data, which in your case is the column names that you actually want.
df = df.iloc[1:].reset_index(drop = True)
This grabs the 2nd through last rows of data to recreate the dataframe. So you have new column names based on the first row, then you recreate the dataframe starting at the second row. The .reset_index(drop = True) isn't totally necessary to include. That just restarts your actual data rows with an index value of 0 rather than 1.
If for some reason you want to keep the column names as they currently exist (as integers rather than labels), you could do something like the following under the if statement in your for loop:
df.insert(i, i, np.nan, allow_duplicates = True)
df.iat[0, i] = f"%Difference {j}"
df.columns = np.arange(len(df.columns))
The first line inserts a column with an integer label filled with NaN values to start with (assuming you have numpy imported). You need to allow duplicates otherwise you'll get an error since the integer value will be the name of a pre-existing column
The second line changes the value in the 1st row of the newly-created column to what you want.
The third line resets the column names to be a range of integers like you had to start with.
As #jezrael suggested, it seems like you might be a little unclear about the difference between column names, indices, and data rows and columns. An index is its own thing, so it's not usually necessary to have a column named Index like you have in your dataframe, especially since that column has the same values in it as the actual index. Clarifying those sorts of things at import can help prevent a lot of hassle later on, so I'd recommend taking a good look at your data source to see if you can create a clearer dataframe to start with!
I want to append some columns in between those "Something" column names
No, there are no columns names Something, for it need set first row of data to columns names:
print (df.columns)
Int64Index([0, 1, 2], dtype='int64')
print (df.iloc[0].tolist())
['Index', 'Something', 'Something2']
df.columns = df.iloc[0].tolist()
df = df.iloc[1:].reset_index(drop=True)
print (df)
Index Something Something2
0 1 5 8
1 2 6 9
2 3 7 10
print (df.columns)
Index(['Index', 'Something', 'Something2'], dtype='object')
Then your solution create columns Difference, but output is different - no columns 0,1,2,3.

Python pandas merge with condition and no duplicates

I have 2 dataframes derived from 2 excel files. The first is a sort of template where there is a column with a condition and the other has the same format but includes inputs for different time periods. I would like to create an output dataframe that basically creates a copy of the template populated with the inputs when the condition is met.
When I use something like df1.merge(df2.assign(Condition='yes'), on=['Condition'], how='left') I sort of get something in line with what I'm after but it contains duplicates. What could I do instead?
thanks
Example below
Code
df1={'reference':[1,2],'condition':['yes','no'],'31/12/2021':[0,0],'31/01/2022':[0,0]}
df1 = pd.DataFrame.from_dict(df1)
df2 = {'reference':[1,2],'condition':["",""],'31/12/2021':[101,231],'31/01/2022':[3423,3242]}
df2 = pd.DataFrame.from_dict(df2)
df1.merge(df2.assign(condition='yes'), on=['condition'], how='left')
Visual example
You could use df.update for this:
# only `update` from column index `2` onwards: ['31/12/2021', '31/01/2022']
df2.update(df1.loc[df1.condition=='no', list(df1.columns)[2:]])
print(df2)
reference condition 31/12/2021 31/01/2022
0 1 101.0 3423.0
1 2 0.0 0.0
Alternative solution using df.where:
df2.iloc[:,2:] = df2.iloc[:,2:].where(df1.condition=='yes',df1.iloc[:,2:])
print(df2)
reference condition 31/12/2021 31/01/2022
0 1 101 3423
1 2 0 0

Fill empty columns with values from another column of another row based on an identifier

I am trying to fill a dataframe, containing repeated elements, based on an identifier.
My Dataframe is as follows:
Code Value
0 SJHV
1 SJIO 96B
2 SJHV 33C
3 CPO3 22A
4 CPO3 22A
5 SJHV 33C #< -- Numbers stored as strings
6 TOY
7 TOY #< -- These aren't NaN, they are empty strings
I would like to remove the empty 'Value' rows only if a non-empty 'Value' row exists. To be clear, I would want my output to look like:
Code Value
0 SJHV 33C
1 SJIO 96B
2 CPO3 22A
3 TOY
My attempt was as follows:
df['Value'].replace('', np.nan, inplace=True)
df2 = df.dropna(subset=['Value']).drop_duplicates('Code')
As expected, this code also drops the 'TOY' Code. Any suggestions?
The empty strings should go to the bottom if you sort them, then you can just drop duplicates.
import pandas as pd
df = pd.DataFrame({'Code':['SJHV','SJIO','SJHV','CPO3','CPO3','SJHV','TOY','TOY'],'Value':['','96B','33C','22A','22A','33C','','']})
df = (
df.sort_values(by=['Value'], ascending=False)
.drop_duplicates(subset=['Code'], keep='first')
.sort_index()
)
Output
Code Value
1 SJIO 96B
2 SJHV 33C
3 CPO3 22A
6 TOY

Create a multiindex DataFrame from existing delimited column names

I have a pandas DataFrame that looks like the following
A_value A_avg B_value B_avg
date
2020-01-01 1 2 3 4
2020-02-01 5 6 7 8
and my goal is to create a multiindex Dataframe that looks like that:
A B
value avg value avg
date
2020-01-01 1 2 3 4
2020-02-01 5 6 7 8
So the part of the column name before the '-' should be the first level of the column index and the part afterwards the second level. The first part is unstructured, the second is always the same (4 endings).
I tried to solve it with pd.wide_to_long() but I think that is the wrong path, as I don't want to change the df itself. The real df is much larger, so creating it manually is not an option. I'm stuck here and did not find a solution.
You can split the columns by the delimier and expand to create Multiindex:
df.columns=df.columns.str.split("_",expand=True)

Adding a new row to a dataframe with correct mapping in pandas

I have a dataframe something like below-
carrier_plan_identifier ... hios_issuer_identifier
1 AUSK ... 99806.0
2 AUSM ... 99806.0
3 AUSN ... 99806.0
4 AUSS ... 99806.0
5 AUST ... 99806.0
I need to pick a particular column ,lets say wellthie_issuer_identifier.
I need to query the database based on this column value. My select query will look something like .
select id, wellthie_issuer_identifier from issuers where wellthie_issuer_identifier in(....)
I need to add id column back to my existing dataframe with respect to the wellthie_issuer_identifier.
I have searched a lot but not clear with how this can be done.
Try this:
1.) pick a particular column ,lets say wellthie_issuer_identifier
t = tuple(df.wellthie_issuer_identifier)
This will give you a tuple like (1,0,1,1)
2.) query the database based on this column value
You need to substitute the above tuple in your query:
query = """select id, wellthie_issuer_identifier from issuers
where wellthie_issuer_identifier in{} """
Create a Cursor to the database and execute this query and Create a Dataframe of the result.
cur.execute(query.format(t))
df_new = pd.DataFrame(cur.fetchall())
df_new.columns = ['id','wellthie_issuer_identifier']
Now your df_new will have columns id, wellthie_issuer_identifier. You need to add this id column back to existing df.
Do this:
df = pd.merge(df,df_new, on='wellthie_issuer_identifier',how='left')
It will add an id column to df which will have values if a match is found on wellthie_issuer_identifier, otherwise it will put NaN.
Let me know if this helps.
You can add another column to a dataframe using pandas if the column is not too long, For example:
import pandas as pd
df = pd.read_csv('just.csv')
df
id user_id name
0 1 1 tolu
1 2 5 jb
2 3 6 jbu
3 4 7 jab
4 5 9 jbb
#to add new column to the data above
df['new_column']=['jdb','biwe','iuwfb','ibeu','igu']#new values
df
id user_id name new_column
0 1 1 tolu jdb
1 2 5 jb biwe
2 3 6 jbu iuwfb
3 4 7 jab ibeu
4 5 9 jbb igu
#this should help if the dataset is not too much
then you can go on querying your database
This will not take values for wellthie_issuer_identifier but as you told it will be all the values that are their, then below should work for you:
df1 = df.assign(id=(df['wellthie_issuer_identifier']).astype('category').cat.codes)

Categories