Comparing 2 dataframes

Comparing 2 dataframes - python

I have 2 dataframes that contain 3 account indictors per account number. The account numbers are like for like in the column labelled "account". I am trying to modify dataframe 2 so it matches dataframe 1 in terms of having the same NaN values for each column.
Dataframe 1:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1234567890,1,np.nan,'G'],
[7854567890,np.nan,100,np.nan],
[7854567899,np.nan,np.nan,np.nan],
[7854567893,np.nan,100,np.nan],
[7854567893,np.nan,np.nan,np.nan],
[9632587415,np.nan,np.nan,'B']],
columns = ['account','ind_1','ind_2','ind_3'])
df
Output:
account ind_1 ind_2 ind_3
0 1234567890 1.0 NaN G
1 7854567890 NaN 100.0 NaN
2 7854567899 NaN NaN NaN
3 7854567893 NaN 100.0 NaN
4 7854567893 NaN NaN NaN
5 9632587415 NaN NaN B
Dataframe 2:
df2 = pd.DataFrame([[1234567890,5,np.nan,'GG'],
[7854567890,1,106,np.nan],
[7854567899,np.nan,100,'N'],
[7854567893,np.nan,100,np.nan],
[7854567893,np.nan,np.nan,np.nan],
[9632587415,3,np.nan,'B']],
columns = ['account','ind_1','ind_2','ind_3'])
df2
Output:
account ind_1 ind_2 ind_3
0 1234567890 5.0 NaN GG
1 7854567890 1.0 106.0 NaN
2 7854567899 NaN 100.0 N
3 7854567893 NaN 100.0 NaN
4 7854567893 NaN NaN NaN
5 9632587415 3.0 NaN B
Problem:
I need to change dataframe 2 so that it is matching the same NaN values in dataframe 1.
For example: Column ind_1 has values in index 0, 1 and 5 in df2, whereas in df1 it only has values in index 0. I need to replace the values in index 1 and 5 in df2 with NaNs to match the same number of Nans in df1. Same logic to applied to the other 2 columns.
Expected outcome:
account ind_1 ind_2 ind_3
0 1234567890 5.0 NaN GG
1 7854567890 NaN 106.0 NaN
2 7854567899 NaN NaN NaN
3 7854567893 NaN 100.0 NaN
4 7854567893 NaN NaN NaN
5 9632587415 NaN NaN B
Is there any easy way to achieve this?
Thanks in advance
Alan

Try this
df2[~df.isna()]
df.isna() check where df has nan value and create a mask
then slice df2 with the mask.

You return a DataFrame with True values where there are NaN values in your first dataframe using df.isna(). You can use this as a Boolean mask to set the correct cells to NaN in your second DataFrame:
df2[df.isna()] = None
Be careful: I assume that you also need to take care that these NaNs will be associated with rows with the same value for account. This solution does not ensure that this will happen, i.e. it will assume that the values in account column are exactly int he same order in both dataframes.

Related

Pandas capture connected rows

I have a following looking dataframe, e.g.:
ID Value
0x3000 nan
nan 1
nan 2
nan 3
0x4252 nan
nan 10
nan 12
now, I'm looking for a way to get these two groups out of this dataframe, like so:
ID Value
0x3000 nan
nan 1
nan 2
nan 3
and
ID Value
0x4252 nan
nan 10
nan 12
so, a group basically starts on a hex value and contains its connected values all the way until the next occurence of valid hex value.
How can this be done effectively in pandas without manually looping through the rows and collecting row by row, until the condition (valid hex value) is met?

You can use groupby with a custom group to generate a list of DataFrames:
l = [g for _,g in df.groupby(df['ID'].notna().cumsum())]
output:
[ ID Value
0 0x3000 NaN
1 NaN 1.0
2 NaN 2.0
3 NaN 3.0,
ID Value
4 0x4252 NaN
5 NaN 10.0
6 NaN 12.0]

Why can one column of the pandas DataFrame not be filled?

I'm having some problems iteratively filling a pandas DataFrame with two different types of values. As a simple example, please consider the following initialization:
IN:
df = pd.DataFrame(data=np.nan,
index=range(5),
columns=['date', 'price'])
df
OUT:
date price
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
When I try to fill one row of the DataFrame, it won't adjust the value in the date column. Example:
IN:
df.iloc[0]['date'] = '2022-05-06'
df.iloc[0]['price'] = 100
df
OUT:
date price
0 NaN 100.0
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
I'm suspecting it has something to do with the fact that the default np.nan value cannot be replaced by a str type value, but I'm not sure how to solve it. Please note that changing the date column's type to str does not seem to make a difference.

This doesn't work because df.iloc[0] creates a temporary Series, which is what you update, not the original DataFrame.
If you need to mix positional and label indexing you can use:
df.loc[df.index[0], 'date'] = '2022-05-06'
df.loc[df.index[0], 'price'] = 100
output:
date price
0 2022-05-06 100.0
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN

Using loc() as shown below may work better:
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.nan,
index=range(5),
columns=['date', 'price'])
print(df)
df.loc[0, 'date'] = '2022-05-06'
df.loc[0, 'price'] = 100
print(df)
Output:
date price
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
date price
0 2022-05-06 100.0
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN

Splitting a pandas dataframe values in a columns?

I have a df that a columns with some ID for companies. How can I split this ID in columns?
In this column the values can be 0(NaN) to more than 5 IDs, how to divide each one of them in separate columns?
Here is an example of the column:
0 4773300
1 NaN
2 6201501,6319400,6202300
3 8230001
4 NaN
5 4742300,4744004,4744003,7319002,4729699,475470
The division would be at each comma, I imagine an output like this:
columnA
columnB
columnC
4773300
Nan
Nan
NaN
Nan
Nan
6201501
6319400
6202300
8230001
Nan
Nan
And so on depending on the number of IDs

You can use the .str.split method to perform this type of transformation quite readily. The trick is to pass the expand=True parameter so your results are put into a DataFrame instead of a Series containing list objects.
>>> df
ID
0 4773300
1 NaN
2 6201501,6319400,6202300
3 8230001
4 NaN
5 4742300,4744004,4744003,7319002,4729699,475470
>>> df['ID'].str.split(',', expand=True)
0 1 2 3 4 5
0 4773300 None None None None None
1 NaN NaN NaN NaN NaN NaN
2 6201501 6319400 6202300 None None None
3 8230001 None None None None None
4 NaN NaN NaN NaN NaN NaN
5 4742300 4744004 4744003 7319002 4729699 475470
You can also clean up the output a little for better aesthetics
replace None for NaN
alphabetic column names (though I would opt to not do this as you'll hit errors if a given entry in the ID column has > 26 ids in it.)
join back to original DataFrame
>>> import pandas as pd
>>> from string import ascii_uppercase
>>> (
df['ID'].str.split(',', expand=True)
.replace({None: float('nan')})
.pipe(lambda d:
d.set_axis(
pd.Series(list(ascii_uppercase))[d.columns],
axis=1
)
)
.add_prefix("column")
.join(df)
)
columnA columnB columnC columnD columnE columnF ID
0 4773300 NaN NaN NaN NaN NaN 4773300
1 NaN NaN NaN NaN NaN NaN NaN
2 6201501 6319400 6202300 NaN NaN NaN 6201501,6319400,6202300
3 8230001 NaN NaN NaN NaN NaN 8230001
4 NaN NaN NaN NaN NaN NaN NaN
5 4742300 4744004 4744003 7319002 4729699 475470 4742300,4744004,4744003,7319002,4729699,475470

Consider each entry as a string, and parse the string to get to individual values.
from ast import literal_eval
df = pd.read_csv('sample.csv', converters={'company': literal_eval})
words = []
for items in df['company']:
for word in items:
words.append(word)
FYI, This is a good starting point. I do not know what is intended output format needed as of now, since your question is kind of incomplete.

Search for starting column and row of column_names in excel file using pandas

I'm analyzing excel files generated by an organization who publishes yearly reports in Excel files. Each year, the column names (Year, A1, B1, C1, etc) remain identical. But each year the organization publishes those column names that start at different row numbers and column numbers.
Each year I manually search for the starting row and column, but it's tedious work given the number of years of reports to wade through.
So I'd like something like this:
...
df = pd.read_excel('test.xlsx')
start_row,start_col = df.find_columns('Year','A1','B1')
...
Thanks.

Let's say you have three .xlsx files on your desktop prefixed with Yearly_Report that when combined in python look like this after reading into one dataframe with something like: df = pd.concat([pd.read_excel(f, header=None) for f in yearly_files]):
0 1 2 3 4 5 6 7 8 9 10
0 A B C NaN NaN NaN NaN NaN NaN NaN NaN
1 1 2 3 NaN NaN NaN NaN NaN NaN NaN NaN
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN A B C NaN NaN NaN NaN NaN NaN
4 NaN NaN 4 5 6 NaN NaN NaN NaN NaN NaN
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN A B C
2 NaN NaN NaN NaN NaN NaN NaN NaN 4 5 6
As you can see, the columns and values are scattered across various columns and rows. The following steps would get you the desired result. First, you need to pd.concat the files and .dropna rows. Then, transpose the dataframe with .T before removing all cells with NaN values. Next, revert the dataframe back with another transpose .T. Finally, simply name the columns and drop rows that are equal to the column headers.
import glob, os
import pandas as pd
main_folder = 'Desktop/'
yearly_files = glob.glob(f'{main_folder}Yearly_Report*.xlsx')
df = pd.concat([pd.read_excel(f, header=None) for f in yearly_files]) \
.dropna(how='all').T \
.apply(lambda x: pd.Series(x.dropna().values)).T
df.columns = ['A','B','C']
df = df[df['A'] != 'A']
df
output:
A B C
1 1 2 3
4 4 5 6
2 4 5 6

Soething Like this not totally sure what you are looking for
df = pd.read_excel('test.xlsx')
for i in df.index:
print(df.loc[i,'Year'])
print(df.loc[i, 'A1'])
print(df.loc[i, "B1"])

How to combine multiple rows in a pandas dataframe which have only 1 non-null entry per column into one row?

I am using json_normalize to parse json entries of a pandas column. But, as an output I am getting a dataframe with multiple rows with each row having only one non-null entry. I want to combine all these rows to one row in pandas.
currency custom.gt custom.eq price.gt price.lt
0 NaN 4.0 NaN NaN NaN
1 NaN NaN NaN 999.0 NaN
2 NaN NaN NaN NaN 199000.0
3 NaN NaN other NaN NaN
4 USD NaN NaN NaN NaN

You can use ffill (forward fill) and bfill (backfill), which are methods for filling NA values in pandas.
# fill NA values
# option 1:
df = df.ffill().bfill()
# option 2:
df = df.fillna(method='ffill').fillna(method='bfill')
print(df)
currency custom.gt custom.eq price.gt price.lt
0 USD 4.0 other 999.0 199000.0
1 USD 4.0 other 999.0 199000.0
2 USD 4.0 other 999.0 199000.0
3 USD 4.0 other 999.0 199000.0
4 USD 4.0 other 999.0 199000.0
You can then drop the duplicated rows using drop_duplicates and keep the first one :
df = df.drop_duplicates(keep='first')
print(df)
currency custom.gt custom.eq price.gt price.lt
0 USD 4.0 other 999.0 199000.0
Depending on how many times you have to repeat the task, I might also take a look at how the JSON file is structured to see if using a dictionary comprehension could help clean things up so that json_normalize can parse it more easily the first time.

you could do
import pandas as pd
from functools import reduce
df = pd.DataFrame.from_dict({"a":["1", None, None],"b" : [None, None, 1], "c":[None, 3, None]})
def red_func(x,y) :
if pd.isna(x) or pd.isnull(x) :
return y
result = [*map( lambda x : reduce(f,x), [list(row) for i, row in df.iterrows()]),]
Outputs :
In [135]: df
Out[135]:
a b c
0 1 NaN NaN
1 None NaN 3.0
2 None 1.0 NaN
In [136]: [*map( lambda x : reduce(f,x), [list(row) for i, row in df.iterrows()]),]
Out[136]: ['1', 3.0, 1.0]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Comparing 2 dataframes - python

Try this df2[~df.isna()] df.isna() check where df has nan value and create a mask then slice df2 with the mask.

Related

Pandas capture connected rows

Why can one column of the pandas DataFrame not be filled?

Splitting a pandas dataframe values in a columns?

Search for starting column and row of column_names in excel file using pandas

How to combine multiple rows in a pandas dataframe which have only 1 non-null entry per column into one row?

Categories

Resources