one to one column-value comparison between 2 dataframes - pandas

one to one column-value comparison between 2 dataframes - pandas - python

I have 2 dataframe -
print(d)
Year Salary Amount Amount1 Amount2
0 2019 1200 53 53 53
1 2020 3443 455 455 455
2 2021 6777 123 123 123
3 2019 5466 313 313 313
4 2020 4656 545 545 545
5 2021 4565 775 775 775
6 2019 4654 567 567 567
7 2020 7867 657 657 657
8 2021 6766 567 567 567
print(d1)
Year Salary Amount Amount1 Amount2
0 2019 1200 53 73 63
import pandas as pd
d = pd.DataFrame({
'Year': [
2019,
2020,
2021,
] * 3,
'Salary': [
1200,
3443,
6777,
5466,
4656,
4565,
4654,
7867,
6766
],
'Amount': [
53,
455,
123,
313,
545,
775,
567,
657,
567
],
'Amount1': [
53,
455,
123,
313,
545,
775,
567,
657,
567
], 'Amount2': [
53,
455,
123,
313,
545,
775,
567,
657,
567
]
})
d1 = pd.DataFrame({
'Year': [
2019
],
'Salary': [
1200
],
'Amount': [
53
],
'Amount1': [
73
], 'Amount2': [
63
]
})
I want to compare the 'Salary' value of dataframe d1 i.e. 1200 with all the values of 'Salary' in dataframe d and set a count if it is >= or < (a Boolean comparison) - this is to be done for all the columns(amount, amount1, amount2 etc), if the value in any column of d1 is NaN/None, no comparison needs to be done. The name of the columns will always be same so it is basically one to one column comparison.
My approach and thoughts -
I can get the values of d1 in a list by doing -
l = []
for i in range(len(d1.columns.values)):
if i == 0:
continue
else:
num = d1.iloc[0, i]
l.append(num)
print(l)
# list comprehension equivalent
lst = [d1.iloc[0, i] for i in range(len(d1.columns.values)) if i != 0]
[1200, 53, 73, 63]
and then use iterrows to iterate over all the columns and rows in dataframe d OR
I can iterate over d and then perform a similar comparison by looping over d1 - but these would be time consuming for a high dimensional dataframe(d in this case).
What would be the more efficient or pythonic way of doing it?

IIUC, you can do:
(df1 >= df2.values).sum()
Output:
Year 9
Salary 9
Amount 9
Amount1 8
Amount2 8
dtype: int64

Related

Dynamically differencing columns in a pandas dataframe using similar column names

The following is the first couple of columns of a data frame, and I calculate V1_x - V1_y, V2_x - V2_y, V3_x - V3_y etc. The difference variable names differ only by the last character (either x or y)
import pandas as pd
data = {'Name': ['Tom', 'Joseph', 'Krish', 'John'], 'Address': ['xx', 'yy', 'zz','ww'], 'V1_x': [20, 21, 19, 18], 'V2_x': [233, 142, 643, 254], 'V3_x': [343, 543, 254, 543], 'V1_y': [20, 21, 19, 18], 'V2_y': [233, 142, 643, 254], 'V3_y': [343, 543, 254, 543]}
df = pd.DataFrame(data)
df
Name Address V1_x V2_x V3_x V1_y V2_y V3_y
0 Tom xx 20 233 343 20 233 343
1 Joseph yy 21 142 543 21 142 543
2 Krish zz 19 643 254 19 643 254
3 John ww 18 254 543 18 254 543
I currently do the calculation by manually defining the column names:
new_df = pd.DataFrame()
new_df['Name'] = df['Name']
new_df['Address'] = df['Address']
new_df['Col1'] = df['V1_x']-df['V1_y']
new_df['Col1'] = df['V2_x']-df['V2_y']
new_df['Col1'] = df['V3_x']-df['V3_y']
Is there an approach that I can use to check if the last column names only differ by the last character and difference them if so?

Try creating a multiindex header using .str.split then reshape the dataframe and using pd.DataFrame.eval for calcuation then reshape back to original form with additional columns. Lastly flatten the multiindex header using list comprehension with f-string formatting:
dfi = df.set_index(['Name', 'Address'])
dfi.columns = dfi.columns.str.split('_', expand=True)
dfs = dfi.stack(0).eval('diff=x-y').unstack()
dfs.columns = [f'{j}_{i}' for i, j in dfs.columns]
dfs
Output:
V1_x V2_x V3_x V1_y V2_y V3_y V1_diff V2_diff V3_diff
Name Address
John ww 18 254 543 18 254 543 0 0 0
Joseph yy 21 142 543 21 142 543 0 0 0
Krish zz 19 643 254 19 643 254 0 0 0
Tom xx 20 233 343 20 233 343 0 0 0

How to read data that has been split into multiple columns?

I have the following dataframe:
q
1 0.83 97 0.7 193 0.238782 289 0.129692 385 0.090692
2 0.75 98 0.7 194 0.238782 290 0.129692 386 0.090692
...
96 0.94693 192 0.299753 288 0.145046 384 0.0965338 480 0.0823061
This data comes from somewhere else, and it has been split. However, the values correspond to a single variable 'q', along with its indices. To clarify, even though there are many columns, they all correspond to one column 'q', plus an index column (notice that the starting index of each column is the continuation of the end of the previous column).
How can I read the data with pandas? I believe I can do it by assigning names to each column and then merging them all together, but I was looking for a more elegant solution. Plus, the number of columns is not fixed.
This is the code that I am using at the moment:
q_param = pd.read_csv('Initial_solutions/initial_q_20y.dat', delim_whitespace=True)
Which does not do the trick. I would prefer to use pandas to solve this issue, but I can also work without it.
EDIT:
At the request of #user17242583, the following command:
print(q_param.head().to_dict())
Gives this output:
{'q': {(1, 0.83, 97, 0.7, 193, 0.238782, 289, 0.129692, 385): 0.090692, (2, 0.75, 98, 0.7, 194, 0.238782, 290, 0.129692, 386): 0.090692, (3, 0.64, 99, 0.64, 195, 0.238782, 291, 0.129692, 387): 0.090692, (4, 0.7, 100, 0.7, 196, 0.238782, 292, 0.129692, 388): 0.0884839, (5, 0.64, 101, 0.64, 197, 0.238782, 293, 0.129692, 389): 0.090692}}

It seems most of your data is index. Try:
df = pd.DataFrame({k:v for lst in [list(k)+[v] for k,v in q_param['q'].items()] for k,v in zip(lst[::2],lst[1::2])}, index=['q']).T.sort_index()

Try this:
data = {
0: pd.concat(q[c] for c in q.columns[0::2]).reset_index(drop=True),
1: pd.concat(q[c] for c in q.columns[1::2]).reset_index(drop=True),
}
df = pd.DataFrame(data)
Output:
>>> df
0 1
0 1 0.830000
1 2 0.750000
2 3 0.640000
3 4 0.700000
4 5 0.640000
5 97 0.700000
6 98 0.700000
7 99 0.640000
8 100 0.700000
9 101 0.640000
10 193 0.238782
11 194 0.238782
12 195 0.238782
13 196 0.238782
14 197 0.238782
15 289 0.129692
16 290 0.129692
17 291 0.129692
18 292 0.129692
19 293 0.129692
20 385 0.090692
21 386 0.090692
22 387 0.090692
23 388 0.088484
24 389 0.090692

How to only show the rows where data of one variable matches column data of another variable

So I want to only show the rows in which the x and y value of the other id's matches the x and y of id 0. For example, show id 0 and id 250017920 (row 9) as the x and y match out of the first 20 rows. This process would need to be repeated for all rows so that all we have left is the rows where the x and y match that of id 0 as its x and y changes.
d={'ID':[0,2398794,3987694,987957, 9875987, 76438739, 2474654, 1983209, 2874050, 250017920, 38764902],
'x':[-46,8769,432, 426, 132, 93, 124, 475, 857, -46, 67],
'y':[2562,987, 987, 252, 234, 123, 765, 1452, 542, 2562, 5876],
'z':[5, 7, 6, 2, 7, 7 ,4 , 5 , 1, 9,3]}
data=pd.DataFrame(data=d)
ID x y z
0 0 -46 2562 5
1 2398794 8769 987 7
2 3987694 432 987 6
3 987957 426 252 2
4 9875987 132 234 7
5 76438739 93 123 7
6 2474654 124 765 4
7 1983209 475 1452 5
8 2874050 857 542 1
9 250017920 -46 2562 9
10 38764902 67 5876 3

For the following dataframe
d={'ID':[999,2398794,3987694,987957, 9875987],
'x':[132,8769,432, 132, 132],
'y':[563,987, 987, 563, 234],
'z':[5, 7, 6, 2, 7]}
data=pd.DataFrame(data=d)
print(data)
ID x y z
0 999 132 563 5
1 2398794 8769 987 7
2 3987694 432 987 6
3 987957 132 563 2
4 9875987 132 234 7
get the index of the ID value for which you want to match values in column x and column y. Here, lets say for ID value 999.
#[0] here refers to the first time this ID appeared in the dataframe, if in any case same ID had appeared multiple times
ind=data.index[data.ID == 999][0]
Now get the rows where x and y values matches x and y values for ID = 999.
data[(data['x']==data['x'][ind]) & (data['y']==data['y'][ind])]
ID x y z
0 999 132 563 5
3 987957 132 563 2

calculating percentile values for each columns group by another column values - Pandas dataframe

I have a dataframe that looks like below -
Year Salary Amount
0 2019 1200 53
1 2020 3443 455
2 2021 6777 123
3 2019 5466 313
4 2020 4656 545
5 2021 4565 775
6 2019 4654 567
7 2020 7867 657
8 2021 6766 567
Python script to get the dataframe below -
import pandas as pd
import numpy as np
d = pd.DataFrame({
'Year': [
2019,
2020,
2021,
] * 3,
'Salary': [
1200,
3443,
6777,
5466,
4656,
4565,
4654,
7867,
6766
],
'Amount': [
53,
455,
123,
313,
545,
775,
567,
657,
567
]
})
I want to calculate certain percentile values for all the columns grouped by 'Year'.
Desired output should look like -
I am running below python script to perform the calculations to calculate certain percentile values-
df_percentile = pd.DataFrame()
p_list = [0.05, 0.10, 0.25, 0.50, 0.75, 0.95, 0.99]
c_list = []
p_values = []
for cols in d.columns[1:]:
for p in p_list:
c_list.append(cols + '_' + str(p))
p_values.append(np.percentile(d[cols], p))
print(len(c_list), len(p_values))
df_percentile['Name'] = pd.Series(c_list)
df_percentile['Value'] = pd.Series(p_values)
print(df_percentile)
Output -
Name Value
0 Salary_0.05 1208.9720
1 Salary_0.1 1217.9440
2 Salary_0.25 1244.8600
3 Salary_0.5 1289.7200
4 Salary_0.75 1334.5800
5 Salary_0.95 1370.4680
6 Salary_0.99 1377.6456
7 Amount_0.05 53.2800
8 Amount_0.1 53.5600
9 Amount_0.25 54.4000
10 Amount_0.5 55.8000
11 Amount_0.75 57.2000
12 Amount_0.95 58.3200
13 Amount_0.99 58.5440
How can I get the output in the required format without having to do extra data manipulation/formatting or in fewer lines of code?

You can try pivot followed by quantile:
(df.pivot(columns='Year')
.quantile([0.01,0.05,0.75, 0.95, 0.99])
.stack('Year')
)
Output:
Salary Amount
Year
0.01 2019 1269.08 58.20
2020 3467.26 456.80
2021 4609.02 131.88
0.05 2019 1545.40 79.00
2020 3564.30 464.00
2021 4785.10 167.40
0.75 2019 5060.00 440.00
2020 6261.50 601.00
2021 6771.50 671.00
0.95 2019 5384.80 541.60
2020 7545.90 645.80
2021 6775.90 754.20
0.99 2019 5449.76 561.92
2020 7802.78 654.76
2021 6776.78 770.84

Numpy: Use vectorization for loop while referring to previous row value?

I have the following dataframe for which I want to create a column named 'Value' using numpy for fast looping and at the same time refer to the previous row value in the same column.
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"Product": ["A", "A", "A", "A", "B", "B", "B", "C", "C"],
"Inbound": [115, 220, 200, 402, 313, 434, 321, 343, 120],
"Outbound": [10, 20, 24, 52, 40, 12, 43, 23, 16],
"Is First?": ["Yes", "No", "No", "No", "Yes", "No", "No", "Yes", "No"],
}
)
Product Inbound Outbound Is First? Value
0 A 115 10 Yes 125
1 A 220 20 No 105
2 A 200 24 No 81
3 A 402 52 No 29
4 B 313 40 Yes 353
5 B 434 12 No 341
6 B 321 43 No 298
7 C 343 23 Yes 366
8 C 120 16 No 350
The formula for Value column in pseudocode is:
if ['Is First?'] = 'Yes' then [Value] = [Inbound] + [Outbound]
else [Value] = [Previous Value] - [Outbound]
The ideal way of creating the Value column right now is to do a for loop and use shift to refer to the previous column (which I am somehow not able to make work). But since I will be applying this over a giant dataset, I want to use the numpy vectorization method on it.
for i in range(len(df)):
if df.loc[i, "Is First?"] == "Yes":
df.loc[i, "Value"] = df.loc[i, "Inbound"] + df.loc[i, "Outbound"]
else:
df.loc[i, "Value"] = df.loc[i, "Value"].shift(-1) + df.loc[i, "Outbound"]

One way:
You may use np.subtract.accumulate with transform
s = df['Is First?'].eq('Yes').cumsum()
df['value'] = ((df.Inbound + df.Outbound).where(df['Is First?'].eq('Yes'), df.Outbound)
.groupby(s)
.transform(np.subtract.accumulate))
Out[1749]:
Product Inbound Outbound Is First? value
0 A 115 10 Yes 125
1 A 220 20 No 105
2 A 200 24 No 81
3 A 402 52 No 29
4 B 313 40 Yes 353
5 B 434 12 No 341
6 B 321 43 No 298
7 C 343 23 Yes 366
8 C 120 16 No 350
Another way:
Assign value for Yes. Create groupid s to use for groupby. Groupby and shift Outbound to calculate cumsum, and subtract it from 'Yes' value of each group. Finally, use it to fillna.
df['value'] = (df.Inbound + df.Outbound).where(df['Is First?'].eq('Yes'))
s = df['Is First?'].eq('Yes').cumsum()
s1 = df.value.ffill() - df.Outbound.shift(-1).groupby(s).cumsum().shift()
df['value'] = df.value.fillna(s1)
Out[1671]:
Product Inbound Outbound Is First? value
0 A 115 10 Yes 125.0
1 A 220 20 No 105.0
2 A 200 24 No 81.0
3 A 402 52 No 29.0
4 B 313 40 Yes 353.0
5 B 434 12 No 341.0
6 B 321 43 No 298.0
7 C 343 23 Yes 366.0
8 C 120 16 No 350.0

This is not a trivial task, the difficulty lies in the consecutive Nos. It's necessary to group consecutive no's together, the code below should do,
col_sum = df.Inbound+df.Outbound
mask_no = df['Is First?'].eq('No')
mask_yes = df['Is First?'].eq('Yes')
consec_no = mask_yes.cumsum()
result = col_sum.groupby(consec_no).transform('first')-df['Outbound'].where(mask_no,0).groupby(consec_no).cumsum()

Use:
df.loc[df['Is First?'].eq('Yes'),'Value']=df['Inbound']+df['Outbound']
df.loc[~df['Is First?'].eq('Yes'),'Value']=df['Value'].fillna(0).shift().cumsum()-df.loc[~df['Is First?'].eq('Yes'),'Outbound'].cumsum()

Annotated numpy code:
## 1. line up values to sum
ob = -df["Outbound"].values
# get yes indices
fi, = np.where(df["Is First?"].values == "Yes")
# insert yes formula at yes positions
ob[fi] = df["Inbound"].values[fi] - ob[fi]
## 2. calculate block sums and subtract each from the
## first element of the **next** block
ob[fi[1:]] -= np.add.reduceat(ob,fi)[:-1]
# now simply taking the cumsum will reset after each block
df["Value"] = ob.cumsum()
Result:
Product Inbound Outbound Is First? Value
0 A 115 10 Yes 125
1 A 220 20 No 105
2 A 200 24 No 81
3 A 402 52 No 29
4 B 313 40 Yes 353
5 B 434 12 No 341
6 B 321 43 No 298
7 C 343 23 Yes 366
8 C 120 16 No 350

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

one to one column-value comparison between 2 dataframes - pandas - python

IIUC, you can do: (df1 >= df2.values).sum() Output: Year 9 Salary 9 Amount 9 Amount1 8 Amount2 8 dtype: int64

Related

Dynamically differencing columns in a pandas dataframe using similar column names

How to read data that has been split into multiple columns?

How to only show the rows where data of one variable matches column data of another variable

calculating percentile values for each columns group by another column values - Pandas dataframe

Numpy: Use vectorization for loop while referring to previous row value?

Categories

Resources