Getting data from one column based on the value of other column - python

I am having trouble coming up with an algorithm for the following problem:
I have two data frames, df1 and df2 (the following are just an example):
import pandas as pd
df1 = pd.DataFrame({'Col1': [1, 7, 10, 50, 73, 80 ], 'Col2': [1,2,3,4,5,6]})
df2 = pd.DataFrame({'Col1': [0, 4, 10, 80], 'Col3': [7,6,8,9]})
As you can see, they both have the Col1, but the values aren't always coincident, but they are in ascending order. I want to create a function that will create a new column on df1, let's call it Col4. The values on this column have to come from df2 following these rules:
1) If df1 and df2 have the same value in Col1, the value in Col4 should be the corresponding value in Col3.
2)If they do not share the same value in Col1, Col4 should be the average between values in Col3 that correspond to the values immediately before and after it.
For example:
As df2 does not have a value in Col1 for 1, the first entry in Col4 should be the average between 7 and 6 (1 is between 0 and 4).
I don't know if I made myself very clear, but the final result for Col4 should be:
(7+6)/2, (6+8)/2, 8, (8+9)/2, (8+9)/2, 9
It would be nice to have a function because I will have to make this operation on many different data frames.
I know it is a weird problem, but thanks for the help!

You can accomplish what you want with pandas.merge_asof
You merge df1 with df2 on Col1 in both directions, forward and backward. Then you simply average the results. I've concatenated the two merges into one df column-wise and renamed the columns so they don't wind up with the same names.
import pandas as pd
df = pd.concat([pd.merge_asof(df1, df2, on='Col1').rename(columns={'Col3': 'Col4_1'}),
pd.merge_asof(df1, df2, on='Col1', direction='forward')[['Col3']].rename(columns={'Col3': 'Col4_2'})], axis=1)
print(df)
# Col1 Col2 Col4_1 Col4_2
#0 1 1 7 6
#1 7 2 6 8
#2 10 3 8 8
#3 50 4 8 9
#4 73 5 8 9
#5 80 6 9 9
# Calculate the average you want, drop helper columns.
df['Col4'] = (df.Col4_1 + df.Col4_2)/2
df.drop(columns=['Col4_1', 'Col4_2'], inplace=True)
print(df)
# Col1 Col2 Col4
#0 1 1 6.5
#1 7 2 7.0
#2 10 3 8.0
#3 50 4 8.5
#4 73 5 8.5
#5 80 6 9.0

Related

How to output each missing row when comparing two CSV using pandas in python

I am looking to compare two CSVs. Both CSVs will have nearly identical data, however the second CSV will have 2 identical rows that CSV 1 does not have. I would like the program to output both of those 2 rows so I can see which row is present in CSV 2, but not CSV 1, and how many times that row is present.
Here is my current logic:
import csv
import pandas as pd
import numpy as np
data1 = {"Col1": [0,1,1,2],
"Col2": [1,2,2,3],
"Col3": [5,2,1,1],
"Col4": [1,2,2,3]}
data2 = {"Col1": [0,1,1,2,4,4],
"Col2": [1,2,2,3,4,4],
"Col3": [5,2,1,1,4,4],
"Col4": [1,2,2,3,4,4]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
ds1 = set(tuple(line) for line in df1.values)
ds2 = set(tuple(line) for line in df2.values)
df = pd.DataFrame(list(ds2.difference(ds1)), columns=df2.columns)
print(df)
Here is my current outcome:
Col1 Col2 Col3 Col4
0 4 4 4 4
Here is my desired outcome:
Col1 Col2 Col3 Col4
0 4 4 4 4
1 4 4 4 4
As of right now, it only outputs the row once even though CSV has the row twice. What can I do so that it not only shows the missing row, but also for each time it is in the second CSV? Thanks in advance!
There is almost always a built-in pandas function meant to do what you want that will be better than trying to re-invent the wheel.
df = df2[~df2.isin(df1).all(axis=1)]
# OR df = df2[df2.ne(df1).all(axis=1)]
print(df)
Output:
Col1 Col2 Col3 Col4
4 4 4 4 4
5 4 4 4 4
You can use:
df2[~df2.eq(df1).all(axis=1)]
Result:
Col1 Col2 Col3 Col4
4 4 4 4 4
5 4 4 4 4
Or (if you want the index to be 0 and 1):
df2[~df2.eq(df1).all(axis=1)].reset_index(drop=True)
Result:
Col1 Col2 Col3 Col4
0 4 4 4 4
1 4 4 4 4
N.B.
You can also use df2[df2.ne(df1).all(axis=1)] instead of df2[~df2.eq(df1).all(axis=1)].

How to conditionally remove first N rows of a dataframe out of 2 dataframes

I have 2 dataframes as following:
d = {'col1': [1, 2, 3, 4], 'col2': ["2010-01-01", "2011-01-01", "2012-01-01", "2013-01-01"]}
df = pd.DataFrame(data=d)
df
col1 col2
0 1 2010-01-01
1 2 2011-01-01
2 3 2012-01-01
3 4 2013-01-01
The other dataframe may look like the following:
d2 = {'col1': [11, 22, 33, 44], 'col1': ["2011-01-01", "2011-05-01", "2012-02-01", "2012-06-01"]}
df2 = pd.DataFrame(data=d2)
df2
col1 col2
0 11 2011-01-01
1 22 2011-05-01
2 33 2012-02-01
3 44 2012-06-01
In both dataframes, col2 includes dates (in String format, not as a date object) and these dates are placed in ascending order.
In my use case, both of these dataframes are supposed to start with the same value in col2.
The first col2 value of df is "2010-01-01".
The first col2 value of df2 is "2011-01-01".
In this particular example, since "2010-01-01" does not exist in df2 and "2011-01-01" is the second row item of col2 in df dataframe, the first row of df needs to be removed so that both dataframes start with the same date String value in col2. So, df is supposed to look as following after the change:
col1 col2
1 2 2011-01-01
2 3 2012-01-01
3 4 2013-01-01
(Please note that the index is not reset after the change.)
But, this could have been the opposite and we could need to remove row(s) from df2 instead to be able to make these dataframes start with the same value in col2.
And in some cases, we may need to remove multiple rows, not only 1, from the dataframe where we need to remove the rows in case we do not find the match in the second row.
Corner case handling:
The logic should also handle such cases (without throwing errors) where it is not possible to make these 2 dataframes start with the same col2 value, in case there is no match. In that case, all rows from both dataframes are supposed to be removed.
Is there an elegant way of developing this logic without writing too many lines of code?
First get minimal datetime in both DataFrames and then filter all rows after this row in both DataFrames (because is possible some rows are removed from df1 or df2) by compare values by both and Series.cummax. If no match both DataFrames are empty.
both = df.merge(df2, on='col2')['col2'].min()
df = df[df['col2'].eq(both).cummax()]
print (df)
col1 col2
1 2 2011-01-01
2 3 2012-01-01
3 4 2013-01-01
df2 = df2[df2['col2'].eq(both).cummax()]
print (df2)
col1 col2
0 11 2011-01-01
1 22 2011-05-01
2 33 2012-02-01
3 44 2013-01-01
#no match df2.col2 in df.col2
d2 = {'col1': [11, 22, 33, 44], 'col2': ["2011-01-21", "2011-05-01",
"2012-02-01", "2003-01-01"]}
df2 = pd.DataFrame(data=d2)
both = df.merge(df2, on='col2')['col2'].min()
df = df[df['col2'].eq(both).cummax()]
print (df)
Empty DataFrame
Columns: [col1, col2]
Index: []
df2 = df2[df2['col2'].eq(both).cummax()]
print (df2)
Empty DataFrame
Columns: [col1, col2]
Index: []

Find name of column which is non nan

I have a Dataframe defined like :
df1 = pd.DataFrame({"col1":[1,np.nan,np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan],
"col2":[np.nan,3,np.nan,4,np.nan,np.nan,np.nan,5,6],
"col3":[np.nan,np.nan,7,np.nan,np.nan,8,9,np.nan, np.nan]})
I want to transform it into a DataFrame like:
df2 = pd.DataFrame({"col_name":['col1','col2','col3','col2','col1',
'col3','col3','col2','col2'],
"value":[1,3,7,4,2,8,9,5,6]})
If possible, can we reverse this process too? By that I mean convert df2 into df1.
I don't want to go through the DataFrame iteratively as it becomes too computationally expensive.
You can stack it:
out = (df1.stack().astype(int).droplevel(0)
.rename_axis('col_name').reset_index(name='value'))
Output:
col_name value
0 col1 1
1 col2 3
2 col3 7
3 col2 4
4 col1 2
5 col3 8
6 col3 9
7 col2 5
8 col2 6
To go from out back to df1, you could pivot:
out1 = pd.pivot(out.reset_index(), 'index', 'col_name', 'value')

python/pandas: update a column based on a series holding sums of that same column

I have a dataframe with a non-unique col1 like the following
col1 col2
0 a 1
1 a 1
2 a 2
3 b 3
4 b 3
5 c 2
6 c 2
Some of the values of col1 repeat lots of times and others not so. I'd like to take the bottom (80%/50%/10%) and change the value to 'other' ahead of plotting.
I've got a series which contains the codes in col1 (as the index) and the amount of times that they appear in the df in descending order by doing the following:
df2 = df.groupby(['col1']).size().sort_values(ascending=False)
I've also got my cut-off point (bottom 80%)
cutOff = round(len(df2)/5)
I'd like to update col1 in df with the value 'others' when col1 appears after the cutOff in the index of the series df2.
I don't know how to go about checking and updating. I figured that the best way would be to do a groupby on col1 and then loop through, but it starts to fall apart, should I create a new groupby object? Or do I call this as an .apply() for each row? Can you update a column that is being used as the index for a dataframe? I could do with some help about how to start.
edit to add:
So if the 'b's in col1 were not in the top 20% most populous values in col1 then I'd expect to see:
col1 col2
0 a 1
1 a 1
2 a 2
3 others 3
4 others 3
5 c 2
6 c 2
data = [["a ", 1],
["a ", 1],
["a ", 2],
["b ", 3],
["b ", 3],
["c ", 2],
["c ", 2], ]
df = pd.DataFrame(data, columns=["col1", "col2"])
print(df)
df2 = df.groupby(['col1']).size().sort_values(ascending=False)
print(df2)
cutOff = round(len(df2) / 5)
others = df2.iloc[cutOff + 1:]
print(others)
result = df.copy()
result.loc[result["col1"].isin(others.index), "col1"] = "others"
print(result)

How to add a list of values to a pandas column

I have a pandas dataframe.
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
col1 col2
0 1 3
1 2 4
I want to add the list lst=[10, 20] element-wise to 'col1' to have the following dataframe.
col1 col2
0 11 3
1 22 4
How to do that?
If you want to edit the column in-place you could do,
df['col1'] += lst
after which df will be,
col1 col2
0 11 3
1 22 4
Similarly, other types of mathematical operations are possible, such as,
df['col1'] *= lst
df['col1'] /= lst
If you want to create a new dataframe after addition,
df1 = df.copy()
df1['col1'] = df['col1'].add(lst, axis=0) # df['col1'].add(lst) outputs a series, df['col1']+lst also works
Now df1 is;
col1 col2
0 11 3
1 22 4

Categories