Difference many columns from a baseline column in pandas - python

I have a baseline column (base) in a pandas data frame and I want to difference all other columns x* from this column while preserving two groups group1, group2:
The easiest way is to simply difference by doing:
df = pd.DataFrame({'group1': [0, 0, 1, 1], 'group2': [2, 2, 3, 4],
'base': [0, 1, 2, 3], 'x1': [3, 4, 5, 6], 'x2': [5, 6, 7, 8]})
df['diff_x1'] = df['x1'] - df['base']
df['diff_x2'] = df['x2'] - df['base']
group1 group2 base x1 x2 diff_x1 diff_x2
0 0 2 0 3 5 3 5
1 0 2 1 4 6 3 5
2 1 3 2 5 7 3 5
3 1 4 3 6 8 3 5
But I have hundreds of columns I need to do this for, so I'm looking for a more efficient way.

You can subtract a Series from a dataframe column wise using the sub method with axis=0, which can save you from doing the subtraction for each column individually:
to_sub = df.filter(regex='x.*') # filter based on your actual logic
pd.concat([
df,
to_sub.sub(df.base, axis=0).add_prefix('diff_')
], axis=1)
# group1 group2 base x1 x2 diff_x1 diff_x2
#0 0 2 0 3 5 3 5
#1 0 2 1 4 6 3 5
#2 1 3 2 5 7 3 5

Another way is using df.drop(..., axis=1). Then pass each remaining column of that dataframe into sub(..., axis=0). Guarantees you catch all columns, and preserve their order, don't even need a regex.
df_diff = df.drop(['group1','group2','base'], axis=1).sub(df['base'], axis=0).add_prefix('diff_')
diff_x1 diff_x2
0 3 5
1 3 5
2 3 5
3 3 5
Hence your full solution is:
pd.concat([df, df_diff], axis=1)
group1 group2 base x1 x2 diff_x1 diff_x2
0 0 2 0 3 5 3 5
1 0 2 1 4 6 3 5
2 1 3 2 5 7 3 5
3 1 4 3 6 8 3 5

Related

Python-way time series transform

Good day!
There is the following time series dataset:
Time Value
1 1
2 1
3 1
4 2
5 2
6 2
7 2
8 3
9 3
10 4
11 4
12 5
I need to split and group data by value like this:
Value Time start, Time end
1 1 3
2 4 7
3 8 9
4 10 11
5 12 12
How to do it fast and in the most functional programming style on python? Various libraries can be used for example pandas, numpy.
Try with pandas:
df.groupby('Time')['Value'].agg(['min','max'])
We can use pandas for this:
Solution:
data = {'Time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
'Value': [1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 5]
}
df = pd.DataFrame(data, columns= ['Time', 'Value'])
res = df.groupby('Value').agg(['min', 'max'])
f_res = res.rename(columns = {'min': 'Start Time', 'max': 'End Time'}, inplace = False)
print(f_res)
Output:
Time
Start Time End Time
Value
1 1 3
2 4 7
3 8 9
4 10 11
5 12 12
first get the count of Values
result = df.groupby('Value').agg(['count'])
result.columns = result.columns.get_level_values(1) #drop multi-index
result
count
Value
1 3
2 4
3 2
4 2
5 1
then cumcount to get time start
s = df.groupby('Value').cumcount()
result["time start"] = s[s == 0].index.tolist()
result
count time start
Value
1 3 0
2 4 3
3 2 7
4 2 9
5 1 11
finally,
result["time start"] += 1
result["time end"] = result["time start"] + result['count'] - 1
result
count time start time end
Value
1 3 1 3
2 4 4 7
3 2 8 9
4 2 10 11
5 1 12 12

how can I merge two dataframes that have same columns but it has different row values? [duplicate]

This question already has an answer here:
What is the difference between combine_first and fillna?
(1 answer)
Closed 2 years ago.
I'm trying to put together two dataframes that have the same columns and number of rows, but one of them have nan in some rows and the other doesn't.
This example is with 2 DF, but I have to do this with around 50 DF and get all dataframes merged in 1.
DF1:
id b c
0 1 15 1
1 2 nan nan
2 3 2 3
3 4 nan nan
DF2:
id b c
0 1 nan nan
1 2 26 6
2 3 nan nan
3 4 60 3
Desired output:
id b c
0 1 15 1
1 2 26 6
2 3 2 3
3 4 60 3
If you have
df1 = pd.DataFrame(np.nan, index=[0, 1], columns=[0, 1])
df2 = pd.DataFrame([[0, np.nan]], index=[0, 1], columns=[0, 1])
df3 = pd.DataFrame([[np.nan, 1]], index=[0, 1], columns=[0, 1])
Then you can update df1
for df in [df2, df3]:
df1.update(df)
print(df1)
0 1
0 0.0 1.0
1 0.0 1.0

Combine data from two columns into one, except if second is already occupied in pandas

Say I have two columns in a data frame, one of which is incomplete.
df = pd.DataFrame({'a': [1, 2, 3, 4], 'b':[5, '', 6, '']})
df
Out:
a b
0 1 5
1 2
2 3 6
3 4
is there a way to fill the empty values in column b with the corresponding values in column a whilst leaving the rest of column b intact?
such that you obtain without iterating over the column?
df
Out:
a b
0 1 5
1 2 2
2 3 6
3 4 4
I think you can use the apply method - but I am not sure. For reference the dataset I'm dealing with is quite large (appx 1GB) which is why iteration - my first attempt was not a good idea.
If blanks are empty strings, you could
In [165]: df.loc[df['b'] == '', 'b'] = df['a']
In [166]: df
Out[166]:
a b
0 1 5
1 2 2
2 3 6
3 4 4
However, if your blanks are NaNs, you could use fillna
In [176]: df
Out[176]:
a b
0 1 5.0
1 2 NaN
2 3 6.0
3 4 NaN
In [177]: df['b'] = df['b'].fillna(df['a'])
In [178]: df
Out[178]:
a b
0 1 5.0
1 2 2.0
2 3 6.0
3 4 4.0
You can use np.where to evaluate df.b, if it's not empty keep its value, otherwise use df.a instead.
df.b=np.where(df.b,df.b,df.a)
df
Out[33]:
a b
0 1 5
1 2 2
2 3 6
3 4 4
You can use pd.Series.where using a boolean version of df.b because '' resolve to False
df.assign(b=df.b.where(df.b.astype(bool), df.a))
a b
0 1 5
1 2 2
2 3 6
3 4 4
You can use replace and ffill with axis=1:
df.replace('',np.nan).ffill(axis=1).astype(df.a.dtypes)
Output:
a b
0 1 5
1 2 2
2 3 6
3 4 4

Comparing every values columns and rows in dataframes

I have two dataframes of different sizes and I would like to use a comparison for all values in four different columns, (two sets of two)
Essentially I would like to see where df1['A'] == df2['A'] & where df1['B'] == df2['B'] and return df1['C']'s value plus df2['C']'s values
import pandas as pd
df1 = pd.DataFrame({"A": [1, 2, 3, 4, 3], "B": [2, 5, 4, 7, 5], "C": [1, 2, 8, 0, 0]})
df2 = pd.DataFrame({"A": [1, 3, 2, 4, 8], "B": [5, 5, 4, 9, 1], "C": [1, 3, 3, 4, 6]})
df1:
A B C
0 1 2 1
1 2 5 2
2 3 4 8
3 4 7 0
4 3 5 0
...
df2:
A B C
0 1 5 1
1 3 4 3
2 2 5 4
3 4 9 4
5 8 1 6
...
in: df1['A'] == df2['A'] & where df1['B'] == df2['B']
df1['D'] = df1['C'] + df2['C']
out: df1:
A B C D
0 1 2 1 nan
1 2 5 2 6
2 3 4 8 11
3 4 7 0 nan
4 3 5 0 nan
My actual dataframes are much larger (120000ish rows of data with values for both 'A' columns range from 1 to 700 and 'B' from 1 to 300) so I know it might be a longer process.
You can merge the two DataFrames on columns A and B. Since you want to keep all values from df1, do a left merge of df1 and df2. The merged column C from df2 will be null wherever A and B don't match. After the merge, it's just a matter of renaming the merged column and doing a sum.
# Do a left merge, keeping df1 column names unchanged.
df1 = pd.merge(df1, df2, how='left', on=['A', 'B'], suffixes=('', '_2'))
# Add the two columns, fill locations that don't match with zero, and rename.
df1['C_2'] = df1['C_2'].add(df1['C']).fillna(0)
df1.rename(columns={'C_2': 'D'}, inplace=True)
You could first merge the two dataframes
In [145]: dff = pd.merge(df1, df2, on=['A', 'B'], how='left')
In [146]: dff
Out[146]:
A B C_x C_y
0 1 2 1 NaN
1 2 5 2 4
2 3 4 8 3
3 4 7 0 NaN
Then, take row-wise sum on C_-{like} columns, where null values are not present, then fill NaN with zero.
In [147]: dff['C'] = dff.filter(regex='C_').sum(skipna=False, axis=1).fillna(0)
In [148]: dff
Out[148]:
A B C_x C_y C
0 1 2 1 NaN 0
1 2 5 2 4 6
2 3 4 8 3 11
3 4 7 0 NaN 0
And, you can drop/pick required columns.

Duplicating a Pandas DF N times

So right now, if I multiple a list i.e. x = [1,2,3]* 2 I get x as [1,2,3,1,2,3] But this doesn't work with Pandas.
So if I want to duplicate a PANDAS DF I have to make a column a list and multiple:
col_x_duplicates = list(df['col_x'])*N
new_df = DataFrame(col_x_duplicates, columns=['col_x'])
Then do a join on the original data:
pd.merge(new_df, df, on='col_x', how='left')
This now duplicates the pandas DF N times, Is there an easier way? Or even a quicker way?
Actually, since you want to duplicate the entire dataframe (and not each element), numpy.tile() may be better:
In [69]: import pandas as pd
In [70]: arr = pd.np.array([[1, 2, 3], [4, 5, 6]])
In [71]: arr
Out[71]:
array([[1, 2, 3],
[4, 5, 6]])
In [72]: df = pd.DataFrame(pd.np.tile(arr, (5, 1)))
In [73]: df
Out[73]:
0 1 2
0 1 2 3
1 4 5 6
2 1 2 3
3 4 5 6
4 1 2 3
5 4 5 6
6 1 2 3
7 4 5 6
8 1 2 3
9 4 5 6
[10 rows x 3 columns]
In [75]: df = pd.DataFrame(pd.np.tile(arr, (1, 3)))
In [76]: df
Out[76]:
0 1 2 3 4 5 6 7 8
0 1 2 3 1 2 3 1 2 3
1 4 5 6 4 5 6 4 5 6
[2 rows x 9 columns]
Here is a one-liner to make a DataFrame with n copies of DataFrame df
n_df = pd.concat([df] * n)
Example:
df = pd.DataFrame(
data=[[34, 'null', 'mark'], [22, 'null', 'mark'], [34, 'null', 'mark']],
columns=['id', 'temp', 'name'],
index=pd.Index([1, 2, 3], name='row')
)
n = 4
n_df = pd.concat([df] * n)
Then n_df is the following DataFrame:
id temp name
row
1 34 null mark
2 22 null mark
3 34 null mark
1 34 null mark
2 22 null mark
3 34 null mark
1 34 null mark
2 22 null mark
3 34 null mark
1 34 null mark
2 22 null mark
3 34 null mark

Categories