Try to get the cross of 2 series of a pandas table - python

I am stuck with an issue on a massive pandas table. I would like to get a boolean to check the cross of 2 series.
df = pd.DataFrame({'A': [1, 2, 3, 4],
'B': [10, 1, 2, 8]})
I would like to add one column in my array to get a result like this one
df = pd.DataFrame({'A': [1, 2, 3, 4],
'B': [10, 1, 2, 8],
'C': [0, -1, 0, 1]
})
So basically to get
0 when there is no cross between series B and A
-1 when table B crosses down table A
1 when table B crosses up table A
I need to do vector calculation because my real table is like more than one million rows.
Thank you

You can compute the relative position of the 2 columns with lt, then convert to integer and compute the diff:
m = df['A'].lt(df['B'])
df['C'] = m.astype(int).diff().fillna(0, downcast='infer')
output:
A B C
0 1 10 0
1 2 1 -1
2 3 2 0
3 4 8 1
visual of A/B:

Related

Fill panel data with ranked timepoints in pandas

Given a DataFrame that represents instances of called customers:
import pandas as pd
import numpy as np
df_1 = pd.DataFrame({"customer_id" : [1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 5, 5]})
The data is ordered by time such that every customer is a time-series and every customer has different timestamps. Thus I need a column that consists of the ranked timepoints:
df_2 = pd.DataFrame({"customer_id" : [1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 5, 5],
"call_nr" : [0,1,2,0,1,0,1,2,3,0,0,1]})
After trying different approaches I came up with this to create call_nr:
np.concatenate([np.arange(df["customer_id"].value_counts().loc[i]) for i in df["customer_id"].unique()])
It works, but I doubt this is best practice. Is there a better solution?
A simpler solution would be to groupby your 'customer_id' and use cumcount:
>>> df_1.groupby('customer_id').cumcount()
0 0
1 1
2 2
3 0
4 1
5 0
6 1
7 2
8 3
9 0
10 0
11 1
which you can assign back as a column in your dataframe

Automatically create multiple python datasets based on column names

I have a huge data set with columns like: "Eas_1", "Eas_2", and so on to "Eas_40" and "Nor_1" to "Nor_40". I want to automatically create multiple separate data sets that consist of all columns that end with the same number (grouped by column name number) and column number pasted as values in the new column (Bin).
My data frame:
df = pd.DataFrame({
"Eas_1": [3, 4, 9, 1],
"Eas_2": [4, 5, 10, 2],
"Nor_1": [9, 7, 9, 2],
"Nor_2": [10, 8, 10, 3],
"Error_1": [2, 5, 1, 6],
"Error_2": [5, 0, 3, 2],
})
I don't know how to create Bin column and paste the column name values, but I could separate data sets manually like this:
df1 = df.filter(regex='_1')
df2 = df.filter(regex='_2')
This would take a lot of effort for me, plus I would have to change the script every time I get new data. This is how I imagine end result:
df1 = pd.DataFrame({
"Eas_1": [3, 4, 9, 1],
"Nor_1": [9, 7, 9, 2],
"Error_1": [2, 5, 1, 6],
"Bin": [1, 1, 1, 1],
})
Thanks in advance!
You can extract the suffixes with .str.extract, then groupby on those:
suffixes = df.columns.str.extract('(\d+)$', expand=False)
for label, data in df.groupby(suffixes, axis=1):
print('-'*10, label, '-'*10)
print(data)
Note To collect your dataframes, you can do:
dfs = [data for _, data in df.groupby(suffixes, axis=1)]
# access the second dataframe
dfs[1]
Output:
---------- 1 ----------
Eas_1 Nor_1 Error_1
0 3 9 2
1 4 7 5
2 9 9 1
3 1 2 6
---------- 2 ----------
Eas_2 Nor_2 Error_2
0 4 10 5
1 5 8 0
2 10 10 3
3 2 3 2

Average of two time series in pandas

Ultimate goal description
My goal is to compute the average of two times series (red and green) stored in pandas DataFrames. However, while both time series have the same columns, they differ in precise time points. What I want to implement is a function average which computes average time series from the two given series such that if a value is missing for particular time point, it should be interpolated. For example:
import pandas as pd
green_df = pd.DataFrame({'A': [4, 2, 5], 'B': [1, 2, 3]}, index=[1, 3, 6])
red_df = pd.DataFrame({'A': [4, 2.5, 8, 2, 4], 'B': [4, 2, 2, 4, 1]}, index=[1, 2, 4, 5, 6])
average_grey_df = pd.DataFrame({'A': [4, 2.7, 3.75, 5.5, 3, 4.5], 'B': [...]}, index= [1, 2, 3, 4, 5, 6])
assert average_grey_df == average(green_df, red_df)
It is obvious when displayed graphically (values shown for column A, but the same should be done with all columns; precise values are just illustrative):
Approach
So far I was not able to find a completely working solution. I was thinking about dividing it to three steps:
(1) extend both time series by time points from the other time series such that missing data are nan
A | ... A | ...
------- -------
1 | 4 | 1 | 4 |
2 |nan| 2 |2.5|
red: 3 | 2 | green: 3 |nan|
4 |nan| 4 | 8 |
5 |nan| 5 | 2 |
6 | 5 | 6 | 4 |
(2) fill the missing data by interpolating both dataframes (direct usage of dataframe interpolate method)
(3) finally compute average of these two time series as following:
averages = (green_df.stack() + red_df.stack()) / 2
average_grey_df = averages.unstack()
Additionally, method dropna can be used to drop created nans. Moreover, maybe there is a better method I haven't discovered.
Question
I was not able to figure out how to compute part (1) at all. I checked methods like join, merge and concat with its various examples, but none of them seems to do the job. Any suggestions? I am also open to other approaches.
Thank you
To perform the task 1) you can do this:
#union of the indexes
union_idx = green_df.index.union(red_df.index)
#reindex with the union
green_df= green_df.reindex(union_idx)
red_df= red_df.reindex(union_idx)
# the interpolation
green_df = green_df.interpolate(method='linear', limit_direction='forward', axis=0)
red_df = red_df.interpolate(method='linear', limit_direction='forward', axis=0)
grey_df= pd.concat([green_df,red_df])
grey_df= grey_df.groupby(level=0).mean()
I get (i didn't pay attention to displaying the correct colors)
You can merge the two dfs. From there, you can interpolate the NA values
green_df = pd.DataFrame({'A': [4, 2, 5], 'B': [1, 2, 3]}, index=[1, 3, 6])
red_df = pd.DataFrame({'A': [4, 2.5, 8, 2, 4], 'B': [4, 2, 2, 4, 1]}, index=[1, 2, 4, 5, 6])
combined_df = pd.merge(green_df, red_df, suffixes=('_green', '_red'), left_index=True, right_index=True, how='outer')
combined_df = combined_df.interpolate()
combined_df['A_avg'] = combined_df[["A_green", "A_red"]].mean(axis=1)
combined_df['B_avg'] = combined_df[["B_green", "B_red"]].mean(axis=1)
These can then be plotted using .plot():
combined_df[['A_green', 'A_red', 'A_avg']].plot(color=['green', 'red', 'gray'])

How to get n most column values from each column in pandas

I know how to get most frequent value of each column in dataframe using "mode". For example:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3]})
df.mode()
A
0 2
But I am unable to find "n" most frequent value of each column of a dataframe? For example for the mentioned dataframe, i would like following output for n=2:
A
0 2
1 1
Any pointer ?
One way is to use pd.Series.value_counts and extract the index:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3]})
res = pd.DataFrame({col: df[col].value_counts().head(2).index for col in df})
# A
# 0 2
# 1 1
Use value_counts and select index values by indexing, but it working for each column separately, so need apply or dict comprehension with DataFrame contructor. Casting to Series is necessary for more general solution if possible indices does not exist, e.g:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3],
'B': [1, 1, 1, 1, 1, 1]})
N = 2
df = df.apply(lambda x: pd.Series(x.value_counts().index[:N]))
Or:
N = 2
df = pd.DataFrame({x:pd.Series( df[x].value_counts().index[:N]) for x in df.columns})
print (df)
A B C
0 2 1.0 d
1 1 NaN e
For more general solution select only numeric columns first by select_dtypes:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3],
'B': [1, 1, 1, 1, 1, 1],
'C': list('abcdef')})
N = 2
df = df.select_dtypes([np.number]).apply(lambda x: pd.Series(x.value_counts().index[:N]))
N = 2
cols = df.select_dtypes([np.number]).columns
df = pd.DataFrame({x: pd.Series(df[x].value_counts().index[:N]) for x in cols})
print (df)
A B C
0 2 1.0 d
1 1 NaN e

Dataframe failed to mask rows due to string values

I wanted to use column values in one csv file to mask rows in another csv,
as in:
df6 = pd.read_csv(‘py_all1a.csv’) # file with multiple columns
df7 = pd.read_csv(‘artexclude1.csv’) # file with multiple columns
#
# csv df6 col 1 has the same header and data type as col 8 in df7.
# I want to mask rows in df6 that have a matching col value to any
# in df7. The data in each column is a text value (single word).
#
mask = df6.iloc[:,1].isin(df7.iloc[:,8])
df6[~mask].to_csv(‘py_all1b.csv’, index=False)
#
On that last line, I tried [mask] with the tilde, resulting in no change to the df6 file (py_all1b.csv), and without the tilde (producing the file with just the column headers).
An answer using a specific data set was provided in the below answer, but it did not work because there were inconsistencies between the text values, namely, on entry had a space while another did not.
The below answer is correct, and I have added a paragraph to show how the text issue can also be resolved.
Try converting to a set first:
mask = df6.iloc[:,1].isin(set(df7.iloc[:,8]))
This ensures your comparison is against values.
Example
df1 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
# 0 1 2
# 0 1 2 3
# 1 4 5 6
# 2 7 8 9
# 3 10 11 12
df2 = pd.DataFrame([[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]])
# 0 1 2
# 0 1 2 3
# 1 1 2 3
# 2 1 2 3
# 3 1 2 3
mask = df1.iloc[:,0].isin(set(df2.iloc[:,0]))
df1[mask]
# 0 1 2
# 0 1 2 3
With strings
It still works:
df1 = pd.DataFrame([['a', 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
df2 = pd.DataFrame([['a', 2, 3], ['a', 2, 3], ['a', 2, 3], ['a', 2, 3]])
mask = df1.iloc[:,0].isin(set(df2.iloc[:,0]))
df1[mask]
# 0 1 2
# 0 a 2 3
When you are dealing with string data, there may be problems with whitespace that can cause matches to be missed. As described in this answer, you may need to instead use:
df6 = pd.read_csv('py_all1a.csv', skipinitialspace=True) # file with multiple columns
df7 = pd.read_csv('artexclude1.csv', skipinitialspace=True) # file with multiple columns
mask = df6.iloc[:,1].isin(set(df7.iloc[:,8]))
df6[~mask].to_csv('py_all1b.csv', index=False)

Categories