Python Pandas dataframe - python

I have one dataframe (df1) like the following:
ATime ETime Difference
0 1444911017815 1588510 1444909429305
1 1444911144979 1715672 1444909429307
2 1444911285683 1856374 1444909429309
3 1444911432742 2003430 1444909429312
4 1444911677101 2247786 1444909429315
5 1444912444821 3015493 1444909429328
6 1444913394542 3965199 1444909429343
7 1444913844134 4414784 1444909429350
8 1444914948835 5519467 1444909429368
9 1444915840638 6411255 1444909429383
10 1444916566634 7137240 1444909429394
11 1444917379593 7950186 1444909429407
I have another very big dataframe (df2) which has a column named Absolute_Time. Absolute_Time has the format as ATime of df1. So what I want to do is, for example, for all Absolute_Time's that lay in the range of row 0 to row 1 of ETime of df1, I want to subtract row 0 of Difference of df1 and so on.

Here's an attempt to accomplish what you might be looking for, starting with:
print(df1)
ATime ETime Difference
0 1444911017815 1588510 1444909429305
1 1444911144979 1715672 1444909429307
2 1444911285683 1856374 1444909429309
3 1444911432742 2003430 1444909429312
4 1444911677101 2247786 1444909429315
5 1444912444821 3015493 1444909429328
6 1444913394542 3965199 1444909429343
7 1444913844134 4414784 1444909429350
8 1444914948835 5519467 1444909429368
9 1444915840638 6411255 1444909429383
10 1444916566634 7137240 1444909429394
11 1444917379593 7950186 1444909429407
next creating a new DataFrame with random times within the range of df1:
df2 = pd.DataFrame({'Absolute Time':[randrange(start=df1.ATime.iloc[0], stop=df1.ATime.iloc[-1]) for i in range(100)]})
df2 = df2.sort_values('Absolute Time').reset_index(drop=True)
np.searchsorted provides you with the index positions where df2 should be inserted in df1 (for the columns in question):
df2.index = np.searchsorted(df1.ATime.values, df2.loc[:, 'Absolute Time'].values)
Assigning the new index and merging produces a new DataFrame. Filling the missing Difference values forward allows to subtract in the next step:
df = pd.merge(df1, df2, left_index=True, right_index=True, how='left').fillna(method='ffill').dropna().astype(int)
df['Absolute Time Adjusted'] = df['Absolute Time'].sub(df.Difference)
print(df.head())
ATime ETime Difference Absolute Time \
1 1444911144979 1715672 1444909429307 1444911018916
1 1444911144979 1715672 1444909429307 1444911138087
2 1444911285683 1856374 1444909429309 1444911138087
3 1444911432742 2003430 1444909429312 1444911303233
3 1444911432742 2003430 1444909429312 1444911359690
Absolute Time Adjusted
1 1589609
1 1708780
2 1708778
3 1873921
3 1930378

Related

changing index of 1 row in pandas

I have the the below df build from a pivot of a larger df. In this table 'week' is the the index (dtype = object) and I need to show week 53 as the first row instead of the last
Can someone advice please? I tried reindex and custom sorting but can't find the way
Thanks!
here is the table
Since you can't insert the row and push others back directly, a clever trick you can use is create a new order:
# adds a new column, "new" with the original order
df['new'] = range(1, len(df) + 1)
# sets value that has index 53 with 0 on the new column
# note that this comparison requires you to match index type
# so if weeks are object, you should compare df.index == '53'
df.loc[df.index == 53, 'new'] = 0
# sorts values by the new column and drops it
df = df.sort_values("new").drop('new', axis=1)
Before:
numbers
weeks
1 181519.23
2 18507.58
3 11342.63
4 6064.06
53 4597.90
After:
numbers
weeks
53 4597.90
1 181519.23
2 18507.58
3 11342.63
4 6064.06
One way of doing this would be:
import pandas as pd
df = pd.DataFrame(range(10))
new_df = df.loc[[df.index[-1]]+list(df.index[:-1])].reset_index(drop=True)
output:
0
9 9
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
Alternate method:
new_df = pd.concat([df[df["Year week"]==52], df[~(df["Year week"]==52)]])

Groupby count per category per month (Current month vs Remaining past months) in separate columns in pandas

Lets say I have the following dataframe:
I am trying to get something like this.
I was thinking to maybe use the rolling function and have separate dataframes for each count type(current month and past 3 months) and then merge them based on ID.
I am new to python and pandas so please bear with me if its a simple question. I am still learning :)
EDIT:
#furas so I started with calculating cumulative sum for all the counts as separate columns
df['f_count_cum] = df.groupby(["ID"])['f_count'].transform(lambda x:x.expanding().sum())
df['t_count_cum] = df.groupby(["ID"])['t_count'].transform(lambda x:x.expanding().sum())
and then just get the current month df by
df_current = df[df.index == (max(df.index)]
df_past_month = df[df.index == (max(df.index - 1)]
and then just merge the two dataframes based on the ID ?
I am not sure if its correct but this is my first take on this
Few assumptions looking at the input sample:
Month index is of datetime64[ns] type. If not, please use below to typecast the datatype.
df['Month'] = pd.to_datetime(df.Month)
Month column is the index. If not, please set it as index.
df = df.set_index('Month')
Considering last month of the df as current month and first 3 months as 'past 3 months'. If not modify the last and first function accordingly in df1 and df2 respectively.
Code
df1 = df.last('M').groupby('ID').sum().reset_index().rename(
columns={'f_count':'f_count(current month)',
't_count':'t_count(current month)'})
df2 = df.first('3M').groupby('ID').sum().reset_index().rename(
columns={'f_count':'f_count(past 3 months)',
't_count':'t_count(past 3 months)'})
df = pd.merge(df1, df2, on='ID', how='inner').reindex(columns = [ 'ID',
'f_count(current month)', 'f_count(past 3 months)',
't_count(current month)','t_count(past 3 months)'
])
Output
ID f_count(current month) f_count(past 3 months) t_count(current month) t_count(past 3 months)
0 A 3 13 8 14
1 B 3 5 7 5
2 C 1 3 2 4
Another version of same code, if you prefer function and single statement
def get_df(freq):
if freq=='M':
return df.last('M').groupby('ID').sum().reset_index()
return df.first('3M').groupby('ID').sum().reset_index()
df = pd.merge(get_df('M').rename(
columns={'f_count':'f_count(current month)',
't_count':'t_count(current month)'}),
get_df('3M').rename(
columns={'f_count':'f_count(past 3 months)',
't_count':'t_count(past 3 months)'}),
on='ID').reindex(columns = [ 'ID',
'f_count(current month)', 'f_count(past 3 months)',
't_count(current month)','t_count(past 3 months)'])
EDIT:
For previous two months from current month:(we can use different combinations of first and last function as per our need)
df2 = df.last('3M').first('2M').groupby('ID').sum().reset_index().rename(
columns={'f_count':'f_count(past 3 months)',
't_count':'t_count(past 3 months)'})

pandas take pairwise difference between a lot of columns

I have a pandas dataframe having a lot of actual (column name ending with _act) and projected columns (column name ending with _proj). Other than actual and projected there's also a date column. Now I want to add an error column (in that order, i.e., beside its projected column) for all of them. Sample dataframe:
date a_act a_proj b_act b_proj .... z_act z_proj
2020 10 5 9 11 .... 3 -1
.
.
What I want:
date a_act a_proj a_error b_act b_proj b_error .... z_act z_proj z_error
2020 10 5 5 9 11 -2 .... 3 -1 4
.
.
What's the best way to achieve this, as I have a lot of actual and projected columns?
You could do:
df = df.set_index('date')
# create new columns
columns = df.columns[df.columns.str.endswith('act')].str.replace('act', 'error')
# compute differences
diffs = pd.DataFrame(data=df.values[:, ::2] - df.values[:, 1::2], index=df.index, columns=columns)
# concat
res = pd.concat((df, diffs), axis=1)
# reorder columns
res = res.reindex(sorted(res.columns), axis=1)
print(res)
Output
a_act a_error a_proj b_act b_error b_proj z_act z_error z_proj
date
2020 10 5 5 9 -2 11 3 4 -1

Efficient way to check if 2 columns of DataFrame are subsets of each other

I have 2 DataFrames which has a column whose value is of type set containing 8 digit integers.
df1 (contains around 200k rows)
id s1
0 0 {43649632, 95799329, 40649644, 23335890, 81779...
1 1 {69900026, 74441229}
2 2 {85195648, 55750338, 98936902, 82000264, 43544...
3 3 {21916700, 13627806}
4 4 {62929026, 38592365, 44179790, 38355127}
df2 (contains around 900k rows)
id s1
0 0 {58209736, 25405713, 28691898, 94682562}
1 1 {81089732, 82343077}
2 2 {59692896, 33234306, 40445479, 18728345, 24464...
3 3 {71406042, 69900026, 74441229}
4 4 {62929026}
I want to know the FASTEST way to find the pair of ids from df2 and df2 that match ONE OF THIS condition:
df1.s1 is a subset of d2.s1
OR
df2.s1 is a subset of d1.s1
For example,
id=1 of df1 is a subset of id=3 of df2, so (1, 3) is a valid pair
id=4 of df2 is a subset of id=4 of df1, so (1, 4) is a valid pair
I have tried this code below but it's going to take about 20 hours:
id_pairs = []
for i in tqdm(list(df2.itertuples(index=False))):
for j in df1.itertuples(index=False):
if i.s1.issubset(j.s1) or j.s1.issubset(i.s1):
id_pairs.append((i.id, j.id))
Is there a faster or more efficient way to do this?
You could do a cartesian join and then apply the condition
df1["key"] = 0
df2["key"] = 0
merged = df1.merge(df2, how="outer", on="key")
def subset(row):
if (row.s1.issubset(row.s2)) or (row.s2.issubset(row.s1)):
return (row.id1, row.id2)
else:
return None
merged.apply(lambda row: subset(row), axis=1).dropna()

How to get unmatching data from 2 dataframes based on one column. (Pandas)

I have 2 data frames sample output is here
My code for getting those and formatting the date column is here
First df:
csv_data_df = pd.read_csv(os.path.join(path_to_data+'\\Data\\',appendedfile))
csv_data_df['Date_Formatted'] = pd.to_datetime(csv_data_df['DATE']).dt.strftime('%Y-%m-%d')
csv_data_df.head(3)
second df :
new_Data_df = pd.read_csv(io.StringIO(response.decode('utf-8')))
new_Data_df['Date_Formatted'] =
pd.to_datetime(new_Data_df['DATE']).dt.strftime('%Y-%m-%d')
new_Data_df.head(3)`
I want to construct third dataframe where only the rows with un-matching dates from second dataframe needs to go in third one.
Is there any method to do that. The date formatted column you can see in the screenshot.
You could set the index of both dataframes to your desired join column, then
use df1.combine_first(df2). For your specific example here, that could look like the below line.
csv_data_df.set_index('Date_Formatted').combine_first(new_Data_df.set_index('Date_Formatted')).reset_index()
Ex:
df = pd.DataFrame(np.random.randn(5, 3), columns=list('abc'), index=list(range(1, 6)))
df2 = pd.DataFrame(np.random.randn(8, 3), columns=list('abc'))
df
Out[10]:
a b c
1 -1.357517 -0.925239 0.974483
2 0.362472 -1.881582 1.263237
3 0.785508 0.227835 -0.604377
4 -0.386585 -0.511583 3.080297
5 0.660516 -1.393421 1.363900
df2
Out[11]:
a b c
0 1.732251 -1.977803 0.720292
1 0.048229 1.125277 1.016083
2 -1.684013 2.136061 0.553824
3 -0.022957 1.237249 0.236923
4 -0.998079 1.714126 1.291391
5 0.955464 -0.049673 1.629146
6 0.865864 1.137120 1.117207
7 -0.126944 1.003784 -0.180811
df.combine_first(df2)
Out[13]:
a b c
0 1.732251 -1.977803 0.720292
1 -1.357517 -0.925239 0.974483
2 0.362472 -1.881582 1.263237
3 0.785508 0.227835 -0.604377
4 -0.386585 -0.511583 3.080297
5 0.660516 -1.393421 1.363900
6 0.865864 1.137120 1.117207
7 -0.126944 1.003784 -0.180811

Categories