Auto join a python dataframe to update it - python

I would like to perform an auto-join on a python dataframe to update it.
Here is the situation, I have a first df with three columns:
In, Out & Date. It means that at a specific date the item "Out" is replaced by "In".
import pandas as pd
import numpy as np
from datetime import datetime
data = [[1,10,"2017-01-01"],[2,10,"2017-01-01"],[10,11,"2017-06-01"],[4,14,"2017-04-01"],[5,14,"2017-12-01"]]
label = ["Out","In","Date"]
df = pd.DataFrame(data,columns=label)
df['Date'] = pd.to_datetime(df['Date'])
print(df)
Out In Date
0 1 10 2017-01-01
1 2 10 2017-01-01
2 10 11 2017-06-01
3 4 14 2017-04-01
4 5 14 2017-12-01
For example it means here that as of first of Jan 2017, item #1 is replaced by item #10.
The trick is that as of june 2017, this item #10 is also replaced by item #11. So that #1 becomes #10 that becomes #11.
Now I would like to populate a final table that gives the final relationships up to a certain date.
If date = 2017-08-01, I would get this table
date = pd.to_datetime("2017-08-01")
data = [[1,11],[2,11],[10,11],[4,14]]
df_final = pd.DataFrame(data,columns=["Out","In"])
print(df_final)
Out In
0 1 11
1 2 11
2 10 11
3 4 14
Would you know how to perform such an auto join?
Thanks,

You can use List comprehension methods and .loc to locate the values.
import pandas as pd
import numpy as np
from datetime import datetime
data = [[1,10,"2017-01-01"],[2,10,"2017-01-01"],[10,11,"2017-06-01"],[4,14,"2017-04-01"],[5,14,"2017-12-01"],[11,18,"2017-12-01"]]
label = ["Out","In","Date"]
df = pd.DataFrame(data,columns=label)
df['Date'] = pd.to_datetime(df['Date'])
print(df)
Out In Date
0 1 10 2017-01-01
1 2 10 2017-01-01
2 10 11 2017-06-01
3 4 14 2017-04-01
4 5 14 2017-12-01
5 11 18 2017-12-01
L=[]
for row in df.iterrows():
x = row[1]['Out']
y = row[1]['In']
while y in df.Out.values.tolist():
y = df.loc[df['Out'] == y,'In'].iloc[0]
L.append((x,y))
df2 = pd.DataFrame(L, columns=['Out', 'In'])
print(df2)
Out In
1 18
2 18
10 18
4 14
5 14
11 18

Related

How to stack two columns of a pandas dataframe in python

I want to stack two columns on top of each other
So I have Left and Right values in one column each, and want to combine them into a single one. How do I do this in Python?
I'm working with Pandas Dataframes.
Basically from this
Left Right
0 20 25
1 15 18
2 10 35
3 0 5
To this:
New Name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
It doesn't matter how they are combined as I will plot it anyway, and the new column name also doesn't matter because I can rename it.
You can create a list of the cols, and call squeeze to anonymise the data so it doesn't try to align on columns, and then call concat on this list, passing ignore_index=True creates a new index, otherwise you'll get the names as index values repeated:
cols = [df[col].squeeze() for col in df]
pd.concat(cols, ignore_index=True)
Many options, stack, melt, concat, ...
Here's one:
>>> df.melt(value_name='New Name').drop('variable', 1)
New Name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
You can also use np.ravel:
import numpy as np
out = pd.DataFrame(np.ravel(df.values.T), columns=['New name'])
print(out)
# Output
New name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
Update
If you have only 2 cols:
out = pd.concat([df['Left'], df['Right']], ignore_index=True).to_frame('New name')
print(out)
# Output
New name
0 20
1 15
2 10
3 0
4 25
5 18
6 35
7 5
Solution with unstack
df2 = df.unstack()
# recreate index
df2.index = np.arange(len(df2))
A solution with masking.
# Your data
import numpy as np
import pandas as pd
df = pd.DataFrame({"Left":[20,15,10,0], "Right":[25,18,35,5]})
# Masking columns to ravel
df2 = pd.DataFrame({"New Name":np.ravel(df[["Left","Right"]])})
df2
New Name
0 20
1 25
2 15
3 18
4 10
5 35
6 0
7 5
I ended up using this solution, seems to work fine
df1 = dfTest[['Left']].copy()
df2 = dfTest[['Right']].copy()
df2.columns=['Left']
df3 = pd.concat([df1, df2],ignore_index=True)

"Dynamic" column selection

The problem:
The input table, let's say, is a merged table of calls and bills, having columns: TIME of the call and months of all the bills. The idea is to have a table that has the last 3 bills the person paid starting from the time of the call. That way putting the bills in context of the call.
The Example input and output:
# INPUT:
# df
# TIME ID 2019-08-01 2019-09-01 2019-10-01 2019-11-01 2019-12-01
# 2019-12-01 1 1 2 3 4 5
# 2019-11-01 2 6 7 8 9 10
# 2019-10-01 3 11 12 13 14 15
# EXPECTED OUTPUT:
# df_context
# TIME ID 0 1 2
# 2019-12-01 1 3 4 5
# 2019-11-01 2 7 8 9
# 2019-10-01 3 11 12 13
EXAMPLE INPUT CREATION:
df = pd.DataFrame({
'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
'ID': [1,2,3],
'2019-08-01': [1,6,11],
'2019-09-01': [2,7,12],
'2019-10-01': [3,8,13],
'2019-11-01': [4,9,14],
'2019-12-01': [5,10,15],
})
The code I have got so far:
# HOW DOES ONE GET THE col_to FOR EVERY ROW?
col_to = df.columns.get_loc(df['TIME'].astype(str).values[0])
col_from = col_to - 3
df_context = pd.DataFrame()
df_context = df_context.append(pd.DataFrame(df.iloc[:, col_from : col_to].values))
df_context["TIME"] = df["TIME"]
cols = df_context.columns.tolist()
df_context = df_context[cols[-1:] + cols[:-1]]
df_context.head()
OUTPUT of my code:
# OUTPUTS:
# TIME 0 1 2
# 0 2019-12-01 2 3 4 should be 3 4 5
# 1 2019-11-01 7 8 9 all good
# 2 2019-10-01 12 13 14 should be 11 12 13
What my code seems to lack if a for loop or two, for the first two lines of code, to do waht I want it to do, but I just can't believe that there isn't a better a solution than the one I am concocting right now.
I would suggest the following steps so that you can avoid dynamic column selection altogether.
Convert the wide table (reference date as columns) to a long table (reference date as rows)
Compute the difference in months between time of the call TIME and reference date
Select only those with difference >= 0 and difference < 3
Format the output table (add a running number, pivot it) according to your requirements
# Initialize dataframe
df = pd.DataFrame({
'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
'ID': [1,2,3],
'2019-08-01': [1,6,11],
'2019-09-01': [2,7,12],
'2019-10-01': [3,8,13],
'2019-11-01': [4,9,14],
'2019-12-01': [5,10,15],
})
# Convert the wide table to a long table by melting the date columns
# Name the new date column as REF_TIME, and the bill column as BILL
date_cols = ['2019-08-01', '2019-09-01', '2019-10-01', '2019-11-01', '2019-12-01']
df = df.melt(id_vars=['TIME','ID'], value_vars=date_cols, var_name='REF_TIME', value_name='BILL')
# Convert TIME and REF_TIME to datetime type
df['TIME'] = pd.to_datetime(df['TIME'])
df['REF_TIME'] = pd.to_datetime(df['REF_TIME'])
# Find out difference between TIME and REF_TIME
df['TIME_DIFF'] = (df['TIME'] - df['REF_TIME']).dt.days
df['TIME_DIFF'] = (df['TIME_DIFF'] / 30).round()
# Keep only the preceding 3 months (including the month = TIME)
selection = (
(df['TIME_DIFF'] < 3) &
(df['TIME_DIFF'] >= 0)
)
# Apply selection, sort the columns and keep only columns needed
df_out = (
df[selection]
.sort_values(['TIME','ID','REF_TIME'])
[['TIME','ID','BILL']]
)
# Add a running number, lets call this BILL_NO
df_out = df_out.assign(BILL_NO = df_out.groupby(['TIME','ID']).cumcount() + 1)
# Pivot the output table to the format needed
df_out = df_out.pivot(index=['ID','TIME'], columns='BILL_NO', values='BILL')
Output:
BILL_NO 1 2 3
ID TIME
1 2019-12-01 3 4 5
2 2019-11-01 7 8 9
3 2019-10-01 11 12 13
Here is my (newbie's) solution, it's gonna work only if the dates in column names are in ascending order:
# Initializing Dataframe
df = pd.DataFrame({
'TIME': ['2019-12-01','2019-11-01','2019-10-01'],
'ID': [1,2,3],
'2019-08-01': [1,6,11],
'2019-09-01': [2,7,12],
'2019-10-01': [3,8,13],
'2019-11-01': [4,9,14],
'2019-12-01': [5,10,15],})
cols = list(df.columns)
new_df = pd.DataFrame([], columns=["0","1","2"])
# Iterating over rows, selecting desired slices and appending them to a new DataFrame:
for i in range(len(df)):
searched_date = df.iloc[i, 0]
searched_column_index = cols.index(searched_date)
searched_row = df.iloc[[i], searched_column_index-2:searched_column_index+1]
mapping_column_names = {searched_row.columns[0]: "0", searched_row.columns[1]: "1", searched_row.columns[2]: "2"}
searched_df = searched_row.rename(mapping_column_names, axis=1)
new_df = pd.concat([new_df, searched_df], ignore_index=True)
new_df = pd.merge(df.iloc[:,0:2], new_df, left_index=True, right_index=True)
new_df
Output:
TIME ID 0 1 2
0 2019-12-01 1 3 4 5
1 2019-11-01 2 7 8 9
2 2019-10-01 3 11 12 13
Anyway I think #Toukenize solution is better since it doesn't require iterating.

How do I reorder by column totals?

For example, how do I reorder each column sum and row sum in the following data with summed rows and columns?
import pandas as pd
data=[['fileA',47,15,3,5,7],['fileB',33,13,4,7,2],['fileC',25,17,9,3,5],
['fileD',25,7,1,4,2],['fileE',19,15,3,8,4], ['fileF',11,17,8,4,5]]
df = pd.DataFrame(data, columns=['filename','rows_cnt','cols_cnt','col_A','col_B','col_C'])
print(df)
filename rows_cnt cols_cnt col_A col_B col_C
0 fileA 47 15 3 5 7
1 fileB 33 13 4 7 2
2 fileC 25 17 9 3 5
3 fileD 25 7 1 4 2
4 fileE 19 15 3 8 4
5 fileF 11 17 8 4 5
df.loc[6]= df.sum(0)
filename rows_cnt cols_cnt col_A col_B col_C
0 fileA 47 15 3 5 7
1 fileB 33 13 4 7 2
2 fileC 25 17 9 3 5
3 fileD 25 7 1 4 2
4 fileE 19 15 3 8 4
5 fileF 11 17 8 4 5
6 fileA... 160 84 28 31 25
I made an image of the question.
How do I reorder the red frame in this image by the standard?
df.reindex([2,5,0,4,1,3,6], axis='index')
Is the only way to create the index manually like this?
data=[['fileA',47,15,3,5,7],['fileB',33,13,4,7,2],['fileC',25,17,9,3,5],
['fileD',25,7,1,4,2],['fileE',19,15,3,8,4], ['fileF',11,17,8,4,5]]
df = pd.DataFrame(data, columns=['filename','rows_cnt','cols_cnt','col_A','col_B','col_C'])
df = df.sort_values(by='cols_cnt', axis=0, ascending=False)
df.loc[6]= df.sum(0)
# to keep number original of index
df = df.reset_index(drop=False)
# need to remove this filename column, since need to sort by column (axis=1)
# unable sort with str and integer data type
df = df.set_index('filename', drop=True)
df = df.sort_values(by=df.index[-1], axis=1, ascending=False)
# set back the index of dataframe into original
df = df.reset_index(drop=False)
df = df.set_index('index', drop=True)
# try to set the fixed columns
fixed_cols = ['filename', 'rows_cnt','cols_cnt']
# try get the new order of columns by fixed the first three columns
# and then add with the remaining columns
new_cols = fixed_cols + (df.columns.drop(fixed_cols).tolist())
df[new_cols]

How to fill an alphanumeric series in a column in a pandas dataframe?

I have certain pandas dataframe which has a structure like this
A B C
1 2 2
2 2 2
...
I want to create a new column called ID and fill it with an alphanumeric series which looks somewhat like this
ID A B C
GT001 1 2 2
GT002 2 2 2
GT003 2 2 2
...
I know how to fill it with either alphabets or numerals but I couldn't figure out if there is a "Pandas native" method which would allow me to fill an alphanumeric series.What would be the best way to do this?
Welcome to Stack Overflow!
If you want a custom ID, then you have to create a list with the desired index:
list = []
for i in range(1, df.shape[0] + 1): # gets the length of the DataFrame.
list.append(f'GT{i:03d}') # Using f-string for format and 03d for leading zeros.
df['ID'] = list
And if you want to set that as an index do df.set_index('ID', inplace=True)
import pandas as pd
import numpy as np
df = pd.DataFrame({'player': np.linspace(0,20,20)})
n = 21
data = ['GT' + '0'*(3-len(str(i))) + str(i) for i in range(1, n)]
df['ID'] = data
Output:
player ID
0 0.000000 GT001
1 1.052632 GT002
2 2.105263 GT003
3 3.157895 GT004
4 4.210526 GT005
5 5.263158 GT006
6 6.315789 GT007
7 7.368421 GT008
8 8.421053 GT009
9 9.473684 GT010
10 10.526316 GT011
11 11.578947 GT012
12 12.631579 GT013
13 13.684211 GT014
14 14.736842 GT015
15 15.789474 GT016
16 16.842105 GT017
17 17.894737 GT018
18 18.947368 GT019
19 20.000000 GT020

Transform rows to columns by the values of two rows in pandas

I have a large dataset which has two columns Name, Value and it looks like this:
import pandas as pd
data = [['code',10],['classe',12],['series','B'], ['code',12],['classe',1],
['series','C'],['code',16],['classe',18],['series','A']]
df1 = pd.DataFrame(data,columns=['Name','Value'])
df1
Output
Name Value
0 code 10
1 classe 12
2 series B
3 code 12
4 classe 1
5 series C
6 code 16
7 classe 18
8 series A
And I want some thing like that:
code classe series
0 10 10 B
1 12 1 C
2 16 18 A
In my dataset it reapts N time and i want to transform it to three columns code, classe, series.
Thanks for your help in advance!
You can accomplish this using .pivot
df2 = df1.pivot(columns='Name', values='Value')
pd.concat([df2[series].dropna().reset_index(drop=True) for series in df2], axis=1)
Output
classe code series
0 12 10 B
1 1 12 C
2 18 16 A
More so, if you changed the ordered data, you still get the desired output:
import pandas as pd
data = [['code',10],['classe',12],['classe', 14], ['series','B'], ['series', 'C'], ['code',12],['classe',1],
['series','C'],['code',16],['classe',18],['series','A']]
df1 = pd.DataFrame(data,columns=['Name','Value'])
df1
Name Value
0 code 10
1 classe 12
2 classe 14 #Added classe
3 series B
4 series C #Added Series
5 code 12
6 classe 1
7 series C
8 code 16
9 classe 18
10 series A
The output will be:
classe code series
0 12 10 B
1 14 12 C
2 1 16 C
3 18 NaN A
Option 1
pd.concat with a groupby should do it.
pd.concat([
pd.Series(v.values, name=k) for k, v in df1.groupby('Name')['Value']
],
axis=1
)
classe code series
0 12 10 B
1 1 12 C
2 18 16 A
Option 2
pivot
Flaky pivot hack, don't count on it! This solution assumes values inside Name alternate regularly - code, classe, series, code, classe, series, ... and so on. Won't work otherwise.
df1.assign(Index=df1.index // 3).pivot('Index', 'Name', 'Value')
Name classe code series
Index
0 12 10 B
1 1 12 C
2 18 16 A
create a new key by using cumsum, then unstack
df1['new']=(df1.Name=='code').cumsum()
df1.set_index(['new','Name']).Value.unstack()
Out[80]:
Name classe code series
new
1 12 10 B
2 1 12 C
3 18 16 A

Categories