Related
I am trying to use entries from df1 to limit amounts in df2, then add them up based on their type and summarize in df3. I'm not sure how to get it, the for loop using iterrows would be my best guess but it's not complete.
Code:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'Caps':['25','50','100']})
df2 = pd.DataFrame({'Amounts':['45','25','65','35','85','105','80'], \
'Type': ['a' ,'b' ,'b' ,'c' ,'a' , 'b' ,'d' ]})
df3 = pd.DataFrame({'Type': ['a' ,'b' ,'c' ,'d']})
df1['Caps'] = df1['Caps'].astype(float)
df2['Amounts'] = df2['Amounts'].astype(float)
for index1, row1 in df1.iterrows():
for index2, row2 in df3.iterrows():
df3[str(row1['Caps']+'limit')] = df2['Amounts'].where(
df2['Type'] == row2['Type']).where(
df2['Amounts']<= row1['Caps'], row1['Caps']).sum()
# My ideal output would be this:
df3 = pd.DataFrame({'Type':['a','b','c','d'],
'Total':['130','195','35','80'],
'25limit':['50','75','25','25'],
'50limit':['95','125','35','50'],
'100limit':['130','190','35','80'],
})
Output:
>>> df3
Type Total 25limit 50limit 100limit
0 a 130 50 95 130
1 b 195 75 125 190
2 c 35 25 35 35
3 d 80 25 50 80
Use numpy for compare all values Amounts with Caps by broadcasting to 2d array a, then create DataFrame by constructor with sum per columns, transpose by DataFrame.T and DataFrame.add_prefix.
For aggregated column use DataFrame.insert for first column with GroupBy.sum:
df1['Caps'] = df1['Caps'].astype(int)
df2['Amounts'] = df2['Amounts'].astype(int)
am = df2['Amounts'].to_numpy()
ca = df1['Caps'].to_numpy()
#pandas below 0.24
#am = df2['Amounts'].values
#ca = df1['Caps'].values
a = np.where(am <= ca[:, None], am[None, :], ca[:, None])
df1 = (pd.DataFrame(a,columns=df2['Type'],index=df1['Caps'])
.sum(axis=1, level=0).T.add_suffix('limit'))
df1.insert(0, 'Total', df2.groupby('Type')['Amounts'].sum())
df1 = df1.reset_index().rename_axis(None, axis=1)
print (df1)
Type Total 25limit 50limit 100limit
0 a 130 50 95 130
1 b 195 75 125 190
2 c 35 25 35 35
3 d 80 25 50 80
Here is my solution without numpy, however it is two times slower than #jezrael's solution, 10.5ms vs. 5.07ms.
limcols= df1.Caps.to_list()
df2=df2.reindex(columns=["Amounts","Type"]+limcols)
df2[limcols]= df2[limcols].transform( \
lambda sc: np.where(df2.Amounts.le(sc.name),df2.Amounts,sc.name))
# Summations:
g=df2.groupby("Type")
df3= g[limcols].sum()
df3.insert(0,"Total", g.Amounts.sum())
# Renaming columns:
c_dic={ lim:f"{lim:.0f}limit" for lim in limcols}
df3= df3.rename(columns=c_dic).reset_index()
# Cleanup:
#df2=df2.drop(columns=limcols)
I want to know if this is possible with pandas:
From df2, I want to create new1 and new2.
new1 as the latest date that can find from df1 that match column A
and B.
new2 as the latest date that can find from df1 that match column A
but not B.
I managed to get new1 but not new2.
Code:
import pandas as pd
d1 = [['1/1/19', 'xy','p1','54'], ['1/1/19', 'ft','p2','20'], ['3/15/19', 'xy','p3','60'],['2/5/19', 'xy','p4','40']]
df1 = pd.DataFrame(d1, columns = ['Name', 'A','B','C'])
d2 =[['12/1/19', 'xy','p1','110'], ['12/10/19', 'das','p10','60'], ['12/20/19', 'fas','p50','40']]
df2 = pd.DataFrame(d2, columns = ['Name', 'A','B','C'])
d3 = [['12/1/19', 'xy','p1','110','1/1/19','3/15/19'], ['12/10/19', 'das','p10','60','0','0'], ['12/20/19', 'fas','p50','40','0','0']]
dfresult = pd.DataFrame(d3, columns = ['Name', 'A','B','C','new1','new2'])
Updated!
IIUC, you want to add two columns to df2 : new1 and new2.
First I modified two things:
df1 = pd.DataFrame(d1, columns = ['Name1', 'A','B','C'])
df2 = pd.DataFrame(d2, columns = ['Name2', 'A','B','C'])
df1.Name1 = pd.to_datetime(df1.Name1)
Renamed Name into Name1 and Name2 for ease of use. Then I turned Name1 into a real date, so we can get the maximum date by group.
Then, We merge df2 with df1 on A column. This will give us rows that match on that column
aux = df2.merge(df1, on='A')
Then when the B columns is the same on both dataframes, we get Name1 out of it:
df2['new1'] = df2.index.map(aux[aux.B_x==aux.B_y].Name1).fillna(0)
If they're different we get the maximum date for every A group:
df2['new2'] = df2.A.map(aux[aux.B_x!=aux.B_y].groupby('A').Name1.max()).fillna(0)
Ouput:
Name2 A B C new1 new2
0 12/1/19 xy p1 110 2019-01-01 00:00:00 2019-03-15 00:00:00
1 12/10/19 das p10 60 0 0
2 12/20/19 fas p50 40 0 0
You can do this by:
standard merge based on A
removing all entries which match B values
sorting for dates
dropping duplicates on A, keeping last date (n.b. assumes dates are in date format, not as strings!)
merging back on id
Thus:
source = df1.copy() # renamed
v = df2.merge(source, on='A', how='left') # get all values where df2.A == source.A
v = v[v['B_x'] != v['B_y']] # drop entries where B values are the same
nv = v.sort_values(by=['Name_y']).drop_duplicates(subset=['Name_x'], keep='last')
df2.merge(nv[['Name_y', 'Name_x']].rename(columns={'Name_y': 'new2', 'Name_x': 'Name'}),
on='Name', how='left') # keeps non-matching, consider inner
This yields:
Out[94]:
Name A B C new2
0 12/1/19 xy p1 110 3/15/19
1 12/10/19 das p10 60 NaN
2 12/20/19 fas p50 40 NaN
My initial thought was to do something like the below. Sadly, it is not elegant. Generally, this sort of way to determining some value are frowned upon mostly because it fails to scale and with large data, gets especially slow.
def find_date(row, source=df1): # renamed df1 to source
t = source[source['B'] != row['B']]
t = t[t['A'] == row['A']]
return t.sort_values(by='date', ascending=False).iloc[0]
df2['new2'] = df2.apply(find_date, axis=1)
I'll try to keep this this short and to the point (with simplified data). I have a table of data that has four columns (keep in mind more columns may be added later), none of which are unique on their own, but these three columns together 'ID','ID2','DO' must be unique as a group. I will bring this table into one dataframe, and the updated version of the table into another dataframe.
If df is the 'original data' and df2 is the 'updated data', is this the most accurate/efficient way to find what changes occur to the original data?
import pandas as pd
#Sample Data:
df = pd.DataFrame({'ID':[546,107,478,546,478], 'ID2':['AUSER','BUSER','CUSER','AUSER','EUSER'], 'DO':[3,6,8,4,6], 'DATA':['ORIG','ORIG','ORIG','ORIG','ORIG']})
df2 = pd.DataFrame({'ID':[107,546,123,546,123], 'ID2':['BUSER','AUSER','DUSER','AUSER','FUSER'], 'DO':[6,3,2,4,3], 'DATA':['CHANGE','CHANGE','CHANGE','ORIG','CHANGE']})
>>> df
DATA DO ID ID2
0 ORIG 3 546 AUSER
1 ORIG 6 107 BUSER
2 ORIG 8 478 CUSER
3 ORIG 4 546 AUSER
4 ORIG 6 478 EUSER
>>> df2
DATA DO ID ID2
0 CHANGE 6 107 BUSER
1 CHANGE 3 546 AUSER
2 CHANGE 2 123 DUSER
3 ORIG 4 546 AUSER
4 CHANGE 3 123 FUSER
#Compare Dataframes
merged = df2.merge(df, indicator=True, how='outer')
#Split the merged comparison into:
# - original records that will be updated or deleted
# - new records that will be inserted or update the original record.
df_original = merged.loc[merged['_merge'] == 'right_only'].drop(columns=['_merge']).copy()
df_new = merged.loc[merged['_merge'] == 'left_only'].drop(columns=['_merge']).copy()
#Create another merge to determine if the new records will either be updates or inserts
check = pd.merge(df_new,df_original, how='left', left_on=['ID','ID2','DO'], right_on = ['ID','ID2','DO'], indicator=True)
in_temp = check[['ID','ID2','DO']].loc[check['_merge']=='left_only']
upd_temp = check[['ID','ID2','DO']].loc[check['_merge']=='both']
#Create dataframes for each Transaction:
# - removals: Remove records based on provided key values
# - updates: Update entire record based on key values
# - inserts: Insert entire record
removals = pd.concat([df_original[['ID','ID2','DO']],df_new[['ID','ID2','DO']],df_new[['ID','ID2','DO']]]).drop_duplicates(keep=False)
updates = df2.loc[(df2['ID'].isin(upd_temp['ID']))&(df2['ID2'].isin(upd_temp['ID2']))&(df2['DO'].isin(upd_temp['DO']))].copy()
inserts = df2.loc[(df2['ID'].isin(in_temp['ID']))&(df2['ID2'].isin(in_temp['ID2']))&(df2['DO'].isin(in_temp['DO']))].copy()
results:
>>> removals
ID ID2 DO
6 478 CUSER 8
8 478 EUSER 6
>>> updates
DATA DO ID ID2
0 CHANGE 6 107 BUSER
1 CHANGE 3 546 AUSER
>>> inserts
DATA DO ID ID2
2 CHANGE 2 123 DUSER
4 CHANGE 3 123 FUSER
To restate the questions. Will this logic consistently and correctly identify the differences between two dataframes with specified key columns? Is there a more efficient or pythonic approach to this?
Updated Sample Data with more records and the corresponding results.
import pandas as pd
#Sample Data:
df = pd.DataFrame({'ID':[546,107,478,546], 'ID2':['AUSER','BUSER','CUSER','AUSER'], 'DO':[3,6,8,4], 'DATA':['ORIG','ORIG','ORIG','ORIG']})
df2 = pd.DataFrame({'ID':[107,546,123,546], 'ID2':['BUSER','AUSER','DUSER','AUSER'], 'DO':[6,3,2,4], 'DATA':['CHANGE','CHANGE','CHANGE','ORIG']})
For changed:
#Concat both df and df2 together, and whenever there is two of the same, drop them both
df3 = pd.concat([df, df2]).drop_duplicates(keep = False)
#Whenever the size of this following group by is 2 or more there was a change.
#Change
df3 = df3.groupby(['ID', 'ID2', 'DO'])['DATA']\
.size()\
.reset_index()\
.query('DATA == 2')
df3.loc[:, 'DATA'] = 'CHANGE'
ID ID2 DO DATA
0 107 BUSER 6 CHANGE
3 546 AUSER 3 CHANGE
For Inserts:
#We can compare the ID comlumn for df and df2 and see whats new in df2
#Inserts
df2[(np.logical_not(df2['ID'].isin(df['ID'])))&
(np.logical_not(df2['ID2'].isin(df['ID2'])))&
(np.logical_not(df2['DO'].isin(df['DO'])))]
ID ID2 DO DATA
2 123 DUSER 2 CHANGE
For Removals:
#Similar logic as above but flipped.
#Removals
df[(np.logical_not(df2['ID'].isin(df['ID'])))&
(np.logical_not(df2['ID2'].isin(df['ID2'])))&
(np.logical_not(df2['DO'].isin(df['DO'])))]
ID ID2 DO DATA
2 478 CUSER 8 ORIG
EDIT
df = pd.DataFrame({'ID':[546,107,478,546,478], 'ID2':['AUSER','BUSER','CUSER','AUSER','EUSER'], 'DO':[3,6,8,4,6], 'DATA':['ORIG','ORIG','ORIG','ORIG','ORIG']})
df2 = pd.DataFrame({'ID':[107,546,123,546,123], 'ID2':['BUSER','AUSER','DUSER','AUSER','FUSER'], 'DO':[6,3,2,4,3], 'DATA':['CHANGE','CHANGE','CHANGE','ORIG','CHANGE']})
New dataframes. For changed we will do it the exact same way:
df3 = pd.concat([df, df2]).drop_duplicates(keep = False)
#Change
Change = df3.groupby(['ID', 'ID2', 'DO'])['DATA']\
.size()\
.reset_index()\
.query('DATA == 2')
Change.loc[:, 'DATA'] = 'CHANGE'
ID ID2 DO DATA
0 107 BUSER 6 CHANGE
5 546 AUSER 3 CHANGE
For inserts/ removals we will do the same groupby as above, except query for the ones that only appear once. Then we will follow up with an inner join with both df and df2 to see what has been added/removed.
InsertRemove = df3.groupby(['ID', 'ID2', 'DO'])['DATA']\
.size()\
.reset_index()\
.query('DATA == 1')
#Inserts
Inserts = InsertRemove.merge(df2, how = 'inner', left_on= ['ID', 'ID2', 'DO'], right_on = ['ID', 'ID2', 'DO'])\
.drop('DATA_x', axis = 1)\
.rename({'DATA_y':'DATA'}, axis = 1)
ID ID2 DO DATA
0 123 DUSER 2 CHANGE
1 123 FUSER 3 CHANGE
#Removals
Remove = InsertRemove.merge(df, how = 'inner', left_on= ['ID', 'ID2', 'DO'], right_on = ['ID', 'ID2', 'DO'])\
.drop('DATA_x', axis = 1)\
.rename({'DATA_y':'DATA'}, axis = 1)
ID ID2 DO DATA
0 478 CUSER 8 ORIG
1 478 EUSER 6 ORIG
I have a dictionary:
#file1 mentions 2 columns while file2 mentions 3
dict2 = ({'file1' : ['colA', 'colB'],'file2' : ['colY','colS','colX'], etc..})
First of all how to make the dictionary in a way that will separate somehow the values headed to a one column concatenation from the columns that are needed to remain in the final dataframe unaffected.
The columns will not have the same names for each file and it is very difficult to automate such customized process. What do you think?
I want to do a concatenation of the mentioned columns in a new column for each file.
This should be automated.
for k, v in dict1.items():
df = pd.DataFrame.from_records(data=arcpy.da.SearchCursor(k, v)) #reads to a df
df['new'] = df.astype(str).apply(' '.join, axis=1)#concatenation
How can I make this work every time, independent of the number of columns in each dictionary?
Example:
a = {'colA' : [123,124,112,165],'colB' :['alpha','beta','gamma','delta']}
file1 = pd.DataFrame(data = a)
file1
colA colB
123 alpha
124 beta
112 gamma
165 delta
b = {'colY' : [123,124,112,165],'colS' :['alpha','beta','gamma','delta'], 'colX' :[323,326,378,399] }
file2 = pd.DataFrame(data = b)
file2
colY colS colX
123 alpha 323
124 beta 326
112 gamma 378
165 delta 399
Result:
file1
col_all
123 alpha
124 beta
112 gamma
165 delta
file2
call_all
123 alpha 323
124 beta 326
112 gamma 378
165 delta 399
NOTE
file2 for example could have 5 more columns but only 3 should be concatenated to a one column. How to make the initial dict that would define which columns to be concatenated and what to just exist there unaffected.
So you have to select columns names for concat, e.g first 3 columns selected by positions:
for k, v in dict1.items():
df = pd.DataFrame.from_records(data=arcpy.da.SearchCursor(k, v)) #reads to a df
df['new'] = df.iloc[:, :3].astype(str).apply(' '.join, axis=1)#concatenation
If create list of possible columns names use intersection:
for k, v in dict1.items():
df = pd.DataFrame.from_records(data=arcpy.da.SearchCursor(k, v)) #reads to a df
L = ['colA','colB','colS']
cols = df.columns.intersection(L)
df['new'] = df[cols].astype(str).apply(' '.join, axis=1)#concatenation
Or filtering:
for k, v in dict1.items():
df = pd.DataFrame.from_records(data=arcpy.da.SearchCursor(k, v)) #reads to a df
L = ['colA','colB','colS']
mask = df.columns.isin(L)
df['new'] = df.loc[:, mask].astype(str).apply(' '.join, axis=1)#concatenation
EDIT:
If want create another data structure with another list of necessary columns names, possible solution is create list of tuples:
L = [('file1', ['colA', 'colB'], ['colA','colB']),
('file2', ['colY','colS','colX'], ['colY','colS'])]
for i, j, k in L:
print (i)
print (j)
print (k)
file1
['colA', 'colB']
['colA', 'colB']
file2
['colY', 'colS', 'colX']
['colY', 'colS']
So your solution should be rewritten:
for i, j, k in L:
df = pd.DataFrame.from_records(data=arcpy.da.SearchCursor(i, j)) #reads to a df
df['new'] = df[k].astype(str).apply(' '.join, axis=1)#concatenation
I extracted multiple dataframes from excel sheet by passing cordinates (start & end)
Now i used below funtion to extacr according to cordinates, but when i am trying to
convert it into dataframe, no sure from where index are coming in df as columns
I wanted to remove these index and make 2nd row as columns, this is my dataframe
0 1 2 3 4 5 6
Cols/Rows A A2 B B2 C C2
0 A 50 50 150 150 200 200
1 B 200 200 250 300 300 300
2 C 350 500 400 400 450 450
def extract_dataframes(sheet):
ws = sheet['pivots']
cordinates = [('A1', 'M8'), ('A10', 'Q17'), ('A19', 'M34'), ('A36', 'Q51')]
multi_dfs_list = []
for i in cordinates:
data_rows = []
for row in ws[i[0]:i[1]]:
data_cols = []
for cell in row:
data_cols.append(cell.value)
data_rows.append(data_cols)
multi_dfs_list.append(data_rows)
multi_dfs = {i: pd.DataFrame(df) for i, df in enumerate(multi_dfs_list)}
return multi_dfs
I tried to delete index but not working.
Note: when i say
>>> multi_dfs[0].columns # first dataframe
RangeIndex(start=0, stop=13, step=1)
Change
multi_dfs = {i: pd.DataFrame(df) for i, df in enumerate(multi_dfs_list)}
for
multi_dfs = {i: pd.DataFrame(df[1:], columns=df[0]) for i, df in enumerate(multi_dfs_list)}
From the Docs,
columns : Index or array-like
Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided
I think need:
df = pd.read_excel(file, skiprows=1)