Proj_Com_Sum comp_1 comp_2 comp_3 Proj_Val_sum val_1 val_2 val_3
70 10 20 35 67 20 30 15
100 50 30 25 70 25 30 15
Given the above as Pandas DataFrame df, I would like to add a Colunm Com_total , Val_total , Proj_Tot_Diff
Where
Com_total = comp_1 + comp_2 + comp_3
Val_total = val_1 + val_2 + val_3
Proj_Tot_Diff = Com_total - Proj_Com_Sum
Since I have about comp .. it would be a long code to write
Com_total = comp_1 + comp_2 + comp_3 .. comp_58
Please Note comp_1..comp_2 may not follow a regex pattern.
it could be some State Names like Florida, NY, etc.
All we know is 2nd colunm to 58th column are to be added.
Hence, I want Some code like
df['Com_total']= df[ col 2:58 ].sum
# Whats the correct Syntax
How to Specify the In Between Columns in a precise notation. Please help with correct Syntax
You can using slicing if your column headers are ordered properly, however it safer if you use #piRSquared's method using filter:
df['Com_total'] = df.loc[:,'comp_1':'comp_3'].sum(1)
df['Val_total'] = df.loc[:,'val_1':'val_3'].sum(1)
df['Proj_Tot_diff'] = df['Com_total'] - df['Proj_Com_Sum']
print(df)
OUtput:
Proj_Com_Sum comp_1 comp_2 comp_3 Proj_Val_sum val_1 val_2 val_3 \
0 70 10 20 35 67 20 30 15
1 100 50 30 25 70 25 30 15
Com_total Val_total Proj_Tot_diff
0 65 65 -5
1 105 70 5
filter and assign
df.assign(
Com_total=df.filter(regex='comp_\d+').sum(1),
Val_total=df.filter(regex='val_\d+').sum(1),
Proj_Tot_Diff=lambda d: d.Com_total - d.Proj_Com_Sum
)
Proj_Com_Sum comp_1 comp_2 comp_3 Proj_Val_sum val_1 val_2 val_3 \
0 70 10 20 35 67 20 30 15
1 100 50 30 25 70 25 30 15
Com_total Val_total Proj_Tot_Diff
0 65 65 -5
1 105 70 5
Edit: as in your edit. to get sum from consecutive columns 2nd - 58th you just use .iloc with 1:58 on columns because integer loc starts from 0 and iloc ignores the right edge number.
df['Com_total'] = df.iloc[:,1:58].sum(1)
Original:
This is a crazy/fun solution using extract on column names and groupby, sum. Finally, join back to df.
df.join(df.groupby(df.columns.str.extract('(comp_|val_)'), axis=1).sum(axis=1) \
.add_suffix('total').assign(Proj_Tot_Diff= lambda x: x.comp_total - df.Proj_Com_Sum))
Out[1958]:
Proj_Com_Sum comp_1 comp_2 comp_3 Proj_Val_sum val_1 val_2 val_3 \
0 70 10 20 35 67 20 30 15
1 100 50 30 25 70 25 30 15
comp_total val_total Proj_Tot_Diff
0 65 65 -5
1 105 70 5
Related
I have a pandas data frame that looks like this:
id age weight group
1 12 45 [10-20]
1 18 110 [10-20]
1 25 25 [20-30]
1 29 85 [20-30]
1 32 49 [30-40]
1 31 70 [30-40]
1 37 39 [30-40]
I am looking for a data frame that would look like this: (sd=standard deviation)
group group_mean_weight group_sd_weight rest_mean_weight rest_sd_weight
[10-20]
[20-30]
[30-40]
Here the second/third columns are mean and SD for that group. columns third and fourth are mean and SD for the rest of the groups combined.
Here's a way to do it:
res = df.group.to_frame().groupby('group').count()
for group in res.index:
mask = df.group==group
srGroup, srOther = df.loc[mask, 'weight'], df.loc[~mask, 'weight']
res.loc[group, ['group_mean_weight','group_sd_weight','rest_mean_weight','rest_sd_weight']] = [
srGroup.mean(), srGroup.std(), srOther.mean(), srOther.std()]
res = res.reset_index()
Output:
group group_mean_weight group_sd_weight rest_mean_weight rest_sd_weight
0 [10-20] 77.500000 45.961941 53.60 24.016661
1 [20-30] 55.000000 42.426407 62.60 28.953411
2 [30-40] 52.666667 15.821926 66.25 38.378596
An alternative way to get the same result is:
res = ( pd.DataFrame(
df.group.drop_duplicates().to_frame()
.apply(lambda x: [
df.loc[df.group==x.group,'weight'].mean(),
df.loc[df.group==x.group,'weight'].std(),
df.loc[df.group!=x.group,'weight'].mean(),
df.loc[df.group!=x.group,'weight'].std()], axis=1, result_type='expand')
.to_numpy(),
index=list(df.group.drop_duplicates()),
columns=['group_mean_weight','group_sd_weight','rest_mean_weight','rest_sd_weight'])
.reset_index().rename(columns={'index':'group'}) )
Output:
group group_mean_weight group_sd_weight rest_mean_weight rest_sd_weight
0 [10-20] 77.500000 45.961941 53.60 24.016661
1 [20-30] 55.000000 42.426407 62.60 28.953411
2 [30-40] 52.666667 15.821926 66.25 38.378596
UPDATE:
OP asked in a comment: "what if I have more than one weight column? what if I have around 10 different weight columns and I want sd for all weight columns?"
To illustrate below, I have created two weight columns (weight and weight2) and have simply provided all 4 aggregates (mean, sd, mean of other, sd of other) for each weight column.
wgtCols = ['weight','weight2']
res = ( pd.concat([ pd.DataFrame(
df.group.drop_duplicates().to_frame()
.apply(lambda x: [
df.loc[df.group==x.group,wgtCol].mean(),
df.loc[df.group==x.group,wgtCol].std(),
df.loc[df.group!=x.group,wgtCol].mean(),
df.loc[df.group!=x.group,wgtCol].std()], axis=1, result_type='expand')
.to_numpy(),
index=list(df.group.drop_duplicates()),
columns=[f'group_mean_{wgtCol}',f'group_sd_{wgtCol}',f'rest_mean_{wgtCol}',f'rest_sd_{wgtCol}'])
for wgtCol in wgtCols], axis=1)
.reset_index().rename(columns={'index':'group'}) )
Input:
id age weight weight2 group
0 1 12 45 55 [10-20]
1 1 18 110 120 [10-20]
2 1 25 25 35 [20-30]
3 1 29 85 95 [20-30]
4 1 32 49 59 [30-40]
5 1 31 70 80 [30-40]
6 1 37 39 49 [30-40]
Output:
group group_mean_weight group_sd_weight rest_mean_weight rest_sd_weight group_mean_weight2 group_sd_weight2 rest_mean_weight2 rest_sd_weight2
0 [10-20] 77.500000 45.961941 53.60 24.016661 87.500000 45.961941 63.60 24.016661
1 [20-30] 55.000000 42.426407 62.60 28.953411 65.000000 42.426407 72.60 28.953411
2 [30-40] 52.666667 15.821926 66.25 38.378596 62.666667 15.821926 76.25 38.378596
I have extracted Table data from an image having multiple tables using Amazon textract and trying to map all the extracted data into a Output template CSV,
However there are multiple tables in the Extracted Input CSV file which are listed one below another. There are approx 7 tables which are listed one below other in each CSV.
Please suggest how to map the values from Input CSV to Output.
Input CSV file:
S.No Item Item_code 1st 2nd 3rd 4th Avg
1 Math_book BK001 27 36 35 23 30
2 Phy_book BJ008 30 40 40 30 35
3 Hin_book NK103 50 50 30 30 40
4 Che_book CH001 40 40 40 20 35
S.No Item_Name Item_code 1st 2nd 3rd 4th Avg
1 Math_book BK001 27 36 35 23 30
2 Phy_book BJ008 30 40 40 30 35
3 Hin_book NK103 50 50 30 30 40
S.No Product Item_code 1st 2nd 3rd 4th Avg
1 Phy_book BJ008 30 40 40 30 35
2 Hin_book NK103 50 50 30 30 40
3 Che_book CH001 40 40 40 20 35
4 Bio_book BI005 50 30 40 60 45
Expected output:
S.No Product Item_code 1st 2nd 3rd 4th
1 Math_book BK001 54 72 70 46
2 Phy_book BJ008 90 120 120 90
3 Hin_book NK103 150 150 90 90
4 Che_book CH001 80 80 80 60
5 Bio_book BI005 50 30 40 60
Code i have been trying to use :
df = pd.read_csv(r'input.csv')
df2 = pd.read_csv(r'output.csv')
How i Can add all the values by groupBy considering (Item,Item_Name,Product) columns submit the values in df2
Please Suggest
Just use grouby -
df = df[df['S.No']!='S.No'].drop('S.No',axis=1) # drop header rows
df[df.columns.values[2:]] = df[df.columns.values[2:]].apply(lambda x: x.astype(int)) #convert data to int type
df = df.groupby(['Item','Item_code'],as_index=False).sum()
df.to_csv('out.csv',index_label='S.No', sep= '\t') # specify the name of output file here
If my data looks like this
Index Country ted_Val1 sam_Val1 ... ted_Val10 sam_Val10
1 Australia 1 3 ... 20 5
2 Bambua 12 33 ... 15 56
3 Tambua 14 34 ... 10 58
df = pd.DataFrame([["Australia", 1, 3, 20, 5],
["Bambua", 12, 33, 15, 56],
["Tambua", 14, 34, 10, 58]
], columns=["Country", "ted_Val1", "sam_Val1", "ted_Val10", "sam_Val10"]
)
I'd like to subtract all 'val_' columns from all 'ted_' values using a list, creating a new column starting with 'dif_' such that:
Index Country ted_Val1 sam_Val1 diff_Val1 ... ted_Val10 sam_Val10 diff_val10
1 Australia 1 3 -2 ... 20 5 -15
2 Bambua 12 33 12 ... 15 56 -41
3 Tambua 14 34 14... 10 58 -48
so far I've got:
calc_vars = ['ted_Val1',
'sam_Val1',
'ted_Val10',
'sam_Val10']
for i in calc_vars:
df_diff['dif_' + str(i)] = df.['ted_' + str(i)] - df.['sam_' + str(i)]
but I'm getting errors, not sure where to go from here. As a warning this is dummy data and there can be several underscores in the names
IIUC you can use filter to choose the columns for subtraction (assuming your columns are properly sorted like your sample):
print (pd.concat([df, pd.DataFrame(df.filter(like="ted").to_numpy()-df.filter(like="sam").to_numpy(),
columns=["diff"+i.split("_")[-1] for i in df.columns if "ted_Val" in i])],1))
Country ted_Val1 sam_Val1 ted_Val10 sam_Val10 diff1 diff10
0 Australia 1 3 20 5 -2 15
1 Bambua 12 33 15 56 -21 -41
2 Tambua 14 34 10 58 -20 -48
try this,
calc_vars = ['ted_Val1', 'sam_Val1', 'ted_Val10', 'sam_Val10']
# extract even & odd values from calc_vars
# ['ted_Val1', 'ted_Val10'], ['sam_Val1', 'sam_Val10']
for ted, sam in zip(calc_vars[::2], calc_vars[1::2]):
df['diff_' + ted.split("_")[-1]] = df[ted] - df[sam]
Edit: if columns are not sorted,
ted_cols = sorted(df.filter(regex="ted_Val\d+"), key=lambda x : x.split("_")[-1])
sam_cols = sorted(df.filter(regex="sam_Val\d+"), key=lambda x : x.split("_")[-1])
for ted, sam in zip(ted_cols, sam_cols):
df['diff_' + ted.split("_")[-1]] = df[ted] - df[sam]
Country ted_Val1 sam_Val1 ted_Val10 sam_Val10 diff_Val1 diff_Val10
0 Australia 1 3 20 5 -2 15
1 Bambua 12 33 15 56 -21 -41
2 Tambua 14 34 10 58 -20 -48
I have a dataframe called df_location:
location = {'location_id': [1,2,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,26,27,28,29],
'humidity_value':[60,61,62,63,64,65,66,67,68,69]}
df_location = pd.DataFrame(locations)
I have another dataframe called df_islands:
islands = {'island_id':[10,20,30,40,50,60],
'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]}
df_islands = pd.DataFrame(islands)
Each island_id corresponds to one or more locations. As you can see, the locations are stored in a list.
What I'm trying to do is to search the list_of_locations for each unique location and merge it to df_location in a way where each island_id will correspond to a specific location.
Final dataframe should be the following:
merged = {'location_id': [1,2,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,26,27,28,29],
'humidity_value':[60,61,62,63,64,65,66,67,68,69],
'island_id':[10,20,20,30,30,40,40,40,50,60]}
df_merged = pd.DataFrame(merged)
I don't know whether there is a method or function in python to do so. I would really appreciate it if someone can give me a solution to this problem.
The pandas method you're looking for to expand your df_islands dataframe is .explode(column_name). From there, rename your column to location_id and then join the dataframes using pd.merge(). It'll perform a SQL-like join method using the location_id as the key.
import pandas as pd
locations = {'location_id': [1,2,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,26,27,28,29],
'humidity_value':[60,61,62,63,64,65,66,67,68,69]}
df_locations = pd.DataFrame(locations)
islands = {'island_id':[10,20,30,40,50,60],
'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]}
df_islands = pd.DataFrame(islands)
df_islands = df_islands.explode(column='list_of_locations')
df_islands.columns = ['island_id', 'location_id']
pd.merge(df_locations, df_islands)
Out[]:
location_id temperature_value humidity_value island_id
0 1 20 60 10
1 2 21 61 20
2 3 22 62 20
3 4 23 63 30
4 5 24 64 30
5 6 25 65 40
6 7 26 66 40
7 8 27 67 40
8 9 28 68 50
9 10 29 69 60
The df.apply() method works here. It's a bit long-winded but it works:
df_location['island_id'] = df_location['location_id'].apply(
lambda x: [
df_islands['island_id'][i] \
for i in df_islands.index \
if x in df_islands['list_of_locations'][i]
# comment above line and use this instead if list is stored in a string
# if x in eval(df_islands['list_of_locations'][i])
][0]
)
First we select the final value we want if the if statement is True: df_islands['island_id'][i]
Then we loop over each column in df_islands by using df_islands.index
Then create the if statement which loops over all values in df_islands['list_of_locations'] and returns True if the value for df_location['location_id'] is in the list.
Finally, since we must contain this long statement in square brackets, it is a list. However, we know that there is only one value in the list so we can index it by using [0] at the end.
I hope this helps and happy for other editors to make the answer more legible!
print(df_location)
location_id temperature_value humidity_value island_id
0 1 20 60 10
1 2 21 61 20
2 3 22 62 20
3 4 23 63 30
4 5 24 64 30
5 6 25 65 40
6 7 26 66 40
7 8 27 67 40
8 9 28 68 50
9 10 29 69 60
I want to train a binary classification ML model with some data that I have; something like this:
df
y ch1_g1 ch2_g1 ch3_g1 ch1_g2 ch2_g2 ch3_g2
0 20 89 62 23 3 74
1 51 64 19 2 83 0
0 14 58 2 71 31 48
1 32 28 2 30 92 91
1 51 36 51 66 15 14
...
My target (y) depends on three characteristics from two groups, however I have an imbalance in my data, a count of values of my y target reveals that I have more zeros than ones in a ratio of about 2.68. I correct this by looping each row and randomly swapping values from group 1 to group 2 and viceversa, like this:
for index,row in df.iterrows():
choice = np.random.choice([0,1])
if row['y'] != choice:
df.loc[index, 'y'] = choice
for column in df.columns[1:]:
key = column.replace('g1', 'g2') if 'g1' in column else column.replace('g2', 'g1')
df.loc[index, column] = row[key]
Doing this reduce the ratio to no more than 1.3, so I was wondering if there is a more direct aproach using pandas methods.
¿Anyone have an idea how to accomplish this?
Whether or not swapping columns solves class unbalance aside, I would swap the whole data set, and randomly choose between the original and the swapped:
# Step 1: swap the columns
df1 = pd.concat((df.filter(regex='[^(_g1)]$'),
df.filter(regex='_g1$')),
axis=1)
# Step 2: rename the columns
df1.columns = df.columns
# random choice
np.random.seed(1)
is_original = np.random.choice([True,False], size=len(df))
# concat to make new dataset
pd.concat((df[is_original],df1[~is_original]))
Output:
y ch1_g1 ch2_g1 ch3_g1 ch1_g2 ch2_g2 ch3_g2
2 0 14 58 2 71 31 48
3 1 32 28 2 30 92 91
0 0 23 3 74 20 89 62
1 1 2 83 0 51 64 19
4 1 66 15 14 51 36 51
Notice that row with indexes 1,4 have g1 swap with g2.