I have two pandas dataframes. The first one contains some data I want to multiplicate with the second dataframe which is a reference table.
So in my example I want to get a new column in df1 for every column in my reference table - but also add up every row in that column.
Like this (Index 205368421 with R21 17): (1205 * 0.526499) + (7562* 0.003115) + (1332* 0.000267) = 658
In Excel VBA I iterated through both tables and did it that way - but it took very long. I've read that pandas is way better for this without iterating.
df1 = pd.DataFrame({'Index': ['205368421', '206321177','202574796','200212811', '204376114'],
'L1.09A': [1205,1253,1852,1452,1653],
'L1.10A': [7562,7400,5700,4586,4393],
'L1.10C': [1332, 0, 700,1180,290]})
df2 = pd.DataFrame({'WorkerID': ['L1.09A', 'L1.10A', 'L1.10C'],
'R21 17': [0.526499,0.003115,0.000267],
'R21 26': [0.458956,0,0.001819]})
Index 1.09A L1.10A L1.10C
205368421 1205 7562 1332
206321177 1253 7400 0
202574796 1852 5700 700
200212811 1452 4586 1180
204376114 1653 4393 290
WorkerID R21 17 R21 26
L1.09A 0.526499 0.458956
L1.10A 0.003115 0
L1.10C 0.000267 0.001819
I want this:
Index L1.09A L1.10A L1.10C R21 17 R21 26
205368421 1205 7562 1332 658 555
206321177 1253 7400 0 683 575
202574796 1852 5700 700 993 851
200212811 1452 4586 1180 779 669
204376114 1653 4393 290 884 759
I would be okay with some hints. Like someone told me this might be matrix multiplication. So .dot() would be helpful. Is this the right direction?
Edit:
I have now done the following:
df1 = df1.set_index('Index')
df2 = df2.set_index('WorkerID')
common_cols = list(set(df1.columns).intersection(df2.index))
df2 = df2.loc[common_cols]
df1_sorted = df1.reindex(sorted(df1.columns), axis=1)
df2_sorted = df2.sort_index(axis=0)
df_multiplied = df1_sorted # df2_sorted
This works with my example dataframes, but not with my real dataframes.
My real ones have these dimensions: df1_sorted(10429, 69) and df2_sorted(69, 18).
It should work, but my df_multiplied is full with NaN.
Alright, I did it!
I had to replace all nan with 0.
So the final solution is:
df1 = df1.set_index('Index')
df2 = df2.set_index('WorkerID')
common_cols = list(set(df1.columns).intersection(df2.index))
df2 = df2.loc[common_cols]
df1_sorted = df1.reindex(sorted(df1.columns), axis=1)
df2_sorted = df2.sort_index(axis=0)
df1_sorted= df1_sorted.fillna(0)
df2_sorted= df2_sorted.fillna(0)
df_multiplied = df1_sorted # df2_sorted
Related
I have a data set that can be found here https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset.
What we need exactly is for every employee to have their own set of rows of all the employees they share the same age with.
The desired output would be to add these rows in the data frame like so
source
target
Bob
Tom
Bob
Carl
Tom
Bob
Tom
Carl
Carl
Bob
Carl
Tom
I am using pandas to create the data frame from the csv file pd.read_csv
I am struggling with creating the loop to have my desired input.
This where I am at so far
import pandas as pd
path = "C:\CNT\IBM.csv"
df = pd.read_csv(path)
def f(row):
if row['A'] == row['B']:
val = 0
elif row['A'] > row['B']:
val = 1
else:
val = -1
return val
df['source'] = ''
df['target'] = ''
df2 = (df.loc[df['Age'] == 18])
print(df2)
this produces this
Age EmployeeNumber MonthlyIncome source target
296 18 297 1420
301 18 302 1200
457 18 458 1878
727 18 728 1051
828 18 829 1904
972 18 973 1611
1153 18 1154 1569
1311 18 1312 1514
My desired output is this
Age EmployeeNumber MonthlyIncome source target
296 18 297 1420 297 302
301 18 302 1200 297 458
457 18 458 1878 297 728
727 18 728 1051 297 829
828 18 829 1904 297 973
972 18 973 1611 297 1154
1153 18 1154 1569 297 1312
1311 18 1312 1514
Where do I go from here?
This will need some modification because I don't have the features added that you do. But this copies the EmployeeNumber to a new column target and shifts the values up. Leaving the last EmployeeNumber as a NaN. I added some modification to have the last row in a group's target value empty. The line of code would need to be modified more if using strings in the source column as well. The main point is using .shift(periods=-1) with the groupby().
import pandas as pd
import numpy as np
path = "C:\CNT\IBM.csv"
df = pd.read_csv(path)
def f(row):
val = np.where(row['A'] == row['B'], 0, np.where(row['A'] >= row['B'], 1, -1))
return val
df['source'] = df['EmployeeNumber']
df['target'] = df.groupby('Age')['EmployeeNumber'].shift(periods=-1).fillna(0).astype(int).replace(0,'')
print(df)
df2 = (df.loc[df['Age'] == 18])
df3 = df2[['Age','EmployeeNumber','source','target']]
print(df3)
I have a numpy array of size (352, 5) and also have a pandas DataFrame.
Objective is to check whether first and second columns of the numpy array exist in a range of 2 pandas columns and if yes then get the index of that particular row and then do something
Example:-
stats = [[ 246 1102 1678 2214 172182]
[ 678 1005 1688 2214 3528850]
[ 1031 241 17 23 331]]
df:-
hpos hpos_end vpos vpos_end
245 298 1100 1124
672 685 1000 1010
Result:-
stats[0] is present in the very first row of df since 246 lies b/w 245 and 298 and 1102 lies b/w 1100 and 1124 and same goes for next element of stats. I want to obtain the index of the row it lies in(if it does).
My approach till now:-
for x, y, w, h, area in stats[:]:
for row in df.itertuples():
if x in range(int(df['HPOS'][row.Index]), int(df['HPOS_END'][row.Index])) and y in range(
int(df['VPOS'][row.Index]), int(df['VPOS_END'][row.Index])):
desired_index = row.Index
is there a faster/optimum way to achieve my objective as iterating a df would be the last thing I would like to do
Note: both numpy array and the df are already in sorted in ascending order based on the first two columns for numpy array and ['hpos', 'vpos] for df
Any help will be appreciated, Thank you :)
I have written a program (code below) that gives me for each file in a folder a data frame. In the data frame are the Quarters in the Year from the file and the counts (how often the quarters occurs in the file). An output for one file in the loop look for example like:
2008Q4 230
2009Q1 186
2009Q2 166
2009Q3 173
2009Q4 246
2010Q1 341
2010Q2 336
2010Q3 200
2010Q4 748
2011Q1 625
2011Q2 690
2011Q3 970
2011Q4 334
2012Q1 573
2012Q2 53
How can I create a big data frame where the counts for the quarters are summed up for all files in the folder?
path = "crisisuser"
os.chdir(path)
result = [i for i in glob.glob('*.{}'.format("csv"))]
os.chdir("..")
for i in result:
df = pd.read_csv("crisisuser/"+i)
df['quarter'] = pd.PeriodIndex(df.time, freq='Q')
df=df['quarter'].value_counts().sort_index()
I think you need append all Series to list, then use concat and sum per index values:
out = []
for i in result:
df = pd.read_csv("crisisuser/"+i)
df['quarter'] = pd.PeriodIndex(df.time, freq='Q')
out.append(df['quarter'].value_counts().sort_index())
s = pd.concat(out).sum(level=0)
I'm not even sure if the title makes sense.
I have a pandas dataframe with 3 columns: x, y, time. There are a few thousand rows. Example below:
x y time
0 225 0 20.295270
1 225 1 21.134015
2 225 2 21.382298
3 225 3 20.704367
4 225 4 20.152735
5 225 5 19.213522
.......
900 437 900 27.748966
901 437 901 20.898460
902 437 902 23.347935
903 437 903 22.011992
904 437 904 21.231041
905 437 905 28.769945
906 437 906 21.662975
.... and so on
What I want to do is retrieve those rows which have the smallest time associated with x and y. Basically for every element on the y, I want to find which have the smallest time value but I want to exclude those that have time 0.0. This happens when x has the same value as y.
So for example, the fastest way to get to y-0 is by starting from x-225 and so on, therefore it could be the case that x repeats itself but for a different y.
e.g.
x y time
225 0 20.295270
438 1 19.648954
27 20 4.342732
9 438 17.884423
225 907 24.560400
I tried up until now groupby but I'm only getting the same x as y.
print(df.groupby('id_y', sort=False)['time'].idxmin())
y
0 0
1 1
2 2
3 3
4 4
The one below just returns the df that I already have.
df.loc[df.groupby("id_y")["time"].idxmin()]
Just to point out one thing, I'm open to options, not just groupby, if there are other ways that is very good.
So need remove rows with time equal first by boolean indexing and then use your solution:
df = df[df['time'] != 0]
df2 = df.loc[df.groupby("y")["time"].idxmin()]
Similar alternative with filter by query:
df = df.query('time != 0')
df2 = df.loc[df.groupby("y")["time"].idxmin()]
Or use sort_values with drop_duplicates:
df2 = df[df['time'] != 0].sort_values(['y','time']).drop_duplicates('y')
This question already has answers here:
Row-wise average for a subset of columns with missing values
(3 answers)
Closed 5 years ago.
I have a this data frame and I would like to calculate a new column as the mean of salary_1, salary_2 and salary_3:
df = pd.DataFrame({
'salary_1': [230, 345, 222],
'salary_2': [235, 375, 292],
'salary_3': [210, 385, 260]
})
salary_1 salary_2 salary_3
0 230 235 210
1 345 375 385
2 222 292 260
How can I do it in pandas in the most efficient way? Actually I have many more columns and I don't want to write this one by one.
Something like this:
salary_1 salary_2 salary_3 salary_mean
0 230 235 210 (230+235+210)/3
1 345 375 385 ...
2 222 292 260 ...
Use .mean. By specifying the axis you can take the average across the row or the column.
df['average'] = df.mean(axis=1)
df
returns
salary_1 salary_2 salary_3 average
0 230 235 210 225.000000
1 345 375 385 368.333333
2 222 292 260 258.000000
If you only want the mean of a few you can select only those columns. E.g.
df['average_1_3'] = df[['salary_1', 'salary_3']].mean(axis=1)
df
returns
salary_1 salary_2 salary_3 average_1_3
0 230 235 210 220.0
1 345 375 385 365.0
2 222 292 260 241.0
an easy way to solve this problem is shown below :
col = df.loc[: , "salary_1":"salary_3"]
where "salary_1" is the start column name and "salary_3" is the end column name
df['salary_mean'] = col.mean(axis=1)
df
This will give you a new dataframe with a new column that shows the mean of all the other columns
This approach is really helpful when you are having a large set of columns or also helpful when you need to perform on only some selected columns not on all.