I have two DataFrames that are each of the exact sane dimensions and I would like to multiply just one specific column from each of them together:
My first DataFrame is:
In [834]: patched_benchmark_df_sim
Out[834]:
build_number name cycles
0 390 adpcm 21598
1 390 aes 5441
2 390 blowfish NaN
3 390 dfadd 463
....
284 413 jpeg 766742
285 413 mips 4263
286 413 mpeg2 2021
287 413 sha 348417
[288 rows x 3 columns]
My second DataFrame is:
In [835]: patched_benchmark_df_syn
Out[835]:
build_number name fmax
0 390 adpcm 143.45
1 390 aes 309.60
2 390 blowfish NaN
3 390 dfadd 241.02
....
284 413 jpeg 197.75
285 413 mips 202.39
286 413 mpeg2 291.29
287 413 sha 243.19
[288 rows x 3 columns]
And I would like to take each element of the cycles column of patched_benchmark_df_sim and multiply that to the corresponding element of the fmax column of patched_benchmark_df_syn, and then store the result in a new DataFrame that has exactly the same structure, contiaining the build_number and name columns, but now the last column containing all the numerical data will be called latency, which is the product of fmax and cycles.
So the output DataFrame has to look something like this:
build_number name latency
0 390 adpcm ## each value here has to be product of cycles and fmax and they must correspond to one another ##
......
I tried doing a straightforward patched_benchmark_df_sim * patched_benchmark_df_syn but that did not work as my DataFrames had the name column that's of string type. Is there no builtin pandas method that can do this for me? How could I proceed with the multiplication to get the result I need?
Thank you very much.
The simplest thing to do is to add a new column to the df and then select the columns you want and if you want assign that to a new df:
In [356]:
df['latency'] = df['cycles'] * df1['fmax']
df
Out[356]:
build_number name cycles latency
0 390 adpcm 21598 3.098233e+06
1 390 aes 5441 1.684534e+06
2 390 blowfish NaN NaN
3 390 dfadd 463 1.115923e+05
284 413 jpeg 766742 1.516232e+08
285 413 mips 4263 8.627886e+05
286 413 mpeg2 2021 5.886971e+05
287 413 sha 348417 8.473153e+07
In [357]:
new_df = df[['build_number', 'name', 'latency']]
new_df
Out[357]:
build_number name latency
0 390 adpcm 3.098233e+06
1 390 aes 1.684534e+06
2 390 blowfish NaN
3 390 dfadd 1.115923e+05
284 413 jpeg 1.516232e+08
285 413 mips 8.627886e+05
286 413 mpeg2 5.886971e+05
287 413 sha 8.473153e+07
As you've found you can't multiply non-numeric type df's together like you tried. The above is assuming that the build_number and name columns are the same from both dfs.
Related
I have a df with numbers in the second column. Each number represents the length of a DNA sequence. I would like to create two new columns in which the first one says where this sequence start and the second one says where this sequence end.
This is my current df:
Names LEN
0 Ribosomal_S9: 121
1 Ribosomal_S8: 129
2 Ribosomal_L10: 100
3 GrpE: 166
4 DUF150: 141
.. ... ...
115 TIGR03632: 117
116 TIGR03654: 175
117 TIGR03723: 314
118 TIGR03725: 212
119 TIGR03953: 188
[120 rows x 2 columns]
And this is what I am trying to get
Names LEN Start End
0 Ribosomal_S9: 121 0 121
1 Ribosomal_S8: 129 121 250
2 Ribosomal_L10: 100 250 350
3 GrpE: 166 350 516
4 DUF150: 141 516 657
.. ... ... ... ..
115 TIGR03632: 117
116 TIGR03654: 175
117 TIGR03723: 314
118 TIGR03725: 212
119 TIGR03953: 188
[120 rows x 4 columns]
Can please anyone put me in the right direction?
Use DataFrame.assign with new columns created with Series.cumsum and for start is added Series.shift:
#convert column to integers
df['LEN'] = df['LEN'].astype(int)
#alternative for replace non numeric to missing values
#df['LEN'] = pd.to_numeric(df['LEN'], errors='coerce')
s = df['LEN'].cumsum()
df = df.assign(Start = s.shift(fill_value=0), End = s)
print (df)
Names LEN Start End
0 Ribosomal_S9: 121 0 121
1 Ribosomal_S8: 129 121 250
2 Ribosomal_L10: 100 250 350
3 GrpE: 166 350 516
4 DUF150: 141 516 657
I'm given a set of the following data:
week A B C D E
1 243 857 393 621 194
2 644 576 534 792 207
3 946 252 453 547 436
4 560 100 864 663 949
5 712 734 308 385 303
I’m asked to find the sum of each column for specified rows/a specified number of weeks, and then plot those numbers onto a bar chart to compare A-E.
Assuming I have the rows I need (e.g. df.iloc[2:4,:]), what should I do next? My assumption is that I need to create a mask with a single row that includes the sum of each column, but I'm not sure how I go about doing that.
I know how to do the final step (i.e. .plot(kind='bar'), I just need to know what the middle step is to obtain the sums I need.
You can use for select by positions iloc, sum and Series.plot.bar:
df.iloc[2:4].sum().plot.bar()
Or if want select by names of index (here weeks) use loc:
df.loc[2:4].sum().plot.bar()
Difference is iloc exclude last position:
print (df.loc[2:4])
A B C D E
week
2 644 576 534 792 207
3 946 252 453 547 436
4 560 100 864 663 949
print (df.iloc[2:4])
A B C D E
week
3 946 252 453 547 436
4 560 100 864 663 949
And if need also filter columns by positions:
df.iloc[2:4, :4].sum().plot.bar()
And by names (weeks):
df.loc[2:4, list('ABCD')].sum().plot.bar()
All you need to do is call .sum() on your subset of the data:
df.iloc[2:4,:].sum()
Returns:
week 7
A 1506
B 352
C 1317
D 1210
E 1385
dtype: int64
Furthermore, for plotting, I think you can probably get rid of the week column (as the sum of week numbers is unlikely to mean anything):
df.iloc[2:4,1:].sum().plot(kind='bar')
# or
df[list('ABCDE')].iloc[2:4].sum().plot(kind='bar')
I'm not even sure if the title makes sense.
I have a pandas dataframe with 3 columns: x, y, time. There are a few thousand rows. Example below:
x y time
0 225 0 20.295270
1 225 1 21.134015
2 225 2 21.382298
3 225 3 20.704367
4 225 4 20.152735
5 225 5 19.213522
.......
900 437 900 27.748966
901 437 901 20.898460
902 437 902 23.347935
903 437 903 22.011992
904 437 904 21.231041
905 437 905 28.769945
906 437 906 21.662975
.... and so on
What I want to do is retrieve those rows which have the smallest time associated with x and y. Basically for every element on the y, I want to find which have the smallest time value but I want to exclude those that have time 0.0. This happens when x has the same value as y.
So for example, the fastest way to get to y-0 is by starting from x-225 and so on, therefore it could be the case that x repeats itself but for a different y.
e.g.
x y time
225 0 20.295270
438 1 19.648954
27 20 4.342732
9 438 17.884423
225 907 24.560400
I tried up until now groupby but I'm only getting the same x as y.
print(df.groupby('id_y', sort=False)['time'].idxmin())
y
0 0
1 1
2 2
3 3
4 4
The one below just returns the df that I already have.
df.loc[df.groupby("id_y")["time"].idxmin()]
Just to point out one thing, I'm open to options, not just groupby, if there are other ways that is very good.
So need remove rows with time equal first by boolean indexing and then use your solution:
df = df[df['time'] != 0]
df2 = df.loc[df.groupby("y")["time"].idxmin()]
Similar alternative with filter by query:
df = df.query('time != 0')
df2 = df.loc[df.groupby("y")["time"].idxmin()]
Or use sort_values with drop_duplicates:
df2 = df[df['time'] != 0].sort_values(['y','time']).drop_duplicates('y')
This question already has answers here:
Row-wise average for a subset of columns with missing values
(3 answers)
Closed 5 years ago.
I have a this data frame and I would like to calculate a new column as the mean of salary_1, salary_2 and salary_3:
df = pd.DataFrame({
'salary_1': [230, 345, 222],
'salary_2': [235, 375, 292],
'salary_3': [210, 385, 260]
})
salary_1 salary_2 salary_3
0 230 235 210
1 345 375 385
2 222 292 260
How can I do it in pandas in the most efficient way? Actually I have many more columns and I don't want to write this one by one.
Something like this:
salary_1 salary_2 salary_3 salary_mean
0 230 235 210 (230+235+210)/3
1 345 375 385 ...
2 222 292 260 ...
Use .mean. By specifying the axis you can take the average across the row or the column.
df['average'] = df.mean(axis=1)
df
returns
salary_1 salary_2 salary_3 average
0 230 235 210 225.000000
1 345 375 385 368.333333
2 222 292 260 258.000000
If you only want the mean of a few you can select only those columns. E.g.
df['average_1_3'] = df[['salary_1', 'salary_3']].mean(axis=1)
df
returns
salary_1 salary_2 salary_3 average_1_3
0 230 235 210 220.0
1 345 375 385 365.0
2 222 292 260 241.0
an easy way to solve this problem is shown below :
col = df.loc[: , "salary_1":"salary_3"]
where "salary_1" is the start column name and "salary_3" is the end column name
df['salary_mean'] = col.mean(axis=1)
df
This will give you a new dataframe with a new column that shows the mean of all the other columns
This approach is really helpful when you are having a large set of columns or also helpful when you need to perform on only some selected columns not on all.
I have a Series named 'graph' in pandas that looks like this:
Wavelength
450 37
455 31
460 0
465 0
470 0
475 0
480 418
485 1103
490 1236
495 894
500 530
505 85
510 0
515 168
520 0
525 0
530 691
535 842
540 5263
545 4738
550 6237
555 1712
560 767
565 620
570 0
575 757
580 1324
585 1792
590 659
595 1001
600 601
605 823
610 0
615 134
620 3512
625 266
630 155
635 743
640 648
645 0
650 583
Name: A1, dtype: object
I am graphing the curve using graph.plot(), which looks like this :
The goal is to smooth the curve. I was trying to use the Savgol_Filter, but to do that I need to separate my series into x & y columns. As of right now, I can acess the "Wavelength" column by using graph.index, but I can't grab the next column to assign it as y.
I've tried using iloc and loc and haven't had any luck yet.
Any tips or new directions to try?
You don't need to pass an x and a y to savgol_filter. You just need the y values which get passed automatically when you pass graph to it. What you are missing is the window size parameter and the polygon order parameter that define the smoothing.
from scipy.signal import savgol_filter
import pandas as pd
# I passed `graph` but I could've passed `graph.values`
# It is `graph.values` that will get used in the filtering
pd.Series(savgol_filter(graph, 7, 3), graph.index).plot()
To address some other points of misunderstanding
graph is a pandas.Series and NOT a pandas.DataFrame. A pandas.DataFrame can be thought of as a pandas.Series of pandas.Series.
So you access the index of the series with graph.index and the values with graph.values.
You could have also done
import matplotlib.pyplot as plt
plt.plot(graph.index, savgol_filter(graph.values, 7, 3))
As you are using Series instead of DataFrame, some libraries could not access index to use it as a column.Use:
df = df.reset_index()
it will convert the index to an extra column you can use in savgol filter or any other.