I have a data frame having 4 columns, 1st column is equal to the counter which has values in hexadecimal.
Data
counter frequency resistance phase
0 15000.000000 698.617126 -0.745298
1 16000.000000 647.001708 -0.269421
2 17000.000000 649.572265 -0.097540
3 18000.000000 665.282775 0.008724
4 19000.000000 690.836975 -0.011101
5 20000.000000 698.051025 -0.093241
6 21000.000000 737.854003 -0.182556
7 22000.000000 648.586792 -0.125149
8 23000.000000 643.014160 -0.172503
9 24000.000000 634.954223 -0.126519
a 25000.000000 631.901733 -0.122870
b 26000.000000 629.401123 -0.123728
c 27000.000000 629.442016 -0.156490
Expected output
| counter | sampling frequency | time. |
| --------| ------------------ |---------|
| 0 | - |t0=0 |
| 1 | 1 |t1=t0+sf |
| 2 | 1 |t2=t1+sf |
| 3 | 1 |t3=t2+sf |
The time column is the new column added to the original data frame. I want to plot time in the x-axis and frequency, resistance, and phase in y-axis.
Because in order to calculate the value of any row you need to calculate the value of the previous row before, you may have to use a for loop for this problem.
For a constant frequency, you could just calculate it in advance, no need to operate in the datafame:
sampling_freq = 1
df['time'] = [sampling_freq * i for i in range(len(df))]
If you need to operate in the dataframe (let's say the frequency may change at some point), in order to call each cell based on row number and column name, you can this suggestion. Syntax would be a lot easier using both numbers for row and column, but I prefer to refer to 'time' instead of 2.
df['time'] = np.zeros(len(df))
for i in range(1, len(df)):
df.iloc[i, df.columns.get_loc('time')] = df.iloc[i-1, df.columns.get_loc('time')] + df.iloc[i, df.columns.get_loc('sampling frequency')]
Or, alternatively, resetting the index so you can iterate through consecutive numbers:
df['time'] = np.zeros(len(df))
df = df.reset_index()
for i in range(1, len(df)):
df.loc[i, 'time'] = df.loc[i-1, 'time'] + df.loc[i, 'sampling frequency']
df = df.set_index('counter')
Note that, because your sampling frequency is likely constant in the whole experiment, you could simplify it like:
sampling_freq = 1
df['time'] = np.zeros(len(df))
for i in range(1,len(df)):
df.iloc[i, df.columns.get_loc('time')] = df.iloc[i-1, df.columns.get_loc('time')] + sampling_freq
But it's not going to be better than just create the time series as in the first example.
Related
I have two datasets (df1 and df2) of values with a certain range (Start and End) in both of them.
I would like to annotate the first one (df1) with values from column Num of the corresponding overlapping range of values (Start/End) on df2.
Example: The first row in df1 ranges from 0-2300000, since 2300000 is lower than the End in first row in df2 and the whole range 0-2300000 is overlapping with the range of 62920-121705338, it would be annotated with Num 3. Similarly, also the row 2 of df1 has range 2300000-5400000 overlapping with the range of 62920-121705338, row 2 would also be annotated with Num 3.
However, in the case of the last row of df1, the range contains two rows from df2, so there needs to output the sum in Num in the last two rows of df2.
The desired output would be df3
df1.head()
|Start |End |Tag |
|---------|---------|-------|
|0 |2300000 |gneg45 |
|2300000 |5400000 |gpos25 |
|143541857|200000000|gneg34 |
df2.head()
| Start | End | Num |
|---------|---------|--------|
|62920 |121705338| 3 |
|143541857|147901334| 2 |
|147901760|151020217| 5 |
df3 =
|Start |End |Num |
|---------|---------|-------|
|0 |2300000 |3 |
|2300000 |5400000 |3 |
|143541857|200000000|7 |
I tried pandas merge creating a key and query based on a range of columns, but nothing really worked.
Thanks in advance!!
From your description, you are looking for overlapping range in df1 and df2 in order for df1 to take the Num value from df2.
To formulate the condition of overlapping range condition, let's illustrate as follows the opposite condition of non-overlapping range:
Either:
|<-------------->|
df2.Start .df2.End
|<------------->|
df1.Start df1.End
or:
|<-------------->|
df2.Start .df2.End
|<------------->|
df1.Start df1.End
This non-overlapping range condition can be formulated as:
Either (df1.End < df2.Start) or (df1.Start > df2.End)
Therefore, the overlapping range condition, being the opposite, is the negation of the above conditions, that is:
~ ((df1.End < df2.Start) | (df1.Start > df2.End))
which is equivalent to:
(df1.End >= df2.Start) & (df1.Start <= df2.End)
[Note: we deduce overlapping condition by considering the opposite and get the negation because the overlapping conditions have more scenarios. There are 4 cases: (1) df1 covering the entire df2 range and more; (2) df1 being entirely contained within the df2 range; (3) overlapping on left end only; (4) overlapping on right end only. We can simplify the logics by our approach.]
Solution 1: Simple Solution for small dataset
Step 1: For small dataset, you can cross join df1 and df2 by .merge(), then filter by the overlapping condition using .query(), as follows:
df3 = (df1.merge(df2, how='cross', suffixes=('_df1', '_df2'))
.query('(End_df1 >= Start_df2) & (Start_df1 <= End_df2)')
.rename({'Start_df1': 'Start', 'End_df1': 'End'}, axis=1)
[['Start', 'End', 'Num']]
)
If your Pandas version is older than 1.2.0 (released in December 2020) and does not support merge with how='cross', you can use:
df3 = (df1.assign(key=1).merge(df2.assign(key=1), on='key', suffixes=('_df1', '_df2')).drop('key', axis=1)
.query('(End_df1 >= Start_df2) & (Start_df1 <= End_df2)')
.rename({'Start_df1': 'Start', 'End_df1': 'End'}, axis=1)
[['Start', 'End', 'Num']]
)
Intermediate result:
print(df3)
Start End Num
0 0 2300000 3
3 2300000 5400000 3
7 143541857 200000000 2
8 143541857 200000000 5
Step 2: Sum up the Num values for same range (same Start and End) by .groupby() and .sum():
df3 = df3.groupby(['Start', 'End'])['Num'].sum().reset_index()
Result:
print(df3)
Start End Num
0 0 2300000 3
1 2300000 5400000 3
2 143541857 200000000 7
Solution 2: Numpy Solution for large dataset
For large dataset and performance is a concern, you can use numpy broadcasting (instead of cross join and filtering) to speed up the execution time:
Step 1:
d1_S = df1.Start.to_numpy()
d1_E = df1.End.to_numpy()
d2_S = df2.Start.to_numpy()
d2_E = df2.End.to_numpy()
# filter for overlapping range condition and get the respective row indexes of `df1`, `df2` in `i` and `j`
i, j = np.where((d1_E[:, None] >= d2_S) & (d1_S[:, None] <= d2_E))
df3 = pd.DataFrame(
np.column_stack([df1.values[i], df2.values[j]]),
columns=df1.columns.append(df2.columns + '_df2')
)
Intermediate result:
print(df3)
Start End Tag Start_df2 End_df2 Num_df2
0 0 2300000 gneg45 62920 121705338 3
1 2300000 5400000 gpos25 62920 121705338 3
2 143541857 200000000 gneg34 143541857 147901334 2
3 143541857 200000000 gneg34 147901760 151020217 5
Step 2: Sum up the Num values for same range (same Start and End) by .groupby() and .sum():
df3 = df3.groupby(['Start', 'End'])['Num_df2'].sum().reset_index(name='Num')
Result:
print(df3)
Start End Num
0 0 2300000 3
1 2300000 5400000 3
2 143541857 200000000 7
Building off the logic from #SeaBean, one option is with conditional_join from pyjanitor, followed by a groupby:
# pip install pyjanitor
import pandas as pd
import janitor
(
df1
.conditional_join(
# add suffix here
# to avoid MultiIndex, which happens
# if the columns overlap
df2.add_suffix('_y'),
# column from left, column from right, comparator
('Start', 'End_y', '<='),
('End', 'Start_y', '>='))
.rename(columns={'Num_y':'Num'})
.groupby(['Start', 'End'], as_index = False)
.Num
.sum()
)
Start End Num
0 0 2300000 3
1 2300000 5400000 3
2 143541857 200000000 7
I have a dataframe_1 as such:
Index Time Label
0 0.000 ns Segment 1
1 2.749 sec baseline
2 3.459 min begin test
3 7.009 min end of test
And I would like to add multiple new rows in between each of dataframe_1's rows, where the Time column for each new row would add an additional minute until reaching dataframe_1's next row's time (and corresponding Label). For example, the above table should ultimately look like this:
Index Time Label
0 0.000 ns Segment 1
1 2.749 sec baseline
2 00:01:02.749000 baseline + 1min
3 00:02:02.749000 baseline + 2min
4 00:03:02.749000 baseline + 3min
5 3.459 min begin test
6 00:04:27.540000 begin test + 1min
7 00:05:27.540000 begin test + 2min
8 00:06:27.540000 begin test + 3min
9 7.009 min end of test
Using Timedelta type via pd.to_timedelta() is perfectly fine.
I thought the best way to do this would be to break up each row of dataframe_1 into its own dataframe, and then adding rows for each added minute, and then concating the dataframes back together. However, I am unsure of how to accomplish this.
Should I use a nested for-loop to [first] iterate over the rows of dataframe_1 and then [second] iterate over a counter so I can create new rows with added minutes?
I was previously not splitting up the individual rows into new dataframes, and I was doing the second iteration like this:
baseline_row = df_legend[df_legend['Label'] == 'baseline']
[baseline_index] = baseline_row.index
baseline_time = baseline_row['Time']
interval_mins = 1
new_time = baseline_time + pd.Timedelta(minutes=interval_mins)
cutoff_time_np = df_legend.iloc[baseline_row.index + 1]['Time']
cutoff_time = pd.to_timedelta(cutoff_time_np)
while new_time.reset_index(drop=True).get(0) < cutoff_time.reset_index(drop=True).get(0):
new_row = baseline_row.copy()
new_row['Label'] = f'minute {interval_mins}'
new_row['Time'] = baseline_time + pd.Timedelta(minutes=interval_mins)
new_row.index = [baseline_index + interval_mins - 0.5]
df_legend = df_legend.append(new_row, ignore_index=False)
df_legend = df_legend.sort_index().reset_index(drop=True)
pdb.set_trace()
interval_mins += 1
new_time = baseline_time + pd.Timedelta(minutes=interval_mins)
But since I want to do this for each row in the original dataframe_1, then I was thinking to split it up into separate dataframes and put it back together. I'm just not sure what the best way is to do that, especially since pandas is apparently very slow if iterating over the rows.
I would really appreciate some guidance.
This might faster than your solution.
df.Time = pd.to_timedelta(df.Time)
df['counts'] = df.Time.diff().apply(lambda x: x.total_seconds()) / 60
df['counts'] = np.floor(df.counts.shift(-1)).fillna(0).astype(int)
df.drop(columns='Index', inplace=True)
df
Time Label counts
0 00:00:00 Segment 1 0
1 00:00:02.749000 baseline 3
2 00:03:27.540000 begin test 3
3 00:07:00.540000 end of test 0
Then use iterrows to get your desire output.
new_df = []
for _, row in df.iterrows():
val = row.counts
if val == 0:
new_df.append(row)
else:
new_df.append(row)
new_row = row.copy()
label = row.Label
for i in range(val):
new_row = new_row.copy()
new_row.Time += pd.Timedelta('1 min')
new_row.Label = f'{label} + {i+1}min'
new_df.append(new_row)
new_df = pd.DataFrame(new_df)
new_df
Time Label counts
0 00:00:00 Segment 1 0
1 00:00:02.749000 baseline 3
1 00:01:02.749000 baseline + 1min 3
1 00:02:02.749000 baseline + 2min 3
1 00:03:02.749000 baseline + 3min 3
2 00:03:27.540000 begin test 3
2 00:04:27.540000 begin test + 1min 3
2 00:05:27.540000 begin test + 2min 3
2 00:06:27.540000 begin test + 3min 3
3 00:07:00.540000 end of test 0
I assume that you converted Time column from "number unit" format to a string
representation of the time. Something like:
Time Label
Index
0 00:00:00.000 Segment 1
1 00:00:02.749 baseline
2 00:03:27.540 begin test
3 00:07:00.540 end of test
Then, to get your result:
Compute timNxt - the Time column shifted by 1 position and converted
to datetime:
timNxt = pd.to_datetime(df.Time.shift(-1))
Define the following "replication" function:
def myRepl(row):
timCurr = pd.to_datetime(row.Time)
timNext = timNxt[row.name]
tbl = [[timCurr.strftime('%H:%M:%S.%f'), row.Label]]
if pd.notna(timNext):
n = (timNext - timCurr) // np.timedelta64(1, 'm') + 1
tbl.extend([ [(timCurr + np.timedelta64(i, 'm')).strftime('%H:%M:%S.%f'),
row.Label + f' + {i}min'] for i in range(1, n)])
return pd.DataFrame(tbl, columns=row.index)
Apply it to each row of your df and concatenate results:
result = pd.concat(df.apply(myRepl, axis=1).tolist(), ignore_index=True)
The result is:
Time Label
0 00:00:00.000000 Segment 1
1 00:00:02.749000 baseline
2 00:01:02.749000 baseline + 1min
3 00:02:02.749000 baseline + 2min
4 00:03:02.749000 baseline + 3min
5 00:03:27.540000 begin test
6 00:04:27.540000 begin test + 1min
7 00:05:27.540000 begin test + 2min
8 00:06:27.540000 begin test + 3min
9 00:07:00.540000 end of test
The resulting DataFrame has Time column also as string, but at
least the fractional part of second has 6 digits everywhere.
So I have this data frame df:
Author | Score
A | 10
B | 4
C | 8
A | 9
B | 7
C | 6
D | 4
E | 3
I want to be able to make a box plot of x = author and y = score where the amount of authors is greater than 1. So the chart will only display authors A, B, and C. The reason why I want to set this limit is because the actual data frame I'm working with contains a rather large number of authors, and the box plot ends up looking extremely cluttered and unable to read. Is there a way to do this?
You can use groupby + transform('size') to create a mask that limits your DataFrame to Authors with more than 1 row. Then boxplot this subset.
m = df.groupby('Author')['Score'].transform('size').gt(1)
df.loc[m].boxplot(by='Author', column='Score')
That method allows you to easily generalize to an arbitrary number of rows as your threshold. In this special case of more than 1 row you could also use duplicated to slice the original:
df[df.duplicated('Author', keep=False)].boxplot(by='Author', column='Score')
First count Authors by grouping them then filter data by Counts.
import pandas as pd
import matplotlib.pyplot as plt
# add counts column
df['Counts'] = df.groupby(['Author']).transform('count')
# filter by value > 1
df = df[df['Counts'] > 1]
# plot
df.boxplot(by='Author', column=['Score'])
plt.show()
Output:
I have a dataset that looks like this:
country | year | supporting_nation | eco_sup | mil_sup
------------------------------------------------------------------
Fake 1984 US 1 1
Fake 1984 SU 0 1
In this fake example, a nation is playing both sides during the cold war and receiving support from both.
I am reshaping the dataset in two ways:
I removed all non US / SU instances of support, I am only interested in these two countries
I want to reduce it to 1 line per year per country, meaning that I am adding US / SU specific dummy variables for each variable
Like so:
country | year | US_SUP | US_eco_sup | US_mil_sup | SU_SUP | SU_eco_sup | SU_mil_sup |
------------------------------------------------------------------------------------------
Fake 1984 1 1 1 1 1 1
Fake 1985 1 1 1 1 1 1
florp 1984 0 0 0 1 1 1
florp 1985 0 0 0 1 1 1
I added all of the dummies and the US_SUP and SU_SUP columns have been populated with the correct values.
However, I am having trouble with giving the right value to the other variables.
To do so, I wrote the following function:
def get_values(x):
cols = ['eco_sup', 'mil_sup']
nation = ''
if x['SU_SUP'] == 1:
nation = 'SU_'
if x['US_SUP'] == 1:
nation = 'US_'
support_vars = x[['eco_sup', 'mil_sup']]
# Since each line contains only one measure of support I can
# automatically assume that the support_vars are from
# the correct nation
support_cols = [nation + x for x in cols]
x[support_cols] = support_vars
The plan is than to use a df.groupby.agg('max') operation, but I never get to this step as the function above return 0 for each new dummy col, regardless of the value of the columns in the dataframe.
So in the last table all of the US/SU_mil/eco_sup variables would be 0.
Does anyone know what I am doing wrong / why the columns are getting the wrong value?
I solved my problem by abandoning the .apply function and using this instead (where old is a list of the old variable names)
for index, row in df.iterrows():
if row['SU_SUP'] == 1:
nation = 'SU_'
for col in old:
df[index: index + 1][nation + col] = int(row[col])
if row['US_SUP'] == 1:
nation = 'US_'
for col in old:
df[index: index + 1][nation + col] = int(row[col])
This did the trick!
I have two data frames df and ctr.
df contains a column position a value between 1 and 100, and a column Average monthly searches which contains an integer.
position | Average monthly searches
1 | 250
2 | 10
3 | 30
2 | 40
4 | 100
ctr contains a column position a value between 1 and 100, and a column Decay Ctr which is a percentage that reflects the decay at each position.
Position | Decay Ctr
1 | 27.18%
2 | 18.27%
3 | 12.66%
4 | 9.13%
5 | 6.90%
What I want to do is for each row in df lookup that position in ctr and times Average monthly searches by the correct Decay Ctr.
with open("C:\Environments\ENV\Export npower Report Rebuild KWs.csv",newline='') as csvfile:
with open("C:\Environments\ENV\ctr_csv.csv", newline='') as ctrfile:
ctr = pd.read_csv(ctrfile)
df = pd.read_csv(csvfile)
But I am unsure how to extract the correct elements to put it into a new column visibility in df. I tried using an apply statement but was unsure how to reference ctr correctly.
df['visibility'] = df.apply(numpy.multiply(df['Average monthly searches'] , ctr[ctr[]] ), axis = 0 )
You can just merge the two frames and then create a new column.
comb_df = df.merge(ctr, left_on = 'Position', right_on = 'Position')
comb_df['visibility'] = comb_df['Avg Monthly searches'] *
comb_df['Decay']
I am unsure of what you are asking but if I understand it correctly, what you want to do it actually merge the two data frames and then you can perform whatever operations you like
First way
df1 = pd.merge(df,ctr, on='Position' index=False)
#Then you can multiply both columns however you like
df1['Visibility'] = df1.apply(numpy.multiply(df['Average monthly searches'] , df['Decay Ctr']), axis=1)
Second Way
df1 = pd.merge(df,ctr,on='Position',index=False)
def multiply(x):
for index, row in df1.iterrows():
row['Visibility'] = row['Average monthly searches'] * row['Decay Ctr']
return row['Visibility']
df1 =df1.apply(multiply, axis=1)