Sort index order with duplicate index numbers - python

I have two dataframes
df1:
Type Number
24 variation 2.5
25 variation 2.6
26 variation 4
27 variation 4
dfn:
Type Number
24 variable
26 variable
I'm trying to append these two data frames and sort them by index
dfn = dfn.append(df).sort_index()
The end result should be
Type Number
24 variable
24 variation 2.5
25 variation 2.6
26 variable
26 variation 4
27 variation 4
However, I am getting results like:
Type Number
24 variable
24 variation 2.5
25 variation 2.6
26 variation 4
26 variable
27 variation 4
I want the row with variable type above the variation type, which works fine with the first index (24) but not for the next index (26) and so on. How can I get the desired results?

Please Try, append, reset index, sort values by multiple columns and drop the reset index column as follows
df1.append(dfn).reset_index().sort_values(['index','Type','Number']).drop('index',1)

Let us try
dfn = df.append(dfn).sort_index()

Related

How to change the data of a particular column and multiply them depending upon the specific values in another column using pandas?

I want to select the year '2019/0'(string) from the column 'year of entry' and only multiply their 'grades' times 2 which is in another column
year of entry
Grades
2019/0
14
2010/0
21
2019/0
15
this is what I have tried so far:
df.loc[df("Year of Entry"),'2018/9'] = df("Grades")*2
its been giving me an error and im not sure if this is the right method.
You can use:
df.loc[df['year of entry'].eq('2019/0'), 'Grades'] *= 2
NB. the modification is in place.
modified df:
year of entry Grades
0 2019/0 28
1 2010/0 21
2 2019/0 30

How can I drop rows which aren't in a given time period?

I'm sure that this question is not really helpful and could mean a lot of thinks so I'll try to explain the question with an example.
So my goal is to delete rows in a DataFrame like the following one if the row can't be part in a line of consecutive days which are as big as a given time period t. If t for example is 3, then the last row needs to be deleted, because there is a gap between the last and the row before. If t would be 4 then also the first three rows must be deleted, hence the 07.04.2012 or 03.04.2012 is missing. Hopefully you can understand what I try to explain here.
Date
Value
04.04.2012
24
05.04.2012
21
06.04.2012
20
08.04.2012
21
09.04.2012
23
10.04.2012
21
11.04.2012
26
13.04.2012
24
My attempt was to iterate over the values in the column 'Date' and check for every element x in the column if the value of the element x subtracted by the value of element x + t = -t. If this is not the case the whole row of the element should be deleted. But while I was searching how you can iterate over a DataFrame I read several times that it is not recommended to do that, because this needs a lot of computing time for big DataFrames. Unfortunately I couldn't find any other method or function which could do this. Therefore, I would be really glad if someone could help me out here. Thank you! :)
With the dates as index you can expand the index of the dataframe to include the missing days. The new dates will create nan values. Create groups for every nan value with .isna().cumsum() and count the size of each groups. Finally select the rows with a count larger or equal to the desired time period.
period = 3
df.set_index('Date', inplace=True)
df[df.groupby(df.reindex(pd.date_range(df.index.min(), df.index.max()))
.Value.isna().cumsum())
.transform('count').ge(period).Value].reset_index()
Output
Date Value
0 2012-04-04 24
1 2012-04-05 21
2 2012-04-06 20
3 2012-04-08 21
4 2012-04-09 23
5 2012-04-10 21
6 2012-04-11 26
To create the dataframe used in this solution
t = '''
Date Value
04.04.2012 24
05.04.2012 21
06.04.2012 20
08.04.2012 21
09.04.2012 23
10.04.2012 21
11.04.2012 26
13.04.2012 24
'''
import pandas as pd
from datetime import datetime
df = pd.read_csv(io.StringIO(t), sep='\s+', parse_dates=['Date'],
date_parser=lambda x: datetime.strptime(x, '%d.%m.%Y'))

python sort a list of strings based on substrings using pandas

I have an excel sheet with 4 columns, Filename, SNR, Dynamic Range, Level.
Filename
SNR
Dynamic Range
Level
1___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS8_HPOF.xlsx
5
11
8
19___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS32_HPOF.xlsx
15
31
23
10___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS16_HPOF.xlsx
10
21
24
28___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS48_HPOF.xlsx
20
41
23
37___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS8_HP4.xlsx
25
51
12
I need to reorganize the first column of the table, Xls filename, such that the bolded part is in order from least to greatest.
i.e.
Filename
SNR
Dynamic Range
Level
1___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS8_HPOF.xlsx
5
11
8
37___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS8_HP4.xlsx
25
51
12
10___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS16_HPOF.xlsx
10
21
24
19___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS32_HPOF.xlsx
15
31
23
28___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS48_HPOF.xlsx
20
41
23
I don't want to change the actual excel file. I was hoping to use pandas because I am doing some other manipulation later on.
I tried this
df.sort_values(by='Xls Filename', key=lambda col: col.str.contains('_FS'),ascending=True)
but it didn't work.
Thank you in advance!
Extract the pattern, find the sort index using argsort and then sort with the sort index:
# extract the number to sort by into a Series
fs = df.Filename.str.extract('FS(\d+)_\w+\.xlsx$', expand=False)
# find the sort index using `argsort` and reorder data frame with the sort index
df.loc[fs.astype(int).argsort()]
# Filename ... Level
#0 1___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS8_HPOF.xlsx ... 8
#4 37___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS8_HP4.xlsx ... 12
#2 10___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS16_HPOF.xlsx ... 24
#1 19___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS32_HPOF.xlsx ... 23
#3 28___SLATE_FPGA_BESBEV_TX_AMIC_9.6MHz_Normal_IN1_G0_0_HQ_DEC0_FS48_HPOF.xlsx ... 23
Where regex FS(\d+)_\w+\.xlsx$ will capture digits that immediately follow FS and precede _\w+\.xlsx.
In case you might have patterns that don't match, convert to float instead of int due to possible nans:
df.loc[fs.astype(float).values.argsort()]

Average for similar looking data in a column using Pandas

I'm working on a large data with more than 60K rows.
I have continuous measurement of current in a column. A code is measured for a second where the equipment measures it for 14/15/16/17 times, depending on the equipment speed and then the measurement moves to the next code and again measures for 14/15/16/17 times and so forth.
Every time measurement moves from one code to another, there is a jump of more than 0.15 on the current measurement
The data with top 48 rows is as follows,
Index
Curr(mA)
0
1.362476
1
1.341721
2
1.362477
3
1.362477
4
1.355560
5
1.348642
6
1.327886
7
1.341721
8
1.334804
9
1.334804
10
1.348641
11
1.362474
12
1.348644
13
1.355558
14
1.334805
15
1.362477
16
1.556172
17
1.542336
18
1.549252
19
1.528503
20
1.549254
21
1.528501
22
1.556173
23
1.556172
24
1.542334
25
1.556172
26
1.542336
27
1.542334
28
1.556170
29
1.535415
30
1.542334
31
1.729109
32
1.749863
33
1.749861
34
1.749861
35
1.736024
36
1.770619
37
1.742946
38
1.763699
39
1.749861
40
1.749861
41
1.763703
42
1.756781
43
1.742946
44
1.736026
45
1.756781
46
1.964308
47
1.957395
I want to write a script where similar data of 14/15/16/17 times is averaged in a separate column for each code measurement .. I have been thinking of doing this with pandas..
I want the data to look like
Index
Curr(mA)
0
1.34907
1
1.54556
2
1.74986
Need some help to get this done. Please help
First get the indexes of every row where there's a jump. Use Pandas' DataFrame.diff() to get the difference between the value in each row and the previous row, then check to see if it's greater than 0.15 with >. Use that to filter the dataframe index, and save the resulting indices (in the case of your sample data, three) in a variable.
indices = df.index[df['Curr(mA)'].diff() > 0.15]
The next steps depend on if there are more columns in the source dataframe that you want in the output, or if it's really just curr(mA) and index. In the latter case, you can use np.split() to cut the dataframe into a list of dataframes based on the indexes you just pulled. Then you can go ahead and average them in a list comphrension.
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
> [1.3490729374999997, 1.5455638666666667, 1.7498627333333332, 1.9608515]
To get it to match your desired output above (same thing but as one-column dataframe rather than list) convert the list to pd.Series and reset_index().
pd.Series(
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
).reset_index(drop=True)
index 0
0 0 1.349073
1 1 1.545564
2 2 1.749863
3 3 1.960851

How to get maximum difference of column "through time" in pandas dataframe

I have a pandas dataframe named 'stock_data' with a MultiIndex index of ('Date', 'StockID') and a column 'Price'. The rows are ordered by date, so for the same stock a later date will have a higher row index. I want to add a new column that for each stock (i.e. group by stock) contains a number with the maximum positive difference between the prices of the stock through time, as in max_price - min_price.
To explain this further, one could calculate this in O(stocks*rows^2) by:
for each stock:
max = 0.0
for i in range(len(rows)-1):
for j in range(i+1, len(rows):
if price[j] - price[i] > max:
max = price[j] - price[i]
How do I do this in pandas without actually calculating every value and assigning it to the right spot of a new column of the dataframe one-at-a-time like the above algorithm (which could probably be improved by sorting but this is besides the point)?
So far, I have only figured out that I can group by 'StockID' with:
stock_data.groupby(level='Stock') and pick the column stock_data.groupby(level='Stock')['Price']. But something like:
stock_data.groupby(level='Stock')['Price'].max() - stock_data.groupby(level='Stock')['Price'].min()
is not what I described above because there is no resitriction that the max() must come after the min().
Edit: The accepted solution works. Now I am also wondering if there is a way to penalize that distance by how far the max is from the min, so shorter gains are higher (therefore preferred) over longterm ones with somewhat bigger difference.
For example, maybe we could do cumsum() up to a certain length after min and not till the end? Somehow?
Let's try [::-1] to reverse the order to be able to get the maximum "in the future", then cummin and cummax after the groupby.
# sample data
np.random.seed(1)
stock_data = pd.DataFrame({'Price':np.random.randint(0,100, size=14)},
index=pd.MultiIndex.from_product(
[pd.date_range('2020-12-01', '2020-12-07', freq='D'),
list('ab')],
names=['date','stock'])
)
and assuming the dates are ordered in time, you can do:
stock_data['diff'] = (df.loc[::-1, 'Price'].groupby(level='stock').cummax()
- df.groupby(level='stock')['Price'].cummin())
print(stock_data)
Price diff
date stock
2020-12-01 a 37 42
b 12 59
2020-12-02 a 72 42
b 9 62
2020-12-03 a 75 42
b 5 66
2020-12-04 a 79 42
b 64 66
2020-12-05 a 16 60
b 1 70
2020-12-06 a 76 60
b 71 70
2020-12-07 a 6 0
b 25 24

Categories