Referencing to the next index in iterrows() - python

I have a Pandas DataFrame which looks like this:
top heading page_no
0 000000 Intro 0
1 100164 Summary 1
2 100451 Experience 1
3 200131 Awards 2
4 200287 Skills 2
5 300147 Education 3
6 300273 Awards 3
7 300329 Interests 3
8 300434 Certifications 3
9 401135 End 4
I have used a filter which uses this dataframe to get the contents from another dataframe. It needs to filter everything between the tops i.e. from 000000 to 100164 and so on till 300434 to 401135.
for index,row in df_heads.iterrows():
begin = int(row['top'])
end = ???
filter_result = result['data'][(result.top < end) & (result.top > begin)]
print(row['heading'])
print(filter_result)
sections[row['heading']] = filter_result
end = begin
What should end be initialized with so that we get the contents of the filter in the correct way ?

I think you can create new column by shift and then replace last NaN to 0 if necessary by fillna:
df_heads['shifted_top'] = df_heads['top'].shift(-1).fillna(0)
print (df_heads)
top heading page_no shifted_top
0 0 Intro 0 100164.0
1 100164 Summary 1 100451.0
2 100451 Experience 1 200131.0
3 200131 Awards 2 200287.0
4 200287 Skills 2 300147.0
5 300147 Education 3 300273.0
6 300273 Awards 3 300329.0
7 300329 Interests 3 300434.0
8 300434 Certifications 3 401135.0
9 401135 End 4 0.0
for index,row in df_heads.iterrows():
begin = int(row['top'])
end = int(row['shifted_top'])
print (begin, end)
0 100164
100164 100451
100451 200131
200131 200287
200287 300147
300147 300273
300273 300329
300329 300434
300434 401135
401135 0

you cannot access a different row's data using a for index, row in df_heads.iterrows() loop. There needs to be an additional variable create outside of the loop with the different row's data as in the example from above.
df_heads['shifted_top'] = df_heads['top'].shift(-1).fillna(0)

Related

Design trailing stop column in pandas

I created a dataframe with pandas and calculated the percentage of earning or losing:
and I hope I can design 2 column like entered market and trail for a trailing stop backtest for example 5% like:
earning/losing entered market trail
0 0 0
1 1 1
2 1 2
3 1 3
7 1 7
4 1 7
5 1 7
8 1 8
2 0 0
5 0 0
4 0 0
I had tried using numpy condition to create like but I can't complete
the rest of condition:
condition = [(df['earning/losing'] > 0) & (df['earning/losing'] >df['earning/losing'].shift(-1)) & (df['earning/losing'] - df['earning/losing'].shift(-1) < 5),]
value = [df.earning/losing,]
df['trail'] = np.select(condition,value,default = 0)
I think if I could create a column like trail, then I can judge the trailing condition,
But I dont know how to create the trail column in pandas
can anyone help me out? thanks alot!

In python, from a column how to check specific word/tag and show their presence in their new related columns

I am new in python. I have a Column in MS Excel file, in which four tag are used which are LOC , ORG , PER and MISC ,given data is like this:
1 LOC/Thai Buddhist temple;
2 PER/louis;
3 ORG/WikiLeaks;LOC/Southern Ocean;
4 ORG/queen;
5 PER/Sanchez;PER/Eli Wallach;MISC/The Good, The Bad and the Ugly;
6
7 PER/Thomas Watson;
...................
...................
.............#continue upto 2,000 rows
and i want a Result that in the specific row which tag is present or not ,if some tag is present then in their specific (NEW Columns which are shown below) column put "1" and if not present any tag then put "0" . I want all 4 columns in this excel file which are LOC/ORG/PER/MISC and will be 2nd ,3rd, 4th and 5th column while first column is given data,and file is contains almost 2815 rows and every row has different tag from these LOC/ORG/PER/MISC .
My goal is to count from the new columns
total number of LOC, total number of ORG, total number of PER and total number of MISC
The result will be like this:
given data LOC ORG PER MISC
1 LOC/Thai Buddhist temple; 1 0 0 0 #here only LOC is present
2 PER/louis; 0 0 1 0 #here only PER is present
3 ORG/WikiLeaks;LOC/Southern Ocean; 1 1 0 0 #here LOC and ORG is present
4 PER/Eli Wallach;MISC/The Good; 0 0 1 1 #here PER and MISC is present
5 .................................................
6 0 0 0 0 #here no tag is present
7 .....................................................
.......................................................
..................................continue up to 2815 rows....
I am beginner in Python.so, I have tried my best to search out its solution code but, I cannot find any program related to my problem that's why I posted here. so, kindly anyone helps me.
I assume you have successfully read the data from excel and created a dataframe in python using pandas (To read the excel file we have df1 = read_excel("File/path/name.xls" Header = True/False)).
Here is the layout of your dataframe df1
Colnum | Tagstring
1 |LOC/Thai Buddhist temple;
2 |PER/louis;
3 |ORG/WikiLeaks;LOC/Southern Ocean;
4 |ORG/queen;
5 |PER/Sanchez;PER/Eli Wallach;MISC/The Good, The Bad and the Ugly;
6 |PER/Thomas Watson;
Now, there are couple of ways to search for text in a string.
I will demonstrate find function :
Syntax : str.find(str, beg=0, end=len(string))
str1 = "LOC";
str2 = "PER";
str3 = "ORG";
str4 = "MISC";
df1["LOC"] = (if Tagstring.find(str1) >= 0 then 1 else 0).astype('int')
df1["PER"] = (if Tagstring.find(str2) >= 0 then 1 else 0).astype('int')
df1["ORG"] = (if Tagstring.find(str3) >= 0 then 1 else 0).astype('int')
df1["MISC"] = (if Tagstring.find(str4) >= 0 then 1 else 0).astype('int')
if you have read your data, df then you can do:
pd.concat([df,pd.DataFrame({i:df.Tagstring.str.contains(i).astype(int) for i in 'LOC ORG PER MISC'.split()})],axis=1)
Out[716]:
Tagstring LOC ORG PER MISC
Colnum
1 LOC/Thai Buddhist temple; 1 0 0 0
2 PER/louis; 0 0 1 0
3 ORG/WikiLeaks;LOC/Southern Ocean; 1 1 0 0
4 ORG/queen; 0 1 0 0
5 PER/Sanchez;PER/Eli Wallach;MISC/The Good, The... 0 0 1 1
6 PER/Thomas Watson; 0 0 1 0

Using Python loop for next i rows in dataframe

I'm a new Python user (making the shift from VBA) and am having trouble figuring out Python's loop function. I have a dataframe df, and I want to create a column of variables based on some condition being met in another column, based on a loop. Something like the below:
cycle = 5
dummy = 1
for i = 1 to cycle
if df["high"].iloc[i] >= df["exit"].iloc[i] and
df["low"].iloc[i] <= df["exit"].iloc[i] then
df["signal"] = dummy
break
elif i = cycle
df["signal"] = cycle + 1
break
else:
dummy = dummy + 1
next i
Basically trying to find in which column over the next columns up to the cycle variable are the conditions in the if statement met, and if they're never met, assign cycle + 1. So df["signal"] will be a column of numbers ranging 1 -> (cycle + 1). Also, there are some NaN values in df["exit"], not sure how that affects the loop.
I've found fairly extensive documentation on row iterations on the site, I feel like this is close to where I need to get to, but can't figure out how to adapt it. Thanks for any advice!
EDIT: INCLUDED DATA SAMPLE FROM EXCEL CELLS BELOW:
high low EXIT test signal/(OUTPUT COLUMN)
4 3 4 1 1
2 2 2 1 1
2 3 5 0 6
4 3 1 0 5
2 5 2 0 4
5 5 1 0 3
3 1 5 0 2
5 1 5 1 1
1 1 4 0 0
EDIT 2: FURTHER CLARIFICATION AROUND SCRIPT
Once the condition
df["high"].iloc[i] >= df["exit"].iloc[i] and
df["low"].iloc[i] <= df["exit"].iloc[i]
is met in the loop, it should terminate for that particular instance/row.
EDIT 3: EXPECTED OUTPUT
The expected output is the df["signal"] column - it is the first instance in the loop where the condition
df["high"].iloc[i] >= df["exit"].iloc[i] and
df["low"].iloc[i] <= df["exit"].iloc[i]
is met in any given row. The output in df["signal"] is effectively i from the loop, or the given iteration.
here is how I would solve the problem, the column 'gr' must not exist before doing this:
# first check all the rows meeting the conditions and add 1 in a temporary column gr
df.loc[(df["high"] >= df["exit"]) & (df["low"] <= df["exit"]), 'gr'] = 1
# manipulate column gr to use groupby after
df['gr'] = df['gr'].cumsum().bfill()
# use cumcount after groupby to recalculate signal
df.loc[:,'signal'] = df.groupby('gr').cumcount(ascending=False).add(1)
# cut the value in signal to the value cycle + 1
df.loc[df['signal'] > cycle, 'signal'] = cycle + 1
# drop column gr
df = df.drop('gr',1)
and you get
high low exit signal
0 4 3 4 1
1 2 2 2 1
2 2 3 5 6
3 4 3 1 5
4 2 5 2 4
5 5 5 1 3
6 3 1 5 2
7 5 1 5 1
8 1 1 4 1
Note: The last row is not working properly as never a row with the condition is met after, and not sure how it will be in the full data or how to handle this. You may consider to add df = df.dropna(subset=['gr']) after the line starting with df['gr'] = ... to drop these last rows, up to you.

Pandas - Delete only contiguous rows that equal zero

I have a large time series df (2.5mil rows) that contain 0 values in a given row, some of which are legitimate. However if there are repeated continuous occurrences of zero values I would like to remove them from my df.
Example:
Col. A contains [1,2,3,0,4,5,0,0,0,1,2,3,0,8,8,0,0,0,0,9] I would like to remove the [0,0,0] and [0,0,0,0] from the middle and leave the remaining 0 to make a new df [1,2,3,0,4,5,1,2,3,0,8,8,9].
The length of zero values before deletion being a parameter that has to be set - in this case > 2.
Is there a clever way to do this in pandas?
It looks like you want to remove the row if it is 0 and either previous or next row in same column is 0. You can use shift to look for previous and next value and compare with current value as below:
result_df = df[~(((df.ColA.shift(-1) == 0) & (df.ColA == 0)) | ((df.ColA.shift(1) == 0) & (df.ColA == 0)))]
print(result_df)
Result:
ColA
0 1
1 2
2 3
3 0
4 4
5 5
9 1
10 2
11 3
12 0
13 8
14 8
19 9
Update for more than 2 consecutive
Following example in link, adding new column to track consecutive occurrence and later checking it to filter:
# https://stackoverflow.com/a/37934721/5916727
df['consecutive'] = df.ColA.groupby((df.ColA != df.ColA.shift()).cumsum()).transform('size')
df[~((df.consecutive>10) & (df.ColA==0))]
We need build a new para meter here, then using drop_duplicates
df['New']=df.A.eq(0).astype(int).diff().ne(0).cumsum()
s=pd.concat([df.loc[df.A.ne(0),:],df.loc[df.A.eq(0),:].drop_duplicates(keep=False)]).sort_index()
s
Out[190]:
A New
0 1 1
1 2 1
2 3 1
3 0 2
4 4 3
5 5 3
9 1 5
10 2 5
11 3 5
12 0 6
13 8 7
14 8 7
19 9 9
Explanation :
#df.A.eq(0) to find the value equal to 0
#diff().ne(0).cumsum() if they are not equal to 0 then we would count them in same group .

Pandas Time-Series: Find previous value for each ID based on year and semester

I realize this is a fairly basic question, but I couldn't find what I'm looking for through searching (partly because I'm not sure how to summarize what I want). In any case:
I have a dataframe that has the following columns:
* ID (each one represents a specific college course)
* Year
* Term (0 = fall semester, 1 = spring semester)
* Rating (from 0 to 5)
My goal is to create another column for Previous Rating. This column would be equal to the course's rating the last time the course was held, and would be NaN for the first offering of the course. The goal is to use the course's rating from the last time the course was offered in order to predict the current semester's enrollment. I am struggling to figure out how to find the last offering of each course for a given row.
I'd appreciate any help in performing this operation! I am working in Pandas but could move my data to R if that'd make it easier. Please let me know if I need to clarify my question.
I think there are two critical points: (1) sorting by Year and Term so that the order corresponds to temporal order; and (2) using groupby to collect on IDs before selecting and shifting the Rating. So, from a frame like
>>> df
ID Year Term Rating
0 1 2010 0 2
1 2 2010 0 2
2 1 2010 1 1
3 2 2010 1 0
4 1 2011 0 3
5 2 2011 0 3
6 1 2011 1 4
7 2 2011 1 0
8 2 2012 0 4
9 2 2012 1 4
10 1 2013 0 2
We get
>>> df = df.sort(["ID", "Year", "Term"])
>>> df["Previous_Rating"] = df.groupby("ID")["Rating"].shift()
>>> df
ID Year Term Rating Previous_Rating
0 1 2010 0 2 NaN
2 1 2010 1 1 2
4 1 2011 0 3 1
6 1 2011 1 4 3
10 1 2013 0 2 4
1 2 2010 0 2 NaN
3 2 2010 1 0 2
5 2 2011 0 3 0
7 2 2011 1 0 3
8 2 2012 0 4 0
9 2 2012 1 4 4
Note that we didn't actually need to sort by ID -- the groupby would have worked equally well without it -- but this way it's easier to see that the shift has done the right thing. Reading up on the split-apply-combine pattern might be helpful.
Use this function to create the new column...
DataFrame.shift(periods=1, freq=None, axis=0, **kwds)
Shift index by desired number of periods with an optional time freq
Lets say you have a dataframe like this...
ID Rating Term Year
1 1 0 2002
2 2 1 2003
3 3 0 2004
2 4 0 2005
where ID is the course ID and you have multiple entries for each id based on year and semester. Your goal is to find the row based on an ID and recent year and term.
For that you can do this...
df[((df['Year'] == max(df.Year)) & (df['ID'] == 2) & (df['Term'] == 0))]
Where we are finding the course by given ID and term and last offering of the course. If you want the rating, then you can do
df[((df['Year'] == max(df.Year)) & (df['ID'] == 2) & (df['Term'] == 0))].Rating
Hope you were trying to accomplish this result.
Thanks.

Categories