I am trying to compare all rows within a group to check if a condition is fulfilled. If the condition is not fulfilled, I set the new column to True, else False. The issue I am having is finding a neat way to compare all rows within each group. I have something that works but will not work where there are a lot of rows in a group.
for i in range(8):
n = -i-1
cond=(((df['age']-df['age'].shift(n))*(df['weight']-df['weight'].shift(n)))<0)&(df['ref']==df['ref'].shift(n))&(df['age']<7)&(df['age'].shift(n)<7)
df['x'+i] = cond.groupby(df['ref']).transform('any')
df.loc[:,'WFA'] = 0
df.loc[(df['x0']==False)&(df['x1']==False)&(df['x2']==False)&(df['x3']==False)&(df['x4']==False)&(df['x5']==False)&(df['x6']==False)&(df['x7']==False),'WFA'] = 1
To iterate through each row, I have created a loop that compares adjacent rows (using shift). Each loop represents the next adjacent row. In effect, I am able to compare all rows within a group where the number of rows within a group is 8 or less. As you can imagine, it becomes pretty cumbersome as the number of rows grows large.
Instead of creating of column for each period in shift, I want to see if any row matches the condition with any other row. Then set the new column 'WFA' True or False.
If anyone is interested, I post the answer to my own question here (although it is very slow):
df.loc[:,'WFA'] = 0
for ref, gref in df.groupby('ref'):
count=0
for r_idx, row in gref.iterrows():
cond = ((((row['age']-gref.loc[gref['age']<7, 'age'])*(row['weight']-gref.loc[gref['age']<7, 'weight']))<0).any())&(row['age']<7)
if cond==False:
count+=1
if count==len(gref):
df.loc[df['ref']==ref, 'WFA'] = 1
Related
I'm working with a large dataset that includes all police stops in my city since 2014. The dataset has millions of rows, but sometimes there are multiple rows for a single stop (so, if the police stopped a group of 4 people, it is included in the database as 4 separate rows even though it's all the same stop). I'm looking to create a new column in the dataset orderInStop, which is a count of how many people were stopped in sequential order. The first person caught up in the stop would have a value of 1, the second person a value of 2, and so on.
To do so, I have used the groupby() function to group all rows that match on time & location, which is the indication that the rows are all part of the same stop. I can manage to create a new column that includes the TOTAL count of the number of people in the stop (so, if there were 4 rows with the same time & location, all four rows have a value of 4 for the new orderInStop variable. But I need the first row in the group to have a value of 1, the second a value of 2, the third 3, and the fourth 4.
Below is my code attempt at iterating through each group I've created to sequentially count each row within each group, but the code doesn't quite work (it populates the entire column rather than each row within the groups). Any help to tweak this code would be much appreciated!
Note: I also tried using logical operators in a for loop, to essentially ask IF the time & location column values match for the current and previous rows, but ran into too many problems with 'the truth values of a Series is ambiguous' errors, so instead I'm trying to use groupby().
Attempt that creates a total count rather than sequential count:
df['order2'] = df.groupby(by=["Date_Time_Occur", "Location"])['orderInStop'].transform('count')
Attempt that fails, to iterate through each row in each group:
df['order3'] = 1
grp = df.groupby(by=["Date_Time_Occur", "Location"])
for name, groups in grp:
count = 1
for row in groups:
df['order3'] = count
count = count + 1
In your example for row in groups iterates over the column names, since groups is a DataFrame.
To iterate over each row you could do
df['order3'] = 1
grp = df.groupby(by=["Date_Time_Occur", "Location"])
for name, groups in grp:
count = 1
for i, row in groups.iterrows(): # i will be index, row a pandas Series
df['order3'] = count
count = count + 1
Note that your solution relies on pandas groupby to preserve row order. This should be the case, see this question, but there is very likely a shorter & safer solution (see fsimonjetz comment for a starting point).
I have a pandas DataFrame with more than 100 thousands of rows. Index represents the time and two columns represents the sensor data and the condition.
When the condition becomes 1, I want to start calculating score card (average and standard deviation) till the next 1 comes. This needs to be calculated for the whole dataset.
Here is a picture of the DataFrame for a specific time span:
What I thought is to iterate through index and items of the df and when condition is met I start to calculate the descriptive statistics.
cycle = 0
for i, row in df_b.iterrows():
if row['condition'] == 1:
print('Condition is changed')
cycle += 1
print('cycle: ', cycle)
#start = ?
#end = ?
#df_b.loc[start:end]
I am not sure how to calculate start and end for this DataFrame. The end will be the start for the next cycle. Additionally, I think this iteration is not the optimal one because it takes a bit of long time to iterate. I appreciate any idea or solution for this problem.
Maybe start out with getting the rows where condition == 1:
cond_1_df = df.loc[df['condition'] == 1]
This dataframe will only contain the rows that meet your condition (being 1).
From here on, you can access the timestamps pairwise, meaning that the first element is beginning and second element is end, sketched below:
former = 0
stamp_pairs = []
df = cond_1_df.reset_index() # make sure indexes pair with number of rows
for index, row in df.iterrows():
if former != 0:
beginning = former
end = row["timestamp"]
former = row["timestamp"]
else:
beginning = 0
end = row["timestamp"]
former = row["timestamp"]
stamp_pairs.append([beginning, end])
This should give you something like this:
[[stamp0, stamp1], [stamp1,stamp2], [stamp2, stamp3]...]
for each of these pairs, you can again create a df containing only the subset of rows where stamp_x < timestamp < stamp_x+1:
time_cond_df = df.loc[(df['timestamp'] > stamp_x) & (df['timestamp'] < stamp_x+1)]
Finally, you get one time_cond_df per timestamp tuple, on which you can perform your score calculations.
Just make shure that your timestamps are comparable with operators ">" and "<"! We can't tell since you did not explicate how you produced the timestamps.
I am pretty new to pandas and trying to learn it. So, any advice would be appreciated :)
This is just a small part of my whole dataframe DF2:
Chromosome_Name
Sequence_Source
Sequence_Feature
Start
End
Strand
Gene_ID
Gene_Name
0
1
ensembl_havana
gene
14363
34806
-
"ENSG00000227232"
"WASH7P"
1
1
havana
gene
89295
138566
-
"ENSG00000238009"
"RP11-34P13.7"
2
1
havana
gene
141474
178862
-
"ENSG00000241860"
"RP11-34P13.13"
3
1
havana
gene
227615
272253
-
"ENSG00000228463"
"AP006222.2"
4
1
ensembl_havana
gene
312720
453948
+
"ENSG00000237094"
"RP4-669L17.10"
These are my conditions:
Condition 1: Reference row's "Start" value <= Other row's "End" value.
Condition 2: Reference row's "End" value >= Other row's "Start" value.
This is what I have done so far:
chromosome_list = ["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","X","Y"]
dataFrame = DF2.groupby(["Chromosome_Name"])
for chromosome in chromosome_list:
CHR = dataFrame.get_group(chromosome)
for i in range(0, len(CHR)-1):
for j in range(i+1, len(CHR)):
Overlap_index = DF2[(DF2.loc[i, ["Chromosome_Name"] == chromosome]) & (DF2.loc[i, ["Start"]] <= DF2.loc[j, ["End"]]) & (DF2.loc[i, ["End"]] >= DF2.loc[j, ["Start"]])].index
DF2 = DF2.drop(Overlap_index )
The chromosome_list is all the unique values of column "Chromosome_Name".
Mainly, I want to check for each row that whether the columns ("Start" and "End") values are satisfying the conditions above. I believe I need to iterate a single row (reference row) over the particular rows found in the data frame. However, to achieve this I need to consider the value of the first column "Chromosome_Name".
More specifically, every row in DF2 should be checked according to the conditions stated above but, for example, a row at Chromosome_Name = 5 shouldn't be checked with the row of Chromosome_Name = 12. Therefore, first, I thought that I should split the dataframe using pd.groupby() according to Chromosome_Name then, using these dataframes' indexes, I could manipulate (drop the given rows from) the DF2. However, it did not work :)
P.S. After DF2 is splitted into sub dataframes (according to unique Chromosome_Name), each sub dataframe has different size. e.g. There are 641 rows at Chromosome_Name = X but there are 19342 rows for the Chromosome_Name = 1
If you know how to correct my code or provide me another solution, I would be glad.
Thanks in advance.
I am new to pandas too so I do not want to give you a wrong insight and advices but have you ever thougth of converting Start and End columns to lists. So that you can use if statement if you are not comfortable with pandas but your task is urgent. However, I am aware that converting dataframe into list would be something opposite to the creation of pandas.
Using openpyxl I am creating python script that will loop through the rows of data and find rows in which some of the column are empty - these will be deleted. The range of rows is 3 to 1800.
I am not excatly sure how to delete these row - please see code I have come up with so far.
What I was trying to achieve is to iterate through the rows and check if columns 4, 7 values are set to None. If True I wanted to return row number into suitable collection (need advise which one would be best for this) and then create another loop that would delete specific row number reversed as I don't want change the structure of the file by deleting rows with data.
I believe there may be easier function to do this but could not find this particular answer.
for i in worksheet.iter_rows(min_row=3, max_row=1800):
emp_id = i[4]
full_name = i[7]
if emp_id.value == None and full_name.value == None:
print(worksheet[i].rows)
rows_to_delete.append(i)
Your iteration looks good.
OpenPyXl offers worksheet.delete_rows(idx, amt=1) where idx is the index of the first row to delete (1-based) and amt is the amount of rows to delete beyond that index.
So worksheet.delete_rows(3, 4) will delete rows 3, 4, 5, and 6 and worksheet.delete_rows(2, 1) will just delete row 2.
In your case, you'd probably want to do something like worksheet.delete_rows(i, 1).
I have to loop a data frame twice with two pointers where the first pointer starts from 0 to len(df) and the second pointer starts from the first pointer to len(df). It also breaks when a certain condition is met. So I tried the below approaches.
for upper_index,upper_row in df.iterrows():
for index,row in df[(upper_index+1):len(data)].iterrows():
if condition met:
break
for upper in df.itertuples():
for inner in df.iloc[(upper.Index+1):].itertuples():
if condition met:
break
for upper in df.values:
for inner in df[(counter+1):].values:
if condition met:
break
The condition met is two compare two rows (first, second pointer) column values and decide to break or not.
For the third method, the counter is a variable I keep to increment the first pointer index.
But surprisingly all methods are slow and take a lot of hours where my df is only 40,000 rows.
Is there a better way to do it?
Any help is appreciated.