Iterate through dataframe and create new dataframe every time condition is met

Iterate through dataframe and create new dataframe every time condition is met - python

From a SQL query I pull a list of activities that happen in the shop, one of the column has a label called 'STATUS'.
I want to go down the dataframe and pull the whole row every time the 'STATUS' label changes
I've created a dataframe from the query and called it df
Changed the column types to what I needed
Created a list of all the column headers
Used that list to create an empty dataframe to which I intended to append
Tried creating a for loop with the condition described above
headerlist = df.columns.values.tolist()
newdf = pd.DataFrame(columns=headerlist)
for index, row in df.iterrows():
if df.STATUS[i] != df.STATUS[i-1]:
newdf = newdf.append(i)
I've uploaded an image here that represents what I'm trying to achieve.
Thank you in advance
https://imgur.com/a/XXEOlKs

Not sure that's the best answer : (Im kinda new to Python too)
let's say that's your original dataframe is df
for even, odd in zip(df.iloc[::2], df.iloc[1::2]):
if even.status!=odd.status:
#do you append here
You should give a runnable df example so we could run our code before posting it to be sure it really works

I think I got it! I tried for an hour before posting and as soon as I posted it clicked
It's a lot simpler than a for loop.
I created a new column called 'MATCH' that has a boolean value of whether or not the new value of 'STATUS' is equal or not to the previous.
Then I just filter by df.STATUS == False
'''python
df['MATCH'] = df.STATUS.eq(df.STATUS.shift())'''

Try this
df[df['STATUS'].ne(df['STATUS'].shift().bfill())]

Related

Check if value from one dataframe exists in other dataframe and return the position in Python

I know that if we want to check if one value exists in a dataframe we use isin(). However, I want the position or positions where it is found in the other dataframe.
Like df1['Column1'].isin(df2['Column2']) only returns True if it is contained in df2. But I want the position where it is found in df2.
1 I do not want to loop over the dataframes because I have a very large dataset. Is there any function in pandas or a quick way to do it without having to loop?

Each line in pandas dataframe has its index (0-... as default or changed by you). If you would like to get the position , try to use .index:
df1[df1['Column1'].isin(df2['Column2'])].index
Updated:
df1['df1_index']=pd.DataFrame(df1[df1['col1'].isin(df2['col1'])].index).astype('int')
df1['df2_index']=pd.DataFrame(df2[df2['col1'].isin(df1['col1'])].index).astype('int')

You might try this:
filtered_df = df1[df1['Column1'] == df2['Column2']]
print(filtered_df)
Does this work?

How to create new dataframe by filtering a column of another dataframe

I have a few datasets that share the same columns so I concatenated them together to form one large dateframe. My idea is to filter a goals_per_90 column by > .5 so it will create a new dataframe showing those whole rows of all the players with a value greater than .5 in a new dataframe. Im thinking of something like this at the moment but getting stuck when
def gettopplayers(Dataframe):
if Dataframe.loc[Dataframe['goals_per_90_overall'] > .5]:
apply.
Im getting lost as to where to append this row to.
Any help would be greatly appreciated. Thank you!

Below python code will make a new dataframe with all the rows where the condition is met. No need for the if condition.
df_new = Dataframe.loc[(Dataframe['goals_per_90_overall'] > .5)]

Create Loop to dynamically select rows from dataframe, then append selected rows to another dataframe: df.query()

I am currently working with dataframes in pandas. In sum, I have a dataframe called "Claims" filled with customer claims data, and I want to parse all the rows in the dataframe based on the unique values found in the field 'Part ID.' I would then like to take each set of rows and append it one at a time to an empty dataframe called "emptydf." This dataframe has the same column headings as the "Claims" dataframe. Since the values in the 'Part ID' column change from week to week, I would like to find some way to do this dynamically, rather than comb through the dataframe each week manually. I was thinking of somehow incorporating the df.where() expression and a For Loop, but am at a loss as to how to put it all together. Any insight into how to go about this, or even some better methods, would be great! The code I have thus far is divided into two steps as follows:
emptydf = Claims[0:0]
#Create empty dataframe
2.Parse_Claims = Claims.query('Part_ID == 1009')
emptydf = emptydf.append(Parse_Claims)
#Parse the dataframe by each unique Part ID number and append to empty dataframe. As you can see, I can only hard code one Part ID number at a time so far. This would take hours to complete manually, so I would love to figure out a way to iterate through the Part ID column and append the data dynamically.
Needless to say, I am super new to Python, so I definitely appreciate your patience in advance!

empty_df = list(Claims.groupby(Claims['Part_ID']))
this will create a list of tuples one for each part id. each tuple has 2 elements 1st is part id and 2nd is subset for that part id

How to feed new columns every time in a loop to a spark dataframe?

I have a task of reading each columns of Cassandra table into a dataframe to perform some operations. Here I want to feed the data like if 5 columns are there in a table I want:-
first column in the first iteration
first and second column in the second iteration to the same dataframe
and likewise.
I need a generic code. Has anyone tried similar to this? Please help me out with an example.

This will work:
df2 = pd.DataFrame()
for i in range(len(df.columns)):
df2 = df2.append(df.iloc[:,0:i+1],sort = True)
Since, the same column name is getting repeated, obviously df will not have same column name twice and hence it will keep on adding rows

You can extract the names from dataframe's schema and then access that particular column and use it the way you want to.
names = df.schema.names
columns = []
for name in names:
columns.append(name)
//df[columns] use it the way you want

A clean and efficient way to update cells in pandas DataFrames

I am looking for a cleaner way to achieve the following:
I have a DataFrame with certain columns that I want to update if new information arrives. This "new information" in for of a pandas DataFrame (from a CSV file) can have more or less rows, however, I am only interested in adding
Original DataFrame
DataFrame with new information
(Note the missing name "c" here and the change in "status" for name "a")
Now, I wrote the following "inconvenient" code to update the original DataFrame with the new information
Updating the "status" column based on the "name" column
for idx,row in df_base.iterrows():
if not df_upd[df_upd['name'] == row['name']].empty:
df_base.loc[idx, 'status'] = df_upd.loc[df_upd['name'] == row['name'], 'status'].values
It achieves exactly what I want, but it just does neither look nice nor efficient, and I hope that there might be a cleaner way. I tried the pd.merge method, however, the problem is that it would be adding new columns instead of "updating" the cells in that column.
pd.merge(left=df_base, right=df_upd, on=['name'], how='left')
I am looking forward to your tips and ideas.

You could set_index("name") and then call .update:
>>> df_base = df_base.set_index("name")
>>> df_upd = df_upd.set_index("name")
>>> df_base.update(df_upd)
>>> df_base
status
name
a 0
b 1
c 0
d 1
More generally, you can set the index to whatever seems appropriate, update, and then reset as needed.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterate through dataframe and create new dataframe every time condition is met - python

Try this df[df['STATUS'].ne(df['STATUS'].shift().bfill())]

Related

Check if value from one dataframe exists in other dataframe and return the position in Python

How to create new dataframe by filtering a column of another dataframe

Create Loop to dynamically select rows from dataframe, then append selected rows to another dataframe: df.query()

How to feed new columns every time in a loop to a spark dataframe?

A clean and efficient way to update cells in pandas DataFrames

Categories

Resources