I am looking to slice rows in a dataframe column based on conditions- I understand I can assign specific values to rows in my df column based on given conditions using .loc, however I need the condition just to determine how much to slice.
For example, if the row starts with 'A', I would like the first 6 chars ([:6]) whereas if it starts with 'B' I would like it to have the first 8 chars ([:8]).
I am doing this in order to get the data into the correct format before I perform an inner join with another dataframe using pd.merge()
.loc. I can use df.loc[df['column'][:1] == 'A'], but it doesn't give me the index of the rows that satisfy the condition. The best solution I can think of is creating a list of all of the indexes that satisfy the conditions and then manipulating each row one by one. Is there a better way to do this?
You can check with np.select
m1 = df.col.str[0] == 'A'
m2 = df.col.str[0] == 'B'
df['NewCol'] = np.select([m1, m2], [df.col.str[:6], df.col.str[:8]], default = df.col)
Related
I have the following condition which eliminates the last row of a group if that last row does not equal 'no' for the column outcome:
m1 = df.groupby(['id'])['outcome'].tail(1) != 'no'
I then use this condition to drop these rows from the dataframe:
df = df.drop(m1[m1].index)
However I do not know how to do the opposite and instead of dropping these rows from the original df, extract entire rows that satisfy the m1 condition. Any suggestions?
From the comments:
df.loc[m1[m1].index, :]
will work.
I have a data frame DF in Python and I want to filter its rows based on 2 columns.
In particular, I want to remove the rows where orderdate is earlier than the startdate
How can I reverse/opposite the condition inside the following code to achieve what I want?
DF = DF.loc[DF['orderdate']<DF['startdate']]
I could reframe the code like below but it won't cover some rows that have NaT and I want to keep them
DF = DF.loc[DF['orderdate']>=DF['startdate']]
Inserting the ~ in front of the condition in parenthesis will reverse the condition and remove all the rows that do not satisfy it.
DF = DF.loc[~(DF['orderdate']<DF['startdate'])]
1- loc takes the rows from the 'orderdate' column and compares them with the rows from the 'startdate' column. Where the condition is true, it returns the index of the lines and stores it in the ids array.
2 - The drop method deletes lines in the dataframe, the parameters are the array with the indices of the lines, and inplace = True, this ensures that the operation is performed on the dataframe itself, if it is False operation it will return a copy of the dataframe
# Get names of indexes for which column orderdate > = startdate
ids = DF.loc[DF['orderdate'] >= DF['startdate']].index
# Delete these row indexes from dataFrame
DF.drop(ids, inplace=True)
I have a single df with two columns df['A'] and df['B'] (df['C']) is timestamp). A's data is a username, and in B is a number.
I want to extract where the username+number values are A) the same, and B) different, i.e. so to show where username has >1 (different) numbers.
Is that possible?
I tested with set(df.A+df.B) to get unique values, but I can't do anything with this.
EDIT:
I need to make this more clear....
I'm picturing a loop whereby I start at index 0, grab its value in df['A'] and df['B'], then I iterate through index n+1...nth row looking for a match on index 0's df['B'], if match exists then check if the matches df['A'] != df['A'] of index 0 and if it does not then print both index's data, then move to index n+1 and repeat the process. Does that make sense?
So this will basically only print data from a dataframe df where the username string (in df['A']) is associated with different numbers (df['B'] values).
You can look for duplicated combinations of two columns with:
df[df[['A', 'B']].duplicated()]
I'm stuck on particluar python question here. I have 2 dataframes DF1 and DF2. In both, I have 2 columns pID and yID (which are not indexed, just default). I'm look to add a column Found in DF1 where the respective values of columns (pID and yID) were found in DF2. Also, I would like to zone in on just values in DF2 where aID == 'Text'.
I believe the below gets me the 1st part of this question; however, I'm unsure how as to incorporate the where.
DF1['Found'] = (DF1[['pID', 'yID']] == DF2[['pID','yID']]).all(axis=1).astype(bool)
Suggestions or answers would be most appreciated. Thanks.
You could subset the second dataframe containing aID == 'Text' to get a reduced DF from which select those portions of columns to be compared against the first dataframe.
Use DF.isin() to check if the values that are present under these column names match or not. And, .all(axis=1) returns True if both the columns happen to be True, else they become False. Convert the boolean series to integers via astype(int) and assign the result to the new column, Found.
df1_sub = df1[['pID', 'yID']]
df2_sub = df2.query('aID=="Text"')[['pID', 'yID']]
df1['Found'] = df1_sub.isin(df2_sub).all(axis=1).astype(int)
df1
Demo DF's used:
df1 = pd.DataFrame(dict(pID=[1,2,3,4,5],
yID=[10,20,30,40,50]))
df2 = pd.DataFrame(dict(pID=[1,2,8,4,5],
yID=[10,12,30,40,50],
aID=['Text','Best','Text','Best','Text']))
If it does not matter where those matches occur, then merge the two dataframes on 'pID', 'yID' common columns as the key by considering the bigger DF's index (right_index=True) as the new index axis that needs to be emitted and aligned after the merge operation is over.
Access these indices which indicate matches found and assign the value, 1 to a new column named Found while filling it's missing elements with 0's throughout.
df1.loc[pd.merge(df1_sub, df2_sub, on=['pID', 'yID'], right_index=True).index, 'Found'] = 1
df1['Found'].fillna(0, inplace=True)
df1 should be modifed accordingly post the above steps.
Given a dataframe C with columns time and session I want to check if the row item in each column matches and then do some operation if true. I'm wondering if there is a vectorized solution to this, currently this is what I'm doing:
for i in range(len(C['time'])):
if C['time'][i] == C['session'][i]:
# do something
You can index your original dataframe with your equality condition and then operate on the result:
C.loc[C['time'] == C['session'], ] = ...result of some operation...