I am trying to find columns hitting specific conditions and put a value in the column col.
My current implementation is:
df.loc[~(df['myCol'].isin(myInfo)), 'col'] = 'ok'
In the future, myCol will have multiple info. So I need to split the value in myCol without changing the dataframe and check if any of the splitted values are in myInfo. If one of them are, the current row should get the value 'ok' in the column col. Is there an elegant way without really splitting and saving in an extra variable?
Currently, I do not know how the multiple info will be represented (either separated by a character or just concatenated one after one, each consisting of 4 alphanumeric values).
Let's say you need to split on "-" for your myCol column.
sep='-'
deconcat = df['MyCol'].str.split(sep, expand=True)
new_df=df.join(deconcat)
The new_df DataFrame will have the same index as df, therefore you can do what you want with new_df and then join back to df to filter it how you want.
You can do the above .isin code for each of the new split columns to get your desired result.
Source:
Code taken from the pyjanitor documentation which has a built-in function, deconcatenate_column, that does this.
Source code for deconcatenate_column
Related
Ive attempted to search the forum for this question, but, I believe I may not be asking it correctly. So here it goes.
I have a large data set with many columns. Originally, I needed to sum all columns for each row by multiple groups based on a name pattern of variables. I was able to do so via:
cols = data.filter(regex=r'_name$').columns
data['sum'] = data.groupby(['id','group'],as_index=False)[cols].sum().assign(sum = lambda x: x.sum(axis=1))
By running this code, I receive a modified dataframe grouped by my 2 factor variables (group & id), with all the columns, and the final sum column I need. However, now, I want to return the final sum column back into the original dataframe. The above code returns the entire modified dataframe into my sum column. I know this is achievable in R by simply adding a .$sum at the end of a piped code. Any ideas on how to get this in pandas?
My hopeful output is just a the addition of the final "sum" variable from the above lines of code into my original dataframe.
Edit: To clarify, the code above returns this entire dataframe:
All I want returned is the column in yellow
is this what you need?
data['sum'] = data.groupby(['id','group'])[cols].transform('sum').sum(axis = 1)
I'm trying to combine two dataframes together in pandas using left merge on common columns, only when I do that the data that I merged doesn't carry over and instead gives NaN values. All of the columns are objects and match that way, so i'm not quite sure whats going on.
this is my first dateframe header, which is the output from a program
this is my second data frame header. the second df is a 'key' document to match the first output with its correct id/tastant/etc and they share the same date/subject/procedure/etc
and this is my code thats trying to merge them on the common columns.
combined = first.merge(second, on=['trial', 'experiment','subject', 'date', 'procedure'], how='left')
with output (the id, ts and tastant columns should match correctly with the first dataframe but doesn't.
Check your dtypes, make sure they match between the 2 dataframes. Pandas makes assumptions about data types when it imports, it could be assuming numbers are int in one dataframe and object in another.
For the string columns, check for additional whitespaces. They can appear in datasets and since you can't see them and Pandas can, it result in no match. You can use df['column'].str.strip().
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.strip.html
I was wondering how I would be able to expand out a list in a cell without repeating variables in other cells.
The goal is to get it so that the list is expanded but the first column is not repeated. I know how to expand the list out but I would not like to have the first column values repeated if that is possible. Thank you for any help!!
In order to get what you're asking for, you still have to use explode() to get what you need. You just have to take it a step further and change the values of the first column. Please note that this will destroy the association between the elements of the list and the letter of the row they were first in. You would be creating a third value for the column (an empty string) that would be repeated for every record not beginning with 1.
If you want to eliminate the value from the rows you are talking about but still want to have those records associated with the value that their list was associated with, you can't. It's not logically possible for a value to both be in a given cell but also not be in that cell. So, I will show you the steps for eliminating the original association.
For this example, I named the columns since they are not provided.
data = [
["a",["1 hey","2 hi","3 hello"]],
["b",["1 what","2 how","3 say"]]
]
df = pd.DataFrame(data,columns=["first","second"])
df = df.explode("second")
df['first'] = df.apply(lambda x: x['first'] if x['second'][0] == '1' else '', axis=1)
I am working on a dataframe with where I have multiple columns and in one of the columns where there are many rows approx more than 1000 rows which contains the string values. Kindly check the below table for more details:
In the above image I want to change the string values in the column Group_Number to number by picking the values from the first column (MasterGroup) and increment by one (01) and want values to be like below:
Also need to verify that if the String is duplicating then instead of giving a new number it replaces with already changed number. For example in the above image ANAYSIM is duplicating and instead of giving a new sequence number I want already given number to repeating string.
Have checked different links but they are focusing on giving values from user:
Pandas DataFrame: replace all values in a column, based on condition
Change one value based on another value in pandas
Conditional Replace Pandas
Any help with achieving the desired outcome is highly appreciated.
We could do cumcount with groupby
s=(df.groupby('MasterGroup').cumcount()+1).mul(10).astype(str)
t=pd.to_datetime(df.Group_number, errors='coerce')
Then we assign
df.loc[t.isnull(), 'Group_number']=df.MasterGroup.astype(str)+s
I've got some big csv's. They can easily have over 300k rows and 500 columns. So obviously I like to get rid of some unneeded data in the resulting dataframe to safe resources.
There are some fix labeled columns and also some variable number of columns having similar lables but being numbered.
example=pd.DataFrame(columns=["fix","variable 1","variable 2","waste 1","waste 2"])
I want to get all these variable columns, which I can get via
example.filter(regex="var")
but I want to include "fix" as well. As df.loc doesn't allow regex' and df.filter only supports a single argument, is there a smooth way to do this? Or do I have to create a quite complex callable?
thanks in advance
Just modify your regex to do a full match for "fix":
df.filter(regex=r"var|(^fix$)")
Empty DataFrame
Columns: [fix, variable 1, variable 2]
Index: []
Another option is using Index.str.contains in the same fashion:
df.loc[:,df.columns.str.contains(r'var|(?:^fix$)') ]
Empty DataFrame
Columns: [fix, variable 1, variable 2]
Index: []
I made the group non-capturing, otherwise pandas complains.