I am trying to split string values from a column, in as many columns as strings are in each row.
I am creating a new dataframe with three columns and I have the string values in the third column, I want to split in new columns (which already have headers) but the numbers of strings which are separated by semicolon, is different in each row
If I use this code:
df['string']= df['string'].str.split(';', expand=True)
then I will have left only one value in the column while the rest of the string values will not be split but eliminated.
Cal u advice on how this line of code should be modified in order to have the right output?
many thanks in advance
Instead of overwriting the original column, you can take the result of split and join with original DataFrame
df = pd.DataFrame({'my_string':['car;war;bus','school;college']})
df = df.join(df['my_string'].str.split(';',expand=True))
print(df)
my_string 0 1 2
0 car;war;bus car war bus
1 school;college school college None
Then we do
df['string']= df['string'].str.split(';', expand=True).str[0]
Related
I have my
my original pandas dataframe as such, where the 'comment' column consist of unseparated lists of strings and another column called 'direction' indicating whether the overall content in 'comment' column suggests positive or negative comments, where 1 represents positive comments and 0 represents negative comments.
Now I wish to create a new Dataframe by separating all the strings under 'comment' by delimiter '' and assign the each new list of strings as a seperate row with their original 'direction' respectively. So it would looks something like this new dataframe.
I wonder how should I achieve so?
Try:
df.comments = df.comments.str.split('<END>')
df = df.explode('comment')
I have a column which contains lists of variable sizes. The lists contain a limited amount of short text values. Around 60 unique values all together.
0 ["AC","BB"]
1 ["AD","CB", "FF"]
2 ["AA","CC"]
3 ["CA","BB"]
4 ["AA"]
I want to make this values columns in my data-frame and the values of this columns would be 1 if the values is in this row and 0 if not.
I know I could expand the list and than call unique and set those as new columns. But after that I don't know what to do?
Here's one way:
df = pd.get_dummies(df.explode('val')).sum(level = 0)
NOTE: Here (level=0) is kind of like a grouping operation that uses an index for grouping stuff. So, I prefer to use this after exploding the dataframe.
I'm having trouble with a thought process on a single to multiple columns using pandas. I have a main column that could have up to ten words separated by commas. I only have eight columns to split out these words to (no more).
I'm Currently using the code below to split out words into multiple columns. This code works as long as I know exactly how many words is in the longest cell. Example: In this case below, one of the cells in the original file will have exactly eight words in order for this to work properly. Otherwise, I will get an error ( Columns must be same length as key ). In testing, I have found that I must have the same number of columns needed to split the longest cell with the same number of words. No more, no less.
df[['column1','column2','column3','column4','column5','column6','column7','column8']] =
df['main'].str.split(',',expand=True)
What I'd like to see happen is a way to not worry about how many words are in the cells of the main column. If longest cell contains 6 words then split them out to 6 columns. If longest cell contains 8 words then split them out to 8 columns. If longest cell contains 10 words then drop last two words and split the rest out using 8 columns.
About the original file main column. I will not know how many words exist in each of the cells. I just have 8 columns so the first eight (if that many) get the honor of splitting to a column. The rest of the words (if any) will get dropped.
Question, Why do I need to have the exact amount of columns in the code above if my longest cell with words doesn't exceed that of my columns? I'm not understanding something.
Any help with the logic would be appreciated.
cols = df[['column1','column2','column3','column4','column5','column6','column7','column8']]
df2 = df['main'].str.split(',',expand=True, n=8)
#df = df.assign(**df2.set_axis(cols[:df2.shape[1]], axis=1))
#-------
if 0 in df2.columns:
df['column1']= np.where(df2[0].isnull(), df['column1'], df2[0])
You can use n=8 and then split the last column
df2 = df['main'].str.split(',', expand=True, n=8)
df = df.assign(**df2.set_axis(df.columns[:df2.shape[1]], axis=1))
df['column8'] = df['column8'].str.split(',').str[0]
You can use a list of labels instead df.columns if you don't want save the result in the first df2.shape[1] columns of df
I have a dataframe in pandas that i need to split up. It is much larger than this, but here is an example:
ID A B
a 0 0
b 1 1
c 2 2
and I have a list: keep_list = ['ID','A']
and another list: recode_list = ['ID','B']
I'd like the split the dataframe up by the column headers into two dataframes: one dataframe with those columns and values whose column headers match the keep_list, and one with those column headers and data that match the recode_alleles list. Every code I have tried thus far has not worked as it is trying to compare the values to the list, not the column names.
Thank you so much in advance for your help!
Assuming your DataFrame's name is df:
you can simply do
df[keep_list] and df[recode_list] to get what you want.
You can do this by index.intersection:
df1 = df[df.columns.intersection(keep_list)]
df2 = df[df.columns.intersection(recode_list)]
I try to select a subset of the object type column cells with str.split(pat="'")
dataset['pictures'].str.split(pat=",")
I want to get the values of the numbers 40092 and 39097 and the two dates of the pictures as two columns ID and DATE but as result I get one column consisting of NaNs.
'pictures' column:
{"col1":"40092","picture_date":"2017-11-06"}
{"col1":"39097","picture_date":"2017-10-31"}
...
Here's what I understood from your question:
You have a pandas Dataframe with one of the columns containing json strings (or any other string that need to be parsed into multiple columns)
E.g.
df = pd.DataFrame({'pictures': [
'{"col1":"40092","picture_date":"2017-11-06"}',
'{"col1":"39097","picture_date":"2017-10-31"}']
})
You want to parse the two elements ('col1' and 'picture_date') into two separate columns for further processing (or perhaps just one of them)
Define a function for parsing the row:
import json
def parse_row(r):
j=json.loads(r['pictures'])
return j['col1'],j['picture_date']
And use Pandas DataFrame.apply() method as follows
df1=df.apply(parse_row, axis=1,result_type='expand')
The result is a new dataframe with two columns - each containing the parsed data:
0 1
0 40092 2017-11-06
1 39097 2017-10-31
If you need just one column you can return a single element from parse_row (instead of a two element tuple in the above example) and just use df.apply(parse_row).
If the values are not in json format, just modify parse_row accordingly (Split, convert string to numbers, etc.)
Thanks for the replies but I solved it by loading the 'pictures' column from the dataset into a list:
picturelist= dataset['pictures'].values.tolist()
And afterwards creating a dataframe of the list made from the column pictures and concat it with the original dataset without the picture column
two_new_columns = pd.Dataframe(picturelist)
new_dataset = pd.concat(dataset, two_new_columns)