Create a list from rows values with no duplicates - python

I would need to extract the following words from a dataframe.
car+ferrari
The dataset is
Owner Sold
type
car+ferrari J.G £500000
car+ferrari R.R.T. £276,550
car+ferrari
motobike+ducati
motobike+ducati
...
I need to create a list with words from type, but distinguishing them separately. So in this case I need only car and ferrari.
The list should be
my_list=['car','ferrari']
no duplicates.
So what I should do is select type car+ferrari and extract the all the words, adding them into a list as shown above, without duplicates (I have many car+ferrari rows, but since I need to create a list with the terms, I need only extract these terms once).
Any help will be appreciated
EDIT: type column is an index

def lister(x): #function to split by '+'
return set(x.split('+'))
df['listcol']=df['type'].apply(lister) # applying the function on the type column and saving output to new column
Adding #AMC's suggestion of a rather inbuilt solution to split series in pandas:
df['type'].str.split(pat='+')
for details refer pandas.Series.str.split
Converting pandas index to series:
pd.Series(df.index)
Apply a function on index:
pd.Series(df.index).apply(lister)
or
pd.Series(df.index).str.split(pat = '+')
or
df.index.to_series().str.split("+")

Related

I have a dataframe containing arrays, is there a way collect all of the elements and store it in a seperate dataframe?

I cant seem to find a way to split all of the array values from the column of a dataframe.
I have managed to get all the array values using this code:
The dataframe is as follows:
I want to use value.counts() on the dataframe and I get this
I want the array values that are clubbed together to be split so that I can get the accurate count of every value.
Thanks in advance!
You could try .explode(), which would create a new row for every value in each list.
df_mentioned_id_exploded = pd.DataFrame(df_mentioned_id.explode('entities.user_mentions'))
With the above code you would create a new dataframe df_mentioned_id_exploded with a single column entities.user_mentions, which you could then use .value_counts() on.

.isin() only returning one value as opposed to entire list

I have a large pandas dataframe with ~500,000 rows and 6 columns and each row is uniquely identified by the 'Names' column. The other 5 columns contain characteristic information about the correpsonding 'Names' entry. I also have a separate list of ~40,000 individual names, all of which are subsumed within the larger dataframe. I want to use this smaller list to extract all of the corresponding infromation in the larger dataframe, and am using:
info = df[df['Names'].isin(ListNames)]
where df is the large dataframe and ListNames is the list of names I want to get the information for. However, when I run this, only one row is extracted from the overall dataframe as opposed to ~40000. I have also tried using ListNames as a 'Series' datatype instead of 'List' datatype but this returned the same thing as before. Would be super grateful for any advice - thanks!

Sorting multiple Pandas Dataframe Columns based on the sorting of one column

I have a dataframe with two columns in it,'item' and 'calories'. I have sorted the 'calories' column numerically using a selection sort algorithm, but i need the 'item' column to change so the calorie value matches the correct item.
menu=pd.read_csv("menu.csv",encoding="utf-8") # Read in the csv file
menu_df=pd.DataFrame(menu,columns=['Item','Calories']) # Creating a dataframe with just the information from the calories column
print(menu_df) # Display un-sorted data
#print(menu_df.at[4,'Calories']) # Practise calling on individual elements within the dataframe.
# Start of selection sort
for outerloopindex in range (len(menu_df)):
smallest_value_index=outerloopindex
for innerloopindex in range(outerloopindex+1,len(menu_df)):
if menu_df.at[smallest_value_index,'Calories']>menu_df.at[innerloopindex,'Calories']:
smallest_value_index=innerloopindex
# Changing the order of the Calorie column.
menu_df.at[outerloopindex,'Calories'],menu_df.at[smallest_value_index,'Calories']=menu_df.at[smallest_value_index,'Calories'],menu_df.at[outerloopindex,'Calories']
# End of selection sort
print(menu_df)
Any help on how to get the 'Item' column to match the corresponding 'Calorie' values after the sort would be really really appreciated.
Thanks
Martin
You can replace df.at[...] with df.loc[...] and refer multiple columns, instead of single one.
So replace line:
menu_df.at[outerloopindex,'Calories'],menu_df.at[smallest_value_index,'Calories']=menu_df.at[smallest_value_index,'Calories'],menu_df.at[outerloopindex,'Calories']
With line:
menu_df.loc[outerloopindex,['Calories','Item']],menu_df.loc[smallest_value_index,['Calories','Item']]=menu_df.loc[smallest_value_index,['Calories','Item']],menu_df.loc[outerloopindex,['Calories','Item']]

Select a subset of an object type cell in panda Dataframe

I try to select a subset of the object type column cells with str.split(pat="'")
dataset['pictures'].str.split(pat=",")
I want to get the values of the numbers 40092 and 39097 and the two dates of the pictures as two columns ID and DATE but as result I get one column consisting of NaNs.
'pictures' column:
{"col1":"40092","picture_date":"2017-11-06"}
{"col1":"39097","picture_date":"2017-10-31"}
...
Here's what I understood from your question:
You have a pandas Dataframe with one of the columns containing json strings (or any other string that need to be parsed into multiple columns)
E.g.
df = pd.DataFrame({'pictures': [
'{"col1":"40092","picture_date":"2017-11-06"}',
'{"col1":"39097","picture_date":"2017-10-31"}']
})
You want to parse the two elements ('col1' and 'picture_date') into two separate columns for further processing (or perhaps just one of them)
Define a function for parsing the row:
import json
def parse_row(r):
j=json.loads(r['pictures'])
return j['col1'],j['picture_date']
And use Pandas DataFrame.apply() method as follows
df1=df.apply(parse_row, axis=1,result_type='expand')
The result is a new dataframe with two columns - each containing the parsed data:
0 1
0 40092 2017-11-06
1 39097 2017-10-31
If you need just one column you can return a single element from parse_row (instead of a two element tuple in the above example) and just use df.apply(parse_row).
If the values are not in json format, just modify parse_row accordingly (Split, convert string to numbers, etc.)
Thanks for the replies but I solved it by loading the 'pictures' column from the dataset into a list:
picturelist= dataset['pictures'].values.tolist()
And afterwards creating a dataframe of the list made from the column pictures and concat it with the original dataset without the picture column
two_new_columns = pd.Dataframe(picturelist)
new_dataset = pd.concat(dataset, two_new_columns)

Pandas: Find string in a column and replace them with numbers with incrementing values

I am working on a dataframe with where I have multiple columns and in one of the columns where there are many rows approx more than 1000 rows which contains the string values. Kindly check the below table for more details:
In the above image I want to change the string values in the column Group_Number to number by picking the values from the first column (MasterGroup) and increment by one (01) and want values to be like below:
Also need to verify that if the String is duplicating then instead of giving a new number it replaces with already changed number. For example in the above image ANAYSIM is duplicating and instead of giving a new sequence number I want already given number to repeating string.
Have checked different links but they are focusing on giving values from user:
Pandas DataFrame: replace all values in a column, based on condition
Change one value based on another value in pandas
Conditional Replace Pandas
Any help with achieving the desired outcome is highly appreciated.
We could do cumcount with groupby
s=(df.groupby('MasterGroup').cumcount()+1).mul(10).astype(str)
t=pd.to_datetime(df.Group_number, errors='coerce')
Then we assign
df.loc[t.isnull(), 'Group_number']=df.MasterGroup.astype(str)+s

Categories