I have a data frame like this:
and the dataset in the CSV file is here.
this data was extracted from the IMDb dataset.
but I have a problem, I could not be able to remove the actor's names which are repeated in the same row for example in row number 4 I want to drop 'Marie Gruber' in both name and actors column.
I tried to use to apply and all conditions but always code consider it the same.
like this code:
data[data['name'] != data['actors']]
Trere are traling spaces for actors column, so first remove them by Series.str.strip:
data['actors'] = data['actors'].str.strip()
data[data['name'] != data['actors']]
Or use skipinitialspace=True in read_csv:
data = pd.read_csv(file, skipinitialspace=True)
data[data['name'] != data['actors']]
Use pandas.dataframe.drop function.
data.drop(data[data.apply(lambda x: x['name'] in x['actors'], axis = 1)].index)
Related
screenshot of dataframe
I have a dataframe with multiple columns. One of these contains names of french suppliers like "Intermarché", "Carrefour", "Leclerc" (as you can see in the framed column on the attached screenshot). Unfortunately, the names are typed by hand and are not standardized at all. From the "Distributeurs" column, I would like to create a new column with the names unified in a list by cell so that I can then use the fonction .explore() and make one product and one distributor per row. I would like to make a selection of about 30 supplies and put 'others suppliers' for the rest. I feel like I have to use regular expressions and loops but I'm totally lost. Could someone help me? Thanks
I try this but I'm lost :
df['test']=''
distrib_list=["Leclerc","Monoprix",'Franprix','Intermarché','Carrefour','Lidl','Proxi','Grand Frais','Fresh','Cora','Casino',"Relais d'Or",'Biocoop','Métro','Match','Super U','Aldi','Spar','Colruyt','Auchan']
for n in df['Distributeurs']:
if n in distrib_list:
df['test'].append
You'll need to first split your Distributeurs with the comma by doing something along the lines of df['Distributeurs'].str.split(','). Once that is done you can iterate over the rows of your dataframe, get the idx and the row in question. Then you can iterate over your splitted Distributeur cell. I also make it case insensitive (for unicode you might need to add things to this if statment).
Then you can create a new dataframe with this information (and add whatever other information you wish) by creating first a list and transforming it into a dataframe. This can be accomplished with something on the lines of:
import pandas as pd
test = []
distList = ['name1', 'name2', 'name3']
data = [['Product1', ['Name1', 'Name2']], ['Product2', ['Name1', 'Name2', 'Name3']], ['Product3', ['Name4', 'Name5']]]
df = pd.DataFrame(data, columns=['Product', 'Distributor'])
for idx, x in df.iterrows():
for i in range(len(x['Distributor'])):
if x['Distributor'][i].lower() in distList :
test.append({
'Product':df['Product'][idx],
'Distributor':x['Distributor'][i]
})
else:
pass
test_df = pd.DataFrame(test)
I am working on a pandas data frame where I want to merge two columns and putting a comma , between those values which are merged and then add the whole cell by [].
Example:
I have this kind of data frame: note: The sample data is uploaded on this link
bboxes class_names
[[0,0,2336,2836],[0,0,2336,2836],[0,0,2336,2836]] ['No finding','No finding','No finding']
and I want to merge two col and add comma between the content , then enclosed that merge col by [] like below:
final_bboxes
[[[0,0,2336,2836],[0,0,2336,2836],[0,0,2336,2836]],['No finding','No finding','No finding']]
Thank you so much
You first need to convert the list as strings to an actual list before combining them. I have used ast.literal_eval to do this safely.
import ast
df["final_bboxes"] = df.apply(lambda row: [ast.literal_eval(row["bboxes"]), ast.literal_eval(row["class_names"])], axis=1)
Try This
df['new_col'] = [[x,y] for [x,y] in df[['bboxes', 'class_names']].values]
string/quote issue
df['new_col'] = [[eval(x),eval(y)] for [x,y] in df[['bboxes', 'class_names']].values] ## you can use ast modules literal_eval as well
#I am not a fan of eval tbh, don't use it unless you don't have any other way
I have a multiIndex dataframe created with pandas similar to this one:
nest = {'A1': dfx[['aa','bb','cc']],
'B1':dfx[['dd']],
'C1':dfx[['ee', 'ff']]}
reform = {(outerKey, innerKey): values for outerKey, innerDict in nest.items() for innerKey, values in innerDict.items()}
dfzx = pd.DataFrame(reform)
What I am trying to achieve is to add a new row at the end of the dataframe that contains a summary of the total for the three categories represented by the new index (A1, B1, C1).
I have tried with df.loc (what I would normally use in this case) but I get error. Similarly for iloc.
a1sum = dfzx['A1'].sum().to_list()
a1sum = sum(a1sum)
b1sum = dfzx['B1'].sum().to_list()
b1sum = sum(b1sum)
c1sum = dfzx['C1'].sum().to_list()
c1sum = sum(c1sum)
totalcat = a1sum, b1sum, c1sum
newrow = ['Total', totalcat]
newrow
dfzx.loc[len(dfzx)] = newrow
ValueError: cannot set a row with mismatched columns
#Alternatively
newrow2 = ['Total', a1sum, b1sum, c1sum]
newrow2
dfzx.loc[len(dfzx)] = newrow2
ValueError: cannot set a row with mismatched columns
How can I fix the mistake? Or else is there any other function that would allow me to proceed?
Note: the DF is destined to be moved on an Excel file (I use ExcelWriter).
The type of results I want to achieve in the end is this one (gray row "SUM"
I came up with a sort of solution on my own.
I created a separate DataFrame in Pandas that contains the summary.
I used ExcelWriter to have both dataframes on the same excel worksheet.
Technically It would be then possible to style and format data in Excel (xlsxwriter or framestyle seem to be popular modules to do so). Alternatively one should be doing that manually.
I have excel file and import to dataframe. I want to extract inside of column to several columns.
Here is original
After importing to pandas in python, I get this data with '\n'
So, I want to extract inside of column. Could you all share idea or code?
My expected columns are....
Don't worry no one is born knowing everything about SO. Considering the data you gave, specially that 'Vector:...' is not separated by '\n', the following works:
import pandas as pd
import numpy as np
data = pd.read_excel("the_data.xlsx")
ok = []
l = len(data['Details'])
for n in range(l):
x = data['Details'][n].split()
x[2] = x[2].lstrip('Vector:')
x = [v for v in x if v not in ['Type:', 'Mission:']]
ok += x
values = np.array(ok).reshape(l, 3)
df = pd.DataFrame(values, columns=['Type', 'Vector', 'Mission'])
data.drop('Details', axis=1, inplace=True)
final = pd.concat([data, df], axis=1)
The process goes like this:
First you split all elements of the Details columns as a list of strings. Second you deal with the 'Vector:....' special case and filter column names. Third you store all the values in a list which will inturn be converted to a numpy array with shape (length, 3). Finally you drop the old 'Details' column and perform a concatenation with the df created from splited strings.
You may want to try a more efficient way to transform your data when reading by trying to use this ideas inside the pd.read_excel method using converters
I know we can access columns using table.cols.somecolumn, but I need to apply the same operation on 10-15 columns of my table. So I'd like an iterative solution. I have the names of the columns as strings in a list : ['col1','col2','col3'].
So I'm looking for something along the lines of:
for col in columnlist:
thiscol = table.cols[col]
#apply whatever operation
Try this:
columnlist = ['col1','col2','col3']
for col in columnlist:
thiscol = getattr(table.cols, col)