screenshot of dataframe
I have a dataframe with multiple columns. One of these contains names of french suppliers like "Intermarché", "Carrefour", "Leclerc" (as you can see in the framed column on the attached screenshot). Unfortunately, the names are typed by hand and are not standardized at all. From the "Distributeurs" column, I would like to create a new column with the names unified in a list by cell so that I can then use the fonction .explore() and make one product and one distributor per row. I would like to make a selection of about 30 supplies and put 'others suppliers' for the rest. I feel like I have to use regular expressions and loops but I'm totally lost. Could someone help me? Thanks
I try this but I'm lost :
df['test']=''
distrib_list=["Leclerc","Monoprix",'Franprix','Intermarché','Carrefour','Lidl','Proxi','Grand Frais','Fresh','Cora','Casino',"Relais d'Or",'Biocoop','Métro','Match','Super U','Aldi','Spar','Colruyt','Auchan']
for n in df['Distributeurs']:
if n in distrib_list:
df['test'].append
You'll need to first split your Distributeurs with the comma by doing something along the lines of df['Distributeurs'].str.split(','). Once that is done you can iterate over the rows of your dataframe, get the idx and the row in question. Then you can iterate over your splitted Distributeur cell. I also make it case insensitive (for unicode you might need to add things to this if statment).
Then you can create a new dataframe with this information (and add whatever other information you wish) by creating first a list and transforming it into a dataframe. This can be accomplished with something on the lines of:
import pandas as pd
test = []
distList = ['name1', 'name2', 'name3']
data = [['Product1', ['Name1', 'Name2']], ['Product2', ['Name1', 'Name2', 'Name3']], ['Product3', ['Name4', 'Name5']]]
df = pd.DataFrame(data, columns=['Product', 'Distributor'])
for idx, x in df.iterrows():
for i in range(len(x['Distributor'])):
if x['Distributor'][i].lower() in distList :
test.append({
'Product':df['Product'][idx],
'Distributor':x['Distributor'][i]
})
else:
pass
test_df = pd.DataFrame(test)
Related
I am trying to create datasets from the name of the columns of a dataframe. Where I have the columns ['NAME1', 'EMAIL1', 'NAME2', 'EMAIL2', NAME3', 'EMAIL3', etc].
I'm trying to split the dataframe based on the 'EMAIL' column, where through a loop, but it's not working properly.
I need it to be a JSON, because there is the possibility that between each 'EMAILn' column there may be a difference in number of columns.
This is my input:
I need this:
This is my code:
for i in df_entities.filter(regex=('^(EMAIL)' + str(i))).columns:
df_groups = df_temp_1.groupby(i)
df_detail = df_groups.get_group(i)
display(df_detail)
What do you recommend me to do?
From already thank you very much.
Regards
filter returns a copy of your dataframe with only the matching columns, but you're trying to loop over just the column names. Just add .columns:
for i in df_entities.filter(regex=('^(Email)' + str(i))).columns:
... # ^^^^^^^^^ important
From your input and desired output, simply call pandas.wide_to_long:
long_df = pd.wide_to_long(
df_entities.reset_index(),
stubnames=["NAME", "EMAIL"],
i="index",
j="version"
)
I have a data frame like this:
and the dataset in the CSV file is here.
this data was extracted from the IMDb dataset.
but I have a problem, I could not be able to remove the actor's names which are repeated in the same row for example in row number 4 I want to drop 'Marie Gruber' in both name and actors column.
I tried to use to apply and all conditions but always code consider it the same.
like this code:
data[data['name'] != data['actors']]
Trere are traling spaces for actors column, so first remove them by Series.str.strip:
data['actors'] = data['actors'].str.strip()
data[data['name'] != data['actors']]
Or use skipinitialspace=True in read_csv:
data = pd.read_csv(file, skipinitialspace=True)
data[data['name'] != data['actors']]
Use pandas.dataframe.drop function.
data.drop(data[data.apply(lambda x: x['name'] in x['actors'], axis = 1)].index)
I have a multiIndex dataframe created with pandas similar to this one:
nest = {'A1': dfx[['aa','bb','cc']],
'B1':dfx[['dd']],
'C1':dfx[['ee', 'ff']]}
reform = {(outerKey, innerKey): values for outerKey, innerDict in nest.items() for innerKey, values in innerDict.items()}
dfzx = pd.DataFrame(reform)
What I am trying to achieve is to add a new row at the end of the dataframe that contains a summary of the total for the three categories represented by the new index (A1, B1, C1).
I have tried with df.loc (what I would normally use in this case) but I get error. Similarly for iloc.
a1sum = dfzx['A1'].sum().to_list()
a1sum = sum(a1sum)
b1sum = dfzx['B1'].sum().to_list()
b1sum = sum(b1sum)
c1sum = dfzx['C1'].sum().to_list()
c1sum = sum(c1sum)
totalcat = a1sum, b1sum, c1sum
newrow = ['Total', totalcat]
newrow
dfzx.loc[len(dfzx)] = newrow
ValueError: cannot set a row with mismatched columns
#Alternatively
newrow2 = ['Total', a1sum, b1sum, c1sum]
newrow2
dfzx.loc[len(dfzx)] = newrow2
ValueError: cannot set a row with mismatched columns
How can I fix the mistake? Or else is there any other function that would allow me to proceed?
Note: the DF is destined to be moved on an Excel file (I use ExcelWriter).
The type of results I want to achieve in the end is this one (gray row "SUM"
I came up with a sort of solution on my own.
I created a separate DataFrame in Pandas that contains the summary.
I used ExcelWriter to have both dataframes on the same excel worksheet.
Technically It would be then possible to style and format data in Excel (xlsxwriter or framestyle seem to be popular modules to do so). Alternatively one should be doing that manually.
I have excel file and import to dataframe. I want to extract inside of column to several columns.
Here is original
After importing to pandas in python, I get this data with '\n'
So, I want to extract inside of column. Could you all share idea or code?
My expected columns are....
Don't worry no one is born knowing everything about SO. Considering the data you gave, specially that 'Vector:...' is not separated by '\n', the following works:
import pandas as pd
import numpy as np
data = pd.read_excel("the_data.xlsx")
ok = []
l = len(data['Details'])
for n in range(l):
x = data['Details'][n].split()
x[2] = x[2].lstrip('Vector:')
x = [v for v in x if v not in ['Type:', 'Mission:']]
ok += x
values = np.array(ok).reshape(l, 3)
df = pd.DataFrame(values, columns=['Type', 'Vector', 'Mission'])
data.drop('Details', axis=1, inplace=True)
final = pd.concat([data, df], axis=1)
The process goes like this:
First you split all elements of the Details columns as a list of strings. Second you deal with the 'Vector:....' special case and filter column names. Third you store all the values in a list which will inturn be converted to a numpy array with shape (length, 3). Finally you drop the old 'Details' column and perform a concatenation with the df created from splited strings.
You may want to try a more efficient way to transform your data when reading by trying to use this ideas inside the pd.read_excel method using converters
I have data saved in a csv. I am querying this data using Python and turning it into a Pandas DataFrame. I have a column called team_members in my dataframe.It has a dictionaryof values. The column looks like so when called:
dt.team_members[1]
Output:
"[{'name': 'LearnFromScratch', 'tier': 'novice tier'}, {'name': 'Scratch', 'tier': 'novice tier'}]"
I tried to see this explanation and other similar:
Splitting multiple Dictionaries within a Pandas Column
But they do not work
I want to get a column called name with the names of the members of the team and another with tier of each member.
Can you help me?
Thanks!!
I assume the output of dt.team_members[1] is a list.
If so, you can directly pass that list to create a dataframe something like:
pd.DataFrame(dt.team_members[1])
You can extract name while looping over it
list(map(lambda x: x.get("name"), dt.team_members[1]))
if you need a new dataframe:
then follow #vivek answer:
pd.DataFrame(dt.team_members[1])
You can try this-
list = dt.team_member[1]
li = []
for x in range(len(list)):
t = {list[x]['name']:list[x]['tier']}
li.append(t)
print(li)
Output is -
[{'LearnFromScratch': 'novice tier'}, {'Scratch': 'novice tier'}]