Pandas reorder raw content - python

I do have the following Excel-File
Which I've converted it to DataFrame and dropped 2 columns using below code:
df = pd.read_excel(self.file)
df.drop(['Name', 'Scopus ID'], axis=1, inplace=True)
Now, My target is to switch all names orders within the df.
For example,
the first name is Adedokun, Babatunde Olubayo
which i would like to convert it to Babatunde Olubayo Adedokun
how to do that for the entire df whatever name is it?

Split the name and reconcat them.
import pandas as pd
data = {'Name': ['Adedokun, Babatunde Olubayo', "Uwizeye, Dieudonné"]}
df = pd.DataFrame(data)
def swap_name(name):
name = name.split(', ')
return name[1] + ' ' + name[0]
df['Name'] = df['Name'].apply(swap_name)
df
Output:
> Name
> 0 Babatunde Olubayo Adedokun
> 1 Dieudonné Uwizeye

Let's assume you want to do the operation on "Other Names 1":
df.loc[:, "Other Names1"] = df["Other Names1"].str.split(",").apply(lambda row: " ".join(row))

You can use str accessor:
df['Name'] = df['Name'].str.split(', ').str[::-1].str.join(' ')
print(df)
# Output
Name
0 Babatunde Olubayo Adedokun
1 Dieudonné Uwizeye

Related

dataframe if string contains substring divide by 1000

I have this dataframe looking like the below dataframe.
import pandas as pd
data = [['yellow', '800test' ], ['red','900ui'], ['blue','900test'], ['indigo','700test'], ['black','400ui']]
df = pd.DataFrame(data, columns = ['Donor', 'value'])
In the value field, if a string contains say 'test', I'd like to divide these numbers by 1000. What would be the best way to do this?
Check Below code using lambda function
df['value_2'] = df.apply(lambda x: str(int(x.value.replace('test',''))/1000)+'test' if x.value.find('test') > -1 else x.value, axis=1)
df
Output:
df["value\d"] = df.value.str.findall("\d+").str[0].astype(int)
df["value\w"] = df.value.str.findall("[^\d]+").str[0]
df.loc[df["value\w"] == "test", "value\d"] = df["value\d"]/1000
df["value"] = df["value\w"] + df["value\d"].astype(str)

How to reformat dataframe using pandas?

I have the following dataframe:
data = {'Names':['Abbey','English','Maths','Billy','English','Maths','Charlie','English','Maths'],'Subject Grade':['Student Name',85,91,'Student Name',82,74,'Student Name',83,96]}
df = pd.DataFrame(data, columns = ['Names','Subject Grade'])
I would like to reformat the dataframe in order for the names, subject and grades to all be in their respective columns as follows:
data2 = {'Names':['Abbey','Abbey','Billy','Billy','Charlie','Charlie'],'Subject':['English','Maths','English','Maths','English','Maths'],'Grade':[85,91,82,74,83,96]}
df2 = pd.DataFrame(data2, columns = ['Names','Subject','Grade'])
Hi you can use those instructions :
df['name'] = df['Names'].mask(df['Subject Grade'] != "Student Name")
df['name'] = df['name'].fillna(method='ffill')
df = df.query('`Subject Grade`!="Student Name"')
df = df.rename(columns={'Names':'Subject', 'Subject Grade':'Grade', 'name':'Names'})

how to split a column by another column in pandas dataframe

I am cleaning data in pandas dataframe, I want split a column by another column.
I want split column 'id' by column 'eNBID',but don't know how to split
import pandas as pd
id_list = ['4600375067649','4600375077246','460037495681','460037495694']
eNBID_list = ['750676','750772','749568','749569']
df=pd.DataFrame({'id':id_list,'eNBID':eNBID_list})
df.head()
id eNBID
4600375067649 750676
4600375077246 750772
460037495681 749568
460037495694 749569
What I want:
df.head()
id eNBID
460-03-750676-49 750676
460-03-750772-46 750772
460-03-749568-1 749568
460-03-749569-4 749569
#column 'eNBID' is the third part of column 'id', the item length in column 'eNBID' is 6 or 7.
considering the 46003 will remain same for all ids
df['id'] = df.apply(lambda x: '-'.join([i[:3]+'-'+i[3:] if '460' in i else i for i in list(re.findall('(\w*)'+'('+x.eNBID+')'+'(\w*)',x.id)[0])]), axis=1)
Output
id eNBID
0 460-03-750676-49 750676
1 460-03-750772-46 750772
2 460-03-749568-1 749568
3 460-03-749569-4 749569
Considering '-' after 3rd, 5th, 11th position:
df['id'] = df['id'].apply(lambda s: s[:3] + '-' + s[3:5] + '-' + s[5:11] + '-' + s[11:])

the comma in the csv text sentence is not readable in the output of python pandas read

Text data in csv file:
Example1:
id,name,address
1,hendro,bandung
The result:
id name class
1 hendro bandung
Example2:
id,name,class
1,hendro,"bandung,semarang"
The result:
id name class
1,hendro,"bandung,semarang" NaN NaN
I try with pandas.read.csv():
import pandas as pd
train = pd.read_csv('book1.csv')
train
My expectation:
the result for example2 is like this;
id name class
1 hendro bandung,semarang
What's wrong? How can I fix it?
You can try the below logic for this case.
Step 1 : Open your CSV and replace double quotes (") to single Quote (').
Step 2 : Run the below code.
df = pd.read_csv('Workbook1.csv', sep=',',quotechar="'")
print df
# renaming the first and last columns as extra '"' is attached with them
df = df.rename(columns={'"id':'id','class"':'class'})
# remove all the '"' from the data
df = df.applymap(lambda x:str(x).replace('"',""))
print df
Output:
"id name class"
0 "1 hendro bandung,semarang"
1 "2 he'sn hen's"
id name class
0 1 hendro bandung,semarang
1 2 he'sn hen's
Data looks like below when opened in notepad:
"id,name,class"
"1,hendro,'bandung,semarang'"
"2,he'sn,hen's"
Try adding the following argument to your code:
import pandas as pd
pd.read_csv('book1.csv', quotechar = '"')
Try this:
import pandas as pd
df = pd.read_csv('book1.csv', sep=",", names= ['id','Name','From','To'])
df = df.iloc[1:]
df['class'] = df['From'] +','+ df['To']
df = df[['id','Name','class']]
df

How to extract entire part of string after certain character in dataframe column?

I am working on using the below code to extract the last number of pandas dataframe column name.
names = df.columns.values
new_df = pd.DataFrame()
for name in names:
if ('.value.' in name) and df[name][0]:
last_number = int(name[-1])
print(last_number)
key, value = my_dict[last_number]
try:
new_df[value][0] = list(new_df[value][0]) + [key]
except:
new_df[value] = [key]
name is a string that looks like this:
'data.answers.1234567890.value.0987654321'
I want to take the entire number after .value. as in the IF statement. How would do this in the IF statement above?
Use str.split, and extract the last slice with -1 (also gracefully handles false cases):
df = pd.DataFrame(columns=[
'data.answers.1234567890.value.0987654321', 'blahblah.value.12345', 'foo'])
df.columns = df.columns.str.split('value.').str[-1]
df.columns
# Index(['0987654321', '12345', 'foo'], dtype='object')
Another alternative is splitting inside a listcomp:
df.columns = [x.split('value.')[-1] for x in df.columns]
df.columns
# Index(['0987654321', '12345', 'foo'], dtype='object')

Categories