I have a csv file with only one column "notes". I want to merge rows of data-frame based on some condition.
Input_data={'notes':
['aaa','bbb','*','hello','**','my name','is xyz',
'(1)','this is','temp','name',
'(2)','BTW','how to','solve this',
'(3)','with python','I don’t want this to be added ',
'I don’t want this to be added ']}
df_in = pd.DataFrame(Input_data)
Input looks like this
Output
output_Data={'notes':
['aaa','bbb','*hello','**my name is xyz',
'(1) this is temp name',
'(2) BTW how to solve this',
'(3) with python','I don’t want this to be added ',
'I don’t want this to be added ']}
df_out=pd.DataFrame(output_Data)
I want to merge the rows with the above row which have either "*" or "(number)" in it. So the output will look like
Other rows which can not be merged should be left.
Also, in case of last row as there is no proper way to know up-to what range we can merge lets say just add only one next row
I solved this but its very long. Any simpler way
df=pd.DataFrame(Input_data)
notes=[];temp=[];flag='';value='';c=0;chk_star='yes'
for i,row in df.iterrows():
row[0]=str(row[0])
if '*' in row[0].strip()[:5] and chk_star=='yes':
value=row[0].strip()
temp=temp+[value]
value=''
continue
if '(' in row[0].strip()[:5]:
chk_star='no'
temp=temp+[value]
value='';c=0
flag='continue'
value=row[0].strip()
if flag=='continue' and '(' not in row[0][:5] :
value=value+row[0]
c=c+1
if c>4:
temp=temp+[value]
print "111",value,temp
break
if '' in temp:
temp.remove('')
df=pd.DataFrame({'notes':temp})
Below solution recognises special characters like *,** and (number) at the start of the of the sentence and starts merging later rows except last row.
import pandas as pd
import re
df = pd.DataFrame({'row':['aaa','bbb','*','hello','**','my name','is xyz',
'(1)','this is','temp','name',
'(2)','BTW','how to','solve this',
'(3)','with python','I don’t want this to be added ',
'I don’t want this to be added ']})
pattern = "^\(\d+\)|^\*+" #Pattern to identify string starting with (number),*,**.
#print(df)
#Selecting index based on the above pattern
selected_index = df[df["row"].str.contains(re.compile(pattern))].index.values
delete_index = []
for index in selected_index:
i=1
#Merging row until next selected index found and add merged rows to delete_index list
while(index+i not in selected_index and index+i < len(df)-1):
df.at[index, 'row'] += ' ' + df.at[index+i, 'row']
delete_index.append(index+i)
i+=1
df.drop(delete_index,inplace=True)
#print(df)
Output:
row
0 aaa
1 bbb
2 *hello
4 **my nameis xyz
7 (1)this istempname
11 (2)BTWhow tosolve this
15 (3)with pythonI don’t want this to be added
18 I don’t want this to be added
You can reset index if you want. using df.reset_index()
I think it is easier when you design your logic to separate df_in into 3 parts: top, middle and bottom. Keeping top and bottom intact while joining middle part. Finally, concat 3 parts together into df_out
First, create m1 and m2 masks to separate df_in to 3 parts.
m1 = df_in.notes.str.strip().str.contains(r'^\*+|\(\d+\)$').cummax()
m2 = ~df_in.notes.str.strip().str.contains(r'^I don’t want this to be added$')
top = df_in[~m1].notes
middle = df_in[m1 & m2].notes
bottom = df_in[~m2].notes
Next, create groupby_mask to group rows and groupby and join:
groupby_mask = middle.str.strip().str.contains(r'^\*+|\(\d+\)$').cumsum()
middle_join = middle.groupby(groupby_mask).agg(' '.join)
Out[3110]:
notes
1 * hello
2 ** my name is xyz
3 (1) this is temp name
4 (2) BTW how to solve this
5 (3) with python
Name: notes, dtype: object
Finally, use pd.concat to concat top, middle_join, bottom
df_final = pd.concat([top, middle_join, bottom], ignore_index=True).to_frame()
Out[3114]:
notes
0 aaa
1 bbb
2 * hello
3 ** my name is xyz
4 (1) this is temp name
5 (2) BTW how to solve this
6 (3) with python
7 I don’t want this to be added
8 I don’t want this to be added
You can use a mask to avoid the for loop :
df = pd.DataFrame({'row':['aaa','bbb','*','hello','**','my name','is xyz',
'(1)','this is ','temp ','name',
'(2)','BTW ','how to ','solve this',
'(3)','with python ','I don’t want this to be added ',
'I don’t want this to be added ']})
special = ['*', '**']
for i in range(11):
special.append('({})'.format(i))
# We find the indexes where we will have to merge
index_to_merge = df[df['row'].isin(special)].index.values
for idx, val in enumerate(index_to_merge):
if idx != len(index_to_merge)-1:
df.loc[val, 'row'] += ' ' + df.loc[val+1:index_to_merge[idx+1]-1, 'row'].values.sum()
else:
df.loc[index, 'row'] += ' ' + df.loc[index+1:, 'row'].values.sum()
# We delete the rows that we just used to merge
df.drop([x for x in np.array(range(len(df))) if x not in index_to_merge])
Out :
row
2 * hello
4 ** my nameis xyz
7 (1) this is temp name
11 (2) BTW how to solve this
15 (3) with python I don’t want this to be added ..
You could also convert your column into a numpy array and use numpy functions to simplify what you did. First you can use the np.where and np.isin to find the indexes where you will have to merge. That way you don't have to iterate on your whole array using a for loop.
Then you can do the mergures on the corresponding indexes. Finally, you can delete the values that have been merged. Here is what it could look like :
list_to_merge = np.array(['aaa','bbb','*','hello','**','my name','is xyz',
'(1)','this is','temp','name',
'(2)','BTW','how to','solve this',
'(3)','with python','I don’t want this to be added ',
'I don’t want this to be added '])
special = ['*', '**']
for i in range(11):
special.append('({})'.format(i))
ix = np.isin(list_to_merge, special)
rows_to_merge = np.where(ix)[0]
# We merge the rows
for index_to_merge in np.where(ix)[0]:
# Check if there we are not trying to merge with an out of bounds value
if index_to_merge!=len(list_to_merge)-1:
list_to_merge[index_to_merge] = list_to_merge[index_to_merge] + ' ' + list_to_merge[index_to_merge+1]
# We delete the rows that have just been used to merge:
rows_to_delete = rows_to_merge +1
list_to_merge = np.delete(list_to_merge, rows_to_delete)
Out :
['aaa', 'bbb', '* hello', '** my name', 'is xyz', '(1) this is',
'temp', 'name', '(2) BTW', 'how to', 'solve this',
'(3) with python', 'I don’t want this to be added ',
'I don’t want this to be added ']
Related
So I've got a pandas dataframe that contains a ton of address info. Aka
AddressNumber
StreetNamePrefix
StreetName
StreetNameSuffix
StreetNamePreDirectional
StreetNamePostDirectional
OccupancySuite
I'd like to combine everything except for OccupancySuite into Address1
I can get address2 easily enough, it's OccupancySuite.
What I'm getting hung up on is combining the rest of the columns, separated by a space, and ignoring the column AND space if it's null. I'd rather not have multiple spaces between address parts due to multiple null columns.
What I have currently is probably pretty hacky, but it gets me there minus the additional spaces between the columns/words.
#Example Pandas DF with two addresses
import pandas as pd
data = [['123','','','easy','st','',''],['500','N','County Road','3932','','East','']]
df = pd.DataFrame(data,columns=['AddressNumber','StreetNamePreDirectional','StreetNamePrefix','StreetName','StreetNameSuffix','StreetNamePostDirectional','OccupancySuite'])
df['Address1']= df['AddressNumber'].fillna('') + ' ' + df['StreetNamePreDirectional'].fillna('') + ' ' + df['StreetNamePrefix'].fillna('') + ' ' + df['StreetName'].fillna('') + ' ' + df['StreetNameSuffix'].fillna('') + ' ' + df['StreetNamePostDirectional'].fillna('')
df.to_csv('localpath\\cleaned_addresses.csv')
If you open said csv, you'll see
123 easy st
500 N County Road 3932 East
What I'm needing is
123 easy st
500 N County Road 3932 East
You can fix your answer by replacing multiple spaces with a single space on pandas
df['Address1'].str.replace(r'\s+',' ')
Also, you could concat your strings more succinctly with an apply.
concat_cols = ['AddressNumber',
'StreetNamePreDirectional',
'StreetNamePrefix',
'StreetName',
'StreetNameSuffix',
'StreetNamePostDirectional']
df['Address1'] = df[concat_cols].apply(lambda x:' '.join(x.values), axis=1)
df['Address1'] = df['Address1'].str.replace(r'\s+',' ')
I Hope this helps you:
I added the column "Address1" to the data frame.
Then, you can perform a for cycle over the len of the data frame (in order to work with the rows) and over the elements in the columns of the data frame.
With an if statement you can ignore the two last columns "OcupancySuite", "Address1" and ignoring the null space.
df["Address1"]=''
for a in range(0, len(df)):
for element in df.columns:
if element in ["OcupancySuite", "Address1"]:
continue
values=df[element].iloc[a]
if not values:
continue
else:
df["Address1"].iloc[a]+=df[element].iloc[a] + ' '
And if the value is not null you can add the info with the space. (last line).
Here you can see more info about the iloc method.
df.to_csv('localpath\\cleaned_addresses.csv')
then you will have the correct spaces.
I need to remove whitespaces in pandas df column. My data looks like this:
industry magazine
Home "Goodhousekeeping.com"; "Prevention.com";
Fashion "Cosmopolitan"; " Elle"; "Vogue"
Fashion " Vogue"; "Elle"
Below is my code:
# split magazine column values, create a new column in df
df['magazine_list'] = dfl['magazine'].str.split(';')
# stip the first whitespace from strings
df.magazine_list = df.magazine_list.str.lstrip()
This returns all NaN, I have also tried:
df.magazine = df.magazine.str.lstrip()
This didn't remove the white spaces either.
Use list comprehension with strip of splitted values, also strip values before split for remove trailing ;, spaces and " values:
f = lambda x: [y.strip('" ') for y in x.strip(';" ').split(';')]
df['magazine_list'] = df['magazine'].apply(f)
print (df)
industry magazine \
0 Home Goodhousekeeping.com; "Prevention.com";
1 Fashion Cosmopolitan; " Elle"; "Vogue"
2 Fashion Vogue; "Elle
magazine_list
0 [Goodhousekeeping.com, Prevention.com]
1 [Cosmopolitan, Elle, Vogue]
2 [Vogue, Elle]
Jezrael provides a good solution. It is useful to know that pandas has string accessors for similar operations without the need of list comprehensions. Normally a list comprehension is faster, but depending on the use case using pandas built-in functions could be more readable or simpler to code.
df['magazine'] = (
df['magazine']
.str.replace(' ', '', regex=False)
.str.replace('"', '', regex=False)
.str.strip(';')
.str.split(';')
)
Output
industry magazine
0 Home [Goodhousekeeping.com, Prevention.com]
1 Fashion [Cosmopolitan, Elle, Vogue]
2 Fashion [Vogue, Elle]
image is added so that you can look how my dataframe df2 will look like
i have written a code to check condition and if it matches it will update the items of my list but it's working at all it return me the same not updated or same previous list. is this code wrong?
please suggest
emp1=[]
for j in range(8,df.shape[0],10):
for i in range(2,len(df.columns)):
b=df.iloc[j][3]
#values are appended from dataframe to list and values are like['3 : 3','4 : 4',.....]
ess=[]
for i in range(df2.shape[0]):
a=df2.iloc[i][2]
ess.append(a) #values taken from file which are(3,4,5,6,7,8,....etc i.e unique id number)
nm=[]
for i in range(df2.shape[0]):
b=df2.iloc[i][3]
nm.append(b) #this list contains name of the employees
ap= [i.split(' : ', 1)[0] for i in emp1] #split it with ' : ' and stored in two another list(if 3 : 3 then it will store left 3)
bp= [i.split(' : ', 1)[1] for i in emp1] #if 3 : 3 the it will store right 3
cp=' : '
#the purpose is to replace right 3 with the name i.e 3 : nameabc and then again join to the list
for i in range(len(emp1)):
for j in range(len(ess)):
#print(i,j)
if ap[i]==ess[j]:
bp[i]=nm[j]
for i in range(df.shape[0]):
ap[i]=ap[i]+cp # adding ' : ' after left integer
emp = [i + j for i, j in zip(ap, bp)] # joining both the values
expected output:
if emp1 contains 3 : 3
then after processing it should show like 3 : nameabc
May be I missed something, but I don't see you assigning any value to emp1. Its empty and for "ap" and "bp", you are looping over empty emp1. That may be the one causing problem.
I have 3 different columns in different dataframes that look like this.
Column 1 has sentence templates, e.g. "He would like to [action] this week".
Column 2 has pairs of words, e.g. "exercise, swim".
The 3d column has the type for the word pair, e.g. [action].
I assume there should be something similar to "melt" in R, but I'm not sure how to do the replacement.
I would like to create a new column/dataframe which will have all the possible options for each sentence template (one sentence per row):
He would like to exercise this week.
He would like to swim this week.
The number of templates is significantly lower than the number of words I have. There are several types of word pairs (action, description, object, etc).
#a simple example of what I would like to achieve
import pandas as pd
#input1
templates = pd.DataFrame(columns=list('AB'))
templates.loc[0] = [1,'He wants to [action] this week']
templates.loc[1] = [2,'She noticed a(n) [object] in the distance']
templates
#input 2
words = pd.DataFrame(columns=list('AB'))
words.loc[0] = ['exercise, swim', 'action']
words.loc[1] = ['bus, shop', 'object']
words
#output
result = pd.DataFrame(columns=list('AB'))
result.loc[0] = [1, 'He wants to exercise this week']
result.loc[1] = [2, 'He wants to swim this week']
result.loc[2] = [3, 'She noticed a(n) bus in the distance']
result.loc[3] = [4, 'She noticed a(n) shop in the distance']
result
First create new columns by Series.str.extract with words from words['B'] and then Series.map for values for replacement:
pat = '|'.join(r"\[{}\]".format(re.escape(x)) for x in words['B'])
templates['matched'] = templates['B'].str.extract('('+ pat + ')', expand=False).fillna('')
templates['repl'] =(templates['matched'].map(words.set_index('B')['A']
.rename(lambda x: '[' + x + ']'))).fillna('')
print (templates)
A B matched repl
0 1 He wants to [action] this week [action] exercise, swim
1 2 She noticed a(n) [object] in the distance [object] bus, shop
And then replace in list comprehension:
z = zip(templates['B'],templates['repl'], templates['matched'])
result = pd.DataFrame({'B':[a.replace(c, y) for a,b,c in z for y in b.split(', ')]})
result.insert(0, 'A', result.index + 1)
print (result)
A B
0 1 He wants to exercise this week
1 2 He wants to swim this week
2 3 She noticed a(n) bus in the distance
3 4 She noticed a(n) shop in the distance
I'm working with an xlsx file with pandas and I would like to add the word "bodypart" in a column if the preceding column contains a word in a predefined list of bodyparts.
Original Dataframe:
Sentence Type
my hand NaN
the fish NaN
Result Dataframe:
Sentence Type
my hand bodypart
the fish NaN
Nothing I've tried works. I feel I'm missing something very obvious. Here's my last (failed) attempt:
import pandas as pd
import numpy as np
bodyparts = ['lip ', 'lips ', 'foot ', 'feet ', 'heel ', 'heels ', 'hand ', 'hands ']
df = pd.read_excel(file)
for word in bodyparts :
if word in df["Sentence"] : df["Type"] = df["Type"].replace(np.nan, "bodypart", regex = True)
I also tried this, with as variants "NaN" and NaN as the first argument of str.replace:
if word in df['Sentence'] : df["Type"] = df["Type"].str.replace("", "bodypart")
Any help would be greatly appreciated!
You can create a regex to search on word boundaries and then use that as an argument to str.contains, eg:
import pandas as pd
import numpy as np
import re
bodyparts = ['lips?', 'foot', 'feet', 'heels?', 'hands?', 'legs?']
rx = re.compile('|'.join(r'\b{}\b'.format(el) for el in bodyparts))
df = pd.DataFrame({
'Sentence': ['my hand', 'the fish', 'the rabbit leg', 'hand over', 'something', 'cabbage', 'slippage'],
'Type': [np.nan] * 7
})
df.loc[df.Sentence.str.contains(rx), 'Type'] = 'bodypart'
Gives you:
Sentence Type
0 my hand bodypart
1 the fish NaN
2 the rabbit leg bodypart
3 hand over bodypart
4 something NaN
5 cabbage NaN
6 slippage NaN
A dirty solution would involve checking the intersection of two sets.
set A is your list of body parts, set B is the set of words in the sentence
df['Sentence']\
.apply(lambda x: 'bodypart' if set(x.split()) \
.symmetric_difference(bodyparts) else None)
The simplest way :
df.loc[df.Sentence.isin(bodyparts),'Type']='Bodypart'
Before you must discard space in bodyparts:
bodyparts = {'lip','lips','foot','feet','heel','heels','hand','hands'}
df.Sentence.isin(bodyparts) select the good rows, and Type the column to set. .loc is the indexer which permit the modification.