Preprocessing text data with different notation in Python - python

Using Python 3, I work with a data frame which requires text preprocessing.
The data frame consists of historical sales for many different medical products with many different strengths. For simplification, the code below only shows a part of the strength column.
df = pd.DataFrame({'Strength': ['20 mg / 120 mg', ' 40/320 mg', '20mg/120mg', '150+750mg', '20/120MG', '62.5mg/375mg', '100 mg', 'Product1 20 mg, Product2 120 mg', '40mg/320mg', 'Product 20mg/120mg', 'Product1 20mg Product2 120mg', '100mg/1ml', '20 mg./ 120 mg..', '62.5 mg / 375 mg', '40/320mg 9s', '40/320', '50/125', '100mg..' '20/120']})
Strength
0 20 mg / 120 mg
1 40/320 mg
2 20mg/120mg
3 150+750mg
4 20/120MG
5 62.5mg/375mg
6 100 mg
7 Product1 20 mg, Product2 120 mg
8 40mg/320mg
9 Product 20mg/120mg
10 Product1 20mg Product2 120mg
11 100mg/1ml
12 20 mg./ 120 mg..
13 62.5 mg / 375 mg
14 40/320mg 9s
15 40/320
16 50/125
17 100mg..20/120
As you can see, there are different spellings for products which actually belong to the same Strength. For example, '20 mg / 120 mg' and 'Artemether 20 mg, Lumefantrine 120 mg' actually have the same strength.
Setting the text to lowercase, removing whitespaces and replacing + by / shown by the following code brings some standardization, but there are still lines with clearly the same strength.
df['Strength'] = df['Strength'].str.lower()
df['Strength'] = df['Strength'].str.replace(' ', '')
df['Strength'] = df['Strength'].str.replace('+', '/')
Adding commands like the following allows to further reduce the number of different notations, but this is way too manual.
df['Strength'].loc[df['Strength'].str.contains('Product1', case=False)
& df['Strength'].str.contains('Product2', case=False)] = '20mg/120mg'
Do you have any approaches for removing the number of unique notations in an efficient way?

Add a new column with fixed labels for each strength and train it based on a suitable ml classifier and predict the appropriate strength for the new item.
For each new notation, manually assign a new label and retrain again...

Related

How do I calculate an average of a range from a series within in a dataframe?

Im new to Python and working with data manipulation
I have a dataframe
df3
Out[22]:
Breed Lifespan
0 New Guinea Singing Dog 18
1 Chihuahua 17
2 Toy Poodle 16
3 Jack Russell Terrier 16
4 Cockapoo 16
.. ... ...
201 Whippet 12--15
202 Wirehaired Pointing Griffon 12--14
203 Xoloitzcuintle 13
204 Yorkie--Poo 14
205 Yorkshire Terrier 14--16
As you observe above, some of the lifespans are in a range like 14--16. The datatype of [Lifespan] is
type(df3['Lifespan'])
Out[24]: pandas.core.series.Series
I want it to reflect the average of these two numbers i.e. 15. I do not want any ranges. Just the average as a single digit. How do I do this?
Using split and expand=True
df = pd.DataFrame({'Breed': ['Dog1', 'Dog2'],
'Lifespan': [12, '14--15']})
df['Lifespan'] = (df['Lifespan']
.astype(str).str.split('--', expand=True)
.astype(float).mean(axis=1)
)
df
# Breed Lifespan
# 0 Dog1 12.0
# 1 Dog2 14.5

Display mean and deviation values on grouped boxplot in Python

I want to display mean and standard deviation values above each of the boxplots in the grouped boxplot (see picture).
My code is
import pandas as pd
import seaborn as sns
from os.path import expanduser as ospath
df = pd.read_excel(ospath('~/Documents/Python/Kandidatspeciale/TestData.xlsx'),'Ark1')
bp = sns.boxplot(y='throw angle', x='incident angle',
data=df,
palette="colorblind",
hue='Bat type')
bp.set_title('Rubber Comparison',fontsize=15,fontweight='bold', y=1.06)
bp.set_ylabel('Throw Angle [degrees]',fontsize=11.5)
bp.set_xlabel('Incident Angle [degrees]',fontsize=11.5)
Where my dataframe, df, is
Bat type incident angle throw angle
0 euro 15 28.2
1 euro 15 27.5
2 euro 15 26.2
3 euro 15 27.7
4 euro 15 26.4
5 euro 15 29.0
6 euro 30 12.5
7 euro 30 14.7
8 euro 30 10.2
9 china 15 29.9
10 china 15 31.1
11 china 15 24.9
12 china 15 27.5
13 china 15 31.2
14 china 15 24.4
15 china 30 9.7
16 china 30 9.1
17 china 30 9.5
I tried with the following code. It needs to be independent of number of x (incident angles), for instance it should do the job for more angles of 45, 60 etc.
m=df.mean(axis=0) #Mean values
st=df.std(axis=0) #Standard deviation values
for i, line in enumerate(bp['medians']):
x, y = line.get_xydata()[1]
text = ' μ={:.2f}\n σ={:.2f}'.format(m[i], st[i])
bp.annotate(text, xy=(x, y))
Can somebody help?
This question brought me here since I was also looking for a similar solution with seaborn.
After some trial and error, you just have to change the for loop to:
for i in range(len(m)):
bp.annotate(
' μ={:.2f}\n σ={:.2f}'.format(m[i], st[i]),
xy=(i, m[i]),
horizontalalignment='center'
)
This change worked for me (although I just wanted to print the actual median values). You can also add changes like the fontsize, color or style (i.e., weight) just by adding them as arguments in annotate.

Transposing the columns and organazing unstructured csv files with pandas

I have too many messy csv files and I am trying to extract information from them. There are random number of unnecessary columns at the beginning of each file. However, the columns that I am interested in are always have the same index. Let me explain in it through an example:
RandomInfo XX
Random2 ZZ
Random3 VV
Random4 KK
Companyname: Apple
VisitsMay ImpressionsMay VisitsApril ImpressionsApril...
Information
International 100 250 90 260
Local 10 22 12 26
With Proxy 5 12 8 16
I want to convert this to:
Companyname Month International Local With Proxy
Apple VistsMay 100 10 5
Apple ImpressionsMay 250 22 12
Apple VisitsApril 90 12 8
Apple ImpressionsApril 260 26 16
Example files here

.csv loading repeats all entries from one column in every cell

I am attempting to load a given csv file with the folowing structure:
Then, I'd like to join all the words with the same "Sent_ID" into one row, with the following code:
train = pd.read_csv("train.csv")
# Create a dataframe of sentences.
sentence_df = pd.DataFrame(train["Sent_ID"].drop_duplicates(), columns=["Sent_ID", "Sentence", "Target"])
for _, row in train.iterrows():
print(str(row["Word"]))
sentence_df.loc[sentence_df["Sent_ID"] == row["Sent_ID"], ["Sentence"]] = str(row["Word"])
However, the result of the print(str(row["Word"])) is:
Name: Word, Length: 4543833, dtype: object
0 Obesity
1 in
2 Low-
3 and
4 Middle-Income
5 Countries
...
i.e every single word in the column, for any given row. This occurs for all rows.
Printing the entire row gives:
id 89
Doc_ID 1
Sent_ID 4
Word 0 Obesity\n1 ...
tag O
Name: 88, dtype: object
This again suggests that every element of the "Word" column is present in each cell. (The 88th entry is not "Obesity\n1" in the .csv file.
I have tried changing the quoting argument in the read_csv function, as well as manually inserting the headers in the names argument, to no avail.
How do I ensure each Dataframe entry only contains its own word?
I've added a pastebin with some of the samples here (the pastebin will expire a week after this edit).
Building on #Aravinds answer, OP wanted a working example:
from io import StringIO
csv = StringIO('''
<paste csv snippet here>
'''
df = pd.read_csv(csv)
# Print first 5 rows
print(df.head())
id Doc_ID Sent_ID Word tag
0 1 1 1 Obesity O
1 2 1 1 in O
2 3 1 1 Low- O
3 4 1 1 and O
4 5 1 1 Middle-Income O
Now we have the data loaded as a pandas.DataFrame We can use the method to combine the words into sentences.
df = df.groupby('Sent_ID').Word.apply(' '.join).reset_index()
print(df)
Sent_ID Word
0 1 Obesity in Low- and Middle-Income Countries : ...
1 2 We have reviewed the distinctive features of e...
2 3 Obesity is rising in every region of the world...
3 4 In LMICs , overweight is higher in women compa...
4 5 Overweight occurs alongside persistent burdens...
5 6 Changes in the global diet and physical activi...
6 7 Emerging risk factors include environmental co...
7 8 Data on effective strategies to prevent the on...
8 9 Expanding the research in this area is a key p...
9 10 MICROCEPHALIA VERA
10 11 Excellent reproducibility of laser speckle con...
11 12 We compared the inter-day reproducibility of p...
12 13 We also tested whether skin blood flow assessm...
13 14 Skin blood flow was evaluated during PORH and ...
14 15 Data are expressed as cutaneous vascular condu...
15 16 Reproducibility is expressed as within subject...
16 17 Twenty-eight healthy participants were enrolle...
17 18 The reproducibility of the PORH peak CVC was b...
18 19 Inter-day reproducibility of the LTH plateau w...
19 20 Finally , we observed significant correlation ...
20 21 The recently developed LSCI technique showed v...
21 22 Moreover , we showed significant correlation b...
22 23 However , more data are needed to evaluate the...
23 24 Positive inotropic action of cholinesterase on...
24 25 The putative chloride channel hCLCA2 has a sin...
25 26 Calcium-activated chloride channel ( CLCA ) pr...
26 27 Genetic and electrophysiological studies have ...
27 28 The human CLCA2 protein is expressed as a 943-...
28 29 Earlier investigations of transmembrane geomet...
29 30 However , analysis by the more recently derive...
Use groupby()
df = df.groupby('Sent_ID')['Word'].apply(' '.join).reset_index()
You can group by multiple columns as a list. Like so
df.groupby(['Doc_ID','Sent_ID','tag'])

Find max of group return entire row pandas

Forgive if this is a duplicate. It seems like it should be, but I've searched all the suggested and more.
I have this table
Look_Back_Months Total Spread Return sector
10 11 0.038961 Apartment
20 21 0.078029 Apartment
30 31 0.079272 Apartment
40 5 0.013499 Office
50 15 0.018679 Office
60 25 -0.003378 Office
I'd like to return
Look_Back_Months Total Spread Return sector
30 31 0.079272 Apartment
50 15 0.018679 Office
Have tried groupby, agg and I keep returning either the max Look_Back_Months and the Total Spread Return. Or just one or the other.
Thanks
By using
df.sort_values('TotalSpreadReturn').drop_duplicates('sector',keep='last')
Out[270]:
Look_Back_Months TotalSpreadReturn sector
50 15 0.018679 Office
30 31 0.079272 Apartment
You can use groupby.max with transform.
g = df.groupby('sector')['TotalSpreadReturn'].transform('max')
res = df[df['TotalSpreadReturn'] == g]
print(res)
Look_Back_Months TotalSpreadReturn sector
30 31 0.079272 Apartment
50 15 0.018679 Office
If it matters, this includes duplicate maximums and maintains index order.

Categories