Replacing/Stripping certain text from data in pandas?

Replacing/Stripping certain text from data in pandas? - python

I've got an issue with Pandas not replacing certain bits of text correctly...
# Create blank column
csvdata["CTemp"] = ""
# Create a copy of the data in "CDPure"
dcol = csvdata.CDPure
# Fill "CTemp" with the data from "CDPure" and replace and/or remove certain parts
csvdata['CTemp'] = dcol.str.replace(" (AMI)", "").replace(" N/A", "Non")
But yet when i print it hasn't replaced any as seen below by running print csvdata[-50:].head(50)
Pole KI DE Score STAT CTemp
4429 NaN NaN NaN 42 NaN Data N/A
4430 NaN NaN NaN 23.43 NaN Data (AMI)
4431 NaN NaN NaN 7.05 NaN Data (AMI)
4432 NaN NaN NaN 9.78 NaN Data
4433 NaN NaN NaN 169.68 NaN Data (AMI)
4434 NaN NaN NaN 26.29 NaN Data N/A
4435 NaN NaN NaN 83.11 NaN Data N/A
NOTE: The CSV is rather big so I have to use pandas.set_option('display.max_columns', 250) to be able to print the above.
Anyone know how I can make it replace those parts correctly in pandas?
EDIT, I've tried .str.replace("", "") and tried just .replace("", "")
Example CSV:
No,CDPure,Blank
1,Data Test,
2,Test N/A,
3,Data N/A,
4,Test Data,
5,Bla,
5,Stack,
6,Over (AMI),
7,Flow (AMI),
8,Test (AMI),
9,Data,
10,Ryflex (AMI),
Example Code:
# Import pandas
import pandas
# Open csv (I have to keep it all as dtype object otherwise I can't do the rest of my script)
csvdata = pandas.read_csv('test.csv', dtype=object)
# Create blank column
csvdata["CTemp"] = ""
# Create a copy of the data in "CDPure"
dcol = csvdata.CDPure
# Fill "CTemp" with the data from "CDPure" and replace and/or remove certain parts
csvdata['CTemp'] = dcol.str.replace(" (AMI)", "").str.replace(" N/A", " Non")
# Print
print csvdata.head(11)
Output:
No CDPure Blank CTemp
0 1 Data Test NaN Data Test
1 2 Test N/A NaN Test Non
2 3 Data N/A NaN Data Non
3 4 Test Data NaN Test Data
4 5 Bla NaN Bla
5 5 Stack NaN Stack
6 6 Over (AMI) NaN Over (AMI)
7 7 Flow (AMI) NaN Flow (AMI)
8 8 Test (AMI) NaN Test (AMI)
9 9 Data NaN Data
10 10 Ryflex (AMI) NaN Ryflex (AMI)

str.replace interprets its argument as a regular expression, so you need to escape the parentheses using dcol.str.replace(r" \(AMI\)", "").str.replace(" N/A", "Non").
This does not appear to be adequately documented; the docs mention that split and replace "take regular expressions, too", but doesn't make it clear that they always interpret their argument as a regular expression.

Related

Merge two rows pandas dataframe

I have this data, and I need to merge the two selected columns with the other row because its duplicated rows cames from my code.
So, how could I do this?

Here is a way to do what your question asks:
df[['State_new', 'Solution_new']] = df[['Power State', 'Recommended Solution']].shift()
mask = ~df['State_new'].isna()
df.loc[mask, 'State'] = df.loc[mask, 'State_new']
df.loc[mask, 'Recommended Solutuin'] = df.loc[mask, 'Solution_new']
df = df.drop(columns=['State_new', 'Solution_new', 'Power State', 'Recommended Solution'])[~df['State'].isna()].reset_index(drop=True)
Explanation:
create versions of the important data from your code shifted down by one row
create a boolean mask indicating which of these shifted rows are not empty
use this mask to overwrite the content of the State and Recommended Solutuin columns (NOTE: using original column labels verbatim from OP's question) with the updated data from your code contained in the shifted columns
drop the columns used to perform the update as they are no longer needed
use reset_index to create a new integer range index without gaps.
In case it's helpful, here is sample code to pull the dataframe in from Excel:
import pandas as pd
df = pd.read_excel('TestBook.xlsx', sheet_name='TestSheet', usecols='AD:AM')
Here's the input dataframe:
MAC RLC RLC 2 PDCCH Down PDCCH Uplink Unnamed: 34 Recommended Solutuin State Power State Recommended Solution
0 122.9822 7119.503 125.7017 1186.507 784.9464 NaN Downtitlt antenna serving cell is overshooting NaN NaN
1 4.1000 7119.503 24.0000 11.000 51.0000 NaN Downtitlt antenna serving cell is overshooting NaN NaN
2 121.8900 2127.740 101.3300 1621.000 822.0000 NaN uptilt antenna bad coverage NaN NaN
3 86.5800 2085.250 94.6400 1650.000 880.0000 NaN uptilt antenna bad coverage NaN NaN
4 64.7500 1873.540 63.8600 1259.000 841.0000 NaN uptilt antenna bad coverage NaN NaN
5 84.8700 1735.070 60.3800 1423.000 474.0000 NaN uptilt antenna bad coverage NaN NaN
6 49.3400 1276.190 59.9600 1372.000 450.0000 NaN uptilt antenna bad coverage NaN NaN
7 135.0200 2359.840 164.1300 1224.000 704.0000 NaN NaN NaN Bad Power Check hardware etc.
8 135.0200 2359.840 164.1300 1224.000 704.0000 NaN uptilt antenna bad coverage NaN NaN
9 163.7200 1893.940 90.0300 1244.000 753.0000 NaN NaN NaN Bad Power Check hardware etc.
10 163.7200 1893.940 90.0300 1244.000 753.0000 NaN uptilt antenna bad coverage NaN NaN
11 129.6400 1163.140 154.3200 663.000 798.0000 NaN NaN NaN Bad Power Check hardware etc.
12 129.6400 1163.140 154.3200 663.000 798.0000 NaN uptilt antenna bad coverage NaN NaN
Here is the sample output:
MAC RLC RLC 2 PDCCH Down PDCCH Uplink Unnamed: 34 Recommended Solutuin State
0 122.9822 7119.503 125.7017 1186.507 784.9464 NaN Downtitlt antenna serving cell is overshooting
1 4.1000 7119.503 24.0000 11.000 51.0000 NaN Downtitlt antenna serving cell is overshooting
2 121.8900 2127.740 101.3300 1621.000 822.0000 NaN uptilt antenna bad coverage
3 86.5800 2085.250 94.6400 1650.000 880.0000 NaN uptilt antenna bad coverage
4 64.7500 1873.540 63.8600 1259.000 841.0000 NaN uptilt antenna bad coverage
5 84.8700 1735.070 60.3800 1423.000 474.0000 NaN uptilt antenna bad coverage
6 49.3400 1276.190 59.9600 1372.000 450.0000 NaN uptilt antenna bad coverage
7 135.0200 2359.840 164.1300 1224.000 704.0000 NaN Check hardware etc. Bad Power
8 163.7200 1893.940 90.0300 1244.000 753.0000 NaN Check hardware etc. Bad Power
9 129.6400 1163.140 154.3200 663.000 798.0000 NaN Check hardware etc. Bad Power

You can use groupby to combine the rows by columns:
df = pd.DataFrame(data)
new_df = df.groupby(['MAC', 'RLC1', 'RLC2', 'POCCH', 'POCCH Up']).sum()
new_df.reset_index()

You can do something like:
fill_cols = ['Power State', 'Recommended Solution 2']
dup_cols = ['MAC_UL','RLC_Through_1','RLC_Through_2','PDCCH Down', 'PDCCH Up']
m = df.duplicated(subset=dup_cols, keep=False)
df_fill = df.loc[m,fill_cols]
df_fill[df_fill['Power State']==''] = np.NaN
df_fill[df_fill['Recommended Solution 2']==''] = np.NaN
df.loc[m,fill_cols]=df_fill.ffill()
Get duplicated rows using duplicated
Fill empty values with NaN
Then use ffill

Extracting Info From A Column that contains irregular structure of ";" and "|" separators

I have a pandas data frame in which one of the columns looks like this.
INFO
SVTYPE=CNV;END=401233
SVTYPE=CNV;END=401233;CSQT=1|BHAT12|ESNT12345|
SVTYPE=CNV;END=401233;CSQT=1|JHV87|ESNT12345|,1|HJJUB2|ESNT12345|
SVTYPE=CNV;END=401233;CSQT=1|GFTREF|ESNT12345|,1|321lkj|ESNT12345|,1|16-YHGT|ESNT12345|...
The information I want to extract in new columns is gene|ESNT12345 . For the same example should be
gene1 gene2 gene3
Na Na Na
BHAT12|ESNT12345 Na Na
JHV87|ESNT12345 HJJUB2|ESNT12345 Na
GFTREF|ESNT12345 321lkj|ESNT12345 16-YHGT|ESNT12345
How can I do this working with pandas? I have been trying with .apply(lambda x:x.split("|"). But as I don't know the number of gene_name|ESNT12345 my dataset has and also this will be used in an application that will take thousands of different data frames, I am looking for a way of dynamically creating the necessary columns.
How can I do this?

IIUC, you could use a regex and str.extractall.
joining to the original data:
new_df = df.join(
df['INFO']
.str.extractall(r'(\w+\|ESNT\d+)')[0]
.unstack(level='match')
.add_prefix('gene_')
)
output:
INFO gene_0 gene_1 gene_2
0 SVTYPE=CNV;END=401233 NaN NaN NaN
1 SVTYPE=CNV;END=401233;CSQT=1|BHAT12|ESNT12345| BHAT12|ESNT12345 NaN NaN
2 SVTYPE=CNV;END=401233;CSQT=1|JHV87|ESNT12345|,1|HJJUB2|ESNT12345| JHV87|ESNT12345 HJJUB2|ESNT12345 NaN
3 SVTYPE=CNV;END=401233;CSQT=1|GFTREF|ESNT12345|,1|321lkj|ESNT12345|,1|16-YHGT|ESNT12345|... GFTREF|ESNT12345 321lkj|ESNT12345 YHGT|ESNT12345
without joining to the original data:
new_df = (df['INFO']
.str.extractall(r'(\w+\|ESNT\d+)')[0]
.unstack(level='match')
.add_prefix('gene_')
.reindex(df.index)
)
output:
match gene_0 gene_1 gene_2
0 NaN NaN NaN
1 BHAT12|ESNT12345 NaN NaN
2 JHV87|ESNT12345 HJJUB2|ESNT12345 NaN
3 GFTREF|ESNT12345 321lkj|ESNT12345 YHGT|ESNT12345
regex hack to have gene1, gene2…
If you really want to have the genes counter to start with 1, you could use this small regex hack (match the beginning of the string as match 0 and drop it):
new_df = (df['INFO']
.str.extractall(r'(^|\w+\|ESNT\d+)')[0]
.unstack(level='match')
.iloc[:, 1:]
.add_prefix('gene')
.reindex(df.index)
)
output:
match gene1 gene2 gene3
0 NaN NaN NaN
1 BHAT12|ESNT12345 NaN NaN
2 JHV87|ESNT12345 HJJUB2|ESNT12345 NaN
3 GFTREF|ESNT12345 321lkj|ESNT12345 YHGT|ESNT12345

How to change value in pandas DataFrame dependent on if clause in a for loop

I would like to analyse a list of orchids (input_df) if it contains orchid species that are on one of six lists. I import these lists from an xlsx file with six sheets as dictionary containing the six lists as DataFrames (orchid_checklists).
import pandas as pd
orchid_checklists = pd.read_excel('\\orchid_checklists.xlsx', sheet_name=None)
input_df = pd.read_excel('\\input.xlsx')
input_df['Orchideen-Checkliste'] = ''
With the following for loop with if condition I am trying to add the name of the Checklist into the field corresponding to the item in input_df['Input Name'] in the column ['Orchideen-Checkliste'] to visualize to what checklist one should refer.
for item in input_df['Input Name']:
for list_name, sheet in orchid_checklists.items():
genus = item.split(' ')[0]
if genus in sheet['referenced'].values:
input_df['Orchideen-Checkliste'] = list_name
else:
pass
In my test input list there is one species called "Bulbophyllum pachyrachis" that should be found. Unfortunately the name of the list "CL_Bulbophyllum" is put into all rows. I can´t figure out why.
In the next step I want to check if the species name is also in the column "exceptions" in either of my checklists. In that case that would not be the correct checklist. In these cases the full species name (e.g. "Aerangis ellisii", see CL_App_I and CL_III below) is found in the column "referenced" of another list.
I haven´t started coding this exception, because I am still stuck on the part before, but any pointers how to approach this are highly welcomed.
This is the input data:
Input Name Orchideen-Checkliste
0 Sobralia madisonii
1 Stelis cocornaensis
2 Stelis gelida
3 Braemia vittata
4 Brassia escobariana
5 Aspasia silvana
6 Bulbophyllum maximum
7 Bulbophyllum pachyrachis
8 Chondroscaphe amabilis
9 Dresslerella hispida
10 Elleanthus sodiroi
11 Maxillaria mathewsii
orchid_checklists:
CL_III
referenced exceptions
0 Aerangis Aerangis ellisii
1 Angraecum NaN
2 Ascocentrum NaN
3 Bletilla NaN
4 Brassavola NaN
5 Calanthe NaN
6 Catasetum NaN
7 Miltonia NaN
8 Miltoniopsis NaN
9 Renanthera NaN
10 Renantherella NaN
11 Rhynchostylis NaN
12 Rossioglossum NaN
13 Vanda NaN
14 Vandopsis NaN
CL_App_I
referenced exceptions
0 Paphiopedilum NaN
1 Phragmipedium NaN
2 Aerangis ellisii NaN
3 Cattleya jongheana NaN
4 Cattleya lobata NaN
5 Dendrobium cruentum NaN
6 Mexipedium xerophyticum NaN
7 Peristeria elata NaN
8 Renanthera imshootiana NaN
CL_Bulbophyllum
referenced exceptions
0 Acrochaene NaN
1 Bulbophyllum NaN
2 Chaseella NaN
3 Codonosiphon NaN
4 Drymoda NaN
5 Monomeria NaN
6 Monosepalum NaN
7 Pedilochilus NaN
8 Succoglossum NaN
9 Sunipia NaN
10 Trias NaN
Thank you in advance for your help!

input_df['Orchideen-Checkliste'] = list_name
Assigns a value to every item of that column because you did not specify a row indexer.
Without changing your process too much: enumerate the items in input_df['Input Name'] when iterating and use the enumeration to specify the row for the assignment.
for index,item in enumerate(input_df['Input Name']):
for list_name, sheet in orchid_checklists.items():
genus = item.split(' ')[0]
if genus in sheet['referenced'].values:
input_df.loc[index,'Orchideen-Checkliste'] = list_name

How to get Dataframe with Table ID in Pandas?

I want to extract dataframe from HTML using URL.
The page contains 59 table/dataframe.
I want to extract 1 particular table which can be identified by its ID "ctl00_Menu1"
Following is my trail which is giving error.
import pandas as pd
df = pd.read_html("http://eciresults.nic.in/statewiseS12.htm?st=S12",attrs = {'id': 'ctl00_Menu1'})
As this is my very early stage in python so can be simple solution but I am unable to find. appreciate help.

I would look at how the URL passes params and probably try to read a dataframe directly from it. I'm unsure if you are trying to develop a function or a script or just exercising.
If you do (notice the 58 at the end of the url)
df = pd.read_html("http://eciresults.nic.in/statewiseS12.htm?st=S1258",attrs = {'id':
'ctl00_Menu1'})
It works and gives you table 59.
[ 0 1 2 \
0 Partywise Partywise NaN
1 Partywise NaN NaN
2 Constituencywise-All Candidates NaN NaN
3 Constituencywise Trends NaN NaN
3 4 5 \
0 Constituencywise-All Candidates Constituencywise-All Candidates NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
6 7
0 Constituencywise Trends Constituencywise Trends
1 NaN NaN
2 NaN NaN
3 NaN NaN ]
Unsure if that's the table you want to extract, but most of the time it's easier to pass it as a url parameter. If you try it without the 58 it works too, I believe the 'ElectionResult' argument might not be a table classifier hence why you can't find any tables with that name.

Python pandas - value_counts not working properly

Based on this post on stack i tried the value counts function like this
df2 = df1.join(df1.genres.str.split(",").apply(pd.value_counts).fillna(0))
and it works fine apart from the fact that although my data has 22 unique genres and after the split i get 42 values, which of course are not unique.
Data example:
Action Adventure Casual Design & Illustration Early Access Education Free to Play Indie Massively Multiplayer Photo Editing RPG Racing Simulation Software Training Sports Strategy Utilities Video Production Web Publishing Accounting Action Adventure Animation & Modeling Audio Production Casual Design & Illustration Early Access Education Free to Play Indie Massively Multiplayer Photo Editing RPG Racing Simulation Software Training Sports Strategy Utilities Video Production Web Publishing nan
0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 1.0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
(i have pasted the head and the first row only)
I have a feeling that the problem is caused from my original data.Well, my column (genres) was a list of lists which contained brackets
example :[Action,Indie]
so when python was reading it, it would read [Action and Action and Action] as different values and the output was 303 different values.
So what i did is that:
for i in df1['genres'].tolist():
if str(i) != 'nan':
i = i[1:-1]
new.append(i)
else:
new.append('nan')

You have to remove first and last [] from column genres by function str.strip and then replace spaces by empty string by function str.replace
import pandas as pd
df = pd.read_csv('test/Copy of AppCrawler.csv', sep="\t")
df['genres'] = df['genres'].str.strip('[]')
df['genres'] = df['genres'].str.replace(' ', '')
df = df.join(df.genres.str.split(",").apply(pd.value_counts).fillna(0))
#temporaly display 30 rows and 60 columns
with pd.option_context('display.max_rows', 30, 'display.max_columns', 60):
print df
#remove for clarity
print df.columns
Index([u'Unnamed: 0', u'appid', u'currency', u'final_price', u'genres',
u'initial_price', u'is_free', u'metacritic', u'release_date',
u'Accounting', u'Action', u'Adventure', u'Animation&Modeling',
u'AudioProduction', u'Casual', u'Design&Illustration', u'EarlyAccess',
u'Education', u'FreetoPlay', u'Indie', u'MassivelyMultiplayer',
u'PhotoEditing', u'RPG', u'Racing', u'Simulation', u'SoftwareTraining',
u'Sports', u'Strategy', u'Utilities', u'VideoProduction',
u'WebPublishing'],
dtype='object')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replacing/Stripping certain text from data in pandas? - python

Related

Merge two rows pandas dataframe

Extracting Info From A Column that contains irregular structure of ";" and "|" separators

How to change value in pandas DataFrame dependent on if clause in a for loop

How to get Dataframe with Table ID in Pandas?

Python pandas - value_counts not working properly

Categories

Resources