Any value deleting in dataframe cell [closed] - python
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I want to transform dataset type. But i cant do it beacuse of there are two dot in my dataset. Im using pd.apply(pd.to_numeric) code. The error code I get is as follows;
ValueError: Unable to parse string "1.232.2" at position 1
my dataset is like this;
Price Value
1.232.2 1.235.3
2.345.2 1.234.2
3.343.5 5.433.3
I must do removing first dot. Example for;
Price Value
1232.2 1235.3
2345.2 1234.2
3343.5 5433.3
I waiting for help. Thank you.
Here's a way to do this.
Convert string to float format (multiple dots to single dot)
You can just do a regex to solve for this.
regex expression: '\.(?=.*\.)'
Explanation:
'\. --> lookup for literal .
(?=.*\.)' --> Exclude all but last .
For each found, replace with ''
The code for this is:
df['Price'] = df['Price'].str.replace('\.(?=.*\.)', '',regex=True)
df['Value'] = df['Value'].str.replace('\.(?=.*\.)', '',regex=True)
If you also want to convert it to numeric, you can directly give:
df['Price'] = pd.to_numeric(df['Price'].str.replace('\.(?=.*\.)', '',regex=True))
df['Value'] = pd.to_numeric(df['Value'].str.replace('\.(?=.*\.)', '',regex=True))
The output of this will be:
Before Cleansing DataFrame:
Price Value
0 1.232.2 1.235.3
1 2.345.2 1.234.2
2 3.343.5 5.433.3
3 123.45 456.25.5
4 0.825 0.0.0
5 0.0.0.2 5.5.5
6 1234 4567
7 NaN NaN
After Cleansing DataFrame:
Price Value
0 1232.2 1235.3
1 2345.2 1234.2
2 3343.5 5433.3
3 123.45 45625.5
4 0.825 00.0
5 000.2 55.5
6 1234 4567
7 NaN NaN
The pd.numeric() version of the solution will look like this:
After Cleanins DataFrame:
Note: it converts all values to 3 decimal places as one of them has 3 decimal places.
Price Value
0 1232.200 1235.3
1 2345.200 1234.2
2 3343.500 5433.3
3 123.450 45625.5
4 0.825 0.0
5 0.200 55.5
6 1234.000 4567.0
7 NaN NaN
Discard data if more than one period (.) in data
If you want to process all the columns in the dataframe, you can use applymap() and if you want to process for a specific column use apply. Also use pd.isnull() to check if data is NaN so you can ignore processing that data.
The below code addresses for NaN, numbers without decimal places, numbers with one period, numbers with multiple periods. The code assumes the data in the columns are either NaNs or strings with digits and periods. It assumes there are no alphabet or non digit characters (except dots). If you need the code to check for digits only, let me know.
The code also assumes that you want to discard the leading numbers. If you do want to concatenate the numbers, then a different solution needs to be implemented (for ex: 1.2345.67 will be replaced to 2345.67 and 1 will be discarded. example #2: 1.2.3.4.5 will be replaced with 4.5 while discarding 1.2.3. If this is NOT what you want, we need to change the code.
You can do the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Price': ['1.232.2', '2.345.2', '3.343.5', '123.45', '0.825','0.0.0.2', '1234',np.NaN],
'Value': ['1.235.3', '1.234.2', '5.433.3', '456.25.5','0.0.0','5.5.5', '4567',np.NaN]})
print (df)
def remove_dots(x):
return x if pd.isnull(x) else '.'.join(x.rsplit('.',2)[-2:])
df = df.applymap(remove_dots)
print (df)
The output of this will be:
Before Cleansing DataFrame:
Price Value
0 1.232.2 1.235.3
1 2.345.2 1.234.2
2 3.343.5 5.433.3
3 123.45 456.25.5
4 0.825 0.0.0
5 0.0.0.2 5.5.5
6 1234 4567
7 NaN NaN
After Cleansing DataFrame:
Price Value
0 232.2 235.3
1 345.2 234.2
2 343.5 433.3
3 123.45 25.5
4 0.825 0.0
5 0.2 5.5
6 1234 4567
7 NaN NaN
If you want to change specific columns only, then you can use apply.
df['Price'] = df['Price'].apply(lambda x: x if pd.isnull(x) else '.'.join(x.rsplit('.',2)[-2:]))
df['Value'] = df['Value'].apply(lambda x: x if pd.isnull(x) else '.'.join(x.rsplit('.',2)[-2:]))
print(df)
Before and after will be the same:
Before Cleansing DataFrame:
Price Value
0 1.232.2 1.235.3
1 2.345.2 1.234.2
2 3.343.5 5.433.3
3 123.45 456.25.5
4 0.825 0.0.0
5 0.0.0.2 5.5.5
6 1234 4567
7 NaN NaN
After Cleansing DataFrame:
Price Value
0 232.2 235.3
1 345.2 234.2
2 343.5 433.3
3 123.45 25.5
4 0.825 0.0
5 0.2 5.5
6 1234 4567
7 NaN NaN
Related
How can I change the value of a row with indexing?
I've scraped the crypto.com website to get the current prices of crypto coins in DataFrame form, it worked perfectly with pandas, but the 'Prices' values are mixed. here's the output: Name Price 24H CHANGE 0 BBitcoinBTC 16.678,36$16.678,36+0,32% +0,32% 1 EEthereumETH $1.230,40$1.230,40+0,52% +0,52% 2 UTetherUSDT $1,02$1,02-0,01% -0,01% 3 BBNBBNB $315,46$315,46-0,64% -0,64% 4 UUSD CoinUSDC $1,00$1,00+0,00% +0,00% 5 BBinance USDBUSD $1,00$1,00+0,00% +0,00% 6 XXRPXRP $0,4067$0,4067-0,13% -0,13% 7 DDogecoinDOGE $0,1052$0,1052+13,73% +13,73% 8 ACardanoADA $0,3232$0,3232+0,98% +0,98% 9 MPolygonMATIC $0,8727$0,8727+1,20% +1,20% 10 DPolkadotDOT $5,48$5,48+0,79% +0,79% I created a regex to filter the mixed date: import re pattern = re.compile(r'(\$.*)(\$)') for value in df['Price']: value = pattern.search(value) print(value.group(1)) output: $16.684,53 $1.230,25 $1,02 $315,56 $1,00 $1,00 $0,4078 $0,105 $0,3236 $0,8733 but I couldn't find a way to change the values. Which is the best way to do it? Thanks.
if youre regex expression is good, this would work df['Price']= df['Price'].apply(lambda x: pattern.search(x).group(1))
can you try this: df['price_v2']=df['Price'].apply(lambda x: '$' + x.split('$')[1]) ''' 0 $16.678,36+0,32% 1 $1.230,40 2 $1,02 3 $315,46 4 $1,00 5 $1,00 6 $0,4067 7 $0,1052 8 $0,3232 9 $0,8727 10 $5,48 Name: price, dtype: object Also, BTC looks different from others. Is this a typo you made or is this the response from the api ? If there are parities that look like BTC, we can add an if else block to the code: df['price']=df['Price'].apply(lambda x: '$' + x.split('$')[1] if x.startswith('$') else '$' + x.split('$')[0]) ''' 0 $16.678,36 1 $1.230,40 2 $1,02 3 $315,46 4 $1,00 5 $1,00 6 $0,4067 7 $0,1052 8 $0,3232 9 $0,8727 10 $5,48 ''' Detail: string = '$1,02$1,02-0,01%' values = string.split('$') # output -- > ['', '1,02', '1,02-0,01%'] final_value = values[1] # we need only price. Thats why i choose the second element and apply this to all dataframe.
Preserving id columns in dataframe after applying assign and groupby
I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so: PregnancyID MotherID gestationalAgeInWeeks abdomCirc 0 0 14 150 0 0 21 200 1 1 20 294 1 1 25 315 1 1 30 350 2 2 8 170 2 2 9 180 2 2 18 NaN Following this answer to a previous questions I had asked, I used this code to summarise the ultrasound measurements using the maximum measurement recorded in a single trimester (13 weeks): (df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)) .drop(columns = 'gestationalAgeInWeeks') .groupby(['MotherID', 'PregnancyID','tm']) .agg('max') .unstack() ) This results in the following output: tm 1 2 3 MotherID PregnancyID 0 0 NaN 200.0 NaN 1 1 NaN 294.0 350.0 2 2 180.0 NaN NaN However, MotherID and PregnancyID no longer appear as columns in the output of df.info(). Similarly, when I output the dataframe to a csv file, I only get columns 1,2 and 3. The id columns only appear when running df.head() as can be seen in the dataframe above. I need to preserve the id columns as I want to use them to merge this dataframe with another one using the ids. Therefore, my question is, how do I preserve these id columns as part of my dataframe after running the code above?
Chain that with reset_index: (df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13) # .drop(columns = 'gestationalAgeInWeeks') # don't need this .groupby(['MotherID', 'PregnancyID','tm'])['abdomCirc'] # change here .max().add_prefix('abdomCirc_') # here .unstack() .reset_index() # and here ) Or a more friendly version with pivot_table: (df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13) .pivot_table(index= ['MotherID', 'PregnancyID'], columns='tm', values= 'abdomCirc', aggfunc='max') .add_prefix('abdomCirc_') # remove this if you don't want the prefix .reset_index() ) Output: tm MotherID PregnancyID abdomCirc_1 abdomCirc_2 abdomCirc_3 0 abdomCirc_0 abdomCirc_0 NaN 200.0 NaN 1 abdomCirc_1 abdomCirc_1 NaN 315.0 350.0 2 abdomCirc_2 abdomCirc_2 180.0 NaN NaN
Use the .map() function depending on another columns value when there is mire than two conditions [duplicate]
This question already has an answer here: pandas map column data based on value from another column using if to determine which dict to use (1 answer) Closed 4 years ago. I have the following dataframe: gene sample SQ assay 0 FAM coop1 842.400000 SIQ 1 FAM 2 NaN SIQ 2 HEX 2 NaN EEK 3 FAM 3 NaN SIQ 4 HEX 3 6.225000 TSI I want to replace the values under genesaccording to these dictionaries: SIQ_map = {'FAM':'qnrS','Texas Red':'IC','HEX':"Sul1"} TSI_map = {'FAM':'blaSHV','Texas Red':'Int1','HEX':'TetB'} MOA_map = {'FAM':'blaOXA','Texas Red':'Aph3a','HEX':'MecA'} EEK_map = {'FAM':'','Texas Red':'','HEX':'blaKPC'} BAM_map = {'FAM':'TetM','Texas Red':'VanB','HEX':'VanA'} I've used the .map() function in dataframes with only one type of assay. But how can I choose a different dictionary for mapping depending on the assay value, if there is more than one? What I want is this output: gene sample SQ assay 0 qnrS coop1 842.400000 SIQ 1 qnrS 2 NaN SIQ 2 2 NaN EEK 3 qnrS 3 NaN SIQ 4 TetB 3 6.225000 TSI I saw using np.where() in another question, but this seems useful only for case with two conditions. In my case, I have 5 conditions (SIQ, TSI,MOA,EEK and BAM). How would I get the desired output in that case?
I'd recommend using a dictionary of dictionaries. assay_map = {'SIQ': {'FAM':'qnrS','Texas Red':'IC','HEX':"Sul1"}, 'TSI': {'FAM':'blaSHV','Texas Red':'Int1','HEX':'TetB'}, 'MOA': {'FAM':'blaOXA','Texas Red':'Aph3a','HEX':'MecA'}, 'EEK': {'FAM':'','Texas Red':'','HEX':'blaKPC'}, 'BAM': {'FAM':'TetM','Texas Red':'VanB','HEX':'VanA'}} This way you can address any gene based on assay. Now we can map the data. df['New Column']=[a[g] for g,a in zip(df['gene'],df['assay'].map(m))]
round pandas column with precision but no trailing 0
Not duplicate because I'm asking about pandas round(). I have a dataframe with some columns with numbers. I run df = df.round(decimals=6) That successfully truncated the long decimals instead of 15.36785699998 correctly writing: 15.367857, but I still get 1.0 or 16754.0 with a trailing zero. How do I get rid of the trailing zeros in all the columns, once I ran pandas df.round() ? I want to save the dataframe as a csv, and need the data to show the way I wish.
df = df.round(decimals=6).astype(object) Converting to object will allow mixed representations. But, keep in mind that this is not very useful from a performance standpoint. df A B 0 0.149724 -0.770352 1 0.606370 -1.194557 2 10.000000 10.000000 3 10.000000 10.000000 4 0.843729 -1.571638 5 -0.427478 -2.028506 6 -0.583209 1.114279 7 -0.437896 0.929367 8 -1.025460 1.156107 9 0.535074 1.085753 df.round(6).astype(object) A B 0 0.149724 -0.770352 1 0.60637 -1.19456 2 10 10 3 10 10 4 0.843729 -1.57164 5 -0.427478 -2.02851 6 -0.583209 1.11428 7 -0.437896 0.929367 8 -1.02546 1.15611 9 0.535074 1.08575
Finding the percent change of values in a Series
I have a DataFrame with 2 columns. I need to know at what point the number of questions has increased. In [19]: status Out[19]: seconds questions 0 751479 9005591 1 751539 9207129 2 751599 9208994 3 751659 9210429 4 751719 9211944 5 751779 9213287 6 751839 9214916 7 751899 9215924 8 751959 9216676 9 752019 9217533 I need the change in percent of 'questions' column and then sort on it. This does not work: status.pct_change('questions').sort('questions').head() Any suggestions?
Try this way instead: >>> status['change'] = status.questions.pct_change() >>> status.sort_values('change', ascending=False) questions seconds change 0 9005591 751479 NaN 1 9207129 751539 0.022379 2 9208994 751599 0.000203 6 9214916 751839 0.000177 4 9211944 751719 0.000164 3 9210429 751659 0.000156 5 9213287 751779 0.000146 7 9215924 751899 0.000109 9 9217533 752019 0.000093 8 9216676 751959 0.000082 pct_change can be performed on Series as well as DataFrames and accepts an integer argument for the number of periods you want to calculate the change over (the default is 1). I've also assumed that you want to sort on the 'change' column with the greatest percentage changes showing first...