See if the values in a column contain % in a pandas dataframe - python

I have a dataframe that has columns whose values contain % (literal percentage sign). I am trying to create a function to automatically convert these values to a decimal.
For example, with the below dataframe:
var1 var2 var3 var4
id
0 1.4515 1.52% -0.5709 4%
1 1.57 1.605% -0.012 8%
2 1.69253 1.657% -0.754 9%
3 1.66331 1.686% -0.0012 5%
4 1.739 1.716% -0.04 12%
5 1.7447 1.61% -0.0023 11%
def pct_to_dec(df):
for col in df:
print(col)
if '%%' in df[col].astype(str):
print(col)
df[col] = df[col].replace({'%%':''}, regex=True)
df[col] = df[col]/100
The function should print var2 and var4, and convert the values in both columns to decimal format. Through troubleshooting I have found that python is not seeing the percentage characters since when I do this code:
df.isin(['%%'])
It prints a dataframe of "False".
Lastly, I have tried to see if I'm using the wrong escape character. I've tried %%, /%, and \%.
I am interested in seeing if I am on the right track, as well as if there is a simpler way to do what I'm trying to do.

You can as well use .str.endswith like in the following example:
for col in df.select_dtypes('object'):
indexer_percent= df[col].str.endswith('%')
df.loc[indexer_percent, col]= df.loc[indexer_percent, col].str.strip('%')
df[col]= df[col].astype('float32')
df.loc[indexer_percent, col]/= 100.0
On your data, this results in:
var1 var2 var3 var4
id
0 1.45150 0.01520 -0.5709 0.04
1 1.57000 0.01605 -0.0120 0.08
2 1.69253 0.01657 -0.7540 0.09
3 1.66331 0.01686 -0.0012 0.05
4 1.73900 0.01716 -0.0400 0.12
5 1.74470 0.01610 -0.0023 0.11
The data is created by:
import pandas as pd
import io
infile=io.StringIO(
"""id var1 var2 var3 var4
0 1.4515 1.52% -0.5709 4%
1 1.57 1.605% -0.012 8%
2 1.69253 1.657% -0.754 9%
3 1.66331 1.686% -0.0012 5%
4 1.739 1.716% -0.04 12%
5 1.7447 1.61% -0.0023 11%"""
)
df= pd.read_csv(infile, index_col=0, sep='\s+')

You can easily check this using the Series method .str.contains
It lets you check which rows of a Series has the string you passed. For example, if you run this code:
df['var2'].str.contains('%')
You'll get a series as a return with all rows equals True. So you just need to implement a for and get the index of the rows that have True values and do whatever you want.
Note that if your rows isn't str type you'll get NaN as a return, so be aware of the type of the columns.

Related

How do you add the value for a certain column from a previous row to your current row in Python Pandas? [duplicate]

In python, how can I reference previous row and calculate something against it? Specifically, I am working with dataframes in pandas - I have a data frame full of stock price information that looks like this:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
Here is how I created this dataframe:
import pandas
url = 'http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
data = data = pandas.read_csv(url)
## now I sorted the data frame ascending by date
data = data.sort(columns='Date')
Starting with row number 2, or in this case, I guess it's 250 (PS - is that the index?), I want to calculate the difference between 2011-01-03 and 2011-01-04, for every entry in this dataframe. I believe the appropriate way is to write a function that takes the current row, then figures out the previous row, and calculates the difference between them, the use the pandas apply function to update the dataframe with the value.
Is that the right approach? If so, should I be using the index to determine the difference? (note - I'm still in python beginner mode, so index may not be the right term, nor even the correct way to implement this)
I think you want to do something like this:
In [26]: data
Out[26]:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
In [27]: data.set_index('Date').diff()
Out[27]:
Close Adj Close
Date
2011-01-03 NaN NaN
2011-01-04 0.16 0.16
2011-01-05 -0.59 -0.58
2011-01-06 1.61 1.57
2011-01-07 -0.73 -0.71
To calculate difference of one column. Here is what you can do.
df=
A B
0 10 56
1 45 48
2 26 48
3 32 65
We want to compute row difference in A only and want to consider the rows which are less than 15.
df['A_dif'] = df['A'].diff()
df=
A B A_dif
0 10 56 Nan
1 45 48 35
2 26 48 19
3 32 65 6
df = df[df['A_dif']<15]
df=
A B A_dif
0 10 56 Nan
3 32 65 6
I don't know pandas, and I'm pretty sure it has something specific for this; however, I'll give you the pure-Python solution, that might be of some help even if you need to use pandas:
import csv
import urllib
# This basically retrieves the CSV files and loads it in a list, converting
# All numeric values to floats
url='http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
reader = csv.reader(urllib.urlopen(url), delimiter=',')
# We sort the output list so the records are ordered by date
cleaned = sorted([[r[0]] + map(float, r[1:]) for r in list(reader)[1:]])
for i, row in enumerate(cleaned): # enumerate() yields two-tuples: (<id>, <item>)
# The try..except here is to skip the IndexError for line 0
try:
# This will calculate difference of each numeric field with the same field
# in the row before this one
print row[0], [(row[j] - cleaned[i-1][j]) for j in range(1, 7)]
except IndexError:
pass

fillna() not allowing floating values

I'm testing a simple imputation method on the side using a copy of my dataset. I'm essentially trying to impute missing values with categorical means grouped by the target variable.
df_test_2 = train_df.loc[:,['Survived','Age']].copy() #copy of dataset for testing
#creating impute function
def impute(df,variable):
if 'Survived'==0: df[variable] = df[variable].fillna(30.7)
else: df[variable] = df[variable].fillna(28.3)
#imputing
impute(df_test_2,'Age')
The output is that the imputation is successful, but the values added are 30 and 28 instead of 30.7 and 28.3.
'Age' is float64.
Thank you
Edit: I simply copied the old code for calling the function here and corrected it now. Wasn't the issue in my original code; problem persists.
Have a look at this to see what may be going on
To test it I set up a simple case
import pandas as pd
import numpy as np
data = {'Survived' : [0,1,1,0,0,1], 'Age' :[12.2,45.4,np.nan,np.nan,64.3,44.3]}
df = pd.DataFrame(data)
df
This got the data set
Survived Age
0 0 12.2
1 1 45.4
2 1 NaN
3 0 NaN
4 0 64.3
5 1 44.3
I ran your function exactly
def impute(df,variable):
if 'Survived'==0: df[variable] = df[variable].fillna(30.7)
else: df[variable] = df[variable].fillna(28.3)
and this yielded this result
Survived Age
0 0. 12.2
1 1 45.4
2 1 28.3
3 0 28.3
4 0 64.3
5 1 44.3
As you can see on the index 3 the row age got filled with the wrong value. The problem is this 'Survived'==0. This is always going to be false. You are checking to see if the string is 0 and it is not.
What you may want is
df2 = df[df['Survived'] == 0].fillna(30.7)
df3 = df[df['Survived'] == 1].fillna(28.3)
dfout = df2.append(df3)
and the output is
Survived Age
0 0 12.2
3 0 30.7
4 0 64.3
1 1 45.4
2 1 28.3
5 1 44.3
Anish
I think is better to use the method apply() available in pandas. This method applies (in rows or in columns) a custom function over a dataframe.
I let you one post: Stack Question
Documentation pandas: Doc Apply df
regards,

Python Pandas: How to update values for other column in groupby?

I have a dataframe with time series.
meter date value
0 1002 19501 0.362
1 1002 19502 0.064
2 1002 19503 0.119
3 1002 19504 0.023
4 1002 19505 0.140
Now I need to change the date to numeric order (1,2,3, etc. until 336) for each unique value in meter. There 336 rows for each unique meter value, so that shouldn't be too difficult, but I am stuck at getting the right result here.
I tried the following:
def change_timestamp(df):
timestamp_uniform = [i for i in range(1,337)]
timestamp = pd.Series(data=timestamp_uniform)
df.date = timestamp.values
return df.date
by_meter = meters_weekly.groupby('meter')
by_meter.apply(change_timestamp)
but the output was just dates repeated.
Any ideas on how to fix that?
You could try something like this -
df['new_date'] = df.groupby(['meter']).cumcount()

how to apply filter condition in percentage string column using pandas?

I am working on below df but unable to apply filter in percentage field,but it is working normal excel.
I need to apply filter condition > 100.00% in the particular field using pandas.
I tried reading it from Html,csv and excel in pandas but unable to use condition.
it requires float conversion but not working with given data
I am assuming that the values you have are read as strings in Pandas:
data = ['4,700.00%', '3,900.00%', '1,500.00%', '1,400.00%', '1,200.00%', '0.15%', '0.13%', '0.12%', '0.10%', '0.08%', '0.07%']
df = pd.DataFrame(data)
df.columns = ['data']
printing the df:
data
0 4,700.00%
1 3,900.00%
2 1,500.00%
3 1,400.00%
4 1,200.00%
5 0.15%
6 0.13%
7 0.12%
8 0.10%
9 0.08%
10 0.07%
then:
df['data'] = df['data'].str.rstrip('%').str.replace(',','').astype('float')
df_filtered = df[df['data'] > 100]
Results:
data
0 4700.0
1 3900.0
2 1500.0
3 1400.0
4 1200.0
I have used below code as well.str.rstrip('%') and .str.replace(',','').astype('float') it working fine

Using python print max and min values and date associated with the max and min values

I am new to programming and am trying to write a program that evaluates and prints the max AVE.SPEED value and the date associated with that value from a csv file.
This would be an example of the file data set:
STATION DATE AVE_SPEED
0 US68 2018-03-22 0.00
1 US68 2018-03-23 0.00
2 US68 2018-03-24 0.00
3 US68 2018-03-26 0.24
4 US68 2018-03-27 2.28
5 US68 2018-03-28 0.21
6 US10 2018-03-29 0.04
7 US10 2018-03-30 0.00
8 US10 2018-03-31 0.00
9 US10 2018-04-01 0.00
10 US10 2018-04-02 0.02
This is what I have come up with so far but it just prints the entire set at the end.
import pandas as pd
df = pd.read_csv (r'data_01.csv')
max1 = df['AVE_SPEED'].max()
print ('Max Speed in MPH: ' + str(max1))
groupby_max1 = df.groupby(['DATE']).max()
print ('Maximum Average Speed Value and Date of Occurance: ' + str(groupby_max1))
Your initial average speed max is correct in pandas.
To find the corresponding date, I would do the following:
mport pandas as pd
df = pd.read_csv (r'data_01.csv')
max1 = df['AVE_SPEED'].max()
print ('Max Speed in MPH: ' + str(max1))
date_of_max = df[df['AVE_SPEED'] == max1]['date'].values[0]
Effectively, you're creating another dataframe where any "AVE_SPEED" must equal the max speed (it should be a single row unless there's are multiple instances of the same max speed). From there, you return the 'date' value of that dataframe/row.
You can then print/return the max velocity and corresponding date as needed.
I would like to suggest a non-pandas approach to this as a lot of new programmers focus on learning pandas instead of learning python -- especially here it might be easier to understand what plain python is doing instead of using a dataframe:
with open('data_01.csv') as f:
data = f.readlines()[1:] # ditch the header
data = [x.split() for x in data] # turn each line in to a list of its values
data.sort(key=lambda x: -float(x[-1])) # sort by the last item in each list (the speed) ascending
print(data[0][2]) # print the date (index 2) from the first item in your sorted data

Categories