Dropping rows based on a string in a table - python

Code to drop rows based on a partial string is not working.
Very simple code, and it runs fine but doesn't drop the rows I want.
The original table in the pdf looks like this:
Chemical
Value
Unit
Type
Fluoride
0.23
ug/L
Lab
Mercury
0.15
ug/L
Lab
Sum of Long Chained Polymers
0.33
Partialsum of Short Chained Polymers
0.40
What I did:
import csv
import tabula
dfs = tabula.read _pdf("Test.pdf", pages= 'all')
file = "Test.pdf"
tables = tabula.read_pdf(file, pages=2, stream=True, multiple_tables=True)
table1 = tables[1]
table1.drop('Unit', axis=1, inplace=True)
table1.drop('Type', axis=1, inplace=True)
discard = ['sum','Sum']
table1[~table1.Chemical.str.contains('|'.join(discard))]
print(table1)
table1.to_csv('test.csv')
The results are that it drops the 2 columns I don't want, so that's fine. But it did not delete the rows with the words "sum" or "Sum" in them. Any insights?

You are close. You did drop the rows, but you didn't save the result.
import pandas as pd
example = {'Chemical': ['Fluoride', 'Mercury', 'Sum of Long Chained Polymers',
'Partialsum of Short Chained Polymers'],
'Value': [0.23, 0.15, 0.33, 0.4],
'Unit': ['ug/L', 'ug/L', '', ''],
'Type': ['Lab', 'Lab', '', '']}
table1 = pd.DataFrame(example)
table1.drop('Unit', axis=1, inplace=True)
table1.drop('Type', axis=1, inplace=True)
discard = ['sum','Sum']
table1 = table1[~table1.Chemical.str.contains('|'.join(discard))]
print(table1)

You can use pd.Series.str.contains with the argument case=False to ignore case:
Also, it's not law, but often considered poor practice to use inplace=True... because in part it leads to confusions like the one you're experiencing.
Given df:
Chemical Value Unit Type
0 Fluoride 0.23 ug/L Lab
1 Mercury 0.15 ug/L Lab
2 Sum of Long Chained Polymers 0.33 NaN NaN
3 Partialsum of Short Chained Polymers 0.40 NaN NaN
Doing:
df = (df.drop(['Unit', 'Type'], axis=1)
.loc[~df.Chemical.str.contains('sum', case=False)])
Output:
Chemical Value
0 Fluoride 0.23
1 Mercury 0.15

Related

Issue in executing a specific type of nested 'for' loop on columns of a panda dataframe

I have a panda dataframe that has values like below. Though in real I am working with lot more columns and historical data
AUD USD JPY EUR
0 0.67 1 140 1.05
I want to iterate over columns to create dataframe with columns AUDUSD, AUDJPY, AUDEUR, USDJPY, USDEUR and JPYEUR
where for eg AUDUSD is calculated as product of AUD column and USD colum
I tried below
for col in df:
for cols in df:
cf[col+cols]=df[col]*df[cols]
But it generates table with unneccessary values like AUDAUD, USDUSD or duplicate value like AUDUSD and USDAUD. I think if i can somehow set "cols =col+1 till end of df" in second for loop I should be able to resolve the issue. But i don't know how to do that ??
Result i am looking for is a table with below columns and their values
AUDUSD, AUDJPY, AUDEUR, USDJPY, USDEUR, JPYEUR
You can use itertools.combinations with pandas.Series.mul and pandas.concat.
Try this :
from itertools import combinations
​
combos = list(combinations(df.columns, 2))
​
out = pd.concat([df[col[1]].mul(df[col[0]]) for col in combos], axis=1, keys=combos)
​
out.columns = out.columns.map("".join)
# Output :
print(out)
AUDUSD AUDJPY AUDEUR USDJPY USDEUR JPYEUR
0 0.67 93.8 0.7035 140 1.05 147.0
# Used input :
df = pd.DataFrame({'AUD': [0.67], 'USD': [1], 'JPY': [140], 'EUR': [1.05]})
I thought it intuitive that your first approach was to use an inner / outer loop and think this solution works in the same spirit:
# Added a Second Row for testing
df = pd.DataFrame(
{'AUD': [0.67, 0.91], 'USD': [1, 1], 'JPY': [140, 130], 'EUR': [1.05, 1]},
)
# Instantiated the Second DataFrame
cf = pd.DataFrame()
# Call the index of the columns as an integer
for i in range(len(df.columns)):
# Increment the index + 1, so you aren't looking at the same column twice
# Also, limit the range to the length of your columns
for j in range(i+1, len(df.columns)):
print(f'{df.columns[i]}' + f'{df.columns[j]}') # VERIFY
# Create a variable of the column names mashed together
combine = f'{df.columns[i]}' + f'{df.columns[j]}
# Assign the rows to be a product of the mashed column series
cf[combine] = df[df.columns[i]] * df[df.columns[j]]
print(cf) # VERIFY
The console Log looks like this:
AUDUSD
AUDJPY
AUDEUR
USDJPY
USDEUR
JPYEUR
AUDUSD AUDJPY AUDEUR USDJPY USDEUR JPYEUR
0 0.67 93.8 0.7035 140 1.05 147.0
1 0.91 118.3 0.9100 130 1.00 130.0

Winsorize dataframe columns per month while ignoring NaN's

I have a dataframe with monthly data and the following colums: date, bm and cash
date bm cash
1981-09-30 0.210308 2.487146
1981-10-31 0.241291 2.897529
1981-11-30 0.221529 2.892758
1981-12-31 0.239002 2.726372
1981-09-30 0.834520 4.387087
1981-10-31 0.800472 4.297658
1981-11-30 0.815778 4.459382
1981-12-31 0.836681 4.895269
Now I want to winsorize my data per month while keeping NaN values in the data. I.e. I want to group the data per month and overwrite observations above the 0.99 and below the 0.01 percentile with the 99 percentile and 0.01 percentile respectively. From Winsorizing data by column in pandas with NaN I found that I should do this with the "clip" function. My code looks as follows:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index(['date'])
df_grouped = df.groupby(pd.Grouper(freq='M'))
cols = df.columns
for c in cols:
df[c] = df_grouped[c].apply(lambda x: x.clip(lower=x.quantile(0.01), upper=x.quantile(0.99)))
I get the following output: ValueError: cannot reindex from a duplicate axis
P.S. I realize that I have not included my required output, but I hope that the required output is clear. Otherwise I can try to put something together.
Edit: These solution from #Allolz is already of great help, but it does not work exactly as it is supposed to. Before I run the code from #Allolz I I ran :
df_in.groupby(pd.Grouper(freq='M', key='date'))['secured'].quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])
Which returned:
date
1980-01-31 0.00 1.580564e+00
0.01 1.599805e+00
0.25 2.388106e+00
0.50 6.427071e+00
0.75 1.200685e+01
0.99 5.133111e+01
1.00 5.530329e+01
After winsorizing I get:
date
1980-01-31 0.00 1.599805
0.01 1.617123
0.25 2.388106
0.50 6.427071
0.75 12.006854
0.99 47.756152
1.00 51.331114
It is clear that the new 0.0 and 1.0 quantiles are equal to the original 0.01 and 0.09 quantiles, which is what we would expect. However, the new 0.01 and 0.99 quantiles are not equal to the original 0.01 and 0.99 quantiles where I would expect that these should remain the same. What can cause this and wat could solve it? My hunch is that it might have to do with NaN's in the data, but I'm not sure if that is really the cause.
One method which will be faster requires you to create helper columns. We will use groupby + transform to broadcast columns for the 0.01 and 0.99 quantile (for that Month group) back to the DataFrame and then you can use those Series to clip the original at once. (clip will leave NaN alone so it satisfies that requirement too). Then if you want, remove the helper columns (I'll leave them in for clariity).
Sample Data
import numpy as np
import panda as pd
np.random.seed(123)
N = 10000
df = pd.DataFrame({'date': np.random.choice(pd.date_range('2010-01-01', freq='MS', periods=12), N),
'val': np.random.normal(1, 0.95, N)})
Code
gp = df.groupby(pd.Grouper(freq='M', key='date'))['val']
# Assign the lower-bound ('lb') and upper-bound ('ub') for Winsorizing
df['lb'] = gp.transform('quantile', 0.01)
df['ub'] = gp.transform('quantile', 0.99)
# Winsorize
df['val_wins'] = df['val'].clip(upper=df['ub'], lower=df['lb'])
Output
The majority of rows will not be changed (only those outside of the 1-99th percentile) so we can check the small susbet rows that did change to see it works. You can see rows for the same months have the same bounds and the winsorized value ('val_wins') is properly clipped to the bound it exceeds.
df[df['val'] != df['val_wins']]
# date val lb ub val_wins
#42 2010-09-01 -1.686566 -1.125862 3.206333 -1.125862
#96 2010-04-01 -1.255322 -1.243975 2.995711 -1.243975
#165 2010-08-01 3.367880 -1.020273 3.332030 3.332030
#172 2010-09-01 -1.813011 -1.125862 3.206333 -1.125862
#398 2010-09-01 3.281198 -1.125862 3.206333 3.206333
#... ... ... ... ... ...
#9626 2010-12-01 3.626950 -1.198967 3.249161 3.249161
#9746 2010-11-01 3.472490 -1.259557 3.261329 3.261329
#9762 2010-09-01 3.460467 -1.125862 3.206333 3.206333
#9768 2010-06-01 -1.625013 -1.482529 3.295520 -1.482529
#9854 2010-12-01 -1.475515 -1.198967 3.249161 -1.198967
#
#[214 rows x 5 columns]

See if the values in a column contain % in a pandas dataframe

I have a dataframe that has columns whose values contain % (literal percentage sign). I am trying to create a function to automatically convert these values to a decimal.
For example, with the below dataframe:
var1 var2 var3 var4
id
0 1.4515 1.52% -0.5709 4%
1 1.57 1.605% -0.012 8%
2 1.69253 1.657% -0.754 9%
3 1.66331 1.686% -0.0012 5%
4 1.739 1.716% -0.04 12%
5 1.7447 1.61% -0.0023 11%
def pct_to_dec(df):
for col in df:
print(col)
if '%%' in df[col].astype(str):
print(col)
df[col] = df[col].replace({'%%':''}, regex=True)
df[col] = df[col]/100
The function should print var2 and var4, and convert the values in both columns to decimal format. Through troubleshooting I have found that python is not seeing the percentage characters since when I do this code:
df.isin(['%%'])
It prints a dataframe of "False".
Lastly, I have tried to see if I'm using the wrong escape character. I've tried %%, /%, and \%.
I am interested in seeing if I am on the right track, as well as if there is a simpler way to do what I'm trying to do.
You can as well use .str.endswith like in the following example:
for col in df.select_dtypes('object'):
indexer_percent= df[col].str.endswith('%')
df.loc[indexer_percent, col]= df.loc[indexer_percent, col].str.strip('%')
df[col]= df[col].astype('float32')
df.loc[indexer_percent, col]/= 100.0
On your data, this results in:
var1 var2 var3 var4
id
0 1.45150 0.01520 -0.5709 0.04
1 1.57000 0.01605 -0.0120 0.08
2 1.69253 0.01657 -0.7540 0.09
3 1.66331 0.01686 -0.0012 0.05
4 1.73900 0.01716 -0.0400 0.12
5 1.74470 0.01610 -0.0023 0.11
The data is created by:
import pandas as pd
import io
infile=io.StringIO(
"""id var1 var2 var3 var4
0 1.4515 1.52% -0.5709 4%
1 1.57 1.605% -0.012 8%
2 1.69253 1.657% -0.754 9%
3 1.66331 1.686% -0.0012 5%
4 1.739 1.716% -0.04 12%
5 1.7447 1.61% -0.0023 11%"""
)
df= pd.read_csv(infile, index_col=0, sep='\s+')
You can easily check this using the Series method .str.contains
It lets you check which rows of a Series has the string you passed. For example, if you run this code:
df['var2'].str.contains('%')
You'll get a series as a return with all rows equals True. So you just need to implement a for and get the index of the rows that have True values and do whatever you want.
Note that if your rows isn't str type you'll get NaN as a return, so be aware of the type of the columns.

Change column names in Pandas Dataframe from a list

Is is possible to change Column Names using data in a list?
df = pd.DataFrame([[1, 1.0, 2.3,0.2,0.53], [2, 3.35, 2.0,0.2,0.65], [2,3.4,
2.0,0.25,0.55], [3,3.4,2.0,0.25,0.55], [1,3.4,2.0,0.25,0.55],
[3,3.4,2.0,0.25,0.55]],
columns=["ID", "A", "B","C","D"])\
.set_index('ID')
I have my new labels as below:
New_Labels=['NaU', 'MgU', 'AlU', 'SiU']
Is possible to change the names using data in the above list? My original data set has 100 columns and I did not want to do it manually for each column.
I was trying the following using df.rename but keep getting errors. Thanks!
You can use this :
df.columns = New_Labels
Using rename is a formally more correct approach. You just have to provide a dictionary that maps your current columns names to the new ones (thing that will guarantee expected results even in case of misplaced columns)
new_names = {'A':'NaU', 'B':'MgU', 'C':'Alu', 'D':'SiU'}
df.rename(index=str, columns=new_names)
Notice you can provide entries for the sole names you want to substitute, the rest will remain the same.
df = pd.DataFrame([[1, 1.0, 2.3,0.2,0.53], [2, 3.35, 2.0,0.2,0.65], [2,3.4,
2.0,0.25,0.55], [3,3.4,2.0,0.25,0.55], [1,3.4,2.0,0.25,0.55],
[3,3.4,2.0,0.25,0.55]],
columns=["ID", "A", "B","C","D"])\
.set_index('ID')
New_Labels=['NaU', 'MgU', 'AlU', 'SiU']
df.columns = New_Labels
this will make df look like this:
NaU MgU AlU SiU
ID
1 1.00 2.3 0.20 0.53
2 3.35 2.0 0.20 0.65
2 3.40 2.0 0.25 0.55
3 3.40 2.0 0.25 0.55
1 3.40 2.0 0.25 0.55
3 3.40 2.0 0.25 0.55
df.columns = New_Labels
Take care of the sequence of new column names.
The accepted rename answer is fine, but it's mainly for mapping old→new names. If we just want to wipe out the column names with a new list, there's no need to create an intermediate mapping dictionary. Just use set_axis directly.
set_axis
To set a list as the columns, use set_axis along axis=1 (the default axis=0 sets the index values):
df.set_axis(New_Labels, axis=1)
# NaU MgU AlU SiU
# ID
# 1 1.00 2.3 0.20 0.53
# 2 3.35 2.0 0.20 0.65
# 2 3.40 2.0 0.25 0.55
# 3 3.40 2.0 0.25 0.55
# 1 3.40 2.0 0.25 0.55
# 3 3.40 2.0 0.25 0.55
Note that set_axis is similar to modifying df.columns directly, but set_axis allows method chaining, e.g.:
df.some_method().set_axis(New_Labels, axis=1).other_method()
Theoretically, set_axis should also provide better error checking than directly modifying an attribute, though I can't find a concrete example at the moment.

Dropping Dataframe rows based on name

I have the following dataframe df where I am trying to drop all rows having curv_typ as PYC_RT or YCIF_RT.
curv_typ maturity bonds 2015M06D19 2015M06D18 2015M06D17 \
0 PYC_RT Y1 GBAAA -0.24 -0.25 -0.23
1 PYC_RT Y1 GBA_AAA -0.05 -0.05 -0.05
2 PYC_RT Y10 GBAAA 0.89 0.92 0.94
My code to do this is as follows. However, for some reason df turns out to be exactly the same as above after running the code below:
df = pd.DataFrame.from_csv("ECB.tsv", sep="\t", index_col=False)
df[df["curv_typ"] != "PYC_RT"]
df[df["curv_typ"] != "YCIF_RT"]
Use isin and negate ~ the boolean condition for the mask:
In [76]:
df[~df['curv_typ'].isin(['PYC_RT', 'YCIF_RT'])]
Out[76]:
Empty DataFrame
Columns: [curv_typ, maturity, bonds, 2015M06D19, 2015M06D18, 2015M06D17]
Index: []
Note that this returns nothing on your sample dataset
You need to assign the resulting DataFrame to the original DataFrame (thus, over-writing it):
df = df[df["curv_typ"] != "PYC_RT"]
df = df[df["curv_typ"] != "YCIF_RT"]

Categories