I have the following dataframe df where I am trying to drop all rows having curv_typ as PYC_RT or YCIF_RT.
curv_typ maturity bonds 2015M06D19 2015M06D18 2015M06D17 \
0 PYC_RT Y1 GBAAA -0.24 -0.25 -0.23
1 PYC_RT Y1 GBA_AAA -0.05 -0.05 -0.05
2 PYC_RT Y10 GBAAA 0.89 0.92 0.94
My code to do this is as follows. However, for some reason df turns out to be exactly the same as above after running the code below:
df = pd.DataFrame.from_csv("ECB.tsv", sep="\t", index_col=False)
df[df["curv_typ"] != "PYC_RT"]
df[df["curv_typ"] != "YCIF_RT"]
Use isin and negate ~ the boolean condition for the mask:
In [76]:
df[~df['curv_typ'].isin(['PYC_RT', 'YCIF_RT'])]
Out[76]:
Empty DataFrame
Columns: [curv_typ, maturity, bonds, 2015M06D19, 2015M06D18, 2015M06D17]
Index: []
Note that this returns nothing on your sample dataset
You need to assign the resulting DataFrame to the original DataFrame (thus, over-writing it):
df = df[df["curv_typ"] != "PYC_RT"]
df = df[df["curv_typ"] != "YCIF_RT"]
Related
I have a panda dataframe that has values like below. Though in real I am working with lot more columns and historical data
AUD USD JPY EUR
0 0.67 1 140 1.05
I want to iterate over columns to create dataframe with columns AUDUSD, AUDJPY, AUDEUR, USDJPY, USDEUR and JPYEUR
where for eg AUDUSD is calculated as product of AUD column and USD colum
I tried below
for col in df:
for cols in df:
cf[col+cols]=df[col]*df[cols]
But it generates table with unneccessary values like AUDAUD, USDUSD or duplicate value like AUDUSD and USDAUD. I think if i can somehow set "cols =col+1 till end of df" in second for loop I should be able to resolve the issue. But i don't know how to do that ??
Result i am looking for is a table with below columns and their values
AUDUSD, AUDJPY, AUDEUR, USDJPY, USDEUR, JPYEUR
You can use itertools.combinations with pandas.Series.mul and pandas.concat.
Try this :
from itertools import combinations
combos = list(combinations(df.columns, 2))
out = pd.concat([df[col[1]].mul(df[col[0]]) for col in combos], axis=1, keys=combos)
out.columns = out.columns.map("".join)
# Output :
print(out)
AUDUSD AUDJPY AUDEUR USDJPY USDEUR JPYEUR
0 0.67 93.8 0.7035 140 1.05 147.0
# Used input :
df = pd.DataFrame({'AUD': [0.67], 'USD': [1], 'JPY': [140], 'EUR': [1.05]})
I thought it intuitive that your first approach was to use an inner / outer loop and think this solution works in the same spirit:
# Added a Second Row for testing
df = pd.DataFrame(
{'AUD': [0.67, 0.91], 'USD': [1, 1], 'JPY': [140, 130], 'EUR': [1.05, 1]},
)
# Instantiated the Second DataFrame
cf = pd.DataFrame()
# Call the index of the columns as an integer
for i in range(len(df.columns)):
# Increment the index + 1, so you aren't looking at the same column twice
# Also, limit the range to the length of your columns
for j in range(i+1, len(df.columns)):
print(f'{df.columns[i]}' + f'{df.columns[j]}') # VERIFY
# Create a variable of the column names mashed together
combine = f'{df.columns[i]}' + f'{df.columns[j]}
# Assign the rows to be a product of the mashed column series
cf[combine] = df[df.columns[i]] * df[df.columns[j]]
print(cf) # VERIFY
The console Log looks like this:
AUDUSD
AUDJPY
AUDEUR
USDJPY
USDEUR
JPYEUR
AUDUSD AUDJPY AUDEUR USDJPY USDEUR JPYEUR
0 0.67 93.8 0.7035 140 1.05 147.0
1 0.91 118.3 0.9100 130 1.00 130.0
Code to drop rows based on a partial string is not working.
Very simple code, and it runs fine but doesn't drop the rows I want.
The original table in the pdf looks like this:
Chemical
Value
Unit
Type
Fluoride
0.23
ug/L
Lab
Mercury
0.15
ug/L
Lab
Sum of Long Chained Polymers
0.33
Partialsum of Short Chained Polymers
0.40
What I did:
import csv
import tabula
dfs = tabula.read _pdf("Test.pdf", pages= 'all')
file = "Test.pdf"
tables = tabula.read_pdf(file, pages=2, stream=True, multiple_tables=True)
table1 = tables[1]
table1.drop('Unit', axis=1, inplace=True)
table1.drop('Type', axis=1, inplace=True)
discard = ['sum','Sum']
table1[~table1.Chemical.str.contains('|'.join(discard))]
print(table1)
table1.to_csv('test.csv')
The results are that it drops the 2 columns I don't want, so that's fine. But it did not delete the rows with the words "sum" or "Sum" in them. Any insights?
You are close. You did drop the rows, but you didn't save the result.
import pandas as pd
example = {'Chemical': ['Fluoride', 'Mercury', 'Sum of Long Chained Polymers',
'Partialsum of Short Chained Polymers'],
'Value': [0.23, 0.15, 0.33, 0.4],
'Unit': ['ug/L', 'ug/L', '', ''],
'Type': ['Lab', 'Lab', '', '']}
table1 = pd.DataFrame(example)
table1.drop('Unit', axis=1, inplace=True)
table1.drop('Type', axis=1, inplace=True)
discard = ['sum','Sum']
table1 = table1[~table1.Chemical.str.contains('|'.join(discard))]
print(table1)
You can use pd.Series.str.contains with the argument case=False to ignore case:
Also, it's not law, but often considered poor practice to use inplace=True... because in part it leads to confusions like the one you're experiencing.
Given df:
Chemical Value Unit Type
0 Fluoride 0.23 ug/L Lab
1 Mercury 0.15 ug/L Lab
2 Sum of Long Chained Polymers 0.33 NaN NaN
3 Partialsum of Short Chained Polymers 0.40 NaN NaN
Doing:
df = (df.drop(['Unit', 'Type'], axis=1)
.loc[~df.Chemical.str.contains('sum', case=False)])
Output:
Chemical Value
0 Fluoride 0.23
1 Mercury 0.15
I have a dataframe with monthly data and the following colums: date, bm and cash
date bm cash
1981-09-30 0.210308 2.487146
1981-10-31 0.241291 2.897529
1981-11-30 0.221529 2.892758
1981-12-31 0.239002 2.726372
1981-09-30 0.834520 4.387087
1981-10-31 0.800472 4.297658
1981-11-30 0.815778 4.459382
1981-12-31 0.836681 4.895269
Now I want to winsorize my data per month while keeping NaN values in the data. I.e. I want to group the data per month and overwrite observations above the 0.99 and below the 0.01 percentile with the 99 percentile and 0.01 percentile respectively. From Winsorizing data by column in pandas with NaN I found that I should do this with the "clip" function. My code looks as follows:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index(['date'])
df_grouped = df.groupby(pd.Grouper(freq='M'))
cols = df.columns
for c in cols:
df[c] = df_grouped[c].apply(lambda x: x.clip(lower=x.quantile(0.01), upper=x.quantile(0.99)))
I get the following output: ValueError: cannot reindex from a duplicate axis
P.S. I realize that I have not included my required output, but I hope that the required output is clear. Otherwise I can try to put something together.
Edit: These solution from #Allolz is already of great help, but it does not work exactly as it is supposed to. Before I run the code from #Allolz I I ran :
df_in.groupby(pd.Grouper(freq='M', key='date'))['secured'].quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])
Which returned:
date
1980-01-31 0.00 1.580564e+00
0.01 1.599805e+00
0.25 2.388106e+00
0.50 6.427071e+00
0.75 1.200685e+01
0.99 5.133111e+01
1.00 5.530329e+01
After winsorizing I get:
date
1980-01-31 0.00 1.599805
0.01 1.617123
0.25 2.388106
0.50 6.427071
0.75 12.006854
0.99 47.756152
1.00 51.331114
It is clear that the new 0.0 and 1.0 quantiles are equal to the original 0.01 and 0.09 quantiles, which is what we would expect. However, the new 0.01 and 0.99 quantiles are not equal to the original 0.01 and 0.99 quantiles where I would expect that these should remain the same. What can cause this and wat could solve it? My hunch is that it might have to do with NaN's in the data, but I'm not sure if that is really the cause.
One method which will be faster requires you to create helper columns. We will use groupby + transform to broadcast columns for the 0.01 and 0.99 quantile (for that Month group) back to the DataFrame and then you can use those Series to clip the original at once. (clip will leave NaN alone so it satisfies that requirement too). Then if you want, remove the helper columns (I'll leave them in for clariity).
Sample Data
import numpy as np
import panda as pd
np.random.seed(123)
N = 10000
df = pd.DataFrame({'date': np.random.choice(pd.date_range('2010-01-01', freq='MS', periods=12), N),
'val': np.random.normal(1, 0.95, N)})
Code
gp = df.groupby(pd.Grouper(freq='M', key='date'))['val']
# Assign the lower-bound ('lb') and upper-bound ('ub') for Winsorizing
df['lb'] = gp.transform('quantile', 0.01)
df['ub'] = gp.transform('quantile', 0.99)
# Winsorize
df['val_wins'] = df['val'].clip(upper=df['ub'], lower=df['lb'])
Output
The majority of rows will not be changed (only those outside of the 1-99th percentile) so we can check the small susbet rows that did change to see it works. You can see rows for the same months have the same bounds and the winsorized value ('val_wins') is properly clipped to the bound it exceeds.
df[df['val'] != df['val_wins']]
# date val lb ub val_wins
#42 2010-09-01 -1.686566 -1.125862 3.206333 -1.125862
#96 2010-04-01 -1.255322 -1.243975 2.995711 -1.243975
#165 2010-08-01 3.367880 -1.020273 3.332030 3.332030
#172 2010-09-01 -1.813011 -1.125862 3.206333 -1.125862
#398 2010-09-01 3.281198 -1.125862 3.206333 3.206333
#... ... ... ... ... ...
#9626 2010-12-01 3.626950 -1.198967 3.249161 3.249161
#9746 2010-11-01 3.472490 -1.259557 3.261329 3.261329
#9762 2010-09-01 3.460467 -1.125862 3.206333 3.206333
#9768 2010-06-01 -1.625013 -1.482529 3.295520 -1.482529
#9854 2010-12-01 -1.475515 -1.198967 3.249161 -1.198967
#
#[214 rows x 5 columns]
I am trying to create a set of new columns that would be derived from an existing columns in a dataframe using a function. Here is sample code that produces errors and I wonder if there a better more efficient way to accomplish it than the loop
import numpy as np
import pandas as pd
dates = pd.date_range('1/1/2000', periods=100, freq='M')
long_df = pd.DataFrame(np.random.randn(100, 4),index=dates, columns=['Colorado', 'Texas', 'New York', 'Ohio'])
mylist=['Colorado', 'Texas', 'New York', 'Ohio']
def trnsfrm_1_10 (a, b):
b = (a-np.min(a))/(np.max(a)-np.min(a))*9+1
return b
for a in mylist:
b=a+"_T"
long_df[b] = long_df.apply(lambda row: trnsfrm_1_10(row[a], row[b]), axis=1)
To clarify above question, here is example of DataFrame that has input columns (Colorado, Texas, New York) and output variables (T_Colorado, T_Texas, T_New York). Let's assume that if for each input variable, below are minimum and maximum of each column then by applying equation: b = (a-min)/(max-min)*9+1 to each column, the output variables are T_Colorado T_Texas T_New York. I had to simulate this process in excel based on just 5 rows, but it would be great to compute minimum and maximum as part of the function because I would have a lot more rows in the real data. I am relatively new to Python and Pandas and I really appreciate your help.
These are example min and max
Colorado Texas New York
min 0.03 -1.26 -1.04
max 1.17 0.37 0.86
This is example of a DataFrame
Index Colorado Texas New York T_Colorado T_Texas T_New York
1/31/2000 0.03 0.37 0.09 1.00 10.00 6.35
2/29/2000 0.4 0.26 -1.04 3.92 9.39 1.00
3/31/2000 0.35 -0.06 -0.75 3.53 7.63 2.37
4/30/2000 1.17 -1.26 -0.61 10.00 1.00 3.04
5/31/2000 0.46 -0.79 0.86 4.39 3.60 10.00
IIUC, you should take advantage of broadcasting
long_df2= (long_df - long_df.min())/(long_df.max() - long_df.min()) * 9 + 1
Then concat
pd.concat([long_df, long_df2.add_suffix('_T')], 1)
In your code, the error is that when you define trnsfrm_1_10, b is a parameter while actually it's only your output. It should not be a parameter, especially as it's the value in the new column you want to create during the loop for. so the code would be more something like:
def trnsfrm_1_10 (a):
b = (a-np.min(a))/(np.max(a)-np.min(a))*9+1
return b
for a in mylist:
b=a+"_T"
long_df[b] = long_df.apply(lambda row: trnsfrm_1_10(row[a]), axis=1)
The other thing is that you calculate np.min(a) in trnsfrm_1_10 which actually will be equal to a (same with max) because you apply row wise so a is the unique value in the row and column you are in. I assume what you mean would be more np.min(long_df['a']) which can also be written long_df[a].min()
If I understand well, what you try to perform is actually:
dates = pd.date_range('1/1/2000', periods=100, freq='M')
long_df = pd.DataFrame(np.random.randn(100, 4),index=dates,
columns=['Colorado', 'Texas', 'New York', 'Ohio'])
mylist=['Colorado', 'Texas', 'New York', 'Ohio']
for a in mylist:
long_df[a+"_T"] = (long_df[a]-long_df[a].min())/(long_df[a].max()-long_df[a].min())*9+1
giving then:
long_df.head()
Out[29]:
Colorado Texas New York Ohio Colorado_T Texas_T \
2000-01-31 -0.762666 1.413276 0.857333 0.648960 3.192754 7.768111
2000-02-29 0.148023 0.304971 1.954966 0.656787 4.676018 6.082177
2000-03-31 0.531195 1.283100 0.070963 1.098968 5.300102 7.570091
2000-04-30 -0.385679 0.425382 1.330285 0.496238 3.806763 6.265344
2000-05-31 -0.047057 -0.362419 -2.276546 0.297990 4.358285 5.066955
New York_T Ohio_T
2000-01-31 6.390972 5.659870
2000-02-29 8.242445 5.676254
2000-03-31 5.064533 6.601876
2000-04-30 7.188740 5.340175
2000-05-31 1.104787 4.925180
where all the value in the colum with _T are calculated from the corresponding column.
Ultimately to not use a for loop over the column, you can do:
long_df_T =(((long_df -long_df.min(axis=0))/(long_df.max(axis=0) -long_df.min(axis=0))*9 +1)
.add_suffix('_T'))
to create a dataframe with all the columns with _T at once. Then few option are available to add them in long_df, one way is with join:
long_df = long_df.join(long_df_T)
I want to apply a function to row slices of dataframe in pandas for each row and returning a dataframe with for each row the value and number of slices that was calculated.
So, for example
df = pandas.DataFrame(numpy.round(numpy.random.normal(size=(2, 10)),2))
f = lambda x: (x - x.mean())
What I want is to apply lambda function f from column 0 to 5 and from column 5 to 10.
I did this:
a = pandas.DataFrame(f(df.T.iloc[0:5,:])
but this is only for the first slice.. how can include the second slice in the code, so that my resulting output frame looks exactly as the input frame -- just that every data point is changed to its value minus the mean of the corresponding slice.
I hope it makes sense.. What would be the right way to go with this?
thank you.
You can simply reassign the result to original df, like this:
import pandas as pd
import numpy as np
# I'd rather use a function than lambda here, preference I guess
def f(x):
return x - x.mean()
df = pd.DataFrame(np.round(np.random.normal(size=(2,10)), 2))
df.T
0 1
0 0.92 -0.35
1 0.32 -1.37
2 0.86 -0.64
3 -0.65 -2.22
4 -1.03 0.63
5 0.68 -1.60
6 -0.80 -1.10
7 -0.69 0.05
8 -0.46 -0.74
9 0.02 1.54
# makde a copy of df here
df1 = df
# just reassign the slices back to the copy
# edited, omit DataFrame part.
df1.T[:5], df1.T[5:] = f(df.T.iloc[0:5,:]), f(df.T.iloc[5:,:])
df1.T
0 1
0 0.836 0.44
1 0.236 -0.58
2 0.776 0.15
3 -0.734 -1.43
4 -1.114 1.42
5 0.930 -1.23
6 -0.550 -0.73
7 -0.440 0.42
8 -0.210 -0.37
9 0.270 1.91