I am trying to create a set of new columns that would be derived from an existing columns in a dataframe using a function. Here is sample code that produces errors and I wonder if there a better more efficient way to accomplish it than the loop
import numpy as np
import pandas as pd
dates = pd.date_range('1/1/2000', periods=100, freq='M')
long_df = pd.DataFrame(np.random.randn(100, 4),index=dates, columns=['Colorado', 'Texas', 'New York', 'Ohio'])
mylist=['Colorado', 'Texas', 'New York', 'Ohio']
def trnsfrm_1_10 (a, b):
b = (a-np.min(a))/(np.max(a)-np.min(a))*9+1
return b
for a in mylist:
b=a+"_T"
long_df[b] = long_df.apply(lambda row: trnsfrm_1_10(row[a], row[b]), axis=1)
To clarify above question, here is example of DataFrame that has input columns (Colorado, Texas, New York) and output variables (T_Colorado, T_Texas, T_New York). Let's assume that if for each input variable, below are minimum and maximum of each column then by applying equation: b = (a-min)/(max-min)*9+1 to each column, the output variables are T_Colorado T_Texas T_New York. I had to simulate this process in excel based on just 5 rows, but it would be great to compute minimum and maximum as part of the function because I would have a lot more rows in the real data. I am relatively new to Python and Pandas and I really appreciate your help.
These are example min and max
Colorado Texas New York
min 0.03 -1.26 -1.04
max 1.17 0.37 0.86
This is example of a DataFrame
Index Colorado Texas New York T_Colorado T_Texas T_New York
1/31/2000 0.03 0.37 0.09 1.00 10.00 6.35
2/29/2000 0.4 0.26 -1.04 3.92 9.39 1.00
3/31/2000 0.35 -0.06 -0.75 3.53 7.63 2.37
4/30/2000 1.17 -1.26 -0.61 10.00 1.00 3.04
5/31/2000 0.46 -0.79 0.86 4.39 3.60 10.00
IIUC, you should take advantage of broadcasting
long_df2= (long_df - long_df.min())/(long_df.max() - long_df.min()) * 9 + 1
Then concat
pd.concat([long_df, long_df2.add_suffix('_T')], 1)
In your code, the error is that when you define trnsfrm_1_10, b is a parameter while actually it's only your output. It should not be a parameter, especially as it's the value in the new column you want to create during the loop for. so the code would be more something like:
def trnsfrm_1_10 (a):
b = (a-np.min(a))/(np.max(a)-np.min(a))*9+1
return b
for a in mylist:
b=a+"_T"
long_df[b] = long_df.apply(lambda row: trnsfrm_1_10(row[a]), axis=1)
The other thing is that you calculate np.min(a) in trnsfrm_1_10 which actually will be equal to a (same with max) because you apply row wise so a is the unique value in the row and column you are in. I assume what you mean would be more np.min(long_df['a']) which can also be written long_df[a].min()
If I understand well, what you try to perform is actually:
dates = pd.date_range('1/1/2000', periods=100, freq='M')
long_df = pd.DataFrame(np.random.randn(100, 4),index=dates,
columns=['Colorado', 'Texas', 'New York', 'Ohio'])
mylist=['Colorado', 'Texas', 'New York', 'Ohio']
for a in mylist:
long_df[a+"_T"] = (long_df[a]-long_df[a].min())/(long_df[a].max()-long_df[a].min())*9+1
giving then:
long_df.head()
Out[29]:
Colorado Texas New York Ohio Colorado_T Texas_T \
2000-01-31 -0.762666 1.413276 0.857333 0.648960 3.192754 7.768111
2000-02-29 0.148023 0.304971 1.954966 0.656787 4.676018 6.082177
2000-03-31 0.531195 1.283100 0.070963 1.098968 5.300102 7.570091
2000-04-30 -0.385679 0.425382 1.330285 0.496238 3.806763 6.265344
2000-05-31 -0.047057 -0.362419 -2.276546 0.297990 4.358285 5.066955
New York_T Ohio_T
2000-01-31 6.390972 5.659870
2000-02-29 8.242445 5.676254
2000-03-31 5.064533 6.601876
2000-04-30 7.188740 5.340175
2000-05-31 1.104787 4.925180
where all the value in the colum with _T are calculated from the corresponding column.
Ultimately to not use a for loop over the column, you can do:
long_df_T =(((long_df -long_df.min(axis=0))/(long_df.max(axis=0) -long_df.min(axis=0))*9 +1)
.add_suffix('_T'))
to create a dataframe with all the columns with _T at once. Then few option are available to add them in long_df, one way is with join:
long_df = long_df.join(long_df_T)
Related
I have a panda dataframe that has values like below. Though in real I am working with lot more columns and historical data
AUD USD JPY EUR
0 0.67 1 140 1.05
I want to iterate over columns to create dataframe with columns AUDUSD, AUDJPY, AUDEUR, USDJPY, USDEUR and JPYEUR
where for eg AUDUSD is calculated as product of AUD column and USD colum
I tried below
for col in df:
for cols in df:
cf[col+cols]=df[col]*df[cols]
But it generates table with unneccessary values like AUDAUD, USDUSD or duplicate value like AUDUSD and USDAUD. I think if i can somehow set "cols =col+1 till end of df" in second for loop I should be able to resolve the issue. But i don't know how to do that ??
Result i am looking for is a table with below columns and their values
AUDUSD, AUDJPY, AUDEUR, USDJPY, USDEUR, JPYEUR
You can use itertools.combinations with pandas.Series.mul and pandas.concat.
Try this :
from itertools import combinations
combos = list(combinations(df.columns, 2))
out = pd.concat([df[col[1]].mul(df[col[0]]) for col in combos], axis=1, keys=combos)
out.columns = out.columns.map("".join)
# Output :
print(out)
AUDUSD AUDJPY AUDEUR USDJPY USDEUR JPYEUR
0 0.67 93.8 0.7035 140 1.05 147.0
# Used input :
df = pd.DataFrame({'AUD': [0.67], 'USD': [1], 'JPY': [140], 'EUR': [1.05]})
I thought it intuitive that your first approach was to use an inner / outer loop and think this solution works in the same spirit:
# Added a Second Row for testing
df = pd.DataFrame(
{'AUD': [0.67, 0.91], 'USD': [1, 1], 'JPY': [140, 130], 'EUR': [1.05, 1]},
)
# Instantiated the Second DataFrame
cf = pd.DataFrame()
# Call the index of the columns as an integer
for i in range(len(df.columns)):
# Increment the index + 1, so you aren't looking at the same column twice
# Also, limit the range to the length of your columns
for j in range(i+1, len(df.columns)):
print(f'{df.columns[i]}' + f'{df.columns[j]}') # VERIFY
# Create a variable of the column names mashed together
combine = f'{df.columns[i]}' + f'{df.columns[j]}
# Assign the rows to be a product of the mashed column series
cf[combine] = df[df.columns[i]] * df[df.columns[j]]
print(cf) # VERIFY
The console Log looks like this:
AUDUSD
AUDJPY
AUDEUR
USDJPY
USDEUR
JPYEUR
AUDUSD AUDJPY AUDEUR USDJPY USDEUR JPYEUR
0 0.67 93.8 0.7035 140 1.05 147.0
1 0.91 118.3 0.9100 130 1.00 130.0
I am trying to create a row in my existing pandas dataframe and the value of a new row should be a computation
I have a dataframe that looks like the below:
Rating LE_St % Total
1.00 7.58 74.55
2.00 0.56 5.55
3.00 0.21 2.04
5.00 0.05 0.44
6.00 1.77 17.42
All 10.17 100.00
I want to add a row called "Metric" which is the sum of "LE_St" variable for "Rating" >= 4 and <6 divided by "LE_St" for "All" i.e Metric = (0.05+1.77)/10.17
My output dataframe should look like below:
Rating LE_St % Total
1.00 7.58 74.55
2.00 0.56 5.55
3.00 0.21 2.04
5.00 0.05 0.44
6.00 1.77 17.42
All 10.17 100.00
Metric 0.44
I believe your approach to the dataframe is wrong.
Usually rows hold values correlating with columns in a matter that makes sense and not hold random information. the power of pandas and python is for holding and manipulating data. You can easily compute a value from a column or even all columns and store them in a "summary" like dataframe or in separate values. That might help you with this as well.
for computation on a column (i.e. Series object) you can use the .sum() method (or any other of the computational tools) and slice your dataframe by values in the "rating" column.
for random computation of small statistics you will be rather off with excel :)
an example of a solution might look like this:
all = 10.17 # i dont know where this value comes from
df = df[df['rating'].between(4, 6, inclusive=True)]
metric = sliced_df['LE_ST'].sum()/all
print metric # or store it somewhere however you like
I have the following dataframe df where I am trying to drop all rows having curv_typ as PYC_RT or YCIF_RT.
curv_typ maturity bonds 2015M06D19 2015M06D18 2015M06D17 \
0 PYC_RT Y1 GBAAA -0.24 -0.25 -0.23
1 PYC_RT Y1 GBA_AAA -0.05 -0.05 -0.05
2 PYC_RT Y10 GBAAA 0.89 0.92 0.94
My code to do this is as follows. However, for some reason df turns out to be exactly the same as above after running the code below:
df = pd.DataFrame.from_csv("ECB.tsv", sep="\t", index_col=False)
df[df["curv_typ"] != "PYC_RT"]
df[df["curv_typ"] != "YCIF_RT"]
Use isin and negate ~ the boolean condition for the mask:
In [76]:
df[~df['curv_typ'].isin(['PYC_RT', 'YCIF_RT'])]
Out[76]:
Empty DataFrame
Columns: [curv_typ, maturity, bonds, 2015M06D19, 2015M06D18, 2015M06D17]
Index: []
Note that this returns nothing on your sample dataset
You need to assign the resulting DataFrame to the original DataFrame (thus, over-writing it):
df = df[df["curv_typ"] != "PYC_RT"]
df = df[df["curv_typ"] != "YCIF_RT"]
I want to apply a function to row slices of dataframe in pandas for each row and returning a dataframe with for each row the value and number of slices that was calculated.
So, for example
df = pandas.DataFrame(numpy.round(numpy.random.normal(size=(2, 10)),2))
f = lambda x: (x - x.mean())
What I want is to apply lambda function f from column 0 to 5 and from column 5 to 10.
I did this:
a = pandas.DataFrame(f(df.T.iloc[0:5,:])
but this is only for the first slice.. how can include the second slice in the code, so that my resulting output frame looks exactly as the input frame -- just that every data point is changed to its value minus the mean of the corresponding slice.
I hope it makes sense.. What would be the right way to go with this?
thank you.
You can simply reassign the result to original df, like this:
import pandas as pd
import numpy as np
# I'd rather use a function than lambda here, preference I guess
def f(x):
return x - x.mean()
df = pd.DataFrame(np.round(np.random.normal(size=(2,10)), 2))
df.T
0 1
0 0.92 -0.35
1 0.32 -1.37
2 0.86 -0.64
3 -0.65 -2.22
4 -1.03 0.63
5 0.68 -1.60
6 -0.80 -1.10
7 -0.69 0.05
8 -0.46 -0.74
9 0.02 1.54
# makde a copy of df here
df1 = df
# just reassign the slices back to the copy
# edited, omit DataFrame part.
df1.T[:5], df1.T[5:] = f(df.T.iloc[0:5,:]), f(df.T.iloc[5:,:])
df1.T
0 1
0 0.836 0.44
1 0.236 -0.58
2 0.776 0.15
3 -0.734 -1.43
4 -1.114 1.42
5 0.930 -1.23
6 -0.550 -0.73
7 -0.440 0.42
8 -0.210 -0.37
9 0.270 1.91
I have data that consist of 1,000 samples from a distribution of a rate for several different countries stored in a pandas DataFrame:
s1 s2 ... s1000 pop
region country
NA USA 0.25 0.27 0.23 300
CAN 0.16 0.14 0.13 35
LA MEX ...
I need to multiply each sample by the population.To accomplish this, I currently have:
for column in data.filter(regex='sample'):
data[column] = data[column]*data['pop']
While this works, iterating over columns feels like it goes against the spirit of python and numpy. Is there a more natural way I'm not seeing? I would normally use apply, but I don't know how to use apply and still get the unique population value for each row.
More context: The reason I need to do this multiplication is because I want to aggregate the data by region, collapsing USA and CAN into North America, for example. However, because my data are rates, I cannot simply add- I must multiply by population to turn them into counts.
I might do something like
>>> df
s1 s2 s1000 pop
region country
NaN USA 0.25 0.27 0.23 300
CAN 0.16 0.14 0.13 35
[2 rows x 4 columns]
>>> df.iloc[:,:-1] = df.iloc[:, :-1].mul(df["pop"], axis=0)
>>> df
s1 s2 s1000 pop
region country
NaN USA 75.0 81.0 69.00 300
CAN 5.6 4.9 4.55 35
[2 rows x 4 columns]
where instead of iloc-ing every column except the last you could use any other loc-based filter.