My issue is the following, I'm creating a pandas data frame from a dictionary that ends up looking like [70k, 300]. I'm trying to normalise each cell be it either by columns and after rows, and other way around rows then columns.
I ha asked a similar question before but this was with a [70k, 70k] data frame so square and it worked just by doing this
dfNegInfoClearRev = (df - df.mean(axis=1)) / df.std(axis=1).replace(0, 1)
dfNegInfoClearRev = (dfNegInfoClearRev - dfNegInfoClearRev.mean(axis=0)) / dfNegInfoClearRev.std(axis=0).replace(0, 1)
print(dfNegInfoClearRev)
This did what I needed for the case of a [70k, 70k]. A problem came up when I tried the same principle with a [70k, 300] if I do this:
dfRINegInfo = (dfRI - dfRI.mean(axis=0)) / dfRI.std(axis=0).replace(0, 1)
dfRINegInfoRows = (dfRINegInfo - dfRINegInfo.mean(axis=1)) / dfRINegInfo.std(axis=1).replace(0, 1)
I somehow end up with a [70k, 70k+300] full of NaNs with the same names.
I ended up doing this:
dfRIInter = dfRINegInfo.sub(dfRINegInfo.mean(axis=1), axis=0)
dfRINegInfoRows = dfRIInter.div(dfRIInter.std(axis=1), axis=0).fillna(1).replace(0, 1)
print(dfRINegInfoRows)
But I'm not sure if this is what I was trying to do and don't really understand why after the column normalisation which it does work [70k, 300] the row normalisation gives me a [70k, 70k+300], and I'm not sure if the way is working is what I'm trying to do. Any help?
I think your new code is doing what you want.
If we look at a 3x3 toy example:
df = pd.DataFrame([
[1, 2, 3],
[2, 4, 6],
[3, 6, 9],
])
The axis=1 mean is:
df.mean(axis=1)
# 0 2.0
# 1 4.0
# 2 6.0
# dtype: float64
And the subtraction applies to each row (i.e., [1,2,3] - [2,4,6] element-wise, [2-4-6] - [2,4,6], and [3,6,9] - [2,4,6]):
df - df.mean(axis=1)
# 0 1 2
# 0 -1.0 -2.0 -3.0
# 1 0.0 0.0 0.0
# 2 1.0 2.0 3.0
So if we have df2 shaped 3x2:
df2 = pd.DataFrame([
[1,2],
[3,6],
[5,10],
])
The axis=1 mean is still length 3:
df2.mean(axis=1)
# 0 1.5
# 1 4.5
# 2 7.5
# dtype: float64
And subtraction will result in the 3rd column being nan (i.e., [1,2,nan] - [1.5,4.5,7.5] element-wise, [3,6,nan] - [1.5,4.5,7.5], and [5,10,nan] - [1.5,4.5,7.5]):
df2 - df2.mean(axis=1)
# 0 1 2
# 0 -0.5 -2.5 NaN
# 1 1.5 1.5 NaN
# 2 3.5 5.5 NaN
If you make the subtraction itself along axis=0 then it works as expected:
df2.sub(df2.mean(axis=1), axis=0)
# 0 1
# 0 -0.5 0.5
# 1 -1.5 1.5
# 2 -2.5 2.5
So when you use a default subtraction between (70000, 300) and (70000,1), there will be 69700 columns of nan.
Related
I am a newbie to Python. I am working with Python 3.6.
I have the following pandas dataframe:
import pandas as pd
data = [[1.5, 2,1.5,0.8], [1.2, 2,1.5,3], [2, 2,1.5,1]]
df = pd.DataFrame(data, columns = ['Floor', 'V1','V2','V3'])
df
Essentially, for each row, if the value in the column V1 is lower than the value of Floor, then I want to set the value of V1 equal to Floor .
The operation needs to be expanded for each row and for each column (i.e. from column V1 to column V3 where for each row there is the same Floor ).
The result would be the following in this example:
data = [[1.5, 2,1.5,1.5], [1.2, 2,1.5,3], [2, 2,2,2]]
Any idea how to achieve this? I was looking at the function where but I am not sure how to deploy it.
Many thanks in advance.
You can use clip:
df = df.clip(lower=df['Floor'], axis=0)
output:
>>> df
Floor V1 V2 V3
0 1.5000 2 1.5000 1.5000
1 1.2000 2 1.5000 3.0000
2 2.0000 2 2.0000 2.0000
if you have other columns
cols = df.filter(regex='V\d+').columns
df[cols] = df[cols].clip(lower=df['Floor'], axis=0)
I would suggest using numpy's np.where(). This allows to compare and update a column base on an if/else criteria. Kindly try:
df['V1'] = np.where(df['V1'] < df['Floor'],df['Floor'],df['V1']
Followed by the same for V2 and V3.
Use:
df.update( df.mask(df.loc[:, 'V1':'V3'].lt(df['Floor'], axis=0), df['Floor'], axis=0))
print (df)
Floor V1 V2 V3
0 1.5 2 1.5 1.5
1 1.2 2 1.5 3.0
2 2.0 2 2.0 2.0
I am trying to multiply two columns (ActualSalary * FTE) within the dataframe (OPR) to create a new column (FTESalary), but somehow it has stopped at row 21357, I don't understand what went wrong or how to fix it. The two columns came from importing a csv file using the line: OPR = pd.read_csv('OPR.csv', encoding='latin1')
[In] OPR
[out]
ActualSalary FTE
44600 1
58,000.00 1
70,000.00 1
17550 1
34693 1
15674 0.4
[In] OPR["FTESalary"] = OPR["ActualSalary"].str.replace(",", "").astype("float")*OPR["FTE"]
[In] OPR
[out]
ActualSalary FTE FTESalary
44600 1 44600
58,000.00 1 58000
70,000.00 1 70000
17550 1 NaN
34693 1 NaN
15674 0.4 NaN
I am not expecting any NULL values as an output at all, I am really struggling with this. I would really appreciate the help.
Many thanks in advance! (I am new to both coding and here, please let me know via message if I have made mistakes or can improve the way I post questions here)
Sharing the data #oppresiveslayer
[In] OPR[0:6].to_dict()
[out]
{'ActualSalary': {0: '44600',
1: '58,000.00',
2: '70,000.00',
3: '39,780.00',
4: '0.00',
5: '78,850.00'},
'FTE': {0: 1.0, 1: 1.0, 2: 1.0, 3: 1.0, 4: 1.0, 5: 1.0}}
For more information on the two columns #charlesreid1
[in] OPR['ActualSalary'].astype
[out]
Name: ActualSalary, Length: 21567, dtype: object>
[in] OPR['FTE'].astype
[out]
Name: FTE, Length: 21567, dtype: float64>
The version I am using:
python: 3.7.3, pandas: 0.25.1 on jupyter Notebook 6.0.0
I believe that your ActualSalary column is a mix of strings and integers. That is the only way I've been able to recreate your error:
df = pd.DataFrame(
{'ActualSalary': ['44600', '58,000.00', '70,000.00', 17550, 34693, 15674],
'FTE': [1, 1, 1, 1, 1, 0.4]})
>>> df['ActualSalary'].str.replace(',', '').astype(float) * df['FTE']
0 44600.0
1 58000.0
2 70000.0
3 NaN
4 NaN
5 NaN
dtype: float64
The issue arises when you try to remove the commas:
>>> df['ActualSalary'].str.replace(',', '')
0 44600
1 58000.00
2 70000.00
3 NaN
4 NaN
5 NaN
Name: ActualSalary, dtype: object
First convert them to strings, before converting back to floats.
fte_salary = (
df['ActualSalary'].astype(str).str.replace(',', '') # Remove commas in string, e.g. '55,000.00' -> '55000.00'
.astype(float) # Convert string column to floats.
.mul(df['FTE']) # Multiply by new salary column by Full-Time-Equivalent (FTE) column.
)
>>> df.assign(FTESalary=fte_salary) # Assign new column to dataframe.
ActualSalary FTE FTESalary
0 44600 1.0 44600.0
1 58,000.00 1.0 58000.0
2 70,000.00 1.0 70000.0
3 17550 1.0 17550.0
4 34693 1.0 34693.0
5 15674 0.4 6269.6
This should work:
OTR['FTESalary'] = OTR.apply(lambda x: pd.to_numeric(x['ActualSalary'].replace(",", ""), errors='coerce') * x['FTE'], axis=1)
output
ActualSalary FTE FTESalary
0 44600 1.0 44600.0
1 58,000.00 1.0 58000.0
2 70,000.00 1.0 70000.0
3 17550 1.0 17550.0
4 34693 1.0 34693.0
5 15674 0.4 6269.6
ok, i think you need to do this:
OTR['FTESalary'] = OTR.reset_index().apply(lambda x: pd.to_numeric(x['ActualSalary'].replace(",", ""), errors='coerce') * x['FTE'], axis=1).to_numpy().tolist()
I was able to do it in a couple steps, but with list comprehension which might be less readable for a beginner. It makes an intermediate column, which does the float conversion, since your ActualSalary column is full of strings at the start.
OPR["X"] = [float(x.replace(",","")) for x in OPR["ActualSalary"]]
OPR["FTESalary"] = OPR["X"]*OPR["FTE"]
I am working with a large panel dataset (longitudinal data) with 500k observations. Currently, I am trying to fill the missing data (at most 30% of observations) using the mean of up till time t of each variable. (The reason why I do not fill the data with overall mean, is to avoid a forward looking bias that arises from using data only available at a later point in time.)
I wrote the following function which does the job, but runs extremely slow (5 hours for 500k rows!!) In general, I find that filling missing data in Pandas is a computationally tedious task. Please enlighten me on how you normally fill missing values, and how you make it run fast
Function to fill with mean till time "t":
def meanTillTimeT(x,cols):
start = time.time()
print('Started')
x.reset_index(inplace=True)
for i in cols:
l1 =[]
for j in range(x.shape[0]):
if x.loc[j,i] !=0 and np.isnan(x.loc[j,i]) == False :
l1.append(x.loc[j,i])
elif np.isnan(x.loc[j,i])==True :
x.loc[j,i]=np.mean(l1)
end = time.time()
print("time elapsed:", end - start)
return x
Let us build a DataFrame for illustration:
import pandas as pd
import numpy as np
df = pd.DataFrame({"value1": [1, 2, 1, 5, np.nan, np.nan, 8, 3],
"value2": [0, 8, 1, np.nan, np.nan, 8, 9, np.nan]})
Here is the DataFrame:
value1 value2
0 1.0 0.0
1 2.0 8.0
2 1.0 1.0
3 5.0 NaN
4 NaN NaN
5 NaN 8.0
6 8.0 9.0
7 3.0 NaN
Now, I suggest to first compute the cumulative sums using pandas.DataFrame.cumsum and also the number of non-NaNs values so as to compute the means. After that, it is enough to fill the NaNs with those means, and inserting them in the original DataFrame. Both actions use pandas.DataFrame.fillna, which is going to be much much faster than Python loops:
df_mean = df.cumsum() / (~df.isna()).cumsum()
df_mean = df_mean.fillna(method = "ffill")
df = df.fillna(value = df_mean)
The result is:
value1 value2
0 1.00 0.0
1 2.00 8.0
2 1.00 1.0
3 5.00 3.0
4 2.25 3.0
5 2.25 8.0
6 8.00 9.0
7 3.00 5.2
. Columns are attributes, rows are observation.
I would like to extract rows, where sum of any two attributes exceed a specified value (say 0.7). Then, in two new columns, list column header with bigger and smaller contribution to sum.
I am new to python, so I am stuck proceeding after generating my dataframe.
You can do this:
import pandas as pd
from itertools import combinations
THRESHOLD = 8.0
def valuation_formula(row):
l = [sorted(x) for x in combinations(row, r=2) if sum(x) > THRESHOLD]
if(len(l) == 0):
row["smaller"], row["larger"] = None, None
else:
row["smaller"], row["larger"] = l[0] # since not specified by OP, we take the first such pair
return row
contribution_df = df.apply(lambda row: valuation_formula(row), axis=1)
So that, if
df = pd.DataFrame({"a" : [1.0, 2.0, 4.0], "b" : [5.0, 6.0, 7.0]})
a b
0 1.0 5.0
1 2.0 6.0
2 4.0 7.0
then, contribution_df is
a b smaller larger
0 1.0 5.0 NaN NaN
1 2.0 6.0 NaN NaN
2 4.0 7.0 4.0 7.0
HTH.
In R , way to break ties randomly when using the rank function is simple:
rank(my_vec, ties.method = "random")
However, though both scipy (scipy.stats.rankdata) and pandas (pandas.Series.rank) have ranking functions, none of them suggest a method that break ties randomly.
Is there a simple way to use a framework in python that has this feature? Given that list order has to remain the same.
Pandas' rank allows for these methods:
method : {'average', 'min', 'max', 'first', 'dense'}
* average: average rank of group
* min: lowest rank in group
* max: highest rank in group
* first: ranks assigned in order they appear in the array
* dense: like 'min', but rank always increases by 1 between groups
To "simply" accomplish your goal we can use 'first' after having randomized the Series.
Assume my series is named my_vec
my_vec.sample(frac=1).rank(method='first')
You can then put it back in the same order it was with
my_vec.sample(frac=1).rank(method='first').reindex_like(my_vec)
Example Runs
my_vec = pd.Series([1, 2, 3, 1, 2, 3])
Trial 1
my_vec.sample(frac=1).rank(method='first').reindex_like(my_vec)
0 2.0 <- I expect this and
1 4.0
2 6.0
3 1.0 <- this to be first ranked
4 3.0
5 5.0
dtype: float64
Trial 2
my_vec.sample(frac=1).rank(method='first').reindex_like(my_vec)
0 1.0 <- Still first ranked
1 3.0
2 6.0
3 2.0 <- but order has switched
4 4.0
5 5.0
dtype: float64