I have the following data:
df =
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 Rocky 10 Casual kkkk 22.4
2 jenifer 50 Emergency 2500.6 '51.6'
3 Tom 10 sick Nan 46.2
4 Harry nn Casual 1800.1 '58.3'
5 Julie 22 sick 3600.2 'unknown'
6 Sam 5 Casual Nan 47.2
7 Mady 6 sick unknown Nan
Output:
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 jenifer 50 Emergency 2500.6 51.6
2 Tom 10 sick Nan 46.2
3 Sam 5 Casual Nan 47.2
4 Mady 6 sick unknown Nan
I want to delete records where there is datatype error in numerical columns(Leaves,Salary,Performance).
If numerical columns contains strings then that row show be deleted from data frame?
df[['Leaves','Salary','Performance']].apply(pd.to_numeric, errors = 'coerce')
but this will covert values to Nan.
Let's start from a note concerning your sample data:
It contains Nan strings, which are not among strings automatically
recognized as NaNs.
To treat them as NaN, I read the source text with read_fwf,
passing na_values=['Nan'].
And now get down to the main task:
Define a function to check whether a cell is acceptable:
def isAcceptable(cell):
if pd.isna(cell) or cell == 'unknown':
return True
return all(c.isdigit() or c == '.' for c in cell)
I noticed that you accept NaN values.
You also a cell if it contains only unknown string, but you don't
accept a cell if such word is enclosed between e.g. quotes.
If you change your mind about what is / is not acceptable, change the
above function accordingly.
Then, to leave only rows with all acceptable values in all 3 mentioned
columns, run:
df[df[['Leaves', 'Salary', 'Performance']].applymap(isAcceptable).all(axis=1)]
Related
I have a 2dataframes, which I am calling as df1 and df2.
df1 has columns like KPI and context and it looks like this.
KPI Context
0 Does the company have a policy in place to man... Anti-Bribery Policy\nBroadridge does not toler...
1 Does the company have a supplier code of conduct? Vendor Code of Conduct Our vendors play an imp...
2 Does the company have a grievance/complaint ha... If you ever have a question or wish to report ...
3 Does the company have a human rights policy ? Human Rights Statement of Commitment Broadridg...
4 Does the company have a policies consistent wi... Anti-Bribery Policy\nBroadridge does not toler...
df2 has a single column 'keyword'
df2:
Keyword
0 1.5 degree
1 1.5°
2 2 degree
3 2°
4 accident
I wanted to create another dataframe out of these two dataframe wherein if a particular value from 'Keyword' column of df2 is present in the 'Context' of df1 then simply write the count of it.
for which I have used pd.crosstab() however I suspect that its not giving me the expected output.
here's what I have tried so far.
new_df = df1.explode('Context')
new_df1 = df2.explode('Keyword')
new_df = pd.crosstab(new_df['KPI'], new_df1['Keyword'], values=new_df['Context'], aggfunc='count').reset_index().rename_axis(columns=None)
print(new_df.head())
the new_df looks like this.
KPI 1.5 degree 1.5° \
0 Does the Supplier code of conduct cover one or... NaN NaN
1 Does the companies have sites/operations locat... NaN NaN
2 Does the company have a due diligence process ... NaN NaN
3 Does the company have a grievance/complaint ha... NaN NaN
4 Does the company have a grievance/complaint ha... NaN NaN
2 degree 2° accident
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 1.0 NaN NaN
4 NaN NaN NaN
The expected output which I want is something like this.
0 KPI 1.5 degree 1.5° 2 degree 2° accident
1 Does the company have a policy in place to man 44 2 3 5 9
what exactly am I missing? please let me know, thanks!
There is multiple problems - first explode working with splitted values, not with strings. Then for extract Keyword from Context need Series.str.findall and for crosstab use columns in same DataFrame, not 2 different:
import re
pat = '|'.join(r"\b{}\b".format(re.escape(x)) for x in df2['Keyword'])
df1['new'] = df1['Context'].str.findall(pat, flags=re.I)
new_df = df1.explode('new')
out = pd.crosstab(new_df['KPI'], new_df['new'])
I have a pandas dataframe containing NaN values for some column.
I'm trying to fill them with a default value (30), but it doesn't work.
Original dataframe:
type avg_speed
0 CAR 32.0
1 CAR NaN
2 CAR NaN
3 BIKE 16.2
4 CAR 28.5
5 SCOOTER 29.7
6 CAR 30.7
7 CAR NaN
8 BIKE NaN
9 BIKE 35.1
...
Desired result:
type avg_speed
0 CAR 32.0
1 CAR 30
2 CAR 30
3 BIKE 16.2
4 CAR 28.5
5 SCOOTER 29.7
6 CAR 30.7
7 CAR 30
8 BIKE 30
9 BIKE 35.1
My code:
def fill_with_default(pandas_df, column_name, default_value):
print(f"Total count: {pandas_df.count()}")
print(f"Count of Nan BEFORE: {pandas_df[column_name].isna().sum()}")
pandas_df[column_name].fillna(default_value, inplace=True)
print(f"Count of Nan AFTER: {pandas_df[column_name].isna().sum()}")
return pandas_df
df = fill_with_default(df, "avg_speed", 30)
Output:
Total count: 105018
Count of Nan BEFORE: 49514
Count of Nan AFTER: 49514
The chain of dataframe transformations and list of columns are too long, so it's difficult to show all steps (join with another dataframe, drop useless columns, add usefull columns, join with other dataframes, filter etc.)
I've tried other options but they also don't work:
#pandas_df.fillna({column_name: default_value}, inplace=True)
#pandas_df.loc[pandas_df[column_name].isnull(),column_name] = default_value
...
Type of column before applying "fillna" is fload64, the same as default_value
Therefore, my question is: what could be the potential reasons of this problem?
What kind of transformation can lead to this problem? Because this is the method that works for another similar data frame. The only difference between them lies in the chain of transformations.
BTW, there is a system log at this place:
/home/hadoop/.local/lib/python3.6/site-
packages/pandas/core/generic.py:6287: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation:
http://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-
copy
self._update_inplace(new_data)
So, my dataframe is
price model_year model condition cylinders fuel odometer transmission type paint_color is_4wd date_posted days_listed
0 9400 2011.0 bmw x5 good 6.0 gas 145000.0 automatic SUV NaN True 2018-06-23 19
1 25500 NaN ford f-150 good 6.0 gas 88705.0 automatic pickup white True 2018-10-19 50
2 5500 2013.0 hyundai sonata like new 4.0 gas 110000.0 automatic sedan red False 2019-02-07 79
3 1500 2003.0 ford f-150 fair 8.0 gas NaN automatic pickup NaN False 2019-03-22 9
4 14900 2017.0 chrysler 200 excellent 4.0 gas 80903.0 automatic sedan black False 2019-04-02 28
As you can see, row 1's model is the same as row 3's, but row 1's model year is missing. It would naturally follow I can replace row 1's model year with row 3's so there isn't NaN there, and I'm aware I can manually change it, but the dataframe is over 50,000 rows long and there are many more values just like that Is there an automated way I can go about replacing these values like that?
Edit: After looking over the df just now, I've realized that I can't really replace the model year like that as it can change even within the same model, although I would still love to know how it's done if possible for future reference
You can merge dataframe with itself and fillna it.
df_want = df.merge(df[['model_year','model']].dropna().drop_duplicates(),on='model',how='left')
df_want['model_year'] = df_want['model_year_x'].fillna(df_want['model_year_y']
df_want = df_want.drop(['model_year_x','model_year_y'],axis=1)
Yes, you can replace all NaN model years with the non-nan entry like this:
models = df['model'].unique()
for m in models:
year = df.loc[(df['model_year'].notna()) & (df['model'] == m)]['model_year'].values[0]
df.at[(df['model_year'].isna()) & (df['model'] == m), 'model_year'] = year
I have a dataframe lexdata and I want to check and count the number of null values and also detect invalid values in the 'sales column' some of the columns
sample data
city year month sales
0 Abilene 2000 1 72.0
1 Abilene 2000 2 ola-k
2 Abilene 2000 3 130.0
3 Abilene 2000 4 lee
4 Abilene 2000 5 141.0
I successfully checked and counted the null values with the following code:
lexdata.isnull().sum()
The challenge is to check for invalid values (string) in the sale column
You can use pd.to_numeric and pass coerce to errors parameter, it will try to convert the values to numeric, and if it can not be converted to the numeric, it will return NaN, and finally you can count the null values after conversion
pd.to_numeric(df['sales'], errors='coerce').isnull().sum()
#output: 2
I have a column named volume in a pandas data frame and I wanted to look back previous 5 volumes from current column # and find 40 percentile .
Volume data - as follows
1200
3400
5000
2300
4502
3420
5670
5400
4320
7890
8790
For 1st 5 values we don’t have enough data to look back , but from 6th value 3420 we should find percentile (40) of previous 5 volumes 1200,3400,5000,2300,4502 and keep doing this for rest of the data by taking previous 5 data from current value.
Not sure if I understand correctly since there is no mcve
However, sounds like you want a rolling quantile
>>> s.rolling(5).quantile(0.4)
0 NaN
1 NaN
2 NaN
3 NaN
4 2960.0
5 3412.0
6 4069.2
7 4069.2
8 4429.2
9 4968.0
10 5562.0
dtype: float64