I have a pandas dataframe containing NaN values for some column.
I'm trying to fill them with a default value (30), but it doesn't work.
Original dataframe:
type avg_speed
0 CAR 32.0
1 CAR NaN
2 CAR NaN
3 BIKE 16.2
4 CAR 28.5
5 SCOOTER 29.7
6 CAR 30.7
7 CAR NaN
8 BIKE NaN
9 BIKE 35.1
...
Desired result:
type avg_speed
0 CAR 32.0
1 CAR 30
2 CAR 30
3 BIKE 16.2
4 CAR 28.5
5 SCOOTER 29.7
6 CAR 30.7
7 CAR 30
8 BIKE 30
9 BIKE 35.1
My code:
def fill_with_default(pandas_df, column_name, default_value):
print(f"Total count: {pandas_df.count()}")
print(f"Count of Nan BEFORE: {pandas_df[column_name].isna().sum()}")
pandas_df[column_name].fillna(default_value, inplace=True)
print(f"Count of Nan AFTER: {pandas_df[column_name].isna().sum()}")
return pandas_df
df = fill_with_default(df, "avg_speed", 30)
Output:
Total count: 105018
Count of Nan BEFORE: 49514
Count of Nan AFTER: 49514
The chain of dataframe transformations and list of columns are too long, so it's difficult to show all steps (join with another dataframe, drop useless columns, add usefull columns, join with other dataframes, filter etc.)
I've tried other options but they also don't work:
#pandas_df.fillna({column_name: default_value}, inplace=True)
#pandas_df.loc[pandas_df[column_name].isnull(),column_name] = default_value
...
Type of column before applying "fillna" is fload64, the same as default_value
Therefore, my question is: what could be the potential reasons of this problem?
What kind of transformation can lead to this problem? Because this is the method that works for another similar data frame. The only difference between them lies in the chain of transformations.
BTW, there is a system log at this place:
/home/hadoop/.local/lib/python3.6/site-
packages/pandas/core/generic.py:6287: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation:
http://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-
copy
self._update_inplace(new_data)
Related
I have a 2dataframes, which I am calling as df1 and df2.
df1 has columns like KPI and context and it looks like this.
KPI Context
0 Does the company have a policy in place to man... Anti-Bribery Policy\nBroadridge does not toler...
1 Does the company have a supplier code of conduct? Vendor Code of Conduct Our vendors play an imp...
2 Does the company have a grievance/complaint ha... If you ever have a question or wish to report ...
3 Does the company have a human rights policy ? Human Rights Statement of Commitment Broadridg...
4 Does the company have a policies consistent wi... Anti-Bribery Policy\nBroadridge does not toler...
df2 has a single column 'keyword'
df2:
Keyword
0 1.5 degree
1 1.5°
2 2 degree
3 2°
4 accident
I wanted to create another dataframe out of these two dataframe wherein if a particular value from 'Keyword' column of df2 is present in the 'Context' of df1 then simply write the count of it.
for which I have used pd.crosstab() however I suspect that its not giving me the expected output.
here's what I have tried so far.
new_df = df1.explode('Context')
new_df1 = df2.explode('Keyword')
new_df = pd.crosstab(new_df['KPI'], new_df1['Keyword'], values=new_df['Context'], aggfunc='count').reset_index().rename_axis(columns=None)
print(new_df.head())
the new_df looks like this.
KPI 1.5 degree 1.5° \
0 Does the Supplier code of conduct cover one or... NaN NaN
1 Does the companies have sites/operations locat... NaN NaN
2 Does the company have a due diligence process ... NaN NaN
3 Does the company have a grievance/complaint ha... NaN NaN
4 Does the company have a grievance/complaint ha... NaN NaN
2 degree 2° accident
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 1.0 NaN NaN
4 NaN NaN NaN
The expected output which I want is something like this.
0 KPI 1.5 degree 1.5° 2 degree 2° accident
1 Does the company have a policy in place to man 44 2 3 5 9
what exactly am I missing? please let me know, thanks!
There is multiple problems - first explode working with splitted values, not with strings. Then for extract Keyword from Context need Series.str.findall and for crosstab use columns in same DataFrame, not 2 different:
import re
pat = '|'.join(r"\b{}\b".format(re.escape(x)) for x in df2['Keyword'])
df1['new'] = df1['Context'].str.findall(pat, flags=re.I)
new_df = df1.explode('new')
out = pd.crosstab(new_df['KPI'], new_df['new'])
in my code I've generated a range of dates using pd.date_range in an effort to compare it to a column of dates read in from excel using pandas. The generated range of dates is refered to as "all_dates".
all_dates=pd.date_range(start='1998-12-31', end='2020-06-23')
for i, date in enumerate(period): # where 'Period' is the column of excel dates
if date==all_dates[i]: # loop until date from excel doesn't match date from generated dates
continue
else:
missing_dates_stock.append(i) # keep list of locations where dates are missing
stock_data.insert(i,"NaN") # insert 'NaN' where missing date is found
This results in TypeError: argument of type 'Timestamp' is not iterable. How can I make the data types match such that I can iterate and compare them? Apologies as I am not very fluent in Python.
I think you are trying to create a NaN row if the date does not exist in the excel file.
Here's a way to do it. You can use the df.merge option.
I am creating df1 to simulate the excel file. It has two columns sale_dt and sale_amt. If the sale_dt does not exist, then we want to create a separate row with NaN in the columns. To ensure we simulate it, I am creating a date range from 1998-12-31 through 2020-06-23 skipping 4 days in between. So we have a dataframe with 4 missing date between each two rows. The solution should create 4 dummy rows with the correct date in ascending order.
import pandas as pd
import random
#create the sales dataframe with missing dates
df1 = pd.DataFrame({'sale_dt':pd.date_range(start='1998-12-31', end='2020-06-23', freq='5D'),
'sale_amt':random.sample(range(1, 2000), 1570)
})
print (df1)
#now create a dataframe with all the dates between '1998-12-31' and '2020-06-23'
df2 = pd.DataFrame({'date':pd.date_range(start='1998-12-31', end='2020-06-23', freq='D')})
print (df2)
#now merge both dataframes with outer join so you get all the rows.
#i am also sorting the data in ascending order so you can see the dates
#also dropping the original sale_dt column and renaming the date column as sale_dt
#then resetting index
df1 = (df1.merge(df2,left_on='sale_dt',right_on='date',how='outer')
.drop(columns=['sale_dt'])
.rename(columns={'date':'sale_dt'})
.sort_values(by='sale_dt')
.reset_index(drop=True))
print (df1.head(20))
The original dataframe was:
sale_dt sale_amt
0 1998-12-31 1988
1 1999-01-05 1746
2 1999-01-10 1395
3 1999-01-15 538
4 1999-01-20 1186
... ... ...
1565 2020-06-03 560
1566 2020-06-08 615
1567 2020-06-13 858
1568 2020-06-18 298
1569 2020-06-23 1427
The output of this will be (first 20 rows):
sale_amt sale_dt
0 1988.0 1998-12-31
1 NaN 1999-01-01
2 NaN 1999-01-02
3 NaN 1999-01-03
4 NaN 1999-01-04
5 1746.0 1999-01-05
6 NaN 1999-01-06
7 NaN 1999-01-07
8 NaN 1999-01-08
9 NaN 1999-01-09
10 1395.0 1999-01-10
11 NaN 1999-01-11
12 NaN 1999-01-12
13 NaN 1999-01-13
14 NaN 1999-01-14
15 538.0 1999-01-15
16 NaN 1999-01-16
17 NaN 1999-01-17
18 NaN 1999-01-18
19 NaN 1999-01-19
I have the following data:
df =
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 Rocky 10 Casual kkkk 22.4
2 jenifer 50 Emergency 2500.6 '51.6'
3 Tom 10 sick Nan 46.2
4 Harry nn Casual 1800.1 '58.3'
5 Julie 22 sick 3600.2 'unknown'
6 Sam 5 Casual Nan 47.2
7 Mady 6 sick unknown Nan
Output:
Emp_Name Leaves Leave_Type Salary Performance
0 Christy 20 sick 3000.0 56.6
1 jenifer 50 Emergency 2500.6 51.6
2 Tom 10 sick Nan 46.2
3 Sam 5 Casual Nan 47.2
4 Mady 6 sick unknown Nan
I want to delete records where there is datatype error in numerical columns(Leaves,Salary,Performance).
If numerical columns contains strings then that row show be deleted from data frame?
df[['Leaves','Salary','Performance']].apply(pd.to_numeric, errors = 'coerce')
but this will covert values to Nan.
Let's start from a note concerning your sample data:
It contains Nan strings, which are not among strings automatically
recognized as NaNs.
To treat them as NaN, I read the source text with read_fwf,
passing na_values=['Nan'].
And now get down to the main task:
Define a function to check whether a cell is acceptable:
def isAcceptable(cell):
if pd.isna(cell) or cell == 'unknown':
return True
return all(c.isdigit() or c == '.' for c in cell)
I noticed that you accept NaN values.
You also a cell if it contains only unknown string, but you don't
accept a cell if such word is enclosed between e.g. quotes.
If you change your mind about what is / is not acceptable, change the
above function accordingly.
Then, to leave only rows with all acceptable values in all 3 mentioned
columns, run:
df[df[['Leaves', 'Salary', 'Performance']].applymap(isAcceptable).all(axis=1)]
I have a column named volume in a pandas data frame and I wanted to look back previous 5 volumes from current column # and find 40 percentile .
Volume data - as follows
1200
3400
5000
2300
4502
3420
5670
5400
4320
7890
8790
For 1st 5 values we don’t have enough data to look back , but from 6th value 3420 we should find percentile (40) of previous 5 volumes 1200,3400,5000,2300,4502 and keep doing this for rest of the data by taking previous 5 data from current value.
Not sure if I understand correctly since there is no mcve
However, sounds like you want a rolling quantile
>>> s.rolling(5).quantile(0.4)
0 NaN
1 NaN
2 NaN
3 NaN
4 2960.0
5 3412.0
6 4069.2
7 4069.2
8 4429.2
9 4968.0
10 5562.0
dtype: float64
I have a dataframe, with recordings of statistics in multiple columns.
I have a list of the column names: stat_columns = ['Height', 'Speed'].
I want to combine the data to get one row per id.
The data comes sorted with the newest records on the top. I want the most recent data, so I must use the first value of each column, by id.
My dataframe looks like this:
Index id Height Speed
0 100007 8.3
1 100007 54
2 100007 8.6
3 100007 52
4 100035 39
5 100014 44
6 100035 5.6
And I want it to look like this:
Index id Height Speed
0 100007 54 8.3
1 100014 44
2 100035 39 5.6
I have tried a simple groupby myself:
df_stats = df_path.groupby(['id'], as_index=False).first()
But this seems to only give me a row with the first statistic found.
For me your solution working, maybe is necessary replace empty values to NaNs:
df_stats = df_path.replace('',np.nan).groupby('id', as_index=False).first()
print (df_stats)
id Index Height Speed
0 100007 0 54.0 8.3
1 100014 5 44.0 NaN
2 100035 4 39.0 5.6